# Error handling in MolPipeline

MolPipeline comes with error handling because in real-world molecular data sets the data can be heterogenous and data processing can fail for various reasons. A robust pipeline must handle these cases automatically to avoid manual intervention.

A simple example is when an erroneous SMILES can not be parsed or a pysico-chemical descriptor for a given molecule can not be calculated. In general, all kinds of processing steps might fail. This becomes especially limiting for pipelines with many processing steps that are applied on large data sets.

In this notebooks we show on a simple examples how the error handling in MolPipline works. This includes:
* Filtering not processable molecules
* Replacing erroneous molecules with a fill value

In [1]:
import numpy as np

from molpipeline.pipeline import Pipeline
from molpipeline.pipeline_elements.any2mol import AutoToMolPipelineElement

from molpipeline.pipeline_elements.error_handling import (
    ErrorFilter,
    ErrorReplacer,
)

from molpipeline.pipeline_elements.mol2any import MolToFoldedMorganFingerprint

 When we try to feed in the following string as SMILES our pipeline will fail
```python
pipeline.predict(["NotAValidSMILES"])
```

MolPipeline can handle such cases gracefully without the programm to stop and having to manually remove the SMILES from the data set. The instance-based processing in MolPipeline handles these cases as `InvalidInstances`, a custom object, which marks the failing SMILES as invalid. When possible, `InvalidInstances` are just passed through the pipeline without doing computation, like:

```python
if isinstance(sample, InvalidInstance):
    return sample
# computation code ...
```

#### Handle errors in molecule processing

In [2]:
pipeline = Pipeline([("auto2mol", AutoToMolPipelineElement())])
pipeline.fit_transform(["CCCCC", "c1ccccc1", "NotAValidSMILES"])

[11:39:59] SMILES Parse Error: syntax error while parsing: NotAValidSMILES
[11:39:59] SMILES Parse Error: Failed parsing SMILES 'NotAValidSMILES' for input: 'NotAValidSMILES'


[<rdkit.Chem.rdchem.Mol at 0x7f80921e8e40>,
 <rdkit.Chem.rdchem.Mol at 0x7f80921e9000>,
 InvalidInstance(auto2mol, Not readable input molecule: NotAValidSMILES)]

With the `ErrorFilter` we can remove `InvalidInstances` from the output

In [3]:
pipeline = Pipeline(
    [("auto2mol", AutoToMolPipelineElement()), ("error_filter", ErrorFilter())]
)
pipeline.fit_transform(["CCCCC", "c1ccccc1", "NotAValidSMILES"])

[11:39:59] SMILES Parse Error: syntax error while parsing: NotAValidSMILES
[11:39:59] SMILES Parse Error: Failed parsing SMILES 'NotAValidSMILES' for input: 'NotAValidSMILES'


[<rdkit.Chem.rdchem.Mol at 0x7f80921eb920>,
 <rdkit.Chem.rdchem.Mol at 0x7f80921e9070>]

Alternatively, by using the `ErrorFilter` and `ErrorReplacer` the `InvalidInstances` can be replaced with a fill value.

In [4]:
# The ErrorFilter tracks all InvalidInstances of the registered pipeline elements.
error_filter = ErrorFilter()

# The ErrorReplacer takes out invalid instances and replaces them with a user specified value
error_replacer = ErrorReplacer.from_error_filter(error_filter, np.nan)

pipeline = Pipeline(
    [
        ("auto2mol", AutoToMolPipelineElement()),
        ("error_filter", error_filter),  # removes InvalidInstances
        (
            "error_replacer",
            error_replacer,
        ),  # fills an replacement value at the respective positions
    ]
)
pipeline.fit_transform(["CCCCC", "c1ccccc1", "NotAValidSMILES"])

[11:39:59] SMILES Parse Error: syntax error while parsing: NotAValidSMILES
[11:39:59] SMILES Parse Error: Failed parsing SMILES 'NotAValidSMILES' for input: 'NotAValidSMILES'


[<rdkit.Chem.rdchem.Mol at 0x7f80921ebbc0>,
 <rdkit.Chem.rdchem.Mol at 0x7f80921ebc30>,
 nan]

#### Handle errors in matrices and other assembled outputs

While InvalidInstances can be just passed through the pipeline sometimes, certain situations require them to be filtered out. For example, when the morgan fingerprints of individual molecules need to be assembled in a numpy feature matrix. In these cases we want to remove the InvalidInstances and create the feature matrix without the failed cases.

In [5]:
pipeline = Pipeline(
    [
        ("auto2mol", AutoToMolPipelineElement()),
        (
            "error_filter",
            ErrorFilter(),
        ),  # at this step all invalid instances are filtered out
        (
            "morgan",
            MolToFoldedMorganFingerprint(
                n_bits=2048, radius=2, output_datatype="dense"
            ),
        ),
    ]
)

# The resulting feature matrix contains only fingerprints for the two valid SMILES
pipeline.fit_transform(["CCCCC", "c1ccccc1", "NotAValidSMILES"])

[11:39:59] SMILES Parse Error: syntax error while parsing: NotAValidSMILES
[11:39:59] SMILES Parse Error: Failed parsing SMILES 'NotAValidSMILES' for input: 'NotAValidSMILES'


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)

Instead of filtering out the molecules completely from the feature matrix, when can also fill the rows of failed molecules with nan. This retains the shape of the matrix and enables the mapping between feature matrix rows and the SMILES input list. 

In [6]:
error_filter = ErrorFilter()

# The ErrorReplacer takes out invalid instances and replaces them with a user specified value
error_replacer = ErrorReplacer.from_error_filter(error_filter, np.nan)

pipeline = Pipeline(
    [
        ("auto2mol", AutoToMolPipelineElement()),
        ("error_filter", error_filter),
        (
            "morgan2",
            MolToFoldedMorganFingerprint(
                n_bits=2048, radius=2, output_datatype="dense"
            ),
        ),
        (
            "error_replacer",
            error_replacer,
        ),  # here the filtered out elements are replaced with the fill value
    ]
)

# The resulting feature matrix contains only fingerprints for the two valid SMILES
pipeline.fit_transform(["CCCCC", "c1ccccc1", "NotAValidSMILES"])

[11:39:59] SMILES Parse Error: syntax error while parsing: NotAValidSMILES
[11:39:59] SMILES Parse Error: Failed parsing SMILES 'NotAValidSMILES' for input: 'NotAValidSMILES'


array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [nan, nan, nan, ..., nan, nan, nan]])