# Error handling in MolPipeline

MolPipeline comes with error handling because in real-world molecular data sets the data can be heterogenous and data processing can fail for various reasons. A robust pipeline must handle these cases automatically to avoid manual intervention.

A simple example is when an erroneous SMILES can not be parsed or a pysico-chemical descriptor for a given molecule can not be calculated. In general, all kinds of processing steps might fail. This becomes especially limiting for pipelines with many processing steps that are applied on large data sets.

In this notebooks we show on a simple examples how the error handling in MolPipline works. This includes:
* Filtering not processable molecules
* Replacing erroneous molecules with a fill value

In [1]:
import numpy as np

from molpipeline import Pipeline, ErrorFilter, FilterReinserter, PostPredictionWrapper
from molpipeline.any2mol import AutoToMol
from molpipeline.mol2any import MolToMorganFP

from sklearn.ensemble import RandomForestClassifier

 When we try to feed in the following string as SMILES our pipeline will fail
```python
pipeline.predict(["NotAValidSMILES"])
```

MolPipeline can handle such cases gracefully without the programm to stop and having to manually remove the SMILES from the data set. The instance-based processing in MolPipeline handles these cases as `InvalidInstances`, a custom object, which marks the failing SMILES as invalid. When possible, `InvalidInstances` are just passed through the pipeline without doing computation, like:

```python
if isinstance(sample, InvalidInstance):
    return sample
# computation code ...
```

#### Handle errors in molecule processing

In [2]:
pipeline = Pipeline([("auto2mol", AutoToMol())])
pipeline.fit_transform(["CCCCC", "c1ccccc1", "NotAValidSMILES"])

[14:52:36] SMILES Parse Error: syntax error while parsing: NotAValidSMILES
[14:52:36] SMILES Parse Error: Failed parsing SMILES 'NotAValidSMILES' for input: 'NotAValidSMILES'


[<rdkit.Chem.rdchem.Mol at 0x7f2433c35af0>,
 <rdkit.Chem.rdchem.Mol at 0x7f2433c35a80>,
 InvalidInstance(auto2mol, Not readable input molecule: NotAValidSMILES)]

With the `ErrorFilter` we can remove `InvalidInstances` from the output

In [3]:
pipeline = Pipeline([("auto2mol", AutoToMol()), ("error_filter", ErrorFilter())])
pipeline.fit_transform(["CCCCC", "c1ccccc1", "NotAValidSMILES"])

[14:52:36] SMILES Parse Error: syntax error while parsing: NotAValidSMILES
[14:52:36] SMILES Parse Error: Failed parsing SMILES 'NotAValidSMILES' for input: 'NotAValidSMILES'


[<rdkit.Chem.rdchem.Mol at 0x7f2433c35bd0>,
 <rdkit.Chem.rdchem.Mol at 0x7f2433c35b60>]

Alternatively, by using the `ErrorFilter` and `FilterReinserter` in conjunction the `InvalidInstances` can be replaced with a fill value.

In [4]:
# The ErrorFilter tracks all InvalidInstances of the registered pipeline elements.
error_filter = ErrorFilter()

# The FilterReinserter lets you re-fill the elements filtered out with ErrorFilter with a user-specified value
error_reinserter = FilterReinserter.from_error_filter(error_filter, np.nan)

pipeline = Pipeline(
    [
        ("auto2mol", AutoToMol()),
        ("error_filter", error_filter),  # removes InvalidInstances
        (
            "error_reinserter",
            error_reinserter,
        ),  # fills a replacement value at the respective positions
    ]
)
pipeline.fit_transform(["CCCCC", "c1ccccc1", "NotAValidSMILES"])

[14:52:36] SMILES Parse Error: syntax error while parsing: NotAValidSMILES
[14:52:36] SMILES Parse Error: Failed parsing SMILES 'NotAValidSMILES' for input: 'NotAValidSMILES'


[<rdkit.Chem.rdchem.Mol at 0x7f2433c35d20>,
 <rdkit.Chem.rdchem.Mol at 0x7f2433c35d90>,
 nan]

#### Handle errors in matrices and other assembled outputs

While InvalidInstances can be just passed through the pipeline sometimes, certain situations require them to be filtered out. For example, when the morgan fingerprints of individual molecules need to be assembled in a numpy feature matrix. In these cases we want to remove the InvalidInstances and create the feature matrix without the failed cases.

In [5]:
pipeline = Pipeline(
    [
        ("auto2mol", AutoToMol()),
        (
            "error_filter",
            ErrorFilter(),
        ),  # at this step all invalid instances are filtered out
        (
            "morgan",
            MolToMorganFP(n_bits=2048, radius=2, return_as="dense"),
        ),
    ]
)

# The resulting feature matrix contains only fingerprints for the two valid SMILES
pipeline.fit_transform(["CCCCC", "c1ccccc1", "NotAValidSMILES"])

[14:52:36] SMILES Parse Error: syntax error while parsing: NotAValidSMILES
[14:52:36] SMILES Parse Error: Failed parsing SMILES 'NotAValidSMILES' for input: 'NotAValidSMILES'


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)

Instead of filtering out the molecules completely from the feature matrix, when can also fill the rows of failed molecules with nan. This retains the shape of the matrix and enables the mapping between feature matrix rows and the SMILES input list. 

In [6]:
# We again combine the ErrorFilter with the FilterReinserter
error_filter = ErrorFilter()
error_reinserter = FilterReinserter.from_error_filter(error_filter, np.nan)

pipeline = Pipeline(
    [
        ("auto2mol", AutoToMol()),
        ("error_filter", error_filter),
        (
            "morgan2",
            MolToMorganFP(n_bits=2048, radius=2, return_as="dense"),
        ),
        (
            "error_reinserter",
            error_reinserter,
        ),
    ]
)

# The resulting feature matrix contains only fingerprints for the two valid SMILES
pipeline.fit_transform(["CCCCC", "c1ccccc1", "NotAValidSMILES"])

[14:52:36] SMILES Parse Error: syntax error while parsing: NotAValidSMILES
[14:52:36] SMILES Parse Error: Failed parsing SMILES 'NotAValidSMILES' for input: 'NotAValidSMILES'


array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [nan, nan, nan, ..., nan, nan, nan]])

#### Inserting fill-values after a final predictor with `PostPredictionWrapper`

In Sklearn's [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) API, the final element is usually a predictor. Since this final predictor only needs to implement a `fit` method, no other elements can follow after it. However, post-processing of the predictor's output, i.e., the predictions, is necessary for consistent error handling. Therefore, we implemented a `PostPredictionWrapper` that can be used to insert fill values into an array of predictions. In this way, the elements-wise correspondences of the input and output array can be guaranteed. 

In [7]:
# create an error filter and reinserter.
error_filter = ErrorFilter()

# The Reinserter is wrapped into a PostPredictionWrapper because we execute
# the re-insertion step at the end of the pipeline, even after the predictor.
error_reinserter = PostPredictionWrapper(
    FilterReinserter.from_error_filter(error_filter, np.nan)
)

pipeline = Pipeline(
    [
        ("auto2mol", AutoToMol()),
        ("error_filter", error_filter),
        (
            "morgan2",
            MolToMorganFP(n_bits=2048, radius=2, return_as="dense"),
        ),
        ("predictor", RandomForestClassifier(random_state=67056)),
        (
            "error_reinserter",
            error_reinserter,
        ),
    ]
)

# fit the pipeline
pipeline.fit(X=["CCCCC", "c1ccccc1"], y=[1, 0])
new_smiles_set = ["CC", "NotAValidSMILES", "CCc1ccc(N)cc1"]
predictions = pipeline.predict(new_smiles_set)
print("Predictions: ", predictions)

Predictions:  [ 0. nan  0.]


Note the nan value in the middle of the `predictions` vector. Without the `ErrorReinserter` in the `PostPredictionWrapper` we could only get the prediction vector as `[0. 0.]`.

We can use the predicitions vector with fill-values to map back to the input SMILES to identify the problematic case:

In [8]:
np.array(new_smiles_set)[np.isnan(predictions)]

array(['NotAValidSMILES'], dtype='<U15')