# Custom Sklearn estimators for cheminformatics: Scaffold splits example

Molpipeline provides some custom Sklearn-like estimators solving common cheminformatics tasks. The estimators comply the [Sklearn estimator API](https://scikit-learn.org/stable/developers/develop.html) and can be used in pipelines.

This notebook shows how to use the MurckoScaffoldClustering estimator for generating scaffold splits for molecular machine learning. It applies the widely used [Murcko-type decomposition](https://www.rdkit.org/docs/GettingStartedInPython.html#murcko-decomposition) to a molecule data sets. From the decomposition a clustering is generated which then can be directly used with Sklearn's group-based splitters for cluster cross-validation.

This is a simple example notebook using dummy data to illustrate the usage of custom estimators for cheminformatics like MurckoScaffoldClustering. Please look at the advanced notebooks for more detailed examples. 

[**Scaffold clustering**](#estimators)
* Murcko-scaffolds and generic scaffolds with MurckoScaffoldClustering estimator

[**Putting it together**](#fullexample)
* Train and evaluate a classfier with MolPipeline
* Cross validation evaluation with scaffold-split
    * Combine MurckoScaffoldClustering with Sklearn's GroupKFold

In [None]:
import numpy as np
import pandas as pd
from rdkit.Chem import PandasTools
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GroupKFold

from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
from molpipeline.estimators import MurckoScaffoldClustering
from molpipeline.mol2any import MolToMorganFP

## Scaffold clustering <a class="anchor" id="estimators"></a>

MolPipeline implements custom Sklearn estimators for standard molecular machine learning tasks. For example, we created a MurckoScaffoldClustering estimator which can be used like a normal Sklearn clustering estimator.   

In [None]:
scaffold_smiles = [
    "Nc1ccccc1",
    "Cc1cc(Oc2nccc(CCC)c2)ccc1",
    "c1ccccc1",
]
linear_smiles = ["CC", "CCC", "CCCN"]

# run the scaffold clustering
scaffold_clustering = MurckoScaffoldClustering(
    n_jobs=1,
    linear_molecules_strategy="ignore",
)
scaffold_clustering.fit_predict(scaffold_smiles + linear_smiles)

The cluster labels of the above scaffold clustering assigns nan to linear molecules. This is because we used linear_molecules_strategy="ignore". Instead we can also use the "own_cluster" strategy which groups all linear molecules in a new cluster:

In [None]:
scaffold_clustering = MurckoScaffoldClustering(
    n_jobs=1,
    linear_molecules_strategy="own_cluster",
)
scaffold_clustering.fit_predict(scaffold_smiles + linear_smiles)

In addition, instead of using the basic Murcko scaffolds for clustering we can cluster with the generic scaffolds. Generic scaffolds are Murcko scaffolds but with all atom elements set to carbons. We can do this by setting make_generic=True like this:

In [None]:
scaffold_clustering = MurckoScaffoldClustering(
    n_jobs=1,
    linear_molecules_strategy="own_cluster",
    make_generic=True,
)
scaffold_clustering.fit_predict(scaffold_smiles + linear_smiles + ["c1ncccc1"])

## Cross validation with scaffold split <a class="anchor" id="fullexample"></a>

A cross validation with scaffold splits is straightfoward to implement with MurckoScaffoldClustering. We can simply combine the generated scaffold clusters with Sklearn's group-based splitters, like [GroupKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html), [StratifiedGroupKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedGroupKFold.html), [LeaveOneGroupOut](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneGroupOut.html#sklearn.model_selection.LeaveOneGroupOut), etc.

Let's setup some data and do the clustering

In [None]:
# a list of dummy smiles
smiles_data = np.array(
    [
        "Nc1ccccc1",
        "Cc1cc(Oc2nccc(CCC)c2)ccc1",
        "c1ccccc1",
        "CCCCN",
        "CCC",
        "CCO",
        "Oc1ccccc1",
        "Oc1ccc(N)cc1",
    ],
)
# a simple dummy target variable y that indicates whether the molecule contains a nitrogen (1=has N, 0=no N)
has_nitrogen_label = np.array([1, 1, 0, 1, 0, 0, 0, 1])

# we cluster the molecules by their murcko scaffold for our cross validation split
scaffold_clustering = MurckoScaffoldClustering(
    n_jobs=1,
    linear_molecules_strategy="own_cluster",
)
groups = scaffold_clustering.fit_predict(smiles_data)

# let's look at the data in a nice dataframe
df = pd.DataFrame(
    {
        "smiles": smiles_data,
        "has_nitrogen_label": has_nitrogen_label,
        "murcko_clusters": groups.astype(int),
    },
)
PandasTools.AddMoleculeColumnToFrame(df, "smiles", "Molecule")
PandasTools.AddMurckoToFrame(df, molCol="Molecule")
df.set_index("smiles")

Now we can run a cross validation using GroupKFold

In [None]:
# setup a splitter that handles the cluster split for us
grouper = GroupKFold(n_splits=2)
grouper.random_state = 67056

for i, (train, test) in enumerate(
    grouper.split(smiles_data, has_nitrogen_label, groups=groups),
):
    # setup the pipeline
    pipeline = Pipeline(
        [
            ("auto2mol", AutoToMol()),
            ("morgan", MolToMorganFP(n_bits=1024, radius=2)),
            ("RandomForestClassifier", RandomForestClassifier(random_state=67056)),
        ],
    )

    # fit the pipeline to the training data
    pipeline.fit(X=smiles_data[train], y=has_nitrogen_label[train])

    # evaluate the pipeline on the test set
    predictions = pipeline.predict_proba(
        X=smiles_data[test],
    )

    # print the performance for predicting the presence of nitrogens on the test set
    for smi, pred, label in zip(
        smiles_data[test],
        predictions[:, 1],
        has_nitrogen_label[test],
        strict=True,
    ):
        print(f"fold {i}:", smi, f"prediction={pred:.2f}", f"label={label}")
    print(
        f"fold {i}:",
        "test ROC AUC score:",
        roc_auc_score(has_nitrogen_label[test], predictions[:, 1]),
    )
    print("-------------------")

The results above show that in this dummy example the presence of a nitrogen can be learned and predicted with a ROC AUC score of 1.0 on the test sets of the scaffold split.   