This notebook gives a short introduction on how to write your own MolPipeline pipeline elements.



# How to add custom molecular featurization?

In MolPipeline, molecular descriptors or featurization methods are implemented as `MolToAnyPipelineElement` because they transform an RDKit molecule data structure to some other representation, e.g., a feature vector.

## Example using `MolToDescriptorPipelineElement`

The `MolToDescriptorPipelineElement` is a specification of `MolToAnyPipelineElement` adding useful functionality to the interface, like the number of features, the names of features and optional feature normalization. Analogously, the `MolToFingerprintPipelineElement` provides some useful functions for molecular fingerprint featurization.

In the following example, we demonstrate how to implement a new molecular descriptor representing the composition of a molecule using the counts of chemical element symbols. 

In [1]:
import numpy as np
import numpy.typing as npt

from molpipeline.abstract_pipeline_elements.mol2any import (
    MolToDescriptorPipelineElement,
)
from molpipeline.utils.molpipeline_types import AnyTransformer, RDKitMol


class ElementCountDescriptor(MolToDescriptorPipelineElement):
    """Element count descriptor."""

    def __init__(
        self,
        elements_to_count: list[int],
        standardizer: AnyTransformer | None = None,
        name: str = "ElementCountDescriptor",
        n_jobs: int = 1,
        uuid: str | None = None,
    ) -> None:
        """Construct a new ElementCountDescriptor.

        Parameters
        ----------
        elements_to_count : list[int]
            List of atomic numbers to count in the molecule.
        standardizer : AnyTransformer | None, optional
            Standardizer to apply to the feature vector.
        name : str, default="ElementCountDescriptor"
            Name of the descriptor.
        n_jobs : int, default=1
            Number of jobs to run in parallel.
        uuid : str, optional
            Unique identifier for the descriptor.

        """
        super().__init__(
            standardizer=standardizer,
            name=name,
            n_jobs=n_jobs,
            uuid=uuid,
        )

        # Defines which chemical elements are to count in a molecule.
        # The keys are the atomic number and the values their position in the feature vector.
        self.elements_dict = {element: i for i, element in enumerate(elements_to_count)}

    @property
    def n_features(self) -> int:
        """Return the number of features."""
        return len(self.elements_dict)

    @property
    def descriptor_list(self) -> list[str]:
        """Return a copy of the descriptor list."""
        return [f"atom_count_{atom_number}" for atom_number in self.elements_dict]

    def pretransform_single(self, value: RDKitMol) -> npt.NDArray[np.float64]:
        """Transform an RDKit molecule to the element count feature vector."""
        feature_vector = np.zeros(len(self.elements_dict))
        for atom in value.GetAtoms():
            atomic_number = atom.GetAtomicNum()
            if atomic_number in self.elements_dict:
                feature_vector[self.elements_dict[atomic_number]] += 1
        return feature_vector

In [2]:
from rdkit import Chem

# let's create a new ElementCountDescriptor counting carbon, nitrogen, oxygen and fluor atoms in the molecule
counter = ElementCountDescriptor(elements_to_count=[6, 7, 8, 9])

# let's transform the molecule to our descriptor
counter.transform([Chem.MolFromSmiles("CCO")])

array([[2., 0., 1., 0.]])

The resulting feature vector shows 2 carbons, 0 nitrogens, 1 oxygen and 0 fluorines.

Now create a pipeline transforming a list of SMILES strings to a numpy matrix of our new descriptor.  

In [3]:
from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol

# create a pipeline
pipeline_feat = Pipeline(
    [
        ("auto2mol", AutoToMol()),
        (
            "ElementCountDescriptor",
            ElementCountDescriptor(elements_to_count=[6, 7, 8, 9]),
        ),
    ]
)

In [4]:
pipeline_feat.fit_transform(["CCCO", "c1ccccc1N", "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"])

array([[3., 0., 1., 0.],
       [6., 1., 0., 0.],
       [8., 4., 2., 0.]])

# How to add a new machine learning model

Adding a new machine learning model is the same procedure as adding a new machine learning model to scikit-learn because of MolPipeline's seamless interoperability with scikit-learn. The [Developing scikit-learn estimators](https://scikit-learn.org/stable/developers/develop.html) guide is a great resource for this with many technical details.

Here in this notebook, we will give a short example of how to implement your own simplified logistic regression estimator.

In [5]:
from sklearn.base import BaseEstimator, ClassifierMixin


class SimplifiedLogisticRegression(BaseEstimator, ClassifierMixin):
    """Example estimator for the simplified logistic regression algorithm."""

    def __init__(self, lr=0.01, num_iter=10000):
        self.lr = lr
        self.num_iter = num_iter

    def _sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        self.theta = np.zeros(X.shape[1])
        for _ in range(self.num_iter):
            z = np.dot(X, self.theta)
            h = self._sigmoid(z)
            gradient = np.dot(X.T, (h - y)) / y.size
            self.theta -= self.lr * gradient

    def predict_proba(self, X):
        return self._sigmoid(np.dot(X, self.theta))

    def predict(self, X):
        return np.array(self.predict_proba(X) > 0.5, dtype=np.int64)

Let's perform a simple test using the presence of oxygen as our target.

In [6]:
smiles_data = ["CCCO", "c1ccccc1N", "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"]
y_has_oxygen = np.array([1, 0, 1])

In [7]:
# define the pipeline
pipeline = Pipeline(
    [
        ("auto2mol", AutoToMol()),
        (
            "ElementCountDescriptor",
            ElementCountDescriptor(elements_to_count=[6, 7, 8, 9]),
        ),
        ("logistic_regression", SimplifiedLogisticRegression()),
    ]
)

# fit the model
pipeline.fit(smiles_data, y_has_oxygen)

# check the predictions on the training set
predictions = pipeline.predict(smiles_data)
predictions

array([1, 0, 1])