## Molecule featurizers

In [1]:
from chemprop.featurizers.molecule import (
    MorganBinaryFeaturizer,
    MorganCountFeaturizer,
    RDKit2DFeaturizer,
    V1RDKit2DFeaturizer,
    V1RDKit2DNormalizedFeaturizer,
)

These are example molecules to featurize.

In [2]:
from chemprop.utils import make_mol

smis = ["C" * i for i in range(1, 11)]
mols = [make_mol(smi, keep_h=False, add_h=False) for smi in smis]

### Molecule vs molgraph featurizers

Both molecule and [molgraph](./molgraph_molecule_featurizer.ipynb) featurizers take `rdkit.Chem.Mol` objects as input. Molgraph featurizers produce a `MolGraph` which is used in message passing. Molecule featurizers produce a 1D numpy array of features that can be used as [extra datapoint descriptors](../data/datapoints.ipynb).

In [3]:
from chemprop.data import MoleculeDatapoint

molecule_featurizer = MorganBinaryFeaturizer()

datapoints = [MoleculeDatapoint(mol, x_d=molecule_featurizer(mol)) for mol in mols]

molecule_featurizer(mols[0])

array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)

### Morgan fingerprint featurizers

Morgan fingerprint can either use a binary or count representation of molecular structures. The radius of structures, length of the fingerprint, and whether to include chirality can all be customized. The default radius is 2, the default length is 2048, and chirality is included by default.

In [4]:
mf = MorganCountFeaturizer(radius=3, length=1024, include_chirality=False)
morgan_fp = mf(mols[0])
morgan_fp.shape, morgan_fp

((1024,), array([0, 0, 0, ..., 0, 0, 0], dtype=int32))

### RDKit molecule featurizers

Chemprop gives a warning that the RDKit molecule featurers are not well scaled by a `StandardScaler`. Consult the literature for more appropriate scaling methods.

In [5]:
molecule_featurizer = RDKit2DFeaturizer()
extra_datapoint_descriptors = [molecule_featurizer(mol) for mol in mols]
extra_datapoint_descriptors[0]



array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.35978494,
        0.        , 16.043     , 12.011     , 16.03130013,  8.        ,
        0.        , -0.07755789, -0.07755789,  0.07755789,  0.07755789,
        1.        ,  1.        ,  1.        , 12.011     , 12.011     ,
       -0.07755789, -0.07755789,  0.1441    ,  0.1441    ,  2.503     ,
        2.503     ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  8.73925103,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  7.42665278,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  7.42665278,  0.        ,  0.        ,  0.  

The rdkit featurizers from v1 are also available. They rely on the `descriptastorus` package which can be found at [https://github.com/bp-kelley/descriptastorus](https://github.com/bp-kelley/descriptastorus). This package doesn't include the following rdkit descriptors: `['AvgIpc', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'SPS']`. Scaled versions of these descriptors are available, though it is unknown which molecules were used to fit the scaling, so this may be a dataleak depending on the test set used to evaluate model performace. See this [issue](https://github.com/bp-kelley/descriptastorus/issues/31) for more details about the scaling. 

In [6]:
molecule_featurizer = V1RDKit2DFeaturizer()
molecule_featurizer = V1RDKit2DNormalizedFeaturizer()
molecule_featurizer(mols[0])

array([1.96075662e-05, 5.77173432e-04, 3.87525506e-15, 2.72296612e-11,
       1.02515408e-07, 4.10254814e-13, 1.63521389e-11, 1.93930344e-05,
       1.22824218e-06, 2.20907757e-07, 6.35349909e-07, 3.08677419e-06,
       1.70338959e-05, 1.34072882e-05, 4.07488775e-10, 2.17523456e-08,
       6.89356874e-07, 2.63048207e-01, 1.96742684e-02, 2.50993926e-11,
       9.25841695e-11, 5.85610910e-17, 1.08871430e-06, 2.39145041e-11,
       7.52245592e-13, 1.23345732e-08, 2.94906350e-01, 9.59992784e-03,
       2.31947354e-03, 9.99390325e-01, 9.88006922e-01, 1.59186446e-08,
       4.42180049e-09, 1.00000000e+00, 7.85198619e-13, 4.14332758e-13,
       6.49617582e-11, 4.45588945e-06, 7.89307465e-03, 2.39990382e-02,
       7.89307465e-03, 4.59284380e-03, 3.24286613e-10, 1.83192891e-02,
       7.38491174e-01, 9.73505944e-01, 6.05575320e-02, 3.42737552e-07,
       1.23284669e-08, 6.13163344e-02, 3.33304127e-02, 9.93858689e-22,
       1.42492255e-01, 6.29631332e-02, 3.47228888e-02, 4.82992991e-15,
      

### Custom

Any class that has a length and returns a 1D numpy array when given an `rdkit.Chem.Mol` can be used as a molecule featurizer. 

In [7]:
import numpy as np
from rdkit import Chem

class MyMoleculeFeaturizer:
    def __len__(self) -> int:
        return 1

    def __call__(self, mol: Chem.Mol) -> np.ndarray:
        total_atoms = mol.GetNumAtoms()
        return np.array([total_atoms])

In [8]:
mf = MyMoleculeFeaturizer()
mf(mols[0])

array([1])

## Using molecule features as extra datapoint descriptors

If you only have molecule features for one molecule per datapoint, those features can be used directly as extra datapoint descriptors. If you have multiple molecules with extra features, or other extra datapoint descriptors, they first need to be concatenated into a single numpy array.

In [9]:
mol1_features = np.random.randn(len(mols), 1)
mol2_features = np.random.randn(len(mols), 2)
other_datapoint_descriptors = np.random.randn(len(mols), 3)

extra_datapoint_descriptors = np.hstack([mol1_features, mol2_features, other_datapoint_descriptors])