In [1]:
%load_ext autoreload
%autoreload 2

## Specialized Fingerprint transformer classes

Aside from the generic molecular transformer classes, two specialized fingerprint vectors are also provided

- **`FPVecTransformer`** is a wrapper around the original transformer that takes the parameters of the featurizer directly as input. A parameter called `kind` needs to be provided to specify which fingerprints should be initialized. 

- **`FPVecFilteredTransformer`** add an additional feature that consist of checking non zeros occurence at each fingerprint position, as well as uniqueness of their values, then filtered out position not meeting some user-defined threshold.

In [2]:
import datamol as dm
import random
import numpy as np
from loguru import logger

# set printing option
np.set_printoptions(threshold=10)

# set random list
np.random.seed(10)
random.seed(10)

data = dm.data.freesolv().sample(500)

In [3]:
from molfeat.trans.fp import FPVecTransformer
from molfeat.trans.fp import FPVecFilteredTransformer

trans1 = FPVecTransformer(kind="fcfp:6", length=1024, useBondTypes=False)
trans1.featurizer.params

Using backend: pytorch


{'radius': 3,
 'nBits': 1024,
 'invariants': [],
 'fromAtoms': [],
 'useChirality': False,
 'useBondTypes': False,
 'useFeatures': True}

Similar to the parent MoleculeTransformer class, **you can copy and serialize these featurizers**

In [4]:
trans1.copy()

FPVecTransformer(kind="fcfp:6", length=1024, dtype=np.float32)

In [5]:
trans1.to_dict()

{'molfeat.trans.fp.FPVecTransformer': {'n_jobs': 1,
  'dtype': numpy.float32,
  'verbose': False,
  'useBondTypes': False,
  'kind': 'fcfp:6',
  'length': 1024}}

In [6]:
print(trans1.to_yaml())

molfeat.trans.fp.FPVecTransformer:
  dtype: !!python/name:numpy.float32 ''
  kind: fcfp:6
  length: 1024
  n_jobs: 1
  useBondTypes: false
  verbose: false



In [7]:
X = trans1(data["smiles"], enforce_dtype=True)[0]
print(X.shape, len(trans1.columns))

(1024,) 1024


Let's check the effect of filtering out bits that are very rarely activate with `FPVecFilteredTransformer`

In [8]:
trans2 = FPVecFilteredTransformer(
    kind="fcfp:6", length=1024, useBondTypes=False, del_invariant=False, occ_threshold=0
)
# the default behaviour should be the same as with the original FPVecTransformer
X2 = trans2(data["smiles"], enforce_dtype=True)[0]
np.all(X == X2)

True

In [9]:
# here we want bits non activated at least 1% of the time or invariant columns to be removed.
trans2 = FPVecFilteredTransformer(
    kind="fcfp:6", length=1024, useBondTypes=False, del_invariant=True, occ_threshold=0.01
)

# Since we have not fitted the transformer yet
# the behaviour would be the same as we do not have the fit parameter to located invariant in the training set
X2 = trans2(data["smiles"], enforce_dtype=True)[0]
np.all(X == X2)

True

In [10]:
# Let's try to fit the fingerprint first instead
# we will use a random sample from freesolve that is a bit larger and might not overlap with out test set

train_data = dm.data.freesolv().sample(n=600)["smiles"]
trans2.fit(train_data)
X2 = trans2(data["smiles"], enforce_dtype=True)[0]
print(X2.shape, len(trans2.columns))

(219,) 219


Parallelize the transformation

In [11]:
# Parallelization by setting n_jobs
trans3 = FPVecTransformer(kind="pharm2D", length=2048, n_jobs=4)
X3 = trans3(data["smiles"], enforce_dtype=True)[0]
print(X3.shape, len(trans3.columns))

(2048,) 2048


## Performing grid search with the specialized fingerprint transformers

We will work on the same task again, the goal here is to search over an hyper-parameter space of the pipeline, which would include our featurizer parameters.

This requires having a featurizer compatible with scikit-learn (some gymnastic to get this to work seemingly).  

In [12]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [13]:
df = dm.data.freesolv()
X, y = df["smiles"], df["expt"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [14]:
feat = FPVecTransformer(kind="rdkit")
pipe = Pipeline(
    [("feat", feat), ("scaler", StandardScaler()), ("rf", RandomForestRegressor(n_estimators=100))]
)

param_grid = dict(
    feat__kind=["fcfp:6", "rdkit", "pharm2D"], feat__length=[512, 1024], rf__n_estimators=[100, 500]
)
grid_search = GridSearchCV(pipe, param_grid=param_grid, n_jobs=-1)

grid_search.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('feat', FPVecTransformer(kind='rdkit')),
                                       ('scaler', StandardScaler()),
                                       ('rf', RandomForestRegressor())]),
             n_jobs=-1,
             param_grid={'feat__kind': ['fcfp:6', 'rdkit', 'pharm2D'],
                         'feat__length': [512, 1024],
                         'rf__n_estimators': [100, 500]})

In [15]:
grid_search.score(X_test, y_test)

0.8405142255762323

In [16]:
grid_search.best_estimator_

Pipeline(steps=[('feat', FPVecTransformer(kind='rdkit', length=1024)),
                ('scaler', StandardScaler()), ('rf', RandomForestRegressor())])