In [1]:
%load_ext autoreload
%autoreload 2

## One featurizer to rule them all?
Contrary to many other machine learning domains, _molecular_ featurization (i.e. the process of transforming a molecule into a vector) lacks a good default. It remains unclear how we can effectively capture the richness of molecular data in a unified representation and what works best heavily depends on the nature and constraints of the task you are trying to model. It is therefore good practice to try different featurization schemes: From structural fingerprints, to physico-chemical descriptors and pre-trained embeddings.

## Don't take our word for it
To demonstrate the impact a featurizer can have, we setup two simple benchmarks.
1. To demonstrate the impact on modeling, we will use two datasets from [MoleculeNet](https://moleculenet.org/datasets-1).
2. To demonstrate the impact on search, we will use the [RDKit Benchmarking Platform](https://github.com/rdkit/benchmarking_platform).

We will compare the performance of three different featurizers:
- **ECFP6** [1]: Binary, circular fingerprints where each bit indicates the presence of particular substructures of a radius up to 3 bonds away from an atom.
- **Mordred** [2]: Continuous descriptors with more than 1800 2D and 3D descriptors.
- **ChemBERTa** [3]: Learned representations from a pre-trained SMILES transformer model.

### Modeling
We will compare the performance on two datasets using scikit-learn [AutoML](https://github.com/automl/auto-sklearn) [4, 5] models.

In [2]:
import os
import numpy as np
import pandas as pd
import datamol as dm
import autosklearn.classification
import autosklearn.regression
from sklearn.metrics import mean_absolute_error, roc_auc_score
from sklearn.model_selection import GroupShuffleSplit
from rdkit.Chem import SaltRemover

from molfeat.trans.fp import FPVecTransformer
from molfeat.trans.pretrained.hf_transformers import PretrainedHFTransformer

In [22]:
def load_dataset(uri: str, readout_col: str):
    """Loads the MoleculeNet dataset"""
    df = pd.read_csv(uri)
    smiles = df["smiles"].values
    y = df[readout_col].values
    return smiles, y


def preprocess_smiles(smi):
    """Preprocesses the SMILES string"""
    with dm.without_rdkit_log():
        mol = dm.to_mol(smi, ordered=True, sanitize=False)
        mol = dm.sanitize_mol(mol)
        if mol is None: 
            return
        
        mol = dm.standardize_mol(mol, disconnect_metals=True)
        remover = SaltRemover.SaltRemover()
        mol = remover.StripMol(mol, dontRemoveEverything=True)

    return dm.to_smiles(mol)


def scaffold_split(smiles):
    """In line with common practice, we will use the scaffold split to evaluate our models"""
    scaffolds = [dm.to_smiles(dm.to_scaffold_murcko(dm.to_mol(smi))) for smi in smiles]
    splitter = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
    return next(splitter.split(smiles, groups=scaffolds))


In [4]:
# Setup the featurizers
trans_ecfp = FPVecTransformer(kind="ecfp:6", n_jobs=-1)
trans_mordred = FPVecTransformer(kind="mordred", replace_nan=True, n_jobs=-1)
trans_chemberta = PretrainedHFTransformer(kind='ChemBERTa-77M-MLM', notation='smiles')



#### Lipophilicity
Lipophilicity is a regression task with 4200 molecules

In [None]:
# Prepare the Lipophilicity dataset
smiles, y_true = load_dataset("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/Lipophilicity.csv", "exp")
smiles = np.array([preprocess_smiles(smi) for smi in smiles])
smiles = np.array([smi for smi in smiles if smi != ""])

X = {
    "ECFP": trans_ecfp(smiles),
    "Mordred": trans_mordred(smiles),
    "ChemBERTa": trans_chemberta(smiles),
}

In [None]:
# To make the output less verbose: 
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Train a model
train_ind, test_ind = scaffold_split(smiles)

scores = {}
for name, feats in X.items():
    
    # Train
    automl = autosklearn.regression.AutoSklearnRegressor(
        memory_limit=24576, 
        time_left_for_this_task=3600,
        n_jobs=1
    )
    automl.fit(feats[train_ind], y_true[train_ind])
    
    # Predict and evaluate
    y_hat = automl.predict(feats[test_ind])
    
    # Evaluate
    mae = mean_absolute_error(y_true[test_ind], y_hat)
    scores[name] = mae

scores

#### BBBP
BBBP is a binary classification task with 2050 molecules

In [14]:
# Prepare the Lipophilicity dataset
smiles, y_true = load_dataset("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/BBBP.csv", "p_np")
smiles = np.array([preprocess_smiles(smi) for smi in smiles])
smiles = np.array([smi for smi in smiles if smi is not None])

X = {
    "ECFP": trans_ecfp(smiles),
    "Mordred": trans_mordred(smiles),
    "ChemBERTa": trans_chemberta(smiles),
}

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


  0%|          | 0/2039 [00:00<?, ?it/s]

  0%|          | 0/2039 [00:00<?, ?it/s]

In [21]:
# To make the output less verbose: 
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Train a model
train_ind, test_ind = scaffold_split(smiles)

scores = {}
for name, feats in X.items():
    
    # Train
    automl = autosklearn.classification.AutoSklearnClassifier(
        memory_limit=24576, 
        time_left_for_this_task=3600,
        n_jobs=1
    )
    automl.fit(feats[train_ind], y_true[train_ind])
    
    # Predict and evaluate
    y_hat = automl.predict_proba(feats[test_ind])
    y_hat = np.max(y_hat, axis=-1)
    
    # Evaluate
    auroc = roc_auc_score(y_true[test_ind], y_hat)
    scores[name] = auroc

scores



{'ECFP': 0.6308048803556228,
 'Mordred': 0.6454648633311264,
 'ChemBERTa': 0.6367161638134872}

### Search
We will evaluate the performance on the search task using 

## Citations:
1. Rogers, D., & Hahn, M. (2010). Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5), 742-754.
2. Moriwaki, H., Tian, Y. S., Kawashita, N., & Takagi, T. (2018). Mordred: a molecular descriptor calculator. Journal of cheminformatics, 10(1), 1-14.
3. Chithrananda, S., Grand, G., & Ramsundar, B. (2020). Chemberta: Large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885.
4. Efficient and Robust Automated Machine Learning Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum and Frank Hutter Advances in Neural Information Processing Systems 28 (2015)
5. Auto-Sklearn 2.0: The Next Generation Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer and Frank Hutter* arXiv:2007.04074 [cs.LG], 2020
