In [1]:
%load_ext autoreload
%autoreload 2

## Define your own calculator
Remember that a calculator is simply a `callable` that takes a molecule as input (either a RDKit `Chem.Mol` object or SMILES string) and returns a dictionary of features.
We can thus easily define our own calculator!

In [2]:
import numpy as np
import datamol as dm

from molfeat.trans import MoleculeTransformer
from rdkit.Chem.rdMolDescriptors import CalcNumHeteroatoms

def my_calculator(mol):
    """My custom featurizer"""
    mol = dm.to_mol(mol)
    rng = np.random.default_rng(0)
    return [mol.GetNumAtoms(), mol.GetNumBonds(), CalcNumHeteroatoms(mol), rng.random()]

# This directly works with the MoleculeTransformer
mol_transf = MoleculeTransformer(my_calculator)
mol_transf(["CN1C=NC2=C1C(=O)N(C(=O)N2C)C"])

[array([14.        , 15.        ,  6.        ,  0.63696169])]

If such functions get more complex, it might instead be easier to wrap it in a class.
This also ensures the calculator remains serializable.

In [3]:
from molfeat.calc import SerializableCalculator


class MyCalculator(SerializableCalculator):
    def __call__(self, mol):
        mol = dm.to_mol(mol)
        rng = np.random.default_rng(0)
        return [mol.GetNumAtoms(), mol.GetNumBonds(), CalcNumHeteroatoms(mol), rng.random()]


mol_transf = MoleculeTransformer(MyCalculator())
mol_transf(["CN1C=NC2=C1C(=O)N(C(=O)N2C)C"])

[array([14.        , 15.        ,  6.        ,  0.63696169])]

## Define your own transformer
The above example shows that in many cases, there's no direct need to create your own transformer class. You can simply use the `MoleculeTransformer` base class.
In more complex cases, such as with pretrained models where batching would be advantageous, it is instead preferable to create your own subclass. 

In [4]:
from sklearn.ensemble import RandomForestRegressor
from molfeat.trans.pretrained import PretrainedMolTransformer


class MyFoundationModel(PretrainedMolTransformer):
    """
    In this dummy example, we train a RF model to predict the cLogP
    then use the feature importance of the RF model as the embedding.
    """

    def __init__(self):
        super().__init__(dtype=np.float32)
        self._featurizer = MoleculeTransformer("maccs", dtype=np.float32)
        self._model = RandomForestRegressor()
        self.train_dummy_model()

    def train_dummy_model(self):
        """
        Load the pretrained model.
        In this dummy example, we train a RF model to predict the cLogP
        """
        data = dm.data.freesolv().smiles.values
        X = self._featurizer(data)
        y = np.array([dm.descriptors.clogp(dm.to_mol(smi)) for smi in data])
        self._model.fit(X, y)

    def _convert(self, inputs: list, **kwargs):
        """Convert the molecule to a format that the model expects"""
        return self._featurizer(inputs)

    def _embed(self,  mols: list, **kwargs):
        """
        Embed the molecules using the pretrained model
        In this dummy example, we simply multiply the features by the importance of the feature
        """
        return [feats * self._model.feature_importances_ for feats in mols]

In [5]:
mol_transf = MyFoundationModel()
mol_transf(["CN1C=NC2=C1C(=O)N(C(=O)N2C)C"]).shape

(1, 167)

Here is another example that shows how to extend Molfeat with an existing embedding language model for astrochemistry. 

```bash
pip install astrochem_embedding
```

In [6]:
import torch
import datamol as dm

from astrochem_embedding import VICGAE
from molfeat.trans.pretrained import PretrainedMolTransformer

class MyAstroChemFeaturizer(PretrainedMolTransformer):
    """
    In this more practical example, we use embeddings from VICGAE a variance-invariance-covariance 
    regularized GRU autoencoder trained on SELFIES strings.
    """
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)        
        self.featurizer = VICGAE.from_pretrained()
    
    def _embed(self, smiles, **kwargs):
        return [self.featurizer.embed_smiles(x) for x in smiles]

transformer = MyAstroChemFeaturizer(dtype=torch.float)
transformer(dm.freesolv()["smiles"][:10]).shape


torch.Size([10, 32])

## Add it to your Model Store
Molfeat has a Model Store to publish your models in a centralized location.
The default is a read-only GCP bucket but you can replace this with your own file storage. This can, for example, be useful to share private featurizers with your team.

In [7]:
import platformdirs
from molfeat.store.modelstore import ModelStore
from molfeat.store import ModelInfo

path = dm.fs.join(platformdirs.user_cache_dir("molfeat"), "custom_model_store")
store = ModelStore(model_store_bucket=path)
len(store.available_models)

0

In [8]:
# Let's define our model's info
info = ModelInfo(
    name = "my_foundation_model",
    inputs = "smiles",
    type="pretrained",
    group="my_group",
    version=0,
    submitter="Datamol",
    description="Solves chemistry!",
    representation="vector",
    require_3D=False,
    tags = ["foundation_model", "random_forest"],
    authors= ["Datamol"],
    reference = "/fake/ref"
)

store.register(info)
store.available_models

  0%|          | 0.00/4.00 [00:00<?, ?B/s]

2023-04-01 15:10:14.556 | INFO     | molfeat.store.modelstore:register:124 - Successfuly registered model my_foundation_model !


[ModelInfo(name='my_foundation_model', inputs='smiles', type='pretrained', version=0, group='my_group', submitter='Datamol', description='Solves chemistry!', representation='vector', require_3D=False, tags=['foundation_model', 'random_forest'], authors=['Datamol'], reference='/fake/ref', created_at=datetime.datetime(2023, 4, 1, 15, 10, 14, 527157), sha256sum='9c298d589a2158eb513cb52191144518a2acab2cb0c04f1df14fca0f712fa4a1')]

## Share with the community
We invite you to share your featurizers with the community to help progress the field.
To learn more, visit [the developer documentation](../developers/create-plugin.html).
