Create a class to calculate molecular descriptors using `fit` and `transform` logic #3

miquelduranfrigola · 2024-02-24T07:39:00Z

We should calculate molecular descriptors. The most important thing is that these descriptors are well known. I suggest that we use Datamol descriptors: https://docs.datamol.io/0.9.1/tutorials/Descriptors.html, in particular, this function:

# Batch compute many descriptors for a list of compounds
df = dm.descriptors.batch_compute_many_descriptors(mols)

We then have to define a Descriptor class that has a fit and a transform method. At fit time, constant values should be removed, missing values imputed and, most importantly, all values should be normalized or scaled somehow. The following code from another Ersilia project may help:

In this case, values are binned (5 bins; we could do up to 10). We obviously lose information, but it makes that dataset very easy to deal with. I like this approach a lot: https://github.com/ligand-discovery/fragment-embedding/blob/main/fragmentembedding/physchem_desc.py
Here, we apply a robust scaler. I prefer binning, but feel free to explore this option too: https://github.com/ligand-discovery/fragment-embedding/blob/main/fragmentembedding/mordred_desc.py

It would be nice if a save and a load method are included as well. We can use joblib for this.

The text was updated successfully, but these errors were encountered:

HellenNamulinda · 2024-02-29T06:12:18Z

Hi @miquelduranfrigola,
By default, datamol's compute_many_descriptors (single molecule) and batch_compute_many_descriptors (list of molecules) computes 22 opiniated molecular properties.

mw
fsp3
n_lipinski_hba
n_lipinski_hbd
n_rings
n_hetero_atoms
n_heavy_atoms
n_rotatable_bonds
n_radical_electrons
tpsa
qed
clogp
sas
n_aliphatic_carbocycles
n_aliphatic_heterocyles
n_aliphatic_rings
n_aromatic_carbocycles
n_aromatic_heterocyles
n_aromatic_rings
n_saturated_carbocycles
n_saturated_heterocyles
n_saturated_rings

I see these can be a good number of features.
But do you think these features are enough, or we can add other properties to calculate. I can modfify the Descriptor class(#5) to cater for that in the pipeline.

miquelduranfrigola · 2024-02-29T07:18:06Z

Hi @HellenNamulinda , let's start with these 22 features. Most likely, it won't be enough to achieve the best-possible models, but let's start here. As discussed, let's bring it to the end, and then we will go back to this point and do some feature engineering! But for now, let's work based on this.

HellenNamulinda · 2024-03-19T08:48:25Z

Just an update on this issue.

We are first bringing the pipeline to an end, first using the datamol descriptors(22 features).
To improve the performance of the model, we will

Increase the number of features(properties calculated by datamol descriptor class).
Try other descriptors such as mordred descriptors
And perhaps the morgan fingerprints.

That's why this issue will be be open until we have explored the different descriptors.

miquelduranfrigola · 2024-03-20T18:29:51Z

Thanks @HellenNamulinda - great summary.

HellenNamulinda · 2024-04-10T09:14:04Z

While it is possible to increase the number of features in datamol, it can be quite handy when calculating more properties. It requires to pass a dictionary(properties_fn) containing functions(from rdkit) for all the additional features/descriptors to be calculated.

So, we have decided to instead use RDKit to get all the descriptors.
Also, Mordred Descriptors were added.
This gives a chance to choose which descriptors to use(Choose between Datamol, Mordred or the RDkitClassical Descriptors).

With more features from rdkit and mordred, FeatureWiz is now used to select the top best features.

PR: #9

miquelduranfrigola · 2024-04-10T10:14:20Z

This sounds good, @HellenNamulinda , thanks so much. This gives us a good starting point with three different types of descriptors:

Small (datamol)
Mid-size (RDKit)
Large (Mordred)
Let's not look for more descriptor types for now. Conceptually, this is all we need. We can look for other options in the future but with these three we are well covered for now.

HellenNamulinda · 2024-05-14T05:50:40Z

Just an update on this issue.

We are first bringing the pipeline to an end, first using the datamol descriptors(22 features). To improve the performance of the model, we will

Increase the number of features(properties calculated by datamol descriptor class).

Try other descriptors such as mordred descriptors

And perhaps the morgan fingerprints.

That's why this issue will be be open until we have explored the different descriptors.

The Morgan Fingeprint Features were integrated in this commit(d2ff580)

miquelduranfrigola assigned HellenNamulinda Feb 24, 2024

miquelduranfrigola mentioned this issue Feb 24, 2024

Model interpretability roadmap ersilia-os/zaira-chem#35

Open

13 tasks

HellenNamulinda closed this as completed May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a class to calculate molecular descriptors using `fit` and `transform` logic #3

Create a class to calculate molecular descriptors using `fit` and `transform` logic #3

miquelduranfrigola commented Feb 24, 2024

HellenNamulinda commented Feb 29, 2024

miquelduranfrigola commented Feb 29, 2024

HellenNamulinda commented Mar 19, 2024 •

edited

Loading

miquelduranfrigola commented Mar 20, 2024

HellenNamulinda commented Apr 10, 2024 •

edited

Loading

miquelduranfrigola commented Apr 10, 2024

HellenNamulinda commented May 14, 2024

Create a class to calculate molecular descriptors using fit and transform logic #3

Create a class to calculate molecular descriptors using fit and transform logic #3

Comments

miquelduranfrigola commented Feb 24, 2024

HellenNamulinda commented Feb 29, 2024

miquelduranfrigola commented Feb 29, 2024

HellenNamulinda commented Mar 19, 2024 • edited Loading

miquelduranfrigola commented Mar 20, 2024

HellenNamulinda commented Apr 10, 2024 • edited Loading

miquelduranfrigola commented Apr 10, 2024

HellenNamulinda commented May 14, 2024

Create a class to calculate molecular descriptors using `fit` and `transform` logic #3

Create a class to calculate molecular descriptors using `fit` and `transform` logic #3

HellenNamulinda commented Mar 19, 2024 •

edited

Loading

HellenNamulinda commented Apr 10, 2024 •

edited

Loading