Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a class to calculate molecular descriptors using fit and transform logic #3

Closed
miquelduranfrigola opened this issue Feb 24, 2024 · 7 comments
Assignees

Comments

@miquelduranfrigola
Copy link
Member

We should calculate molecular descriptors. The most important thing is that these descriptors are well known. I suggest that we use Datamol descriptors: https://docs.datamol.io/0.9.1/tutorials/Descriptors.html, in particular, this function:

# Batch compute many descriptors for a list of compounds
df = dm.descriptors.batch_compute_many_descriptors(mols)

We then have to define a Descriptor class that has a fit and a transform method. At fit time, constant values should be removed, missing values imputed and, most importantly, all values should be normalized or scaled somehow. The following code from another Ersilia project may help:

It would be nice if a save and a load method are included as well. We can use joblib for this.

@HellenNamulinda
Copy link
Collaborator

Hi @miquelduranfrigola,
By default, datamol's compute_many_descriptors (single molecule) and batch_compute_many_descriptors (list of molecules) computes 22 opiniated molecular properties.

mw
fsp3
n_lipinski_hba
n_lipinski_hbd
n_rings
n_hetero_atoms
n_heavy_atoms
n_rotatable_bonds
n_radical_electrons
tpsa
qed
clogp
sas
n_aliphatic_carbocycles
n_aliphatic_heterocyles
n_aliphatic_rings
n_aromatic_carbocycles
n_aromatic_heterocyles
n_aromatic_rings
n_saturated_carbocycles
n_saturated_heterocyles
n_saturated_rings

I see these can be a good number of features.
But do you think these features are enough, or we can add other properties to calculate. I can modfify the Descriptor class(#5) to cater for that in the pipeline.

@miquelduranfrigola
Copy link
Member Author

Hi @HellenNamulinda , let's start with these 22 features. Most likely, it won't be enough to achieve the best-possible models, but let's start here. As discussed, let's bring it to the end, and then we will go back to this point and do some feature engineering! But for now, let's work based on this.

@HellenNamulinda
Copy link
Collaborator

HellenNamulinda commented Mar 19, 2024

Just an update on this issue.

We are first bringing the pipeline to an end, first using the datamol descriptors(22 features).
To improve the performance of the model, we will

  • Increase the number of features(properties calculated by datamol descriptor class).
  • Try other descriptors such as mordred descriptors
  • And perhaps the morgan fingerprints.

That's why this issue will be be open until we have explored the different descriptors.

@miquelduranfrigola
Copy link
Member Author

Thanks @HellenNamulinda - great summary.

@HellenNamulinda
Copy link
Collaborator

HellenNamulinda commented Apr 10, 2024

While it is possible to increase the number of features in datamol, it can be quite handy when calculating more properties. It requires to pass a dictionary(properties_fn) containing functions(from rdkit) for all the additional features/descriptors to be calculated.

So, we have decided to instead use RDKit to get all the descriptors.
Also, Mordred Descriptors were added.
This gives a chance to choose which descriptors to use(Choose between Datamol, Mordred or the RDkitClassical Descriptors).

With more features from rdkit and mordred, FeatureWiz is now used to select the top best features.

PR: #9

@miquelduranfrigola
Copy link
Member Author

This sounds good, @HellenNamulinda , thanks so much. This gives us a good starting point with three different types of descriptors:

  • Small (datamol)
  • Mid-size (RDKit)
  • Large (Mordred)
    Let's not look for more descriptor types for now. Conceptually, this is all we need. We can look for other options in the future but with these three we are well covered for now.

@HellenNamulinda
Copy link
Collaborator

Just an update on this issue.

We are first bringing the pipeline to an end, first using the datamol descriptors(22 features). To improve the performance of the model, we will

  • Increase the number of features(properties calculated by datamol descriptor class).
  • Try other descriptors such as mordred descriptors
  • And perhaps the morgan fingerprints.

That's why this issue will be be open until we have explored the different descriptors.

The Morgan Fingeprint Features were integrated in this commit(d2ff580)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants