-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a class to calculate molecular descriptors using fit
and transform
logic
#3
Comments
Hi @miquelduranfrigola,
I see these can be a good number of features. |
Hi @HellenNamulinda , let's start with these 22 features. Most likely, it won't be enough to achieve the best-possible models, but let's start here. As discussed, let's bring it to the end, and then we will go back to this point and do some feature engineering! But for now, let's work based on this. |
Just an update on this issue. We are first bringing the pipeline to an end, first using the datamol descriptors(22 features).
That's why this issue will be be open until we have explored the different descriptors. |
Thanks @HellenNamulinda - great summary. |
While it is possible to increase the number of features in datamol, it can be quite handy when calculating more properties. It requires to pass a dictionary(properties_fn) containing functions(from rdkit) for all the additional features/descriptors to be calculated. So, we have decided to instead use RDKit to get all the descriptors. With more features from rdkit and mordred, FeatureWiz is now used to select the top best features. PR: #9 |
This sounds good, @HellenNamulinda , thanks so much. This gives us a good starting point with three different types of descriptors:
|
The Morgan Fingeprint Features were integrated in this commit(d2ff580) |
We should calculate molecular descriptors. The most important thing is that these descriptors are well known. I suggest that we use Datamol descriptors: https://docs.datamol.io/0.9.1/tutorials/Descriptors.html, in particular, this function:
We then have to define a
Descriptor
class that has afit
and atransform
method. At fit time, constant values should be removed, missing values imputed and, most importantly, all values should be normalized or scaled somehow. The following code from another Ersilia project may help:It would be nice if a
save
and aload
method are included as well. We can usejoblib
for this.The text was updated successfully, but these errors were encountered: