-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding grover featurizer #3138
Adding grover featurizer #3138
Conversation
e9b43a0
to
5d52672
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I think we should figure out how to integrate the one hot encoder into the moleculer_featurizer one hot encoder. There's already 2 one hot encoders right now in deepchem. Also had a few questions that maybe a comment or little more detail would clarify.
] | ||
} | ||
|
||
_BOND_FDIM = 14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No atom feature dim needed? Since it's not the same as len(_ATOM_FEATURES). It may be necessary when we do batching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest change, I reused utilities from dmpnn featurizer. This removed the need for BOND_FDIM as well as ATOM_FDIM in grover featurizer. The grover featurizer is similar to DMPNNFeaturizer. In longer run, I think they both should be unified.
as input and computes the following sets of features: | ||
1. a molecular graph from the input molecule | ||
2. functional groups which are used **only** during pretraining | ||
3. additional features which can **only** be used during finetuning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't the optional featurers be used in pretraining?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pretraining task depends only on a specific set of features from the molecule - the function groups, the atom vocabulary and bond vocabulary. Hence, optional features are not required in pretraining. Note: Atom vocabulary and bond vocabulary are generated at the time of training and not included in this PR.
(The implementation for normalization is based on `RDKit2DNormalized()` method | ||
in 'descriptastorus' library.) | ||
|
||
The neural network architecture requires that the features are appropriately scaled to prevent | ||
features with large ranges from dominating smaller ranged features, as well as preventing | ||
issues where features in the training set are not drawn from the same sample distribution as | ||
features in the testing set. To prevent these issues, a large sample of molecules is used to fit | ||
cumulative density functions (CDFs) to all features. | ||
|
||
CDFs were used as opposed to simpler scaling algorithms mainly because CDFs have the useful | ||
property that 'each value has the same meaning: the percentage of the population observed below | ||
the raw feature value.' | ||
When the `is_normalized` option is set as True, descriptor values are normalized across the sample | ||
by fitting a cumulative density function. CDFs were used as opposed to simpler scaling algorithms | ||
mainly because CDFs have the useful property that 'each value has the same meaning: the percentage | ||
of the population observed below the raw feature value.' | ||
|
||
Warning: Currently, the normalizing cdf parameters are not available for BCUT2D descriptors. | ||
(BCUT2D_MWHI, BCUT2D_MWLOW, BCUT2D_CHGHI, BCUT2D_CHGLO, BCUT2D_LOGPHI, BCUT2D_LOGPLOW, BCUT2D_MRHI, BCUT2D_MRLOW) | ||
|
||
Attributes | ||
---------- | ||
descriptors: List[str] | ||
List of RDKit descriptor names used in this class. | ||
|
||
Note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some info about enabling custom descriptors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make it in a different PR, improving overall docs.
5d52672
to
19b53d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to me. Good to merge in once @tonydavis629's comments on the thread have been addressed
A note to self: I need to fix mypy error and a failing unit test before merging. |
Fixed mypy and failing unit test. Going to merge this in @rbharath as this has been hanging around for some time now. |
In this PR, I am adding grover featurizer to suite of featurizers in deepchem.