Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding grover featurizer #3138

Merged
merged 5 commits into from
Jan 20, 2023
Merged

Adding grover featurizer #3138

merged 5 commits into from
Jan 20, 2023

Conversation

arunppsg
Copy link
Contributor

In this PR, I am adding grover featurizer to suite of featurizers in deepchem.

@arunppsg arunppsg marked this pull request as draft December 13, 2022 07:41
@arunppsg arunppsg force-pushed the grover_feat branch 2 times, most recently from e9b43a0 to 5d52672 Compare December 23, 2022 05:20
@arunppsg arunppsg marked this pull request as ready for review December 23, 2022 06:56
Copy link
Collaborator

@tonydavis629 tonydavis629 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I think we should figure out how to integrate the one hot encoder into the moleculer_featurizer one hot encoder. There's already 2 one hot encoders right now in deepchem. Also had a few questions that maybe a comment or little more detail would clarify.

deepchem/feat/molecule_featurizers/grover_featurizer.py Outdated Show resolved Hide resolved
]
}

_BOND_FDIM = 14
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No atom feature dim needed? Since it's not the same as len(_ATOM_FEATURES). It may be necessary when we do batching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest change, I reused utilities from dmpnn featurizer. This removed the need for BOND_FDIM as well as ATOM_FDIM in grover featurizer. The grover featurizer is similar to DMPNNFeaturizer. In longer run, I think they both should be unified.

as input and computes the following sets of features:
1. a molecular graph from the input molecule
2. functional groups which are used **only** during pretraining
3. additional features which can **only** be used during finetuning
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't the optional featurers be used in pretraining?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pretraining task depends only on a specific set of features from the molecule - the function groups, the atom vocabulary and bond vocabulary. Hence, optional features are not required in pretraining. Note: Atom vocabulary and bond vocabulary are generated at the time of training and not included in this PR.

Comment on lines 26 to 37
(The implementation for normalization is based on `RDKit2DNormalized()` method
in 'descriptastorus' library.)

The neural network architecture requires that the features are appropriately scaled to prevent
features with large ranges from dominating smaller ranged features, as well as preventing
issues where features in the training set are not drawn from the same sample distribution as
features in the testing set. To prevent these issues, a large sample of molecules is used to fit
cumulative density functions (CDFs) to all features.

CDFs were used as opposed to simpler scaling algorithms mainly because CDFs have the useful
property that 'each value has the same meaning: the percentage of the population observed below
the raw feature value.'
When the `is_normalized` option is set as True, descriptor values are normalized across the sample
by fitting a cumulative density function. CDFs were used as opposed to simpler scaling algorithms
mainly because CDFs have the useful property that 'each value has the same meaning: the percentage
of the population observed below the raw feature value.'

Warning: Currently, the normalizing cdf parameters are not available for BCUT2D descriptors.
(BCUT2D_MWHI, BCUT2D_MWLOW, BCUT2D_CHGHI, BCUT2D_CHGLO, BCUT2D_LOGPHI, BCUT2D_LOGPLOW, BCUT2D_MRHI, BCUT2D_MRLOW)

Attributes
----------
descriptors: List[str]
List of RDKit descriptor names used in this class.

Note
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some info about enabling custom descriptors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make it in a different PR, improving overall docs.

Copy link
Member

@rbharath rbharath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me. Good to merge in once @tonydavis629's comments on the thread have been addressed

@arunppsg
Copy link
Contributor Author

A note to self: I need to fix mypy error and a failing unit test before merging.

@arunppsg
Copy link
Contributor Author

A note to self: I need to fix mypy error and a failing unit test before merging.

Fixed mypy and failing unit test. Going to merge this in @rbharath as this has been hanging around for some time now.

@arunppsg arunppsg merged commit ba12d6e into deepchem:master Jan 20, 2023
@arunppsg arunppsg deleted the grover_feat branch April 11, 2023 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants