Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting molecular coordinates for QM9 dataset from sdf files #2903

Merged
merged 10 commits into from May 27, 2022

Conversation

arunppsg
Copy link
Contributor

@arunppsg arunppsg commented Apr 24, 2022

The .sdf files comes with information about molecular coordinates. The molecular coordinates can be used in wide range of predictive models for predicting molecular properties. This PR updates the existing load_sdf_files function to extract molecular coordinates from .sdf file. Relevant changes have been made to MolGraphConvFeaturizer to support molecular coordinates as kwargs.

Brief demonstration of the change: in the following code,

tasks, datasets, transformers = dc.molnet.load_qm9(reload=False, featurizer  =dc.feat.MolGraphConvFeaturizer())
train = datasets[0]
train.X[0]

the current output is
GraphData(node_features=[17, 30], edge_index=[2, 34], edge_features=None, pos=[17, 3]) whereas, earlier it was GraphData(node_features=[17, 30], edge_index=[2, 34], edge_features=None)

Additionally, tests have been added for load_sdf_file and MolGraphconvFeaturizer to capture the new features, renamed test data directory to assets from data and made yapf fixes.

@arunppsg arunppsg marked this pull request as draft April 24, 2022 14:16
Copy link
Member

@rbharath rbharath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a first look, but it might be easiest to review the PR on the developer call or office hours since I want to make sure I understand all the changes. Better 3D coordinate handling in general seems like a worthwhile change! Just want to make sure I understand the details

@@ -876,7 +876,16 @@ def _featurize_shard(self,
Boolean values indicating successful featurization for corresponding
sample in the source.
"""
features = [elt for elt in self.featurizer(shard[self.mol_field])]
pos_cols = ['pos_x', 'pos_y', 'pos_z']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these standard names for the position columns or are they specific to the qm9 dataset?

@arunppsg arunppsg marked this pull request as ready for review May 16, 2022 04:42
@rbharath rbharath merged commit 4491b78 into deepchem:master May 27, 2022
@arunppsg arunppsg deleted the qm9 branch June 30, 2023 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants