New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracting molecular coordinates for QM9 dataset from sdf files #2903
Conversation
removed unused import, using unzip util
lowered shard size of qm9 dataset due to kernel died error when loading dataset with bigger shard size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a first look, but it might be easiest to review the PR on the developer call or office hours since I want to make sure I understand all the changes. Better 3D coordinate handling in general seems like a worthwhile change! Just want to make sure I understand the details
@@ -876,7 +876,16 @@ def _featurize_shard(self, | |||
Boolean values indicating successful featurization for corresponding | |||
sample in the source. | |||
""" | |||
features = [elt for elt in self.featurizer(shard[self.mol_field])] | |||
pos_cols = ['pos_x', 'pos_y', 'pos_z'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these standard names for the position columns or are they specific to the qm9 dataset?
The
.sdf
files comes with information about molecular coordinates. The molecular coordinates can be used in wide range of predictive models for predicting molecular properties. This PR updates the existingload_sdf_files
function to extract molecular coordinates from.sdf
file. Relevant changes have been made toMolGraphConvFeaturizer
to support molecular coordinates as kwargs.Brief demonstration of the change: in the following code,
the current output is
GraphData(node_features=[17, 30], edge_index=[2, 34], edge_features=None, pos=[17, 3])
whereas, earlier it wasGraphData(node_features=[17, 30], edge_index=[2, 34], edge_features=None)
Additionally, tests have been added for
load_sdf_file
andMolGraphconvFeaturizer
to capture the new features, renamed test data directory toassets
fromdata
and made yapf fixes.