Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DFTYaml Loader #3295

Merged
merged 7 commits into from
Mar 29, 2023
Merged

DFTYaml Loader #3295

merged 7 commits into from
Mar 29, 2023

Conversation

advikavs
Copy link
Contributor

Description

Load and featurize .yaml files

Copy link
Member

@rbharath rbharath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first round of review. A couple of changes needed on the API with more details in the comments below


Parameters
----------
featurizer: Featurizer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting looks off here. Can you also add more details about the featurizers that we support?

featurizer: Featurizer
"""

def create_dftdataset(self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't create a new method name here. Instead, we should follow the existing API (create_dataset and not create_dftdataset)

X = np.array([self._featurize_shard(shard) for shard in entries])
y = np.array([0])
w = w
return NumpyDataset(X, y, w=w)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need to create only a NumpyDataset. If we implement _get_shards we should directly be able to construct a DiskDataset with the inherited superclass method of create_dataset


Returns
-------
x: DFTEntry object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks possibly off? A shard has multiple datapoints. That should be multiple DFTEntry objects right

deepchem/data/tests/test_dftyaml.py Outdated Show resolved Hide resolved
Copy link
Member

@rbharath rbharath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost good to go, but you need to add yaml to the requirements file for tests https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports

true_val = shard['true_val']
systems = shard['systems']
except KeyError:
print("Unknown key")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a print, you want to raise an error:

raise ValueError("Unknown key in yaml file. Please check format for correctness.")

Copy link
Member

@rbharath rbharath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@advikavs Can you confirm CI is clear? I will go ahead and merge if so

Copy link
Member

@rbharath rbharath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On ubuntu-latest, I'm seeing this CI failure due to dqc. @advikavs could you take a look and fix?

=========================== short test summary info ============================
ERROR deepchem/feat/dft_data.py - ModuleNotFoundError: No module named 'dqc'
ERROR deepchem/models/dft/nnxc.py - ModuleNotFoundError: No module named 'dqc'
!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!
======================== 12 warnings, 2 errors in 8.87s ========================

@advikavs
Copy link
Contributor Author

On ubuntu-latest, I'm seeing this CI failure due to dqc. @advikavs could you take a look and fix?

=========================== short test summary info ============================
ERROR deepchem/feat/dft_data.py - ModuleNotFoundError: No module named 'dqc'
ERROR deepchem/models/dft/nnxc.py - ModuleNotFoundError: No module named 'dqc'
!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!
======================== 12 warnings, 2 errors in 8.87s ========================

I have fixed the issue in this PR: #3315

@rbharath rbharath merged commit 5887076 into deepchem:master Mar 29, 2023
@advikavs advikavs deleted the yaml branch March 30, 2023 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants