-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DFTYaml Loader #3295
DFTYaml Loader #3295
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first round of review. A couple of changes needed on the API with more details in the comments below
deepchem/data/data_loader.py
Outdated
|
||
Parameters | ||
---------- | ||
featurizer: Featurizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting looks off here. Can you also add more details about the featurizers that we support?
deepchem/data/data_loader.py
Outdated
featurizer: Featurizer | ||
""" | ||
|
||
def create_dftdataset(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't create a new method name here. Instead, we should follow the existing API (create_dataset
and not create_dftdataset
)
deepchem/data/data_loader.py
Outdated
X = np.array([self._featurize_shard(shard) for shard in entries]) | ||
y = np.array([0]) | ||
w = w | ||
return NumpyDataset(X, y, w=w) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need to create only a NumpyDataset
. If we implement _get_shards
we should directly be able to construct a DiskDataset
with the inherited superclass method of create_dataset
deepchem/data/data_loader.py
Outdated
|
||
Returns | ||
------- | ||
x: DFTEntry object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks possibly off? A shard has multiple datapoints. That should be multiple DFTEntry
objects right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost good to go, but you need to add yaml
to the requirements file for tests https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
deepchem/data/data_loader.py
Outdated
true_val = shard['true_val'] | ||
systems = shard['systems'] | ||
except KeyError: | ||
print("Unknown key") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a print, you want to raise an error:
raise ValueError("Unknown key in yaml file. Please check format for correctness.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@advikavs Can you confirm CI is clear? I will go ahead and merge if so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On ubuntu-latest, I'm seeing this CI failure due to dqc. @advikavs could you take a look and fix?
=========================== short test summary info ============================
ERROR deepchem/feat/dft_data.py - ModuleNotFoundError: No module named 'dqc'
ERROR deepchem/models/dft/nnxc.py - ModuleNotFoundError: No module named 'dqc'
!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!
======================== 12 warnings, 2 errors in 8.87s ========================
I have fixed the issue in this PR: #3315 |
Description
Load and featurize .yaml files