-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix splitter errors for datasets without labels #2641
Conversation
@rbharath I feel there might be a better way to do this, but for now this works as intended with the USPTO loader. |
deepchem/data/tests/test_shape.py
Outdated
w = np.random.randint(2, size=(num_datapoints, num_tasks)) | ||
ids = np.array(["id"] * num_datapoints) | ||
|
||
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here I'm thinking more if we do
dataset = dc.data.DiskDataset.from_numpy(X)
assert dataset.get_shard_size() == 100
The idea is we want to check that datasets without labels work correctly :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup! will add that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason I'm not seeing the latest CI run here.
If the CI is passing as expected we can go ahead and merge this PR in. I think code is all good. Let me know if I can merge
|
||
Note | ||
---- | ||
DiskDatasets without labels cannot be resharded! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fixed! Can you raise a separate issue for this and we can get a fix going?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup will do!
@rbharath Could you try restarting the checks maybe? I think I opened the PR when github actions was down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think this is good! Going to merge in
Description
Some datasets without labels such as USPTO, Swiss-Prot are not compatible with the splitters as the get_shard_size method looks for the length of y, which in this case is None. This is a crack at getting around this issue, I was able to load in and split the USPTO datasets with this.
Type of change
Please check the option that is related to your PR.
Checklist
yapf -i <modified file>
and check no errors (yapf version must be 0.22.0)mypy -p deepchem
and check no errorsflake8 <modified file> --count
and check no errors