Fix splitter errors for datasets without labels #2641

Suzukazole · 2021-08-04T17:20:05Z

Description

Some datasets without labels such as USPTO, Swiss-Prot are not compatible with the splitters as the get_shard_size method looks for the length of y, which in this case is None. This is a crack at getting around this issue, I was able to load in and split the USPTO datasets with this.

Type of change

Please check the option that is related to your PR.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
- In this case, we recommend to discuss your modification on GitHub issues before creating the PR
Documentations (modification for documents)

Checklist

Suzukazole · 2021-08-04T17:21:52Z

@rbharath I feel there might be a better way to do this, but for now this works as intended with the USPTO loader.

rbharath · 2021-08-09T14:42:55Z

deepchem/data/tests/test_shape.py

+  w = np.random.randint(2, size=(num_datapoints, num_tasks))
+  ids = np.array(["id"] * num_datapoints)
+
+  dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)


So here I'm thinking more if we do

dataset = dc.data.DiskDataset.from_numpy(X) assert dataset.get_shard_size() == 100

The idea is we want to check that datasets without labels work correctly :)

yup! will add that

rbharath

For some reason I'm not seeing the latest CI run here.

If the CI is passing as expected we can go ahead and merge this PR in. I think code is all good. Let me know if I can merge

deepchem/data/tests/reaction_smiles.csv

rbharath · 2021-08-12T01:12:18Z

deepchem/data/tests/test_shape.py

+
+  Note
+  ----
+  DiskDatasets without labels cannot be resharded!


This should be fixed! Can you raise a separate issue for this and we can get a fix going?

Yup will do!

Suzukazole · 2021-08-12T05:18:01Z

@rbharath Could you try restarting the checks maybe? I think I opened the PR when github actions was down.

rbharath

Ok I think this is good! Going to merge in

Suzukazole added 2 commits August 4, 2021 22:45

change y to ids and checks

4970501

Merge branch 'deepchem:master' into usptotok

7ae5640

add test

f114ba6

rbharath reviewed Aug 9, 2021

View reviewed changes

Suzukazole added 2 commits August 11, 2021 02:16

update test, docs

3573cca

yapf

5104ddd

rbharath reviewed Aug 12, 2021

View reviewed changes

Suzukazole mentioned this pull request Aug 23, 2021

Reshard fails for DiskDataset objects with no labels #2677

Closed

Trigger Checks

c99d126

rbharath approved these changes Aug 23, 2021

View reviewed changes

rbharath merged commit a8f150e into deepchem:master Aug 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix splitter errors for datasets without labels #2641

Fix splitter errors for datasets without labels #2641

Suzukazole commented Aug 4, 2021

Suzukazole commented Aug 4, 2021

rbharath Aug 9, 2021

Suzukazole Aug 9, 2021

rbharath left a comment

rbharath Aug 12, 2021

Suzukazole Aug 12, 2021

Suzukazole commented Aug 12, 2021

rbharath left a comment

Fix splitter errors for datasets without labels #2641

Fix splitter errors for datasets without labels #2641

Conversation

Suzukazole commented Aug 4, 2021

Description

Type of change

Checklist

Suzukazole commented Aug 4, 2021

rbharath Aug 9, 2021

Choose a reason for hiding this comment

Suzukazole Aug 9, 2021

Choose a reason for hiding this comment

rbharath left a comment

Choose a reason for hiding this comment

rbharath Aug 12, 2021

Choose a reason for hiding this comment

Suzukazole Aug 12, 2021

Choose a reason for hiding this comment

Suzukazole commented Aug 12, 2021

rbharath left a comment

Choose a reason for hiding this comment