Added distributed loading of PyTorch DiskDataset #3704

arunppsg · 2023-12-11T11:15:13Z

Description

Added support for loading disk dataset in multiple processes.

Type of change

Please check the option that is related to your PR.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
- In this case, we recommend to discuss your modification on GitHub issues before creating the PR
Documentations (modification for documents)

Checklist

vsaravind01

The changes look good to me.

rbharath · 2023-12-15T01:38:15Z

deepchem/data/pytorch_datasets.py

    def __iter__(self):
        worker_info = torch.utils.data.get_worker_info()
        n_shards = self.disk_dataset.get_number_shards()
        if worker_info is None:
-            first_shard = 0
-            last_shard = n_shards
+            process_id = 0


@arunppsg Can we add a unit test for the changed functionality?

Documenting offline discussion, this is proving difficult to write. The existing unit tests pass. Will come back with additional tests in future PR

rbharath

@arunppsg Is there any risk for breakage with these changes? What code uses TorchDiskDataset already?

arunppsg · 2023-12-18T12:37:52Z

@arunppsg Is there any risk for breakage with these changes? What code uses TorchDiskDataset already?

DiskDataset.make_pytorch_dataset uses TorchDiskDataset method to make an iterable pytorch dataset. In these code snippets, we are only updating the process of distributing the shards across multiple workers in a single machine, so that each gpu gets a different copy of the dataset.

rbharath

LGTM

added distributed loading of DiskDataset

bb454b8

vsaravind01 reviewed Dec 11, 2023

View reviewed changes

rbharath reviewed Dec 15, 2023

View reviewed changes

rbharath approved these changes Dec 19, 2023

View reviewed changes

rbharath merged commit 8130f62 into deepchem:master Dec 19, 2023
24 of 33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added distributed loading of PyTorch DiskDataset #3704

Added distributed loading of PyTorch DiskDataset #3704

arunppsg commented Dec 11, 2023

vsaravind01 left a comment

rbharath Dec 15, 2023

rbharath Dec 19, 2023

rbharath left a comment

arunppsg commented Dec 18, 2023

rbharath left a comment

Added distributed loading of PyTorch DiskDataset #3704

Added distributed loading of PyTorch DiskDataset #3704

Conversation

arunppsg commented Dec 11, 2023

Description

Type of change

Checklist

vsaravind01 left a comment

Choose a reason for hiding this comment

rbharath Dec 15, 2023

Choose a reason for hiding this comment

rbharath Dec 19, 2023

Choose a reason for hiding this comment

rbharath left a comment

Choose a reason for hiding this comment

arunppsg commented Dec 18, 2023

rbharath left a comment

Choose a reason for hiding this comment