Suggestion that might lead to faster data loading #509

cwognum · 2024-04-11T22:20:41Z

cwognum
Apr 11, 2024
Collaborator

I have been testing different ways of creating data loaders for Polaris.

One of the strategies I'm looking into is inspired by Graphium:

Save all data-point in individual pickle files.
Group files in directories of at most 1000 datapoints.

I noticed that you're using torch.save and torch.load to save and load the files. I was expecting this to be faster, because you would not have to convert from / to a torch.Tensor, but it might not be...

Let's look at a simple example.

We start by defining some utility code

def get_pickle_archive(X, y, dest_dir, use_torch):
    
    os.makedirs(dest_dir, exist_ok=True)

    pairs = list(zip(X, y))
    for idx, pair in enumerate(pairs):
        if idx % 1000 == 0:
            dest = os.path.join(dest_dir, f"{idx // 1000}")
            os.makedirs(dest, exist_ok=True)
        filename = os.path.join(dest, f"{idx}.pt")

        if use_torch:
            torch.save(pair, filename, pickle_protocol=4)
        else:
            with open(filename, "wb") as f:
                pickle.dump(pair, f)

    return dest_dir


class PickleDataset(torch.utils.data.Dataset):
    def __init__(self, root, length, use_torch):
        self.root = root
        self.length = length
        self.use_torch = use_torch

    def __getitem__(self, index):
        
        subdir = index // 1000
        path = os.path.join(self.root, f"{subdir}/{index}.pt")
        
        if self.use_torch:
            return torch.load(path)
        with open(path, "rb") as f:
            return pickle.load(f)

    def __len__(self):
        return self.length

And now we can do some benchmarking.

# Create a random, toy dataset of 10k samples
X = np.random.random((10000, 64, 64, 3))
y = np.random.random(10000)

Save and load using Torch

# Create a dataset that saves using torch
archive = get_pickle_archive(X, y, tmpdir, use_torch=True)
dataset = PickleDataset(archive, use_torch=True, length=10000)
dataloader = torch.utils.data.DataLoader(dataset)

%%timeit
for epoch in range(5): 
    for batch in dataloader:
        pass

This gives: 4.13 s ± 35.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Save and load using Pickle

# This time, we just use pickle directly.
archive = get_pickle_archive(X, y, tmpdir, use_torch=False)
dataset = PickleDataset(archive, use_torch=False, length=10000)
dataloader = torch.utils.data.DataLoader(dataset)

%%timeit
for epoch in range(5): 
    for batch in dataloader:
        pass

This gives: 1.72 s ± 8.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conclusion

This could be an easy change that might get you a ~58% speed-up!

torch.save adds relevant functionality (see e.g. here), but I'm not familiar enough with the internals of either Torch or Graphium to understand whether this functionality is needed for your use case of saving featurized graphs (right?) to the disk.

Might be worth a try!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion that might lead to faster data loading #509

{{title}}

Replies: 0 comments

Select a reply

Suggestion that might lead to faster data loading #509

cwognum Apr 11, 2024 Collaborator

Save and load using Torch

Save and load using Pickle

Conclusion

Replies: 0 comments

cwognum
Apr 11, 2024
Collaborator