Support for larger than memory datasets #11

JonathanSchmidt1 · 2022-08-25T22:51:46Z

Hi,
You mentioned at PSI-K that the only bottleneck for larger than memory datasets was the normalization of the energies and forces. It would be great if you could add an option to explicitly enter the statistics to avoid this issue.
best and thank you very much,
Jonathan

ilyes319 · 2022-08-31T11:18:46Z

Hi @JonathanSchmidt1,

Thank you for your messsage! I think the several sides for this are :

Enable parsing of the statistics to avoid running out of memory
Add multi-processing for the dataloader to speed up the loading and more memory effecient loading and pre-processing

Could you give an estimate of the size of the dataset in you want to load?

JonathanSchmidt1 · 2022-08-31T11:38:29Z

Hi,
the solutions sound good. At the moment we are thinking about ~35M structures, maybe on average 12 atoms and the corresponding forces and stresses.
best,
Jonathan

peastman · 2022-12-12T23:22:38Z

I just ran into the same problem. I'm trying to fit a MACE model to the SPICE dataset, which is moderately large but not huge. About 1.1 million conformations, an average of around 40 atoms per molecule, with energies and forces. Converted to xyz format it's 3.9 GB. When it tried to load the dataset, it filled up all the memory in my computer (32 GB), then the computer hung and had to be turned off with the power switch.

In my work with TorchMD-Net, I developed a HDF5 based dataset format for dealing with this problem. It transparently pages data into memory as needed. Would something similar be helpful here?

gabor1 · 2022-12-12T23:38:05Z

The question is whether we should push multi-gpu-multi-node training, assuming that large data sets take a long time to train anyway

peastman · 2022-12-12T23:56:35Z

Multi-GPU is nice if you happen to have them, but it isn't necessary. A single GPU can train models on large datasets. It just takes longer.

In my case it never got as far as using the GPU. It ran out of host memory first.

ilyes319 · 2022-12-13T14:50:41Z

Hi @peastman!

So you did not get any error message but just a crash? I guess HDF5 is a good option for that. In your code, is it interfaced with a torch dataloader ?

peastman · 2022-12-13T17:14:31Z

So you did not get any error message but just a crash?

It went deeply into swap space and started thrashing, which caused the computer to become unresponsive.

In your code, is it interfaced with a torch dataloader ?

It subclasses torch_geometric.data.Dataset.

I created a reduced version of the dataset with less than 10% of the data, so I could analyze where exactly the memory was being used. I noted the total memory used by the process at various points in run_train.py.

Immediately before the call to get_dataset_from_xyz(): 300 MB
Immediately after it returns: 2.9 GB
After creating train_loader and valid_loader: 6.1 GB
After creating the model: 8.2 GB

It looks like there are multiple bottlenecks in memory use. Loading the raw data with get_dataset_from_xyz() would take over 30 GB for the full dataset. Then when it creates the AtomicData objects, that more than doubles the memory use. Addressing this would involve two steps:

Load the raw data in a way that doesn't require everything to be in memory at once.
Create the AtomicData objects as they're needed and then discard them again, rather than building them all in advance.

davkovacs · 2023-01-11T15:13:07Z

@peastman Have you also considered using LMDB instead of hdf5? I would like to implement a new data loader that can efficiently deal with arbitrarily large datasets, and was wondering if in your experience hdf5 is better / faster? I am currently leaning towards using lmdb, but have not benchmarked them exhaustively yet.

peastman · 2023-01-11T18:21:05Z

I'd never heard of LMDB before. It looks like something very different. LMDB is a database system, while hdf5 is just a file format that's designed to allow efficient read access. It's possible a full database would have advantages, but it's going to be a lot harder for users. There's good support for hdf5 in just about every language, and it takes minimal code to build a file. Users are also much more likely to be familiar with it already.

As for performance, hdf5 is working great for me. I have no trouble handling large datasets and I can get GPU utilization close to 100%.

davkovacs · 2023-02-15T10:25:13Z

@peastman Could you perhaps try this pull request. I had a go at implementing the on-the-fly data loading and statistics parsing. Let me know if you have any questions, instructions are in the README.

#73

peastman · 2023-02-15T21:35:58Z

It moves the problem to a new place without fixing it. When I run the preprocess_data.py script I get the same behavior as before. The memory used by the process gradually increases until it fills up all available memory. Then the computer hangs and has to be shut down with the power button.

davkovacs · 2023-02-15T21:47:56Z

Sorry, I was under the assumption that there is 100-s of Gb-s of CPU RAM, and it is only the GPU RAM that is limited. So I moved the whole preprocesssing to the CPU and the GPU should read the preprocessed data from the HDF5 one batch at a time.

I can try to implement a modified low memory version of the preprocessing that processes one config at a time and writes it to disk before going to the next.

peastman · 2023-02-15T21:50:01Z

My computer only has 32 GB of RAM.

davkovacs · 2023-02-15T23:53:31Z

I have attempted a fix. I really hope it will work. Please use this branch and see the example in the README

https://github.com/davkovacs/mace/tree/on_the_fly_dataloading

peastman · 2023-02-16T00:24:39Z

Thanks! It's running now. The resident size climbed to about 12 GB and then stopped increasing. I'll let you know what happens.

peastman · 2023-02-16T18:15:34Z

Success! I started from a dataset that's about 1 GB in the HDF5 format used by TorchMD-Net. Converting it to xyz format increased it to just under 4 GB. preprocess_data.py ran for over 3.5 hours and produced a pair of files that totaled about 56 GB(!). But training now seems to be working. It's been running for about 12 hours, and the loss is gradually decreasing.

What exactly is the loss function? And is there any way I can tell what epoch it's on?

davkovacs · 2023-02-16T18:39:07Z

Great to hear!
It should have created a logs directory which contains a file that logs the validation loss (every 2 epochs by default). There is another folder called results which contains a file that logs the loss for each batch.

For the precise form of the loss function see Appendix 5.1 of the paper
https://arxiv.org/pdf/2206.07697.pdf

peastman · 2023-02-16T18:48:51Z

I guess that means it hasn't completed two epochs yet. I'll keep watching.

davkovacs · 2023-04-08T12:56:26Z

@peastman we have significantly improved the support for large datasets. Please have a look at the multi-GPU branch. It does not create huge datasets anymore during the preprocessing, and the memory requirements should also be even smaller. For me a preprocessed SPICE is now less than 8Gb. If you need any help, let me know!

JonathanSchmidt1 · 2023-04-08T13:18:45Z

That's great to here is the multi-GPU branch operational already?

davkovacs · 2023-04-08T13:28:31Z

We still have some debugging to do for multi-GPU training, but it works for training on a single GPU.

peastman · 2023-04-17T20:47:41Z

It works great, thanks! I have a training run going right now. I'm also looking forward to multi-GPU support.

ilyes319 · 2024-04-23T09:37:35Z

Merged to main in #363

ilyes319 self-assigned this Aug 30, 2022

ilyes319 added the enhancement New feature or request label Aug 31, 2022

davkovacs self-assigned this Jan 11, 2023

davkovacs linked a pull request Feb 8, 2023 that will close this issue

on the fly data loading #73

Merged

Amitcuhp mentioned this issue Nov 14, 2023

Lammps with CPU MD #225

Closed

Amitcuhp mentioned this issue Nov 24, 2023

MACE : Lammps with GPU support error #238

Closed

ilyes319 closed this as completed Apr 23, 2024

fhfeng21 mentioned this issue Sep 28, 2024

Errors when using mace in combination with lammps. #611

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for larger than memory datasets #11

Support for larger than memory datasets #11

JonathanSchmidt1 commented Aug 25, 2022

ilyes319 commented Aug 31, 2022

JonathanSchmidt1 commented Aug 31, 2022

peastman commented Dec 12, 2022

gabor1 commented Dec 12, 2022

peastman commented Dec 12, 2022

ilyes319 commented Dec 13, 2022

peastman commented Dec 13, 2022

davkovacs commented Jan 11, 2023

peastman commented Jan 11, 2023

davkovacs commented Feb 15, 2023

peastman commented Feb 15, 2023

davkovacs commented Feb 15, 2023

peastman commented Feb 15, 2023

davkovacs commented Feb 15, 2023

peastman commented Feb 16, 2023

peastman commented Feb 16, 2023

davkovacs commented Feb 16, 2023 •

edited

Loading

peastman commented Feb 16, 2023

davkovacs commented Apr 8, 2023 •

edited

Loading

JonathanSchmidt1 commented Apr 8, 2023

davkovacs commented Apr 8, 2023

peastman commented Apr 17, 2023

ilyes319 commented Apr 23, 2024

Support for larger than memory datasets #11

Support for larger than memory datasets #11

Comments

JonathanSchmidt1 commented Aug 25, 2022

ilyes319 commented Aug 31, 2022

JonathanSchmidt1 commented Aug 31, 2022

peastman commented Dec 12, 2022

gabor1 commented Dec 12, 2022

peastman commented Dec 12, 2022

ilyes319 commented Dec 13, 2022

peastman commented Dec 13, 2022

davkovacs commented Jan 11, 2023

peastman commented Jan 11, 2023

davkovacs commented Feb 15, 2023

peastman commented Feb 15, 2023

davkovacs commented Feb 15, 2023

peastman commented Feb 15, 2023

davkovacs commented Feb 15, 2023

peastman commented Feb 16, 2023

peastman commented Feb 16, 2023

davkovacs commented Feb 16, 2023 • edited Loading

peastman commented Feb 16, 2023

davkovacs commented Apr 8, 2023 • edited Loading

JonathanSchmidt1 commented Apr 8, 2023

davkovacs commented Apr 8, 2023

peastman commented Apr 17, 2023

ilyes319 commented Apr 23, 2024

davkovacs commented Feb 16, 2023 •

edited

Loading

davkovacs commented Apr 8, 2023 •

edited

Loading