Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for larger than memory datasets #11

Closed
JonathanSchmidt1 opened this issue Aug 25, 2022 · 23 comments · Fixed by #73
Closed

Support for larger than memory datasets #11

JonathanSchmidt1 opened this issue Aug 25, 2022 · 23 comments · Fixed by #73
Assignees
Labels
enhancement New feature or request

Comments

@JonathanSchmidt1
Copy link

Hi,
You mentioned at PSI-K that the only bottleneck for larger than memory datasets was the normalization of the energies and forces. It would be great if you could add an option to explicitly enter the statistics to avoid this issue.
best and thank you very much,
Jonathan

@ilyes319 ilyes319 self-assigned this Aug 30, 2022
@ilyes319
Copy link
Contributor

Hi @JonathanSchmidt1,

Thank you for your messsage! I think the several sides for this are :

  • Enable parsing of the statistics to avoid running out of memory
  • Add multi-processing for the dataloader to speed up the loading and more memory effecient loading and pre-processing

Could you give an estimate of the size of the dataset in you want to load?

@ilyes319 ilyes319 added the enhancement New feature or request label Aug 31, 2022
@JonathanSchmidt1
Copy link
Author

Hi,
the solutions sound good. At the moment we are thinking about ~35M structures, maybe on average 12 atoms and the corresponding forces and stresses.
best,
Jonathan

@peastman
Copy link

I just ran into the same problem. I'm trying to fit a MACE model to the SPICE dataset, which is moderately large but not huge. About 1.1 million conformations, an average of around 40 atoms per molecule, with energies and forces. Converted to xyz format it's 3.9 GB. When it tried to load the dataset, it filled up all the memory in my computer (32 GB), then the computer hung and had to be turned off with the power switch.

In my work with TorchMD-Net, I developed a HDF5 based dataset format for dealing with this problem. It transparently pages data into memory as needed. Would something similar be helpful here?

@gabor1
Copy link
Collaborator

gabor1 commented Dec 12, 2022

The question is whether we should push multi-gpu-multi-node training, assuming that large data sets take a long time to train anyway

@peastman
Copy link

Multi-GPU is nice if you happen to have them, but it isn't necessary. A single GPU can train models on large datasets. It just takes longer.

In my case it never got as far as using the GPU. It ran out of host memory first.

@ilyes319
Copy link
Contributor

Hi @peastman!

So you did not get any error message but just a crash? I guess HDF5 is a good option for that. In your code, is it interfaced with a torch dataloader ?

@peastman
Copy link

So you did not get any error message but just a crash?

It went deeply into swap space and started thrashing, which caused the computer to become unresponsive.

In your code, is it interfaced with a torch dataloader ?

It subclasses torch_geometric.data.Dataset.

I created a reduced version of the dataset with less than 10% of the data, so I could analyze where exactly the memory was being used. I noted the total memory used by the process at various points in run_train.py.

Immediately before the call to get_dataset_from_xyz(): 300 MB
Immediately after it returns: 2.9 GB
After creating train_loader and valid_loader: 6.1 GB
After creating the model: 8.2 GB

It looks like there are multiple bottlenecks in memory use. Loading the raw data with get_dataset_from_xyz() would take over 30 GB for the full dataset. Then when it creates the AtomicData objects, that more than doubles the memory use. Addressing this would involve two steps:

  1. Load the raw data in a way that doesn't require everything to be in memory at once.
  2. Create the AtomicData objects as they're needed and then discard them again, rather than building them all in advance.

@davkovacs davkovacs self-assigned this Jan 11, 2023
@davkovacs
Copy link
Collaborator

@peastman Have you also considered using LMDB instead of hdf5? I would like to implement a new data loader that can efficiently deal with arbitrarily large datasets, and was wondering if in your experience hdf5 is better / faster? I am currently leaning towards using lmdb, but have not benchmarked them exhaustively yet.

@peastman
Copy link

I'd never heard of LMDB before. It looks like something very different. LMDB is a database system, while hdf5 is just a file format that's designed to allow efficient read access. It's possible a full database would have advantages, but it's going to be a lot harder for users. There's good support for hdf5 in just about every language, and it takes minimal code to build a file. Users are also much more likely to be familiar with it already.

As for performance, hdf5 is working great for me. I have no trouble handling large datasets and I can get GPU utilization close to 100%.

@davkovacs davkovacs linked a pull request Feb 8, 2023 that will close this issue
@davkovacs
Copy link
Collaborator

@peastman Could you perhaps try this pull request. I had a go at implementing the on-the-fly data loading and statistics parsing. Let me know if you have any questions, instructions are in the README.

#73

@peastman
Copy link

It moves the problem to a new place without fixing it. When I run the preprocess_data.py script I get the same behavior as before. The memory used by the process gradually increases until it fills up all available memory. Then the computer hangs and has to be shut down with the power button.

@davkovacs
Copy link
Collaborator

Sorry, I was under the assumption that there is 100-s of Gb-s of CPU RAM, and it is only the GPU RAM that is limited. So I moved the whole preprocesssing to the CPU and the GPU should read the preprocessed data from the HDF5 one batch at a time.

I can try to implement a modified low memory version of the preprocessing that processes one config at a time and writes it to disk before going to the next.

@peastman
Copy link

My computer only has 32 GB of RAM.

@davkovacs
Copy link
Collaborator

I have attempted a fix. I really hope it will work. Please use this branch and see the example in the README

https://github.com/davkovacs/mace/tree/on_the_fly_dataloading

@peastman
Copy link

Thanks! It's running now. The resident size climbed to about 12 GB and then stopped increasing. I'll let you know what happens.

@peastman
Copy link

Success! I started from a dataset that's about 1 GB in the HDF5 format used by TorchMD-Net. Converting it to xyz format increased it to just under 4 GB. preprocess_data.py ran for over 3.5 hours and produced a pair of files that totaled about 56 GB(!). But training now seems to be working. It's been running for about 12 hours, and the loss is gradually decreasing.

What exactly is the loss function? And is there any way I can tell what epoch it's on?

@davkovacs
Copy link
Collaborator

davkovacs commented Feb 16, 2023

Great to hear!
It should have created a logs directory which contains a file that logs the validation loss (every 2 epochs by default). There is another folder called results which contains a file that logs the loss for each batch.

For the precise form of the loss function see Appendix 5.1 of the paper
https://arxiv.org/pdf/2206.07697.pdf

@peastman
Copy link

I guess that means it hasn't completed two epochs yet. I'll keep watching.

@davkovacs
Copy link
Collaborator

davkovacs commented Apr 8, 2023

@peastman we have significantly improved the support for large datasets. Please have a look at the multi-GPU branch. It does not create huge datasets anymore during the preprocessing, and the memory requirements should also be even smaller. For me a preprocessed SPICE is now less than 8Gb. If you need any help, let me know!

@JonathanSchmidt1
Copy link
Author

That's great to here is the multi-GPU branch operational already?

@davkovacs
Copy link
Collaborator

We still have some debugging to do for multi-GPU training, but it works for training on a single GPU.

@peastman
Copy link

It works great, thanks! I have a training run going right now. I'm also looking forward to multi-GPU support.

@ilyes319
Copy link
Contributor

Merged to main in #363

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants