-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for larger than memory datasets #11
Comments
Thank you for your messsage! I think the several sides for this are :
Could you give an estimate of the size of the dataset in you want to load? |
Hi, |
I just ran into the same problem. I'm trying to fit a MACE model to the SPICE dataset, which is moderately large but not huge. About 1.1 million conformations, an average of around 40 atoms per molecule, with energies and forces. Converted to xyz format it's 3.9 GB. When it tried to load the dataset, it filled up all the memory in my computer (32 GB), then the computer hung and had to be turned off with the power switch. In my work with TorchMD-Net, I developed a HDF5 based dataset format for dealing with this problem. It transparently pages data into memory as needed. Would something similar be helpful here? |
The question is whether we should push multi-gpu-multi-node training, assuming that large data sets take a long time to train anyway |
Multi-GPU is nice if you happen to have them, but it isn't necessary. A single GPU can train models on large datasets. It just takes longer. In my case it never got as far as using the GPU. It ran out of host memory first. |
Hi @peastman! So you did not get any error message but just a crash? I guess HDF5 is a good option for that. In your code, is it interfaced with a torch dataloader ? |
It went deeply into swap space and started thrashing, which caused the computer to become unresponsive.
It subclasses torch_geometric.data.Dataset. I created a reduced version of the dataset with less than 10% of the data, so I could analyze where exactly the memory was being used. I noted the total memory used by the process at various points in run_train.py. Immediately before the call to It looks like there are multiple bottlenecks in memory use. Loading the raw data with
|
@peastman Have you also considered using LMDB instead of hdf5? I would like to implement a new data loader that can efficiently deal with arbitrarily large datasets, and was wondering if in your experience hdf5 is better / faster? I am currently leaning towards using lmdb, but have not benchmarked them exhaustively yet. |
I'd never heard of LMDB before. It looks like something very different. LMDB is a database system, while hdf5 is just a file format that's designed to allow efficient read access. It's possible a full database would have advantages, but it's going to be a lot harder for users. There's good support for hdf5 in just about every language, and it takes minimal code to build a file. Users are also much more likely to be familiar with it already. As for performance, hdf5 is working great for me. I have no trouble handling large datasets and I can get GPU utilization close to 100%. |
It moves the problem to a new place without fixing it. When I run the preprocess_data.py script I get the same behavior as before. The memory used by the process gradually increases until it fills up all available memory. Then the computer hangs and has to be shut down with the power button. |
Sorry, I was under the assumption that there is 100-s of Gb-s of CPU RAM, and it is only the GPU RAM that is limited. So I moved the whole preprocesssing to the CPU and the GPU should read the preprocessed data from the HDF5 one batch at a time. I can try to implement a modified low memory version of the preprocessing that processes one config at a time and writes it to disk before going to the next. |
My computer only has 32 GB of RAM. |
I have attempted a fix. I really hope it will work. Please use this branch and see the example in the README https://github.com/davkovacs/mace/tree/on_the_fly_dataloading |
Thanks! It's running now. The resident size climbed to about 12 GB and then stopped increasing. I'll let you know what happens. |
Success! I started from a dataset that's about 1 GB in the HDF5 format used by TorchMD-Net. Converting it to xyz format increased it to just under 4 GB. preprocess_data.py ran for over 3.5 hours and produced a pair of files that totaled about 56 GB(!). But training now seems to be working. It's been running for about 12 hours, and the loss is gradually decreasing. What exactly is the loss function? And is there any way I can tell what epoch it's on? |
Great to hear! For the precise form of the loss function see Appendix 5.1 of the paper |
I guess that means it hasn't completed two epochs yet. I'll keep watching. |
@peastman we have significantly improved the support for large datasets. Please have a look at the multi-GPU branch. It does not create huge datasets anymore during the preprocessing, and the memory requirements should also be even smaller. For me a preprocessed SPICE is now less than 8Gb. If you need any help, let me know! |
That's great to here is the multi-GPU branch operational already? |
We still have some debugging to do for multi-GPU training, but it works for training on a single GPU. |
It works great, thanks! I have a training run going right now. I'm also looking forward to multi-GPU support. |
Merged to main in #363 |
Hi,
You mentioned at PSI-K that the only bottleneck for larger than memory datasets was the normalization of the energies and forces. It would be great if you could add an option to explicitly enter the statistics to avoid this issue.
best and thank you very much,
Jonathan
The text was updated successfully, but these errors were encountered: