-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v2 Feature Request]: On-the-Fly Graph Generation #858
Comments
|
I do not find this information in the documentation. |
related to #792 |
|
As noted in the linked issue, the CLI does not use caching currently. If you are running out of memory when you don't expect to, there may be a memory leak in our code. Could you try using progressively smaller subsets of your data and see if you still get the OOM error? |
I just use: chemprop train -n 8 --data-path train.csv -t classification --save-dir chemprop_test --smiles-column SMILES --target-column LABEL Even: chemprop train --data-path train.csv -t classification --save-dir chemprop_test --smiles-column SMILES --target-column LABEL gets killed because of OOM. So, I am just running your CLI program. |
I see in your CLI commands that you first tried using |
Dear @KnathanM |
The dataset csv files takes 620MB; but trying to model it on a 16GB RAM computer will also get killed because of OOM. |
If you have swap enabled, I guess it would just start swapping. |
By checking the memory consumption for just python importing RDKit:
and the memory consumption when creating a tiny graph:
we can arrive at a low end estimate of 644 kilobytes of memory per molecule, or about half a megabyte for simplicity. The dataset at hand is ~10MM molecules - at even our low-end estimate of half a meg per graph (just the graph, not even the Chemprop datapoint), this dataset would require five terabytes of RAM to hold in memory (cached) all at once. I don't think there is anything for Chemprop to do here... the batch size needs to be cute way down and calculated on-the-fly to fit in the memory of basically any machine, right @KnathanM? |
Just for some more context: with DNA Encoded Libraries, very large datasets are becoming standard. |
We do support incremental training, just not incremental loading of an a dataset. Chemprop will greedily load your entire dataset at the beginning of training - This is a design decision that massively simplifies the backend work. Feature are already re-calculated on-the-fly, which reduces memory consumption, but that isn't the problem here. A dataset of 10MM graphs (plus their associated data) is just too big to hold in memory all at once. If you want batched/incremental loading of your dataset you will need to implement it yourself. This would probably involve deciding your dataset splits before hand and then subclassing the molecule datapoint to avoid calculating the RDKit molecule until it is needed during training (like we currently do with the features). If you're not familiar with using Chemprop as a module, try the training demo notebook. We aren't going to implement this for the time being. This is because (1) it's very complicated (2) we have other features to add and (3) I suspect you are doing the Kaggle Leash Bio competition and having us do the work for you doesn't align well with their rules IMHO. |
As a note to future users with large datasets: I ran a quick test using the 10M molecule dataset @UnixJunkie sent me to see how much space Chemprop datapoints take up. In a jupyter notebook, I loaded the data and tried to make all 10M datapoints:
This ran out of memory on my machine with 192G. But when I only used the first 5M smiles/labels, it fit in memory.
Using I agree with Jackson that if you want to use a dataset that is too large for the memory available, subclassing the molecule datapoint class to calculate the RDKit mol object on the fly would work. I'll also add that I don't plan to implement this for the time being because I expect recreating the RDKit mol object at each epoch would significantly slow down training. And in any event, I was able to train a model on 5M datapoints with the current code. |
What are you trying to do?
I am trying to do classification modeling of a very large dataset (~10M molecules).
The process went OOM while it was running on a machine equipped with 256 GB RAM...
Previous attempts
Screenshots
N/A.
Apparently, v2 does not support anymore --no_cache_mol ?!
The text was updated successfully, but these errors were encountered: