Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2 Feature Request]: On-the-Fly Graph Generation #858

Closed
UnixJunkie opened this issue May 7, 2024 · 14 comments
Closed

[v2 Feature Request]: On-the-Fly Graph Generation #858

UnixJunkie opened this issue May 7, 2024 · 14 comments
Labels
enhancement a new feature request wontfix This will not be worked on
Milestone

Comments

@UnixJunkie
Copy link

What are you trying to do?
I am trying to do classification modeling of a very large dataset (~10M molecules).
The process went OOM while it was running on a machine equipped with 256 GB RAM...

Previous attempts

chemprop train -n 8 --data-path train.csv -t classification --save-dir chemprop_test --smiles-column SMILES --target-column LABEL

Screenshots
N/A.

Apparently, v2 does not support anymore --no_cache_mol ?!

@UnixJunkie UnixJunkie added the question Further information is requested label May 7, 2024
@UnixJunkie
Copy link
Author

usage: chemprop [-h] {train,predict,convert,fingerprint,hpopt} ...
chemprop: error: unrecognized arguments: --no_cache_mol

@UnixJunkie
Copy link
Author

I do not find this information in the documentation.
In v1's doc, it was findable using 'large dataset' as a search query.

@UnixJunkie
Copy link
Author

related to #792

@kevingreenman
Copy link
Member

kevingreenman commented May 7, 2024

duplicate of #792

@KnathanM
Copy link
Contributor

KnathanM commented May 7, 2024

As noted in the linked issue, the CLI does not use caching currently. If you are running out of memory when you don't expect to, there may be a memory leak in our code. Could you try using progressively smaller subsets of your data and see if you still get the OOM error?

@UnixJunkie
Copy link
Author

UnixJunkie commented May 8, 2024

I just use:

chemprop train -n 8 --data-path train.csv -t classification --save-dir chemprop_test --smiles-column SMILES --target-column LABEL

Even:

chemprop train --data-path train.csv -t classification --save-dir chemprop_test --smiles-column SMILES --target-column LABEL

gets killed because of OOM.

So, I am just running your CLI program.

@KnathanM
Copy link
Contributor

KnathanM commented May 8, 2024

I just use:

chemprop train -n 8 --data-path train.csv -t classification --save-dir chemprop_test --smiles-column SMILES --target-column LABEL

Even:

chemprop train --data-path train.csv -t classification --save-dir chemprop_test --smiles-column SMILES --target-column LABEL

gets killed because of OOM.

I see in your CLI commands that you first tried using -n 8 and then tried removing that option. -n controls how many workers are used to featurize the atoms and bonds in molecules. This is unrelated to memory usage as the features are not cached.

@KnathanM KnathanM reopened this May 8, 2024
@KnathanM KnathanM changed the title [v2 QUESTION]: how to pass --no_cache_mol to 'chemprop train'? [v2 Bug]: Out of memory - CLI large dataset May 8, 2024
@UnixJunkie
Copy link
Author

Dear @KnathanM
I just shared the dataset w/ you over gdrive.

@UnixJunkie
Copy link
Author

UnixJunkie commented May 8, 2024

The dataset csv files takes 620MB; but trying to model it on a 16GB RAM computer will also get killed because of OOM.

@UnixJunkie
Copy link
Author

If you have swap enabled, I guess it would just start swapping.

@JacksonBurns
Copy link
Member

By checking the memory consumption for just python importing RDKit:

/usr/bin/time -v python -c "from rdkit import Chem"
	...
	Maximum resident set size (kbytes): 69624
	...

and the memory consumption when creating a tiny graph:

/usr/bin/time -v python -c "from rdkit import Chem; m = Chem.MolFromSmiles('C')"
	...
	Maximum resident set size (kbytes): 70268
	...

we can arrive at a low end estimate of 644 kilobytes of memory per molecule, or about half a megabyte for simplicity.

The dataset at hand is ~10MM molecules - at even our low-end estimate of half a meg per graph (just the graph, not even the Chemprop datapoint), this dataset would require five terabytes of RAM to hold in memory (cached) all at once.

I don't think there is anything for Chemprop to do here... the batch size needs to be cute way down and calculated on-the-fly to fit in the memory of basically any machine, right @KnathanM?

@UnixJunkie
Copy link
Author

Just for some more context: with DNA Encoded Libraries, very large datasets are becoming standard.
Even if you do class balancing, the dataset might still be 500k to 1M molecules easily.
DNNs are supposed to support incremental learning using batches.
So, even very large datasets should be supported (provided the user has the patience to wait for the model
to be trained).

@JacksonBurns
Copy link
Member

DNNs are supposed to support incremental learning using batches.

We do support incremental training, just not incremental loading of an a dataset.

Chemprop will greedily load your entire dataset at the beginning of training - chemprop.cli.train.main will call build_splits which in turn calls build_data_from_files which reads the input CSV in it's entirety and converts all smiles to datapoints.

This is a design decision that massively simplifies the backend work. Feature are already re-calculated on-the-fly, which reduces memory consumption, but that isn't the problem here. A dataset of 10MM graphs (plus their associated data) is just too big to hold in memory all at once.

If you want batched/incremental loading of your dataset you will need to implement it yourself. This would probably involve deciding your dataset splits before hand and then subclassing the molecule datapoint to avoid calculating the RDKit molecule until it is needed during training (like we currently do with the features). If you're not familiar with using Chemprop as a module, try the training demo notebook.

We aren't going to implement this for the time being. This is because (1) it's very complicated (2) we have other features to add and (3) I suspect you are doing the Kaggle Leash Bio competition and having us do the work for you doesn't align well with their rules IMHO.

@JacksonBurns JacksonBurns closed this as not planned Won't fix, can't repro, duplicate, stale May 8, 2024
@JacksonBurns JacksonBurns changed the title [v2 Bug]: Out of memory - CLI large dataset [v2 Feature Request]: On-the-Fly Graph Generation May 8, 2024
@JacksonBurns JacksonBurns added enhancement a new feature request wontfix This will not be worked on and removed question Further information is requested labels May 8, 2024
@chemprop chemprop locked as resolved and limited conversation to collaborators May 8, 2024
@KnathanM
Copy link
Contributor

KnathanM commented May 8, 2024

As a note to future users with large datasets:

I ran a quick test using the 10M molecule dataset @UnixJunkie sent me to see how much space Chemprop datapoints take up. In a jupyter notebook, I loaded the data and tried to make all 10M datapoints:

import pandas as pd
from chemprop.data import MoleculeDatapoint

df = pd.read_csv("10M_dataset.csv")
smis = df["SMILES"].values
ys = df[["LABEL"]].values
datapoints = [MoleculeDatapoint.from_smi(smi, y) for smi, y in zip(smis, ys)]

This ran out of memory on my machine with 192G. But when I only used the first 5M smiles/labels, it fit in memory.

smis = df["SMILES"].values[:5000000]
ys = df[["LABEL"]].values[:5000000]

Using htop I saw that my starting memory usage was 5G. After making the 5M datapoints my memory usage was 135G. During training for 1 epoch on the 5M datapoints, the memory peaked at 155G and then went back to 135G when the training finished. So for this dataset 5M datapoints takes roughly 130G to store. Note that the main contributor to the memory usage of datapoints are the rdkit.Chem.Mol objects that are created from the SMILES. The size of the mol objects may be different for different molecules. Also note that these datapoints are not featurized, which happens on-the-fly.

I agree with Jackson that if you want to use a dataset that is too large for the memory available, subclassing the molecule datapoint class to calculate the RDKit mol object on the fly would work. I'll also add that I don't plan to implement this for the time being because I expect recreating the RDKit mol object at each epoch would significantly slow down training. And in any event, I was able to train a model on 5M datapoints with the current code.

@kevingreenman kevingreenman added this to the v2.0.1 milestone May 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement a new feature request wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants