[v2 Feature Request]: On-the-Fly Graph Generation #858

UnixJunkie · 2024-05-07T04:12:06Z

What are you trying to do?
I am trying to do classification modeling of a very large dataset (~10M molecules).
The process went OOM while it was running on a machine equipped with 256 GB RAM...

Previous attempts

chemprop train -n 8 --data-path train.csv -t classification --save-dir chemprop_test --smiles-column SMILES --target-column LABEL

Screenshots
N/A.

Apparently, v2 does not support anymore --no_cache_mol ?!

UnixJunkie · 2024-05-07T04:12:58Z

usage: chemprop [-h] {train,predict,convert,fingerprint,hpopt} ...
chemprop: error: unrecognized arguments: --no_cache_mol

UnixJunkie · 2024-05-07T04:17:26Z

I do not find this information in the documentation.
In v1's doc, it was findable using 'large dataset' as a search query.

UnixJunkie · 2024-05-07T04:19:44Z

related to #792

kevingreenman · 2024-05-07T13:52:00Z

~~duplicate of #792~~

KnathanM · 2024-05-07T14:04:29Z

As noted in the linked issue, the CLI does not use caching currently. If you are running out of memory when you don't expect to, there may be a memory leak in our code. Could you try using progressively smaller subsets of your data and see if you still get the OOM error?

UnixJunkie · 2024-05-08T00:53:01Z

I just use:

chemprop train -n 8 --data-path train.csv -t classification --save-dir chemprop_test --smiles-column SMILES --target-column LABEL

Even:

chemprop train --data-path train.csv -t classification --save-dir chemprop_test --smiles-column SMILES --target-column LABEL

gets killed because of OOM.

So, I am just running your CLI program.

KnathanM · 2024-05-08T01:37:23Z

I just use:

chemprop train -n 8 --data-path train.csv -t classification --save-dir chemprop_test --smiles-column SMILES --target-column LABEL

Even:

chemprop train --data-path train.csv -t classification --save-dir chemprop_test --smiles-column SMILES --target-column LABEL

gets killed because of OOM.

I see in your CLI commands that you first tried using -n 8 and then tried removing that option. -n controls how many workers are used to featurize the atoms and bonds in molecules. This is unrelated to memory usage as the features are not cached.

UnixJunkie · 2024-05-08T01:55:33Z

Dear @KnathanM
I just shared the dataset w/ you over gdrive.

UnixJunkie · 2024-05-08T02:27:22Z

The dataset csv files takes 620MB; but trying to model it on a 16GB RAM computer will also get killed because of OOM.

UnixJunkie · 2024-05-08T02:35:33Z

If you have swap enabled, I guess it would just start swapping.

JacksonBurns · 2024-05-08T02:52:28Z

By checking the memory consumption for just python importing RDKit:

/usr/bin/time -v python -c "from rdkit import Chem"
	...
	Maximum resident set size (kbytes): 69624
	...

and the memory consumption when creating a tiny graph:

/usr/bin/time -v python -c "from rdkit import Chem; m = Chem.MolFromSmiles('C')"
	...
	Maximum resident set size (kbytes): 70268
	...

we can arrive at a low end estimate of 644 kilobytes of memory per molecule, or about half a megabyte for simplicity.

The dataset at hand is ~10MM molecules - at even our low-end estimate of half a meg per graph (just the graph, not even the Chemprop datapoint), this dataset would require five terabytes of RAM to hold in memory (cached) all at once.

I don't think there is anything for Chemprop to do here... the batch size needs to be cute way down and calculated on-the-fly to fit in the memory of basically any machine, right @KnathanM?

UnixJunkie · 2024-05-08T03:43:31Z

Just for some more context: with DNA Encoded Libraries, very large datasets are becoming standard.
Even if you do class balancing, the dataset might still be 500k to 1M molecules easily.
DNNs are supposed to support incremental learning using batches.
So, even very large datasets should be supported (provided the user has the patience to wait for the model
to be trained).

JacksonBurns · 2024-05-08T13:08:26Z

DNNs are supposed to support incremental learning using batches.

We do support incremental training, just not incremental loading of an a dataset.

Chemprop will greedily load your entire dataset at the beginning of training - chemprop.cli.train.main will call build_splits which in turn calls build_data_from_files which reads the input CSV in it's entirety and converts all smiles to datapoints.

This is a design decision that massively simplifies the backend work. Feature are already re-calculated on-the-fly, which reduces memory consumption, but that isn't the problem here. A dataset of 10MM graphs (plus their associated data) is just too big to hold in memory all at once.

If you want batched/incremental loading of your dataset you will need to implement it yourself. This would probably involve deciding your dataset splits before hand and then subclassing the molecule datapoint to avoid calculating the RDKit molecule until it is needed during training (like we currently do with the features). If you're not familiar with using Chemprop as a module, try the training demo notebook.

We aren't going to implement this for the time being. This is because (1) it's very complicated (2) we have other features to add and (3) I suspect you are doing the Kaggle Leash Bio competition and having us do the work for you doesn't align well with their rules IMHO.

KnathanM · 2024-05-08T18:34:05Z

As a note to future users with large datasets:

I ran a quick test using the 10M molecule dataset @UnixJunkie sent me to see how much space Chemprop datapoints take up. In a jupyter notebook, I loaded the data and tried to make all 10M datapoints:

import pandas as pd
from chemprop.data import MoleculeDatapoint

df = pd.read_csv("10M_dataset.csv")
smis = df["SMILES"].values
ys = df[["LABEL"]].values
datapoints = [MoleculeDatapoint.from_smi(smi, y) for smi, y in zip(smis, ys)]

This ran out of memory on my machine with 192G. But when I only used the first 5M smiles/labels, it fit in memory.

smis = df["SMILES"].values[:5000000]
ys = df[["LABEL"]].values[:5000000]

Using htop I saw that my starting memory usage was 5G. After making the 5M datapoints my memory usage was 135G. During training for 1 epoch on the 5M datapoints, the memory peaked at 155G and then went back to 135G when the training finished. So for this dataset 5M datapoints takes roughly 130G to store. Note that the main contributor to the memory usage of datapoints are the rdkit.Chem.Mol objects that are created from the SMILES. The size of the mol objects may be different for different molecules. Also note that these datapoints are not featurized, which happens on-the-fly.

I agree with Jackson that if you want to use a dataset that is too large for the memory available, subclassing the molecule datapoint class to calculate the RDKit mol object on the fly would work. I'll also add that I don't plan to implement this for the time being because I expect recreating the RDKit mol object at each epoch would significantly slow down training. And in any event, I was able to train a model on 5M datapoints with the current code.

UnixJunkie added the question Further information is requested label May 7, 2024

kevingreenman closed this as completed May 7, 2024

KnathanM mentioned this issue May 8, 2024

[FEATURE]: Cache in CLI #792

Closed

KnathanM reopened this May 8, 2024

KnathanM changed the title ~~[v2 QUESTION]: how to pass --no_cache_mol to 'chemprop train'?~~ [v2 Bug]: Out of memory - CLI large dataset May 8, 2024

JacksonBurns closed this as not planned Won't fix, can't repro, duplicate, stale May 8, 2024

JacksonBurns changed the title ~~[v2 Bug]: Out of memory - CLI large dataset~~ [v2 Feature Request]: On-the-Fly Graph Generation May 8, 2024

JacksonBurns added enhancement a new feature request wontfix This will not be worked on and removed question Further information is requested labels May 8, 2024

chemprop locked as resolved and limited conversation to collaborators May 8, 2024

kevingreenman added this to the v2.0.1 milestone May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2 Feature Request]: On-the-Fly Graph Generation #858

[v2 Feature Request]: On-the-Fly Graph Generation #858

UnixJunkie commented May 7, 2024

UnixJunkie commented May 7, 2024

UnixJunkie commented May 7, 2024

UnixJunkie commented May 7, 2024

kevingreenman commented May 7, 2024 •

edited

Loading

KnathanM commented May 7, 2024

UnixJunkie commented May 8, 2024 •

edited

Loading

KnathanM commented May 8, 2024

UnixJunkie commented May 8, 2024

UnixJunkie commented May 8, 2024 •

edited

Loading

UnixJunkie commented May 8, 2024

JacksonBurns commented May 8, 2024

UnixJunkie commented May 8, 2024

JacksonBurns commented May 8, 2024

KnathanM commented May 8, 2024

[v2 Feature Request]: On-the-Fly Graph Generation #858

[v2 Feature Request]: On-the-Fly Graph Generation #858

Comments

UnixJunkie commented May 7, 2024

UnixJunkie commented May 7, 2024

UnixJunkie commented May 7, 2024

UnixJunkie commented May 7, 2024

kevingreenman commented May 7, 2024 • edited Loading

KnathanM commented May 7, 2024

UnixJunkie commented May 8, 2024 • edited Loading

KnathanM commented May 8, 2024

UnixJunkie commented May 8, 2024

UnixJunkie commented May 8, 2024 • edited Loading

UnixJunkie commented May 8, 2024

JacksonBurns commented May 8, 2024

UnixJunkie commented May 8, 2024

JacksonBurns commented May 8, 2024

KnathanM commented May 8, 2024

kevingreenman commented May 7, 2024 •

edited

Loading

UnixJunkie commented May 8, 2024 •

edited

Loading

UnixJunkie commented May 8, 2024 •

edited

Loading