Caching logic improvement #432

DomInvivo · 2023-08-10T16:27:37Z

Changelogs

This is a draft PR looking to change the logic of how caching is done. The PR is motivated in #431 . I'll wait for comments and suggestions before pursuing this.

Removed the cache_data_path option. We don't want to load
Added the dataloading_from option to select whether to load from Disk or RAM

Checklist:

_Was this PR discussed in an issue? _ Yes in Issues for loading from RAM instead of from Disk #431
Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
Update the API documentation is a new function is added, or an existing one is deleted.
Write concise and explanatory changelogs above.
If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

discussion related to that PR

Explore finetuning

graphium/data/datamodule.py

… caching

graphium/data/datamodule.py

WenkelF · 2023-08-11T17:01:05Z

@DomInvivo I tested the caching branch and at closer inspection found some bugs that were not evident from the changelogs:

The unit tests currently fail because we need to add from os import PathLike as Path to the imports in the datamodule or directly use os.PathLike instead.

After this is fixed, there is an issue because the MultitaskDataset is defined as follows:

graphium/graphium/data/dataset.py

Lines 136 to 151 in 7fa23b6

    
           class MultitaskDataset(Dataset): 
        
               pass 
        
               def __init__( 
        
                   self, 
        
                   datasets: Dict[str, SingleTaskDataset], 
        
                   n_jobs=-1, 
        
                   backend: str = "loky", 
        
                   featurization_batch_size=1000, 
        
                   progress: bool = True, 
        
                   save_smiles_and_ids: bool = False, 
        
                   about: str = "", 
        
                   data_path: Optional[Union[str, os.PathLike]] = None, 
        
                   dataloading_from: str = "ram", 
        
                   files_ready: bool = False, 
        
               ):

However, in the datamodule we call it like this:

graphium/graphium/data/datamodule.py

Lines 1286 to 1297 in 7fa23b6

    
           multitask_dataset = Datasets.MultitaskDataset( 
        
               singletask_datasets, 
        
               n_jobs=self.featurization_n_jobs, 
        
               backend=self.featurization_backend, 
        
               featurization_batch_size=self.featurization_batch_size, 
        
               progress=self.featurization_progress, 
        
               about=about, 
        
               save_smiles_and_ids=save_smiles_and_ids, 
        
               data_path=self._path_to_load_from_file(stage) if processed_graph_data_path else None, 
        
               processed_graph_data_path=processed_graph_data_path, 
        
               files_ready=files_ready, 
        
           )  # type: ignore

It seems dataloading_from is hardcoded to “ram” and we pass the unsupported processed_graph_data_path as an argument. Instead, we need to pass dataloading_from here (see proposed change above).

With this change, it runs for “disk”. IF we don’t set processed_graph_data_path, it also runs for “ram”. If we set “ram” and processed_graph_data_path, we need to check if data is already cached, otherwise we get an error when trying to create already existing files (see comment above).

WenkelF · 2023-08-17T15:01:26Z

@DomInvivo here are the main updates:

Simplifying logic to prepare data and do dataloading from DISK/RAM
Allow cached data (on DISK) to be loaded into RAM for dataloading
Speeding up transfer from DISK to RAM via parallelization
Adding option to prepare data in advance to CLI via graphium-prepare-data (make sure to re-install dependencies before using it)
Updating README to explain how and why to prepare data in advance

WenkelF · 2023-08-17T15:03:58Z

graphium/data/datamodule.py

+            # else:
+            #     processed_train_data_path = None
+            #     processed_val_data_path = None
+


Forgot to remove commented lines. Will do so shortly.

codecov · 2023-08-17T15:07:40Z

Codecov Report

Merging #432 (cc91bfa) into main (10fe04b) will increase coverage by 0.81%.
Report is 41 commits behind head on main.
The diff coverage is 66.23%.

@@            Coverage Diff             @@
##             main     #432      +/-   ##
==========================================
+ Coverage   64.74%   65.56%   +0.81%     
==========================================
  Files          89       90       +1     
  Lines        8211     8226      +15     
==========================================
+ Hits         5316     5393      +77     
+ Misses       2895     2833      -62

Flag	Coverage Δ
unittests	`65.56% <66.23%> (+0.81%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
ipu	`49.14% <ø> (ø)`

DomInvivo

Minor comments. Good work :)

graphium/cli/prepare_data.py

WenkelF · 2023-08-18T16:33:03Z

@DomInvivo some finishing touches for the PR:

Fixed unit tests for test_datamodule
Adding warning message save_smiles_and_ids argument in setup() is superseeded by prepare_data()
Temporary hardcoding no parallelization when transerring from DISK to RAM

WenkelF · 2023-08-18T17:49:36Z

@DomInvivo should be ready to merge

cwognum · 2023-08-23T14:36:05Z

@WenkelF @DomInvivo I know I'm a little late to the party (apologies!), but I just saw you added a new CLI in this PR.

You added a new command in the pyproject.toml and are using the @hydra.main() decorator. Moving forward, I think it would be better to add these under the already existing graphium CLI. Don't worry about it for now; I will make the changes in #441, because I've also migrated from Click to Typer there.

For your information, using the @hydra.main() decorator makes the more advanced functionality of hydra available through the CLI (e.g. tab completion, multirun, working directory management, logging management and more). You thus cannot not use it. A major limitation, however, is that you cannot group commands as you can do with click and typer. To prevent creating a new command for each script and to instead have a clear, single command as entry-point organized with subcommands, in many cases you are probably better off using a traditional CLI tool (e.g. Click or Typer) with the Compose API.

For the particular command you introduced in this PR and given the changes in #441, we could for example do:

from .data import data_app
from hydra import compose, initialize

@app.command(name="prepare", help="Prepare the data in advance")
def cli(config_path: str, config_name: str) -> None:
    with initialize(version_base=None, config_path=config_path):
            cfg = compose(config_name=config_name, overrides=[])
    run_prepare_data(cfg)

This command is then available as:

graphium data prepare

~~Again, don't worry about it! I'll make the changes in this case, but if you see someone else making a similar change, you can point them to this comment.~~
Changes are made here: 7861de8

WenkelF and others added 6 commits August 4, 2023 14:39

Merge pull request #11 from WenkelF/explore_finetuning

6ecbbba

Explore finetuning

Merge branch 'datamol-io:main' into main

839f300

Merge branch 'main' of https://github.com/datamol-io/graphium

4219aab

First pass to change the graph caching logic

cc27c8f

changing the yaml configs with the new caching logic

bdf14d1

applied black linting

3ebeffc

DomInvivo requested review from WenkelF and callumm-graphcore August 10, 2023 16:27

DomInvivo linked an issue Aug 10, 2023 that may be closed by this pull request

Issues for loading from RAM instead of from Disk #431

Closed

4 tasks

WenkelF reviewed Aug 11, 2023

View reviewed changes

graphium/data/datamodule.py Outdated Show resolved Hide resolved

WenkelF reviewed Aug 11, 2023

View reviewed changes

graphium/data/datamodule.py Outdated Show resolved Hide resolved

DomInvivo added 2 commits August 11, 2023 10:44

minor fix

4c08b56

Merge branch 'caching' of https://github.com/datamol-io/graphium into…

7fa23b6

… caching

WenkelF reviewed Aug 11, 2023

View reviewed changes

graphium/data/datamodule.py Outdated Show resolved Hide resolved

DomInvivo marked this pull request as ready for review August 17, 2023 14:21

WenkelF added 3 commits August 17, 2023 10:53

Updating caching and dataloading from disk/ram

0497a53

Adding option to prepare data in advance to CLI

f307ed5

Updating README

d088ea2

WenkelF reviewed Aug 17, 2023

View reviewed changes

DomInvivo commented Aug 17, 2023

View reviewed changes

graphium/cli/prepare_data.py Outdated Show resolved Hide resolved

graphium/cli/prepare_data.py Outdated Show resolved Hide resolved

DomInvivo and others added 2 commits August 17, 2023 11:22

Update README.md

b4eaff7

Fixed the PR comments

5a23093

DomInvivo requested review from zhiyil1230 and s-maddrellmander and removed request for zhiyil1230 August 18, 2023 05:02

WenkelF added 2 commits August 18, 2023 12:07

Fixing unit tests

ef1e30f

Undoing some unintentional changes

d492417

Undoing some unintentional changes

7665b67

WenkelF added 2 commits August 18, 2023 13:40

Fixing datamodule unit test and unit test speedup

4240bbb

Reformatting with black

cc91bfa

DomInvivo merged commit 4adaaf7 into main Aug 18, 2023
5 checks passed

DomInvivo mentioned this pull request Aug 23, 2023

Moving the prepare_data from the main files right before the training #415

Closed

cwognum mentioned this pull request Aug 24, 2023

WIP: Various improvements to the CLI endpoints #441

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching logic improvement #432

Caching logic improvement #432

DomInvivo commented Aug 10, 2023 •

edited by WenkelF

Loading

WenkelF commented Aug 11, 2023

WenkelF commented Aug 17, 2023

WenkelF Aug 17, 2023

codecov bot commented Aug 17, 2023 •

edited

Loading

DomInvivo left a comment

WenkelF commented Aug 18, 2023

WenkelF commented Aug 18, 2023

cwognum commented Aug 23, 2023 •

edited

Loading

Caching logic improvement #432

Caching logic improvement #432

Conversation

DomInvivo commented Aug 10, 2023 • edited by WenkelF Loading

Changelogs

WenkelF commented Aug 11, 2023

WenkelF commented Aug 17, 2023

WenkelF Aug 17, 2023

Choose a reason for hiding this comment

codecov bot commented Aug 17, 2023 • edited Loading

Codecov Report

DomInvivo left a comment

Choose a reason for hiding this comment

WenkelF commented Aug 18, 2023

WenkelF commented Aug 18, 2023

cwognum commented Aug 23, 2023 • edited Loading

DomInvivo commented Aug 10, 2023 •

edited by WenkelF

Loading

codecov bot commented Aug 17, 2023 •

edited

Loading

cwognum commented Aug 23, 2023 •

edited

Loading