Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching logic improvement #432

Merged
merged 18 commits into from
Aug 18, 2023
Merged

Caching logic improvement #432

merged 18 commits into from
Aug 18, 2023

Conversation

DomInvivo
Copy link
Collaborator

@DomInvivo DomInvivo commented Aug 10, 2023

Changelogs

This is a draft PR looking to change the logic of how caching is done. The PR is motivated in #431 . I'll wait for comments and suggestions before pursuing this.

  • Removed the cache_data_path option. We don't want to load
  • Added the dataloading_from option to select whether to load from Disk or RAM

Checklist:

  • _Was this PR discussed in an issue? _ Yes in Issues for loading from RAM instead of from Disk #431
  • Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
  • Update the API documentation is a new function is added, or an existing one is deleted.
  • Write concise and explanatory changelogs above.
  • If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

discussion related to that PR

@DomInvivo DomInvivo linked an issue Aug 10, 2023 that may be closed by this pull request
4 tasks
@WenkelF
Copy link
Collaborator

WenkelF commented Aug 11, 2023

@DomInvivo I tested the caching branch and at closer inspection found some bugs that were not evident from the changelogs:

The unit tests currently fail because we need to add from os import PathLike as Path to the imports in the datamodule or directly use os.PathLike instead.

After this is fixed, there is an issue because the MultitaskDataset is defined as follows:

class MultitaskDataset(Dataset):
pass
def __init__(
self,
datasets: Dict[str, SingleTaskDataset],
n_jobs=-1,
backend: str = "loky",
featurization_batch_size=1000,
progress: bool = True,
save_smiles_and_ids: bool = False,
about: str = "",
data_path: Optional[Union[str, os.PathLike]] = None,
dataloading_from: str = "ram",
files_ready: bool = False,
):

However, in the datamodule we call it like this:

multitask_dataset = Datasets.MultitaskDataset(
singletask_datasets,
n_jobs=self.featurization_n_jobs,
backend=self.featurization_backend,
featurization_batch_size=self.featurization_batch_size,
progress=self.featurization_progress,
about=about,
save_smiles_and_ids=save_smiles_and_ids,
data_path=self._path_to_load_from_file(stage) if processed_graph_data_path else None,
processed_graph_data_path=processed_graph_data_path,
files_ready=files_ready,
) # type: ignore

It seems dataloading_from is hardcoded to “ram” and we pass the unsupported processed_graph_data_path as an argument. Instead, we need to pass dataloading_from here (see proposed change above).

With this change, it runs for “disk”. IF we don’t set processed_graph_data_path, it also runs for “ram”. If we set “ram” and processed_graph_data_path, we need to check if data is already cached, otherwise we get an error when trying to create already existing files (see comment above).

@DomInvivo DomInvivo marked this pull request as ready for review August 17, 2023 14:21
@WenkelF
Copy link
Collaborator

WenkelF commented Aug 17, 2023

@DomInvivo here are the main updates:

  • Simplifying logic to prepare data and do dataloading from DISK/RAM
  • Allow cached data (on DISK) to be loaded into RAM for dataloading
  • Speeding up transfer from DISK to RAM via parallelization
  • Adding option to prepare data in advance to CLI via graphium-prepare-data (make sure to re-install dependencies before using it)
  • Updating README to explain how and why to prepare data in advance

# else:
# processed_train_data_path = None
# processed_val_data_path = None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove commented lines. Will do so shortly.

@codecov
Copy link

codecov bot commented Aug 17, 2023

Codecov Report

Merging #432 (cc91bfa) into main (10fe04b) will increase coverage by 0.81%.
Report is 41 commits behind head on main.
The diff coverage is 66.23%.

@@            Coverage Diff             @@
##             main     #432      +/-   ##
==========================================
+ Coverage   64.74%   65.56%   +0.81%     
==========================================
  Files          89       90       +1     
  Lines        8211     8226      +15     
==========================================
+ Hits         5316     5393      +77     
+ Misses       2895     2833      -62     
Flag Coverage Δ
unittests 65.56% <66.23%> (+0.81%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
ipu 49.14% <ø> (ø)

Copy link
Collaborator Author

@DomInvivo DomInvivo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. Good work :)

graphium/cli/prepare_data.py Outdated Show resolved Hide resolved
graphium/cli/prepare_data.py Outdated Show resolved Hide resolved
@DomInvivo DomInvivo requested review from zhiyil1230 and s-maddrellmander and removed request for zhiyil1230 August 18, 2023 05:02
@WenkelF
Copy link
Collaborator

WenkelF commented Aug 18, 2023

@DomInvivo some finishing touches for the PR:

  • Fixed unit tests for test_datamodule
  • Adding warning message save_smiles_and_ids argument in setup() is superseeded by prepare_data()
  • Temporary hardcoding no parallelization when transerring from DISK to RAM

@WenkelF
Copy link
Collaborator

WenkelF commented Aug 18, 2023

@DomInvivo should be ready to merge

@DomInvivo DomInvivo merged commit 4adaaf7 into main Aug 18, 2023
5 checks passed
@cwognum
Copy link
Collaborator

cwognum commented Aug 23, 2023

@WenkelF @DomInvivo I know I'm a little late to the party (apologies!), but I just saw you added a new CLI in this PR.

You added a new command in the pyproject.toml and are using the @hydra.main() decorator. Moving forward, I think it would be better to add these under the already existing graphium CLI. Don't worry about it for now; I will make the changes in #441, because I've also migrated from Click to Typer there.

For your information, using the @hydra.main() decorator makes the more advanced functionality of hydra available through the CLI (e.g. tab completion, multirun, working directory management, logging management and more). You thus cannot not use it. A major limitation, however, is that you cannot group commands as you can do with click and typer. To prevent creating a new command for each script and to instead have a clear, single command as entry-point organized with subcommands, in many cases you are probably better off using a traditional CLI tool (e.g. Click or Typer) with the Compose API.

For the particular command you introduced in this PR and given the changes in #441, we could for example do:

from .data import data_app
from hydra import compose, initialize

@app.command(name="prepare", help="Prepare the data in advance")
def cli(config_path: str, config_name: str) -> None:
    with initialize(version_base=None, config_path=config_path):
            cfg = compose(config_name=config_name, overrides=[])
    run_prepare_data(cfg)

This command is then available as:

graphium data prepare

Again, don't worry about it! I'll make the changes in this case, but if you see someone else making a similar change, you can point them to this comment.
Changes are made here: 7861de8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issues for loading from RAM instead of from Disk
3 participants