Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HF dataset loading optimizations #623

Merged
merged 11 commits into from
Jun 14, 2024
Merged

HF dataset loading optimizations #623

merged 11 commits into from
Jun 14, 2024

Conversation

2015aroras
Copy link
Collaborator

@2015aroras 2015aroras commented Jun 13, 2024

Issue: Loading HF datasets for downstream evals has been slowing down the start of our runs because:

  1. Every process tries to load each HF dataset at the same time. This both results in a lot of network traffic to one endpoint (potentially resulting in throttling?), and also maybe some contention over HF's dataset cache on disk.
  2. Even when a dataset has already been cached locally, trying to load the dataset results in network calls as HF checks for changes to the data.

Fix: This PR tackles these issues by:

  1. Utilizing HF's load_from_disk and save_to_disk dataset methods to save a local copy of the datasets. These datasets are no longer associated with the online versions, and so loading them from disk does not result in network traffic (as far as I can tell).
  2. Making only the FS rank 0 process perform network calls for HF dataset loading. Doing this requires coordination between the processes using barrier(), but this seems to be less problematic. Note that barriers are invoked each time a dataset needs to be loaded (on the positive side, if all datasets are already in the cache then there are no barrier invocations).

In a 2 GPU interactive session, the performance goes from 10 mins without this new cache to 1 min with this cache ready beforehand. I have populated a copy of the cache already at /net/weka/reviz/hf_datasets_cache.

I haven't yet verified that the evaluation correctness is not affected by this change (because no GPUs available), but I don't expect this to negatively impact eval correctness.

Edit: Now the barrier will always be invoked exactly once per dataset

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only minor comments

from ..tokenizer import Tokenizer

log = logging.getLogger(__name__)


def load_dataset(path, name, split, datasets_cache_dir: Optional[str] = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing type hints

Suggested change
def load_dataset(path, name, split, datasets_cache_dir: Optional[str] = None):
def load_dataset(path, name, split, datasets_cache_dir: Optional[str] = None):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds_eval_dataset = task_class(tokenizer=tokenizer, **task_kwargs) # type: ignore
ds_eval_dataset = task_class(
tokenizer=tokenizer, datasets_cache_dir=train_config.hf_datasets_cache_dir, **task_kwargs
) # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this # type: ignore is probably on the wrong line now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intellisense puts it there and the type checking passes with it there (and fails without it there).

Comment on lines 23 to 36
try:
return load_dataset_from_disk(path, name, split, datasets_cache_dir)
except FileNotFoundError:
log.info(
"Path %s name %s split %s not present in local dir %s, loading from online",
path,
name,
split,
datasets_cache_dir,
)
# Barrier here to stop the case where FS rank 0 saves the dataset to disk before some non-zero
# ranks try getting the dataset from disk. This would cause those non-zero ranks to bypass
# the next barrier and cause rank 0 to be stuck at that barrier.
barrier()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little sketch, but I guess it's fine as long as all ranks either have or do not have the dataset on disk, which is probably a reasonable assumption?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9f2d9d2
I've changed the logic to be simpler and more robust against this case. Now the idea is that FS local rank 0 does its thing first (load from disk or online and cache), then every other rank follows afterwards. This is less optimized but simpler and safer imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that much better ✅

@2015aroras 2015aroras merged commit 41ed20a into main Jun 14, 2024
12 checks passed
@2015aroras 2015aroras deleted the shanea/hf-save-to-disk-2 branch June 14, 2024 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants