HF dataset loading optimizations #623

2015aroras · 2024-06-13T21:11:58Z

Issue: Loading HF datasets for downstream evals has been slowing down the start of our runs because:

Every process tries to load each HF dataset at the same time. This both results in a lot of network traffic to one endpoint (potentially resulting in throttling?), and also maybe some contention over HF's dataset cache on disk.
Even when a dataset has already been cached locally, trying to load the dataset results in network calls as HF checks for changes to the data.

Fix: This PR tackles these issues by:

Utilizing HF's load_from_disk and save_to_disk dataset methods to save a local copy of the datasets. These datasets are no longer associated with the online versions, and so loading them from disk does not result in network traffic (as far as I can tell).
Making only the FS rank 0 process perform network calls for HF dataset loading. Doing this requires coordination between the processes using barrier(), but this seems to be less problematic. Note that barriers are invoked each time a dataset needs to be loaded (on the positive side, if all datasets are already in the cache then there are no barrier invocations).

In a 2 GPU interactive session, the performance goes from 10 mins without this new cache to 1 min with this cache ready beforehand. I have populated a copy of the cache already at /net/weka/reviz/hf_datasets_cache.

I haven't yet verified that the evaluation correctness is not affected by this change (because no GPUs available), but I don't expect this to negatively impact eval correctness.

Edit: Now the barrier will always be invoked exactly once per dataset

epwalsh

LGTM, only minor comments

epwalsh · 2024-06-13T22:01:53Z

olmo/eval/downstream.py

 from ..tokenizer import Tokenizer

 log = logging.getLogger(__name__)


+def load_dataset(path, name, split, datasets_cache_dir: Optional[str] = None):


nit: missing type hints

Suggested change

def load_dataset(path, name, split, datasets_cache_dir: Optional[str] = None):

def load_dataset(path, name, split, datasets_cache_dir: Optional[str] = None):

epwalsh · 2024-06-13T22:04:04Z

olmo/eval/__init__.py

-    ds_eval_dataset = task_class(tokenizer=tokenizer, **task_kwargs)  # type: ignore
+    ds_eval_dataset = task_class(
+        tokenizer=tokenizer, datasets_cache_dir=train_config.hf_datasets_cache_dir, **task_kwargs
+    )  # type: ignore


I think this # type: ignore is probably on the wrong line now

Intellisense puts it there and the type checking passes with it there (and fails without it there).

epwalsh · 2024-06-13T22:05:15Z

olmo/eval/downstream.py

+        try:
+            return load_dataset_from_disk(path, name, split, datasets_cache_dir)
+        except FileNotFoundError:
+            log.info(
+                "Path %s name %s split %s not present in local dir %s, loading from online",
+                path,
+                name,
+                split,
+                datasets_cache_dir,
+            )
+        # Barrier here to stop the case where FS rank 0 saves the dataset to disk before some non-zero
+        # ranks try getting the dataset from disk. This would cause those non-zero ranks to bypass
+        # the next barrier and cause rank 0 to be stuck at that barrier.
+        barrier()


This seems a little sketch, but I guess it's fine as long as all ranks either have or do not have the dataset on disk, which is probably a reasonable assumption?

9f2d9d2
I've changed the logic to be simpler and more robust against this case. Now the idea is that FS local rank 0 does its thing first (load from disk or online and cache), then every other rank follows afterwards. This is less optimized but simpler and safer imo.

I like that much better ✅

2015aroras added 4 commits June 13, 2024 13:37

Add util methods for loading and saving to disk

f4dbde6

Add logic for loading dataset using disk cache

82d0e9d

Pass cache dir into downstream evals

b4acd09

Pass cache dir using training config

561d642

2015aroras requested review from epwalsh and dirkgr June 13, 2024 21:16

epwalsh approved these changes Jun 13, 2024

View reviewed changes

2015aroras added 7 commits June 13, 2024 15:40

Simplify, robustify but slightly unoptimize load_dataset logic

9f2d9d2

Add type hints for load_dataset

c55974d

Remove type ignore for downstream evaluator construction

97911fd

Re-add type ignore

f934bb7

Move load_hf_dataset to utils

10737a2

Add load_hf_dataset tests

2058edd

Update CHANGELOG

b5bd9ff

2015aroras merged commit 41ed20a into main Jun 14, 2024
12 checks passed

2015aroras deleted the shanea/hf-save-to-disk-2 branch June 14, 2024 21:42

2015aroras mentioned this pull request Jun 24, 2024

Issue training on multiple nodes #550

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HF dataset loading optimizations #623

HF dataset loading optimizations #623

2015aroras commented Jun 13, 2024 •

edited

Loading

epwalsh left a comment

epwalsh Jun 13, 2024

2015aroras Jun 13, 2024

epwalsh Jun 13, 2024

2015aroras Jun 13, 2024

epwalsh Jun 13, 2024

2015aroras Jun 13, 2024

epwalsh Jun 13, 2024

	def load_dataset(path, name, split, datasets_cache_dir: Optional[str] = None):
	def load_dataset(path, name, split, datasets_cache_dir: Optional[str] = None):

HF dataset loading optimizations #623

HF dataset loading optimizations #623

Conversation

2015aroras commented Jun 13, 2024 • edited Loading

epwalsh left a comment

Choose a reason for hiding this comment

epwalsh Jun 13, 2024

Choose a reason for hiding this comment

2015aroras Jun 13, 2024

Choose a reason for hiding this comment

epwalsh Jun 13, 2024

Choose a reason for hiding this comment

2015aroras Jun 13, 2024

Choose a reason for hiding this comment

epwalsh Jun 13, 2024

Choose a reason for hiding this comment

2015aroras Jun 13, 2024

Choose a reason for hiding this comment

epwalsh Jun 13, 2024

Choose a reason for hiding this comment

2015aroras commented Jun 13, 2024 •

edited

Loading