New Progress Bar, Backoff, Batching #165

soldni · 2024-05-23T01:03:57Z

This PR adds three nice features to BaseParallelProcessor:

Refactors progress bar out of parallel.py
Adds a PoolWithDebug wrapper around multiprocessing.Pool that transparently disables multiprocessing when debugging
Uses backoff library to implement backoff and retries in case of failure
Ability to create parallel processor that work in batch mode (will change tokenizer processor later to use this new functionality)

undfined

Looks nice! A few questions, mostly for my understanding and curious if we can run the BPP without a progress bar? It's small but has some perf impact in aggregate. Also, seems not useful to report per "worker" when distributing compute.

undfined · 2024-05-24T21:19:28Z

.devcontainer/postInstall.sh

@@ -2,4 +2,4 @@

 PATH=/home/vscode/.cargo/bin:$PATH
 cd dolma
-source /home/vscode/miniforge3/bin/activate && pip install cmake "maturin[patchelf]>=1.1,<2.0"
+source /home/vscode/miniforge3/bin/activate && pip install cmake "maturin>=1.5,<2.0"


undfined · 2024-05-28T16:12:56Z

pyproject.toml

@@ -30,6 +31,8 @@ dependencies = [
    "numpy",
    "necessary>=0.4.3",
    "charset-normalizer>=3.2.0",
+    "zstandard>=0.20.0",
+    "backoff>=2.0.0",


Is this version required? There's 2 minor versions since this 2.0 release "2.2.1"

undfined · 2024-05-28T16:25:15Z

python/dolma/core/parallel.py

+    def __radd__(self: BPP, other: BPP) -> BPP:
+        """Combine two parallel processors into one."""
+        return other.__add__(self)


Can you describe when this is useful?

undfined · 2024-05-28T16:26:34Z

python/dolma/core/parallel.py

+        """Process multiple files. Naively calls process_single for each file, but can be overridden."""
+        for src_path, dst_path, single_kwargs in zip(source_paths, destination_paths, kwargs):
+            cls.process_single(source_path=src_path, destination_path=dst_path, queue=queue, **single_kwargs)


Maybe include an example of overloading this processing method?

undfined · 2024-05-28T16:28:12Z

python/dolma/core/parallel.py

+        cls.get_logger().warning(message)
+
+    @classmethod
+    def _process_batch_and_save_status(


Is this actually saving status or just the metadata outcome?

undfined · 2024-05-28T16:39:46Z

python/dolma/core/parallel.py

-            len(all_process_kwargs),
-        )
+        # no need to be wasteful with processes: we only need as many cores a the number of batches
+        num_processes = min(self.num_processes, len(batches))


Can you have more batches than available procs/cores?

undfined · 2024-05-28T16:44:00Z

python/dolma/core/utils.py

+    Args:
+        iterables (Iterable[T]): One or more iterables to group into batches.
+        batch_size (int): The size of each batch. Defaults to 1.
+        drop_last (bool): Whether to drop the last batch if it is smaller than `batch_size`. Defaults to False.


What scenario would you want to drop the last batch?

undfined · 2024-05-28T16:46:15Z

python/dolma/core/parallel.py

+        if not hasattr(self, "PROGRESS_BAR_CLS"):
+            self.PROGRESS_BAR_CLS = BaseProgressBar.from_increment_function(self)


Can this be run without a progress bar at all?

undfined · 2024-05-28T16:47:39Z

tests/python/test_utils.py

+class TestBatching(TestCase):
+    def test_batching(self):
+        a = [1, 2, 3, 4, 5]
+        b = [6, 7, 8, 9, 0]
+
+        output = list(batch_iterator(a, b, batch_size=2))
+        self.assertEqual(len(output), 3)
+        self.assertEqual(output[0], [(1, 2), (6, 7)])
+        self.assertEqual(output[1], [(3, 4), (8, 9)])
+        self.assertEqual(output[2], [(5,), (0,)])
+
+    def test_single_batching(self):
+        a = [1, 2, 3, 4, 5]
+
+        output = list(batch_iterator(a, batch_size=2))
+
+        self.assertEqual(len(output), 3)
+        self.assertEqual(output[0], [(1, 2)])
+        self.assertEqual(output[1], [(3, 4)])
+        self.assertEqual(output[2], [(5,)])
+
+    def test_longer_batch_than_slice(self):
+        a = list(range(3))
+        b = list(range(3, 6))
+        c = list(range(6, 9))
+
+        output = list(batch_iterator(a, b, c, batch_size=4))
+
+        self.assertEqual(len(output), 1)
+        self.assertEqual(output[0], [(0, 1, 2), (3, 4, 5), (6, 7, 8)])
+
+    def test_drop_last(self):
+        a = [1, 2, 3, 4, 5]
+        b = [6, 7, 8, 9, 0]
+
+        output = list(batch_iterator(a, b, batch_size=2, drop_last=True))
+        self.assertEqual(len(output), 2)
+        self.assertEqual(output[0], [(1, 2), (6, 7)])
+        self.assertEqual(output[1], [(3, 4), (8, 9)])


undfined · 2024-05-28T16:50:35Z

tests/python/utils.py

-    dolma_tests_skip = os.environ.get(DOLMA_TESTS_SKIP_AWS_ENV_VAR)
-    LOGGER.info(f"{DOLMA_TESTS_SKIP_AWS_ENV_VAR}: {dolma_tests_skip}")
-    return (dolma_tests_skip or "false").lower() == "true"
+    dolma_tests_skip = yaml.safe_load(os.environ.get(DOLMA_TESTS_SKIP_AWS_ENV_VAR) or "false")


safe_load is duplicative if casting to bool() regardless below right? More a nit than anything, I guess it wouldn't hurt...

soldni added 9 commits May 22, 2024 18:03

added support for old-style retries_on_error

35719fc

added support for retries_on_error

67b3bda

data

155319c

deps

d8cb681

get_annotations not available

e6270dc

fixes

75a5b0d

quoting type aliases

86371d6

3.8 compatibility

73aad08

more style

b9ec3eb

soldni requested review from undfined and kyleclo and removed request for undfined May 23, 2024 17:10

soldni added 4 commits May 23, 2024 10:35

pyi

e42f9fc

viz pbar

be6c984

fixing small regression in tests

f5c696c

order from user

e941f05

undfined approved these changes May 28, 2024

View reviewed changes

soldni and others added 3 commits May 31, 2024 21:35

min timeout

1e292ff

Merge branch 'main' into soldni/pbar2

d805ee3

Merge branch 'main' into soldni/pbar2

3d5baab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Progress Bar, Backoff, Batching #165

New Progress Bar, Backoff, Batching #165

soldni commented May 23, 2024 •

edited

Loading

undfined left a comment

undfined May 24, 2024

undfined May 28, 2024

undfined May 28, 2024

undfined May 28, 2024

undfined May 28, 2024

undfined May 28, 2024

undfined May 28, 2024

undfined May 28, 2024

undfined May 28, 2024

undfined May 28, 2024

		if not hasattr(self, "PROGRESS_BAR_CLS"):
		self.PROGRESS_BAR_CLS = BaseProgressBar.from_increment_function(self)

New Progress Bar, Backoff, Batching #165

Are you sure you want to change the base?

New Progress Bar, Backoff, Batching #165

Conversation

soldni commented May 23, 2024 • edited Loading

undfined left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soldni commented May 23, 2024 •

edited

Loading