Benchmark restructuring and memory profiling (#642)

* Refactor compression benchmarks to runnable design Plan: each benchmark will export two functions, filename_setup and filename_run, which designate how the benchmark is to be run. setup will take any required parameters and return a processed param tuple to be passed as argument to the runner. The runner is designed to be as slim as possible so we only measure the crucial code. Then, we can externally call these functions and time/profile/benchmark on the runtime of the function call, allowing for a great increase of control. Further refactors along this design coming soon. * Refactor dataset iteration benchmarks Following the previous commit, this refactors the benchmark_dataset_iter into separate files with the same design as the now-refactored `benchmark_compress_hub.py`. One step closer to full control * Add full dataset compute benchmark It'll be nice to keep track of this as well. Might be subsumed by the dataset_comparison file, but I'll get to that next. * Refactor benchmark_random_access into new format Improves `benchmark_access_hub_full.py` and uses that as a base for `benchmark_access_hub_slice.py` which replaces functionality from `benchmark_random_access.py` (now deleted). * Remove unused line in benchmark_iterate_hub TF * Local variants of iteration benchmarks using tfds * Remove dataset compare benchmarks Existing refactored benchmarks now cover all cases once present in this file. * Rename remaining un-refactored benchmarks "legacy" Until these can be converted, I want to have a distinction to know what is and isn't compatible with the new runner (next few commits). This will probably be fixed before going in * Fix minor issues in total access benchmarks * Initial prototype for benchmark runner notebook * Update benchmark runner notebook * Add psutil to benchmark requirements * Fix pytorch and tensorflow local benchmarks * Add network benchmarking and expand suites * Update .gitignore with benchmark local data * Auto-fix issues with black * Add time to network monitor output to plot better
activeloopai · Mar 30, 2021 · da105b0 · da105b0
1 parent 286eae2
commit da105b0
Show file tree

Hide file tree

Showing 18 changed files with 527 additions and 421 deletions.
diff --git a/.gitignore b/.gitignore
@@ -195,3 +195,7 @@ cov.xml
 hub/api/cov.xml
 hub/api/nested_seq
 nested_seq
+
+# Benchmark local test data (auto-downloaded)
+benchmarks/hub_data
+benchmarks/torch_data
diff --git a/benchmarks/benchmark_access_hub_full.py b/benchmarks/benchmark_access_hub_full.py
@@ -0,0 +1,16 @@
+from hub import Dataset
+
+
+def benchmark_access_hub_full_setup(dataset_name, field=None):
+    dset = Dataset(dataset_name, cache=False, storage_cache=False, mode="r")
+
+    keys = dset.keys
+    if field is not None:
+        keys = (field,)
+    return (dset, keys)
+
+
+def benchmark_access_hub_full_run(params):
+    dset, keys = params
+    for k in keys:
+        dset[k].compute()
diff --git a/benchmarks/benchmark_access_hub_slice.py b/benchmarks/benchmark_access_hub_slice.py
@@ -0,0 +1,16 @@
+from hub import Dataset
+
+
+def benchmark_access_hub_slice_setup(dataset_name, slice_bounds, field=None):
+    dset = Dataset(dataset_name, cache=False, storage_cache=False, mode="r")
+
+    keys = dset.keys
+    if field is not None:
+        keys = (field,)
+    return (dset, slice_bounds, keys)
+
+
+def benchmark_access_hub_slice_run(params):
+    dset, slice_bounds, keys = params
+    for k in keys:
+        dset[k][slice_bounds[0] : slice_bounds[1]].compute()
diff --git a/benchmarks/benchmark_compress_hub.py b/benchmarks/benchmark_compress_hub.py
@@ -0,0 +1,28 @@
+import numpy as np
+from PIL import Image
+
+import hub
+
+
+def benchmark_compress_hub_setup(
+    times, image_path="./images/compression_benchmark_image.png"
+):
+    img = Image.open(image_path)
+    arr = np.array(img)
+    ds = hub.Dataset(
+        "./data/bench_png_compression",
+        mode="w",
+        shape=times,
+        schema={"image": hub.schema.Image(arr.shape, compressor="png")},
+    )
+
+    batch = np.zeros((times,) + arr.shape, dtype="uint8")
+    for i in range(times):
+        batch[i] = arr
+
+    return (ds, times, batch)
+
+
+def benchmark_compress_hub_run(params):
+    ds, times, batch = params
+    ds["image", :times] = batch
diff --git a/benchmarks/benchmark_compress_pillow.py b/benchmarks/benchmark_compress_pillow.py
@@ -0,0 +1,16 @@
+from PIL import Image
+from io import BytesIO
+
+
+def benchmark_compress_pillow_setup(
+    times, image_path="./images/compression_benchmark_image.png"
+):
+    img = Image.open(image_path)
+    return (img, times)
+
+
+def benchmark_compress_pillow_run(params):
+    img, times = params
+    for _ in range(times):
+        b = BytesIO()
+        img.save(b, format="png")
diff --git a/benchmarks/benchmark_compress_time.py b/benchmarks/benchmark_compress_time.py