# Dask Modeled Auto-Chunking

This experiment aims to compare the behavior of Dask's auto-chunking feature with our trained models.
On this notebook you will find:
- The problem statement
- The data collection for the experiment
- The evaluation of the experiment results.

### Problem Statement

Dask is widely recognized for its ability to parallelize computations, particularly when processing large datasets.
Efficient memory management becomes crucial in memory-intensive operations, where Dask’s strategy of chunking data into smaller blocks plays a pivotal role in both performance and memory usage.

Traditionally, Dask offers an auto-chunking feature that dynamically adjusts chunk sizes based on the data and resources available.
However, in certain complex computations, particularly in specialized fields like geophysics or computational modeling, it is crucial to assess how Dask’s default auto-chunking compares to custom-trained models that are designed for domain-specific operations.

In this experiment, we will compare Dask’s auto-chunking feature with a custom pre-trained model used for envelope and GST3D. Specifically, we aim to evaluate:

- The performance of Dask’s auto-chunking in managing chunk sizes dynamically for memory-intensive tasks.
- The behavior of a custom model with manually optimized chunk sizes for envelope and GST3D operations.
- The trade-offs between Dask’s auto-chunking versus the custom approach in terms of memory efficiency and computational performance.
- Whether the auto-chunking approach is viable for highly specialized operations such as GST3D, or if manually tuned models offer significant benefits.

By analyzing these factors, this experiment seeks to uncover the most effective chunking strategy for domain-specific, memory-intensive Dask operations, and to identify scenarios where Dask’s auto-chunking may fall short compared to manually optimized alternatives.

## Data Collection

In this section, we will outline the steps needed to collect the necessary data for our experiment.
The process is organized into the following steps:

1. **Setup Environment:**
  - Set up the environment with proper env variables and global constants to use during the experiment.

2. **Setup dependencies:**
  - Set up the virtual environment running this notebook with the required dependencies.

3. **Setup the output directory:**
  - On this step we will setup the output directory in which we will save the experiment results.

4. **Generate synthetic seismic data:**
  - Generate synthetic seismic data for a given shape.

5. **Collect data for each operator:**
  - Apply each operator to the synthetic data using both Dask auto-chunking, as well as the optimal chunk baseed on our model

After completing these steps, we will have the performance data from Dask to compare the results

### Setup Environment

During the environment setup, we need to:
- Proper configure `PYTHONPATH`
- Setup dependencies

Below, we're configuring the `PYTHONPATH` to allow using the tools we've coded for the experiments

In [1]:
import os
import sys

helpers_path = os.path.abspath('../libs/helpers')
traceq_path = os.path.abspath('../libs/traceq')

helpers_path not in sys.path and sys.path.append(helpers_path)
traceq_path not in sys.path and sys.path.append(traceq_path)

print(sys.path)

['/home/delucca/.pyenv/versions/3.10.14/lib/python310.zip', '/home/delucca/.pyenv/versions/3.10.14/lib/python3.10', '/home/delucca/.pyenv/versions/3.10.14/lib/python3.10/lib-dynload', '', '/home/delucca/.pyenv/versions/3.10.14/envs/dask-auto-chunking/lib/python3.10/site-packages', '/home/delucca/src/unicamp/msc/dask-auto-chunking/libs/helpers', '/home/delucca/src/unicamp/msc/dask-auto-chunking/libs/traceq']


In [3]:
!pip install bokeh



In [2]:
from pprint import pprint

NUM_INLINES = 600
NUM_XLINES = 600
NUM_SAMPLES = 600

LOG_TRANSPORTS = ['CONSOLE', 'FILE']
LOG_LEVEL = 'DEBUG'

print('Experiment config:')
pprint({
    'NUM_INLINES': NUM_INLINES,
    'NUM_XLINES': NUM_XLINES,
    'NUM_SAMPLES': NUM_SAMPLES,
    'LOG_TRANSPORTS': LOG_TRANSPORTS,
    'LOG_LEVEL': LOG_LEVEL,
}, indent=2, sort_dicts=True)

Experiment config:
{ 'LOG_LEVEL': 'DEBUG',
  'LOG_TRANSPORTS': ['CONSOLE', 'FILE'],
  'NUM_INLINES': 600,
  'NUM_SAMPLES': 600,
  'NUM_XLINES': 600}


In [3]:
import uuid
import os

from datetime import datetime

EXPERIMENT_ID = f'008-{datetime.now().strftime("%Y%m%d%H%M%S")}-{uuid.uuid4().hex[:6]}'
OUTPUT_DIR = f'./output/{EXPERIMENT_ID}'

os.makedirs(OUTPUT_DIR)

OUTPUT_DIR

'./output/008-20241008160251-5ebf1a'

In [4]:
import dask

from bokeh.io import output_notebook
from helpers.datasets import generate_seismic_data

# Ensure Bokeh works properly in Jupyter
output_notebook()

# Disable GPU diagnostics in Dask
dask.config.set({"distributed.diagnostics.nvml": False})

# Create a synthetic seismic experiment
DATA_OUTPUT_DIR = f'{OUTPUT_DIR}/experiment'
synthetic_data_path = generate_seismic_data(
    inlines=NUM_INLINES,
    xlines=NUM_XLINES,
    samples=NUM_SAMPLES,
    output_dir=DATA_OUTPUT_DIR,
)

2024-10-08 16:02:53 - generate-seismic-data - INFO - Generating synthetic data for shape (600, 600, 600)


## Envelope

### Auto-chunking

In [10]:
import dask.array as da
import time
from dask.diagnostics import ResourceProfiler
from dask.distributed import Client
from helpers.dask_operators import envelope_from_ndarray, load_segy

client = Client(n_workers=1, threads_per_worker=1, memory_limit='16GB')

# Use Dask Profiler to monitor resource usage
resource_profiler = ResourceProfiler()

with resource_profiler:
    start_time = time.time()
    try:
        synthetic_data = load_segy(synthetic_data_path)
        print("Data shape: ", synthetic_data.shape)

        X = da.from_array(synthetic_data, chunks='auto')
        print("Chunks: ", X.chunks)
        print("Number of chunks along each axis:", [len(c) for c in X.chunks])

        result = envelope_from_ndarray(X)
    finally:
        end_time = time.time()
        client.close()

execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")

resource_visualization = resource_profiler.visualize()
display(resource_visualization)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 43763 instead


Data shape:  (600, 600, 600)
Chunks:  ((322, 278), (322, 278), (322, 278))
Number of chunks along each axis: [2, 2, 2]


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Execution time: 8.03 seconds


### Modeled Chunk

In [7]:
import pickle
import pandas as pd

envelope_model = None

with open('../models/memory_usage/output/models/envelope.pkl', 'rb') as file:
    envelope_model = pickle.load(file)


def extract_features(df):
    # Interaction
    df["inline_crossline"] = df["inlines"] * df["crosslines"]
    df["inline_sample"] = df["inlines"] * df["samples"]
    df["crossline_sample"] = df["crosslines"] * df["samples"]
    df["volume"] = df["inlines"] * df["crosslines"] * df["samples"]

    # Logarithmic and Exponential Transformations
    df['log_inlines'] = np.log1p(df['inlines'])
    df['log_crosslines'] = np.log1p(df['crosslines'])
    df['log_samples'] = np.log1p(df['samples'])

    # Ratios
    df['inline_to_crossline'] = df['inlines'] / (df['crosslines'] + 1)
    df['inline_to_sample'] = df['inlines'] / (df['samples'] + 1)
    df['crossline_to_sample'] = df['crosslines'] / (df['samples'] + 1)

    # Statistical Aggregates
    df['mean_inline_crossline'] = df[['inlines', 'crosslines']].mean(axis=1)
    df['std_inline_crossline'] = df[['inlines', 'crosslines']].std(axis=1)

    return df


synthetic_data = load_segy(synthetic_data_path)
target_df = pd.DataFrame([synthetic_data.shape], columns=['inlines', 'crosslines', 'samples'])
target_df = extract_features(target_df)
expected_memory_usage = envelope_model.predict(target_df)[0]

print(f"The expected memory usage is {expected_memory_usage:.2f} KB for the target shape {synthetic_data.shape}")

The expected memory usage is 6462068.00 KB for the target shape (600, 600, 600)


In [12]:
def get_optimal_chunk_size(shape, expected_memory_usage, client):
    # Get scheduler information to retrieve worker details
    scheduler_info = client.scheduler_info()
    workers = scheduler_info['workers']
    num_workers = len(workers)

    # Initialize variables for worker resources
    total_memory = 0

    for worker, details in workers.items():
        memory_limit = details['memory_limit']
        total_memory += memory_limit

    # Memory per worker (convert to GB)
    memory_per_worker_gb = (total_memory / num_workers) / 1e9

    print(f"Total Workers: {num_workers}")
    print(f"Memory per Worker: {memory_per_worker_gb:.2f} GB")

    # Predicted memory usage is in KB, so convert it to GB
    expected_memory_usage_gb = expected_memory_usage / (1024 ** 2)

    # Calculate the optimal chunk size
    # If memory usage exceeds the per-worker memory, we need to reduce the chunk size
    if expected_memory_usage_gb > memory_per_worker_gb:
        print("Expected memory usage exceeds memory per worker. Reducing chunk size.")
        chunk_size_ratio = memory_per_worker_gb / expected_memory_usage_gb
        chunk_size = tuple(int(dim * chunk_size_ratio) for dim in shape)
    else:
        # If memory usage is within limits, use full shape as chunk
        chunk_size = shape

    print(f"Optimal Chunk Size: {chunk_size}")
    return chunk_size

In [8]:
import numpy as np
import dask.array as da
from dask.diagnostics import ResourceProfiler
from dask.distributed import Client
from helpers.dask_operators import envelope_from_ndarray, load_segy

client = Client(n_workers=1, threads_per_worker=1, memory_limit='16GB')

# Example usage based on your synthetic experiment shape and model's prediction
synthetic_data_shape = synthetic_data.shape  # Assuming shape is something like (inlines, crosslines, samples)
expected_memory_usage_kb = envelope_model.predict(target_df)[0]  # Model output in KB

optimal_chunk_size = get_optimal_chunk_size(synthetic_data_shape, expected_memory_usage_kb, client)
print(f"The optimal chunk size is {optimal_chunk_size}")

# Use Dask Profiler to monitor resource usage
resource_profiler = ResourceProfiler()

with resource_profiler:
    start_time = time.time()
    try:
        synthetic_data = load_segy(synthetic_data_path)
        print("Data shape: ", synthetic_data.shape)

        X = da.from_array(synthetic_data, chunks=optimal_chunk_size)
        print("Chunks: ", X.chunks)
        print("Number of chunks along each axis:", [len(c) for c in X.chunks])

        result = envelope_from_ndarray(X)
    finally:
        end_time = time.time()
        client.close()

execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")

resource_visualization = resource_profiler.visualize()
display(resource_visualization)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 45081 instead


Total Workers: 1
Memory per Worker: 16.00 GB
Optimal Chunk Size: (600, 600, 600)
The optimal chunk size is (600, 600, 600)
Data shape:  (600, 600, 600)
Chunks:  ((600,), (600,), (600,))
Number of chunks along each axis: [1, 1, 1]


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Execution time: 5.43 seconds


## Evaluating with a few workers

In [22]:
n_workers = 5
n_threads = 5

### Auto-chunking

In [23]:
client = Client(n_workers=n_workers, memory_limit=f'{16 / n_workers}GB')

# Use Dask Profiler to monitor resource usage
resource_profiler = ResourceProfiler()

with resource_profiler:
    start_time = time.time()
    try:
        synthetic_data = load_segy(synthetic_data_path)
        print("Data shape: ", synthetic_data.shape)

        X = da.from_array(synthetic_data, chunks='auto')
        print("Chunks: ", X.chunks)
        print("Number of chunks along each axis:", [len(c) for c in X.chunks])

        result = envelope_from_ndarray(X)
    finally:
        end_time = time.time()
        client.close()

execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")

resource_visualization = resource_profiler.visualize()
display(resource_visualization)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 44111 instead


Data shape:  (600, 600, 600)
Chunks:  ((322, 278), (322, 278), (322, 278))
Number of chunks along each axis: [2, 2, 2]


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Execution time: 6.16 seconds


### Modeled chunking

In [24]:
client = Client(n_workers=n_workers, memory_limit=f'{16 / n_workers}GB')

synthetic_data_shape = synthetic_data.shape  # Assuming shape is something like (inlines, crosslines, samples)
expected_memory_usage_kb = envelope_model.predict(target_df)[0]  # Model output in KB

optimal_chunk_size = get_optimal_chunk_size(synthetic_data_shape, expected_memory_usage_kb, client)
print(f"The optimal chunk size is {optimal_chunk_size}")

# Use Dask Profiler to monitor resource usage
resource_profiler = ResourceProfiler()

with resource_profiler:
    start_time = time.time()
    try:
        synthetic_data = load_segy(synthetic_data_path)
        print("Data shape: ", synthetic_data.shape)

        X = da.from_array(synthetic_data, chunks=optimal_chunk_size)
        print("Chunks: ", X.chunks)
        print("Number of chunks along each axis:", [len(c) for c in X.chunks])

        result = envelope_from_ndarray(X)
    finally:
        end_time = time.time()
        client.close()

execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")

resource_visualization = resource_profiler.visualize()
display(resource_visualization)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 36163 instead


Total Workers: 5
Memory per Worker: 3.20 GB
Expected memory usage exceeds memory per worker. Reducing chunk size.
Optimal Chunk Size: (311, 311, 311)
The optimal chunk size is (311, 311, 311)
Data shape:  (600, 600, 600)
Chunks:  ((311, 289), (311, 289), (311, 289))
Number of chunks along each axis: [2, 2, 2]


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Execution time: 6.29 seconds


## Evaluating with many workers

In [29]:
n_workers = 15
n_threads = 15

### Autochunking

In [30]:
client = Client(n_workers=n_workers, memory_limit=f'{16 / n_workers}GB')

# Use Dask Profiler to monitor resource usage
resource_profiler = ResourceProfiler()

with resource_profiler:
    start_time = time.time()
    try:
        synthetic_data = load_segy(synthetic_data_path)
        print("Data shape: ", synthetic_data.shape)

        X = da.from_array(synthetic_data, chunks='auto')
        print("Chunks: ", X.chunks)
        print("Number of chunks along each axis:", [len(c) for c in X.chunks])

        result = envelope_from_ndarray(X)
    finally:
        end_time = time.time()
        client.close()

execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")

resource_visualization = resource_profiler.visualize()
display(resource_visualization)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 46551 instead


Data shape:  (600, 600, 600)
Chunks:  ((322, 278), (322, 278), (322, 278))
Number of chunks along each axis: [2, 2, 2]


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Execution time: 7.61 seconds


### Modeled chunking

In [31]:
client = Client(n_workers=n_workers, memory_limit=f'{16 / n_workers}GB')

synthetic_data_shape = synthetic_data.shape  # Assuming shape is something like (inlines, crosslines, samples)
expected_memory_usage_kb = envelope_model.predict(target_df)[0]  # Model output in KB

optimal_chunk_size = get_optimal_chunk_size(synthetic_data_shape, expected_memory_usage_kb, client)
print(f"The optimal chunk size is {optimal_chunk_size}")

# Use Dask Profiler to monitor resource usage
resource_profiler = ResourceProfiler()

with resource_profiler:
    start_time = time.time()
    try:
        synthetic_data = load_segy(synthetic_data_path)
        print("Data shape: ", synthetic_data.shape)

        X = da.from_array(synthetic_data, chunks=optimal_chunk_size)
        print("Chunks: ", X.chunks)
        print("Number of chunks along each axis:", [len(c) for c in X.chunks])

        result = envelope_from_ndarray(X)
    finally:
        end_time = time.time()
        client.close()

execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")

resource_visualization = resource_profiler.visualize()
display(resource_visualization)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 45081 instead


Total Workers: 15
Memory per Worker: 1.07 GB
Expected memory usage exceeds memory per worker. Reducing chunk size.
Optimal Chunk Size: (103, 103, 103)
The optimal chunk size is (103, 103, 103)
Data shape:  (600, 600, 600)
Chunks:  ((103, 103, 103, 103, 103, 85), (103, 103, 103, 103, 103, 85), (103, 103, 103, 103, 103, 85))
Number of chunks along each axis: [6, 6, 6]


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Execution time: 10.85 seconds


## Evaluating with a large amount of workers

In [32]:
n_workers = 25
n_threads = 25

### Autochunking

In [33]:
client = Client(n_workers=n_workers, memory_limit=f'{16 / n_workers}GB')

# Use Dask Profiler to monitor resource usage
resource_profiler = ResourceProfiler()

with resource_profiler:
    start_time = time.time()
    try:
        synthetic_data = load_segy(synthetic_data_path)
        print("Data shape: ", synthetic_data.shape)

        X = da.from_array(synthetic_data, chunks='auto')
        print("Chunks: ", X.chunks)
        print("Number of chunks along each axis:", [len(c) for c in X.chunks])

        result = envelope_from_ndarray(X)
    finally:
        end_time = time.time()
        client.close()

execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")

resource_visualization = resource_profiler.visualize()
display(resource_visualization)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 45843 instead


Data shape:  (600, 600, 600)
Chunks:  ((322, 278), (322, 278), (322, 278))
Number of chunks along each axis: [2, 2, 2]


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.
2024-10-08 16:13:54,091 - distributed.diskutils - ERROR - Failed to remove '/tmp/dask-scratch-space/worker-tiny0s31/storage' (failed in <built-in function rmdir>): [Errno 2] No such file or directory: 'storage'
2024-10-08 16:13:54,091 - distributed.diskutils - ERROR - Failed to remove '/tmp/dask-scratch-space/worker-tiny0s31' (failed in <built-in function rmdir>): [Errno 2] No such file or directory: '/tmp/dask-scratch-space/worker-tiny0s31'
2024-10-08 16:13:55,380 - distributed.diskutils - ERROR - Failed to remove '/tmp/dask-scratch-space/worker-jto6cnu8/storage' (failed in <built-in function rmdir>): [Errno 2] No such file or directory: 'storage'
2024-10-08 16:13:55,380 - distributed.diskutils - ERROR - Failed to remove 

KilledWorker: Attempted to run task ('absolute-74f73ac830f252c13b8a1ddebafcc028', 1, 1, 0) on 4 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:43725. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.

### Modeled chunking

In [34]:
client = Client(n_workers=n_workers, memory_limit=f'{16 / n_workers}GB')

synthetic_data_shape = synthetic_data.shape  # Assuming shape is something like (inlines, crosslines, samples)
expected_memory_usage_kb = envelope_model.predict(target_df)[0]  # Model output in KB

optimal_chunk_size = get_optimal_chunk_size(synthetic_data_shape, expected_memory_usage_kb, client)
print(f"The optimal chunk size is {optimal_chunk_size}")

# Use Dask Profiler to monitor resource usage
resource_profiler = ResourceProfiler()

with resource_profiler:
    start_time = time.time()
    try:
        synthetic_data = load_segy(synthetic_data_path)
        print("Data shape: ", synthetic_data.shape)

        X = da.from_array(synthetic_data, chunks=optimal_chunk_size)
        print("Chunks: ", X.chunks)
        print("Number of chunks along each axis:", [len(c) for c in X.chunks])

        result = envelope_from_ndarray(X)
    finally:
        end_time = time.time()
        client.close()

execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")

resource_visualization = resource_profiler.visualize()
display(resource_visualization)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 41131 instead


Total Workers: 25
Memory per Worker: 0.64 GB
Expected memory usage exceeds memory per worker. Reducing chunk size.
Optimal Chunk Size: (62, 62, 62)
The optimal chunk size is (62, 62, 62)
Data shape:  (600, 600, 600)
Chunks:  ((62, 62, 62, 62, 62, 62, 62, 62, 62, 42), (62, 62, 62, 62, 62, 62, 62, 62, 62, 42), (62, 62, 62, 62, 62, 62, 62, 62, 62, 42))
Number of chunks along each axis: [10, 10, 10]


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Execution time: 11.96 seconds
