# Tile Generation Benchmarks for Varied Chunk Sizes

## Explanation

In this notebook we compare the performance of tiling artificially generated Zarr data to different chunk sizes. The CMIP6 data provides an excellent real world dataset, but is relatively low resolution. In order to study the impact of higher resolution data, we artificially generated Zarr datastores to explore the relationship between tile generation time and chunk size.

## Setup

In [None]:
import json
import pandas as pd
fomr xarray_tile_test import XarrayTileTest
import sys; sys.path.append('..')
import helpers.eodc_hub_role as eodc_hub_role

In [None]:
credentials = eodc_hub_role.fetch_and_set_credentials()

Load the fake datasets which have increasingly fine spatial resolution and thus increasingly large chunk size.

In [None]:
# Run 3 iterations of each setting
iterations = 3
zooms = range(12)
all_zarr_datasets = json.loads(open('../01-generate-datasets/fake-datasets.json').read()).items()
zarr_datasets = list(filter(k.contains('single_chunk') for k,v in zarr_datasets.items()))

## Run Tests

In [None]:
results = []

for zarr_dataset_id, zarr_dataset in zarr_datasets.items():
    zarr_tile_test = XarrayTileTest(
        dataset_id=zarr_dataset_id,
        **zarr_dataset
    )

    # Run it 3 times for each zoom level
    for zoom in zooms:
        zarr_tile_test.run_batch({'zoom': zoom}, batch_size=iterations)

    results.append(zarr_tile_test.store_results(credentials))

In [None]:
## Read and Plot Results

In [None]:
see code in run-xarray-tests.ipynb

In [None]:
expanded_df.plot.scatter(x='chunk_size_mb', y='time', by='zoom')

In [None]:
expanded_df.results.to_csv('results/03-chunk-size-results.csv')