# Tile Generation Benchmarks across Data Formats

## Explanation

In this notebook we compare the performance of tiling CMIP6 data stored as COG, NetCDF and Zarr. In order to tile the NetCDF, we use a kerchunk reference file. You are able to use the ZarrReader with NetCDF directly, however you cannot read more than file at once which makes it incomparable with the pgSTAC+COG and Zarr methods.

## Setup

In [None]:
%%capture
!pip install -r ../requirements.txt

In [1]:
%load_ext autoreload
%autoreload
import json
import pandas as pd
from cog_tile_test import CogTileTest
from xarray_tile_test import XarrayTileTest
import sys; sys.path.append('..')
import helpers.eodc_hub_role as eodc_hub_role
import helpers.dataframe as dataframe_helpers
import warnings
warnings.filterwarnings('ignore')

In [2]:
credentials = eodc_hub_role.fetch_and_set_credentials()

Below we only load the CMIP6 Zarr dataset which has the same chunk structure as the original NetCDF data.

In [3]:
# Run 3 iterations of each setting
iterations = 3
zooms = range(12)
cog_dataset_id, cog_dataset = list(json.loads(open('../01-generate-datasets/cmip6-pgstac/cog-datasets.json').read()).items())[0]

In [4]:
kerchunk_dataset_id, kerchunk_dataset = list(json.loads(open('../01-generate-datasets/cmip6-kerchunk-dataset.json').read()).items())[0]

In [5]:
zarr_datasets = json.loads(open('../01-generate-datasets/cmip6-zarr-datasets.json').read())
filtered_dict = {k: v for k, v in zarr_datasets.items() if '600_1440_1' in k}
zarr_dataset_id, zarr_dataset = list(filtered_dict.items())[0]

## Run Tests

### COG Tests

In [6]:
# Based on our findings in 01-cog-gdal-tests we run these tests with set_gdal_vars to True.
cog_tile_test = CogTileTest(
    dataset_id=cog_dataset_id,
    lat_extent=[-59, 89],
    lon_extent=[-179, 179],
    extra_args={
        'query': cog_dataset['example_query'],
        'set_gdal_vars': True,
        'credentials': credentials
    }
)

# Run it 3 times for each zoom level
for zoom in zooms:
    cog_tile_test.run_batch({'zoom': zoom}, batch_size=iterations)

cog_results = cog_tile_test.store_results(credentials)

Caught exception: An error occurred (InvalidPermission.Duplicate) when calling the AuthorizeSecurityGroupIngress operation: the specified rule "peer: 35.91.185.39/32, TCP, from port: 5432, to port: 5432, ALLOW" already exists
Connected to database
Wrote instance data to s3://nasa-eodc-data-store/test-results/20230902203302_CogTileTest_CMIP6_daily_GISS-E2-1-G_tas.json


### Kerchunk Tests

In [7]:
kerchunk_tile_test = XarrayTileTest(
    dataset_id=kerchunk_dataset_id,
    **kerchunk_dataset
)

# Run it 3 times for each zoom level
for zoom in zooms:
    kerchunk_tile_test.run_batch({'zoom': zoom}, batch_size=iterations)

kerchunk_results = kerchunk_tile_test.store_results(credentials)

Wrote instance data to s3://nasa-eodc-data-store/test-results/20230902203312_XarrayTileTest_cmip6-kerchunk.json


In [8]:
zarr_tile_test = XarrayTileTest(
    dataset_id=zarr_dataset_id,
    **zarr_dataset
)

# Run it 3 times for each zoom level
for zoom in zooms:
    zarr_tile_test.run_batch({'zoom': zoom}, batch_size=iterations)

zarr_results = zarr_tile_test.store_results(credentials)

Wrote instance data to s3://nasa-eodc-data-store/test-results/20230902203318_XarrayTileTest_600_1440_1_CMIP6_daily_GISS-E2-1-G_tas.zarr.json


## Read and Plot Results

In [9]:
all_urls = [cog_results, zarr_results, kerchunk_results]
all_df = dataframe_helpers.load_all_into_dataframe(credentials, all_urls)
expanded_df = dataframe_helpers.expand_timings(all_df)

In [22]:
expanded_df['data_format'] = 'Unknown'
# Define the conditions
expanded_df.loc[expanded_df['dataset_id'] == cog_dataset_id, 'data_format'] = 'COG'
expanded_df.loc[expanded_df['dataset_id'] == zarr_dataset_id, 'data_format'] = 'Zarr'
expanded_df.loc[expanded_df['dataset_id'] == kerchunk_dataset_id, 'data_format'] = 'kerchunk'

In [24]:
# throws strange KeyError: 'zoom'
# expanded_df.plot.scatter(x='zoom', y='time', by='data_format')

In [25]:
import hvplot.pandas
expanded_df.hvplot.scatter(x='zoom', y='time', by='data_format')

In [26]:
expanded_df.to_csv('results-csvs/02-cog-kerchunk-zarr-results.csv')