# Profiling tiling of titiler-pgstac and titiler-xarray

This notebook profiles code for tiling CMIP6 data via 2 methods:

1. pgSTAC + COGs: The first method uses a (local or remote) pgSTAC database for storing metadata about COGs on S3. The libraries used are pgstac for reading STAC metadata and rio_tiler's rasterio for reading COGs on S3.
2. kerchunk + netCDF: The second method uses a (local or S3) kerchunk reference file for NetCDF files stored on S3. The libraries used are xarray for reading the Zarr metadata and rio_tiler's XarrayReader for reading data from the NetCDFs on S3.

In the future, the following improvements and additions to this profiling code will be made:

1. Test with a Zarr store. The profiling code will be run on a Zarr store to compare the performance of reading from a Zarr store vs. reading from a NetCDF via kerchunk.
2. Test different chunking strategies: The profiling code will be run on a few different Zarr stores with different chunking schemes.
3. Test a higher resolution dataset.

## Setup / Step 0

1. Load some basic libraries
2. Set an initial tile to test
3. Add some AWS credentials
4. Install any missing libraries

In [1]:
import boto3
from datetime import datetime
import io
from PIL import Image
import os
import warnings
import cmip6_zarr.eodc_hub_role
from matplotlib.pyplot import imshow
warnings.filterwarnings('ignore')

In [2]:
%%capture
!pip install -r requirements.txt

In [3]:
xyz_tile = (0,0,0)

In [4]:
#parameters
temporal_resolution = "daily"
model = "GISS-E2-1-G"
variable = "tas"
anon=True

## Profile titiler-pgstac

To achieve the best performance, we set some GDAL environment variables.

These variables are documented here https://developmentseed.org/titiler/advanced/performance_tuning/, but that advice is copied into comments below for ease of reference.

In [None]:
# By default GDAL reads the first 16KB of the file, then if that doesn't contain the entire metadata
# it makes one more request for the rest of the metadata.
# In environments where latency is relatively high, AWS S3,
# it may be beneficial to increase this value depending on the data you expect to read.
os.environ['GDAL_INGESTED_BYTES_AT_OPEN'] = '32768'

# It's much better to set EMPTY_DIR because this prevents GDAL from making a LIST request.
# LIST requests are made for sidecar files, which does not apply for COGs
os.environ['GDAL_DISABLE_READDIR_ON_OPEN'] = 'EMPTY_DIR'

# Tells GDAL to merge consecutive range GET requests.
os.environ['GDAL_HTTP_MERGE_CONSECUTIVE_RANGES'] = 'YES'

# When set to YES, this attempts to download multiple range requests in parallel, reusing the same TCP connection. 
# Note this is only possible when the server supports HTTP2, which many servers don't yet support.
# There's no downside to setting YES here.
os.environ['GDAL_HTTP_MULTIPLEX'] = 'YES'
os.environ['GDAL_HTTP_VERSION'] = '2'

# Setting this to TRUE enables GDAL to use an internal caching mechanism. It's recommended (strongly).
os.environ['VSI_CACHE'] = 'TRUE'

In [None]:
#!pip install morecantile==3.4.0 loguru titiler titiler-pgstac
#!pip install psycopg psycopg_binary psycopg_pool

In [9]:
# Uncomment this line if using a local database
# os.environ['LOCAL'] == 'True'

# useful to always reload the module while its being developed

from cmip6_zarr import eodc_hub_role
credentials = eodc_hub_role.fetch_and_set_credentials()
os.environ['AWS_ACCESS_KEY_ID'] = credentials['AccessKeyId']
os.environ['AWS_SECRET_ACCESS_KEY'] = credentials['SecretAccessKey']
os.environ['AWS_SESSION_TOKEN'] = credentials['SessionToken'] 
from profiler.main import cprofile_list_to_dict

In [None]:
%load_ext autoreload
%autoreload
import cmip6_pgstac.profile_pgstac as profile_pgstac

if temporal_resolution == 'daily':
    collection = f"CMIP6_daily_{model}_{variable}"
elif temporal_resolution == 'monthly':
    collection = f"CMIP6_ensemble_monthly_median_{variable}"

query = {
  "collections": [ collection ],
  "filter": {
    "op": "t_intersects",
    "args": [
      {
        "property": "datetime"
      },
      {
        "interval": [
          "1950-04-01T00:00:00Z"           
        ]
      }
    ]
  },
  "filter-lang": "cql2-json"
}

image_and_assets, cprofile = profile_pgstac.tile(*xyz_tile, query=dict(query))
cprofile

In [None]:
all_times = cprofile_list_to_dict(cprofile['cprofile'][1:])
total_time = list(all_times.values())[0]['tottime']
total_time

**NOTES:**

* There are 2 parts to the overall timing of generating the image - `pgstac-search` and `get_tile`.
* `get_tile` above is a list with a timing for each tif. The bulk of this time is in `CustomSTACReader#tile`. That function has a subfunction `_reader` which wraps `src_dst.tile` in `self.reader`.
* The `CustomSTACReader`'s `reader` attribute is `BaseReader` from `rio_tiler.io.base`. There is no init function for BaseReader so I don't think any time is spent initializing the reader.
* `CustomSTACReader` inherits from `MultiBaseReader` so the `#tile` function is defined in that class.
* The `MultiBaseReader#tile` function also has a `_reader` subfunction which is called for each asset.
* The code for rio_tiler's `MultiBaseReader#tile#_reader` can be thought of as a **initialize reader** step and a **tile** step. 
* The bulk of the `get_tile` time is spent in the **initialize reader** step of `MultiBaseReader#tile#_reader`. The initialization of `MultiBaseReader#reader` spends most of it's time in `rasterio.open`. I have not dug into the subcalls of `rasterio.open`
* `MultiBaseReader#tile#tile` is roughly equivalent to `rasterio.io.reader#part` and wraps the reading of the WarpedVRT, so should be the sum of the previous calls.


In [None]:
image = image_and_assets[0].data_as_image()
# data_as_image() returns a numpy.ndarray in form of (col, row, band)
imshow(image)

# Profile titiler-xarray

In [6]:
# useful to always reload the module while its being developed
%load_ext autoreload
%autoreload
import xarray_tile_reader
import zarr_reader

For each store, run the test and add to a results table.

Each store should be tested `n` times and the mean should be reported for the time of reading the zarr store and reprojecting the data

`tile`, `chunk_size`, `reading dataset` `reproject` and `total_time` should be reported.

In [30]:
bucket = 'nasa-eodc-data-store'
chunk_set_paths = ['600_1440_1', '600_1440_29', '365_262_262']
results_df = {
  'kerchunk': {
      "data_store_path": f"combined_CMIP6_{temporal_resolution}_{model}_{variable}_kerchunk.json"
  }
}

for chunk_set_path in chunk_set_paths:
    results_df[chunk_set_path] = {
        'data_store_path': f"{chunk_set_path}/CMIP6_{temporal_resolution}_{model}_{variable}.zarr"
    }
    
results_df

{'kerchunk': {'data_store_path': 'combined_CMIP6_daily_GISS-E2-1-G_tas_kerchunk.json'},
 '600_1440_1': {'data_store_path': '600_1440_1/CMIP6_daily_GISS-E2-1-G_tas.zarr'},
 '600_1440_29': {'data_store_path': '600_1440_29/CMIP6_daily_GISS-E2-1-G_tas.zarr'},
 '365_262_262': {'data_store_path': '365_262_262/CMIP6_daily_GISS-E2-1-G_tas.zarr'}}

In [35]:
from profiler.main import Timer

for dataset in results_df.keys():
    data_store_path = results_df[dataset]['data_store_path']
    data_store_url = f"s3://{bucket}/{data_store_path}"
    reference = False
    if dataset == 'kerchunk':
        reference = True

    ds = zarr_reader.xarray_open_dataset(data_store_url, anon=False, reference=reference)

    dask_array = ds[0][variable]
    chunk_size_bytes = dask_array.dtype.itemsize * dask_array.chunks[0][0] * dask_array.chunks[1][0] * dask_array.chunks[2][0]
    chunk_size_mb = chunk_size_bytes / (1024 * 1024)
    results_df[dataset]['chunk_size_mb'] = chunk_size_mb

    with Timer() as t:
        image_and_timings, cprofile = xarray_tile_reader.tile(
            data_store_url,
            *xyz_tile,
            reference=reference,
            anon=False,
            variable=variable,
        )
    total_time = round(t.elapsed * 1000, 2)

    timings = image_and_timings[1]
    results_df[dataset]['timings'] = {}

    results_df[dataset]['timings'] = {
        'time_to_open (ms)': timings['time_to_open'],
        'rio.reproject (ms)': timings['rio.reproject'],
        'total_time (ms)': total_time
    }

results_df

{'kerchunk': {'data_store_path': 'combined_CMIP6_daily_GISS-E2-1-G_tas_kerchunk.json',
  'chunk_size_mb': 3.2958984375,
  'timings': {'time_to_open (ms)': 45.19,
   'rio.reproject (ms)': 74.76,
   'total_time (ms)': 192.57}},
 '600_1440_1': {'data_store_path': '600_1440_1/CMIP6_daily_GISS-E2-1-G_tas.zarr',
  'chunk_size_mb': 3.2958984375,
  'timings': {'time_to_open (ms)': 1817.47,
   'rio.reproject (ms)': 73.47,
   'total_time (ms)': 1963.43}},
 '600_1440_29': {'data_store_path': '600_1440_29/CMIP6_daily_GISS-E2-1-G_tas.zarr',
  'chunk_size_mb': 95.5810546875,
  'timings': {'time_to_open (ms)': 135.65,
   'rio.reproject (ms)': 519.13,
   'total_time (ms)': 724.5}},
 '365_262_262': {'data_store_path': '365_262_262/CMIP6_daily_GISS-E2-1-G_tas.zarr',
  'chunk_size_mb': 95.57746887207031,
  'timings': {'time_to_open (ms)': 79.57,
   'rio.reproject (ms)': 1175.34,
   'total_time (ms)': 1325.78}}}