# Caching strategies for titiler-xarrayâ€™s xarray.open_dataset


When creating image tiles from xarray datasets â€“ whether they are zarr, virtual zarr or anything else xarray-readable â€“ the 2 most expensive operations are reading opening the dataset with xarray, requiring multiple S3 or HTTPs requests (how many depends on the dataset) and reprojection. By caching the result of xarray.open_dataset, we hope to improve performance of the titiler-xarray API for all subsequent requests.

This document evaluates 3 options for caching:

1. Diskcache using AWS Lambda ephemeral storage
2. Fsspecâ€™s filecache with AWS Elastic File System (EFS)
3. Redis in-memory cache with AWS Elasticache


## Estimating load + storage

We need to estimate load and storage in order to estimate costs.

### Max load estimate

We estimate a max load of a throughput of 500kb for 185 req/s and 10GB storage for xarray metadata files.


In [1]:
metadata_size_gb = 0.0005 # most costs are in GB
max_req_per_sec = 185
estimated_storage_gb = 7

### Reasoning

* **Response load:** Most xarray dataset objects stored in memcached are <1mb (~100kb). 500kb provides an upper limit. If we store 100 datasets that is only 0.5 GB. However, in experients I saw storage in EFS to be about 1.2GB for 17 datasets, so 100 datasets would take up more like 7GB.
* **Max requests per second:** GIBS max requests per day are about 16 million which is 185 req/s average (see https://www.earthdata.nasa.gov/eosdis/system-performance-and-metrics/esdis-metrics/esdis-weekly-metrics/esdis-weekly-metrics-october-9-2023)
* We expect the max size of each request to be 500kb and 185 requests, so we need a throughput of ~100MB/s
    * By chance it appears that this is about the permitted throughput of EFS with bursting capacity and 1.19 GB of stored data
    * ![EFS Permitted Throughput](./efs-permitted-throughput.png)

In [2]:
# check max_reqmax_req_per_sec == max gibs req/s
max_daily_gibs_req = 16e6
assert(max_req_per_sec == round(max_daily_gibs_req/24/60/60))

## Baseline Costs - Lambda

Titiler-xarray runs on lambda. The associated costs will be for:

* run time (charged by gb/sec)
* number of requests
* ephemeral storage (also gb/sec)

In [3]:
seconds_per_month = 60 * 60 * 24 * 30
gb_second_cost_runtime = 0.0000166667 # Duration is calculated from the time your code begins executing until it returns or otherwise terminates, rounded up to the nearest 1 ms*. The price depends on the amount of memory you allocate to your function.
# All lambdas were set at the maximum memory (10GB)
memory_gb = 10
# and this cost would be associated with the execution time,
# so obviously faster execution is cheaper.
max_monthly_runtime_cost = gb_second_cost_runtime * memory_gb * seconds_per_month
print(f"Estimated monthly run time cost for lambda: ${round(max_monthly_runtime_cost, 2)}")

Estimated monthly run time cost for lambda: $432.0


In [4]:
cost_per_million_requests = 0.2
max_requests_per_month = max_daily_gibs_req * 30
max_monthly_requests_costs = cost_per_million_requests * max_requests_per_month/1e6
print(f"Estimated monthly requests cost for lambda: ${round(max_monthly_requests_costs, 2)}")

Estimated monthly requests cost for lambda: $96.0


In [5]:
gb_second_cost_storage = 0.0000000309
storage_gb = 10
max_monthly_storage_cost = gb_second_cost_storage * seconds_per_month * storage_gb
print(f"Estimated monthly storage cost for lambda: ${round(max_monthly_storage_cost, 2)}")

Estimated monthly storage cost for lambda: $0.8


In [6]:
total_max_monthly_lambda_cost = max_monthly_runtime_cost + max_monthly_requests_costs + max_monthly_storage_cost
print(f"Total max monthly cost: ${round(total_max_monthly_lambda_cost, 2)}")

Total max monthly cost: $528.8


## NAT Gateway

For options 2 and 3, a NAT Gateway is required. In order to communicate with the internet, the lambda nees a public IP address. A NAT Gateway provides an IP address for services in a private subnet. The lambda and services it communicates with such as Elasticache and Elastic File System may exist in the same VPC and communicate within that network. But for the lambda to make requests outside of that network, it needs a NAT Gateway. More information is available from AWS: https://repost.aws/knowledge-center/internet-access-lambda-function.

In [7]:
hrs_per_month = 24 * 30
nat_gateway_cost_per_hour = 0.045
nat_gateway_run_cost = nat_gateway_cost_per_hour * hrs_per_month
print(f"Cost to run nat gateway per month: ${nat_gateway_run_cost}")

Cost to run nat gateway per month: $32.4


In [8]:
# Data Transfer OUT From Amazon EC2 To Internet is $0.09 per GB, see https://aws.amazon.com/ec2/pricing/on-demand/
# Below we find the maximum average content size is about 1400 bytes, 1.4e-6 GB
max_content_size_response_gb = 1.4e-6
gb_out_per_month = max_daily_gibs_req * 30 * max_content_size_response_gb
cost_per_gb_out = 0.09
cost_per_month = gb_out_per_month * cost_per_gb_out
print(f"Cost to transfer data out of nat gateway per month: ${cost_per_month}")

Cost to transfer data out of nat gateway per month: $60.48


## Performance Results

### Methodology

Results are presented from 4 deployments, those with the prefix `feature` each has a different caching method employed.

1. dev is the `dev` branch and deployment, with no caching configured.
1. feature is diskcache + Lambda ephemeral storage, code: [feat/diskcache](https://github.com/developmentseed/titiler-xarray/tree/feat/diskcache)
2. feature2 is fsspec filecache + EFS, code: [feat/fsspec-filecache](https://github.com/developmentseed/titiler-xarray/tree/feat/fsspec-filecache)
3. feature3 is elasticache, code: [feat/elasticache](https://github.com/developmentseed/titiler-xarray/tree/feat/elasticache)

The most notable changes are in infrastructure/cdk/app.py, where changes to cloud infrastructure were made, and titiler/xarray/reader.py where changes to application code handling to account for cache reading and writing was made.

Tests were run against either a set of [external datasets](../01-generate-datasets/external-datasets.json) or [fake datasets](../01-generate-datasets/fake-datasets.json). The external datasets are real-world scenarios including a kerchunk reference, a NetCDF file, 2 public zarr stores with consolidated metadata and one private zarr store with unconsolidated and chunked metadata (prod-giovanni-cache). The fake datasets represent variations in chunk size and number of spatial chunks.

### Performance Results Summary

* Fsspec file cache had the fastest performance, but failed to support NetCDF and data stores with unconsolidated metadata and lots of coordinating chunks. While the latter should be considered an edge case and perhaps should not even be considered, the former is very desirable.
* Diskcache and Elasticache performed roughly the same.

Tests were run numerous times and past results may be found in the notebooks in the history of Oct 22 - 25 of https://github.com/developmentseed/tile-benchmarking/blob/main/03-e2e/compare-dev-feature.ipynb

## Performance testing

We used the locust library to run performance tests. The test methods and scripts are documented in the [README.md](./README.md) in this directory.

Below, we run tests on the `dev` branch, deployed to dev-titiler-xarray.delta-backend.com which has no caching implemented.

In [9]:
import pandas as pd
import hvplot.pandas
import holoviews as hv
pd.options.plotting.backend = 'holoviews'
import os
import sys
import warnings
warnings.filterwarnings('ignore')
sys.path.append('..')
from helpers import dataframe
from helpers.eodc_hub_role import fetch_and_set_credentials

In [10]:
credentials = fetch_and_set_credentials()

## Download results from past tests

See executions from:
* external datasets: https://github.com/developmentseed/tile-benchmarking/blob/cfdc88b08a6a3219b5cd592827ebc4e042c91caf/03-e2e/compare-dev-feature.ipynb
* fake datasets: https://github.com/developmentseed/tile-benchmarking/blob/18f90c0508733994709378a3d7fe5de2c84f34d7/03-e2e/compare-dev-feature.ipynb

In [11]:
include_exclude_string = '--exclude "*" --include "*_urls_stats.csv"'

In [12]:
%%capture
!rm -rf downloaded_*_results
!aws s3 cp --recursive s3://nasa-eodc-data-store/tile-benchmarking-results/dev_2023-10-25_17-11-17/ downloaded_dev_results/ {include_exclude_string}
!aws s3 cp --recursive s3://nasa-eodc-data-store/tile-benchmarking-results/feature_2023-10-25_17-14-44/ downloaded_feature_results/ {include_exclude_string}
!aws s3 cp --recursive s3://nasa-eodc-data-store/tile-benchmarking-results/feature2_2023-10-25_17-16-13/ downloaded_feature2_results/ {include_exclude_string}
!aws s3 cp --recursive s3://nasa-eodc-data-store/tile-benchmarking-results/feature3_2023-10-25_17-19-40/ downloaded_feature3_results/ {include_exclude_string}
!aws s3 cp --recursive s3://nasa-eodc-data-store/tile-benchmarking-results/dev_2023-10-25_16-42-49/ downloaded_dev_results/ {include_exclude_string}
!aws s3 cp --recursive s3://nasa-eodc-data-store/tile-benchmarking-results/feature_2023-10-25_16-43-48/ downloaded_feature_results/ {include_exclude_string}
!aws s3 cp --recursive s3://nasa-eodc-data-store/tile-benchmarking-results/feature2_2023-10-25_16-54-52/ downloaded_feature2_results/ {include_exclude_string}
!aws s3 cp --recursive s3://nasa-eodc-data-store/tile-benchmarking-results/feature3_2023-10-25_16-55-47/ downloaded_feature3_results/ {include_exclude_string} 


In [13]:
results = { 'feature3': {}, 'feature2': {}, 'feature': {}, 'dev': {} }

def read_results(env: str = "dev"):
    # Specify the directory path and the suffix
    directory_path = f"downloaded_{env}_results/"
    suffix = "_urls_stats.csv"  # For example, if you're interested in text files

    # List all files in the directory
    all_files = os.listdir(directory_path)

    # Filter the files to only include those that end with the specified suffix
    files_with_suffix = [f"{directory_path}{f}" for f in all_files if f.endswith(suffix)]

    dfs = []
    for file in files_with_suffix:
        df = pd.read_csv(file)
        df['file'] = file
        dfs.append(df)

    merged_df = pd.concat(dfs)
    merged_df['dataset'] = [file.split('/')[1].replace('_urls_stats.csv', '') for file in merged_df['file']]
    results[env]['all'] = merged_df
    # The "Aggregated" results represent aggregations across tile endpoints. 
    results[env][f'Aggregated {env}'] = merged_df[merged_df['Name'] == 'Aggregated']
    return results

In [14]:
read_results()['dev']['Aggregated dev'].drop(columns=['Type', 'Name', 'file'])

Unnamed: 0,Request Count,Failure Count,Median Response Time,Average Response Time,Min Response Time,Max Response Time,Average Content Size,Requests/s,Failures/s,50%,...,75%,80%,90%,95%,98%,99%,99.9%,99.99%,100%,dataset
10,49,0,1500.0,1719.9174,1456.352363,5361.696805,693.0,1.51611,0.0,1500,...,1500,1600,1600,3500,5400,5400,5400,5400,5400,single_chunk_store_lat2896_lon5792.zarr
10,49,0,840.0,1072.402135,400.656153,2498.306146,693.0,2.712068,0.0,840,...,1600,1700,1900,2000,2500,2500,2500,2500,2500,with_chunks_store_lat4096_lon8192.zarr
10,48,0,520.0,586.357011,467.411134,2376.193945,693.0,4.422554,0.0,520,...,570,590,650,690,2400,2400,2400,2400,2400,single_chunk_store_lat1448_lon2896.zarr
10,48,0,660.0,989.751051,477.505606,2116.002682,693.0,2.868162,0.0,690,...,1500,1600,1900,2000,2100,2100,2100,2100,2100,with_chunks_store_lat5793_lon11586.zarr
10,49,0,420.0,583.17129,331.21642,2660.831845,408.142857,4.568501,0.0,420,...,560,620,720,1900,2700,2700,2700,2700,2700,pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc
10,49,0,460.0,491.719038,352.293957,845.567034,1269.367347,4.994612,0.0,460,...,540,560,640,740,850,850,850,850,850,power_901_monthly_meteorology_utc.zarr
10,49,0,500.0,542.160083,293.473977,1670.366966,693.0,4.518945,0.0,500,...,570,610,900,1100,1700,1700,1700,1700,1700,with_chunks_store_lat2048_lon4096.zarr
10,49,0,430.0,1150.74689,392.810963,10041.495259,740.795918,2.698381,0.0,430,...,470,500,590,8800,10000,10000,10000,10000,10000,aws-noaa-oisst-feedstock_reference
10,49,0,280.0,342.073466,244.599888,1605.886792,693.0,7.156789,0.0,280,...,310,320,400,750,1600,1600,1600,1600,1600,single_chunk_store_lat724_lon1448.zarr
10,50,0,7700.0,8353.167761,6664.545591,17159.533976,472.3,0.319998,0.0,7700,...,8100,8400,11000,15000,17000,17000,17000,17000,17000,prod-giovanni-cache-GPM_3IMERGHH_06_precipitat...


# Option 1: Diskcache with ephemeral Lambda storage

The python diskcache library can be used to cache the result of an xarray open dataset function call on disk. It uses sqlite and has been shown to be faster than fsspecâ€™s filecache in local tests (see https://github.com/developmentseed/tile-benchmarking/blob/main/02-run-tests/test-xarray-open-dataset.ipynb).

Code: [feat/diskcache](https://github.com/developmentseed/titiler-xarray/tree/feat/diskcache)

### Pros:

* Diskcache includes a memoize decorator that makes it simple to implement
* Seems relatively stable (e.g. no errors in load tests)
* We can set a storage limit so we should not exceed the ephemeral storage capacity.

### Cons:

* While diskcache was faster fsspec in local testing, when deployed using AWS Elastic File System (EFS), the application failed after a few requests ([disk IO error](https://github.com/developmentseed/titiler-xarray/issues/22#issuecomment-1771775404). This was not surprising because SQLite is [not recommended](https://www.sqlite.org/faq.html#q5) for use with Network File System (NFS) mounts.
* You can use diskcache with Lambda's ephemeral storage, but anything stored will be lost whenever the execution environmentrestarts. This still may be a desirable option.

## Option 1 Performance tests

In [15]:
read_results('feature')['feature']['Aggregated feature'].drop(columns=['Type', 'Name', 'file'])

Unnamed: 0,Request Count,Failure Count,Median Response Time,Average Response Time,Min Response Time,Max Response Time,Average Content Size,Requests/s,Failures/s,50%,...,75%,80%,90%,95%,98%,99%,99.9%,99.99%,100%,dataset
10,49,0,1418.340401,1479.043666,1418.340401,2439.225473,693.0,1.736724,0.0,1400,...,1500,1500,1600,1700,2400,2400,2400,2400,2400,single_chunk_store_lat2896_lon5792.zarr
10,48,0,420.0,525.009207,314.167214,1230.301654,693.0,4.816633,0.0,420,...,580,640,910,1100,1200,1200,1200,1200,1200,with_chunks_store_lat4096_lon8192.zarr
10,48,0,420.0,475.301659,404.873435,1580.917438,693.0,5.40468,0.0,420,...,450,450,530,740,1600,1600,1600,1600,1600,single_chunk_store_lat1448_lon2896.zarr
10,50,0,500.0,580.48251,406.238325,1428.285419,693.0,4.131997,0.0,500,...,660,690,870,1100,1400,1400,1400,1400,1400,with_chunks_store_lat5793_lon11586.zarr
10,44,0,120.0,264.267902,111.736084,1821.026105,430.25,8.774488,0.0,120,...,130,130,150,1600,1800,1800,1800,1800,1800,pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc
10,47,0,200.0,248.966084,136.250184,758.704247,1290.12766,9.580365,0.0,200,...,280,290,410,680,760,760,760,760,760,power_901_monthly_meteorology_utc.zarr
10,47,0,410.0,381.187136,234.652849,842.40912,693.0,6.956826,0.0,410,...,430,440,520,520,840,840,840,840,840,with_chunks_store_lat2048_lon4096.zarr
10,49,0,220.0,998.602001,196.984251,11219.674018,740.795918,3.272109,0.0,220,...,280,300,380,9300,11000,11000,11000,11000,11000,aws-noaa-oisst-feedstock_reference
10,50,0,190.0,208.963919,177.126034,557.890449,693.0,9.910472,0.0,190,...,210,230,240,290,560,560,560,560,560,single_chunk_store_lat724_lon1448.zarr
10,46,0,270.0,949.519073,181.70328,8270.276655,474.23913,3.501913,0.0,270,...,350,360,730,7800,8300,8300,8300,8300,8300,prod-giovanni-cache-GPM_3IMERGHH_06_precipitat...


# Option 2: Fsspecâ€™s [filecache](https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally) with EFS

Fsspec offers a filecache option.

Pros: 
* Data is persisted across lambda container restarts.
* Not sure we need this yet, but servers can share EFS mounts.

Cons/Caveats:
* EFS can be expensive
* Code is complicated
* Throughput configuration is complicated and ensuring we donâ€™t surpass our throughput rate may be challenging. When EFS is configured with the "bursting" throughput option, the throughput "base" rate is determined by data stored in EFS and accrues "burst credits" whenever the throughput is lower than the base rate. The first time performance tests were run for fake datasets, there was a small but noticeable error rate (see https://github.com/developmentseed/tile-benchmarking/blob/25982c8e44c3e794399007e72c2170d10b2cdc23/03-e2e/compare-dev-feature.ipynb). The AWS EFS documentation states: 

> Burst credits accrue when the file system consumes below its base throughput rate, and are deducted when throughput exceeds the base rate.

I think when tests were first executed, perhaps we reached the throughput limit but then they accrued while the system was not being used and when tests were run the following day there were 0 errors.

* There is less success when evaluating performance against "real world" datasets.
    * The NetCDF and power_901_monthly_meteorology_utc.zarr had worse (in the case of NetCDF, much worse and worse than with no cache) performance. The reason why requires more investigation.
    * The unconsolidated data store from prod-giovanni-cache (which also has chunked coordinate data) times out. The diskcache option worked much better for this case, but only this case.



## Additional Costs

https://aws.amazon.com/efs/pricing/

Estimated spend on writes is less than $1 and we are not considering provisioned storate at this time.

In [16]:
cost_per_gb_month_storage = 0.30
print(f"Estimated storage cost: ${estimated_storage_gb * cost_per_gb_month_storage}")

Estimated storage cost: $2.1


In [17]:
per_gb_read_cost = 0.03
# Assuming the same peak requests as GIBS, which is truly an upper threshold, this would be $7,192.8 ðŸ˜±a month
cost_per_month_for_reads = max_daily_gibs_req * metadata_size_gb * per_gb_read_cost * 30
print(f"Estimated cost for data transfer ${cost_per_month_for_reads}")

Estimated cost for data transfer $7200.0


## Option 2 Performance tests

In [18]:
read_results('feature2')['feature2']['Aggregated feature2'].drop(columns=['Type', 'Name', 'file'])

Unnamed: 0,Request Count,Failure Count,Median Response Time,Average Response Time,Min Response Time,Max Response Time,Average Content Size,Requests/s,Failures/s,50%,...,75%,80%,90%,95%,98%,99%,99.9%,99.99%,100%,dataset
10,49,0,260.0,378.534136,224.907388,1897.327955,693.0,6.989456,0.0,260,...,300,310,440,1500,1900,1900,1900,1900,1900,single_chunk_store_lat2896_lon5792.zarr
10,46,0,160.0,265.299231,118.592266,525.174544,693.0,9.127379,0.0,160,...,410,420,450,470,530,530,530,530,530,with_chunks_store_lat4096_lon8192.zarr
10,36,0,120.0,164.773804,110.64461,543.874623,693.0,11.901037,0.0,120,...,140,150,230,520,540,540,540,540,540,single_chunk_store_lat1448_lon2896.zarr
10,43,0,170.0,312.82641,114.755425,861.999777,693.0,8.557336,0.0,170,...,450,470,560,720,860,860,860,860,860,with_chunks_store_lat5793_lon11586.zarr
10,49,0,1500.0,1460.999404,1001.980978,2133.641595,408.142857,1.818899,0.0,1500,...,1600,1700,1800,1900,2100,2100,2100,2100,2100,pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc
10,47,0,260.0,310.510605,222.291937,1310.912221,1290.12766,7.904701,0.0,260,...,290,330,450,530,1300,1300,1300,1300,1300,power_901_monthly_meteorology_utc.zarr
10,45,0,140.0,189.931547,114.024488,549.856486,693.0,11.196795,0.0,140,...,220,250,330,440,550,550,550,550,550,with_chunks_store_lat2048_lon4096.zarr
10,49,0,210.0,855.105984,202.786246,10105.925414,740.795918,3.727676,0.0,210,...,230,240,260,7300,10000,10000,10000,10000,10000,aws-noaa-oisst-feedstock_reference
10,42,0,120.0,133.297186,102.902503,299.345507,693.0,14.257826,0.0,120,...,130,130,150,250,300,300,300,300,300,single_chunk_store_lat724_lon1448.zarr
10,49,49,30003.888743,30011.028302,30003.888743,30057.844673,33.0,0.085937,0.085937,30000,...,30000,30000,30000,30000,30000,30000,30000,30000,30000,prod-giovanni-cache-GPM_3IMERGHH_06_precipitat...


# Option 3: Elasticache

[AWS Elasticache](https://aws.amazon.com/elasticache/) is a managed in-memory data store which we can use with Redis or memcached. https://aws.amazon.com/elasticache/pricing/.

Pros:
* Stable, relatively cheap

Cons:
* Not as fast as fsspec filecache efs

Note: Performance did not seem significantly different when using a cache.m6g.xlarge instance: compare https://github.com/developmentseed/tile-benchmarking/blob/ea9fc4fd0f5604fe8d1e0862ea75978a741d1d7e/03-e2e/compare-dev-feature.ipynb (xlarge instance) with https://github.com/developmentseed/tile-benchmarking/blob/18f90c0508733994709378a3d7fe5de2c84f34d7/03-e2e/compare-dev-feature.ipynb (cache.t3.small instance)

## Additional cost to baseline

In [19]:
cost_per_hour = 0.034 # cache.t3.small
hours_per_month = 24 * 30
print(f"cost per month for cache.t3.small: ${cost_per_hour * hours_per_month}")

cost per month for cache.t3.small: $24.48


In [20]:
read_results('feature3')['feature3']['Aggregated feature3'].drop(columns=['Type', 'Name', 'file'])

Unnamed: 0,Request Count,Failure Count,Median Response Time,Average Response Time,Min Response Time,Max Response Time,Average Content Size,Requests/s,Failures/s,50%,...,75%,80%,90%,95%,98%,99%,99.9%,99.99%,100%,dataset
10,49,0,1500.0,1514.725919,1444.128111,2479.043787,693.0,1.671301,0.0,1500,...,1500,1500,1700,1700,2500,2500,2500,2500,2500,single_chunk_store_lat2896_lon5792.zarr
10,48,0,430.0,483.631991,343.690888,928.624275,693.0,5.259799,0.0,430,...,510,540,640,810,930,930,930,930,930,with_chunks_store_lat4096_lon8192.zarr
10,49,0,430.0,461.707215,417.393816,856.266527,693.0,5.465216,0.0,430,...,450,480,530,630,860,860,860,860,860,single_chunk_store_lat1448_lon2896.zarr
10,48,0,460.0,527.196443,422.772875,1033.374978,693.0,4.929293,0.0,460,...,570,620,720,830,1000,1000,1000,1000,1000,with_chunks_store_lat5793_lon11586.zarr
10,48,0,130.0,230.505524,123.87102,1874.403686,411.5625,9.678562,0.0,130,...,140,140,160,1100,1900,1900,1900,1900,1900,pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc
10,45,0,240.0,269.038526,143.731983,579.322664,1316.222222,8.9207,0.0,240,...,300,350,460,530,580,580,580,580,580,power_901_monthly_meteorology_utc.zarr
10,48,0,420.0,379.906156,244.664341,656.245297,693.0,6.844039,0.0,420,...,430,440,440,540,660,660,660,660,660,with_chunks_store_lat2048_lon4096.zarr
10,48,0,250.0,901.64584,218.87391,8858.197954,741.75,3.679831,0.0,250,...,300,310,370,7700,8900,8900,8900,8900,8900,aws-noaa-oisst-feedstock_reference
10,48,0,200.0,229.926662,188.894617,547.768037,693.0,9.499588,0.0,200,...,210,240,340,450,550,550,550,550,550,single_chunk_store_lat724_lon1448.zarr
10,48,0,350.0,430.707536,230.477325,1532.890652,473.229167,5.93804,0.0,350,...,420,440,760,1000,1500,1500,1500,1500,1500,prod-giovanni-cache-GPM_3IMERGHH_06_precipitat...


## Plot results together

In [21]:
dev_df = results['dev'][f'Aggregated dev']
feature_df = results['feature'][f'Aggregated feature']
feature2_df = results['feature2'][f'Aggregated feature2']
feature3_df = results['feature3'][f'Aggregated feature3']
feature2_df.columns = ['dataset' if col == 'dataset' else col + ' Feature2' for col in feature2_df.columns]
feature3_df.columns = ['dataset' if col == 'dataset' else col + ' Feature3' for col in feature3_df.columns]

merged_df = pd.merge(dev_df, feature_df,  on='dataset', suffixes=(' Dev', ' Feature'))
merged_df = pd.merge(merged_df, feature2_df, on='dataset', how='outer')
merged_df = pd.merge(merged_df, feature3_df, on='dataset', how='outer')

In [22]:
dataset_specs_external = dataframe.csv_to_pandas('https://raw.githubusercontent.com/developmentseed/tile-benchmarking/cd261da49937d49f32375bc03548a3ce2d856f42/03-e2e/zarr_info.csv')
dataset_specs_fake = dataframe.csv_to_pandas('https://raw.githubusercontent.com/developmentseed/tile-benchmarking/18f90c0508733994709378a3d7fe5de2c84f34d7/03-e2e/zarr_info.csv')

dataset_specs_all = pd.concat([dataset_specs_external, dataset_specs_fake])

In [23]:
dataset_specs_all.loc[dataset_specs_all['collection_name'] == 'prod-giovanni-cache-GPM_3IMERGHH_06_precipitationCal', 'chunk_size_mb'] = 1.977
dataset_specs_all.loc[dataset_specs_all['collection_name'] == 'pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc', 'chunk_size_mb'] = 3.29

In [24]:
merged_df['Failure Rate Dev'] = merged_df['Failure Count Dev']/merged_df['Request Count Dev'] * 100
merged_df['Failure Rate Feature'] = merged_df['Failure Count Feature']/merged_df['Request Count Feature'] * 100
merged_df['Failure Rate Feature2'] = merged_df['Failure Count Feature2']/merged_df['Request Count Feature2'] * 100
merged_df['Failure Rate Feature3'] = merged_df['Failure Count Feature3']/merged_df['Request Count Feature3'] * 100

summary_df = merged_df[
    [
        'Average Response Time Dev', 'Failure Rate Dev',
        'Average Response Time Feature', 'Failure Rate Feature',
        'Average Response Time Feature2', 'Failure Rate Feature2',
        'Average Response Time Feature3', 'Failure Rate Feature3',        
        'dataset'
    ]
].sort_values('Average Response Time Dev')
merged_specs = summary_df.merge(dataset_specs_all, left_on='dataset', right_on='collection_name')

In [25]:
ylim = (0, 4000)
xlim = (0, 260)

dev_line = merged_specs.sort_values('chunk_size_mb').hvplot.line(
    x='chunk_size_mb', y='Average Response Time Dev', label='Dev', color='cyan',
    xlim=xlim, ylim=ylim
)

# Plot 'col2'
feature_line = merged_specs.sort_values('chunk_size_mb').hvplot.line(
    x='chunk_size_mb', y='Average Response Time Feature', label='Feature', color='magenta', alpha=0.4,
    xlim=xlim, ylim=ylim
)

feature2_line = merged_specs.sort_values('chunk_size_mb').hvplot.line(
    x='chunk_size_mb', y='Average Response Time Feature2', label='Feature2', color='orange', alpha=0.4,
    xlim=xlim, ylim=ylim
)

feature3_line = merged_specs.sort_values('chunk_size_mb').hvplot.line(
    x='chunk_size_mb', y='Average Response Time Feature3', label='Feature3', color='green', alpha=0.4,
    xlim=xlim, ylim=ylim
)

# Combine the two line plots
combined_plot = dev_line * feature_line * feature2_line * feature3_line
combined_plot.opts(legend_position='right')

![](lineplot.png)

In [34]:
# Create box plots for each column and combine
cols = ['Average Response Time Dev', 'Average Response Time Feature', 'Average Response Time Feature2', 'Average Response Time Feature3']

df_melted = merged_df.melt(value_vars=cols)

# Create the box plot using the transformed data
plot = df_melted.hvplot.box(by='variable', y='value', rot=45, ylim=(0,4000))

plot

![](boxplot.png)

## Appendix: Getting chunk size from NetCDF and prod-giovanni-cache datasets

In [None]:
import h5py
import numpy as np
import requests

remote_url = "https://nex-gddp-cmip6.s3-us-west-2.amazonaws.com/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/pr/pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc"
local_filepath = remote_urlte_url.split('/')[-1]
response = requests.get(remote_url, stream=True)
response.raise_for_status()

with open(local_filepath, 'wb') as f:
    for chunk in response.iter_content(chunk_size=8192): 
        f.write(chunk)

f = h5py.File(local_filepath, 'r')
np.prod(f['pr'].chunks) * f['pr'].dtype.itemsize / 1024 / 1024

In [64]:
import s3fs 
import xarray as xr

# Requires credentials for prod-giovanni-cache
s3_fs = s3fs.S3FileSystem()
s3url = 's3://prod-giovanni-cache/zarr/GPM_3IMERGHH_06_precipitationCal/'
s3_store = s3fs.S3Map(s3url, s3=s3_fs)
# ds = xr.open_dataset(s3_store, consolidated=False, engine='zarr')
# variable = 'variable'
# chunks = ds[variable].encoding['chunks']
# np.prod(chunks) * ds[variable].dtype.itemsize