# Comparing dev and feature

This notebook compares results between dev and feature titiler deployments. Running end-to-end benchmarks is documented in [https://github.com/developmentseed/tile-benchmarking/tree/main/03-e2e/README.md](https://github.com/developmentseed/tile-benchmarking/tree/main/03-e2e/README.md).

This notebook is comparing titiler-xarray's dev branch at [commit 9ac1686612d](https://github.com/developmentseed/titiler-xarray/commit/9ac1686612d706e0f078a418818b16544efb11c0) with a feature deployment that includes [diskcache](https://github.com/developmentseed/titiler-xarray/tree/feat/diskcache) and another feature deployment (feature2) that includes [fsspec's filecache using EFS](https://github.com/developmentseed/titiler-xarray/tree/feat/fsspec-filecache).

In [57]:
# Import libraries
import os
import pandas as pd
import hvplot.pandas
import holoviews as hv
pd.options.plotting.backend = 'holoviews'
import warnings
warnings.filterwarnings('ignore')
import sys
sys.path.append('..')
from helpers import dataframe
# You will need to set credentials to access nasa-eodc-data-store
from helpers import eodc_hub_role
credentials = eodc_hub_role.fetch_and_set_credentials()

In [58]:
# Remove any previous results
!rm -rf downloaded_dev_results/
!rm -rf downloaded_feature*_results/

In [59]:
%%capture
!aws s3 cp --recursive s3://nasa-eodc-data-store/tile-benchmarking-results/dev_2023-11-10_17-32-33/ downloaded_dev_results/
!aws s3 cp --recursive s3://nasa-eodc-data-store/tile-benchmarking-results/feature2_2023-11-10_17-38-01/ downloaded_feature2_results/
!aws s3 cp --recursive s3://nasa-eodc-data-store/tile-benchmarking-results/feature3_2023-11-10_17-41-22/ downloaded_feature3_results/ 

Parse and merge results into a single dataframe.

In [60]:
results = { 'feature3': {}, 'feature2': {}, 'dev': {} }
for env in results.keys():
    # Specify the directory path and the suffix
    directory_path = f"downloaded_{env}_results/"
    suffix = "_urls_stats.csv"  # For example, if you're interested in text files

    # List all files in the directory
    all_files = os.listdir(directory_path)

    # Filter the files to only include those that end with the specified suffix
    files_with_suffix = [f"{directory_path}{f}" for f in all_files if f.endswith(suffix)]

    dfs = []
    for file in files_with_suffix:
        df = pd.read_csv(file)
        df['file'] = file
        dfs.append(df)

    merged_df = pd.concat(dfs)
    merged_df['dataset'] = [file.split('/')[1].replace('_urls_stats.csv', '') for file in merged_df['file']]
    results[env]['all'] = merged_df
    # The "Aggregated" results represent aggregations across tile endpoints. 
    results[env][f'Aggregated {env}'] = merged_df[merged_df['Name'] == 'Aggregated']

In [61]:
dataset_specs_all = dataframe.csv_to_pandas('zarr_info.csv')
dataset_specs_all

Unnamed: 0,collection_name,source,chunks,shape_dict,dtype,chunk_size_mb,compression,number_of_spatial_chunks,number_coordinate_chunks
0,power_901_monthly_meteorology_utc.zarr,s3://power-analysis-ready-datastore/power_901_...,"{'y': 504, 'x': 25}","{'y': 361, 'x': 576}",float64,2.403259,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, ...",16.50286,2.0
1,cmip6-pds_GISS-E2-1-G_historical_tas,s3://cmip6-pds/CMIP6/CMIP/NASA-GISS/GISS-E2-1-...,"{'y': 600, 'x': 90}","{'y': 90, 'x': 144}",float32,29.663086,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, ...",0.24,1.0
2,aws-noaa-oisst-feedstock_reference,https://ncsa.osn.xsede.org/Pangeo/pangeo-forge...,"{'y': 1, 'x': 1}","{'y': 720, 'x': 1440}",int16,1.977539,Zlib(level=4),1036800.0,1.0
3,pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc,https://nex-gddp-cmip6.s3-us-west-2.amazonaws....,"{'y': 'N', 'x': '/'}","{'y': 600, 'x': 1440}",float32,,,,0.0
4,20231107090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v...,s3://podaac-ops-cumulus-protected/MUR-JPL-L4-G...,,,,,,,
5,3B42_Daily.19980101.7.nc4,s3://gesdisc-cumulus-prod-protected/TRMM_L3/TR...,,,,,,,
6,3B-DAY.MS.MRG.3IMERG.20000601-S000000-E235959....,s3://gesdisc-cumulus-prod-protected/GPM_L3/GPM...,,,,,,,


In [62]:
dev_df = results['dev'][f'Aggregated dev']
feature2_df = results['feature2'][f'Aggregated feature2']
feature3_df = results['feature3'][f'Aggregated feature3']
dev_df.columns = ['dataset' if col == 'dataset' else col + ' Dev' for col in dev_df.columns]
feature2_df.columns = ['dataset' if col == 'dataset' else col + ' Feature2' for col in feature2_df.columns]
feature3_df.columns = ['dataset' if col == 'dataset' else col + ' Feature3' for col in feature3_df.columns]

merged_df = pd.merge(dev_df, feature2_df,  on='dataset', how='outer')
merged_df = pd.merge(merged_df, feature3_df, on='dataset', how='outer')

In [63]:
merged_df

Unnamed: 0,Type Dev,Name Dev,Request Count Dev,Failure Count Dev,Median Response Time Dev,Average Response Time Dev,Min Response Time Dev,Max Response Time Dev,Average Content Size Dev,Requests/s Dev,...,75% Feature3,80% Feature3,90% Feature3,95% Feature3,98% Feature3,99% Feature3,99.9% Feature3,99.99% Feature3,100% Feature3,file Feature3
0,,Aggregated,19,19,47.0,51.304068,37.999023,118.35502,25.0,19.434113,...,630,640,1500,1800,2700,2700,2700,2700,2700,downloaded_feature3_results/3B-DAY.MS.MRG.3IME...
1,,Aggregated,25,0,390.0,405.982184,339.145354,603.894412,847.6,4.969779,...,140,140,1100,1400,1400,1400,1400,1400,1400,downloaded_feature3_results/pr_day_ACCESS-CM2_...
2,,Aggregated,24,0,420.0,474.938282,366.492412,745.738177,1403.583333,4.844792,...,280,330,420,470,660,660,660,660,660,downloaded_feature3_results/power_901_monthly_...
3,,Aggregated,23,0,430.0,519.810967,382.378135,993.911789,403.173913,4.776193,...,310,310,2500,2600,2600,2600,2600,2600,2600,downloaded_feature3_results/aws-noaa-oisst-fee...
4,,Aggregated,20,20,45.0,48.097489,36.910034,88.979817,25.0,20.72366,...,120,120,130,160,160,160,160,160,160,downloaded_feature3_results/3B42_Daily.1998010...
5,,Aggregated,22,0,500.0,532.980641,467.859612,672.335105,694.0,4.601611,...,430,480,490,510,570,570,570,570,570,downloaded_feature3_results/cmip6-pds_GISS-E2-...
6,,Aggregated,24,0,910.0,1014.014851,603.885571,2154.707359,33964.791667,2.494608,...,30000,30000,30000,30000,30000,30000,30000,30000,30000,downloaded_feature3_results/20231107090000-JPL...


In [64]:
merged_df['Failure Rate Dev'] = merged_df['Failure Count Dev']/merged_df['Request Count Dev'] * 100
merged_df['Failure Rate Feature2'] = merged_df['Failure Count Feature2']/merged_df['Request Count Feature2'] * 100
merged_df['Failure Rate Feature3'] = merged_df['Failure Count Feature3']/merged_df['Request Count Feature3'] * 100

summary_df = merged_df[
    [
        'Average Response Time Dev', 'Failure Rate Dev',
        'Average Response Time Feature2', 'Failure Rate Feature2',
        'Average Response Time Feature3', 'Failure Rate Feature3',        
        'dataset'
    ]
].sort_values('Average Response Time Dev')
merged_specs = summary_df.merge(dataset_specs_all, left_on='dataset', right_on='collection_name')

In [65]:
merged_specs

Unnamed: 0,Average Response Time Dev,Failure Rate Dev,Average Response Time Feature2,Failure Rate Feature2,Average Response Time Feature3,Failure Rate Feature3,dataset,collection_name,source,chunks,shape_dict,dtype,chunk_size_mb,compression,number_of_spatial_chunks,number_coordinate_chunks
0,48.097489,100.0,229.863884,54.545455,81.920033,58.333333,3B42_Daily.19980101.7.nc4,3B42_Daily.19980101.7.nc4,s3://gesdisc-cumulus-prod-protected/TRMM_L3/TR...,,,,,,,
1,51.304068,100.0,2541.231444,0.0,749.974042,0.0,3B-DAY.MS.MRG.3IMERG.20000601-S000000-E235959....,3B-DAY.MS.MRG.3IMERG.20000601-S000000-E235959....,s3://gesdisc-cumulus-prod-protected/GPM_L3/GPM...,,,,,,,
2,405.982184,0.0,16956.33968,100.0,290.169052,0.0,pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc,pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc,https://nex-gddp-cmip6.s3-us-west-2.amazonaws....,"{'y': 'N', 'x': '/'}","{'y': 600, 'x': 1440}",float32,,,,0.0
3,474.938282,0.0,429.04395,0.0,252.735926,0.0,power_901_monthly_meteorology_utc.zarr,power_901_monthly_meteorology_utc.zarr,s3://power-analysis-ready-datastore/power_901_...,"{'y': 504, 'x': 25}","{'y': 361, 'x': 576}",float64,2.403259,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, ...",16.50286,2.0
4,519.810967,0.0,538.768803,0.0,660.769386,0.0,aws-noaa-oisst-feedstock_reference,aws-noaa-oisst-feedstock_reference,https://ncsa.osn.xsede.org/Pangeo/pangeo-forge...,"{'y': 1, 'x': 1}","{'y': 720, 'x': 1440}",int16,1.977539,Zlib(level=4),1036800.0,1.0
5,532.980641,0.0,221.567378,0.0,402.474154,0.0,cmip6-pds_GISS-E2-1-G_historical_tas,cmip6-pds_GISS-E2-1-G_historical_tas,s3://cmip6-pds/CMIP6/CMIP/NASA-GISS/GISS-E2-1-...,"{'y': 600, 'x': 90}","{'y': 90, 'x': 144}",float32,29.663086,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, ...",0.24,1.0
6,1014.014851,0.0,11769.717077,62.5,14995.812411,100.0,20231107090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v...,20231107090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v...,s3://podaac-ops-cumulus-protected/MUR-JPL-L4-G...,,,,,,,


NOTE: We don't have chunk information for prod giovanni cache dataset since it is protected (it can be added).

In [39]:
ylim = 3000
xlim = 256
dev_line = merged_specs.sort_values('chunk_size_mb').hvplot.scatter(
    x='chunk_size_mb', y='Average Response Time Dev', label='Dev', color='cyan',
    xlim=(0, xlim), ylim=(0, ylim)
)

# # Plot 'col2'
# feature_line = merged_specs.sort_values('chunk_size_mb').hvplot.line(
#     x='chunk_size_mb', y='Average Response Time Feature', label='Feature', color='magenta', alpha=0.4,
#     xlim=(0, xlim), ylim=(0, ylim)
# )

feature2_line = merged_specs.sort_values('chunk_size_mb').hvplot.scatter(
    x='chunk_size_mb', y='Average Response Time Feature2', label='Feature2', color='orange', alpha=0.4,
    xlim=(0, xlim), ylim=(0, ylim)
)

feature3_line = merged_specs.sort_values('chunk_size_mb').hvplot.scatter(
    x='chunk_size_mb', y='Average Response Time Feature3', label='Feature3', color='green', alpha=0.4,
    xlim=(0, xlim), ylim=(0, ylim)
)

# Combine the two line plots
combined_plot = dev_line * feature2_line * feature3_line
combined_plot