# Comparing dev and feature

This notebook compares results between dev and feature titiler deployments. Running end-to-end benchmarks is documented in [https://github.com/developmentseed/tile-benchmarking/tree/main/03-e2e/README.md](https://github.com/developmentseed/tile-benchmarking/tree/main/03-e2e/README.md).

This notebook is comparing titiler-xarray's dev branch at [commit 9ac1686612d](https://github.com/developmentseed/titiler-xarray/commit/9ac1686612d706e0f078a418818b16544efb11c0) with a feature deployment that includes [diskcache](https://github.com/developmentseed/titiler-xarray/tree/feat/diskcache) and another feature deployment (feature2) that includes [fsspec's filecache using EFS](https://github.com/developmentseed/titiler-xarray/tree/feat/fsspec-filecache).

In [10]:
# Import libraries
import os
import pandas as pd
import hvplot.pandas
import holoviews as hv
pd.options.plotting.backend = 'holoviews'
import warnings
warnings.filterwarnings('ignore')
import sys
sys.path.append('..')
from helpers import dataframe
# You will need to set credentials to access nasa-eodc-data-store
from helpers import eodc_hub_role
credentials = eodc_hub_role.fetch_and_set_credentials()

In [11]:
# Remove any previous results
!rm -rf downloaded_dev_results/
!rm -rf downloaded_feature2_results/

In [12]:
%%capture
!aws s3 cp --recursive s3://nasa-eodc-data-store/tile-benchmarking-results/dev_2023-10-22_00-42-56/ downloaded_dev_results/
!aws s3 cp --recursive s3://nasa-eodc-data-store/tile-benchmarking-results/feature_2023-10-22_00-43-50/ downloaded_feature_results/
!aws s3 cp --recursive s3://nasa-eodc-data-store/tile-benchmarking-results/feature2_2023-10-23_23-05-31/ downloaded_feature2_results/


Parse and merge results into a single dataframe.

In [13]:
results = { 'feature2': {}, 'feature': {}, 'dev': {} }
for env in results.keys():
    # Specify the directory path and the suffix
    directory_path = f"downloaded_{env}_results/"
    suffix = "_urls_stats.csv"  # For example, if you're interested in text files

    # List all files in the directory
    all_files = os.listdir(directory_path)

    # Filter the files to only include those that end with the specified suffix
    files_with_suffix = [f"{directory_path}{f}" for f in all_files if f.endswith(suffix)]

    dfs = []
    for file in files_with_suffix:
        df = pd.read_csv(file)
        df['file'] = file
        dfs.append(df)

    merged_df = pd.concat(dfs)
    merged_df['dataset'] = [file.split('/')[1].replace('_urls_stats.csv', '') for file in merged_df['file']]
    results[env]['all'] = merged_df
    # The "Aggregated" results represent aggregations across tile endpoints. 
    results[env][f'Aggregated {env}'] = merged_df[merged_df['Name'] == 'Aggregated']

In [14]:
dataset_specs_all = dataframe.csv_to_pandas('zarr_info.csv')
#dataset_specs_all

In [15]:
dev_df = results['dev'][f'Aggregated dev']
feature_df = results['feature'][f'Aggregated feature']
feature2_df = results['feature2'][f'Aggregated feature2']
feature2_df.columns = ['dataset' if col == 'dataset' else col + ' Feature2' for col in feature2_df.columns]
merged_df = pd.merge(dev_df, feature_df,  on='dataset', suffixes=(' Dev', ' Feature'))
merged_df = pd.merge(merged_df, feature2_df, on='dataset', how='outer')

In [16]:
merged_df['Failure Rate Dev'] = merged_df['Failure Count Dev']/merged_df['Request Count Dev'] * 100
merged_df['Failure Rate Feature'] = merged_df['Failure Count Feature']/merged_df['Request Count Feature'] * 100
merged_df['Failure Rate Feature2'] = merged_df['Failure Count Feature2']/merged_df['Request Count Feature2'] * 100

summary_df = merged_df[
    [
        'Average Response Time Dev', 'Failure Rate Dev',
        'Average Response Time Feature', 'Failure Rate Feature',
        'Average Response Time Feature2', 'Failure Rate Feature2',
        'dataset'
    ]
].sort_values('Average Response Time Dev')
merged_specs = summary_df.merge(dataset_specs_all, left_on='dataset', right_on='collection_name')

In [17]:
merged_specs

Unnamed: 0,Average Response Time Dev,Failure Rate Dev,Average Response Time Feature,Failure Rate Feature,Average Response Time Feature2,Failure Rate Feature2,dataset,collection_name,source,chunks,shape_dict,dtype,chunk_size_mb,compression,number_of_spatial_chunks,number_coordinate_chunks
0,429.02638,0.0,204.592442,0.0,317.026707,0.0,power_901_monthly_meteorology_utc.zarr,power_901_monthly_meteorology_utc.zarr,s3://power-analysis-ready-datastore/power_901_...,"{'y': 504, 'x': 25}","{'y': 361, 'x': 576}",float64,2.403259,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, ...",16.502857,2.0
1,471.743337,0.0,222.400569,0.0,1292.641275,0.0,pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc,pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc,https://nex-gddp-cmip6.s3-us-west-2.amazonaws....,"{'y': 'N', 'x': '/'}","{'y': 600, 'x': 1440}",float32,,,,0.0
2,540.974925,0.0,396.836991,0.0,193.611272,0.0,cmip6-pds_GISS-E2-1-G_historical_tas,cmip6-pds_GISS-E2-1-G_historical_tas,s3://cmip6-pds/CMIP6/CMIP/NASA-GISS/GISS-E2-1-...,"{'y': 600, 'x': 90}","{'y': 90, 'x': 144}",float32,29.663086,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, ...",0.24,1.0
3,1067.613283,0.0,884.032641,0.0,832.400618,0.0,aws-noaa-oisst-feedstock_reference,aws-noaa-oisst-feedstock_reference,https://ncsa.osn.xsede.org/Pangeo/pangeo-forge...,"{'zlev': 1, 'y': 1, 'x': 720}","{'zlev': 1, 'y': 720, 'x': 1440}",int16,1.977539,Zlib(level=4),1440.0,2.0
4,6985.687888,0.0,842.066773,0.0,30010.420828,100.0,prod-giovanni-cache-GPM_3IMERGHH_06_precipitat...,prod-giovanni-cache-GPM_3IMERGHH_06_precipitat...,s3://prod-giovanni-cache/zarr/GPM_3IMERGHH_06_...,,,,,,,


NOTE: We don't have chunk information for prod giovanni cache dataset since it is protected (it can be added).

In [19]:
ylim = 1400
xlim = 60
dev_line = merged_specs.sort_values('chunk_size_mb').hvplot.line(
    x='chunk_size_mb', y='Average Response Time Dev', label='Dev', color='cyan',
    xlim=(0, xlim), ylim=(0, ylim)
)

# Plot 'col2'
feature_line = merged_specs.sort_values('chunk_size_mb').hvplot.line(
    x='chunk_size_mb', y='Average Response Time Feature', label='Feature', color='magenta', alpha=0.4,
    xlim=(0, xlim), ylim=(0, ylim)
)

feature2_line = merged_specs.sort_values('chunk_size_mb').hvplot.line(
    x='chunk_size_mb', y='Average Response Time Feature2', label='Feature2', color='orange', alpha=0.4,
    xlim=(0, xlim), ylim=(0, ylim)
)

# Combine the two line plots
combined_plot = dev_line * feature_line * feature2_line
combined_plot