## Runtime Comparison

**Author:** Alex Michels

In this notebook we will compare the runtime of our clustering algorithm for travel-time analysis against analyzing each hospital seperately.

## Table of Contents

* [Analyzing SPACTS](#spacts)
* [Analyzing without SPACTS](#nm)
* [Comparison](#comp)

First, we need to load some packages for our analysis:

In [1]:
import datetime
from IPython.display import display, Markdown
import numpy as np
import pandas as pd
import os

Each method had a "Part 1" and "Part 2". 

* Part 1 perfomed the clustering (or not) and then submitted each hospital/cluster as a seperate job, pausing for 1 second between jobs. 
* The jobs submitted by Part 1 are the "Part 2" jobs. They performed the analysis (travel-time catchment) for the hospital or region assigned to them by Part 1. At the end of each Part 2 job, the job records it's index and runtime in seconds in a file

The "-1" region index gives the time in seconds recorded by SLURM for the "Part 1"s (manually entered from SLURM output) and the rest are the time in seconds (recorded by the Python script). The data is at the path `/data/runtime`. The files ending with "NM" stands for "no merge" and the files ending with "Co" have the records when applying SPACTS. We ran each method 5 times, let's dig into the data:

In [2]:
time_path = "../data/runtime"
spacts_paths = [f"Par_Timings-Co{i}.csv" for i in range(1,6)]
no_merge_paths = [f"Par_Timings-NM{i}.csv" for i in range(1,6)]

<hr id="spacts">

## Analyzing SPACTS

Now that we have specified the paths, let's load the SPACTS data:

In [3]:
spacts = []
for path in spacts_paths:
    spacts.append(pd.read_csv(os.path.join(time_path, path)))

So we have a list of Dataframes. Let's preview one of them:

In [4]:
spacts[0].head()

Unnamed: 0,REGION_INDEX,TIME
0,-1,523.0
1,1,2058.843962
2,12,2329.047585
3,4,2466.289198
4,9,2682.624517


Now, let's calculate some summary statistics based on that information including the total number or regions, average runtime, etc.:

In [5]:
funcs = [sum, np.mean, np.std, min, max]
table_string, second_row = "|Run|Regions|", "\n|-|-|"
for func in funcs:
    table_string += f"{func.__name__}|"
    second_row += "-|"
table_string += second_row + "\n"

for i, df in enumerate(spacts):
    table_string += f"|{i}|{len(df)-1}|"
    for func in funcs:
        table_string += f"{datetime.timedelta(seconds=func(df['TIME']))}|"
    table_string += "\n"
display(Markdown(table_string))

|Run|Regions|sum|mean|std|min|max|
|-|-|-|-|-|-|-|
|0|96|3 days, 17:48:30.234052|0:55:33.095196|0:23:31.631550|0:08:43|3:06:32.631056|
|1|96|3 days, 17:35:49.907433|0:55:25.256778|0:23:08.562301|0:08:45|3:07:27.401912|
|2|96|3 days, 18:15:35.344157|0:55:49.848909|0:24:17.107448|0:08:36|3:09:16.383331|
|3|96|3 days, 17:48:21.821496|0:55:33.008469|0:23:36.130034|0:08:32|3:09:01.244118|
|4|96|3 days, 18:17:26.305852|0:55:50.992844|0:23:27.123873|0:08:24|3:08:52.109168|


Let's look at the mean value of these summary stats across all runs:

In [6]:
table_string, second_row = "|", "\n|"
for func in funcs:
    table_string += f"{func.__name__}|"
    second_row += "-|"
table_string += second_row + "\n"

for func in funcs:
    values = []
    for i, df in enumerate(spacts):
        values.append(func(df['TIME']))
    res = np.mean(values)
    table_string += f"{datetime.timedelta(seconds=res)}|"
display(Markdown(table_string))

|sum|mean|std|min|max|
|-|-|-|-|-|
3 days, 17:57:08.722598|0:55:38.440439|0:23:36.111041|0:08:36|3:08:13.953917|

We can also look at the results for Part 1 and Part 2 seperately:

In [7]:
table_string = "|Run|Part 1|Part 2|\n|-|-|-|\n"
for i, df in enumerate(spacts):
    table_string += f"|{i}|{datetime.timedelta(seconds=df.loc[df['REGION_INDEX']==-1, 'TIME'].values[0])}|{datetime.timedelta(seconds=sum(df.loc[df['REGION_INDEX']>-1, 'TIME']))}\n"
display(Markdown(table_string))

|Run|Part 1|Part 2|
|-|-|-|
|0|0:08:43|3 days, 17:39:47.234052
|1|0:08:45|3 days, 17:27:04.907433
|2|0:08:36|3 days, 18:06:59.344157
|3|0:08:32|3 days, 17:39:49.821496
|4|0:08:24|3 days, 18:09:02.305852


Here, we see that the Part 1s for SPACTS took around 8 and a half minutes. When we consider that there are 96 regions which took around 1.5 mintues to run, we can surmize that SPACTS took around 7 minutes to run. Let's now compare that against the "no merge" strategy:

<hr id="nm" />

## Analyzing without SPACTS

Now, let's load the runtime data from the analyses that didn't use SPACTS.

In [8]:
nm = []
for path in no_merge_paths:
    nm.append(pd.read_csv(os.path.join(time_path, path)))

Again, we can print summary statistics for each run:

In [9]:
funcs = [sum, np.mean, np.std, min, max]
table_string, second_row = "|Run|Regions|", "\n|-|-|"
for func in funcs:
    table_string += f"{func.__name__}|"
    second_row += "-|"
table_string += second_row + "\n"

for i, df in enumerate(nm):
    table_string += f"|{i}|{len(df)-1}|"
    for func in funcs:
        table_string += f"{datetime.timedelta(seconds=func(df['TIME']))}|"
    table_string += "\n"
display(Markdown(table_string))

|Run|Regions|sum|mean|std|min|max|
|-|-|-|-|-|-|-|
|0|7437|65 days, 14:18:56.002177|0:12:41.970422|0:09:50.028282|0:00:27.142152|2:04:54|
|1|7437|65 days, 15:37:21.864942|0:12:42.603101|0:09:50.506608|0:00:26.517856|2:04:52|
|2|7437|65 days, 22:11:24.441628|0:12:45.781721|0:09:53.965067|0:00:25.279490|2:04:48|
|3|7437|65 days, 17:48:17.816777|0:12:43.659292|0:09:50.881709|0:00:24.900945|2:04:51|
|4|7437|65 days, 12:27:17.626884|0:12:41.069861|0:09:48.581274|0:00:32.727022|2:04:45|


And get the mean value for each of these summary stats across each run:

In [10]:
table_string, second_row = "|", "\n|"
for func in funcs:
    table_string += f"{func.__name__}|"
    second_row += "-|"
table_string += second_row + "\n"

for func in funcs:
    values = []
    for i, df in enumerate(nm):
        values.append(func(df['TIME']))
    res = np.mean(values)
    table_string += f"{datetime.timedelta(seconds=res)}|"
display(Markdown(table_string))

|sum|mean|std|min|max|
|-|-|-|-|-|
65 days, 16:28:39.550481|0:12:43.016880|0:09:50.792588|0:00:27.313493|2:04:50|

<hr id="comp" />

## Comparison

Let's compare both methods:

In [11]:
table_string = "|Method|Mean Part 1|Mean Part 2|Mean Total|STD Total|\n|-|-|-|-|-|\n"
methods = {"No Clustering": nm, "SPACTS (9.5GB)": spacts}
for key, val in methods.items():
    table_string += f"|{key}|"
    mean_part_1 = round(np.mean([df.loc[df['REGION_INDEX']==-1, 'TIME'].values[0] for df in val]))
    table_string += f"{datetime.timedelta(seconds=mean_part_1)}|"
    mean_part_2 = round(np.mean([sum(df.loc[df['REGION_INDEX']>-1, 'TIME']) for df in val]))
    table_string += f"{datetime.timedelta(seconds=mean_part_2)}|"
    table_string += f"{datetime.timedelta(seconds=mean_part_1+mean_part_2)}|"
    std_total = round(np.std([sum(df['TIME']) for df in val]))
    table_string += f"{datetime.timedelta(seconds=std_total)}|"
    table_string += "\n"

display(Markdown(table_string))

|Method|Mean Part 1|Mean Part 2|Mean Total|STD Total|
|-|-|-|-|-|
|No Clustering|2:04:50|65 days, 14:23:50|65 days, 16:28:40|3:20:46|
|SPACTS (9.5GB)|0:08:36|3 days, 17:48:33|3 days, 17:57:09|0:16:29|


Finally, we can calculate the speedup from using SPACTS:

In [12]:
avg_spacts = np.mean([sum(df["TIME"]) for df in spacts])
avg_no_merge = np.mean([sum(df["TIME"]) for df in nm])
print(f"SPACTS lead to a {avg_no_merge / avg_spacts:.2f}x speedup!")

SPACTS lead to a 17.53x speedup!
