## Runtime Comparison

**Author:** Alex Michels

In this notebook we will compare the runtime of our clustering algorithm for travel-time analysis against analyzing each hospital seperately.

First, we need to load some packages for our analysis:

In [None]:
import datetime
from glob import glob
from IPython.display import display, Markdown
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os

Each method had a "Part 1" and "Part 2". 

* Part 1 perfomed the clustering (or not) and then submitted each hospital/cluster as a seperate job, pausing for 1 second between jobs. 
* The jobs submitted by Part 1 are the "Part 2" jobs. They performed the analysis (travel-time catchment) for the hospital or region assigned to them by Part 1. At the end of each Part 2 job, the job records it's index and runtime in seconds in a file

The "-1" region index gives the time in seconds recorded by SLURM for the "Part 1"s (manually entered from SLURM output) and the rest are the time in seconds (recorded by the Python script). The data is at the path `/data/runtime`. The files ending with "NM" stands for "no merge" and the files ending with "XXG" have the records when applying SPACTS for XX gigabytes memory limit for the computation (so the SPACTS partitioning used XX/2). We ran each method 5 times, let's dig into the data:

In [None]:
time_path = "../data/runtime/"
runtime_file = "Par_Timings-{}-*.csv"
runtime_codes = ["20G", "26G", "32G", "40G", "48G", "56G", "64G", "72G", "NM"]
runtime_globs = [time_path + runtime_file.format(x) for x in runtime_codes]
print(runtime_globs)

<hr id="comparison">

## Analyzing Runtimes

Now that we have specified the paths, let's load the SPACTS data:

In [None]:
csvs = {}
for code, pattern in zip(runtime_codes, runtime_globs):
    print(pattern)
    _files_to_load = glob(pattern)
    print(f"We have {len(_files_to_load)} files for code {code}")
    csvs[code] = []
    for _file in _files_to_load:
        df = pd.read_csv(_file)
        csvs[code].append(df)

So we have a list of Dataframes. Let's preview one of them:

In [None]:
csvs[code][0].head()

To know if we were able to speed up or not, we need to calculate the "no merge" (NM) average runtime:

In [None]:
no_merge_avg = np.mean([sum(df['TIME']) for df in csvs["NM"]])
print(datetime.timedelta(seconds=no_merge_avg))

Using that information, we can now calculate a variety of summary statistics:

* Mean Part 1 - the average time for the clustering/job submission step.
* Mean Part 2 - the average time for the travel time calculation step.
* Mean Total - the average total computing time for the method.
* STD Total - the standard deviation for the total computing time for the method.
* Waiting Time - the maximum Part 1 and Part 2. This is the worst-case waiting time to get results back if you could run infinite (or at least 7438) jobs at once.
* Speed Up - the no merge average computing time divided by the average computing time for the method.

In [None]:
table_string = "|Method|Mean Job Submission|Mean Travel Analysis|Mean Total|STD Total|Turnaround Time|Speed Up|\n|-|-|-|-|-|-|-|\n"
for key, val in csvs.items():
    table_string += f"|{key}|"
    mean_part_1 = round(np.mean([df.loc[df['REGION_INDEX']==-1, 'TIME'].values[0] for df in val]))
    table_string += f"{datetime.timedelta(seconds=mean_part_1)}|"
    mean_part_2 = round(np.mean([sum(df.loc[df['REGION_INDEX']>-1, 'TIME']) for df in val]))
    mean_total = round(np.mean([sum(df['TIME']) for df in val]))
    table_string += f"{datetime.timedelta(seconds=mean_part_2)}|"
    table_string += f"{datetime.timedelta(seconds=mean_total)}|"
    std_total = round(np.std([sum(df['TIME']) for df in val]))
    table_string += f"{datetime.timedelta(seconds=std_total)}|"
    max_waiting_time = round(max([df.loc[df['REGION_INDEX']==-1, 'TIME'].values[0] for df in val]) + max([max(df.loc[df['REGION_INDEX']>-1, 'TIME']) for df in val]))
    table_string += f"{datetime.timedelta(seconds=max_waiting_time)}|"
    table_string += f"{no_merge_avg/mean_total:.2f}x|"
    table_string += "\n"

display(Markdown(table_string))

Let's make a line graph to illustrate the average computing times with standard deviation error bars. First, let's collect the information we need:

In [None]:
gbs = [int(code[:-1]) for code in runtime_codes[:-1]]
means = [round(np.mean([sum(df['TIME']) for df in val])) / 3600.0 for key, val in csvs.items()][:-1]
stds = [round(np.std([sum(df['TIME']) for df in val])) / 3600.0 for key, val in csvs.items()][:-1]
gbs, means, stds

Then, we can plot it:

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
fontsize = 20
plt.errorbar(gbs, means, yerr=stds, color='red')
plt.xlabel("Memory Limit (GB)", fontsize=fontsize)
plt.xticks(gbs, fontsize=fontsize-4)
plt.ylabel("Mean computing time (hours)", fontsize=fontsize)
plt.yticks(fontsize=fontsize-4)
ax.grid('on')
plt.tight_layout()
plt.savefig("../img/Runtimes.jpg")
plt.show()