# Benchmark Data Compilation Notebook

In this Notebook, our objective is to compile benchmark data from the `\bulk_benchmarks` and `\single_benchmarks` directories, generating two distinct `.csv` files with raw data. 
These files will serve as datasets for direct comparison with the benchmarks of the CytoSnake workflows. 


## Imports 

In [1]:
import sys
import json
import pathlib
import pandas as pd
from datetime import datetime

sys.path.append("../../../")
from src.benchmark_utils import get_benchmark_files

# Parameters Used in this Notebook

Follow parameters used in this notebook.

In [2]:
# inputs
working_dir = pathlib.Path().resolve()
benchmark_dir = pathlib.Path("./benchmarks").resolve(strict=True)

# outputs paths
# single_benchmark_csv = working_dir / "single_benchmarks.csv"

## Loading all JSON files 

Here we are loading all the JSON files. 
The file name structure of the JSON files is `{Plate_name}_{type}_{process}_benchmarks.json`.
Also we are loading the file size information. 

In [3]:
single_json_files = list(get_benchmark_files(benchmark_dir, ext="json"))

# loading json file that contains file size information
with open("./file_size.json", encoding="utf-8", mode="r") as content:
    plate_size = json.load(content)

In [4]:
# applying time format
tformat = "%Y-%m-%d %H:%M:%S.%f"

# collecting all data
raw_benchmark_data = []

# iterating each json file and extract data
for single_json_file in single_json_files:
    # collecting data from just file name
    plate_name = single_json_file.stem.split("_CFReT_")[0]
    file_size = plate_size[plate_name]
    process_name = single_json_file.stem.split("_CFReT_")[1].split("_benchmark")[0]

    # opening json file to extract benchmark information
    with open(single_json_file, encoding="utf-8", mode="r") as contents:
        benchmark_data = json.load(contents)

        # accessing to all metadata from benchmarks
        meta_data = benchmark_data["metadata"]
        selected_data = {
            "pid": meta_data["pid"],
            "process_name": process_name,
            "input_data_name": plate_name,
            "start_time": datetime.strptime(meta_data["start_time"], tformat),
            "end_time": datetime.strptime(meta_data["end_time"], tformat),
            "time_duration": (
                datetime.strptime(meta_data["end_time"], tformat)
                - datetime.strptime(meta_data["start_time"], tformat)
            ).total_seconds(),
            "total_allocations": int(meta_data["total_allocations"]),
            "peak_memory": round(int(meta_data["peak_memory"]) / 1024**2, 3),
            "file_size": plate_size[plate_name],
        }

    # append to list
    raw_benchmark_data.append(selected_data)

In [5]:
# create to dataframe
benchmark_df = pd.DataFrame(raw_benchmark_data)
benchmark_df.to_csv("CFReT_complete_benchmark.csv", index=False)
benchmark_df

Unnamed: 0,pid,process_name,input_data_name,start_time,end_time,time_duration,total_allocations,peak_memory,file_size
0,15073,annotate,localhost231120090001,2023-12-13 11:09:07.399,2023-12-13 11:09:10.065,2.666,1162900,485.398,338.357
1,15073,annotate,localhost230405150001,2023-12-13 11:14:40.085,2023-12-13 11:14:43.923,3.838,1164111,820.28,416.983
2,15073,normalize,localhost220513100001_KK22-05-198_FactinAdjusted,2023-12-13 11:10:21.893,2023-12-13 11:10:26.258,4.365,2271682,1259.332,279.372
3,15073,annotate,localhost220512140003_KK22-05-198,2023-12-13 11:11:41.002,2023-12-13 11:11:48.497,7.495,1176907,2603.802,676.675
4,15073,feature_select,localhost230405150001,2023-12-13 11:14:53.014,2023-12-13 11:16:30.233,97.219,1704443,1167.728,416.983
5,15073,annotate,localhost220513100001_KK22-05-198_FactinAdjusted,2023-12-13 11:10:16.200,2023-12-13 11:10:19.258,3.058,1162799,1079.838,279.372
6,15073,feature_select,localhost231120090001,2023-12-13 11:09:16.733,2023-12-13 11:10:15.176,58.443,1712833,713.557,338.357
7,15073,feature_select,localhost220512140003_KK22-05-198,2023-12-13 11:12:05.866,2023-12-13 11:14:38.615,152.749,1724136,2069.401,676.675
8,15073,normalize,localhost230405150001,2023-12-13 11:14:47.516,2023-12-13 11:14:52.941,5.425,2273291,1833.306,416.983
9,15073,normalize,localhost231120090001,2023-12-13 11:09:12.628,2023-12-13 11:09:16.656,4.028,2272288,1101.74,338.357
