# Improved Benchmark Processing with CFReT-dataset using CytoSnake's `cp_process_singlecells` Workflow

This notebook uses the benchmark data obtained through the application of CytoSnake's `cp_process_singlecells` workflow. The collected data is then compiled into a unified benchmark profile, allowing comparisons with other generated benchmark profiles in this repo.

This benchmark profile was generated  by using the [`CFReT-data`](https://github.com/WayScience/CFReT_data)

**NOTE**: This benchmark profile only profiles the single-cell pipeline found [here](https://github.com/WayScience/CFReT_data/blob/main/3.process_cfret_features/1.single_cell_processing.ipynb).

In [1]:
import sys
import json
import pathlib
import subprocess
from datetime import datetime

import pandas as pd

sys.path.append("../../../")
from src.benchmark_utils import get_benchmark_files

## Generating CytoSnake `cp_process_singlecells` Benchmark Profile with NF1 Data

In this section, we are generating a benchmark profile for CytoSnake's `cp_process_singlecells` applied to the NF1 dataset. For additional information on the CFReT dataset, detailed documentation is available in its repository and can also be found in the controls directory of this repository.

In [2]:
# parameters
CWD_PATH = pathlib.Path(".").resolve()
BENCHMARK_DIR = pathlib.Path("./benchmarks/").resolve(strict=True)

In [3]:
# loading in file size data:
with open("./file_size.json", mode="r") as stream:
    file_sizes = json.load(stream)

# this is in MB
print(file_sizes)

{'localhost220513100001_KK22-05-198_FactinAdjusted': 327.434, 'localhost220512140003_KK22-05-198': 788.492, 'localhost231120090001': 409.934, 'localhost230405150001': 476.719}


In [4]:
# converting .bin files into .json files (~2min)
# skip this block if the json files are already in the benchmarks/ directory
for bin_path in get_benchmark_files(BENCHMARK_DIR, ext="bin"):
    json_out = BENCHMARK_DIR / f"{bin_path.stem}.json"

    # # executing memray to convert bin files into json files
    memray_stats = subprocess.run(
        [
            "memray",
            "stats",
            "--json",
            "--output",
            str(json_out),
            "--force",
            str(bin_path),
        ],
        capture_output=True,
        check=True,
    )

    # stdout message
    print(
        f"{bin_path.relative_to(CWD_PATH)} was successfully converted into {json_out.relative_to(CWD_PATH)}"
    )

benchmarks/localhost230405150001_CFReT_features_annotate_benchmarks.bin was successfully converted into benchmarks/localhost230405150001_CFReT_features_annotate_benchmarks.json
benchmarks/localhost230405150001_CFReT_feature_select_benchmarks.bin was successfully converted into benchmarks/localhost230405150001_CFReT_feature_select_benchmarks.json
benchmarks/localhost220513100001_KK22-05-198_FactinAdjusted_CFReT_feature_select_benchmarks.bin was successfully converted into benchmarks/localhost220513100001_KK22-05-198_FactinAdjusted_CFReT_feature_select_benchmarks.json
benchmarks/localhost231120090001_CFReT_features_normalize_benchmarks.bin was successfully converted into benchmarks/localhost231120090001_CFReT_features_normalize_benchmarks.json
benchmarks/localhost220512140003_KK22-05-198_CFReT_converted_normalized_feature_select_benchmarks.bin was successfully converted into benchmarks/localhost220512140003_KK22-05-198_CFReT_converted_normalized_feature_select_benchmarks.json
benchmarks/

In [5]:
# applying time format
tformat = "%Y-%m-%d %H:%M:%S.%f"

# using all json files to compile benchmark profile
raw_benchmark_data = []
for json_path in get_benchmark_files(BENCHMARK_DIR, ext="json"):
    # collecting data from just file name
    plate_name = json_path.stem.split("_CFReT_")[0]
    process_name = json_path.stem.rsplit("_", 3)[2]

    # updating name
    if process_name == "select":
        process_name = "feature_select"

    # file size checking
    try:
        file_size = file_sizes[plate_name]
    except KeyError:
        file_size = "0"

    # opening json file to extract benchmark information
    with open(json_path, encoding="utf-8", mode="r") as contents:
        benchmark_data = json.load(contents)

        # accessing to all metadata from benchmarks
        meta_data = benchmark_data["metadata"]
        selected_data = {
            "pid": meta_data["pid"],
            "process_name": process_name,
            "input_data_name": plate_name,
            "start_time": datetime.strptime(meta_data["start_time"], tformat),
            "end_time": datetime.strptime(meta_data["end_time"], tformat),
            "time_duration": (
                datetime.strptime(meta_data["end_time"], tformat)
                - datetime.strptime(meta_data["start_time"], tformat)
            ).total_seconds(),
            "total_allocations": int(meta_data["total_allocations"]),
            "peak_memory": round(int(meta_data["peak_memory"]) / 1024**2, 3),
            "file_size": file_size,
        }
        # append to list
        raw_benchmark_data.append(selected_data)

In [6]:
# generating benchmark profile and saving it
benchmark_profile = pd.DataFrame(data=raw_benchmark_data)
benchmark_profile

Unnamed: 0,pid,process_name,input_data_name,start_time,end_time,time_duration,total_allocations,peak_memory,file_size
0,661135,feature_select,localhost220512140003_KK22-05-198,2024-01-03 13:53:07.149,2024-01-03 13:54:35.243,88.094,48067410,1573.907,788.492
1,654771,annotate,localhost231120090001,2024-01-03 13:42:03.495,2024-01-03 13:42:06.145,2.65,11143355,200.382,409.934
2,656452,feature_select,localhost220513100001_KK22-05-198_FactinAdjusted,2024-01-03 13:42:50.285,2024-01-03 13:43:25.107,34.822,14987908,605.612,327.434
3,655244,feature_select,localhost231120090001,2024-01-03 13:42:12.078,2024-01-03 13:42:12.442,0.364,22579,5.899,409.934
4,649937,annotate,localhost220513100001_KK22-05-198_FactinAdjusted,2024-01-03 13:41:02.547,2024-01-03 13:42:43.255,100.708,71670736,1421.573,327.434
5,659158,feature_select,localhost230405150001,2024-01-03 13:45:48.016,2024-01-03 13:46:36.071,48.055,23197286,790.029,476.719
6,657366,normalize,localhost230405150001,2024-01-03 13:43:28.184,2024-01-03 13:45:46.177,137.993,102492374,2263.525,476.719
7,649934,normalize,localhost220513100001_KK22-05-198_FactinAdjusted,2024-01-03 13:41:02.569,2024-01-03 13:42:48.435,105.866,72146979,1639.288,327.434
8,661001,normalize,localhost220512140003_KK22-05-198,2024-01-03 13:48:58.042,2024-01-03 13:53:04.982,246.94,195564542,4525.042,788.492
9,655050,normalize,localhost231120090001,2024-01-03 13:42:07.727,2024-01-03 13:42:10.582,2.855,11186824,200.075,409.934


In [7]:
benchmark_profile.to_csv("CFReT_cp_processing_singlecells_benchmark_profile.csv")