# ReadME
## Experiment Analysis

### Overview
This dataset contains the results of 3410 runs from 22 separate experiments designed to benchmark data‑processing operations. Each run reports CPU time, memory usage, throughput and other performance metrics for a batch of 512 samples processed on a CPU. The experiments use different input files ranging from 10 MB to 1 GB, and the number of batches varies per run, which explains the variation in time and memory consumption.

Variables
The following variables are recorded for each run:

`batch_size` – number of samples processed in each batch.

`data_path` – relative path to the input data file for the run.

`device` – hardware used for the run (all experiments used the CPU).

`execution_time` (seconds/batch) – time taken to process one batch.

`library_overhead_memory` (MB) – memory overhead of the processing library.

`num_batches` – number of batches processed.

`rank` – optional ranking of certain runs.

`result_path` – directory where output data were written.

`sample_persec` – number of samples processed per second.

`throughput_bps` – data throughput in bytes per second.

`total_cpu_memory` (MB) – total CPU memory usage.

`total_cpu_time` (seconds) – total CPU time consumed by the run.

`total_image_memory` (MB) – memory used to store images.

`total_model_memory` (MB) – memory used by the model.

`total_process_memory` (MB) – total memory consumed by the process.

**Important variables**
Among the available variables, some provide a concise picture of performance:

*total CPU time (seconds)* – the overall time spent on CPU; lower values mean better efficiency.

*total CPU memory (MB)* – peak memory footprint during the run.

*execution time per batch (s)* – how long one batch takes on average.

*throughput (MB/s)* – amount of data processed per second; higher throughput indicates better performance.

*samples per second* – how many samples were processed per second, independent of data size.

These metrics were used to compare the experiments and create visualisations.

### Summary table  
Experiments are sorted in ascending order by their index to make comparisons straightforward. The table below summarises the average values of the important metrics for each experiment. Two additional columns are included:

Data Prefix – the size of the input file derived from the data_path (e.g., 10MB, 25MB, 100MB).

File Limit – a cyclic label (2, 4, 6, 8, 10) assigned sequentially across experiments to group runs into five categories.

Throughput values are converted to megabytes per second for readability. Averages are computed across all runs belonging to each experiment.

| Experiment   | Data Prefix | File Limit | Avg CPU Time (s) | Avg CPU Memory (MB) | Avg Exec Time per Batch (s) | Avg Throughput (MB/s) | Avg Samples/s |
| ------------ | ----------- | ---------- | ---------------- | ------------------- | --------------------------- | --------------------- | ------------- |
| Experiment1  | 100MB       | 2          | 25.34            | 14295.35            | 0.33                        | 5.09                  | 201.66        |
| Experiment2  | 10MB        | 4          | 3.56             | 7151.39             | 3.57                        | 23.70                 | 144.36        |
| Experiment3  | 10MB        | 2          | 3.56             | 7151.39             | 3.57                        | 23.70                 | 144.36        |
| Experiment4  | 10MB        | 4          | 3.56             | 7151.39             | 3.57                        | 23.70                 | 144.36        |
| Experiment5  | 10MB        | 6          | 3.56             | 7151.39             | 3.57                        | 23.70                 | 144.36        |
| Experiment6  | 10MB        | 8          | 3.56             | 7151.39             | 3.57                        | 23.70                 | 144.36        |
| Experiment7  | 25MB        | 10         | 7.28             | 17864.66            | 2.92                        | 31.04                 | 189.07        |
| Experiment8  | 25MB        | 2          | 7.28             | 17864.66            | 2.92                        | 31.04                 | 189.07        |
| Experiment9  | 25MB        | 4          | 7.28             | 17864.66            | 2.92                        | 31.04                 | 189.07        |
| Experiment10 | 25MB        | 6          | 7.28             | 17864.66            | 2.92                        | 31.04                 | 189.07        |
| Experiment11 | 50MB        | 8          | 11.89            | 35579.80            | 2.40                        | 37.05                 | 225.67        |
| Experiment15 | 50MB        | 10         | 11.89            | 35579.80            | 2.40                        | 37.05                 | 225.67        |
| Experiment16 | 75MB        | 2          | 18.37            | 53229.60            | 2.47                        | 37.21                 | 226.64        |
| Experiment17 | 75MB        | 4          | 18.37            | 53229.60            | 2.47                        | 37.21                 | 226.64        |
| Experiment18 | 75MB        | 6          | 18.37            | 53229.60            | 2.47                        | 37.21                 | 226.64        |
| Experiment19 | 75MB        | 8          | 18.37            | 53229.60            | 2.47                        | 37.21                 | 226.64        |
| Experiment20 | 75MB        | 10         | 18.37            | 53229.60            | 2.47                        | 37.21                 | 226.64        |
| Experiment21 | 100MB       | 2          | 22.35            | 70800.08            | 2.28                        | 38.96                 | 237.35        |
| Experiment22 | 100MB       | 4          | 22.35            | 70800.08            | 2.28                        | 38.96                 | 237.35        |
| Experiment23 | 100MB       | 6          | 22.35            | 70800.08            | 2.28                        | 38.96                 | 237.35        |
| Experiment24 | 100MB       | 8          | 22.35            | 70800.08            | 2.28                        | 38.96                 | 237.35        |
| Experiment25 | 100MB       | 10         | 22.35            | 70800.08            | 2.28                        | 38.96                 | 237.35        |


In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the experiment results (CSV created from the raw `.txt` file)
df = pd.read_csv('experiments.csv')

In [9]:
df.columns

Index(['experiment', 'file_limit', 'data_prefix', 'avg_total_cpu_time_seconds',
       'avg_execution_time_seconds_batch', 'avg_throughput_bps',
       'avg_sample_persec', 'n_records'],
      dtype='object')

In [11]:
# Ensure numeric columns are numeric
numeric_cols = [
    'avg_total_cpu_time_seconds',
    'avg_execution_time_seconds_batch',
    'avg_throughput_bps',
    'avg_sample_persec',
    'n_records'
]
for c in numeric_cols:
    df[c] = pd.to_numeric(df[c], errors='coerce')

# Convenience metric (MB/s) & optional CSV
df['avg_throughput_MBps'] = df['avg_throughput_bps'] / 1e6
df.to_csv('experiment_summary.csv', index=False)          # updated CSV

# Plot settings
sns.set_theme(style='whitegrid')

## Horizontal bar: throughput (MB/s) 
sorted_df = df.sort_values('avg_throughput_MBps')

plt.figure(figsize=(8, 7))

# Use default palette (no warning) or set hue='experiment'
sns.barplot(
    data=sorted_df,
    y='experiment',
    x='avg_throughput_MBps'
)
plt.xlabel('Average Throughput (MB/s)')
plt.ylabel('Experiment')
plt.title('Average Throughput per Experiment (MB/s)')
plt.tight_layout()
plt.savefig('avg_throughput_MBps.png', dpi=300)
plt.close()

## Scatter: execution time vs. samples/s (with a legend) 
plt.figure(figsize=(8, 6))
scatter = sns.scatterplot(
    data=df,
    x='avg_execution_time_seconds_batch',
    y='avg_sample_persec',
    hue='experiment',        # colour by experiment
    palette='tab10',
    s=100,
    legend='brief'           # show legend
)
plt.xlabel('Average Execution Time per Batch (s)')
plt.ylabel('Average Samples per Second')
plt.title('Execution Time vs Samples per Second per Experiment')

## Annotate each point
for _, row in df.iterrows():
    scatter.annotate(
        row['experiment'],
        (row['avg_execution_time_seconds_batch'], row['avg_sample_persec']),
        textcoords="offset points",
        xytext=(5, -5),
        ha='left',
        fontsize=7
    )

plt.tight_layout()
plt.savefig('execution_vs_sample_persec_refined.png', dpi=300)
plt.close()

##Line: total CPU time across experiments 
df['experiment'] = df['experiment'].astype(str)          # ensure string dtype
df['exp_num'] = df['experiment'].str.extract(r'(\d+)').astype(int)

df_sorted = df.sort_values('exp_num')

plt.figure(figsize=(8, 6))
sns.lineplot(
    data=df_sorted,
    x='exp_num',
    y='avg_total_cpu_time_seconds',
    marker='o',
    linewidth=2
)
plt.xticks(df_sorted['exp_num'], df_sorted['experiment'], rotation=45, ha='right')
plt.xlabel('Experiment')
plt.ylabel('Average Total CPU Time (s)')
plt.title('Average Total CPU Time per Experiment')
plt.tight_layout()
plt.savefig('avg_cpu_time_refined.png', dpi=300)
plt.close()

In [12]:
import re
import json
from pathlib import Path

# Load & parse All Experiments.txt
txt_path = Path("All Experiments.txt")

# Read the whole file
raw = txt_path.read_text()

# Regex to capture each "ExperimentX" section followed by its JSON list
pattern = re.compile(r"Experiment\d+\s*\n(\[.*?\])", re.S)
all_runs = []

for match in pattern.finditer(raw):
    # clean up JSON (remove stray trailing commas if any)
    json_blob = match.group(1)
    runs = json.loads(json_blob)
    all_runs.extend(runs)

df = pd.DataFrame(all_runs)

# Clean up / ensure numeric columns
numeric_cols = [
    "total_cpu_time (seconds)",
    "total_cpu_memory (MB)",
    "execution_time (seconds/batch)",
    "throughput_bps",
    "sample_persec",
    "num_batches",
]
for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors="coerce")

# Derive helper columns

# Partition size (data_prefix) from data_path, e.g. "25MB"
df["partition"] = df["data_path"].str.split("/").str[0]

# Duration (seconds) – use total CPU time (method A)
df["duration_s"] = df["total_cpu_time (seconds)"]

# Aggregate per partition
summary = (
    df.groupby("partition")
      .agg(
          Requests=("partition", "size"),
          Duration_s=("duration_s", "mean"),
          Memory_GB=("total_cpu_memory (MB)", lambda x: x.mean() / 1024)
      )
      .reset_index()
)

# Add Cost ($) via fixed lookup

partition_cost_lookup = {
    "25MB": 0.16,
    "50MB": 0.20,
    "75MB": 0.30,
    "100MB": 0.38
}
summary["Cost ($)"] = summary["partition"].map(partition_cost_lookup)

# Tidy formatting

summary["Duration_s"] = summary["Duration_s"].round(2)
summary["Memory_GB"]  = summary["Memory_GB"].round(1)

# Sort partitions numerically (25, 50, 75, 100)
summary["part_num"] = summary["partition"].str.extract(r"(\d+)").astype(int)
summary = summary.sort_values("part_num").drop(columns="part_num")


# Create CSV
summary.to_csv("partition_cost_table.csv", index=False)
print(summary.to_string(index=False))

partition  Requests  Duration_s  Memory_GB  Cost ($)
     10MB       100        3.56        7.0       NaN
     25MB      1600        7.28       17.4      0.16
     50MB       410       11.89       34.7      0.20
     75MB       685       18.37       52.0      0.30
    100MB       615       22.84       60.2      0.38


### Failed 240 batch job

The workflow failed because the Map state tried to extract `$.body`, but your init Lambda was already returning a bare array, so the iterator handed each Lambda a list instead of a single job object and the code couldn’t find the `"bucket"` key. When you switched `Payload.$`, you entered an invalid value—only `$`, a valid JSONPath, or the special context path `$$Map.Item.Value` is allowed—so the definition validator rejected it. Wrap the job list in an object (or point `ItemsPath` to `$`) and set `Payload.$` to `$$Map.Item.Value` so each iteration receives one job record containing `"bucket"`, fixing both the runtime KeyError and the save‑time validation error.