# Performance Test Reliability Experiment Analysis

This notebook analyzes the variability of DHIS2 performance tests across different GitHub runners and configurations.

## Experiment Setup

The experiment compares 4 configurations:
1. GitHub's Ubuntu runner with shareConnections off
2. GitHub's Ubuntu runner with shareConnections on  
3. BuildJet custom runner with shareConnections off
4. BuildJet custom runner with shareConnections on

Each configuration runs 24 times per day to gather variability data.

In [12]:
import re
from pathlib import Path

import pandas as pd
import plotly.express as px

## Data Loading

Load simulation.log files from all experiment runs and extract response times.

In [13]:
def load_experiment_data():
    """Load all experiment data from simulation.csv files."""
    base_path = Path("../experiment-workflows-flat")
    all_data = []

    for config_dir in base_path.iterdir():
        if config_dir.is_dir():
            experiment_group = config_dir.name  # Use directory name as experiment group

            # Load data from each test run
            for test_dir in config_dir.iterdir():
                if test_dir.is_dir():
                    csv_file = test_dir / "simulation.csv"
                    if csv_file.exists():
                        try:
                            df = pd.read_csv(csv_file)
                            # Filter only request records with response times
                            request_data = df[
                                (df["record_type"] == "request") & (df["response_time_ms"].notna())
                            ].copy()

                            if not request_data.empty:
                                request_data["experiment_group"] = experiment_group
                                request_data["test_run"] = test_dir.name
                                all_data.append(request_data)
                        except Exception as e:
                            print(f"Error reading {csv_file}: {e}")

    if all_data:
        return pd.concat(all_data, ignore_index=True)
    else:
        return pd.DataFrame()


# Load the data
print("Loading experiment data from CSV files...")
data = load_experiment_data()
print(f"Loaded {len(data)} requests from {data['test_run'].nunique()} test runs")
print(f"Experiment groups: {data['experiment_group'].unique()}")
print(f"Columns: {list(data.columns)}")

Loading experiment data from CSV files...
Loaded 6400 requests from 64 test runs
Experiment groups: ['buildjet-2vcpu-ubuntu-2204' 'buildjet-2vcpu-ubuntu-2204-shared'
 'ubuntu-24.04-shared' 'ubuntu-24.04']
Columns: ['record_type', 'scenario_name', 'group_hierarchy', 'request_name', 'status', 'start_timestamp', 'end_timestamp', 'response_time_ms', 'error_message', 'event_type', 'duration_ms', 'cumulated_response_time_ms', 'is_incoming', 'experiment_group', 'test_run']


## Data Summary

Basic statistics about the loaded data.

In [14]:
# Filter successful requests only
successful_data = data[data["status"] == "OK"].copy()

print("Data Summary:")
print(f"Total requests: {len(data)}")
print(f"Successful requests: {len(successful_data)}")
print(f"Success rate: {len(successful_data) / len(data) * 100:.1f}%")
print("\nRequests per experiment group:")
print(successful_data["experiment_group"].value_counts())

# Basic statistics using response_time_ms column
print("\nResponse time statistics (ms):")
summary_stats = (
    successful_data.groupby("experiment_group")["response_time_ms"]
    .agg(["count", "mean", "median", "std", "min", "max"])
    .round(2)
)
print(summary_stats)

Data Summary:
Total requests: 6400
Successful requests: 6400
Success rate: 100.0%

Requests per experiment group:
experiment_group
buildjet-2vcpu-ubuntu-2204           1600
buildjet-2vcpu-ubuntu-2204-shared    1600
ubuntu-24.04-shared                  1600
ubuntu-24.04                         1600
Name: count, dtype: int64

Response time statistics (ms):
                                   count    mean  median      std    min  \
experiment_group                                                           
buildjet-2vcpu-ubuntu-2204          1600  487.91   136.0  2271.73   86.0   
buildjet-2vcpu-ubuntu-2204-shared   1600  484.58   141.0  2238.93   85.0   
ubuntu-24.04                        1600  559.69   184.0  3438.84  156.0   
ubuntu-24.04-shared                 1600  594.15   186.0  3518.87  156.0   

                                       max  
experiment_group                            
buildjet-2vcpu-ubuntu-2204         26981.0  
buildjet-2vcpu-ubuntu-2204-shared  31795.0  
ubuntu

## Box Plot Analysis

Creating a box plot similar to the one described in [What is a Boxplot?](https://statisticsbyjim.com/graphs/box-plot) to visualize the distribution and variability of response times across different experiment groups.

The box plot shows:
* **Box**: Interquartile range (25th to 75th percentile)
* **Line in box**: Median (50th percentile)
* **Whiskers**: Extend to 1.5 * IQR from the box edges
* **Points**: Outliers beyond the whiskers

In [15]:
# Create box plot grouped by experiment groups
fig = px.box(
    successful_data,
    x="experiment_group",
    y="response_time_ms",
    title="Response Time Distribution by Experiment Group",
    labels={"response_time_ms": "Response Time (ms)", "experiment_group": "Experiment Group"},
    color="experiment_group",
)

# Customize the plot
fig.update_layout(height=600, showlegend=False, xaxis_tickangle=-45, template="plotly_white")

# Add annotations explaining the box plot components (no normal distribution assumption)
fig.add_annotation(
    x=0.02,
    y=0.98,
    xref="paper",
    yref="paper",
    text="Box Plot Components:<br>"
    + "• Box: Middle 50% of data (IQR)<br>"
    + "• Line: Median (50th percentile)<br>"
    + "• Whiskers: Data within 1.5×IQR<br>"
    + "• Points: Outliers beyond whiskers<br>"
    + "• No distribution assumptions needed",
    showarrow=False,
    bgcolor="rgba(255,255,255,0.8)",
    bordercolor="black",
    borderwidth=1,
    font=dict(size=10),
    align="left",
)

fig.show()

## Variability Analysis

Calculate and compare variability metrics across experiment groups to understand which setup provides the most consistent performance. Using robust statistics that don't assume normal distribution.

In [16]:
# Calculate variability metrics per test run using robust statistics
variability_per_run = (
    successful_data.groupby(["experiment_group", "test_run"])["response_time_ms"]
    .agg(["count", "median", "mean", "std", "min", "max"])
    .reset_index()
)


# Add robust variability measures
def calculate_iqr(x):
    return x.quantile(0.75) - x.quantile(0.25)


def calculate_mad(x):
    """Median Absolute Deviation - robust measure of variability"""
    return (x - x.median()).abs().median()


# Calculate additional robust metrics per test run
iqr_mad_stats = (
    successful_data.groupby(["experiment_group", "test_run"])["response_time_ms"]
    .agg(
        [
            ("iqr", calculate_iqr),
            ("mad", calculate_mad),
            ("q25", lambda x: x.quantile(0.25)),
            ("q75", lambda x: x.quantile(0.75)),
        ]
    )
    .reset_index()
)

# Merge the dataframes
variability_per_run = variability_per_run.merge(iqr_mad_stats, on=["experiment_group", "test_run"])

# Calculate coefficient of variation (CV) for each test run
variability_per_run["cv"] = variability_per_run["std"] / variability_per_run["mean"]

# Summary of variability across experiment groups using robust statistics
variability_summary = (
    variability_per_run.groupby("experiment_group")
    .agg(
        {
            "median": ["median", calculate_mad],  # Median of medians, MAD of medians
            "iqr": ["median", calculate_mad],  # Median IQR, MAD of IQRs
            "mad": ["median", calculate_mad],  # Median MAD, MAD of MADs
            "cv": ["median", calculate_mad],  # Median CV, MAD of CVs
        }
    )
    .round(4)
)

print("Robust Variability Summary (using median and MAD instead of mean and std):")
print(variability_summary)

Robust Variability Summary (using median and MAD instead of mean and std):
                                   median                   iqr                \
                                   median calculate_mad  median calculate_mad   
experiment_group                                                                
buildjet-2vcpu-ubuntu-2204         133.00         11.50  55.750        13.500   
buildjet-2vcpu-ubuntu-2204-shared  134.25         17.75  63.875        19.125   
ubuntu-24.04                       184.50          2.00  52.500         6.125   
ubuntu-24.04-shared                184.00          3.25  50.875         7.250   

                                     mad                    cv                
                                  median calculate_mad  median calculate_mad  
experiment_group                                                              
buildjet-2vcpu-ubuntu-2204         12.25          2.25  4.8125        0.2583  
buildjet-2vcpu-ubuntu-2204-shared  15.50 

In [17]:
# Box plot of IQR (Interquartile Range) across test runs - robust variability measure
fig_iqr = px.box(
    variability_per_run,
    x="experiment_group",
    y="iqr",
    title="Interquartile Range (IQR) by Experiment Group",
    labels={"iqr": "IQR (ms) - Robust Variability Measure", "experiment_group": "Experiment Group"},
    color="experiment_group",
)

fig_iqr.update_layout(height=500, showlegend=False, xaxis_tickangle=-45, template="plotly_white")

fig_iqr.add_annotation(
    x=0.02,
    y=0.98,
    xref="paper",
    yref="paper",
    text="Lower IQR = More Consistent Performance<br>IQR is robust to outliers",
    showarrow=False,
    bgcolor="rgba(255,255,255,0.8)",
    bordercolor="black",
    borderwidth=1,
    font=dict(size=12),
)

fig_iqr.show()

# Box plot of MAD (Median Absolute Deviation) - another robust variability measure
fig_mad = px.box(
    variability_per_run,
    x="experiment_group",
    y="mad",
    title="Median Absolute Deviation (MAD) by Experiment Group",
    labels={"mad": "MAD (ms) - Robust Variability Measure", "experiment_group": "Experiment Group"},
    color="experiment_group",
)

fig_mad.update_layout(height=500, showlegend=False, xaxis_tickangle=-45, template="plotly_white")

fig_mad.add_annotation(
    x=0.02,
    y=0.98,
    xref="paper",
    yref="paper",
    text="Lower MAD = More Consistent Performance<br>MAD is very robust to outliers",
    showarrow=False,
    bgcolor="rgba(255,255,255,0.8)",
    bordercolor="black",
    borderwidth=1,
    font=dict(size=12),
)

fig_mad.show()

## Time Series Analysis

Analyze how performance varies over time for each configuration.

In [18]:
# Extract timestamp from test run names
def extract_timestamp(test_run_name):
    # Extract timestamp from format like 'trackerexportertests-20250629121357615'
    match = re.search(r"(\d{14})", test_run_name)
    if match:
        timestamp_str = match.group(1)
        return pd.to_datetime(timestamp_str, format="%Y%m%d%H%M%S")
    return None


variability_per_run["timestamp"] = variability_per_run["test_run"].apply(extract_timestamp)
variability_per_run = variability_per_run.dropna(subset=["timestamp"]).sort_values("timestamp")

# Time series plot of median response times (robust to outliers)
fig_ts = px.line(
    variability_per_run,
    x="timestamp",
    y="median",
    color="experiment_group",
    title="Median Response Time Over Time",
    labels={"median": "Median Response Time (ms)", "timestamp": "Time"},
)

fig_ts.update_layout(height=500, template="plotly_white")

fig_ts.show()

# Time series plot of IQR (robust variability measure)
fig_ts_iqr = px.line(
    variability_per_run,
    x="timestamp",
    y="iqr",
    color="experiment_group",
    title="IQR (Variability) Over Time",
    labels={"iqr": "IQR (ms) - Robust Variability", "timestamp": "Time"},
)

fig_ts_iqr.update_layout(height=500, template="plotly_white")

fig_ts_iqr.show()

## Interactive Analysis

Use the cells below to explore the data interactively. You can modify the filters and create custom visualizations.

In [None]:
# Interactive filter - modify these values to explore different subsets
selected_experiment_group = "ubuntu-24.04"  # Change this to focus on specific experiment group
max_response_time = 10000  # Filter out very slow requests

filtered_data = successful_data[
    (successful_data["experiment_group"] == selected_experiment_group)
    & (successful_data["response_time_ms"] <= max_response_time)
]

print(f"Filtered data: {len(filtered_data)} requests for {selected_experiment_group}")

# Calculate statistics separately for cleaner formatting
median_time = filtered_data["response_time_ms"].median()
q75 = filtered_data["response_time_ms"].quantile(0.75)
q25 = filtered_data["response_time_ms"].quantile(0.25)
iqr = q75 - q25
mad = (filtered_data["response_time_ms"] - median_time).abs().median()

print(f"Response time stats: median={median_time:.1f}ms, IQR={iqr:.1f}ms, MAD={mad:.1f}ms")

# Histogram of filtered data
fig_hist = px.histogram(
    filtered_data,
    x="response_time_ms",
    nbins=50,
    title=f"Response Time Distribution - {selected_experiment_group}",
    labels={"response_time_ms": "Response Time (ms)", "count": "Frequency"},
)

fig_hist.show()

In [20]:
# Export summary data for further analysis using robust statistics
summary_export = (
    successful_data.groupby(["experiment_group", "test_run"])["response_time_ms"]
    .agg(
        [
            "count",
            "median",
            "mean",
            "std",
            "min",
            "max",
            ("q25", lambda x: x.quantile(0.25)),
            ("q75", lambda x: x.quantile(0.75)),
            ("iqr", lambda x: x.quantile(0.75) - x.quantile(0.25)),
            ("mad", lambda x: (x - x.median()).abs().median()),
        ]
    )
    .reset_index()
)

summary_export.to_csv("experiment_summary_robust.csv", index=False)
print("Summary data with robust statistics exported to experiment_summary_robust.csv")
print(f"Shape: {summary_export.shape}")
summary_export.head()

Summary data with robust statistics exported to experiment_summary_robust.csv
Shape: (64, 12)


Unnamed: 0,experiment_group,test_run,count,median,mean,std,min,max,q25,q75,iqr,mad
0,buildjet-2vcpu-ubuntu-2204,trackerexportertests-20250629121400652,100,136.5,504.54,2650.61065,119.0,26618.0,125.0,168.25,43.25,12.5
1,buildjet-2vcpu-ubuntu-2204,trackerexportertests-20250629131227003,100,128.0,519.01,2482.272206,116.0,24879.0,122.0,250.75,128.75,10.0
2,buildjet-2vcpu-ubuntu-2204,trackerexportertests-20250629140510325,100,106.0,449.35,1646.082484,87.0,16386.0,97.0,367.5,270.5,12.0
3,buildjet-2vcpu-ubuntu-2204,trackerexportertests-20250629150530534,100,128.5,499.22,2383.468221,114.0,23895.0,121.75,173.25,51.5,10.5
4,buildjet-2vcpu-ubuntu-2204,trackerexportertests-20250629160601653,100,104.5,449.88,1686.987814,86.0,16776.0,96.0,360.5,264.5,11.5
