RQ1: Is there an association between repository activity characteristics (number of commits and number of contributors) and the configured workload of performance tests (number of concurrent users and total number of requests) in non-trivial repositories?


We’ll operationalize performance testing intensity as a composite of:

median concurrent users (number_of_users)

median total requests (number_of_requests

In [79]:
import pandas as pd
import numpy as np
from scipy.stats import spearmanr


Load CSV files

In [None]:
DATA_DIR = "E2EGit"  # change this

repository = pd.read_csv(f"{DATA_DIR}/repository.csv")
non_trivial = pd.read_csv(f"{DATA_DIR}/non_trivial_repository.csv")

gui_repo = pd.read_csv(f"{DATA_DIR}/gui_testing_repo_details.csv")
gui_tests = pd.read_csv(f"{DATA_DIR}/gui_testing_test_details.csv")

perf_tests = pd.read_csv(f"{DATA_DIR}/performance_testing_test_details.csv")

repository.columns = repository.columns.str.strip()
non_trivial.columns = non_trivial.columns.str.strip()
perf_tests.columns = perf_tests.columns.str.strip()

# --- RENAME HERE (edit only the left side if yours differs) ---
repository = repository.rename(columns={
    "name": "repository_name",          # if your repo table uses "name"
    # "repo": "repository_name",        # uncomment if that's the name
})

non_trivial = non_trivial.rename(columns={
    "name": "repository_name",          # if non-trivial table uses "name"
    # "repo": "repository_name",
})

perf_tests = perf_tests.rename(columns={
    "repo": "repository_name",          # if perf table uses "repo"
    "project": "repository_name",       # or "project"
    # if it's already repository_name, it won't change
})


# sanity check
print("repository columns:", repository.columns.tolist())
print("non_trivial columns:", non_trivial.columns.tolist())
print("perf_tests columns:", perf_tests.columns.tolist())


  repository = pd.read_csv(f"{DATA_DIR}/repository.csv")


repository columns: ['repository_name', 'is_fork', 'commits', 'branches', 'releases', 'forks', 'main_language', 'default_branch', 'licences', 'homepage', 'watchers', 'stargazers', 'contributors', 'size', 'created_at', 'pushed_at', 'updated_at', 'total_issues', 'open_issues', 'total_pull_requests', 'open_pull_requests', 'blank_lines', 'code_lines', 'comment_lines', 'metrics', 'last_commit', 'last_commit_sha', 'has_wiki', 'is_archived', 'is_disabled', 'is_locked', 'languages', 'labels', 'topics']
non_trivial columns: ['repository_name', 'is_web_java', 'is_web_python', 'is_web_javascript', 'is_web_typescript', 'web_dependencies']
perf_tests columns: ['repository_name', 'test_path', 'threadgroup_taskset_id', 'is_jmeter', 'is_locust', 'threadgroup_taskset_name', 'number_of_users', 'ramp_up', 'loop_count', 'duration', 'number_of_requests']


Restrict to non-trivial repositories with performance tests

In [81]:
perf_tests_nt = perf_tests.merge(
    non_trivial,
    on="repository_name",
    how="inner"
)

print("Non-trivial repos with perf tests:",
      perf_tests_nt["repository_name"].nunique())


Non-trivial repos with perf tests: 84


Extract numeric workload values from TEXT

In [82]:
def extract_first_numeric(x):
    if pd.isna(x):
        return np.nan
    m = re.search(r"\d+(\.\d+)?", str(x))
    return float(m.group()) if m else np.nan

perf_tests_nt["users_numeric"] = perf_tests_nt["number_of_users"].apply(extract_first_numeric)
perf_tests_nt["requests_numeric"] = perf_tests_nt["number_of_requests"].apply(extract_first_numeric)

perf_tests_nt[[
    "repository_name",
    "number_of_users", "users_numeric",
    "number_of_requests", "requests_numeric"
]].head(10)


Unnamed: 0,repository_name,number_of_users,users_numeric,number_of_requests,requests_numeric
0,apache/roller,5,5.0,10.0,10.0
1,nysenate/openlegislation,1,1.0,1.0,1.0
2,nysenate/openlegislation,1,1.0,1.0,1.0
3,nysenate/openlegislation,1,1.0,1.0,1.0
4,nysenate/openlegislation,1,1.0,1.0,1.0
5,nysenate/openlegislation,1,1.0,1.0,1.0
6,eclipse/jetty.project,400,400.0,5.0,5.0
7,eclipse/jetty.project,400,400.0,5.0,5.0
8,apereo/cas,15,15.0,0.0,0.0
9,apereo/cas,1,1.0,0.0,0.0


Aggregate workload per repository

In [83]:
perf_agg = (
    perf_tests_nt
    .groupby("repository_name", as_index=False)
    .agg(
        median_users=("users_numeric", "median"),
        median_requests=("requests_numeric", "median"),
        num_perf_tests=("test_path", "count")
    )
)

perf_agg.head()
len(perf_agg)


84

Merge with repository activity data

In [None]:
analysis_df = perf_agg.merge(
    repository,
    on="repository_name",
    how="inner"
)

analysis_df.head()


Unnamed: 0,repository_name,median_users,median_requests,num_perf_tests,is_fork,commits,branches,releases,forks,main_language,...,metrics,last_commit,last_commit_sha,has_wiki,is_archived,is_disabled,is_locked,languages,labels,topics
0,52north/sos,1.0,4.0,7,0.0,6848.0,26.0,89.0,82.0,Java,...,"language:Freemarker Template, commentLines:9, ...",2023-05-22T05:41:25,c7256ab39f4853bbd7786c5fbb19cbef5da3d8b4,0.0,0.0,0.0,0.0,Java; JavaScript; PostScript; CSS; HTML; PLpgS...,4.3.10; 4.x; 5.x; aqd e-reporting; bug; depend...,aqd; ereporting; hydrology; inspire; observati...
1,HumanSignal/label-studio,,2.0,5,0.0,3243.0,580.0,69.0,1994.0,JavaScript,...,"language:make, commentLines:21, codeLines:38, ...",2024-03-28T03:22:17,0b8e98dab311be81d420ab6a755cba69a006b2f5,0.0,0.0,0.0,0.0,JavaScript; Python; TypeScript; Stylus; HTML; ...,actions-update; audio; backend; blog; bot; bou...,annotation; annotation-tool; annotations; boun...
2,abpframework/abp,2000.0,5.0,2,0.0,34626.0,93.0,196.0,3294.0,C#,...,"language:Text, commentLines:0, codeLines:48, b...",2024-03-29T11:04:02,b2b7751c91bad5d2057bd8960bf6a3a3f8784f1e,0.0,0.0,0.0,0.0,C#; HTML; TypeScript; JavaScript; CSS; PowerSh...,.net; abp-cli; abp-community; abp-framework; a...,abp; angular; architecture; aspnet; aspnet-cor...
3,adorsys/xs2a,50.0,5.0,3,0.0,7808.0,171.0,176.0,61.0,Java,...,"language:DOS Batch, commentLines:0, codeLines:...",2024-02-22T11:21:11,7ce8fe270c0e540ee4b2a4f15167e66401c69516,0.0,0.0,0.0,0.0,Java; CSS; HTML; Shell; Dockerfile; Makefile;,bug; dependencies; duplicate; enhancement; goo...,berlin-group; nextgenpsd2; psd; psd2; psd2-xs2...
4,airsonic-advanced/airsonic-advanced,1.0,4.0,1,0.0,3154.0,7.0,382.0,99.0,JavaScript,...,,2023-03-30T05:48:14,9b43fa47566aa37a3aacdf2255cf9c97cd3b0bd9,1.0,0.0,0.0,0.0,JavaScript; Java; CSS; SCSS; Shell; Dockerfile...,breaking change; bug; dependencies; documentat...,


Compute project age (days)

In [None]:
analysis_df["created_at"] = pd.to_datetime(
    analysis_df["created_at"],
    errors="coerce",
    utc=True
)

analysis_df = analysis_df.dropna(subset=["created_at"])

now_utc = pd.Timestamp.now(tz="UTC")

analysis_df["project_age_days"] = (
    (now_utc - analysis_df["created_at"])
    .dt.total_seconds() / 86400
).astype(float)

analysis_df[[
    "repository_name",
    "commits",
    "project_age_days"
]].head()


Unnamed: 0,repository_name,commits,contributors
0,52north/sos,6848.0,15.0
1,HumanSignal/label-studio,3243.0,109.0
2,abpframework/abp,34626.0,326.0
3,adorsys/xs2a,7808.0,45.0
4,airsonic-advanced/airsonic-advanced,3154.0,88.0


Handle missing workload values

In [86]:
print(analysis_df[["median_users", "median_requests"]].isna().mean())


median_users       0.202381
median_requests    0.023810
dtype: float64


Spearman correlations (MAIN RESULTS)

In [87]:
# commits vs users
tmp = analysis_df[["commits", "median_users"]].dropna()
rho_users, p_users = spearmanr(tmp["commits"], tmp["median_users"])

# commits vs requests
tmp = analysis_df[["commits", "median_requests"]].dropna()
rho_req, p_req = spearmanr(tmp["commits"], tmp["median_requests"])

rho_users, p_users, rho_req, p_req

# age vs users
tmp = analysis_df[["project_age_days", "median_users"]].dropna()
rho_age_users, p_age_users = spearmanr(tmp["project_age_days"], tmp["median_users"])

# age vs requests
tmp = analysis_df[["project_age_days", "median_requests"]].dropna()
rho_age_req, p_age_req = spearmanr(tmp["project_age_days"], tmp["median_requests"])

rho_age_users, p_age_users, rho_age_req, p_age_req


(0.017205194803947176,
 0.8900898699661464,
 0.07349509872919736,
 0.5116969902207373)

Compact results table (for report)

In [None]:
results = pd.DataFrame([
    {"activity": "commits", "workload": "users", "rho": rho_users, "p": p_users},
    {"activity": "commits", "workload": "requests", "rho": rho_req, "p": p_req},
    {"activity": "age", "workload": "users", "rho": rho_age_users, "p": p_age_users},
    {"activity": "age", "workload": "requests", "rho": rho_age_req, "p": p_age_req},
])

results


Unnamed: 0,activity,workload,rho,p
0,commits,users,0.012203,0.921923
1,commits,requests,0.029952,0.789373
2,age,users,0.017205,0.89009
3,age,requests,0.073495,0.511697
