# üß™ 04 ‚Äî Full Reproduction Pipeline

This notebook runs the **entire MSR 2026 Challenge Track artifact pipeline**, including:

1. RQ1 ‚Äî Testing behavior of AI coding agents
2. RQ2 ‚Äî Review dynamics & resolution difficulty
3. RQ3 ‚Äî Early acceptance signals

All outputs (figures + CSV tables) are automatically stored under:

```
output/figures/
output/tables/
```

Run all cells to fully reproduce the results presented in the paper.

In [1]:
import sys, os
sys.path.append("../src")

from src.msr2026.rq1.run_rq1 import run_rq1
from src.msr2026.rq2.run_rq2 import run_rq2
from src.msr2026.rq3.run_rq3 import run_rq3

import pandas as pd
from IPython.display import display

## ‚ñ∂Ô∏è Run RQ1 ‚Äî Testing Behavior

In [2]:
print("=== Running RQ1 ===")
run_rq1()
print("RQ1 completed.\n")

=== Running RQ1 ===

Loading RQ1 data from HuggingFace...
Extracting test file indicators...
Merging PR-level data...


  pr["contains_test"] = pr["contains_test"].fillna(False)


Computing agent-level metrics...
Saving CSV outputs...
‚úî CSV tables saved to: ../output/tables/RQ1
‚úî Behavior matrix saved.
‚úî Task-type matrix saved.
‚úî RQ1 completed ‚Äî Figures saved to: ../output/figures/RQ1
RQ1 completed.



## üìä Load RQ1 Tables

In [3]:
import os
RQ1_TABLE_DIR = "../output/tables/RQ1/"



for f in os.listdir(RQ1_TABLE_DIR):
    if f.endswith(".csv"):
        print(f"=== {f} ===")
        display(pd.read_csv(os.path.join(RQ1_TABLE_DIR, f)))
        print()


=== rq1_average_test_files.csv ===


Unnamed: 0,agent,avg_test_file_count
0,Claude_Code,0.535137
1,Copilot,0.629036
2,Cursor,0.087034
3,Devin,0.538764
4,OpenAI_Codex,0.178592



=== rq1_behavior_matrix.csv ===


Unnamed: 0,agent,test_inclusion_rate,avg_test_file_count,conditional_avg_test_file_count
0,Claude_Code,0.03796,0.535137,14.097436
1,Copilot,0.036771,0.629036,17.106739
2,Cursor,0.009988,0.087034,8.714286
3,Devin,0.047337,0.538764,11.381392
4,OpenAI_Codex,0.011624,0.178592,15.364068



=== rq1_conditional_test_files.csv ===


Unnamed: 0,agent,conditional_avg_test_file_count
0,Claude_Code,14.097436
1,Copilot,17.106739
2,Cursor,8.714286
3,Devin,11.381392
4,OpenAI_Codex,15.364068



=== rq1_task_type_matrix.csv ===


Unnamed: 0,agent,chore,build,ci,docs
0,Claude_Code,0.571429,0.125,0.0,0.125
1,Copilot,0.357143,0.231405,0.149254,0.091703
2,Cursor,0.2,0.238095,0.0625,0.028986
3,Devin,0.174377,0.09375,0.186441,0.066129
4,OpenAI_Codex,0.328605,0.091837,0.056818,0.113619



=== rq1_test_inclusion.csv ===


Unnamed: 0,agent,test_inclusion_rate
0,Claude_Code,0.03796
1,Copilot,0.036771
2,Cursor,0.009988
3,Devin,0.047337
4,OpenAI_Codex,0.011624





## ‚ñ∂Ô∏è Run RQ2 ‚Äî Review Dynamics

In [4]:
print("=== Running RQ2 ===")
run_rq2()
print("RQ2 completed.\n")

=== Running RQ2 ===

Loading RQ2 data from HuggingFace...

=== Cleaning Stage 1 ===
After Stage 1: 21,753 comments

=== Cleaning Stage 2 ===
After Stage 2: 18,551 comments

=== Classifying comments ===


  df["short_body"].str.contains(pattern, case=False, regex=True, na=False),
  df["short_body"].str.contains(pattern, case=False, regex=True, na=False),
  df["short_body"].str.contains(pattern, case=False, regex=True, na=False),
  df["short_body"].str.contains(pattern, case=False, regex=True, na=False),
  df["short_body"].str.contains(pattern, case=False, regex=True, na=False),



Saving CSV outputs...
‚úî Comment distribution tables saved.

=== Generating Figure 1b ‚Äî Normalized Comment Distribution ===
‚úî Resolution matrix saved.

=== Generating Figure 2 ‚Äî Resolution Heatmap ===

‚úî RQ2 completed ‚Äî Figures saved to: ../output/figures/RQ2
‚úî RQ2 CSV tables saved to: ../output/tables/RQ2
RQ2 completed.



## üìä Load RQ2 Tables

In [6]:
RQ2_TABLE_DIR = "../output/tables/RQ2/"

for f in os.listdir(RQ2_TABLE_DIR):
    if f.endswith(".csv"):
        print(f"=== {f} ===")
        display(pd.read_csv(os.path.join(RQ2_TABLE_DIR, f)))
        print()

=== rq2_resolution_matrix.csv ===


Unnamed: 0,agent,correctness,documentation,other,security,style,testing
0,Claude_Code,0.8,0.583333,0.834983,1.0,1.0,0.833333
1,Copilot,0.977968,0.981375,0.980065,1.0,0.991597,0.976419
2,Cursor,0.907407,0.88,0.902128,0.5,0.857143,0.804878
3,Devin,0.946809,0.947761,0.965295,0.8,0.909091,0.985075
4,OpenAI_Codex,0.730435,0.741379,0.789432,1.0,0.688525,0.791667



=== rq2_type_counts_filtered.csv ===


Unnamed: 0,agent,correctness,documentation,security,style,testing
0,Claude_Code,20,12,4,2,12
1,Copilot,817,698,43,238,1145
2,Cursor,54,25,2,63,41
3,Devin,188,134,10,198,134
4,OpenAI_Codex,115,116,6,61,96



=== rq2_type_counts_raw.csv ===


Unnamed: 0,agent,correctness,documentation,other,security,style,testing
0,Claude_Code,20,12,303,4,2,12
1,Copilot,817,698,9481,43,238,1145
2,Cursor,54,25,470,2,63,41
3,Devin,188,134,2795,10,198,134
4,OpenAI_Codex,115,116,1268,6,61,96



=== rq2_type_distribution_pct.csv ===


Unnamed: 0,agent,correctness,documentation,security,style,testing
0,Claude_Code,0.4,0.24,0.08,0.04,0.24
1,Copilot,0.277797,0.237334,0.014621,0.080925,0.389323
2,Cursor,0.291892,0.135135,0.010811,0.340541,0.221622
3,Devin,0.283133,0.201807,0.01506,0.298193,0.201807
4,OpenAI_Codex,0.291878,0.294416,0.015228,0.154822,0.243655





## ‚ñ∂Ô∏è Run RQ3 ‚Äî Early Acceptance Signals

In [None]:
print("=== Running RQ3 ===")
run_rq3()
print("RQ3 completed.\n")

## üìä Load RQ3 Tables

In [7]:
RQ3_TABLE_DIR = "../output/tables/RQ3/"

for f in os.listdir(RQ3_TABLE_DIR):
    if f.endswith(".csv"):
        print(f"=== {f} ===")
        display(pd.read_csv(os.path.join(RQ3_TABLE_DIR, f)))
        print()

=== rq3_features_clipped.csv ===


Unnamed: 0,pr_id,agent,accepted,desc_length,churn,files_changed,is_test
0,3264933329,Claude_Code,0,1928.0,396.0,3.0,1.0
1,3265118634,Claude_Code,1,649.0,76.0,11.0,0.0
2,3265640341,Claude_Code,1,4516.0,407.0,5.0,0.0
3,3265709660,Claude_Code,1,2222.0,300.0,15.0,0.0
4,3265782173,Claude_Code,0,327.0,221.0,21.0,1.0
...,...,...,...,...,...,...,...
33591,2857942945,Devin,1,814.0,85.0,2.0,0.0
33592,2857959763,Devin,0,352.0,47.0,1.0,0.0
33593,2858280902,Devin,0,845.0,14.0,1.0,0.0
33594,2858429985,Devin,1,735.0,636.0,23.0,0.0



=== rq3_features_raw.csv ===


Unnamed: 0,pr_id,agent,accepted,desc_length,churn,files_changed,is_test
0,3264933329,Claude_Code,0,1928,396.0,3.0,1.0
1,3265118634,Claude_Code,1,649,76.0,11.0,0.0
2,3265640341,Claude_Code,1,4516,407.0,5.0,0.0
3,3265709660,Claude_Code,1,2222,300.0,15.0,0.0
4,3265782173,Claude_Code,0,327,221.0,21.0,1.0
...,...,...,...,...,...,...,...
33591,2857942945,Devin,1,814,85.0,2.0,0.0
33592,2857959763,Devin,0,352,47.0,1.0,0.0
33593,2858280902,Devin,0,845,14.0,1.0,0.0
33594,2858429985,Devin,1,735,636.0,23.0,0.0



=== rq3_mannwhitney_tests.csv ===


Unnamed: 0,feature,U_value,p_value
0,desc_length,152014666.0,0.0
1,churn,135434546.0,2.763219e-142
2,files_changed,126548770.5,1.666592e-47
3,is_test,112556705.0,0.0002415459



=== rq3_summary_table.csv ===


Unnamed: 0.1,Unnamed: 0,Reject_median,Reject_IQR,Accept_median,Accept_IQR
0,desc_length,724.0,1866.75,353.0,314.0
1,churn,160.0,601.0,79.0,247.0
2,files_changed,4.0,8.0,3.0,7.0
3,is_test,0.0,1.0,0.0,1.0





## üéâ Pipeline Finished
All RQ results, tables, and figures have been reproduced successfully.

You may continue exploring the analysis or reviewing individual RQ notebooks.