# AIDEV-Pop Dataset: Merge PR Data with Task Classifications

This notebook merges `prs.csv` with `gpt_conventional_commits.csv` for all agents in the AIDEV-Pop dataset.

## 1. Import Required Libraries

In [11]:
import pandas as pd
import os
from pathlib import Path

## 2. Define Data Paths and Agent List

In [12]:
# Define base path to the AIDEV-Pop dataset
base_path = Path("AIDev/aidev-pop")

# List all agent folders
agents = [d.name for d in base_path.iterdir() if d.is_dir()]
print(f"Found {len(agents)} agents: {agents}")

Found 5 agents: ['Devin', 'Copilot', 'Cursor', 'OpenAI_Codex', 'Claude_Code']


## 3. Load and Examine Data from Devin (Test Case)

In [13]:
# Load Devin's prs.csv
devin_prs = pd.read_csv(base_path / "Devin" / "prs.csv")
print(f"Devin PRs shape: {devin_prs.shape}")
devin_prs.head()

Devin PRs shape: (4829, 12)


Unnamed: 0,id,number,title,user,user_id,state,created_at,closed_at,merged_at,repo_url,html_url,body
0,3277367174,3243,Fix #3242: Add reasoning parameter to LLM clas...,devin-ai-integration[bot],158243242,open,2025-07-30T14:39:48Z,,,https://api.github.com/repos/crewAIInc/crewAI,https://github.com/crewAIInc/crewAI/pull/3243,# Fix #3242: Add reasoning parameter to LLM cl...
1,3277422683,3730,docs: highlight Neon Local Connect option on c...,devin-ai-integration[bot],158243242,open,2025-07-30T14:52:10Z,,,https://api.github.com/repos/neondatabase/website,https://github.com/neondatabase/website/pull/3730,# docs: highlight Neon Local Connect option on...
2,3277429161,1895,feat(crawler): replace robotstxt library with ...,devin-ai-integration[bot],158243242,open,2025-07-30T14:53:47Z,,,https://api.github.com/repos/mendableai/firecrawl,https://github.com/mendableai/firecrawl/pull/1895,# feat(crawler): replace robotstxt library wit...
3,3277452738,2557,Add documentation for resolveUsers 50-user bat...,devin-ai-integration[bot],158243242,closed,2025-07-30T15:00:13Z,2025-07-30T18:15:17Z,2025-07-30T18:15:17Z,https://api.github.com/repos/liveblocks/livebl...,https://github.com/liveblocks/liveblocks/pull/...,# Add documentation for resolveUsers 50-user b...
4,3277544065,2911,fix: resolve Fuel deployment transaction size ...,devin-ai-integration[bot],158243242,open,2025-07-30T15:28:54Z,,,https://api.github.com/repos/pyth-network/pyth...,https://github.com/pyth-network/pyth-crosschai...,"## Summary\n\nFixed the ""transaction size limi..."


In [14]:
# Load Devin's gpt_conventional_commits.csv
devin_commits = pd.read_csv(base_path / "Devin" / "gpt_conventional_commits.csv")
print(f"Devin Commits shape: {devin_commits.shape}")
devin_commits.head()

Devin Commits shape: (4827, 6)


Unnamed: 0,agent,id,title,reason,type,confidence
0,Devin,3277367174,Fix #3242: Add reasoning parameter to LLM clas...,title provides conventional commit label,fix,10
1,Devin,3277422683,docs: highlight Neon Local Connect option on c...,title provides conventional commit label,docs,10
2,Devin,3277429161,feat(crawler): replace robotstxt library with ...,title provides conventional commit label,feat,10
3,Devin,3277544065,fix: resolve Fuel deployment transaction size ...,title provides conventional commit label,fix,10
4,Devin,3277892335,feat(python-sdk): implement missing crawl_enti...,title provides conventional commit label,feat,10


In [15]:
# Check common columns to determine merge key
print("PR columns:", devin_prs.columns.tolist())
print("\nCommit columns:", devin_commits.columns.tolist())

PR columns: ['id', 'number', 'title', 'user', 'user_id', 'state', 'created_at', 'closed_at', 'merged_at', 'repo_url', 'html_url', 'body']

Commit columns: ['agent', 'id', 'title', 'reason', 'type', 'confidence']


## 4. Create Merge Function

In [16]:
def merge_agent_data(agent_name, base_path):
    """
    Load and merge prs.csv and gpt_conventional_commits.csv for a given agent.
    
    Parameters:
    - agent_name: Name of the agent folder
    - base_path: Path object pointing to the aidev-pop directory
    
    Returns:
    - Merged DataFrame or None if files are missing
    """
    agent_path = base_path / agent_name
    prs_file = agent_path / "prs.csv"
    commits_file = agent_path / "gpt_conventional_commits.csv"
    
    # Check if both files exist
    if not prs_file.exists():
        print(f"Warning: {prs_file} not found for {agent_name}")
        return None
    if not commits_file.exists():
        print(f"Warning: {commits_file} not found for {agent_name}")
        return None
    
    # Load data
    prs_df = pd.read_csv(prs_file)
    commits_df = pd.read_csv(commits_file)
    
    print(f"{agent_name}: PRs shape={prs_df.shape}, Commits shape={commits_df.shape}")
    
    # Merge on 'id' column (assuming both files have this column)
    merged_df = pd.merge(prs_df, commits_df, on='id', how='left', suffixes=('', '_commit'))
    
    print(f"{agent_name}: Merged shape={merged_df.shape}")
    
    return merged_df

## 5. Test Merge Function on Devin

In [17]:
# Test the merge function on Devin
devin_merged = merge_agent_data("Devin", base_path)
devin_merged.head()

Devin: PRs shape=(4829, 12), Commits shape=(4827, 6)
Devin: Merged shape=(4829, 17)


Unnamed: 0,id,number,title,user,user_id,state,created_at,closed_at,merged_at,repo_url,html_url,body,agent,title_commit,reason,type,confidence
0,3277367174,3243,Fix #3242: Add reasoning parameter to LLM clas...,devin-ai-integration[bot],158243242,open,2025-07-30T14:39:48Z,,,https://api.github.com/repos/crewAIInc/crewAI,https://github.com/crewAIInc/crewAI/pull/3243,# Fix #3242: Add reasoning parameter to LLM cl...,Devin,Fix #3242: Add reasoning parameter to LLM clas...,title provides conventional commit label,fix,10.0
1,3277422683,3730,docs: highlight Neon Local Connect option on c...,devin-ai-integration[bot],158243242,open,2025-07-30T14:52:10Z,,,https://api.github.com/repos/neondatabase/website,https://github.com/neondatabase/website/pull/3730,# docs: highlight Neon Local Connect option on...,Devin,docs: highlight Neon Local Connect option on c...,title provides conventional commit label,docs,10.0
2,3277429161,1895,feat(crawler): replace robotstxt library with ...,devin-ai-integration[bot],158243242,open,2025-07-30T14:53:47Z,,,https://api.github.com/repos/mendableai/firecrawl,https://github.com/mendableai/firecrawl/pull/1895,# feat(crawler): replace robotstxt library wit...,Devin,feat(crawler): replace robotstxt library with ...,title provides conventional commit label,feat,10.0
3,3277452738,2557,Add documentation for resolveUsers 50-user bat...,devin-ai-integration[bot],158243242,closed,2025-07-30T15:00:13Z,2025-07-30T18:15:17Z,2025-07-30T18:15:17Z,https://api.github.com/repos/liveblocks/livebl...,https://github.com/liveblocks/liveblocks/pull/...,# Add documentation for resolveUsers 50-user b...,Devin,Add documentation for resolveUsers 50-user bat...,The PR only adds explanatory documentation abo...,docs,10.0
4,3277544065,2911,fix: resolve Fuel deployment transaction size ...,devin-ai-integration[bot],158243242,open,2025-07-30T15:28:54Z,,,https://api.github.com/repos/pyth-network/pyth...,https://github.com/pyth-network/pyth-crosschai...,"## Summary\n\nFixed the ""transaction size limi...",Devin,fix: resolve Fuel deployment transaction size ...,title provides conventional commit label,fix,10.0


In [18]:
# Check the columns in the merged DataFrame
print("Merged columns:", devin_merged.columns.tolist())

Merged columns: ['id', 'number', 'title', 'user', 'user_id', 'state', 'created_at', 'closed_at', 'merged_at', 'repo_url', 'html_url', 'body', 'agent', 'title_commit', 'reason', 'type', 'confidence']


## 6. Process All Agents

In [19]:
# Process all agents and store merged data in a dictionary
merged_data = {}

for agent in agents:
    print(f"\n{'='*60}")
    print(f"Processing {agent}...")
    print('='*60)
    merged_df = merge_agent_data(agent, base_path)
    if merged_df is not None:
        merged_data[agent] = merged_df
    else:
        print(f"Skipping {agent} due to missing files")

print(f"\n\nSuccessfully merged data for {len(merged_data)} agents: {list(merged_data.keys())}")


Processing Devin...
Devin: PRs shape=(4829, 12), Commits shape=(4827, 6)
Devin: Merged shape=(4829, 17)

Processing Copilot...
Copilot: PRs shape=(4971, 12), Commits shape=(4970, 6)
Copilot: Merged shape=(4971, 17)

Processing Cursor...
Cursor: PRs shape=(1541, 12), Commits shape=(1541, 6)
Cursor: Merged shape=(1541, 17)

Processing OpenAI_Codex...
OpenAI_Codex: PRs shape=(21800, 12), Commits shape=(21799, 6)
OpenAI_Codex: Merged shape=(21800, 17)

Processing Claude_Code...
Claude_Code: PRs shape=(459, 12), Commits shape=(459, 6)
Claude_Code: Merged shape=(459, 17)


Successfully merged data for 5 agents: ['Devin', 'Copilot', 'Cursor', 'OpenAI_Codex', 'Claude_Code']


## 7. Verify Merged DataFrames

In [20]:
# Display summary information for each agent
for agent, df in merged_data.items():
    print(f"\n{'='*60}")
    print(f"{agent}")
    print('='*60)
    print(f"Shape: {df.shape}")
    print(f"Columns: {df.columns.tolist()}")
    print(f"\nFirst few rows:")
    display(df.head(3))
    print(f"\nData types:")
    print(df.dtypes)
    
    # Save each agent's merged data to CSV
    for agent, df in merged_data.items():
        output_file = base_path / agent / "merged_prs_commits.csv"
        df.to_csv(output_file, index=False)
        print(f"Saved {agent} merged data to {output_file}")


Devin
Shape: (4829, 17)
Columns: ['id', 'number', 'title', 'user', 'user_id', 'state', 'created_at', 'closed_at', 'merged_at', 'repo_url', 'html_url', 'body', 'agent', 'title_commit', 'reason', 'type', 'confidence']

First few rows:


Unnamed: 0,id,number,title,user,user_id,state,created_at,closed_at,merged_at,repo_url,html_url,body,agent,title_commit,reason,type,confidence
0,3277367174,3243,Fix #3242: Add reasoning parameter to LLM clas...,devin-ai-integration[bot],158243242,open,2025-07-30T14:39:48Z,,,https://api.github.com/repos/crewAIInc/crewAI,https://github.com/crewAIInc/crewAI/pull/3243,# Fix #3242: Add reasoning parameter to LLM cl...,Devin,Fix #3242: Add reasoning parameter to LLM clas...,title provides conventional commit label,fix,10.0
1,3277422683,3730,docs: highlight Neon Local Connect option on c...,devin-ai-integration[bot],158243242,open,2025-07-30T14:52:10Z,,,https://api.github.com/repos/neondatabase/website,https://github.com/neondatabase/website/pull/3730,# docs: highlight Neon Local Connect option on...,Devin,docs: highlight Neon Local Connect option on c...,title provides conventional commit label,docs,10.0
2,3277429161,1895,feat(crawler): replace robotstxt library with ...,devin-ai-integration[bot],158243242,open,2025-07-30T14:53:47Z,,,https://api.github.com/repos/mendableai/firecrawl,https://github.com/mendableai/firecrawl/pull/1895,# feat(crawler): replace robotstxt library wit...,Devin,feat(crawler): replace robotstxt library with ...,title provides conventional commit label,feat,10.0



Data types:
id                int64
number            int64
title            object
user             object
user_id           int64
state            object
created_at       object
closed_at        object
merged_at        object
repo_url         object
html_url         object
body             object
agent            object
title_commit     object
reason           object
type             object
confidence      float64
dtype: object
Saved Devin merged data to AIDev/aidev-pop/Devin/merged_prs_commits.csv
Saved Copilot merged data to AIDev/aidev-pop/Copilot/merged_prs_commits.csv
Saved Cursor merged data to AIDev/aidev-pop/Cursor/merged_prs_commits.csv
Saved OpenAI_Codex merged data to AIDev/aidev-pop/OpenAI_Codex/merged_prs_commits.csv
Saved Claude_Code merged data to AIDev/aidev-pop/Claude_Code/merged_prs_commits.csv

Copilot
Shape: (4971, 17)
Columns: ['id', 'number', 'title', 'user', 'user_id', 'state', 'created_at', 'closed_at', 'merged_at', 'repo_url', 'html_url', 'body', 'agent', 't

Unnamed: 0,id,number,title,user,user_id,state,created_at,closed_at,merged_at,repo_url,html_url,body,agent,title_commit,reason,type,confidence
0,3221043102,1387,Implement getJSSyntacticDiagnosticsForFile for...,Copilot,198982749,open,2025-07-11T00:10:20Z,,,https://api.github.com/repos/microsoft/typescr...,https://github.com/microsoft/typescript-go/pul...,This PR implements the missing `getJSSyntactic...,Copilot,Implement getJSSyntacticDiagnosticsForFile for...,The PR introduces a new feature by implementin...,feat,9.0
1,3221049488,17571,Improve BCP081 diagnostic message to be more e...,Copilot,198982749,open,2025-07-11T00:14:10Z,,,https://api.github.com/repos/Azure/bicep,https://github.com/Azure/bicep/pull/17571,The BCP081 warning about missing resource type...,Copilot,Improve BCP081 diagnostic message to be more e...,The change improves the diagnostic message to ...,fix,8.0
2,3221072168,7714,Replace [KnownBuiltin] string-based comparison...,Copilot,198982749,closed,2025-07-11T00:30:43Z,2025-07-15T12:08:53Z,2025-07-15T12:08:53Z,https://api.github.com/repos/shader-slang/slang,https://github.com/shader-slang/slang/pull/7714,This PR replaces the inefficient string-based ...,Copilot,Replace [KnownBuiltin] string-based comparison...,The PR introduces a new enum-based system to r...,feat,9.0



Data types:
id                int64
number            int64
title            object
user             object
user_id           int64
state            object
created_at       object
closed_at        object
merged_at        object
repo_url         object
html_url         object
body             object
agent            object
title_commit     object
reason           object
type             object
confidence      float64
dtype: object
Saved Devin merged data to AIDev/aidev-pop/Devin/merged_prs_commits.csv
Saved Copilot merged data to AIDev/aidev-pop/Copilot/merged_prs_commits.csv
Saved Cursor merged data to AIDev/aidev-pop/Cursor/merged_prs_commits.csv
Saved OpenAI_Codex merged data to AIDev/aidev-pop/OpenAI_Codex/merged_prs_commits.csv
Saved Claude_Code merged data to AIDev/aidev-pop/Claude_Code/merged_prs_commits.csv

Cursor
Shape: (1541, 17)
Columns: ['id', 'number', 'title', 'user', 'user_id', 'state', 'created_at', 'closed_at', 'merged_at', 'repo_url', 'html_url', 'body', 'agent', 'ti

Unnamed: 0,id,number,title,user,user_id,state,created_at,closed_at,merged_at,repo_url,html_url,body,agent,title_commit,reason,type,confidence
0,3229388782,7610,Display webhook secret in dashboard,iuwqyir,16590638,closed,2025-07-14T16:51:28Z,2025-07-15T20:08:30Z,2025-07-15T20:08:30Z,https://api.github.com/repos/thirdweb-dev/js,https://github.com/thirdweb-dev/js/pull/7610,```\r\n<!--\r\n\r\n## title your PR with this ...,Cursor,Display webhook secret in dashboard,The PR introduces a new feature by adding a ne...,feat,9
1,3229430451,8571,fix: Improve error message for missing email s...,mikeldking,5640648,closed,2025-07-14T17:07:09Z,2025-07-15T16:07:24Z,2025-07-15T16:07:24Z,https://api.github.com/repos/Arize-ai/phoenix,https://github.com/Arize-ai/phoenix/pull/8571,Improve OIDC error handling to provide a user-...,Cursor,fix: Improve error message for missing email s...,title provides conventional commit label,fix,10
2,3229479550,1324,Update Twilio integration documentation,AngeloGiacco,29235343,closed,2025-07-14T17:25:38Z,2025-07-15T17:20:03Z,2025-07-15T17:20:03Z,https://api.github.com/repos/elevenlabs/eleven...,https://github.com/elevenlabs/elevenlabs-docs/...,Update Twilio integration documentation to ref...,Cursor,Update Twilio integration documentation,The change is explicitly about updating docume...,docs,9



Data types:
id               int64
number           int64
title           object
user            object
user_id          int64
state           object
created_at      object
closed_at       object
merged_at       object
repo_url        object
html_url        object
body            object
agent           object
title_commit    object
reason          object
type            object
confidence       int64
dtype: object
Saved Devin merged data to AIDev/aidev-pop/Devin/merged_prs_commits.csv
Saved Copilot merged data to AIDev/aidev-pop/Copilot/merged_prs_commits.csv
Saved Cursor merged data to AIDev/aidev-pop/Cursor/merged_prs_commits.csv
Saved OpenAI_Codex merged data to AIDev/aidev-pop/OpenAI_Codex/merged_prs_commits.csv
Saved Claude_Code merged data to AIDev/aidev-pop/Claude_Code/merged_prs_commits.csv

OpenAI_Codex
Shape: (21800, 17)
Columns: ['id', 'number', 'title', 'user', 'user_id', 'state', 'created_at', 'closed_at', 'merged_at', 'repo_url', 'html_url', 'body', 'agent', 'title_commit

Unnamed: 0,id,number,title,user,user_id,state,created_at,closed_at,merged_at,repo_url,html_url,body,agent,title_commit,reason,type,confidence
0,3172564184,2560,[alpha_factory] auto-start inproc bus,MontrealAI,24208299,closed,2025-06-24T16:13:36Z,2025-06-24T16:13:45Z,2025-06-24T16:13:45Z,https://api.github.com/repos/MontrealAI/AGI-Al...,https://github.com/MontrealAI/AGI-Alpha-Agent-...,## Summary\n- auto-start in-process EventBus c...,OpenAI_Codex,[alpha_factory] auto-start inproc bus,The changes introduce a new feature that auto-...,feat,9.0
1,3172714215,90,Add pagination to SQLAlchemy relationship reso...,simba-git,64661186,closed,2025-06-24T17:14:40Z,2025-06-24T17:41:05Z,2025-06-24T17:41:05Z,https://api.github.com/repos/featureform/enric...,https://github.com/featureform/enrichmcp/pull/90,## Summary\n- autogenerate paginated relations...,OpenAI_Codex,Add pagination to SQLAlchemy relationship reso...,The PR introduces a new feature by adding pagi...,feat,9.0
2,3172731620,2564,[alpha_factory] implement mesh registration re...,MontrealAI,24208299,closed,2025-06-24T17:20:57Z,2025-06-24T17:21:15Z,2025-06-24T17:21:15Z,https://api.github.com/repos/MontrealAI/AGI-Al...,https://github.com/MontrealAI/AGI-Alpha-Agent-...,## Summary\n- add exponential backoff and fina...,OpenAI_Codex,[alpha_factory] implement mesh registration re...,The PR introduces a new feature: implementing ...,feat,9.0



Data types:
id                int64
number            int64
title            object
user             object
user_id           int64
state            object
created_at       object
closed_at        object
merged_at        object
repo_url         object
html_url         object
body             object
agent            object
title_commit     object
reason           object
type             object
confidence      float64
dtype: object
Saved Devin merged data to AIDev/aidev-pop/Devin/merged_prs_commits.csv
Saved Copilot merged data to AIDev/aidev-pop/Copilot/merged_prs_commits.csv
Saved Cursor merged data to AIDev/aidev-pop/Cursor/merged_prs_commits.csv
Saved OpenAI_Codex merged data to AIDev/aidev-pop/OpenAI_Codex/merged_prs_commits.csv
Saved Claude_Code merged data to AIDev/aidev-pop/Claude_Code/merged_prs_commits.csv

Claude_Code
Shape: (459, 17)
Columns: ['id', 'number', 'title', 'user', 'user_id', 'state', 'created_at', 'closed_at', 'merged_at', 'repo_url', 'html_url', 'body', 'agent',

Unnamed: 0,id,number,title,user,user_id,state,created_at,closed_at,merged_at,repo_url,html_url,body,agent,title_commit,reason,type,confidence
0,3264933329,2911,Fix: Wait for all partitions in load_collectio...,weiliu1031,108661493,closed,2025-07-26T02:59:01Z,2025-07-29T07:01:20Z,,https://api.github.com/repos/milvus-io/pymilvus,https://github.com/milvus-io/pymilvus/pull/2911,## Summary\n\nFixes an issue where `load_colle...,Claude_Code,Fix: Wait for all partitions in load_collectio...,title provides conventional commit label,fix,10
1,3265118634,2,ファイルパス参照を相対パスに統一し、doc/からdocs/に統一,cm-kojimat,61827001,closed,2025-07-26T04:56:55Z,2025-07-26T22:12:24Z,2025-07-26T22:12:24Z,https://api.github.com/repos/classmethod/tsumiki,https://github.com/classmethod/tsumiki/pull/2,## 背景\n\n現在、本プロジェクトにおいて以下のパス構成の不整合が生じています：\n\n...,Claude_Code,ファイルパス参照を相対パスに統一し、doc/からdocs/に統一,The changes unify file path references and dir...,refactor,9
2,3265640341,30,Add build staleness detection for debug CLI,MSch,7475,closed,2025-07-26T13:31:19Z,2025-07-26T13:37:22Z,2025-07-26T13:37:22Z,https://api.github.com/repos/steipete/Peekaboo,https://github.com/steipete/Peekaboo/pull/30,## Summary\r\n\r\n Implements comprehensive b...,Claude_Code,Add build staleness detection for debug CLI,The PR introduces a new feature that adds buil...,feat,10



Data types:
id               int64
number           int64
title           object
user            object
user_id          int64
state           object
created_at      object
closed_at       object
merged_at       object
repo_url        object
html_url        object
body            object
agent           object
title_commit    object
reason          object
type            object
confidence       int64
dtype: object
Saved Devin merged data to AIDev/aidev-pop/Devin/merged_prs_commits.csv
Saved Copilot merged data to AIDev/aidev-pop/Copilot/merged_prs_commits.csv
Saved Cursor merged data to AIDev/aidev-pop/Cursor/merged_prs_commits.csv
Saved OpenAI_Codex merged data to AIDev/aidev-pop/OpenAI_Codex/merged_prs_commits.csv
Saved Claude_Code merged data to AIDev/aidev-pop/Claude_Code/merged_prs_commits.csv


## 8. Save Merged Data (Optional)