<a href="https://colab.research.google.com/github/awjans/CopilotForPRsAdoption/blob/main/scripts/AIDev.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**First**, We need to define the URLs of the AIDev Parquet Files that we are intersted in.

In [None]:
pull_request_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/pull_request.parquet'
pr_comments_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/pr_comments.parquet'
pr_commits_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/pr_commits.parquet'
pr_commit_details_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/pr_commit_details.parquet'
pr_reviews_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/pr_reviews.parquet'
pr_review_comments_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/pr_review_comments.parquet'
pr_task_type_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/pr_task_type.parquet'
repository_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/repository.parquet'
user_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/user.parquet'

"""
Load the Parquet file into a Pandas DataFrame from the file URL.
"""
def load_data(url):
  try:
    # For Parquet files:
    df = pd.read_parquet(url)

    return df
  except Exception as e:
      print(f"Error loading data: {e}")
      print("Please ensure the URL is correct and the file is publicly accessible.")

**Second**, We need to load the data from the URLs

In [30]:
pull_request = load_data(pull_request_file_url)
pr_comments = load_data(pr_comments_file_url)
pr_commits = load_data(pr_commits_file_url)
pr_commit_details = load_data(pr_commit_details_file_url)
pr_reviews = load_data(pr_reviews_file_url)
pr_review_comments = load_data(pr_review_comments_file_url)
pr_task_type = load_data(pr_task_type_file_url)
repository = load_data(repository_file_url)
user = load_data(user_file_url)

**Third**, Gather the covariant variables





In [None]:
max_created_at = pull_request['created_at'].max()
min_created_at = pull_request['created_at'].min()

print(f"Maximum created_at: {max_created_at}")
print(f"Minimum created_at: {min_created_at}")

Maximum created_at: 2025-07-30T19:36:13Z
Minimum created_at: 2024-12-24T00:23:09Z


# List of Covariants

## PR Variables
1. **Number of added lines (num_added_LOC):** The # of added LOC by a PR
2. **Number of deleted lines (num_deleted_LOC):** The # of deleted LOC by a PR
3. **PR size (PR_size):** The total nunber of added and deleted LOC by a PR (additions + deletions)
4. **Purpose (PR_purpose):** The purpose of a PR, i.e., bug, document, and feature. Simple keyword search in the title/body ('fix', 'bug', 'doc', …).
5. **Number of files (num_files_changed):** The # of files changed by a PR
6. **Number of commits (num_commits):** The # of commits involved in a PR
7. **Description length:** The length of a PR description
8. **PR author experience (PR_author_experience):** The # of prior PRs that were submitted by the PR author (author’s prior PR count). Query the author’s PR history in the same repo and count PRs created before the current one.
9. **Is member (is_member):** Whether or not the author is a member or outside collaborator
10. **Number of comments (num_comments):** The # of comments left on a PR
11. **Number of author comments (num_author_comments):** The # of comments left by the PR author
12. **Number of reviewer comments (num_reviewer_comments):** The # of comments left by the reviewers who participate in the disucssion
13. **Number of reviewers (num_reviewers):** The # of developers who participate in the discussion.
14. **Repo age (repo_age):** Time interval between the repository creation time and PR creation time in days.

## Project variables

15. **Language (language):** The repository language that a PR belongs to, represented by the top 10 or others
16. **Number of forks (num_forks):** The # of forks that a repository has
17. **Number of stargazers (num_stargazers):** The # of stargazers that a repository has.

## Treatment variables

18. **With Copilot for PRs (with_copilot_for_PRs):** Whether or not a PR is generated by Copilot for PRs (binary)

## Outcome variables

19. **Review time (review_time):** Time interval between the PR creation time and closed time in hours
20. **Is merged (is_merged):** Whether or not a PR is merged (binary)

In [65]:
# 1 - Number of added lines (num_added_LOC): The # of added LOC by a PR
# 2 - Number of deleted lines (num_deleted_LOC): The # of deleted LOC by a PR
# 3 - PR size (PR_size): The total number of added and deleted LOC by a PR (additions + deletions)
# Make sure we don't crash because the columns already exist (rentrant code)
pull_request = pull_request.drop(columns=['num_added_LOC', 'num_deleted_LOC', 'PR_size', 'pr_id'], errors='ignore')

# Get the sums of the columns we are interested in
pr_commit_LOC = (pr_commit_details.groupby(['pr_id'])
                                  .sum(['additions', 'deletions', 'changes'])
                                  .reset_index())

# Rename the sum columns to what we want
pr_commit_LOC = (pr_commit_LOC.rename(columns={'additions': 'num_added_LOC'})
                              .rename(columns={'deletions': 'num_deleted_LOC'})
                              .rename(columns={'changes': 'PR_size'}))

# Drop the extraneous columns
pr_commit_LOC = pr_commit_LOC.drop(columns=['commit_stats_total', 'commit_stats_additions', 'commit_stats_deletions'])

# Merge the Dataframes with a left join
pull_request = pd.merge(pull_request, pr_commit_LOC, left_on='id', right_on='pr_id', how='left')

# Garbage collect the temporary Dataframe
pr_commit_LOC = None

# Fill N/A values with defaults
pull_request['num_added_LOC'] = pull_request['num_added_LOC'].fillna(0).astype(int)
pull_request['num_deleted_LOC'] = pull_request['num_deleted_LOC'].fillna(0).astype(int)
pull_request['PR_size'] = pull_request['PR_size'].fillna(0).astype(int)

# Drop the 'pr_id' column that was left as an artifact of the merge
pull_request = pull_request.drop(columns=['pr_id'], errors='ignore')


In [51]:
# 4 - Purpose (PR_purpose): The purpose of a PR, i.e., bug, document, and feature. Simple keyword search in the title/body ('fix', 'bug', 'doc', …).
# Make sure we don't crash because the columns already exist (rentrant code)
pull_request = pull_request.drop(columns=['PR_purpose'], errors='ignore')

# Make a copy of the PR Task Type Dataframe and Drop unneeded columns
pr_task = pr_task_type.copy().drop(columns=['agent', 'title', 'reason', 'confidence'], errors='ignore')

# Group by ID and get the First Record
pr_task = pr_task.groupby(['id']).first()

# Rename the column to what we want to keep
pr_task = pr_task.rename(columns={'type': 'PR_purpose'})

# Merge the Dataframes with a left join
pull_request = pd.merge(pull_request, pr_task, left_on='id', right_on='id', how='left')

# Garbage Collect the temporary Dataframe
pr_task = None

# Fill N/A values with defaults
pull_request['PR_purpose'] = pull_request['PR_purpose'].fillna('other')


In [61]:
# 5 - Number of files (num_files_changed): The # of files changed by a PR
# Make sure we don't crash because the columns already exist (rentrant code)
pull_request = pull_request.drop(columns=['num_files_changed', 'pr_id'], errors='ignore')

# Count the number of Files changed and change the column name to what we want
pr_files_changed = (pr_commit_details.groupby(['pr_id', 'filename'])
                                     .size()
                                     .groupby(['pr_id'])
                                     .size()
                                     .reset_index(name='num_files_changed'))

# Merge the Dataframes with a left join
pull_request = pd.merge(pull_request, pr_files_changed, left_on='id', right_on='pr_id', how='left')

# Garbage Collect the temporary Dataframe
pr_files_changed = None

# Fill N/A values with defaults
pull_request['num_files_changed'] = pull_request['num_files_changed'].fillna(0).astype(int)

# Drop the 'pr_id' column that was left as an artifact of the merge
pull_request = pull_request.drop(columns=['pr_id'], errors='ignore')


In [63]:
# 6 - Number of commits (num_commits): The # of commits involved in a PR
# Make sure we don't crash because the columns already exist (rentrant code)
pull_request = pull_request.drop(columns=['num_commits', 'pr_id'], errors='ignore')

# Count the number of Commits for the Pull Request, name the column what we want
pr_commits_count = pr_commits.groupby(['pr_id']).size().reset_index(name='num_commits')

# Merge the Dataframes using a left join
pull_request = pd.merge(pull_request, pr_commits_count, left_on='id', right_on='pr_id', how='left')

# Garbage Collect the temporary Dataframe
pr_commits_count = None

# Fill N/A values with defaults
pull_request['num_commits'] = pull_request['num_commits'].fillna(0).astype(int)

# Drop the 'pr_id' column that was left as an artifact of the merge
pull_request = pull_request.drop(columns=['pr_id'], errors='ignore')


In [14]:
# 7 - Description length: The length of a PR description
# Make sure we don't crash because the columns already exist (rentrant code)
pull_request = pull_request.drop(columns=['description_length'], errors='ignore')

# Get the Length of the Body of the Pull Request
pull_request['description_length'] = pull_request['body'].str.len()


In [None]:
# 8 - PR author experience (PR_author_experience): The # of prior PRs that were submitted by the PR author (author’s prior PR count).
#     Query the author’s PR history in the same repo and count PRs created before the current one.


In [None]:
# 9 - Is member (is_member): Whether or not the author is a member or outside collaborator


In [64]:
# 10 - Number of comments (num_comments): The # of comments left on a PR
# Make sure we don't crash because the columns already exist (rentrant code)
pull_request = pull_request.drop(columns=['num_comments', 'pr_id'], errors='ignore')

# Count the number of Comments for the Pull Request, name the column what we want.
pr_comments_count = pr_comments.groupby(['pr_id']).size().reset_index(name='num_comments')

# Merge the Dataframes using a left join
pull_request = pd.merge(pull_request, pr_comments_count, left_on='id', right_on='pr_id', how='left')

# Garbage Collect the temporary Dataframe
pr_comments_count = None

# Fill N/A values with defaults
pull_request['num_comments'] = pull_request['num_comments'].fillna(0).astype(int)

# Drop the 'pr_id' column that was left as an artifact of the merge
pull_request = pull_request.drop(columns=['pr_id'], errors='ignore')


In [None]:
# 11 - Number of author comments (num_author_comments): The # of comments left by the PR author


In [None]:
# 12 - Number of reviewer comments (num_reviewer_comments): The # of comments left by the reviewers who participate in the disucssion


In [None]:
# 13 - Number of reviewers (num_reviewers): The # of developers who participate in the discussion.


In [None]:
# 14 - Repo age (repo_age): Time interval between the repository creation time and PR creation time in days.


In [None]:
# 15 - Language (language): The repository language that a PR belongs to, represented by the top 10 or others
# 16 - Number of forks (num_forks): The # of forks that a repository has
# 17 - Number of stargazers (num_stargazers): The # of stargazers that a repository has.


**Fourth**, Bot detection and filtering employed the methodology of Golzadeh et al. (2022); simple “bot” username suffix check with a comprehensive, manually verified list of 527 bot accounts