<a href="https://colab.research.google.com/github/awjans/CopilotForPRsAdoption/blob/main/scripts/AIDev.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Collection/Cleaning Overview
1. **PR identification**
   * Queried GitHub via GraphQL for PRs whose description contained the phrase **“Generated by Copilot”** or any of the marker tags:

     * `copilot:summary`
     * `copilot:walkthrough`
     * `copilot:poem`
     * `copilot:all`

2. **Scope**
   * Collected **18,256 PRs** from **146 early-adopter repositories** during **March 2023 – August 2023**.

3. **Control set**
   * For the same repositories, gathered **54,188 PRs** that did **not** contain any Copilot marker.
   * These served as the **untreated (control) group** for the **RQ2 comparison**.

4. **Bot filtering**
   * Removed PRs and comments authored by bots using the **high-precision method** of **Golzadeh et al. (2022)**, which included:
     * (i) Usernames ending with “bot”
     * (ii) A curated list of **527 known bot accounts**

5. **Revision extraction (RQ3)**
   * From the **18,256 Copilot-generated PRs**, retrieved the full **edit history** of PR descriptions.
   * Identified **1,437 revisions** where developers **edited the AI-suggested content**.

In [141]:
import asyncio
import matplotlib.pyplot as plt
import nest_asyncio
import numpy as np
import os
import pandas as pd
import requests
import seaborn as sns
import datetime

from dateutil import parser
from google.colab import userdata
from urllib.parse import urlparse

# **First**, We need to define the URLs of the AIDev Parquet Files that we are intersted in.

In [142]:
pull_request_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/pull_request.parquet'
pr_comments_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/pr_comments.parquet'
pr_commits_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/pr_commits.parquet'
pr_commit_details_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/pr_commit_details.parquet'
pr_reviews_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/pr_reviews.parquet'
pr_review_comments_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/pr_review_comments.parquet'
pr_task_type_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/pr_task_type.parquet'
repository_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/repository.parquet'
user_file_url = 'https://huggingface.co/datasets/hao-li/AIDev/resolve/main/user.parquet'


In [143]:
"""
Load the Parquet file into a Pandas DataFrame from the file URL.
"""
def load_data(url: str):
  import pandas as pd # Import pandas inside the function
  try:
    # For Parquet files:
    df = pd.read_parquet(url)

    return df
  except Exception as e:
      print(f"Error loading data: {e}")
      print("Please ensure the URL is correct and the file is publicly accessible.")
      return None # Return None in case of an error

In [144]:
nest_asyncio.apply()

GH_TOKEN = os.environ.get('GITHUB_TOKEN', userdata.get('GITHUB_TOKEN'))

async def get_repo_data(repo_url: str):
    # Make the Request
    print(f'Requesting: {repo_url}')
    response = requests.get(repo_url, headers={'Authorization': f'token {GH_TOKEN}'})
    response.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
    # Process the JSON response
    return response.json()


def get_repo_created_at(repo_url: str):
    """
    Get the Repository Created At timestamp for the Repo from GitHub the API call.

    Args:
        repo_url: The GitHub API repository URL.

    Returns:
        The created_at timestamp if successful, None otherwise.
    """
    try:
        task = asyncio.create_task(get_repo_data(repo_url))
        event_loop = asyncio.get_running_loop()
        if event_loop.is_running():
          data = event_loop.run_until_complete(task)
        else:
          data = asyncio.run(task)

        # Extract the createdAt value
        created_at = data['created_at']
        print(f"Repo: {repo_url}; Created At: {created_at}")

        if created_at:
            return pd.to_datetime(created_at)
        else:
            raise Exception(f"Error: Could not retrieve createdAt for {repo_url}. Response data: {data}")

    except requests.exceptions.RequestException as e:
        print(f"Error during GitHub API request for {repo_url}: {e}")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None


# **Second**, We need to load the data from the URLs

In [145]:
pull_request = load_data(pull_request_file_url)
pr_comments = load_data(pr_comments_file_url)
pr_commits = load_data(pr_commits_file_url)
pr_commit_details = load_data(pr_commit_details_file_url)
pr_reviews = load_data(pr_reviews_file_url)
pr_review_comments = load_data(pr_review_comments_file_url)
pr_task_type = load_data(pr_task_type_file_url)
repository = load_data(repository_file_url)
user = load_data(user_file_url)

## Create a Copy of Pull_Requests

Remove the Open Pull Requests

In [146]:
metrics = pull_request.copy()
repos = repository.copy()

# Remove Open Pull Requests (closed_at is None)
print(f"Number of Pull Requests: {len(metrics)}")
metrics = metrics[metrics['closed_at'].notna()]
print(f"Number of Pull Requests: {len(metrics)}")

# Convert Timestamps
metrics['created_at'] = pd.to_datetime(metrics['created_at'])
metrics['closed_at'] = pd.to_datetime(metrics['closed_at'])

# Remove Repositories that do not have a Pull Request
print(f"Number of Repositories: {len(repos)}")
repos = repos[repos['id'].isin(metrics['repo_id'])]
print(f"Number of Repositories: {len(repos)}")
repos['created_at'] = repos.apply(lambda row: get_repo_created_at(row['url']), axis=1)
repos = repos.dropna(subset=['created_at'])
print(f"Number of Repositories: {len(repos)}")

Number of Pull Requests: 33596
Number of Pull Requests: 31284
Number of Repositories: 2807
Number of Repositories: 2404
Requesting: https://api.github.com/repos/kizuna-ai-lab/sokuji
Repo: https://api.github.com/repos/kizuna-ai-lab/sokuji; Created At: 2025-04-14T15:58:15Z
Requesting: https://api.github.com/repos/freenet/freenet-core
Repo: https://api.github.com/repos/freenet/freenet-core; Created At: 2021-07-16T13:19:06Z
Requesting: https://api.github.com/repos/vexxhost/atmosphere
Repo: https://api.github.com/repos/vexxhost/atmosphere; Created At: 2022-08-23T23:21:09Z
Requesting: https://api.github.com/repos/JonasKruckenberg/k23
Repo: https://api.github.com/repos/JonasKruckenberg/k23; Created At: 2023-10-12T10:28:02Z
Requesting: https://api.github.com/repos/JuliaLang/julia
Repo: https://api.github.com/repos/JuliaLang/julia; Created At: 2011-04-21T07:01:50Z
Requesting: https://api.github.com/repos/artsy/force
Repo: https://api.github.com/repos/artsy/force; Created At: 2014-08-12T18:14:34

# **Third**, Gather the covariant variables

They use two CSVs, treatment_metrics.csv and control_metrics.csv


## PR Variables
1. **additions:** The # of added LOC by a PR
2. **deletions:** The # of deleted LOC by a PR
3. **prSize:** The total nunber of added and deleted LOC by a PR (additions + deletions)
4. **purpose:** The purpose of a PR, i.e., bug, document, and feature. Simple keyword search in the title/body ('fix', 'bug', 'doc', …).
5. **changedFiles:** The # of files changed by a PR
6. **commitsTotalCount:** The # of commits involved in a PR
7. **Description length:** The length of a PR description
8. **prExperience:** The # of prior PRs that were submitted by the PR author (author’s prior PR count). Query the author’s PR history in the same repo and count PRs created before the current one.
9. **isMember:** Whether or not the author is a member or outside collaborator (True/False).
10. **commentsTotalCount:** The # of comments left on a PR
11. **authorComments:** The # of comments left by the PR author
12. **reviewersComments:** The # of comments left by the reviewers who participate in the disucssion
13. **reviewersTotalCount:** The # of developers who participate in the discussion.
14. **repoAge:** Time interval between the repository creation time and PR creation time in days.
15. **state**: State of the pull request (MERGED or CLOSED).
16. **bodyLength**: Length of the PR body (in characters).
17. **reviewTime**: Time taken to review the PR (in hours, floating point, no rounding).


## Project variables

18. **repoLanguage:** Programming language of the repository (e.g., Python, PHP, TypeScript, Vue). *[I'm assuming its the top language as there is only one]*
19. **forkCount:** The # of forks that a repository has
20. **stargazerCount:** The # of stargazers that a repository has.

## Treatment variables

21. **With Copilot for PRs:** Whether or not a PR is generated by Copilot for PRs (binary)

## Outcome variables

22. **Review time (reviewTime):** Time interval between the PR creation time and closed time in hours
23. **Is merged (state):** Whether or not a PR is merged (binary)


### Order in CSV (treatment_metrics.csv and control_metrics.csc)
1. **repoLanguage**
2. **forkCount**
3. **stargazerCount**
4. **repoAge**
5. **state**
6. **deletions**
7. **additions**
8. **changedFiles**
9. **commentsTotalCount**
10. **commitsTotalCount**
11. **prExperience**
12. **isMember**
13. **authorComments**
14. **reviewersComments**
15. **reviewersTotalCount**
16. **bodyLength**
17. **prSize**
18. **reviewTime**
19. **purpose**


In [147]:
# 1 - additions: The # of added LOC by a PR
# 2 - deletions The # of deleted LOC by a PR
# 3 - prSize: The total number of added and deleted LOC by a PR (additions + deletions)
# Make sure we don't crash because the columns already exist (rentrant code)
metrics = metrics.drop(columns=['additions', 'deletions', 'prSize', 'pr_id'], errors='ignore')

# Get the sums of the columns we are interested in
pr_commit_LOC = (pr_commit_details.groupby(['pr_id'])
                                  .sum(['additions', 'deletions', 'changes'])
                                  .reset_index())

# Rename the sum columns to what we want
pr_commit_LOC = (pr_commit_LOC.rename(columns={'changes': 'prSize'}))

# Drop the extraneous columns
pr_commit_LOC = pr_commit_LOC.drop(columns=['commit_stats_total', 'commit_stats_additions', 'commit_stats_deletions'])

# Merge the Dataframes with a left join
metrics = pd.merge(metrics, pr_commit_LOC, left_on='id', right_on='pr_id', how='left')

# Garbage collect the temporary Dataframe
pr_commit_LOC = None

# Fill N/A values with defaults
metrics['additions'] = metrics['additions'].fillna(0).astype(int)
metrics['deletions'] = metrics['deletions'].fillna(0).astype(int)
metrics['prSize'] = metrics['prSize'].fillna(0).astype(int)

# Drop the 'pr_id' column that was left as an artifact of the merge
metrics = metrics.drop(columns=['pr_id'], errors='ignore')

In [148]:
# 4 - purpose: The purpose of a PR, i.e., bug, document, and feature. Simple keyword search in the title/body ('fix', 'bug', 'doc', …).
# Make sure we don't crash because the columns already exist (rentrant code)
metrics = metrics.drop(columns=['purpose'], errors='ignore')

# Make a copy of the PR Task Type Dataframe and Drop unneeded columns
pr_task = pr_task_type.copy().drop(columns=['agent', 'title', 'reason', 'confidence'], errors='ignore')

# Group by ID and get the First Record
pr_task = pr_task.groupby(['id']).first()

# Rename the column to what we want to keep
pr_task = pr_task.rename(columns={'type': 'purpose'})

# Merge the Dataframes with a left join
metrics = pd.merge(metrics, pr_task, left_on='id', right_on='id', how='left')

# Garbage Collect the temporary Dataframe
pr_task = None

# Fill N/A values with defaults
metrics['purpose'] = metrics['purpose'].fillna('other')

#Check that the purpose = either of the three options; Bug, Feature, Document

In [149]:
# 5 - changedFiles: The # of files changed by a PR
# Make sure we don't crash because the columns already exist (rentrant code)
metrics = metrics.drop(columns=['changedFiles', 'pr_id'], errors='ignore')

# Count the number of Files changed and change the column name to what we want
pr_files_changed = (pr_commit_details.groupby(['pr_id', 'filename'])
                                     .size()
                                     .groupby(['pr_id'])
                                     .size()
                                     .reset_index(name='changedFiles'))

# Merge the Dataframes with a left join
metrics = pd.merge(metrics, pr_files_changed, left_on='id', right_on='pr_id', how='left')

# Garbage Collect the temporary Dataframe
pr_files_changed = None

# Fill N/A values with defaults
metrics['changedFiles'] = metrics['changedFiles'].fillna(0).astype(int)

# Drop the 'pr_id' column that was left as an artifact of the merge
metrics = metrics.drop(columns=['pr_id'], errors='ignore')

In [150]:
# 6 - commitsTotalCount: The # of commits involved in a PR
# Make sure we don't crash because the columns already exist (rentrant code)
metrics = metrics.drop(columns=['commitsTotalCount', 'pr_id'], errors='ignore')

# Count the number of Commits for the Pull Request, name the column what we want.
pr_commits_count = pr_commits.groupby(['pr_id']).size().reset_index(name='commitsTotalCount')

# Merge the Dataframes using a left join
metrics = pd.merge(metrics, pr_commits_count, left_on='id', right_on='pr_id', how='left')

# Garbage Collect the temporary Dataframe
pr_commits_count = None

# Fill N/A values with defaults
metrics['commitsTotalCount'] = metrics['commitsTotalCount'].fillna(0).astype(int)

# Drop the 'pr_id' column that was left as an artifact of the merge
metrics = metrics.drop(columns=['pr_id'], errors='ignore')

In [151]:
# 7 - bodyLength The length of a PR description (in characters)
# Make sure we don't crash because the columns already exist (rentrant code)
metrics = metrics.drop(columns=['bodyLength'], errors='ignore')

# Get the Length of the Body of the Pull Request
metrics['bodyLength'] = metrics['body'].str.len()

In [152]:
# 8 - prExperience: The # of prior PRs that were submitted by the PR author (author’s prior PR count).
#     Query the author’s PR history in the same repo and count PRs created before the current one.
# Make sure we don't crash because the columns already exist (rentrant code)
metrics = metrics.drop(columns=['prExperience'], errors='ignore')

# TODO: Figure out how to do this
metrics['prExperience'] = 0


In [153]:
# 9 - isMember: Whether or not the author is a member or outside collaborator (True/False)
# Make sure we don't crash because the columns already exist (rentrant code)
metrics = metrics.drop(columns=['isMember'], errors='ignore')

# TODO: Figure out how to tell if a user is a member
metrics['isMember'] = False


In [154]:
# 10 - commentsTotalCount: The # of comments left on a PR
# Make sure we don't crash because the columns already exist (rentrant code)
metrics.drop(columns=['commentsTotalCount', 'pr_id'], errors='ignore', inplace=True)

# Count the number of Comments for the Pull Request, name the column what we want.
pr_comments_count = pr_comments.groupby(['pr_id']).size().reset_index(name='commentsTotalCount')

# Merge the Dataframes using a left join
metrics = pd.merge(metrics, pr_comments_count, left_on='id', right_on='pr_id', how='left')

# Garbage Collect the temporary Dataframe
pr_comments_count = None

# Fill N/A values with defaults
metrics['commentsTotalCount'] = metrics['commentsTotalCount'].fillna(0).astype(int)

# Drop the 'pr_id' column that was left as an artifact of the merge
metrics.drop(columns=['pr_id'], errors='ignore', inplace=True)

In [155]:
# 11 - authorComments: The # of comments left by the PR author
# Make sure we don't crash because the columns already exist (rentrant code)
metrics = metrics.drop(columns=['authorComments', 'pr_id'], errors='ignore')

# Filter comments to only include those made by the PR author
# Need to merge with metrics to get the author_id for each pr_comment
author_comments = pd.merge(pr_comments, metrics[['id', 'user_id']], left_on='pr_id', right_on='id', how='left')
author_comments = author_comments[author_comments['user_id_x'] == author_comments['user_id_y']]

# Count the number of author comments per pull request
author_comments_count = author_comments.groupby(['pr_id']).size().reset_index(name='authorComments')

# Merge the Dataframes using a left join
metrics = pd.merge(metrics, author_comments_count, left_on='id', right_on='pr_id', how='left')

# Garbage Collect the temporary Dataframes
author_comments = None
author_comments_count = None

# Fill N/A values with defaults
metrics['authorComments'] = metrics['authorComments'].fillna(0).astype(int)

# Drop the 'pr_id' column that was left as an artifact of the merge
metrics = metrics.drop(columns=['pr_id'], errors='ignore')

In [156]:
# 12 - reviewersComments: The # of comments left by the reviewers who participate in the discussion
# Make sure we don't crash because the columns already exist (rentrant code)
metrics = metrics.drop(columns=['reviewersComments', 'pr_id'], errors='ignore')

# Filter comments to exclude those made by the PR author
# Need to merge with metrics to get the author_id for each pr_comment
reviewer_comments = pd.merge(pr_comments, metrics[['id', 'user_id']], left_on='pr_id', right_on='id', how='left')
reviewer_comments = reviewer_comments[reviewer_comments['user_id_x'] != reviewer_comments['user_id_y']]

# Count the number of reviewer comments per pull request
reviewer_comments_count = reviewer_comments.groupby(['pr_id']).size().reset_index(name='reviewersComments')

# Merge the Dataframes using a left join
metrics = pd.merge(metrics, reviewer_comments_count, left_on='id', right_on='pr_id', how='left')

# Garbage Collect the temporary Dataframes
reviewer_comments = None
reviewer_comments_count = None

# Fill N/A values with defaults
metrics['reviewersComments'] = metrics['reviewersComments'].fillna(0).astype(int)

# Drop the 'pr_id' column that was left as an artifact of the merge
metrics = metrics.drop(columns=['pr_id'], errors='ignore')

In [157]:
# 13 - reviewersTotalCount: The # of developers who participate in the discussion.
# Make sure we don't crash because the columns already exist (rentrant code)
metrics = metrics.drop(columns=['reviewersTotalCount', 'pr_id'], errors='ignore')

# Extract user_id from the nested 'user' column in pr_review_comments
pr_review_comments['user_id_from_user'] = pr_review_comments['user'].apply(lambda x: x.get('id') if isinstance(x, dict) else None)

# Extract user_id from the nested 'user' column in pr_reviews
pr_reviews['user_id_from_user'] = pr_reviews['user'].apply(lambda x: x.get('id') if isinstance(x, dict) else None)

# Get author_id from metrics for merging
metrics['author_id_from_author'] = metrics['user'].apply(lambda x: x.get('id') if isinstance(x, dict) else None)


# Get unique reviewer IDs from review comments, excluding the author
reviewer_comments_users = pd.merge(pr_review_comments, metrics[['id', 'author_id_from_author']], left_on='pull_request_review_id', right_on='id', how='left')
reviewer_comments_users = reviewer_comments_users[reviewer_comments_users['user_id_from_user'] != reviewer_comments_users['author_id_from_author']]
reviewer_comments_users = reviewer_comments_users.groupby(['pull_request_review_id'])['user_id_from_user'].nunique().reset_index(name='reviewer_commenters')


# Get unique reviewer IDs from reviews, excluding the author
review_users = pd.merge(pr_reviews, metrics[['id', 'author_id_from_author']], left_on='pr_id', right_on='id', how='left')
review_users = review_users[review_users['user_id_from_user'] != review_users['author_id_from_author']]
review_users = review_users.groupby(['pr_id'])['user_id_from_user'].nunique().reset_index(name='reviewers')

# Merge the two dataframes to get unique users from both sources
reviewers_total = pd.merge(reviewer_comments_users, review_users, left_on='pull_request_review_id', right_on='pr_id', how='outer').fillna(0)

# Calculate the total number of unique reviewers
reviewers_total['reviewersTotalCount'] = reviewers_total['reviewer_commenters'] + reviewers_total['reviewers']
reviewers_total = reviewers_total.drop(columns=['reviewer_commenters', 'reviewers'])

# Merge the Dataframes using a left join
metrics = pd.merge(metrics, reviewers_total, left_on='id', right_on='pull_request_review_id', how='left')

# Garbage Collect temporary dataframes
reviewer_comments_users = None
review_users = None
reviewers_total = None

# Fill N/A values with defaults
metrics['reviewersTotalCount'] = metrics['reviewersTotalCount'].fillna(0).astype(int)

# Drop the 'pr_id' column that was left as an artifact of the merge
metrics = metrics.drop(columns=['pr_id'], errors='ignore')

# Drop the temporary user_id and author_id columns
pr_review_comments = pr_review_comments.drop(columns=['user_id_from_user'], errors='ignore')
pr_reviews = pr_reviews.drop(columns=['user_id_from_user'], errors='ignore')
metrics = metrics.drop(columns=['author_id_from_author'], errors='ignore')

In [159]:
# 14 - repoAge: Time interval between the repository creation time and PR creation time in days.
# Make sure we don't crash because the columns already exist (rentrant code)
metrics = metrics.drop(columns=['repoAge'], errors='ignore')

# Copy the Repository dataframe and remove the unnecessary columns
repos_temp = (repos.copy()
                   .drop(columns=['license', 'full_name', 'language', 'forks', 'stars'], errors='ignore')
                   .rename(columns={'id': 'repo_id'}))

# Drop the now unnecessary URL column
repos_temp = repos_temp.drop(columns=['url'], errors='ignore')

# Merge the Dataframes using a left join
metrics = pd.merge(metrics, repos_temp, left_on='repo_id', right_on='repo_id', how='left')

# Garbage Collect the temporary dataframe
repos_temp = None

# Drop from Metrics any Repo without a repo created date
metrics = metrics.dropna(subset=['repo_created_at'])

# Calculate the Repo Age in Days (created_at - repo_created_at), handling potential None values
metrics['repoAge'] = (metrics['created_at'].dt.days - metrics['repo_created_at'].dt.days)

# Drop the unnecessary Repo Created At column
metrics = metrics.drop(columns=['repo_created_at'], errors='ignore')

# Fill N/A values with defaults
metrics['repoAge'] = metrics['repoAge'].fillna(0).astype(int)

KeyError: ['repo_created_at']

In [None]:
# 15 - repoLanguage: The repository language that a PR belongs to (top language)
# 16 - forkCount: The # of forks that a repository has
# 17 - stargazerCount: The # of stargazers that a repository has.
# Make sure we don't crash because the columns already exist (rentrant code)
metrics = metrics.drop(columns=['repoLanguage', 'forkCount', 'stargazerCount'], errors='ignore')

repos = (repository.copy()
                   .drop(columns=['license', 'repo_url', 'html_url', 'full_name'], errors='ignore')
                   .rename(columns={'id': 'repo_id', 'language': 'repoLanguage', 'forks': 'forkCount', 'stars': 'stargazerCount'}))

# Group by ID and get the First Record
repos = repos.groupby(['repo_id']).first().reset_index() # Add reset_index() to make repo_id a column again

# Merge the Dataframes using a left join
metrics = pd.merge(metrics, repos, left_on='repo_id', right_on='repo_id', how='left')

# Garbage Collect the temporary Dataframe
repos = None

# Fill N/A values with defaults
metrics['repoLanguage'] = metrics['repoLanguage'].fillna('other')
metrics['forkCount'] = metrics['forkCount'].fillna(0).astype(int)
metrics['stargazerCount'] = metrics['stargazerCount'].fillna(0).astype(int)

In [None]:
# 18 - state: State of the pull request (MERGED or CLOSED).
# Make sure we don't crash because the columns already exist (rentrant code)
metrics = metrics.drop(columns=['isMerged'], errors='ignore')

metrics['isMerged'] = metrics['merged_at'].apply(lambda x: 0 if x is None else 1)

In [None]:
# 19 - reviewTime: Time taken to review the PR (in hours, floating point, no rounding).
# Make sure we don't crash because the columns already exist (rentrant code)
metrics = metrics.drop(columns=['reviewTime'], errors='ignore')

# Calculate review time in hours, handling potential NaT values
metrics['reviewTime'] = ((metrics['closed_at']dt.total_seconds - metrics['created_at'].dt.total_seconds) / 3600).astype(float)

# Fill N/A values with defaults (e.g., for open PRs)
metrics['reviewTime'] = metrics['reviewTime'].fillna(0)

In [None]:
csv_order = ['repoLanguage',
'forkCount',
'stargazerCount',
'repoAge',
'state',
'deletions',
'additions',
'changedFiles',
'commentsTotalCount',
'commitsTotalCount',
'prExperience',
'isMember',
'authorComments',
'reviewersComments',
'reviewersTotalCount',
'bodyLength',
'prSize',
'reviewTime',
'purpose']

metrics = metrics.loc[:, csv_order]
metrics = metrics[csv_order]
display(metrics.columns)

# **Fourth**, Bot detection and filtering employed the methodology of Golzadeh et al. (2022)

simple “bot” username suffix check with a comprehensive, manually verified list of 527 bot accounts
* groundtruthbots.csv - a list of bots from Golzadeh et al.


# **Fifth**, Adoption Trend (RQ1)

* Counted occurrences of each marker tag; copilot:summary was the most frequent (13 231 instances).
* Visualised cumulative PRs over time (Fig. 3) and proportion of PRs per repository (Fig. 4).


# **Sixth**, Causal Inference (RQ2)

### Propensity‑Score Estimation
Logistic regression (treatment = Copilot usage) on the 17 covariates.
Estimated each PR’s probability of receiving the treatment (ps).
### Weight Construction
Inverse‑probability weights: 1/ps for treated, 1/(1‑ps) for control.
### Entropy Balancing
Applied the entropy‑balancing algorithm (equivalent to R’s ebalance) to adjust the raw weights so that the weighted means of all covariates matched exactly between groups.
After balancing, absolute mean differences for every covariate were ≤ 0.10 (Fig. 2).
### Outcome Regression
* Review time (continuous): weighted ordinary least squares (lm analogue) with only the treatment indicator. The coefficient gave the Average Treatment Effect on the Treated (ATT) of ‑19.3 h (p ≈ 1.6 × 10⁻¹⁷).
* Merge outcome (binary): weighted logistic regression (glm with logit link). The exponentiated treatment coefficient yielded an odds ratio of 1.57 (95 % CI [1.35, 1.84], p < 0.001).
These two models answer RQ2.1 (review‑time reduction) and RQ2.2 (higher merge likelihood).


# The R Scripts
The main difference between PMW_merge.R and PMW_review.R is:

* PMW_merge.R includes the column isMerged, which indicates whether each pull request was merged (state == "MERGED"). This column is added to the modeling data and used in the analysis.
* PMW_review.R does not include the isMerged column in its modeling data; it focuses only on review-related metrics.
* Otherwise, both scripts process the same input data, use similar covariates, and prepare for causal inference analysis. The inclusion of isMerged in PMW_merge.R allows for analysis related to PR merge status, while PMW_review.R is focused on review characteristics.