# Pull Requests Dataset Preparation

## Load Libraries

Load needed libraries.

In [1]:
from data_preparation import *
from datetime import datetime
import numpy as np
import pandas as pd

## Load Dataset

Load datasets from two different sources and combine them in one data frame.

In [2]:
# Paths to datasets
prs_vm1_path = '../data/prs-vm1.csv'
prs_vm2_path = '../data/prs-vm2.csv'

# Combine datasets from different sources
df_prs = pd.concat(map(pd.read_csv, [prs_vm1_path, prs_vm2_path]))
df_prs.head()

Unnamed: 0,cursor,owner,name,sshUrl,url,baseRepo,baseRef,baseRefPrefix,headRepo,headRef,...,draft,files,createdAt,publishedAt,mergedAt,closedAt,groupId,artifactId,version,filePath
0,Y3Vyc29yOjY=,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,turrisxyz/javaparser,Pinned-Dependencies-GitHub,...,False,2,2022-06-26,2022-06-26,,,com.github.javaparser,javaparser-parent,3.24.3-SNAPSHOT,.github/workflows/create_github_release.yml
1,Y3Vyc29yOjY=,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,turrisxyz/javaparser,Pinned-Dependencies-GitHub,...,False,2,2022-06-26,2022-06-26,,,com.github.javaparser,javaparser-parent,3.24.3-SNAPSHOT,.github/workflows/maven_tests.yml
2,Y3Vyc29yOjY=,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,turrisxyz/javaparser,Pinned-Dependencies-GitHub,...,False,2,2022-06-26,2022-06-26,,,com.github.javaparser,javaparser-parent,3.24.3-SNAPSHOT,.github/workflows/create_github_release.yml
3,Y3Vyc29yOjY=,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,turrisxyz/javaparser,Pinned-Dependencies-GitHub,...,False,2,2022-06-26,2022-06-26,,,com.github.javaparser,javaparser-parent,3.24.3-SNAPSHOT,.github/workflows/maven_tests.yml
4,Y3Vyc29yOjY=,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,iChenLei/javaparser,patch-1,...,False,1,2022-06-17,2022-06-17,,,com.github.javaparser,javaparser-parent,3.24.3-SNAPSHOT,readme.md


In [3]:
df_prs['state']

0         OPEN
1         OPEN
2         OPEN
3         OPEN
4         OPEN
          ... 
564661    OPEN
564662    OPEN
564663    OPEN
564664    OPEN
564665    OPEN
Name: state, Length: 997289, dtype: object

## Validate Data

Perform a set of sanity check to validate the source data.

In [4]:
rows, columns = df_prs.shape
df_prs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 997289 entries, 0 to 564665
Data columns (total 24 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   cursor         997289 non-null  object 
 1   owner          997289 non-null  object 
 2   name           997289 non-null  object 
 3   sshUrl         997289 non-null  object 
 4   url            997289 non-null  object 
 5   baseRepo       997289 non-null  object 
 6   baseRef        997289 non-null  object 
 7   baseRefPrefix  997289 non-null  object 
 8   headRepo       997289 non-null  object 
 9   headRef        997289 non-null  object 
 10  headRefPrefix  997289 non-null  object 
 11  title          997289 non-null  object 
 12  number         997289 non-null  int64  
 13  state          997289 non-null  object 
 14  draft          997289 non-null  bool   
 15  files          997289 non-null  int64  
 16  createdAt      997289 non-null  object 
 17  publishedAt    997289 non-nul

All columns seem to have non-null values.

### Sanity Checks

**Validate `sshUrl` column:** Verify that all SSH URLs are correctly build and match the expected repository owner/name.

In [5]:
ssh_url_check = validate_urls('sshUrl', df_prs)
assert ssh_url_check, 'sshUrl check'

Records have right scheme: True
Records have right authority: True
Records have right path: True


**Validate `url` column:** Verify that all HTTP URLs are correctly build and match the expected repository owner/name.

In [6]:
http_url_check = validate_urls('url', df_prs)
assert http_url_check, 'url check'

Records have right scheme: True
Records have right authority: True
Records have right path: True


**Validate `baseRepo` column:** Verify that all head repository values follow the format "owner/name" and match the expected repository owner/name values.

In [7]:
base_repo_check = validate_repo_name('baseRepo', df_prs)
assert base_repo_check, 'baseRepo check'

Records have right repository name: True


**Validate `headRepo` column:** Verify that all head repository values follow the format "owner/name".

In [8]:
head_repo_check = validate_repo_name_format('headRepo', df_prs)
assert head_repo_check, 'headRepo check'

Records have right repository name format: True


**Validate `number` column:** Verify that all records are greater than 0.

In [9]:
number_check = validate_limit_num(1, 'number', df_prs)
assert number_check, 'number check'

Records have at least 1: True


**Validate `state` column:** Veridy that all pull requests are open.

In [10]:
state_check = validate_value('OPEN', 'state', df_prs)
assert state_check

Records have value OPEN: True


**Validate `files` column:** Verify that all records are greater than 0.

In [11]:
files_check = validate_limit_num(1, 'files', df_prs)
assert files_check, 'files check'

Records have at least 1: True


**Validate `createdAt` column:** Verify that all dates have the expected format "yyyy-MM-dd".

In [12]:
last_created_date = datetime(2022, 4, 22)
created_at_check = validate_date('createdAt', df_prs)
created_at_limit_check = validate_limit_date(last_created_date, 'createdAt', df_prs)
#assert created_at_check and created_at_limit_check, 'createdAt check'

Records have right format: True
Records appear after 2022-04-22 00:00:00: False
Records appear before 2022-08-22 10:56:54.163176: True


**Validate `publishedAt` column:** Verify that all dates have the expected format "yyyy-MM-dd" and appear after a limit date.

In [13]:
last_pushed_date = datetime(2022, 4, 22)
pushed_at_format_check = validate_date('publishedAt', df_prs)
pushed_at_limit_check = validate_limit_date(last_pushed_date, 'publishedAt', df_prs)
#assert pushed_at_format_check and pushed_at_limit_check, 'publishedAt check'

Records have right format: True
Records appear after 2022-04-22 00:00:00: False
Records appear before 2022-08-22 10:56:54.714764: True


## Clean Dataset

Clean the current data frame.

**Remove all non-Java files:** We only care about pull requests that impact to code of the library. The rest can be discarded.

In [14]:
initial_rows = df_prs.shape[0]
initial_prs = len(df_prs.groupby(['owner', 'name', 'number']).groups)

mask = df_prs['filePath'].str.match('^.+[.]java')
df_prs = df_prs[mask]

final_rows = df_prs.shape[0]
final_prs = len(df_prs.groupby(['owner', 'name', 'number']).groups)

print(f'Keep only Java files: {initial_rows - final_rows}/{initial_rows} rows and {initial_prs - final_prs}/{initial_prs} PRs have been removed')

print('')
print(f'Initial rows: {initial_rows}')
print(f'Final rows: {final_rows}')
print(f'Removed rows: {initial_rows - final_rows}')

print('')
print(f'Initial PRs: {initial_prs}')
print(f'Final PRs: {final_prs}')
print(f'Removed PRs: {initial_prs - final_prs}')

Keep only Java files: 230080/997289 rows and 611/1865 PRs have been removed

Initial rows: 997289
Final rows: 767209
Removed rows: 230080

Initial PRs: 1865
Final PRs: 1254
Removed PRs: 611


**Identify number of draft pull requests:** Check the number of draft pull requests. This information is useful in case we need to reduce the number of cases we are analyzing.

In [15]:
initial_rows = df_prs.shape[0]
initial_prs = len(df_prs.groupby(['owner', 'name', 'number']).groups)

mask = df_prs['draft'] == True

final_rows = df_prs[mask].shape[0]
final_prs = len(df_prs[mask].groupby(['owner', 'name', 'number']).groups)

print(f'Initial rows: {initial_rows}')
print(f'Final rows: {final_rows}')
print(f'Removed rows: {initial_rows - final_rows}')

print('')
print(f'Initial PRs: {initial_prs}')
print(f'Final PRs: {final_prs}')
print(f'Removed PRs: {initial_prs - final_prs}')

Initial rows: 767209
Final rows: 184404
Removed rows: 582805

Initial PRs: 1254
Final PRs: 126
Removed PRs: 1128


After analyzing the number of draft pull requests (1,128 out of 1,254), we can conclude that removing draft pull requests will considerably impact the study. Thus, we wont remove them.

**Change granularity of data:** As of right now, each record within the dataset represents a modified file within a pull request. We need to report only pull requests. The previous granularity was needed to discard non-code files.

In [16]:
# Remove "filePath" column
if 'filePath' in df_prs.columns:
    del df_prs['filePath']
    assert 'filePath' not in df_prs.columns

# Remove duplicates
df_prs = df_prs.drop_duplicates()
df_prs.head()

Unnamed: 0,cursor,owner,name,sshUrl,url,baseRepo,baseRef,baseRefPrefix,headRepo,headRef,...,state,draft,files,createdAt,publishedAt,mergedAt,closedAt,groupId,artifactId,version
5,Y3Vyc29yOjY=,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,yufanyufan/javaparser,patch-1,...,OPEN,False,1,2022-06-10,2022-06-10,,,com.github.javaparser,javaparser-parent,3.24.3-SNAPSHOT
6,Y3Vyc29yOjY=,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,yufanyufan/javaparser,patch-1,...,OPEN,False,1,2022-06-10,2022-06-10,,,com.github.javaparser,javaparser-symbol-solver-core,3.24.3-SNAPSHOT
9,Y3Vyc29yOjY=,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,MysterAitch/javaparser,2446--solving_records,...,OPEN,True,18,2022-06-07,2022-06-07,,,com.github.javaparser,javaparser-parent,3.24.3-SNAPSHOT
45,Y3Vyc29yOjY=,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,MysterAitch/javaparser,2446--solving_records,...,OPEN,True,18,2022-06-07,2022-06-07,,,com.github.javaparser,javaparser-core,3.24.3-SNAPSHOT
160,Y3Vyc29yOjY=,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,MysterAitch/javaparser,2446--solving_records,...,OPEN,True,18,2022-06-07,2022-06-07,,,com.github.javaparser,javaparser-symbol-solver-core,3.24.3-SNAPSHOT


**Remove `cursor` column:** We remove the `cursor` column given that this value was only used when querying the GitHub API. It won't provide any additional information during the study analysis.

In [17]:
# Remove "filePath" column
if 'cursor' in df_prs.columns:
    del df_prs['cursor']
    assert 'cursor' not in df_prs.columns

In [18]:
df_prs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2532 entries, 5 to 564660
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   owner          2532 non-null   object 
 1   name           2532 non-null   object 
 2   sshUrl         2532 non-null   object 
 3   url            2532 non-null   object 
 4   baseRepo       2532 non-null   object 
 5   baseRef        2532 non-null   object 
 6   baseRefPrefix  2532 non-null   object 
 7   headRepo       2532 non-null   object 
 8   headRef        2532 non-null   object 
 9   headRefPrefix  2532 non-null   object 
 10  title          2532 non-null   object 
 11  number         2532 non-null   int64  
 12  state          2532 non-null   object 
 13  draft          2532 non-null   bool   
 14  files          2532 non-null   int64  
 15  createdAt      2532 non-null   object 
 16  publishedAt    2532 non-null   object 
 17  mergedAt       0 non-null      float64
 18  closed

**Keep cases that have clients:** We need to cross-check with the clients data frame, if all impacted packages by the pull requests have clients.

In [19]:
# Read clients dataset
clients_path = '../data/clients-cleaned.csv'
df_clients = pd.read_csv(clients_path)
df_clients.head()

Unnamed: 0,owner,name,sshUrl,url,stars,createdAt,pushedAt,packages,groupId,artifactId,version,path,clients,relevantClients,cowner,cname,csshUrl,curl,cstars
0,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,4290,2011-10-30,2022-07-27,5,com.github.javaparser,javaparser-core,3.24.3-SNAPSHOT,javaparser-core/,3166,27,kiegroup,optaplanner,ssh://git@github.com:kiegroup/optaplanner.git,https://github.com/kiegroup/optaplanner,2676
1,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,4290,2011-10-30,2022-07-27,5,com.github.javaparser,javaparser-core,3.24.3-SNAPSHOT,javaparser-core/,3166,27,Azure,azure-sdk-for-java,ssh://git@github.com:Azure/azure-sdk-for-java.git,https://github.com/Azure/azure-sdk-for-java,1556
2,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,4290,2011-10-30,2022-07-27,5,com.github.javaparser,javaparser-core,3.24.3-SNAPSHOT,javaparser-core/,3166,27,fabric8io,kubernetes-client,ssh://git@github.com:fabric8io/kubernetes-clie...,https://github.com/fabric8io/kubernetes-client,2579
3,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,4290,2011-10-30,2022-07-27,5,com.github.javaparser,javaparser-core,3.24.3-SNAPSHOT,javaparser-core/,3166,27,abstracta,jmeter-java-dsl,ssh://git@github.com:abstracta/jmeter-java-dsl...,https://github.com/abstracta/jmeter-java-dsl,186
4,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,4290,2011-10-30,2022-07-27,5,com.github.javaparser,javaparser-core,3.24.3-SNAPSHOT,javaparser-core/,3166,27,kiegroup,drools,ssh://git@github.com:kiegroup/drools.git,https://github.com/kiegroup/drools,4563


In [20]:
# Merge based on common keys
keys = ['owner', 'name', 'sshUrl', 'url', 'groupId', 'artifactId', 'version']
df_combined = pd.merge(df_prs, df_clients, how='inner', on=keys)
df_combined.head()

Unnamed: 0,owner,name,sshUrl,url,baseRepo,baseRef,baseRefPrefix,headRepo,headRef,headRefPrefix,...,pushedAt,packages,path,clients,relevantClients,cowner,cname,csshUrl,curl,cstars
0,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,yufanyufan/javaparser,patch-1,refs/heads/,...,2022-07-27,5,javaparser-symbol-solver-core/,1295,11,dunwu,java-tutorial,ssh://git@github.com:dunwu/java-tutorial.git,https://github.com/dunwu/java-tutorial,1188
1,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,yufanyufan/javaparser,patch-1,refs/heads/,...,2022-07-27,5,javaparser-symbol-solver-core/,1295,11,awspring,spring-cloud-aws,ssh://git@github.com:awspring/spring-cloud-aws...,https://github.com/awspring/spring-cloud-aws,312
2,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,yufanyufan/javaparser,patch-1,refs/heads/,...,2022-07-27,5,javaparser-symbol-solver-core/,1295,11,kiegroup,kogito-runtimes,ssh://git@github.com:kiegroup/kogito-runtimes.git,https://github.com/kiegroup/kogito-runtimes,356
3,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,yufanyufan/javaparser,patch-1,refs/heads/,...,2022-07-27,5,javaparser-symbol-solver-core/,1295,11,eclipse,deeplearning4j,ssh://git@github.com:eclipse/deeplearning4j.git,https://github.com/eclipse/deeplearning4j,12560
4,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,yufanyufan/javaparser,patch-1,refs/heads/,...,2022-07-27,5,javaparser-symbol-solver-core/,1295,11,davidfantasy,mybatis-plus-generator-ui,ssh://git@github.com:davidfantasy/mybatis-plus...,https://github.com/davidfantasy/mybatis-plus-g...,409


In [21]:
df_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10027 entries, 0 to 10026
Data columns (total 34 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   owner            10027 non-null  object 
 1   name             10027 non-null  object 
 2   sshUrl           10027 non-null  object 
 3   url              10027 non-null  object 
 4   baseRepo         10027 non-null  object 
 5   baseRef          10027 non-null  object 
 6   baseRefPrefix    10027 non-null  object 
 7   headRepo         10027 non-null  object 
 8   headRef          10027 non-null  object 
 9   headRefPrefix    10027 non-null  object 
 10  title            10027 non-null  object 
 11  number           10027 non-null  int64  
 12  state            10027 non-null  object 
 13  draft            10027 non-null  bool   
 14  files            10027 non-null  int64  
 15  createdAt_x      10027 non-null  object 
 16  publishedAt      10027 non-null  object 
 17  mergedAt    

In [22]:
# Remove extra clients dataset columns
remove_columns = set(df_clients.columns) - set(df_prs)
remove_columns.add('createdAt_y')

for col in remove_columns:
    if col in df_combined.columns:
        del df_combined[col]
        assert col not in df_combined.columns

        
# Remove duplicates (only keep PRs with potentially impacted clients)
df_prs = df_combined.drop_duplicates()

# Rename createdAt_x column
df_prs = df_prs.rename(columns={'createdAt_x': 'createdAt'})

print(f'Initial PRs: {df_prs.shape[0]}')
print(f'Final PRs: {df_combined.shape[0]}')
print(f'Removed PRs: {df_prs.shape[0] - df_combined.shape[0]}')

Initial PRs: 921
Final PRs: 10027
Removed PRs: -9106


**Note:** Perform manual check of some removed cases.

In [23]:
# Merge based on common keys
keys = ['owner', 'name', 'sshUrl', 'url', 'groupId', 'artifactId', 'version']
df_combined = pd.merge(df_prs, df_clients, how='outer', on=keys)
df_combined.head()

Unnamed: 0,owner,name,sshUrl,url,baseRepo,baseRef,baseRefPrefix,headRepo,headRef,headRefPrefix,...,pushedAt,packages,path,clients,relevantClients,cowner,cname,csshUrl,curl,cstars
0,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,yufanyufan/javaparser,patch-1,refs/heads/,...,2022-07-27,5,javaparser-symbol-solver-core/,1295,11,dunwu,java-tutorial,ssh://git@github.com:dunwu/java-tutorial.git,https://github.com/dunwu/java-tutorial,1188
1,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,yufanyufan/javaparser,patch-1,refs/heads/,...,2022-07-27,5,javaparser-symbol-solver-core/,1295,11,awspring,spring-cloud-aws,ssh://git@github.com:awspring/spring-cloud-aws...,https://github.com/awspring/spring-cloud-aws,312
2,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,yufanyufan/javaparser,patch-1,refs/heads/,...,2022-07-27,5,javaparser-symbol-solver-core/,1295,11,kiegroup,kogito-runtimes,ssh://git@github.com:kiegroup/kogito-runtimes.git,https://github.com/kiegroup/kogito-runtimes,356
3,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,yufanyufan/javaparser,patch-1,refs/heads/,...,2022-07-27,5,javaparser-symbol-solver-core/,1295,11,eclipse,deeplearning4j,ssh://git@github.com:eclipse/deeplearning4j.git,https://github.com/eclipse/deeplearning4j,12560
4,javaparser,javaparser,ssh://git@github.com:javaparser/javaparser.git,https://github.com/javaparser/javaparser,javaparser/javaparser,master,refs/heads/,yufanyufan/javaparser,patch-1,refs/heads/,...,2022-07-27,5,javaparser-symbol-solver-core/,1295,11,davidfantasy,mybatis-plus-generator-ui,ssh://git@github.com:davidfantasy/mybatis-plus...,https://github.com/davidfantasy/mybatis-plus-g...,409


In [24]:
# Rename createdAt_x column
df_combined = df_combined.rename(columns={'createdAt_x': 'prCreatedAt'})
df_combined = df_combined.rename(columns={'createdAt_y': 'createdAt'})
df_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11835 entries, 0 to 11834
Data columns (total 34 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   owner            11835 non-null  object 
 1   name             11835 non-null  object 
 2   sshUrl           11835 non-null  object 
 3   url              11835 non-null  object 
 4   baseRepo         9991 non-null   object 
 5   baseRef          9991 non-null   object 
 6   baseRefPrefix    9991 non-null   object 
 7   headRepo         9991 non-null   object 
 8   headRef          9991 non-null   object 
 9   headRefPrefix    9991 non-null   object 
 10  title            9991 non-null   object 
 11  number           9991 non-null   float64
 12  state            9991 non-null   object 
 13  draft            9991 non-null   object 
 14  files            9991 non-null   float64
 15  prCreatedAt      9991 non-null   object 
 16  publishedAt      9991 non-null   object 
 17  mergedAt    

In [25]:
# Remove cases where no PR has been registered
df_combined = df_combined[df_combined['baseRepo'].notnull()]
df_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9991 entries, 0 to 9990
Data columns (total 34 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   owner            9991 non-null   object 
 1   name             9991 non-null   object 
 2   sshUrl           9991 non-null   object 
 3   url              9991 non-null   object 
 4   baseRepo         9991 non-null   object 
 5   baseRef          9991 non-null   object 
 6   baseRefPrefix    9991 non-null   object 
 7   headRepo         9991 non-null   object 
 8   headRef          9991 non-null   object 
 9   headRefPrefix    9991 non-null   object 
 10  title            9991 non-null   object 
 11  number           9991 non-null   float64
 12  state            9991 non-null   object 
 13  draft            9991 non-null   object 
 14  files            9991 non-null   float64
 15  prCreatedAt      9991 non-null   object 
 16  publishedAt      9991 non-null   object 
 17  mergedAt      

In [26]:
# Define columns order
cols = ['owner', 'name', 'sshUrl', 'url', 'stars', 'createdAt', 'pushedAt',
       'packages', 'groupId', 'artifactId', 'version', 'path', 'clients', 
       'relevantClients', 'cowner', 'cname', 'csshUrl', 'curl', 'cstars',
       'number', 'title', 'state', 'draft', 'files', 'prCreatedAt', 'publishedAt',
       'mergedAt', 'closedAt', 'baseRepo', 'baseRef', 'baseRefPrefix', 'headRepo', 
        'headRef', 'headRefPrefix']

# Check that the number of columns is preserved
assert len(cols) == len(df_combined.columns.tolist())

# Re-arrange columns order
df_combined = df_combined[cols]
df_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9991 entries, 0 to 9990
Data columns (total 34 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   owner            9991 non-null   object 
 1   name             9991 non-null   object 
 2   sshUrl           9991 non-null   object 
 3   url              9991 non-null   object 
 4   stars            9991 non-null   int64  
 5   createdAt        9991 non-null   object 
 6   pushedAt         9991 non-null   object 
 7   packages         9991 non-null   int64  
 8   groupId          9991 non-null   object 
 9   artifactId       9991 non-null   object 
 10  version          9991 non-null   object 
 11  path             9991 non-null   object 
 12  clients          9991 non-null   int64  
 13  relevantClients  9991 non-null   int64  
 14  cowner           9991 non-null   object 
 15  cname            9991 non-null   object 
 16  csshUrl          9991 non-null   object 
 17  curl          

In [27]:
# Rename column names
df_combined = df_combined.rename(columns={
    'groupId': 'pkgGroupId',
    'artifactId': 'pkgArtifactId', 
    'version': 'pkgVersion', 
    'path': 'pkgPath', 
    'clients': 'pkgClients', 
    'relevantClients': 'pkgRelevantClients',
    'number': 'prNumber', 
    'title': 'prTitle', 
    'state': 'prState', 
    'draft': 'prDraft', 
    'files': 'prModifiedFiles', 
    'publishedAt': 'prPublishedAt',
    'mergedAt': 'prMergedAt', 
    'closedAt': 'prClosedAt'
})
df_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9991 entries, 0 to 9990
Data columns (total 34 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   owner               9991 non-null   object 
 1   name                9991 non-null   object 
 2   sshUrl              9991 non-null   object 
 3   url                 9991 non-null   object 
 4   stars               9991 non-null   int64  
 5   createdAt           9991 non-null   object 
 6   pushedAt            9991 non-null   object 
 7   packages            9991 non-null   int64  
 8   pkgGroupId          9991 non-null   object 
 9   pkgArtifactId       9991 non-null   object 
 10  pkgVersion          9991 non-null   object 
 11  pkgPath             9991 non-null   object 
 12  pkgClients          9991 non-null   int64  
 13  pkgRelevantClients  9991 non-null   int64  
 14  cowner              9991 non-null   object 
 15  cname               9991 non-null   object 
 16  csshUr

## Store Cleaned Dataset

Store current data frame into an output CSV file.

**Save modified dataset into CSV file** 

In [28]:
output_path = '../data/prs-cleaned.csv'
df_prs.to_csv(output_path, index=False)

**Save combined dataset into CSV file** 

In [29]:
output_path = '../data/combined.csv'
df_combined.to_csv(output_path, index=False)