# Split the Projects Dataset

Goal: Split the Data with the 100 most frequent projects into a train (70 %), validate (15 %) and test set (15 %).

#### Load Data

In [1]:
import pandas as pd
import numpy as np

data = pd.read_pickle('../data/03b_Projects_Subset.pkl')
data.head(3)

Unnamed: 0,message,author_email,project
0,Ensure topic as bytes when zmq_filtering enabl...,pengyao@pengyao.org,saltstack_salt
1,Fix the process_test.test_kill failure in <I>,janderson@saltstack.com,saltstack_salt
2,Add state.pkg to highstate outputters,thatch45@gmail.com,saltstack_salt


In [2]:
from collections import Counter

project_counter = Counter(data['project'])
len(project_counter)

100

There are 100 projects.

All messages of one project should be in one subset to be able to cluster projects by their style.

In [4]:
print('70 percent of the {total_projects} projects: {fraction_projects} projects'.format(total_projects = len(project_counter), fraction_projects = len(project_counter) * 0.7))
print('15 percent of the {total_projects} projects: {fraction_projects} projects'.format(total_projects = len(project_counter), fraction_projects = len(project_counter) * 0.15))

70 percent of the 100 projects: 70.0 projects
15 percent of the 100 projects: 15.0 projects


The following split is proposed:

| | number of projects |
| --- | --- |
| Training | 70 |
| Validate | 15 |
| Test | 15 |

The allocation of projects to a subset should be random because some projects have more commit messages while others have fewer.

In [5]:
import numpy as np

random_allocation = np.concatenate((np.full((70,), 0), np.full((15,), 1), np.full((15,), 2)))

np.random.shuffle(random_allocation)

print(random_allocation)
print(Counter(random_allocation))

[1 0 0 0 0 1 0 0 0 2 1 0 0 0 1 0 0 0 0 0 0 1 0 0 2 1 2 0 1 0 0 0 0 2 2 0 0
 0 1 1 2 0 2 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 2 0 2 2 0 0 0 0 0 0 0 0 1 0 2 1
 0 0 0 2 1 0 0 0 2 0 0 0 0 1 0 0 2 0 0 0 0 0 2 0 0 0]
Counter({0: 70, 1: 15, 2: 15})


The data is split accordingly and each project gets a unique label.

In [6]:
import warnings
warnings.filterwarnings('ignore')

train_set = pd.DataFrame(columns=['message', 'author_email', 'project'])
validate_set = pd.DataFrame(columns=['message', 'author_email', 'project'])
test_set = pd.DataFrame(columns=['message', 'author_email', 'project'])

for i, group_object in enumerate(data.groupby('project')):
    group_object[1]['label'] = i
    if random_allocation[i] == 0:
        train_set = pd.concat([train_set, group_object[1]])
    if random_allocation[i] == 1:
        validate_set = pd.concat([validate_set, group_object[1]])
    if random_allocation[i] == 2:
        test_set = pd.concat([test_set, group_object[1]])

train_set.reset_index(drop=True, inplace=True)
validate_set.reset_index(drop=True, inplace=True)
test_set.reset_index(drop=True, inplace=True)

### Check for Overlapping Authors

In [7]:
authors = []
authors.append(list(train_set['author_email'].unique()))
authors.append(list(validate_set['author_email'].unique()))
authors.append(list(test_set['author_email'].unique()))

train_val_overlap = len(set(authors[0]) & set(authors[1]))
train_test_overlap = len(set(authors[0]) & set(authors[2]))
val_test_overlap = len(set(authors[1]) & set(authors[2]))

print(f"There are {train_val_overlap} authors that occur in both train and validate set.")
print(f"There are {train_test_overlap} authors that occur in both train and test set.")
print(f"There are {val_test_overlap} authors that occur in both validate and test set.")

There are 208 authors that occur in both train and validate set.
There are 391 authors that occur in both train and test set.
There are 65 authors that occur in both validate and test set.


### Resulting Train Set

In [8]:
authors_count = len(train_set['author_email'].unique())
projects_count = len(train_set['project'].unique())

print('Number of different Authors in the train set: ' + str(authors_count))
print('Number of different Projects in the train set: ' + str(projects_count))
print('Average amount of commit messages per project: ' + str(round(len(train_set) / projects_count, 2)))
train_set

Number of different Authors in the train set: 18272
Number of different Projects in the train set: 70
Average amount of commit messages per project: 2636.94


Unnamed: 0,message,author_email,project,label
0,Signup: Headstart: Add back the Headstart flow...,kwight@kwight.ca,Automattic_wp-calypso,1.0
1,Stats: Add delay to avoid stale `_dl` in `caly...,donpark@docuverse.com,Automattic_wp-calypso,1.0
2,wp: fix eslint warning.,rdsuarez@gmail.com,Automattic_wp-calypso,1.0
3,endpoint: post: rename `post` by `post_get`,rdsuarez@gmail.com,Automattic_wp-calypso,1.0
4,Analytics: Remove broken and unused `site_post...,kwight@kwight.ca,Automattic_wp-calypso,1.0
...,...,...,...,...
184581,[WFLY-<I>] Don't drop the log-store root model...,brian.stansberry@redhat.com,wildfly_wildfly,99.0
184582,callbackHandle isn't being set anywhere.,jfclere@gmail.com,wildfly_wildfly,99.0
184583,Don't try to read an unstarted NetworkInterfac...,brian.stansberry@redhat.com,wildfly_wildfly,99.0
184584,[WFLY-<I>] Update the expected caller principa...,fjuma@redhat.com,wildfly_wildfly,99.0


### Resulting Validate Set

In [9]:
authors_count = len(validate_set['author_email'].unique())
projects_count = len(validate_set['project'].unique())

print('Number of different Authors in the validate set: ' + str(authors_count))
print('Number of different Projects in the validate set: ' + str(projects_count))
print('Average amount of commit messages per project: ' + str(round(len(validate_set) / projects_count, 2)))
validate_set

Number of different Authors in the validate set: 4975
Number of different Projects in the validate set: 15
Average amount of commit messages per project: 2171.27


Unnamed: 0,message,author_email,project,label
0,Handle interrupts in FaultTolerateAlluxioMaster,aaudibert10@gmail.com,Alluxio_alluxio,0.0
1,Change swift ufs to return empty group/owner i...,jia.calvin@gmail.com,Alluxio_alluxio,0.0
2,Don't discard the buffer unnecessarily in unde...,jia.calvin@gmail.com,Alluxio_alluxio,0.0
3,TACHYON-<I>: Check - Check Tachyon specfic ope...,sdp@apache.org,Alluxio_alluxio,0.0
4,[SMALLFIX] Simplified equals implementation of...,jan.hentschel@ultratendency.com,Alluxio_alluxio,0.0
...,...,...,...,...
32564,* How did this 'repo' get past the last massiv...,postmodern.mod3@gmail.com,ronin-ruby_ronin,87.0
32565,Added Script::ClassMethods#short_name.,postmodern.mod3@gmail.com,ronin-ruby_ronin,87.0
32566,Include UI::Output::Helpers into all Ronin Mod...,postmodern.mod3@gmail.com,ronin-ruby_ronin,87.0
32567,Call Database.upgrade from Database.setup.,postmodern.mod3@gmail.com,ronin-ruby_ronin,87.0


### Resulting Test Set

In [10]:
authors_count = len(test_set['author_email'].unique())
projects_count = len(test_set['project'].unique())

print('Number of different Authors in the test set: ' + str(authors_count))
print('Number of different Projects in the test set: ' + str(projects_count))
print('Average amount of commit messages per project: ' + str(round(len(test_set) / projects_count, 2)))
test_set

Number of different Authors in the test set: 4321
Number of different Projects in the test set: 15
Average amount of commit messages per project: 3607.87


Unnamed: 0,message,author_email,project,label
0,[INTERNAL] Windows Phone <I>: Table export\n\n...,tommy.vinh.lam@sap.com,SAP_openui5,9.0
1,[INTERNAL] sap.m.DateTimeInput: Islamic calend...,cahit.guerguec@sap.com,SAP_openui5,9.0
2,[FIX] sap.m.ViewSettingsDialog: Dialog appears...,alexander.ivanov01@sap.com,SAP_openui5,9.0
3,[INTERNAL] sap.m.ActionSheet: AfterClose event...,ivaylo.plashkov@sap.com,SAP_openui5,9.0
4,[INTERNAL][FIX] sap.ui.fl: Correct id of appli...,tuan.dat.ngo@sap.com,SAP_openui5,9.0
...,...,...,...,...
54113,Updates test S3 bucket name to match sweeper p...,gdavison@hashicorp.com,terraform-providers_terraform-provider-aws,96.0
54114,tests/r/secretsmanager_secret: Use consistent ...,dirk.avery@gmail.com,terraform-providers_terraform-provider-aws,96.0
54115,"serverlessapprepo: Migrate to service, global ...",dirk.avery@gmail.com,terraform-providers_terraform-provider-aws,96.0
54116,Fix letter case on aws_dms_endpoint.mongodb_se...,vitor@vitorbaptista.com,terraform-providers_terraform-provider-aws,96.0


The number of average commit messages per dataframe cannot be totally balanced since there are some committers with a significantly higher amount of commit messages who are more likely to be allocated to the train set.

### Save all three Dataframes

In [12]:
train_set.to_pickle('../data/04-1a_Projects_Train_Set.pkl')
validate_set.to_pickle('../data/04-1b_Projects_Validate_Set.pkl')
test_set.to_pickle('../data/04-1c_Projects_Test_Set.pkl')