# Split the Authors Dataset

Goal: Split the Data with the 100 most frequent authors into a train (70 %), validate (15 %) and test set (15 %).

#### Load Data

In [2]:
import pandas as pd
import numpy as np

data = pd.read_pickle('../data/03a_Authors_Subset.pkl')
data.head(3)

Unnamed: 0,message,author_email,project
0,Fixed an error happening when the memory stats...,michele.simionato@gmail.com,gem_oq-engine
1,Updated setup.py [skip CI],michele.simionato@gmail.com,micheles_decorator
2,Fixed an exposure test [skip hazardlib],michele.simionato@gmail.com,gem_oq-engine


In [3]:
from collections import Counter

author_counter = Counter(data['author_email'])
len(author_counter)

100

There are 100 authors.

All messages of one author should be in one subset to be able to cluster authors by their style.

In [4]:
print('70 percent of the {total_authors} authors: {fraction_authors} authors'.format(total_authors = len(author_counter), fraction_authors = len(author_counter) * 0.7))
print('15 percent of the {total_authors} authors: {fraction_authors} authors'.format(total_authors = len(author_counter), fraction_authors = len(author_counter) * 0.15))

70 percent of the 100 authors: 70.0 authors
15 percent of the 100 authors: 15.0 authors


The following split is proposed:

| | number of authors |
| --- | --- |
| Training | 70 |
| Validate | 15 |
| Test | 15 |

The allocation of authors to a subset should be random because some authors have more commit messages while others have fewer.

In [5]:
import numpy as np

random_allocation = np.concatenate((np.full((70,), 0), np.full((15,), 1), np.full((15,), 2)))

np.random.shuffle(random_allocation)

print(random_allocation)
print(Counter(random_allocation))

[1 0 0 2 0 0 0 0 0 2 2 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0 2 0 1 2 1 1 0 0
 0 0 2 0 0 2 0 2 0 0 0 1 0 2 0 0 2 0 0 0 0 0 0 1 1 0 2 0 0 0 1 0 0 0 0 1 0
 0 2 0 2 2 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0]
Counter({0: 70, 1: 15, 2: 15})


The data is split accordingly and each committer gets a unique label.

In [6]:
import warnings
warnings.filterwarnings('ignore')

train_set = pd.DataFrame(columns=['message', 'author_email', 'project'])
validate_set = pd.DataFrame(columns=['message', 'author_email', 'project'])
test_set = pd.DataFrame(columns=['message', 'author_email', 'project'])

for i, group_object in enumerate(data.groupby('author_email')):
    group_object[1]['label'] = i
    if random_allocation[i] == 0:
        train_set = pd.concat([train_set, group_object[1]])
    if random_allocation[i] == 1:
        validate_set = pd.concat([validate_set, group_object[1]])
    if random_allocation[i] == 2:
        test_set = pd.concat([test_set, group_object[1]])

train_set.reset_index(drop=True, inplace=True)
validate_set.reset_index(drop=True, inplace=True)
test_set.reset_index(drop=True, inplace=True)

### Check for Overlapping Projects

In [7]:
projects = []
projects.append(list(train_set['project'].unique()))
projects.append(list(validate_set['project'].unique()))
projects.append(list(test_set['project'].unique()))

train_val_overlap = len(set(projects[0]) & set(projects[1]))
train_test_overlap = len(set(projects[0]) & set(projects[2]))
val_test_overlap = len(set(projects[1]) & set(projects[2]))

print(f"There are {train_val_overlap} projects that occur in both train and validate set.")
print(f"There are {train_test_overlap} projects that occur in both train and test set.")
print(f"There are {val_test_overlap} projects that occur in both validate and test set.")

There are 18 projects that occur in both train and validate set.
There are 45 projects that occur in both train and test set.
There are 0 projects that occur in both validate and test set.


### Resulting Train Set

In [8]:
authors_count = len(train_set['author_email'].unique())
projects_count = len(train_set['project'].unique())

print('Number of different Authors in the train set: ' + str(authors_count))
print('Number of different Projects in the train set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(train_set) / authors_count, 2)))
train_set

Number of different Authors in the train set: 70
Number of different Projects in the train set: 1286
Average amount of commit messages per author: 1154.79


Unnamed: 0,message,author_email,project,label
0,Removed value dimension restriction on Points,P.Rudiger@ed.ac.uk,pyviz_holoviews,1.0
1,Reverted change to sublabel positioning,P.Rudiger@ed.ac.uk,pyviz_holoviews,1.0
2,Increased default max_samples on decimate,P.Rudiger@ed.ac.uk,pyviz_holoviews,1.0
3,Small fix for stream sources on batched plots,P.Rudiger@ed.ac.uk,pyviz_holoviews,1.0
4,Allowed constructing empty MultiDimensionalMap...,P.Rudiger@ed.ac.uk,pyviz_holoviews,1.0
...,...,...,...,...
80830,Rename LiSEtest to SimTest,zacharyspector@gmail.com,LogicalDash_LiSE,99.0
80831,Turn character.StatMapping into a Signal,zacharyspector@gmail.com,LogicalDash_LiSE,99.0
80832,Decruft the old unused _no_use_canvas property...,zacharyspector@gmail.com,LogicalDash_LiSE,99.0
80833,logging for dummy\n\nTo more rapidly identify ...,zacharyspector@gmail.com,LogicalDash_LiSE,99.0


### Resulting Validate Set

In [9]:
authors_count = len(validate_set['author_email'].unique())
projects_count = len(validate_set['project'].unique())

print('Number of different Authors in the validate set: ' + str(authors_count))
print('Number of different Projects in the validate set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(validate_set) / authors_count, 2)))
validate_set

Number of different Authors in the validate set: 15
Number of different Projects in the validate set: 117
Average amount of commit messages per author: 1162.2


Unnamed: 0,message,author_email,project,label
0,calcs/hazard/event_based/post_processing:\n\nM...,Lars.Butler@gmail.com,gem_oq-engine,0.0
1,added javadoc heading to hdf5 util class\n\n\n...,Lars.Butler@gmail.com,gem_oq-engine,0.0
2,added missing imports in db_tests/__init__.py ...,Lars.Butler@gmail.com,gem_oq-engine,0.0
3,"Fixed up a longer-running test, added slow attr",Lars.Butler@gmail.com,gem_oq-engine,0.0
4,calculators/hazard/event_based/core_next:\n\nR...,Lars.Butler@gmail.com,gem_oq-engine,0.0
...,...,...,...,...
17428,Apply uupdates to the dropfile routine to salt...,thatch45@gmail.com,saltstack_salt,96.0
17429,Remove esky errors because they only confuse %...,thatch45@gmail.com,saltstack_salt,96.0
17430,Add event firing to salt-ssh,thatch45@gmail.com,saltstack_salt,96.0
17431,Fix #<I>\n\nSorry about the long wait on this ...,thatch45@gmail.com,saltstack_salt,96.0


### Resulting Test Set

In [10]:
authors_count = len(test_set['author_email'].unique())
projects_count = len(test_set['project'].unique())

print('Number of different Authors in the test set: ' + str(authors_count))
print('Number of different Projects in the test set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(test_set) / authors_count, 2)))
test_set

Number of different Authors in the test set: 15
Number of different Projects in the test set: 279
Average amount of commit messages per author: 969.07


Unnamed: 0,message,author_email,project,label
0,[enrich][logs] Show datasource in loading iden...,acs@bitergia.com,chaoss_grimoirelab-elk,3.0
1,[enrich][bugzilla] Use items as the type for t...,acs@bitergia.com,chaoss_grimoirelab-elk,3.0
2,Removed not used code (old use of pullrequests...,acs@bitergia.com,chaoss_grimoirelab-elk,3.0
3,[release] Update version number to <I>,acs@bitergia.com,chaoss_grimoirelab-elk,3.0
4,[logs] Remove logs related to getting last upd...,acs@bitergia.com,chaoss_grimoirelab-elk,3.0
...,...,...,...,...
14531,Marking `Uuid::uuid5()` as pure: same input le...,ocramius@gmail.com,ramsey_uuid,78.0
14532,Removing temporary files on failed write opera...,ocramius@gmail.com,Ocramius_ProxyManager,78.0
14533,Expecting `inspectionId` in the `InspectionCon...,ocramius@gmail.com,Roave_RoaveDeveloperTools,78.0
14534,Marked return type of `\set_exception_handler(...,ocramius@gmail.com,phpstan_phpstan,78.0


The number of average commit messages per dataframe cannot be totally balanced since there are some committers with a significantly higher amount of commit messages who are more likely to be allocated to the train set.

### Save all three Dataframes

In [11]:
train_set.to_pickle('../data/04-1a_Authors_Train_Set.pkl')
validate_set.to_pickle('../data/04-1b_Authors_Validate_Set.pkl')
test_set.to_pickle('../data/04-1c_Authors_Test_Set.pkl')