# Split the Frequent Committers Dataset

Goal: Split the Data with the most frequent committers into a train (70 %), validate (15 %) and test set (15 %).

#### Load Data

In [1]:
import pandas as pd

data = pd.read_pickle('../data/03_Subset_Frequent_Committers.pkl')
data.head(3)

Unnamed: 0,message,author_email,project
0,Fixed an error happening when the memory stats...,michele.simionato@gmail.com,gem_oq-engine
1,Updated setup.py [skip CI],michele.simionato@gmail.com,micheles_decorator
2,Fixed an exposure test [skip hazardlib],michele.simionato@gmail.com,gem_oq-engine


In [2]:
from collections import Counter

author_counter = Counter(data['author_email'])
len(author_counter)

42

There are 42 authors.

All messages of one author should be in one subset to be able to cluster authors by their style.

In [3]:
print('70 percent of the {total_authors} authors: {fraction_authors} authors'.format(total_authors = len(author_counter), fraction_authors = len(author_counter) * 0.7))
print('15 percent of the {total_authors} authors: {fraction_authors} authors'.format(total_authors = len(author_counter), fraction_authors = len(author_counter) * 0.15))

70 percent of the 42 authors: 29.4 authors
15 percent of the 42 authors: 6.3 authors


Since these numbers are really small, a larger validate and test subset is preffered.

The following split is proposed:

| | number of authors |
| --- | --- |
| Training | 28 |
| Validate | 7 |
| Test | 7 |

The allocation of authors to a subset should be random because some authors have more commit messages while others have fewer.

In [4]:
import numpy as np

random_allocation = np.concatenate((np.full((28,), 0), np.full((7,), 1), np.full((7,), 2)))

np.random.shuffle(random_allocation)

print(random_allocation)
print(Counter(random_allocation))

[1 1 1 0 0 0 2 0 2 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 1 0 0 0 1 0 2 0 0 2 0 2
 0 2 1 0 0]
Counter({0: 28, 1: 7, 2: 7})


The data is split accordingly and each committer gets a unique label.

In [5]:
import warnings
warnings.filterwarnings('ignore')

train_set = pd.DataFrame(columns=['message', 'author_email', 'project'])
validate_set = pd.DataFrame(columns=['message', 'author_email', 'project'])
test_set = pd.DataFrame(columns=['message', 'author_email', 'project'])

for i, group_object in enumerate(data.groupby('author_email')):
    group_object[1]['label'] = i
    if random_allocation[i] == 0:
        train_set = pd.concat([train_set, group_object[1]])
    if random_allocation[i] == 1:
        validate_set = pd.concat([validate_set, group_object[1]])
    if random_allocation[i] == 2:
        test_set = pd.concat([test_set, group_object[1]])

train_set.reset_index(drop=True, inplace=True)
validate_set.reset_index(drop=True, inplace=True)
test_set.reset_index(drop=True, inplace=True)

### Check for Overlapping Projects

In [6]:
projects = []
projects.append(list(train_set['project'].unique()))
projects.append(list(validate_set['project'].unique()))
projects.append(list(test_set['project'].unique()))

train_val_overlap = len(set(projects[0]) & set(projects[1]))
train_test_overlap = len(set(projects[0]) & set(projects[2]))
val_test_overlap = len(set(projects[1]) & set(projects[2]))

print(f"There are {train_val_overlap} projects that occur in both train and validate set.")
print(f"There are {train_test_overlap} projects that occur in both train and test set.")
print(f"There are {val_test_overlap} projects that occur in both validate and test set.")

There are 11 projects that occur in both train and validate set.
There are 30 projects that occur in both train and test set.
There are 2 projects that occur in both validate and test set.


### Resulting Train Set

In [7]:
authors_count = len(train_set['author_email'].unique())
projects_count = len(train_set['project'].unique())

print('Number of different Authors in the train set: ' + str(authors_count))
print('Number of different Projects in the train set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(train_set) / authors_count, 2)))
train_set

Number of different Authors in the train set: 28
Number of different Projects in the train set: 504
Average amount of commit messages per author: 1521.68


Unnamed: 0,message,author_email,project,label
0,Remove redundant replenishConnRequests when re...,anacrolix@gmail.com,anacrolix_torrent,3.0
1,Ending a conn because we don't like its ID is ...,anacrolix@gmail.com,anacrolix_torrent,3.0
2,Treat closed as gotFin for purposes of destroy...,anacrolix@gmail.com,anacrolix_utp,3.0
3,Fix for getting closest nodes on our own ID,anacrolix@gmail.com,anacrolix_dht,3.0
4,Fix race in TextPexConnState,anacrolix@gmail.com,anacrolix_torrent,3.0
...,...,...,...,...
42602,Rename LiSEtest to SimTest,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
42603,Turn character.StatMapping into a Signal,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
42604,Decruft the old unused _no_use_canvas property...,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
42605,logging for dummy\n\nTo more rapidly identify ...,zacharyspector@gmail.com,LogicalDash_LiSE,41.0


### Resulting Validate Set

In [8]:
authors_count = len(validate_set['author_email'].unique())
projects_count = len(validate_set['project'].unique())

print('Number of different Authors in the validate set: ' + str(authors_count))
print('Number of different Projects in the validate set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(validate_set) / authors_count, 2)))
validate_set

Number of different Authors in the validate set: 7
Number of different Projects in the validate set: 98
Average amount of commit messages per author: 1640.43


Unnamed: 0,message,author_email,project,label
0,calcs/hazard/event_based/post_processing:\n\nM...,Lars.Butler@gmail.com,gem_oq-engine,0.0
1,added javadoc heading to hdf5 util class\n\n\n...,Lars.Butler@gmail.com,gem_oq-engine,0.0
2,added missing imports in db_tests/__init__.py ...,Lars.Butler@gmail.com,gem_oq-engine,0.0
3,"Fixed up a longer-running test, added slow attr",Lars.Butler@gmail.com,gem_oq-engine,0.0
4,calculators/hazard/event_based/core_next:\n\nR...,Lars.Butler@gmail.com,gem_oq-engine,0.0
...,...,...,...,...
11478,lxd/device/device/utils/network: Renames netwo...,thomas.parrott@canonical.com,lxc_lxd,39.0
11479,lxd/storage/filesystem: Adds StatVFS function,thomas.parrott@canonical.com,lxc_lxd,39.0
11480,lxd/db/node: Update nodeIsOffline to consider ...,thomas.parrott@canonical.com,lxc_lxd,39.0
11481,lxd/instance/exec: Simplify connection slot se...,thomas.parrott@canonical.com,lxc_lxd,39.0


### Resulting Test Set

In [9]:
authors_count = len(test_set['author_email'].unique())
projects_count = len(test_set['project'].unique())

print('Number of different Authors in the test set: ' + str(authors_count))
print('Number of different Projects in the test set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(test_set) / authors_count, 2)))
test_set

Number of different Authors in the test set: 7
Number of different Projects in the test set: 214
Average amount of commit messages per author: 2033.57


Unnamed: 0,message,author_email,project,label
0,Add a role_mentions reader,blactbt@live.de,meew0_discordrb,6.0
1,Add a newline at the end of msgbot,blactbt@live.de,discordjs_discord.js,6.0
2,Use pack instead of chr to get the hacky chars,blactbt@live.de,meew0_discordrb,6.0
3,:anchor: Make sure to correctly add the embeds...,blactbt@live.de,meew0_discordrb,6.0
4,Add some comments to permission_overwrite,blactbt@live.de,meew0_discordrb,6.0
...,...,...,...,...
14230,Apply uupdates to the dropfile routine to salt...,thatch45@gmail.com,saltstack_salt,38.0
14231,Remove esky errors because they only confuse %...,thatch45@gmail.com,saltstack_salt,38.0
14232,Add event firing to salt-ssh,thatch45@gmail.com,saltstack_salt,38.0
14233,Fix #<I>\n\nSorry about the long wait on this ...,thatch45@gmail.com,saltstack_salt,38.0


The number of average commit messages per dataframe cannot be totally balanced since there are some committers with a significantly higher amount of commit messages who are more likely to be allocated to the train set.

### Save all three Dataframes

In [10]:
train_set.to_pickle('../data/04-0a_Train_Set.pkl')
validate_set.to_pickle('../data/04-0b_Validate_Set.pkl')
test_set.to_pickle('../data/04-0c_Test_Set.pkl')

## Split a Second Time

This time the goal is to have commit messages of every author in every dataset to predict authors.

In [11]:
train_set = pd.DataFrame(columns=['message', 'author_email', 'project'])
validate_set = pd.DataFrame(columns=['message', 'author_email', 'project'])
test_set = pd.DataFrame(columns=['message', 'author_email', 'project'])

for i, group_object in enumerate(data.groupby('author_email')):
    group_object[1]['label'] = i
    group_object[1].reset_index(inplace=True)
    random_allocation = np.random.choice(3, len(group_object[1]), p=[0.7, 0.15, 0.15])
    for j, subset in enumerate(random_allocation):
        if subset == 0:
            train_set = pd.concat([train_set, group_object[1].iloc[[j]]])
        if subset == 1:
            validate_set = pd.concat([validate_set, group_object[1].iloc[[j]]])
        if subset == 2:
            test_set = pd.concat([test_set, group_object[1].iloc[[j]]])

train_set.reset_index(drop=True, inplace=True)
validate_set.reset_index(drop=True, inplace=True)
test_set.reset_index(drop=True, inplace=True)

train_set.drop(columns=['index'], inplace=True)
validate_set.drop(columns=['index'], inplace=True)
test_set.drop(columns=['index'], inplace=True)

### Resulting Train Set

In [12]:
authors_count = len(train_set['author_email'].unique())
projects_count = len(train_set['project'].unique())

print('Number of different Authors in the train set: ' + str(authors_count))
print('Number of different Projects in the train set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(train_set) / authors_count, 2)))
train_set

Number of different Authors in the train set: 42
Number of different Projects in the train set: 697
Average amount of commit messages per author: 1139.52


Unnamed: 0,message,author_email,project,label
0,calcs/hazard/event_based/post_processing:\n\nM...,Lars.Butler@gmail.com,gem_oq-engine,0.0
1,added javadoc heading to hdf5 util class\n\n\n...,Lars.Butler@gmail.com,gem_oq-engine,0.0
2,added missing imports in db_tests/__init__.py ...,Lars.Butler@gmail.com,gem_oq-engine,0.0
3,"Fixed up a longer-running test, added slow attr",Lars.Butler@gmail.com,gem_oq-engine,0.0
4,calculators/hazard/event_based/core_next:\n\nR...,Lars.Butler@gmail.com,gem_oq-engine,0.0
...,...,...,...,...
47855,Correct some docstrings and pep8,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
47856,Rename LiSEtest to SimTest,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
47857,Turn character.StatMapping into a Signal,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
47858,logging for dummy\n\nTo more rapidly identify ...,zacharyspector@gmail.com,LogicalDash_LiSE,41.0


### Resulting Validate Set

In [13]:
authors_count = len(validate_set['author_email'].unique())
projects_count = len(validate_set['project'].unique())

print('Number of different Authors in the validate set: ' + str(authors_count))
print('Number of different Projects in the validate set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(validate_set) / authors_count, 2)))
validate_set

Number of different Authors in the validate set: 42
Number of different Projects in the validate set: 446
Average amount of commit messages per author: 242.67


Unnamed: 0,message,author_email,project,label
0,Added missing site_model schema declaration,Lars.Butler@gmail.com,gem_oq-engine,0.0
1,qa_tests/_utils:\n\nAdded a utility method to ...,Lars.Butler@gmail.com,gem_oq-engine,0.0
2,Fixed a couple of pylint violations\n\n\nForme...,Lars.Butler@gmail.com,gem_oq-engine,0.0
3,fixed a few minor bugs in helpers\n\n\nFormer-...,Lars.Butler@gmail.com,gem_oq-engine,0.0
4,added missing docstring\n\n\nFormer-commit-id:...,Lars.Butler@gmail.com,gem_oq-engine,0.0
...,...,...,...,...
10187,use the adapter since that's what holds the se...,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10188,"Fix logic in time setter, and make it create b...",zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10189,correct that caching method so that it removes...,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10190,Add another negative case to test_contents,zacharyspector@gmail.com,LogicalDash_LiSE,41.0


### Resulting Test Set

In [14]:
authors_count = len(test_set['author_email'].unique())
projects_count = len(test_set['project'].unique())

print('Number of different Authors in the test set: ' + str(authors_count))
print('Number of different Projects in the test set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(test_set) / authors_count, 2)))
test_set

Number of different Authors in the test set: 42
Number of different Projects in the test set: 468
Average amount of commit messages per author: 244.6


Unnamed: 0,message,author_email,project,label
0,export/risk:\n\n`event_loss` tables (csv) are ...,Lars.Butler@gmail.com,gem_oq-engine,0.0
1,Added another test case for the export() api f...,Lars.Butler@gmail.com,gem_oq-engine,0.0
2,db/models:\n\nFixed a typo in a comment.,Lars.Butler@gmail.com,gem_oq-engine,0.0
3,db/routers_test:\n\nRemoved references to Uplo...,Lars.Butler@gmail.com,gem_oq-engine,0.0
4,version info now uses utc time to make applica...,Lars.Butler@gmail.com,gem_oq-engine,0.0
...,...,...,...,...
10268,Fix ValueError when last_result_idx is unset,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10269,Don't regen spot collider needlessly,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10270,"Make the kobold extra sprinty, for easier testing",zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10271,re-fix flickering knock on wood,zacharyspector@gmail.com,LogicalDash_LiSE,41.0


### Save all three Dataframes

In [15]:
train_set.to_pickle('../data/05-0a_Authors_Train_Set.pkl')
validate_set.to_pickle('../data/05-0b_Authors_Validate_Set.pkl')
test_set.to_pickle('../data/05-0c_Authors_Test_Set.pkl')