# Split the Frequent Committers Dataset

Goal: Split the Data with the most frequent committers into a train (70 %), validate (15 %) and test set (15 %).

#### Load Data

In [59]:
import pandas as pd

data = pd.read_pickle('../data/03_Subset_Frequent_Committers.pkl')
data.head(3)

Unnamed: 0,message,author_email,project
0,Fixed an error happening when the memory stats...,michele.simionato@gmail.com,gem_oq-engine
1,Updated setup.py [skip CI],michele.simionato@gmail.com,micheles_decorator
2,Fixed an exposure test [skip hazardlib],michele.simionato@gmail.com,gem_oq-engine


In [60]:
from collections import Counter

author_counter = Counter(data['author_email'])
len(author_counter)

42

There are 42 authors.

All messages of one author should be in one subset to be able to cluster authors by their style.

In [61]:
print('70 percent of the {total_authors} authors: {fraction_authors} authors'.format(total_authors = len(author_counter), fraction_authors = len(author_counter) * 0.7))
print('15 percent of the {total_authors} authors: {fraction_authors} authors'.format(total_authors = len(author_counter), fraction_authors = len(author_counter) * 0.15))

70 percent of the 42 authors: 29.4 authors
15 percent of the 42 authors: 6.3 authors


Since these numbers are really small, a larger validate and test subset is preffered.

The following split is proposed:

| | number of authors |
| --- | --- |
| Training | 28 |
| Validate | 7 |
| Test | 7 |

The allocation of authors to a subset should be random because some authors have more commit messages while others have fewer.

In [62]:
import numpy as np

# this seed (13) returns a splitting option that fullfills the above splitting proportionalities, other most probably may not
np.random.seed(13)

random_allocation = np.random.choice(3, 42, p=[0.7, 0.15, 0.15])

np.random.shuffle(random_allocation)

print(random_allocation)
print(Counter(random_allocation))

[0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 2 0 1 0 1 2 2 0 0 2 1 0 0 0 0 0 2 0 2 0 0 0
 1 2 0 0 1]
Counter({0: 28, 1: 7, 2: 7})


The data is split accordingly and each committer gets a unique label.

In [63]:
import warnings
warnings.filterwarnings('ignore')

train_set = pd.DataFrame(columns=['message', 'author_email', 'project'])
validate_set = pd.DataFrame(columns=['message', 'author_email', 'project'])
test_set = pd.DataFrame(columns=['message', 'author_email', 'project'])

for i, group_object in enumerate(data.groupby('author_email')):
    group_object[1]['label'] = i
    if random_allocation[i] == 0:
        train_set = pd.concat([train_set, group_object[1]])
    if random_allocation[i] == 1:
        validate_set = pd.concat([validate_set, group_object[1]])
    if random_allocation[i] == 2:
        test_set = pd.concat([test_set, group_object[1]])

train_set.reset_index(drop=True, inplace=True)
validate_set.reset_index(drop=True, inplace=True)
test_set.reset_index(drop=True, inplace=True)

### Check for Overlapping Projects

In [64]:
projects = []
projects.append(list(train_set['project'].unique()))
projects.append(list(validate_set['project'].unique()))
projects.append(list(test_set['project'].unique()))

train_val_overlap = len(set(projects[0]) & set(projects[1]))
train_test_overlap = len(set(projects[0]) & set(projects[2]))
val_test_overlap = len(set(projects[1]) & set(projects[2]))

print(f"There are {train_val_overlap} projects that occur in both train and validate set.")
print(f"There are {train_test_overlap} projects that occur in both train and test set.")
print(f"There are {val_test_overlap} projects that occur in both validate and test set.")

There are 23 projects that occur in both train and validate set.
There are 4 projects that occur in both train and test set.
There are 1 projects that occur in both validate and test set.


### Resulting Train Set

In [65]:
authors_count = len(train_set['author_email'].unique())
projects_count = len(train_set['project'].unique())

print('Number of different Authors in the train set: ' + str(authors_count))
print('Number of different Projects in the train set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(train_set) / authors_count, 2)))
train_set

Number of different Authors in the train set: 28
Number of different Projects in the train set: 575
Average amount of commit messages per author: 1694.21


Unnamed: 0,message,author_email,project,label
0,calcs/hazard/event_based/post_processing:\n\nM...,Lars.Butler@gmail.com,gem_oq-engine,0.0
1,added javadoc heading to hdf5 util class\n\n\n...,Lars.Butler@gmail.com,gem_oq-engine,0.0
2,added missing imports in db_tests/__init__.py ...,Lars.Butler@gmail.com,gem_oq-engine,0.0
3,"Fixed up a longer-running test, added slow attr",Lars.Butler@gmail.com,gem_oq-engine,0.0
4,calculators/hazard/event_based/core_next:\n\nR...,Lars.Butler@gmail.com,gem_oq-engine,0.0
...,...,...,...,...
47433,"Fixed ""is a"" op with Ident",tj@vision-media.ca,stylus_stylus,40.0
47434,removed old dynamic helper logic from the view...,tj@vision-media.ca,expressjs_express,40.0
47435,fixed property error due to parser not being p...,tj@vision-media.ca,stylus_stylus,40.0
47436,Fixed connect middleware for <I>.x,tj@vision-media.ca,stylus_stylus,40.0


### Resulting Validate Set

In [66]:
authors_count = len(validate_set['author_email'].unique())
projects_count = len(validate_set['project'].unique())

print('Number of different Authors in the validate set: ' + str(authors_count))
print('Number of different Projects in the validate set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(validate_set) / authors_count, 2)))
validate_set

Number of different Authors in the validate set: 7
Number of different Projects in the validate set: 136
Average amount of commit messages per author: 1457.57


Unnamed: 0,message,author_email,project,label
0,refactor: getFrames,avwu@qq.com,avwo_whistle,5.0
1,refactor: websocket,avwu@qq.com,avwo_whistle,5.0
2,refactor: add timeout event,avwu@qq.com,avwo_whistle,5.0
3,feat: Support for getting random port of ui se...,avwu@qq.com,avwo_whistle,5.0
4,refactor: Change getClientIp to req.clientIp,avwu@qq.com,avwo_whistle,5.0
...,...,...,...,...
10198,Rename LiSEtest to SimTest,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10199,Turn character.StatMapping into a Signal,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10200,Decruft the old unused _no_use_canvas property...,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10201,logging for dummy\n\nTo more rapidly identify ...,zacharyspector@gmail.com,LogicalDash_LiSE,41.0


### Resulting Test Set

In [67]:
authors_count = len(test_set['author_email'].unique())
projects_count = len(test_set['project'].unique())

print('Number of different Authors in the test set: ' + str(authors_count))
print('Number of different Projects in the test set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(test_set) / authors_count, 2)))
test_set

Number of different Authors in the test set: 7
Number of different Projects in the test set: 91
Average amount of commit messages per author: 1526.29


Unnamed: 0,message,author_email,project,label
0,"API CHANGE Removed Member::init_db_fields(), i...",ingo@silverstripe.com,silverstripe_silverstripe-framework,15.0
1,Set omnipay response earlier in PurchaseServic...,ingo@silverstripe.com,silverstripe_silverstripe-omnipay,15.0
2,MINOR Protection against infinite initializati...,ingo@silverstripe.com,silverstripe_silverstripe-framework,15.0
3,Allowing success and error callbacks in refresh(),ingo@silverstripe.com,silverstripe_silverstripe-framework,15.0
4,Better shell execution feedback from PDF extra...,ingo@silverstripe.com,silverstripe_silverstripe-textextraction,15.0
...,...,...,...,...
10679,Apply uupdates to the dropfile routine to salt...,thatch45@gmail.com,saltstack_salt,38.0
10680,Remove esky errors because they only confuse %...,thatch45@gmail.com,saltstack_salt,38.0
10681,Add event firing to salt-ssh,thatch45@gmail.com,saltstack_salt,38.0
10682,Fix #<I>\n\nSorry about the long wait on this ...,thatch45@gmail.com,saltstack_salt,38.0


The number of average commit messages per dataframe cannot be totally balanced since there are some committers with a significantly higher amount of commit messages who are more likely to be allocated to the train set.

### Save all three Dataframes

In [68]:
train_set.to_pickle('../data/04a_Train_Set.pkl')
validate_set.to_pickle('../data/04b_Validate_Set.pkl')
test_set.to_pickle('../data/04c_Test_Set.pkl')

## Split a Second Time

This time the goal is to have commit messages of every author in every dataset to predict authors.

In [74]:
train_set = pd.DataFrame(columns=['message', 'author_email', 'project'])
validate_set = pd.DataFrame(columns=['message', 'author_email', 'project'])
test_set = pd.DataFrame(columns=['message', 'author_email', 'project'])

for i, group_object in enumerate(data.groupby('author_email')):
    group_object[1]['label'] = i
    group_object[1].reset_index(inplace=True)
    random_allocation = np.random.choice(3, len(group_object[1]), p=[0.7, 0.15, 0.15])
    for j, subset in enumerate(random_allocation):
        if subset == 0:
            train_set = pd.concat([train_set, group_object[1].iloc[[j]]])
        if subset == 1:
            validate_set = pd.concat([validate_set, group_object[1].iloc[[j]]])
        if subset == 2:
            test_set = pd.concat([test_set, group_object[1].iloc[[j]]])

train_set.reset_index(drop=True, inplace=True)
validate_set.reset_index(drop=True, inplace=True)
test_set.reset_index(drop=True, inplace=True)

train_set.drop(columns=['index'], inplace=True)
validate_set.drop(columns=['index'], inplace=True)
test_set.drop(columns=['index'], inplace=True)

### Resulting Train Set

In [82]:
authors_count = len(train_set['author_email'].unique())
projects_count = len(train_set['project'].unique())

print('Number of different Authors in the train set: ' + str(authors_count))
print('Number of different Projects in the train set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(train_set) / authors_count, 2)))
train_set

Number of different Authors in the train set: 42
Number of different Projects in the train set: 693
Average amount of commit messages per author: 1132.79


Unnamed: 0,message,author_email,project,label
0,added missing imports in db_tests/__init__.py ...,Lars.Butler@gmail.com,gem_oq-engine,0.0
1,"Fixed up a longer-running test, added slow attr",Lars.Butler@gmail.com,gem_oq-engine,0.0
2,site: revised SiteCollection doc about depth info,Lars.Butler@gmail.com,gem_oq-engine,0.0
3,export/risk:\n\n`event_loss` tables (csv) are ...,Lars.Butler@gmail.com,gem_oq-engine,0.0
4,Added missing site_model schema declaration,Lars.Butler@gmail.com,gem_oq-engine,0.0
...,...,...,...,...
47572,Remove the last vestiges of numpy dependency f...,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
47573,Implement character_portals_diff,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
47574,Rename LiSEtest to SimTest,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
47575,Decruft the old unused _no_use_canvas property...,zacharyspector@gmail.com,LogicalDash_LiSE,41.0


### Resulting Validate Set

In [83]:
authors_count = len(validate_set['author_email'].unique())
projects_count = len(validate_set['project'].unique())

print('Number of different Authors in the validate set: ' + str(authors_count))
print('Number of different Projects in the validate set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(validate_set) / authors_count, 2)))
validate_set

Number of different Authors in the validate set: 42
Number of different Projects in the validate set: 451
Average amount of commit messages per author: 246.19


Unnamed: 0,message,author_email,project,label
0,added javadoc heading to hdf5 util class\n\n\n...,Lars.Butler@gmail.com,gem_oq-engine,0.0
1,calculators/hazard/event_based/core_next:\n\nR...,Lars.Butler@gmail.com,gem_oq-engine,0.0
2,"engine:\n\nWhen getting risk calculations, ord...",Lars.Butler@gmail.com,gem_oq-engine,0.0
3,added missing docstring\n\n\nFormer-commit-id:...,Lars.Butler@gmail.com,gem_oq-engine,0.0
4,db/models:\n\nFixed a typo in a comment.,Lars.Butler@gmail.com,gem_oq-engine,0.0
...,...,...,...,...
10335,correct that caching method so that it removes...,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10336,Add another negative case to test_contents,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10337,re-fix flickering knock on wood,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10338,Correct some docstrings and pep8,zacharyspector@gmail.com,LogicalDash_LiSE,41.0


### Resulting Test Set

In [84]:
authors_count = len(test_set['author_email'].unique())
projects_count = len(test_set['project'].unique())

print('Number of different Authors in the test set: ' + str(authors_count))
print('Number of different Projects in the test set: ' + str(projects_count))
print('Average amount of commit messages per author: ' + str(round(len(test_set) / authors_count, 2)))
test_set

Number of different Authors in the test set: 42
Number of different Projects in the test set: 461
Average amount of commit messages per author: 247.81


Unnamed: 0,message,author_email,project,label
0,calcs/hazard/event_based/post_processing:\n\nM...,Lars.Butler@gmail.com,gem_oq-engine,0.0
1,hazard/writers:\n\nAdded pretty printing for h...,Lars.Butler@gmail.com,gem_oq-engine,0.0
2,version info now uses utc time to make applica...,Lars.Butler@gmail.com,gem_oq-engine,0.0
3,calcs/hazard/general:\n\nOrder results by id i...,Lars.Butler@gmail.com,gem_oq-engine,0.0
4,tests/calcs/risk/classical/core_test:\n\nUpdat...,Lars.Butler@gmail.com,gem_oq-engine,0.0
...,...,...,...,...
10403,Make the portal patcher iterate over portal ke...,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10404,Change PawnSpot finalization so it doesn't rel...,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10405,use the adapter since that's what holds the se...,zacharyspector@gmail.com,LogicalDash_LiSE,41.0
10406,Remove stale card widgets when switching rules,zacharyspector@gmail.com,LogicalDash_LiSE,41.0


### Save all three Dataframes

In [85]:
train_set.to_pickle('../data/05a_Authors_Train_Set.pkl')
validate_set.to_pickle('../data/05b_Authors_Validate_Set.pkl')
test_set.to_pickle('../data/05c_Authors_Test_Set.pkl')