# Generate Most Frequent Committers Subset

Goal: Include all messages from committers who have more than 1000 commits.

#### Load Data

In [2]:
import pandas as pd

data = pd.read_csv("data/01_Original.csv")
data.head(3)

Unnamed: 0,hash,diff,message,author_email,author_name,committer_email,committer_name,project,split
0,1de640cc59b4b3030447d567b3c99c50777bd760,a/setup.py b/setup.py\nindex <HASH>..<HASH> 1...,setup: Detect if wheel and twine installed,gcushen@users.noreply.github.com,George Cushen,gcushen@users.noreply.github.com,George Cushen,gcushen_mezzanine-api,train
1,c1cce6fe5e49df5546c30a662fd141d41f4fc389,a/Builder.php b/Builder.php\nindex <HASH>..<H...,[Builder] Adding root page in any case,g.passault@gmail.com,Gregwar,g.passault@gmail.com,Gregwar,Gregwar_Slidey,train
2,2f7d97d15ea41f4112e74429617c5daad740d7cc,a/web.go b/web.go\nindex <HASH>..<HASH> 10064...,Added web.Urlencode method,hoisie@gmail.com,Michael Hoisie,hoisie@gmail.com,Michael Hoisie,hoisie_web,train


In [3]:
data = data.drop(['hash', 'diff', 'author_email', 'author_name', 'committer_name', 'split'], axis=1)

#### Take the subset

The Dataset Exploration showed that there are exactly 50 committers who have more than 1000 commits.

However, the email address noreply@github.com will be excluded since it brings no information on the committer.

In [4]:
from collections import Counter

committer_count = Counter(data["committer_email"])
included_committers_counter = committer_count.most_common(50)[1:]

#### Transform to a list

In [5]:
included_committer_emails = []

for tuple in included_committers_counter:
    included_committer_emails.append(tuple[0])

#### Filter the dataframe

In [6]:
subset = data.where(data['committer_email'] == included_committer_emails[0])

for committer_email in included_committer_emails[1:]:
    subset = pd.concat([subset, data.where(data['committer_email'] == committer_email)])

In [10]:
subset = subset.dropna()
subset.reset_index(drop=True, inplace=True)
subset

Unnamed: 0,message,committer_email,project
0,Fixed an error happening when the memory stats...,michele.simionato@gmail.com,gem_oq-engine
1,Updated setup.py [skip CI],michele.simionato@gmail.com,micheles_decorator
2,Fixed an exposure test [skip hazardlib],michele.simionato@gmail.com,gem_oq-engine
3,Added a correction factor of <I> to point sour...,michele.simionato@gmail.com,gem_oq-engine
4,Added a comment\n\n\nFormer-commit-id: 1dcd<I>...,michele.simionato@gmail.com,gem_oq-engine
...,...,...,...
78140,refactor: remove x-whistle-https-request,avwu@qq.com,avwo_whistle
78141,refactor: hasProtocol,avwu@qq.com,avwo_whistle
78142,refactor: Add img.onerror,avwu@qq.com,avwo_whistle
78143,"reqWrite, resWrite",avwu@qq.com,avwo_whistle


#### Check whether the sum is correct

In [11]:
sum([count for _, count in included_committers_counter])

78145

The sum is correct.

#### Save dataset

In [14]:
subset.to_pickle('data/03_Subset_Frequent_Committers.pkl')
subset.to_csv('data/03_Subset_Frequent_Committers.csv')