# Generate Most Frequent Committers Subset

Goal: Include all messages from committers who have more than 1000 commits.

#### Load Data

In [1]:
import pandas as pd

data = pd.read_csv("../data/01_Original.csv")
data.head(3)

Unnamed: 0,hash,diff,message,author_email,author_name,committer_email,committer_name,project,split
0,1de640cc59b4b3030447d567b3c99c50777bd760,a/setup.py b/setup.py\nindex <HASH>..<HASH> 1...,setup: Detect if wheel and twine installed,gcushen@users.noreply.github.com,George Cushen,gcushen@users.noreply.github.com,George Cushen,gcushen_mezzanine-api,train
1,c1cce6fe5e49df5546c30a662fd141d41f4fc389,a/Builder.php b/Builder.php\nindex <HASH>..<H...,[Builder] Adding root page in any case,g.passault@gmail.com,Gregwar,g.passault@gmail.com,Gregwar,Gregwar_Slidey,train
2,2f7d97d15ea41f4112e74429617c5daad740d7cc,a/web.go b/web.go\nindex <HASH>..<HASH> 10064...,Added web.Urlencode method,hoisie@gmail.com,Michael Hoisie,hoisie@gmail.com,Michael Hoisie,hoisie_web,train


In [2]:
data.drop(['hash', 'diff', 'committer_email', 'author_name', 'committer_name', 'split'], axis=1, inplace=True)

#### Take the subset

The Dataset Exploration showed that there are exactly 42 committers who have more than 1000 commits taken by author_email.

In [11]:
from collections import Counter

author_count = Counter(data["author_email"])
included_authors_counter = author_count.most_common(42)

In [4]:
author_count.most_common(43)

[('michele.simionato@gmail.com', 4991),
 ('thomas.parrott@canonical.com', 3058),
 ('crynobone@gmail.com', 3054),
 ('igor.kroitor@gmail.com', 2916),
 ('jaraco@jaraco.com', 2899),
 ('postmodern.mod3@gmail.com', 2669),
 ('github@contao.org', 2285),
 ('mark@mark-story.com', 2082),
 ('thatch45@gmail.com', 1976),
 ('ccordoba12@gmail.com', 1932),
 ('pedro@algarvio.me', 1907),
 ('mjpt777@gmail.com', 1817),
 ('mitchell.hashimoto@gmail.com', 1751),
 ('moodler', 1688),
 ('ns@vivid-planet.com', 1576),
 ('marijnh@gmail.com', 1500),
 ('aaron.patterson@gmail.com', 1450),
 ('blactbt@live.de', 1379),
 ('ocramius@gmail.com', 1362),
 ('P.Rudiger@ed.ac.uk', 1356),
 ('duncan.macleod@ligo.org', 1278),
 ('fabien.potencier@gmail.com', 1243),
 ('fisharebest@gmail.com', 1219),
 ('Lars.Butler@gmail.com', 1215),
 ('matijs@matijs.net', 1207),
 ('none@none', 1176),
 ('skodak', 1155),
 ('andreas@one.com', 1152),
 ('palehose@gmail.com', 1138),
 ('jmettraux@gmail.com', 1125),
 ('anacrolix@gmail.com', 1119),
 ('mike@si

#### Transform to a list

In [6]:
included_author_emails = []

for tuple in included_authors_counter:
    if tuple[1] >= 1000: #not really required since already filtered above
        included_author_emails.append(tuple[0])

included_author_emails

#### Filter the dataframe

In [13]:
subset = data.where(data['author_email'] == included_author_emails[0])

for author_email in included_author_emails[1:]:
    subset = pd.concat([subset, data.where(data['author_email'] == author_email)])

In [14]:
subset = subset.dropna()
subset.reset_index(drop=True, inplace=True)
subset

Unnamed: 0,message,author_email,project
0,Fixed an error happening when the memory stats...,michele.simionato@gmail.com,gem_oq-engine
1,Updated setup.py [skip CI],michele.simionato@gmail.com,micheles_decorator
2,Fixed an exposure test [skip hazardlib],michele.simionato@gmail.com,gem_oq-engine
3,Added a correction factor of <I> to point sour...,michele.simionato@gmail.com,gem_oq-engine
4,Added a comment\n\n\nFormer-commit-id: 1dcd<I>...,michele.simionato@gmail.com,gem_oq-engine
...,...,...,...
68320,lxc/remote: Show the fingerprint as string not...,stgraber@ubuntu.com,lxc_lxd
68321,lxd-p2c: Update to changed cert functions,stgraber@ubuntu.com,lxc_lxd
68322,networks: Extend allowed character set for int...,stgraber@ubuntu.com,lxc_lxd
68323,Don't grab addresses from public remotes,stgraber@ubuntu.com,lxc_lxd


#### Check whether the sum is correct

In [15]:
sum([count for _, count in included_authors_counter])

68325

The sum is correct.

#### Save dataset

In [16]:
subset.to_pickle('../data/03_Subset_Frequent_Committers.pkl')
subset.to_csv('../data/03_Subset_Frequent_Committers.csv')