# Generate Most Frequent Committers Subset

Goal: Include all messages from the 100 committers who have the most commit messages.

#### Load Data

In [1]:
import pandas as pd

data = pd.read_pickle("../data/02_All_Decreased_Filesize.pkl")
data.head(3)

Unnamed: 0,message,author_email,project
0,setup: Detect if wheel and twine installed,gcushen@users.noreply.github.com,gcushen_mezzanine-api
1,[Builder] Adding root page in any case,g.passault@gmail.com,Gregwar_Slidey
2,Added web.Urlencode method,hoisie@gmail.com,hoisie_web


#### Take the subset

For the final model, the 100 committers with the highest amount of commit messages are used.

In [2]:
from collections import Counter

author_count = Counter(data["author_email"])
included_authors_counter = author_count.most_common(100)

In [3]:
author_count.most_common(100)

[('michele.simionato@gmail.com', 4991),
 ('thomas.parrott@canonical.com', 3058),
 ('crynobone@gmail.com', 3054),
 ('igor.kroitor@gmail.com', 2916),
 ('jaraco@jaraco.com', 2899),
 ('postmodern.mod3@gmail.com', 2669),
 ('github@contao.org', 2285),
 ('mark@mark-story.com', 2082),
 ('thatch45@gmail.com', 1976),
 ('ccordoba12@gmail.com', 1932),
 ('pedro@algarvio.me', 1907),
 ('mjpt777@gmail.com', 1817),
 ('mitchell.hashimoto@gmail.com', 1751),
 ('moodler', 1688),
 ('ns@vivid-planet.com', 1576),
 ('marijnh@gmail.com', 1500),
 ('aaron.patterson@gmail.com', 1450),
 ('blactbt@live.de', 1379),
 ('ocramius@gmail.com', 1362),
 ('P.Rudiger@ed.ac.uk', 1356),
 ('duncan.macleod@ligo.org', 1278),
 ('fabien.potencier@gmail.com', 1243),
 ('fisharebest@gmail.com', 1219),
 ('Lars.Butler@gmail.com', 1215),
 ('matijs@matijs.net', 1207),
 ('none@none', 1176),
 ('skodak', 1155),
 ('andreas@one.com', 1152),
 ('palehose@gmail.com', 1138),
 ('jmettraux@gmail.com', 1125),
 ('anacrolix@gmail.com', 1119),
 ('mike@si

#### Transform to a list

In [4]:
included_author_emails = [author[0] for author in included_authors_counter]
included_author_emails

['michele.simionato@gmail.com',
 'thomas.parrott@canonical.com',
 'crynobone@gmail.com',
 'igor.kroitor@gmail.com',
 'jaraco@jaraco.com',
 'postmodern.mod3@gmail.com',
 'github@contao.org',
 'mark@mark-story.com',
 'thatch45@gmail.com',
 'ccordoba12@gmail.com',
 'pedro@algarvio.me',
 'mjpt777@gmail.com',
 'mitchell.hashimoto@gmail.com',
 'moodler',
 'ns@vivid-planet.com',
 'marijnh@gmail.com',
 'aaron.patterson@gmail.com',
 'blactbt@live.de',
 'ocramius@gmail.com',
 'P.Rudiger@ed.ac.uk',
 'duncan.macleod@ligo.org',
 'fabien.potencier@gmail.com',
 'fisharebest@gmail.com',
 'Lars.Butler@gmail.com',
 'matijs@matijs.net',
 'none@none',
 'skodak',
 'andreas@one.com',
 'palehose@gmail.com',
 'jmettraux@gmail.com',
 'anacrolix@gmail.com',
 'mike@silverorange.com',
 'p@shedcollective.org',
 'tj@vision-media.ca',
 'hajimehoshi@gmail.com',
 'ronnie@dio.jp',
 'jerome@leclan.ch',
 'zacharyspector@gmail.com',
 'j.boggiano@seld.be',
 'avwu@qq.com',
 'ingo@silverstripe.com',
 'stgraber@ubuntu.com',
 

#### Filter the dataframe

In [9]:
subset = data.where(data['author_email'] == included_author_emails[0])

for author_email in included_author_emails[1:]:
    subset = pd.concat([subset, data.where(data['author_email'] == author_email)])

In [10]:
subset = subset.dropna()
subset.reset_index(drop=True, inplace=True)
subset

Unnamed: 0,message,author_email,project
0,Fixed an error happening when the memory stats...,michele.simionato@gmail.com,gem_oq-engine
1,Updated setup.py [skip CI],michele.simionato@gmail.com,micheles_decorator
2,Fixed an exposure test [skip hazardlib],michele.simionato@gmail.com,gem_oq-engine
3,Added a correction factor of <I> to point sour...,michele.simionato@gmail.com,gem_oq-engine
4,Added a comment\n\n\nFormer-commit-id: 1dcd<I>...,michele.simionato@gmail.com,gem_oq-engine
...,...,...,...
112799,Prevent crash of test in not supported envs,medyk@medikoo.com,medikoo_dom-ext
112800,Fix definitions orders\n\nUneven states of Sym...,medyk@medikoo.com,medikoo_es6-symbol
112801,Ensure to not leave orphaned async call,medyk@medikoo.com,serverless_serverless
112802,refactor(CLI): Do not notify of update when ne...,medyk@medikoo.com,serverless_serverless


#### Check whether the sum is correct

In [11]:
sum([count for _, count in included_authors_counter])

112804

The sum is correct.

#### Save dataset

In [12]:
subset.to_pickle('../data/03a_Authors_Subset.pkl')