# Microtask 1
---
Produce a listing of the number of new committers per month, and the number of commits for each of them, as a table and as a CSV file. Use the GrimoireLab enriched index for git.

In [1]:
from datetime import datetime
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search

import subprocess
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

Assuming we have a running instance of Elasticsearch locally at http://localhost:9200

In [2]:
# elasticsearch instance
es = Elasticsearch('http://localhost:9200', verify_certs=False)

The [scikit-learn](https://github.com/scikit-learn/scikit-learn) repository will be used

In [3]:
# scikit-learn repo
repo = 'https://github.com/scikit-learn/scikit-learn.git'

Run the p2o.py file with the required arguments to get the enriched index for the scikit-learn repository. This index will be called `sklearn`

In [None]:
subprocess.run(['p2o.py', '--enrich', '--index', 'sklearn_raw', '--index-enrich', 'sklearn', '-e', 'http://localhost:9200', '--no_inc', '--debug', 'git', repo])

Frame a query for the required information using the `elasticsearch_dsl` module.

In [4]:
# frame a query
s = Search(using=es, index='sklearn')
s.aggs.bucket('by_authors', 'terms', field='author_name', size=15000).metric('first_commit', 'min', field='author_date')
s = s.sort('author_date')

# execute the query
result = s.execute()

Aggregate the results as buckets.

In [5]:
buckets_result = result['aggregations']['by_authors']['buckets']
buckets = []

Fill the buckets with the retrieved document for each commit.
Specifically, we require the `first_commit` date, the name of the `author` and the `total_commits` of the author. The `total_commits` can be found by counting the number of documents per author.

In [6]:
for bucket in buckets_result:
    # divide by milliseconds
    first_commit = bucket['first_commit']['value']/1000
    buckets.append({'first_commit':datetime.utcfromtimestamp(first_commit), 'author':bucket['key'], 'total_commits':bucket['doc_count']})

Create a pandas DataFrame from the information stored in `buckets`.

In [7]:
authors = pd.DataFrame.from_records(buckets)
authors

Unnamed: 0,author,first_commit,total_commits
0,Olivier Grisel,2010-03-03 13:46:28,2288
1,Andreas Mueller,2011-05-23 20:43:48,2093
2,Fabian Pedregosa,2010-01-05 13:26:32,1674
3,Lars Buitinck,2011-04-28 18:04:01,1283
4,Alexandre Gramfort,2010-03-02 15:01:46,1078
5,Gilles Louppe,2011-07-19 11:33:00,942
6,Gael Varoquaux,2010-01-08 08:35:18,886
7,Peter Prettenhofer,2010-10-19 00:00:25,868
8,Mathieu Blondel,2010-09-14 20:03:49,784
9,Gael varoquaux,2010-05-13 22:07:13,654


Sort the DataFrame by `first_commit` and reorganize the index.

In [8]:
authors.sort_values(by='first_commit', ascending=False, inplace=True)
authors.index = range(len(authors))
authors.head()

Unnamed: 0,author,first_commit,total_commits
0,Kirill,2018-03-02 08:32:42,2
1,Alexander-N,2018-03-01 23:33:05,1
2,Adam Richie-Halford,2018-03-01 14:27:26,1
3,Jan Schlüter,2018-03-01 02:03:19,1
4,Will Rosenfeld,2018-02-28 02:41:58,1


Let's visualize the data. We will plot the `total_commits` commits for each person sorted by `first_commit`.
Before that, let's add another column to our DataFrame that stores the `year` and `month` of first commit concatenated into a string. This will help us understand the distribution of commits over each month.

In [9]:
years = list(authors.first_commit.dt.year)
months = list(authors.first_commit.dt.month)
datestr = []

for i in range(len(authors)):
    if months[i] < 10:
        months[i] = str(0) + str(months[i])
    datestr.append(str(years[i]) + str(months[i]))
    
authors['datestr'] = pd.Series(datestr, index=authors.index)
authors = authors[['author', 'datestr', 'first_commit', 'total_commits']]
# head of the DataFrame
authors.head()

Unnamed: 0,author,datestr,first_commit,total_commits
0,Kirill,201803,2018-03-02 08:32:42,2
1,Alexander-N,201803,2018-03-01 23:33:05,1
2,Adam Richie-Halford,201803,2018-03-01 14:27:26,1
3,Jan Schlüter,201803,2018-03-01 02:03:19,1
4,Will Rosenfeld,201802,2018-02-28 02:41:58,1


In [None]:
by_month = authors[['first_commit', 'total_commits']].groupby([authors.first_commit.dt.year, authors.first_commit.dt.month]).agg(['min', 'max', 'count'])

In [None]:
by_month

In [None]:
by_month['first_commit']['count']

In [None]:
fig, ax = plt.subplots(figsize=(20, 15))
ax.plot(authors.datestr, authors.total_commits, 'k-')
ax.set_ylabel('Total number of commits')
ax.set_xlabel('Month')
ax.set_xticklabels([])
plt.show()