## This is the first microtask for the project: Reporting CHAOSS Metrics under the CHAOSS org in GSoC-18.

The task is to: ***Produce a listing of the number of new committers per month, and the number of commits for each of them, as a table and as a CSV file. Use the GrimoireLab enriched index for git.***

We start by importing the necessary modules.

In [1]:
from datetime import datetime, timedelta
from elasticsearch import Elasticsearch
from IPython.display import display
from dateutil.relativedelta import relativedelta
from calendar import monthrange, month_name
from collections import defaultdict, OrderedDict
from pprint import pprint

import subprocess
import pandas as pd

And by specifying the necessary variables.

In [2]:
# Elasticsearch should be running at this URL, otherwise the next command will give and error
ES_URL = "http://localhost:9200" 

es = Elasticsearch(ES_URL)

# URL of the repository to be analysed
repository_url = "https;//github.com/aimacode/aima-python.git"

# Names of the indices by which the repository will be stored
enriched_index_name = "aima_python_git"
raw_index_name = "aima_python_git_raw"

### Getting the data

Now we will use `p2o.py` along with **`git`** to download the repository, and insert it into elasticsearch.  
This command will index 2 versions of the repository: a raw version and an enriched version.
`p2o.py` first downloads the repository and uploads the raw version of it. Then it enriches the data i.e adds paramerets such as `lines added`, `lines deleted`, `committer name`, `author name` and so on.  
These additional fields provide us with more insights to the activity in the repository.

In [None]:
subprocess.run(['p2o.py', '--enrich', '--index', raw_index_name,
      '--index-enrich', enriched_index_name, '-e', ES_URL,
      '--no_inc', '--debug', 'git', repository_url])

### Analysis

**NOTE:** There is a another turotial available on [GrimoireLab Tutorial](https://grimoirelab.gitbooks.io/tutorial/python/pandas-for-grimoirelab-indexes.html) which uses *Elasticsearch_dsl* python library and the aggregation that elasticsearch provides. This tutorial also teaches how to get all the new committers for each month.

We'll start with getting **all** the commits (documents in the index) from the repository.

In [3]:
def get_all_commit_records(index=None):
    "Queries the elasticsearch instance and returns all the documents in the index."
    
    temp_res = es.search(index=index, body={"query":{"match_all":{}}})
    size = temp_res["hits"]["total"]
    query = {
                "size":size, 
                "query":
                        {
                            "match_all":{}
                        }
    }
    res = es.search(index=index, body=query)
    return [res["hits"]["hits"][i]["_source"] for i in range(res["hits"]["total"])]

We store it in the *result* variable

In [4]:
results = get_all_commit_records(enriched_index_name)

This is what a document looks like

In [5]:
pprint(results[0])

{'Author': 'spottedMetal <spottedMetal@2679dc44-f919-0410-802c-91c6f4a87680>',
 'Committer': 'spottedMetal '
              '<spottedMetal@2679dc44-f919-0410-802c-91c6f4a87680>',
 'author_date': '2007-07-13T21:12:24',
 'author_domain': '2679dc44-f919-0410-802c-91c6f4a87680',
 'author_name': 'spottedMetal',
 'commit_date': '2007-07-13T21:12:24',
 'committer_domain': '2679dc44-f919-0410-802c-91c6f4a87680',
 'committer_name': 'spottedMetal',
 'files': 1,
 'github_repo': 'aimacode/aima-python',
 'grimoire_creation_date': '2007-07-13T21:12:24+00:00',
 'hash': 'aeeaedc6624a4a5a7ac17aef6e2d05e7a2b77b89',
 'hash_short': 'aeeaed',
 'is_git_commit': 1,
 'lines_added': 181,
 'lines_changed': 247,
 'lines_removed': 66,
 'message': 'XYEnvironment notifies observers of new objects and object '
            'moves.\n'
            'EnvCanvas draws each object either as a canvas text (string) or '
            'using\n'
            'an image from a file associated with the object class.\n'
            '\n

#### Custom functions: 
To parse dates, get start and end dates of months and to divide commits according to the months in which they were made

In [6]:
def parse_date(date, custom_format=None):
    """Returns a datetime.datetime object from a string. 
    custom_format for the date can be given as input"""
    
    if custom_format:
        return datetime.strptime(date, custom_format)
    
    return datetime.strptime(date, "%Y-%m-%dT%H:%M:%S")

In [7]:
def get_end_date_of_month(date):
    "Given a date, return the end date of the month"
    
    return date + relativedelta(days = +(monthrange(date.year, date.month)[1] - date.day))

In [8]:
def get_start_date_of_month(date):
    "Given a date, return the start date of the month"
    
    return date - relativedelta(days = +date.day-1)

In [9]:
def get_bucket_name(date):
    "Given a date return the a string in the form of Month-YYYY"
    
    return month_name[date.month] + "-" + str(date.year)

In [15]:
def get_extreme_commits_dates(commit_list):
    "Given a list of commits, return the dates of the first and the last commits"
    
    fc_date = min(parse_date(item['commit_date']) for item in commit_list)
    lc_date = max(parse_date(item['commit_date']) for item in commit_list)
    return fc_date, lc_date

In [16]:
def make_buckets(first_date, last_date):
    """Given the project start date and the last commit date, return 
    containers for months in between those dates. Each container is a month 
    containing details about all the commits and committers for that month."""
    
    buckets = OrderedDict()
    
    month_start_date = get_start_date_of_month(first_date)
    month_end_date = get_end_date_of_month(first_date)
    bucket_name = get_bucket_name(first_date)
    
    while month_end_date <= last_date:
        commit = {}
        commit['new_committers'] = defaultdict(int)
        commit['old_committers'] = defaultdict(int)
        commit['commits'] = []
        buckets[bucket_name] = commit
        
        month_start_date = month_end_date + relativedelta(days=+1)
        month_end_date = get_end_date_of_month(month_start_date)
        bucket_name = get_bucket_name(month_start_date)
    
    commit = {}
    commit['new_committers'] = defaultdict(int)
    commit['old_committers'] = defaultdict(int)
    commit['commits'] = []
    buckets[bucket_name] = commit
    
    return buckets

We take in a list of the commits, starting from the very first. Then we create buckets of months and put all the commits of each month in that month's bucket.

Then we separate the commits as made by an `old_committer` or a `new_committer`.

In [17]:
def analyse_repository(commit_list):
    
    first_date, last_date = get_extreme_commits_dates(commit_list)
    
    months = make_buckets(first_date, last_date)
    
    for commit in commit_list:
        month = get_bucket_name(parse_date(commit['commit_date']))
        months[month]["commits"].append(commit)
        
    all_committers = []
    
    for name, month in months.items():
        for commit in month['commits']:
            committer = commit['author_name']
            if committer in all_committers:
                month['old_committers'][committer] += 1
            else:
                month['new_committers'][committer] += 1
        all_committers = list(set(all_committers + list(month['old_committers'].keys()) + 
                                  list(month['new_committers'].keys())))
        del month['commits']
                
    return months

In [18]:
Output = analyse_repository(results)

#### The number of new committers per month:

In [19]:
fmt = '{:<20}{}'

print(fmt.format('Month', 'No. of new committers'))

for month_name, month in Output.items():
    if len(month["new_committers"]) != 0:
        print(fmt.format(month_name, len(month["new_committers"])))

Month               No. of new committers
June-2007           1
July-2007           1
May-2010            1
August-2011         1
February-2016       1
March-2016          24
April-2016          3
May-2016            3
August-2016         1
September-2016      1
January-2017        1
February-2017       1
March-2017          20
April-2017          3
May-2017            1
June-2017           1
August-2017         2
September-2017      1
December-2017       6
January-2018        3
February-2018       2


We convert the buckets in Output to a list of dictionaries. Each dictionary contains 3 elements: `month`, `Author of commit` and `Number of commits`. Then we use the `pandas` library to convert this list into a table.

In [20]:
def get_table_from_dict(commit_dict):
    table = []
    for month_name, month in commit_dict.items():
        ls = []
        for key, val in month['new_committers'].items():
            item = {}
            item['month'] = month_name
            item["Author of commit"] = key
            item['Number of commits'] = val
            ls.append(item)
        table = table + ls
    
    return pd.DataFrame(table)

In [21]:
table = get_table_from_dict(Output)

#### New committers with the number of commits that they made

In [22]:
display(table)

Unnamed: 0,Author of commit,Number of commits,month
0,peter.norvig,1,June-2007
1,spottedMetal,18,July-2007
2,srburnet,1,May-2010
3,withal,13,August-2011
4,norvig,3,February-2016
5,greyshadows,1,March-2016
6,abhishek garg,2,March-2016
7,utk1610,1,March-2016
8,SnShine,33,March-2016
9,Tamer Tas,5,March-2016


`get_table_from_dict` gives us the required table fot new committers each month and the number of commits that they do.  
Now we'll just use the `pandas.to_csv` function to put this table in a csv file

#### Transfering the above table into a CSV

In [23]:
table.to_csv(enriched_index_name + ".csv", index=False, sep=",")

The list of new committers with the number of commits is now available as a csv file in the current folder