## This is the third microtask for the project: Reporting CHAOSS Metrics under the CHAOSS org in GSoC-18.

This task is to: ***Produce a listing of repositories, as a table and as CSV file, with the number of commits authored, issues opened, and pull requests opened, during the last three months, ordered by the total number (commits plus issues plus pull requests)***

For this task, the flow will be like this:
- Select the repositories to be analysed
- Using Perceval query the data sources and index them into elasticsearch
- Get the `Git` and `GitHub` **Indices** from elasticsearch Indices having the same base repository
- Query these **Indices** for relevant data (Pull requests, Commits, Issues opened in the last 3 months)
- Sort the Indices according to the relevant fields and generate tables/CSV files

We start by importing the necessary libraries and defining the necessary variables:

In [1]:
import subprocess

from dateutil.relativedelta import relativedelta
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
from pprint import pprint

import pandas as pd

es = Elasticsearch()

### Data Preparation

This is the initial data preperation step. If you already have some indices in your local elasticsearch instance, you can skip this step and go directly to [Analysis](#Analysis).

In [2]:
ES_URL = 'http://localhost:9200'

# We are going to add 3 repositories into the local elasticsearch cluster.
# You can add more repositories by adding their information into the list below.
repositories = [
    {
        'repo_url' : 'https://github.com/aimacode/aima-python.git',
        'org_name' : 'aimacode',
        'repo_name' : 'aima-python',
        'git_enrich' : 'aima_python_git',
        'git_raw' : 'aima_python_git_raw',
        'github_enrich' : 'aima_python_github',
        'github_raw' : 'aima_python_github_raw'
    },
    {
        'repo_url' : 'https://github.com/chaoss/grimoirelab-perceval.git',
        'org_name' : 'chaoss',
        'repo_name' : 'grimoirelab-perceval',
        'git_enrich' : 'perceval_git',
        'git_raw' : 'perceval_git_raw',
        'github_enrich' : 'perceval_github',
        'github_raw' : 'perceval_github_raw'
    },
    {
        'repo_url' : 'https://github.com/chaoss/grimoirelab-elk.git',
        'org_name' : 'chaoss',
        'repo_name' : 'grimoirelab-elk',
        'git_enrich' : 'grimoireelk_git',
        'git_raw' : 'grimoireelk_git_raw',
        'github_enrich' : 'grimoireelk_github',
        'github_raw' : 'grimoireelk_github_raw'
    }
]

# Github access token
token = '<YOUR GITHUB TOKEN GOES HERE>'

Now we will use `p2o.py` to get the repositories and insert them into elasticsearch and enrich them.

In [3]:
for repo in repositories:
    print('inserting git data for repo: ', repo['repo_name'])
    subprocess.run(['p2o.py', '--enrich', '--index', repo['git_raw'],
      '--index-enrich', repo['git_enrich'], '-e', ES_URL,
      '--no_inc', '--debug', 'git', repo['repo_url']])
    
    print('inserting github data for repo: ', repo['repo_name'])
    subprocess.run(['p2o.py', '--enrich', '--index', repo['github_raw'],
      '--index-enrich', repo['github_enrich'], '-e', ES_URL,
      '--no_inc', '--debug', 'github', repo['org_name'] , repo['repo_name'],
      '-t', token, '--sleep-for-rate'])
    
    print()

inserting git data for repo:  aima-python
inserting github data for repo:  aima-python

inserting git data for repo:  grimoirelab-perceval
inserting github data for repo:  grimoirelab-perceval

inserting git data for repo:  grimoirelab-elk
inserting github data for repo:  grimoirelab-elk



<a id='Analysis'></a>
### Analysis

We initially assume that we do not have any idea about which repositories are present in our instance. Since we have to create a table about all the indices that are present in our local instance, we group these indices according to there `metadata__gelk_backend_name` and their `github_repo` parameters. These will give us the **Git** and **GitHub** indices of each of the repositories stored in elasticsearch.  
Then using these index names, we can query elasticsearch and get the number of commits, pull requests and issues opened in the last 3 months.

**NOTE: This analysis assumes that the `github_repo` field name for the `Git` and `GitHub` data sources is the same in the corresponding indices for the repositories.**

In [4]:
index_dict = {}

for index in es.indices.get_alias("*"):
    print("Fetching data about: ", index)
    s = Search(using=es, index=index)
    s = s.source(['metadata__gelk_backend_name', 'github_repo'])
    result = s.execute().to_dict()
    try:
        result = result['hits']['hits'][0]['_source']
        repo = result['github_repo']
        enrichment_type = result['metadata__gelk_backend_name']
        print('repository name: {}, enrichment type: {}'.format(repo, enrichment_type))
    except:
        print("Skipping index: ", index)
        print()
        continue
    if repo not in index_dict.keys():
        index_dict[repo] = {}
    index_dict[repo][enrichment_type] = index
    print()

Fetching data about:  grimoireelk_github_raw
Skipping index:  grimoireelk_github_raw

Fetching data about:  grimoireelk_git_raw
Skipping index:  grimoireelk_git_raw

Fetching data about:  perceval_github
repository name: chaoss/grimoirelab-perceval, enrichment type: GitHubEnrich

Fetching data about:  aima_python_git
repository name: aimacode/aima-python, enrichment type: GitEnrich

Fetching data about:  .kibana
Skipping index:  .kibana

Fetching data about:  perceval_github_raw
Skipping index:  perceval_github_raw

Fetching data about:  perceval_git
repository name: chaoss/grimoirelab-perceval, enrichment type: GitEnrich

Fetching data about:  grimoireelk_github
repository name: chaoss/grimoirelab-elk, enrichment type: GitHubEnrich

Fetching data about:  aima_python_github_raw
Skipping index:  aima_python_github_raw

Fetching data about:  aima_python_git_raw
Skipping index:  aima_python_git_raw

Fetching data about:  perceval_git_raw
Skipping index:  perceval_git_raw

Fetching data ab

Let's see what the Repositories that we have got

In [5]:
pprint(index_dict)

{'aimacode/aima-python': {'GitEnrich': 'aima_python_git',
                          'GitHubEnrich': 'aima_python_github'},
 'chaoss/grimoirelab-elk': {'GitEnrich': 'grimoireelk_git',
                            'GitHubEnrich': 'grimoireelk_github'},
 'chaoss/grimoirelab-perceval': {'GitEnrich': 'perceval_git',
                                 'GitHubEnrich': 'perceval_github'}}


Now, for the main analysis, we will query the `git` and `github` enriched indices and calculate the number of commits, pull requests and issues created.

In [6]:
# TODO: cleanup code
for repo, data in index_dict.items():
    # ----- Count the number of commits -----
    git_enrich = data['GitEnrich']
    s = Search(using=es, index=git_enrich)
    # We add the fields that we want in the results.
    s = s.source(['commit_date'])
    # We are looking for all the commits that were made in the last 3 months.
    s = s.filter('range', commit_date={'gte' : 'now-3M'})
    # And we are going to arrange all these documents according to 
    # when they were created in ascending order
    s = s.sort({'commit_date': { 'order' : 'asc'}})
    # To capture all the documents, we have kept the size as 1000. 
    # You can change it according to your needs.
    s = s[0:1000]
    commits = s.execute().to_dict()
    num_commits = commits['hits']['total']
    index_dict[repo]['num_commits'] = num_commits
    # ----- End of querying git data source -----
    
    # ----- Count the number of issues and PRs -----
    github_enrich = data['GitHubEnrich']
    s = Search(using=es, index=github_enrich)
    # We add the fields that we want in the results.
    s = s.source(['created_at', 'item_type'])
    # We are looking for all the issues/PRs that were created in the last 3 months.
    s = s.filter('range', created_at={'gte' : 'now-3M'})
    # And we are going to arrange all these documents according to 
    # when they were created in ascending order
    s = s.sort({'created_at': { 'order' : 'asc'}})
    # To capture all the documents, we have kept the size as 1000. 
    # You can change it according to your needs.
    s = s[0:1000]
    issues_prs = s.execute().to_dict()['hits']['hits']
    issues = 0
    prs = 0
    for item in issues_prs:
        if item['_source']['item_type'] == 'pull request':
            prs += 1
        else:
            issues += 1
    index_dict[repo]['num_issues'] = issues
    index_dict[repo]['num_prs'] = prs
    # -----End of querying github data source -----
    
    # calculate the total:
    index_dict[repo]['total'] = issues + prs + num_commits

In [7]:
index_dict

{'aimacode/aima-python': {'GitEnrich': 'aima_python_git',
  'GitHubEnrich': 'aima_python_github',
  'num_commits': 81,
  'num_issues': 57,
  'num_prs': 116,
  'total': 254},
 'chaoss/grimoirelab-elk': {'GitEnrich': 'grimoireelk_git',
  'GitHubEnrich': 'grimoireelk_github',
  'num_commits': 170,
  'num_issues': 14,
  'num_prs': 80,
  'total': 264},
 'chaoss/grimoirelab-perceval': {'GitEnrich': 'perceval_git',
  'GitHubEnrich': 'perceval_github',
  'num_commits': 232,
  'num_issues': 19,
  'num_prs': 79,
  'total': 330}}

Converting the above dictionary into a pandas DataFrame

In [8]:
data_list = []
for repo, data in index_dict.items():
    data['repository'] = repo
    data_list.append(data)

indices = pd.DataFrame(data_list)

In [9]:
indices.sort_values(by=['total'], inplace=True)

In [10]:
indices

Unnamed: 0,GitEnrich,GitHubEnrich,num_commits,num_issues,num_prs,repository,total
1,aima_python_git,aima_python_github,81,57,116,aimacode/aima-python,254
2,grimoireelk_git,grimoireelk_github,170,14,80,chaoss/grimoirelab-elk,264
0,perceval_git,perceval_github,232,19,79,chaoss/grimoirelab-perceval,330


Finally, save this dataframe into a csv file named: Indices.csv

In [11]:
indices.to_csv('Indices.csv', index=False)

This concludes our 3rd microtask.