# GMD Metrics using Manuscripts2

This Notebook is for the users of manuscripts to be able to generate the [GMD metrics](https://github.com/chaoss/metrics/blob/master/2_Growth-Maturity-Decline.md) by themselves using the new Manuscripts2 classes and functions. This Notebook aims at giving the users an indepth and hands on approach to calculating these metrics. We will be analysing the [grimoirelab-perceval](https://github.com/chaoss/grimoirelab-perceval) repository.

We will primarily be using the git github_prs and github_issues data sources and [manuscripts2](https://github.com/chaoss/grimoirelab-manuscripts/tree/master/manuscripts2) module.

First: Make sure that you have an elasticsearch instance running on your computer. If you are not running it on http://localhost:9200, please change the ES_URL variable below.

Second: this notebook works on enriched indices to calculate the GMD metrics. If you know what they are and if you have them in your ES instance, great! Else please read this [README.md](https://github.com/chaoss/grimoirelab-manuscripts/blob/master/manuscripts2/README.md) explaining how to get the enriched indices.

We start by importing the necessary libraries.

In [10]:
import sys
sys.path.insert(0, '..')

# analysis modules
import pandas as pd

# query and connection modules
from elasticsearch import Elasticsearch
from elasticsearch_dsl import A

# utility and support modules
from manuscripts2.elasticsearch import (Query,
                                        PullRequests,
                                        Issues,
                                        Index,
                                        buckets_to_df,
                                        calculate_bmi)
from pprint import pprint
from datetime import datetime, timezone
from dateutil import parser, relativedelta

These are some of the variables that are needed for the analysis. Please look at the [Readme.md](https://github.com/chaoss/grimoirelab-manuscripts/blob/master/manuscripts2/README.md) of manuscripts2 to know about how to generate the necessary infrastructure to calculate these metrics.

In [2]:
ES_URL = "http://localhost:9200"
es = Elasticsearch(ES_URL)

github_issues = "perceval_github_issues"
github_prs = "perceval_github_prs"
git_data_source = "perceval_git"

github_issues_index = Index(es=es, index_name=github_issues)
github_prs_index = Index(es=es, index_name=github_prs)
git_index = Index(es=es, index_name=git_data_source)

start_date = datetime(2015, 1, 1)
end_date = datetime(2018, 7, 10)

max_size = 10000 # temporary hack to get all the values in the query

<h2><center>Issue Resolution</center></h2>

Goal: Identify how effective the community is at addressing issues identified by community partcipants.

Name | Question |
--- | --- |
[Open Issues](https://github.com/chaoss/metrics/tree/master/activity-metrics/open-issues.md) | What is the number of open issues? |
[Closed Issues](https://github.com/chaoss/metrics/tree/master/activity-metrics/closed-issues.md) | What is the number of closed issues? | 
[Issue Resolution Efficiency](https://github.com/chaoss/metrics/tree/master/activity-metrics/issue-resolution-efficiency.md) | What is the number of closed issues/number of abandoned issues? |  
[Open Issue Age](https://github.com/chaoss/metrics/tree/master/activity-metrics/open-issue-age.md) | What is the the age of open issues? | 
[First Response to Issue Duration](https://github.com/chaoss/metrics/tree/master/activity-metrics/first-response-to-issue-duration.md) | What is the duration of time for a first response to an issue? | 
[Closed Issue Resolution Duration](https://github.com/chaoss/metrics/tree/master/activity-metrics/closed-issue-resolution-duration.md) | What is the duration of time for issues to be resolved? |

<a id="open_issues"></a>
### open issues

Here, we can see that we get the issues open by authors, by organizations and by the month in which they were created.

In [3]:
open_issues = Issues(github_issues_index)
open_issues.is_open()
num_open_issues = open_issues.get_cardinality("id_in_repo").get_aggs()
print(num_open_issues)

29


<a id="closed_issues"></a>
### closed issues

In [4]:
closed_issues = Issues(github_issues_index)
closed_issues.is_closed()

total_closed = closed_issues.get_cardinality("id_in_repo").get_aggs() 
print("Total issues closed: ", total_closed)

print()
print("Total issues closed by authors: ")
closed_by_authors = closed_issues.get_cardinality("id_in_repo").by_authors("author_name").fetch_aggregation_results()
by_author_buckets = closed_by_authors['aggregations']['0']['buckets']
print(buckets_to_df(by_author_buckets))

Total issues closed:  115

Total issues closed by authors: 
     0  doc_count                         key
0   27         27         Alvaro del Castillo
1   20         20              Alberto Martín
2   10         10  Jesus M. Gonzalez-Barahona
3    9          9              Manrique Lopez
4    9          9             Santiago Dueñas
5    3          3   Daniel Izquierdo Cortazar
6    3          3        David Pose Fernández
7    3          3                 Jose Miguel
8    3          3                   Quan Zhou
9    2          2    Brylie Christopher Oxley
10   2          2                      MishiR
11   2          2              Robin Muilwijk
12   2          2             Saad Bin Shahid
13   2          2                 TheReal1604
14   2          2                 iganchevup8
15   2          2                     valerio
16   1          1               Andre Klapper
17   1          1            Bogdan Vasilescu
18   1          1                  Bowen Chen
19   1          1   

### issue resolution efficiency:

This is the number of (issues closed / issues open) per month

In [5]:
opened_issues = Issues(github_issues_index)
opened_issues_by_period = opened_issues.get_cardinality("id_in_repo").by_period().get_timeseries(dataframe=True)
closed_issues_by_period = closed_issues.get_cardinality("id_in_repo").by_period(field="closed_at").get_timeseries(dataframe=True)
issue_resolution_efficiency_per_month = calculate_bmi(closed_issues_by_period, opened_issues_by_period)
pprint(issue_resolution_efficiency_per_month)

             bmi
date            
2016-01-01  1.00
2016-02-01  0.50
2016-03-01  0.50
2016-04-01  1.00
2016-05-01  0.00
2016-06-01  0.33
2016-07-01  1.00
2016-08-01  0.00
2016-09-01  0.71
2016-10-01  0.58
2016-11-01  0.57
2016-12-01  1.50
2017-01-01  0.73
2017-02-01  3.00
2017-03-01  0.71
2017-04-01  1.50
2017-05-01  0.50
2017-06-01  1.00
2017-07-01  1.00
2017-08-01  0.00
2017-09-01  0.56
2017-10-01  0.88
2017-11-01  1.18
2017-12-01  0.75
2018-01-01  1.33
2018-02-01  0.57
2018-03-01  1.25
2018-04-01  3.00
2018-05-01  0.00
2018-06-01  1.00


<a id="closed_issues"></a>
### open issue age

As per the [discussion here](https://github.com/chaoss/metrics/blob/master/activity-metrics/open-issue-age.md), We'll calculate the percentile, mean, variance and create some visualisations for this metric.

In [6]:
issue = Issues(github_issues_index)
issue.is_open()
percentiles = issue.get_percentiles("time_open_days").get_aggs()
print("Percentiles: ", percentiles)

issue.get_extended_stats("time_open_days")
extended_stats = issue.fetch_aggregation_results()['aggregations']['0']
print("Extended stats for the issues: ")
pprint(extended_stats)

Percentiles:  451.1400146484375
Extended stats for the issues: 
{'avg': 420.9899962195035,
 'count': 29,
 'max': 892.7999877929688,
 'min': 14.149999618530273,
 'std_deviation': 315.4856400552799,
 'std_deviation_bounds': {'lower': -209.98128389105625,
                          'upper': 1051.9612763300634},
 'sum': 12208.7098903656,
 'sum_of_squares': 8026149.213941627,
 'variance': 99531.1890810896}


### first_response_to_issue duration

In [11]:
issues = Issues(github_issues_index)
percentiles = issues.get_percentiles("time_to_first_attention").get_aggs()
print("Percentiles: ", percentiles)

issues.get_extended_stats("time_to_first_attention")
extended_stats = issues.fetch_aggregation_results()['aggregations']['0']
print("Extended stats for the issues: ")
pprint(extended_stats)

Percentiles:  0.19500000029802322
Extended stats for the issues: 
{'avg': 21.262884847497425,
 'count': 104,
 'max': 535.3400268554688,
 'min': 0.0,
 'std_deviation': 82.63831294030459,
 'std_deviation_bounds': {'lower': -144.01374103311176,
                          'upper': 186.5395107281066},
 'sum': 2211.340024139732,
 'sum_of_squares': 757244.9079163955,
 'variance': 6829.090765619713}


### closed issue resolution duration (Time to resolution of closed issue)

#### percentiles

In [7]:
closed_issues = Issues(github_issues_index)
closed_issues.is_closed()
percentiles = closed_issues.get_percentiles("time_to_close_days").get_aggs()
print("Percentiles: ", percentiles)

closed_issues.get_extended_stats("time_to_close_days")
extended_stats = closed_issues.fetch_aggregation_results()['aggregations']['0']
print("Extended stats for the issues: ")
pprint(extended_stats)

Percentiles:  3.6500000953674316
Extended stats for the issues: 
{'avg': 30.83017426458714,
 'count': 115,
 'max': 582.3300170898438,
 'min': 0.0,
 'std_deviation': 92.63210085425231,
 'std_deviation_bounds': {'lower': -154.4340274439175,
                          'upper': 216.09437597309176},
 'sum': 3545.470040427521,
 'sum_of_squares': 1096088.661693576,
 'variance': 8580.706108672372}


#### Moving average
Moving average: For time to issue resolution, we'll also look at the moving average.

In [12]:
'''
Example query to get moving average 
{
    "size": 0,
    "aggs": {
        "my_date_histo":{                
            "date_histogram":{
                "field":"created_at",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "time_to_close_days" } 
                },
                "the_movavg":{
                    "moving_avg":{ "buckets_path": "the_sum" } 
                }
            }
        }
    }
}
'''

closed_issues = Issues(github_issues_index)
closed_issues.is_closed()
a = A("date_histogram", field="created_at", interval="week")
a.metric("the_sum", "sum", field="time_to_close_days")
a.metric("monthly_moving_average", "moving_avg", buckets_path="the_sum")
closed_issues.add_custom_aggregation(a)
moving_average_for_closed_issues = buckets_to_df(closed_issues.fetch_aggregation_results()['aggregations']['0']['buckets'])
print("Moving average for closed issues: ")
print(moving_average_for_closed_issues)

Moving average for closed issues: 
            date_in_seconds  monthly_moving_average     the_sum
key                                                            
2016-01-18    1453075200000                0.000000    0.570000
2016-01-25    1453680000000                0.000000    0.000000
2016-02-01    1454284800000                0.000000    0.000000
2016-02-08    1454889600000                0.570000   16.299999
2016-02-15    1455494400000                0.000000    0.000000
2016-02-22    1456099200000                0.000000    0.000000
2016-02-29    1456704000000                0.000000    0.000000
2016-03-07    1457308800000                8.435000    0.070000
2016-03-14    1457913600000                5.646666    3.640000
2016-03-21    1458518400000                5.145000    4.950000
2016-03-28    1459123200000                0.000000    0.000000
2016-04-04    1459728000000                5.106000    0.760000
2016-04-11    1460332800000                0.000000    0.000000
2016-

## Code Development
Goal: Identify how effective the community is at merging new code into the codebase.

Name | Question |
--- | --- | 
[Code Commits](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-commits.md) | What is the number of code commits? |
[Lines of Code Changed](https://github.com/chaoss/metrics/tree/master/activity-metrics/lines-of-code-changed.md) | What is the number of lines of code changed? | 
[Code Reviews](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-reviews.md) | What is the number of code reviews?
[Code Merge Duration](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-merge-duration.md) | What is the duration of time between code merge request and code commit?
[Code Review Efficiency](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-review-efficiency.md) | What is the number of merged code changes/number of abandoned code change requests?
[Maintainer Response to Merge Request Duration](https://github.com/chaoss/metrics/tree/master/activity-metrics/maintainer-response-to-merge-request-duration.md) | What is the duration of time for a maintainer to make a first response to a code merge request?
[Code Review Iteration](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-review-iteration.md) | What is the number of iterations that occur before a merge request is accepted or declined?
[Forks](https://github.com/chaoss/metrics/tree/master/activity-metrics/forks.md) | Forks are a concept in distributed version control systems like GitHub. It is a proxy for the approximate number of developers who have taken a shot at building and deploying the codebase *for development*.
[Pull Requests Open](https://github.com/chaoss/metrics/tree/master/activity-metrics/pull-requests-open.md) | Number of open pull requests. | 
[Pull Requests Closed](https://github.com/chaoss/metrics/tree/master/activity-metrics/pull-requests-made-closed.md) | Number of closed pull requests. | 
[Pull Request Comment Duration](https://github.com/chaoss/metrics/tree/master/activity-metrics/pull-requests-comment-duration.md) | The difference between the timestamp of the pull request creation date and the most recent comment on the pull request.
[Pull Request Comment Diversity](https://github.com/chaoss/metrics/tree/master/activity-metrics/pull-requests-comment-diversity.md) | Number of each people discussing each pull request.
[Pull Request Comments](https://github.com/chaoss/metrics/tree/master/activity-metrics/pull-request-comments.md) | Number of comments on each pull request. 


### code commits

**NOTE:** HERE THE INDEX WILL HAVE TO BE CHANGED

In [13]:
commits = Query(git_index)
commits.get_cardinality("hash")
total_commits = commits.get_aggs()
print("total commits: ", total_commits)

total commits:  1244


##### Number of commits by months

In [14]:
commits.get_cardinality("hash").by_period().get_timeseries(dataframe=True)

Unnamed: 0_level_0,unixtime,value
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-08-01,1438387000.0,8
2015-09-01,1441066000.0,0
2015-10-01,1443658000.0,0
2015-11-01,1446336000.0,23
2015-12-01,1448928000.0,21
2016-01-01,1451606000.0,26
2016-02-01,1454285000.0,76
2016-03-01,1456790000.0,51
2016-04-01,1459469000.0,24
2016-05-01,1462061000.0,21


### lines of code changed

In [15]:
commits = Query(git_index)
lc = commits.get_sum("lines_changed").get_aggs()
la = commits.get_sum("lines_added").get_aggs()
lr = commits.get_sum("lines_removed").get_aggs()

print("Total lines changed: ", lc)
print("Total lines added: ", la)
print("Total lines removed: ", lr)

Total lines changed:  197426.0
Total lines added:  152494.0
Total lines removed:  44932.0


### Code reviews = Number of PRs

In [19]:
prs = PullRequests(github_prs_index)
print("Number of PRs: ", prs.get_cardinality("id_in_repo").get_aggs())

print("Number of PRs per quarter: ")
print(prs.get_cardinality("id_in_repo").by_period(period="quarter").get_timeseries(dataframe=True))

Number of PRs:  276
Number of PRs per quarter: 
                unixtime  value
date                           
2015-10-01  1.443658e+09      2
2016-01-01  1.451606e+09     18
2016-04-01  1.459469e+09     11
2016-07-01  1.467331e+09     13
2016-10-01  1.475280e+09     12
2017-01-01  1.483229e+09     10
2017-04-01  1.491005e+09      4
2017-07-01  1.498867e+09     16
2017-10-01  1.506816e+09     66
2018-01-01  1.514765e+09     82
2018-04-01  1.522541e+09     24
2018-07-01  1.530403e+09     18


### pull requests Open

In [20]:
open_prs = PullRequests(github_prs_index)
open_prs.is_open()
# get the single valued aggregation before putting it again as a child agg
num_open_prs = open_prs.get_cardinality("id_in_repo").get_aggs() 
print("Number of Pull Requests open: ", num_open_prs)
print()
open_prs.get_cardinality("id_in_repo").by_authors("author_name")
response = open_prs.fetch_aggregation_results()['aggregations']
open_prs_by_authors = response['0']['buckets']
print("Open PRs by authors")
print(buckets_to_df(open_prs_by_authors))

Number of Pull Requests open:  8

Open PRs by authors
   0  doc_count                         key
0  2          2  Jesus M. Gonzalez-Barahona
1  2          2               Keanu Nichols
2  1          1               Gustavo Silva
3  1          1                 Jose Miguel
4  1          1      Miguel Ángel Fernández
5  1          1                     valerio


### pull requests closed

In [21]:
closed_prs = PullRequests(github_prs_index)
closed_prs.is_closed()
# get the single valued aggregation before putting it again as a child agg
num_closed_prs = closed_prs.get_cardinality("id_in_repo").get_aggs()
print("Number of closed pull requests: ", num_closed_prs)
print()

closed_prs.get_cardinality("id_in_repo").by_authors("author_name")
response = closed_prs.fetch_aggregation_results()['aggregations']
closed_prs_by_authors = response['0']['buckets']
print("Number of closed PRs by authors")
print(buckets_to_df(closed_prs_by_authors))

Number of closed pull requests:  268

Number of closed PRs by authors
      0  doc_count                         key
0   153        153                     valerio
1    33         33         Alvaro del Castillo
2    22         22             Santiago Dueñas
3    19         19  Jesus M. Gonzalez-Barahona
4    10         10              Alberto Martín
5     4          4                 Jose Miguel
6     4          4                   Quan Zhou
7     3          3        David Pose Fernández
8     2          2                 David Esler
9     2          2              Israel Herraiz
10    2          2               Keanu Nichols
11    2          2                    camillem
12    1          1           Anvesh Chaturvedi
13    1          1                 Assad (OW2)
14    1          1               Gustavo Silva
15    1          1                      Jeremy
16    1          1             Luis Cañas-Díaz
17    1          1              Manrique Lopez
18    1          1      Miguel Ángel 

## Community Growth
Goal: Identify the size of the project community and whether it's growing, shrinking, or staying the same.

Name | Question | 
--- | --- |
[Contributors](https://github.com/chaoss/metrics/tree/master/activity-metrics/contributors.md) | What is the number of contributors? |
[New Contributors](https://github.com/chaoss/metrics/tree/master/activity-metrics/new-contributors.md) | What is the number of new contributors? | 
[Contributing Organizations](https://github.com/chaoss/metrics/tree/master/activity-metrics/contributing-organizations.md) | What is the number of contributing organizations? |
[New Contributing Organizations](https://github.com/chaoss/metrics/tree/master/activity-metrics/new-contributing-organizations.md) | What is the number of new contributing organizations?
[Sub-Projects](https://github.com/chaoss/metrics/tree/master/activity-metrics/sub-projects.md) | What is the number of sub-projects?

### number of contributors

In [22]:
contributors = Query(git_index)
contributors.get_sum("lines_changed").by_authors("author_name")
contributors.get_sum("lines_added").by_authors("author_name")
contributors.get_sum("lines_removed").by_authors("author_name")
contributors.get_average("files").by_authors("author_name")

contributors_df = buckets_to_df(contributors.fetch_aggregation_results()['aggregations']['0']['buckets'])
del contributors_df['doc_count']

contributors_df = contributors_df.set_index("key")
contributors_df = contributors_df.rename(columns={'0': "lines_changed", '1': "lines_added", '2': "lines_removed", '3': "files_changed"})

print(contributors_df)

                                lines_changed  lines_added  lines_removed  \
key                                                                         
Santiago Dueñas                       61910.0      45588.0        16322.0   
valerio cosentino                     90952.0      64292.0        26660.0   
Alvaro del Castillo                   15735.0      15340.0          395.0   
Alberto Martín                        22862.0      21781.0         1081.0   
Jesus M. Gonzalez-Barahona             1232.0       1105.0          127.0   
quan                                   3654.0       3593.0           61.0   
Miguel Ángel Fernández                  190.0         46.0          144.0   
David Pose Fernández                     10.0          4.0            6.0   
camillem                                  4.0          2.0            2.0   
valerio                                 162.0        123.0           39.0   
David Esler                               3.0          2.0            1.0   

### new contributors

For new contributors, we have to get the names and counts of the people who made commits to the project. [This](https://grimoirelab.gitbooks.io/tutorial/python/pandas-for-grimoirelab-indexes.html) tutorial of Grimoirelab actually gets the dates on which the authors made their first commits. Based on that we can get the months when the authors made their first commits and those authors will be the new authors for that month. We can do a similar thing for Year. (We can also get the authors by week, but there is little point in calculating that and it will be complex to calculate that too.)

In [23]:
# new contributors by month
new_contributors = Query(git_index)
new_contributors.get_min("author_date").by_authors("author_name")
response = new_contributors.fetch_aggregation_results()
buckets = response['aggregations']['0']['buckets']
pprint(buckets_to_df(buckets))

               0  doc_count                             key
0   1.439921e+12        770                 Santiago Dueñas
1   1.504796e+12        323               valerio cosentino
2   1.449255e+12         59             Alvaro del Castillo
3   1.455033e+12         51                  Alberto Martín
4   1.451589e+12         19      Jesus M. Gonzalez-Barahona
5   1.459513e+12          5                            quan
6   1.518440e+12          3          Miguel Ángel Fernández
7   1.509697e+12          2            David Pose Fernández
8   1.459163e+12          2                        camillem
9   1.507653e+12          2                         valerio
10  1.508280e+12          1                     David Esler
11  1.515512e+12          1                  Israel Herraiz
12  1.457165e+12          1  J. Manrique Lopez de la Fuente
13  1.474893e+12          1                 Luis Cañas Díaz
14  1.523210e+12          1                         Prabhat
15  1.483981e+12          1             

### Contributing Organizations

In [24]:
contributing_orgs = Query(github_issues_index)
contributing_orgs.get_terms("user_org")
response = contributing_orgs.fetch_aggregation_results()
buckets = response['aggregations']['0']['buckets']
organizations = pd.Series([item['key'] for item in buckets])
print(organizations)

0                       @Bitergia 
1                         Bitergia
2                         GNUmedia
3            BBVA Data & Analytics
4                               EY
5                   Geeky Engineer
6                 T-Systems Iberia
7                 @DIAL-Community 
8                           @SUSE 
9                           @adobe
10                             CMU
11                  IIIT Hyderabad
12                          Orange
13                            SUSE
14                         Samsung
15              University of Oulu
16                  Yak Shave Inc.
17    http://www.aeva.in/team.html
dtype: object
