# Manuscripts: Re-visited

Manuscripts, currently, mostly only provides us with aggregations of data such as average, cardinality and so on. It isn't flexible enough to let us play with data For example: sort the data by different filters and values. Here, we will be experimenting what all can be done with the metrics. None of the previously written code will be used here so as to look at different ways and basically redesign the current code.

We'll still be looking at the [GMD metrics](https://github.com/chaoss/metrics/blob/master/2_Growth-Maturity-Decline.md)

In [1]:
# import the necessary libraries

# analysis modules
import pandas as pd

# query and connection modules
from elasticsearch import Elasticsearch

from elasticsearch_dsl import A, Q, Search
from elasticsearch_dsl.query import Match, MultiMatch

# utility and support modules
from manuscripts2.new_functions import EQCC, Index, buckets_to_df, calculate_bmi
from manuscripts2.derived_classes import PullRequests, Issues
from pprint import pprint
from datetime import datetime, timezone
from dateutil import parser, relativedelta

In [2]:
# declare the necessary variables
github_data_source = "perceval_github"
git_data_source = "perceval_git"

github_index = Index(github_data_source)
git_index = Index(git_data_source)

start_date = datetime(2014, 8, 1)
start_date = start_date # "2014-08-01"
end_date = datetime(2018, 5, 22)
end_date = end_date

max_size = 10000 # temporary hack to get all the values in the query

### Types of filters we will be looking at:

Let's talk about the kind of filters we want while looking at the metrics. 

Can we look at the metrics by seggregating them according to:
- Date?
 - days
 - weeks
 - months
 - years


- Organizations?
 - if people from multiple organizations are a part of the project, then we might need to see how they play along and which org is having the most influence?
 

- Authors?
 - What if we want all the issues by the authors that created them?

## Issue Resolution
Goal: Identify how effective the community is at addressing issues identified by community partcipants.

Name | Question | Implemented | Issue | PR | Visualisation 
--- | --- | --- | --- | --- | --- |
[Open Issues](https://github.com/chaoss/metrics/tree/master/activity-metrics/open-issues.md) | What is the number of open issues? | Yes | None | None | No
[Closed Issues](https://github.com/chaoss/metrics/tree/master/activity-metrics/closed-issues.md) | What is the number of closed issues? | Yes | None | None | No
[Issue Resolution Efficiency](https://github.com/chaoss/metrics/tree/master/activity-metrics/issue-resolution-efficiency.md) | What is the number of closed issues/number of abandoned issues? | Yes | [wg-gmd#5](https://github.com/chaoss/wg-gmd/issues/5) | None | No
[Open Issue Age](https://github.com/chaoss/metrics/tree/master/activity-metrics/open-issue-age.md) | What is the the age of open issues? | Yes | None | None | No
[First Response to Issue Duration](https://github.com/chaoss/metrics/tree/master/activity-metrics/first-response-to-issue-duration.md) | What is the duration of time for a first response to an issue? | No | [wg-gmd#8](https://github.com/chaoss/wg-gmd/issues/8) | None | No
[Closed Issue Resolution Duration](https://github.com/chaoss/metrics/tree/master/activity-metrics/closed-issue-resolution-duration.md) | What is the duration of time for issues to be resolved? | Yes | [wg-gmd#7](https://github.com/chaoss/wg-gmd/issues/7) | None | No

<a id="open_issues"></a>
### open issues

Here, we can see that we get the issues open by authors, by organizations and by the month in which they were created.

In [3]:
open_issues = Issues(github_index)
open_issues.is_open()
num_open_issues = open_issues.get_cardinality("id_in_repo").get_aggs()
print(num_open_issues)

25


<a id="closed_issues"></a>
### closed issues

In [4]:
closed_issues = Issues(github_index)
closed_issues.is_closed()
# remember, all the aggregations are going into an ordered dict
total_closed = closed_issues.get_cardinality("id_in_repo").get_aggs() 
print(total_closed)

closed_by_authors = closed_issues.get_cardinality("id_in_repo").by_authors("author_name").fetch_aggregation_results()
by_author_buckets = closed_by_authors['aggregations']['0']['buckets']
print(buckets_to_df(by_author_buckets))

113
     0  doc_count                         key
0   27         27         Alvaro del Castillo
1   20         20              Alberto Martín
2    9          9  Jesus M. Gonzalez-Barahona
3    9          9              Manrique Lopez
4    9          9             Santiago Dueñas
5    3          3   Daniel Izquierdo Cortazar
6    3          3        David Pose Fernández
7    3          3                 Jose Miguel
8    3          3                   Quan Zhou
9    2          2    Brylie Christopher Oxley
10   2          2                      MishiR
11   2          2              Robin Muilwijk
12   2          2             Saad Bin Shahid
13   2          2                 TheReal1604
14   2          2                 iganchevup8
15   2          2                     valerio
16   1          1               Andre Klapper
17   1          1            Bogdan Vasilescu
18   1          1                  Bowen Chen
19   1          1              Cristian Baldi
20   1          1             

### Issue resolution efficiency

In [5]:
#  issues closed / issues open per month
opened_issues = Issues(github_index)
opened_issues_by_period = opened_issues.get_cardinality("id_in_repo").by_period().get_ts()

closed_issues_by_period = closed_issues.get_cardinality("id_in_repo").by_period(field="closed_at").get_ts()

issues_closed_per_issue_opened = calculate_bmi(closed_issues_by_period, opened_issues_by_period)

# TODO:
# total number of bugs closed during the period / (total number of bugs opened during the period + total number of bugs open at the beginning of the period
# total number of bugs still open at the end of the period / (total number of bugs opened during the period + total number of bugs open at the beginning of the period

In [6]:
issues_closed_per_issue_opened

{'Closed/Submitted': [1.0,
  0.5,
  0.5,
  1.0,
  0,
  0.3333333333333333,
  1.0,
  0.0,
  0.7142857142857143,
  0.5833333333333334,
  0.5714285714285714,
  1.5,
  0.7272727272727273,
  3.0,
  0.7142857142857143,
  1.5,
  0.5,
  1.0,
  1.0,
  0,
  0.5555555555555556,
  0.875,
  1.1818181818181819,
  0.75,
  1.3333333333333333,
  0.5714285714285714,
  1.25,
  3.0],
 'Period': ['2016-1',
  '2016-2',
  '2016-3',
  '2016-4',
  '2016-5',
  '2016-6',
  '2016-7',
  '2016-8',
  '2016-9',
  '2016-10',
  '2016-11',
  '2016-12',
  '2017-1',
  '2017-2',
  '2017-3',
  '2017-4',
  '2017-5',
  '2017-6',
  '2017-7',
  '2017-8',
  '2017-9',
  '2017-10',
  '2017-11',
  '2017-12',
  '2018-1',
  '2018-2',
  '2018-3',
  '2018-4']}

<a id="closed_issues"></a>
### open issue age

As per the [discussion here](https://github.com/chaoss/metrics/blob/master/activity-metrics/open-issue-age.md), We'll calculate the percentile, mean, variance and create some visualisations for this metric.

In [10]:
issue = Issues(github_index)
issue.is_open()
percentiles = issue.get_percentile("time_open_days").get_aggs()
print("Percentiles: ", percentiles)

issue.get_extended_stats("time_open_days")
extended_stats = issue.fetch_aggregation_results()['aggregations']['1']
pprint(extended_stats)

Percentiles:  472.7099914550781
{'avg': 428.90999813079833,
 'count': 25,
 'max': 835.75,
 'min': 10.800000190734863,
 'std_deviation': 291.42757593563766,
 'std_deviation_bounds': {'lower': -153.945153740477,
                          'upper': 1011.7651500020736},
 'sum': 10722.749953269958,
 'sum_of_squares': 6722345.462807083,
 'variance': 84930.03201572187}


#### visualizations

In [11]:
# visualisations
time_open_days_issues = Issues(github_index)
time_open_days_issues.is_open()
time_open_days_issues.fetch_results_from_source('time_open_days', 'id_in_repo', dataframe=True)

Unnamed: 0,id_in_repo,time_open_days
0,58,638.91
1,104,529.01
2,319,120.93
3,385,19.94
4,91,578.98
5,139,395.82
6,217,194.16
7,331,109.36
8,42,731.93
9,16,835.75


### First response to issue

### Closed issue resolution duration (Time to resolution of closed issue)

#### percentiles

In [12]:
closed_issues = Issues(github_index)
closed_issues.is_closed()
percentiles = closed_issues.get_percentile("time_to_close_days").get_aggs()
print("Percentiles: ", percentiles)

closed_issues.get_extended_stats("time_to_close_days")
extended_stats = closed_issues.fetch_aggregation_results()['aggregations']['1']
pprint(extended_stats)

Percentiles:  3.6500000953674316
{'avg': 31.253982658819417,
 'count': 113,
 'max': 582.3300170898438,
 'min': 0.0,
 'std_deviation': 93.39081320398705,
 'std_deviation_bounds': {'lower': -155.5276437491547,
                          'upper': 218.0356090667935},
 'sum': 3531.7000404465944,
 'sum_of_squares': 1095948.062792196,
 'variance': 8721.843990902002}


#### Moving average
Moving average: For time to issue resolution, we'll also look at the moving average.

In [13]:
'''
Example query to get moving average 
{
    "size": 0,
    "aggs": {
        "my_date_histo":{                
            "date_histogram":{
                "field":"created_at",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "time_to_close_days" } 
                },
                "the_movavg":{
                    "moving_avg":{ "buckets_path": "the_sum" } 
                }
            }
        }
    }
}
'''
closed_issues = Issues(github_index)
closed_issues.is_closed()
a = A("date_histogram", field="created_at", interval="week")
a.metric("the_sum", "sum", field="time_to_close_days")
a.metric("monthly_moving_average", "moving_avg", buckets_path="the_sum")
closed_issues.add_custom_aggregation(a)
moving_average_for_closed_issues = buckets_to_df(closed_issues.fetch_aggregation_results()['aggregations']['0']['buckets'])

In [14]:
moving_average_for_closed_issues

Unnamed: 0_level_0,date_in_seconds,monthly_moving_average,the_sum
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-01-18,1453075200000,,0.570000
2016-01-25,1453680000000,,0.000000
2016-02-01,1454284800000,,0.000000
2016-02-08,1454889600000,0.570000,16.299999
2016-02-15,1455494400000,,0.000000
2016-02-22,1456099200000,,0.000000
2016-02-29,1456704000000,,0.000000
2016-03-07,1457308800000,8.435000,0.070000
2016-03-14,1457913600000,5.646666,3.640000
2016-03-21,1458518400000,5.145000,4.950000


#### visualizations

In [15]:
# visualisations
closed_issues = Issues(github_index)
closed_issues.is_closed()
closed_issue_age = closed_issues.fetch_results_from_source('time_to_close_days', 'id_in_repo', dataframe=True)
print(closed_issue_age)

    id_in_repo  time_to_close_days
0           32                0.76
1           50                3.19
2           63                0.24
3           97                2.62
4           77               71.78
5          108                2.54
6          133                0.03
7          257                0.20
8          155                1.95
9          358                0.80
10         369                1.13
11          26                0.01
12          57                2.83
13          80                0.07
14          89                0.08
15          94                0.72
16         107                0.54
17         116                0.19
18         114                7.40
19         126                3.92
20         132                5.96
21         150                7.01
22         165               42.63
23         202                6.09
24         261                5.09
25         141              158.00
26         175               14.76
27         183      

## Code Development
Goal: Identify how effective the community is at merging new code into the codebase.

Name | Question | Implemented | Issue | PR
--- | --- | --- | --- | --- |
[Code Commits](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-commits.md) | What is the number of code commits? | Yes | None | None
[Lines of Code Changed](https://github.com/chaoss/metrics/tree/master/activity-metrics/lines-of-code-changed.md) | What is the number of lines of code changed? | Yes | None | None
[Code Reviews](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-reviews.md) | What is the number of code reviews?
[Code Merge Duration](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-merge-duration.md) | What is the duration of time between code merge request and code commit?
[Code Review Efficiency](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-review-efficiency.md) | What is the number of merged code changes/number of abandoned code change requests?
[Maintainer Response to Merge Request Duration](https://github.com/chaoss/metrics/tree/master/activity-metrics/maintainer-response-to-merge-request-duration.md) | What is the duration of time for a maintainer to make a first response to a code merge request?
[Code Review Iteration](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-review-iteration.md) | What is the number of iterations that occur before a merge request is accepted or declined?
[Forks](https://github.com/chaoss/metrics/tree/master/activity-metrics/forks.md) | Forks are a concept in distributed version control systems like GitHub. It is a proxy for the approximate number of developers who have taken a shot at building and deploying the codebase *for development*.
[Pull Requests Open](https://github.com/chaoss/metrics/tree/master/activity-metrics/pull-requests-open.md) | Number of open pull requests. | Yes | None | None | 
[Pull Requests Closed](https://github.com/chaoss/metrics/tree/master/activity-metrics/pull-requests-made-closed.md) | Number of closed pull requests. | Yes | None | None |
[Pull Request Comment Duration](https://github.com/chaoss/metrics/tree/master/activity-metrics/pull-requests-comment-duration.md) | The difference between the timestamp of the pull request creation date and the most recent comment on the pull request.
[Pull Request Comment Diversity](https://github.com/chaoss/metrics/tree/master/activity-metrics/pull-requests-comment-diversity.md) | Number of each people discussing each pull request.
[Pull Request Comments](https://github.com/chaoss/metrics/tree/master/activity-metrics/pull-request-comments.md) | Number of comments on each pull request. 


### code commits

**NOTE:** HERE THE INDEX WILL HAVE TO BE CHANGED

In [16]:
commits = EQCC(git_index)
commits.get_cardinality("hash")
total_commits = commits.get_aggs()
print("total commits: ", total_commits)

all_commits = commits.fetch_results_from_source("hash", "commit_date")

total commits:  1194


When you go to the [perceval github repo](https://github.com/chaoss/grimoirelab-perceval), you'll see that actually 1182 commit are present. That maybe because of some empty commit messages. 

#### by months

In [17]:
buckets_to_df(commits.get_cardinality("hash").by_period().fetch_aggregation_results()['aggregations']['0']['buckets'])

Unnamed: 0_level_0,0,date_in_seconds
key,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-08-01,8,1438387200000
2015-09-01,0,1441065600000
2015-10-01,0,1443657600000
2015-11-01,23,1446336000000
2015-12-01,21,1448928000000
2016-01-01,26,1451606400000
2016-02-01,76,1454284800000
2016-03-01,51,1456790400000
2016-04-01,24,1459468800000
2016-05-01,21,1462060800000


### Lines of code changed

In [18]:
commits = EQCC(git_index)
lc = commits.get_sum("lines_changed").get_aggs()
la = commits.get_sum("lines_added").get_aggs()
lr = commits.get_sum("lines_removed").get_aggs()

print("Total lines changed: ", lc)
print("Total lines added: ", la)
print("Total lines removed: ", lr)

Total lines changed:  177774.0
Total lines added:  133052.0
Total lines removed:  44722.0


### Pull requests Open

In [19]:
open_prs = PullRequests(github_index)
open_prs.is_open()
# get the single valued aggregation before putting it again as a child agg
num_open_prs = open_prs.get_cardinality("id_in_repo").get_aggs() 
open_prs.get_cardinality("id_in_repo").by_authors("author_name")
response = open_prs.fetch_aggregation_results()['aggregations']

In [20]:
open_prs.aggregations

OrderedDict([('terms_author_name',
              Terms(aggs={0: Cardinality(field='id_in_repo', precision_threshold=3000)}, field='author_name', missing='others', size=10000))])

In [21]:
print(num_open_prs)

7


In [22]:
open_prs_by_authors = response['0']['buckets']
print(buckets_to_df(open_prs_by_authors))

   0  doc_count                         key
0  2          2               Keanu Nichols
1  1          1               Gustavo Silva
2  1          1  Jesus M. Gonzalez-Barahona
3  1          1                 Jose Miguel
4  1          1      Miguel Ángel Fernández
5  1          1                     valerio


### Pull requests closed

In [23]:
closed_prs = PullRequests(github_index)
closed_prs.is_closed()
# get the single valued aggregation before putting it again as a child agg
num_closed_prs = closed_prs.get_cardinality("id_in_repo").get_aggs() 
closed_prs.get_cardinality("id_in_repo").by_authors("author_name")
response = closed_prs.fetch_aggregation_results()['aggregations']

In [24]:
print(num_closed_prs)

247


In [25]:
closed_prs_by_authors = response['0']['buckets']
print(buckets_to_df(closed_prs_by_authors))

      0  doc_count                         key
0   134        134                     valerio
1    33         33         Alvaro del Castillo
2    22         22             Santiago Dueñas
3    18         18  Jesus M. Gonzalez-Barahona
4    10         10              Alberto Martín
5     4          4                 Jose Miguel
6     4          4                   Quan Zhou
7     3          3        David Pose Fernández
8     2          2                 David Esler
9     2          2              Israel Herraiz
10    2          2               Keanu Nichols
11    2          2                    camillem
12    1          1           Anvesh Chaturvedi
13    1          1                 Assad (OW2)
14    1          1               Gustavo Silva
15    1          1                      Jeremy
16    1          1             Luis Cañas-Díaz
17    1          1              Manrique Lopez
18    1          1      Miguel Ángel Fernández
19    1          1           Nicolas Lamirault
20    1      

## Community Growth
Goal: Identify the size of the project community and whether it's growing, shrinking, or staying the same.

Name | Question | Implemented | Issue | PR
--- | --- | --- | --- | --- |
[Contributors](https://github.com/chaoss/metrics/tree/master/activity-metrics/contributors.md) | What is the number of contributors? | Yes | None | None
[New Contributors](https://github.com/chaoss/metrics/tree/master/activity-metrics/new-contributors.md) | What is the number of new contributors? | Yes | None | None
[Contributing Organizations](https://github.com/chaoss/metrics/tree/master/activity-metrics/contributing-organizations.md) | What is the number of contributing organizations? | Yes | None | None
[New Contributing Organizations](https://github.com/chaoss/metrics/tree/master/activity-metrics/new-contributing-organizations.md) | What is the number of new contributing organizations?
[Sub-Projects](https://github.com/chaoss/metrics/tree/master/activity-metrics/sub-projects.md) | What is the number of sub-projects?

### Number of contributors

In [26]:
contributors = EQCC(git_index)
contributors.get_sum("lines_changed").by_authors("author_name")
contributors.get_sum("lines_added").by_authors("author_name")
contributors.get_sum("lines_removed").by_authors("author_name")
contributors.get_average("files").by_authors("author_name")
contributors.get_cardinality("author_uuid")

<manuscripts2.new_functions.EQCC at 0x113b96c18>

In [27]:
# maybe a pie chart showing the different users and the magnitude of their contributions is the total number of lines changed/removed/added??

buckets_to_df(contributors.fetch_aggregation_results()['aggregations']['0']['buckets'])

Unnamed: 0,0,1,2,3,doc_count,key
0,61910.0,45588.0,16322.0,2.12053,755,Santiago Dueñas
1,71426.0,44913.0,26513.0,2.254237,295,valerio cosentino
2,15723.0,15334.0,389.0,2.528302,53,Alvaro del Castillo
3,22862.0,21781.0,1081.0,1.921569,51,Alberto Martín
4,1118.0,1048.0,70.0,2.222222,18,Jesus M. Gonzalez-Barahona
5,3654.0,3593.0,61.0,5.6,5,quan
6,190.0,46.0,144.0,11.333333,3,Miguel Ángel Fernández
7,10.0,4.0,6.0,2.0,2,David Pose Fernández
8,4.0,2.0,2.0,1.0,2,camillem
9,162.0,123.0,39.0,7.0,2,valerio


### New contributors

For new contributors, we have to get the names and counts of the people who made commits to the project. [This](https://grimoirelab.gitbooks.io/tutorial/python/pandas-for-grimoirelab-indexes.html) tutorial of Grimoirelab actually gets the dates on which the authors made their first commits. Based on that we can get the months when the authors made their first commits and those authors will be the new authors for that month. We can do a similar thing for Year. (We can also get the authors by week, but there is little point in calculating that and it will be complex to calculate that too.)

In [28]:
# new contributors by month
new_contributors = EQCC(git_index)
new_contributors.get_min("author_date").by_authors("author_name")
response = new_contributors.fetch_aggregation_results()
buckets = response['aggregations']['0']['buckets']
print(buckets)

[{'key': 'Santiago Dueñas', 'doc_count': 755, '0': {'value': 1439921307000.0, 'value_as_string': '2015-08-18T18:08:27.000Z'}}, {'key': 'valerio cosentino', 'doc_count': 295, '0': {'value': 1504795590000.0, 'value_as_string': '2017-09-07T14:46:30.000Z'}}, {'key': 'Alvaro del Castillo', 'doc_count': 53, '0': {'value': 1449254774000.0, 'value_as_string': '2015-12-04T18:46:14.000Z'}}, {'key': 'Alberto Martín', 'doc_count': 51, '0': {'value': 1455033405000.0, 'value_as_string': '2016-02-09T15:56:45.000Z'}}, {'key': 'Jesus M. Gonzalez-Barahona', 'doc_count': 18, '0': {'value': 1451589385000.0, 'value_as_string': '2015-12-31T19:16:25.000Z'}}, {'key': 'quan', 'doc_count': 5, '0': {'value': 1459512989000.0, 'value_as_string': '2016-04-01T12:16:29.000Z'}}, {'key': 'Miguel Ángel Fernández', 'doc_count': 3, '0': {'value': 1518440171000.0, 'value_as_string': '2018-02-12T12:56:11.000Z'}}, {'key': 'David Pose Fernández', 'doc_count': 2, '0': {'value': 1509697434000.0, 'value_as_string': '2017-11-03T0

### Contributing Organizations

In [29]:
github_index = Index("aima_github")
contributing_orgs = EQCC(github_index)
contributing_orgs.get_terms("user_org")
response = contributing_orgs.fetch_aggregation_results()
buckets = response['aggregations']['0']['buckets']
organizations = pd.Series([item['key'] for item in buckets])
print(organizations)


0                                                IIIT-H
1                                          Google, Inc.
2                         BITS Pilani, Hyderabad Campus
3                               University of São Paulo
4                                             @OSDLabs 
5                                         IIT Kharagpur
6                                            @Atlassian
7                                             @facebook
8                  University of Massachusetts, Amherst
9                                             IIT Mandi
10                                           freelancer
11                                Lewis & Clark College
12                                            @Aloompa 
13                                         @mesosphere 
14                 @sprinklr,@ideadevice ,NIC,iLLGaming
15                                          Databricks 
16                                           ETH Zurich
17             Student at California Baptist Uni

In [30]:
organizations = EQCC(github_index)
organizations.get_terms("author_name").by_organizations("user_orgs")
response = organizations.fetch_aggregation_results()
buckets = response['aggregations']['0']['buckets']
pprint(buckets)


[{'0': {'buckets': [{'doc_count': 184, 'key': 'Anthony Marakis'},
                    {'doc_count': 82, 'key': 'C.G.Vedant'},
                    {'doc_count': 46, 'key': 'Aman Deep Singh'},
                    {'doc_count': 38, 'key': 'Google Code Exporter'},
                    {'doc_count': 36, 'key': 'Tarun Kumar Vangani'},
                    {'doc_count': 32, 'key': 'Surya Teja Cheedella'},
                    {'doc_count': 29, 'key': 'lucasmoura'},
                    {'doc_count': 21, 'key': 'Kaivalya Rawal'},
                    {'doc_count': 20, 'key': 'Apurv Bajaj'},
                    {'doc_count': 20, 'key': 'Peter Norvig'},
                    {'doc_count': 18, 'key': 'chiragvartak'},
                    {'doc_count': 17, 'key': 'Nouman Ahmed'},
                    {'doc_count': 17, 'key': 'Vinay Varma'},
                    {'doc_count': 16, 'key': 'Angira Sharma'},
                    {'doc_count': 16, 'key': 'Rahul Goswami'},
                    {'doc_count': 15, 'key