# Manuscripts: Re-visited

Manuscripts, currently, mostly only provides us with aggregations of data such as average, cardinality and so on. It isn't flexible enough to let us play with data For example: sort the data by different filters and values. Here, we will be experimenting what all can be done with the metrics. None of the previously written code will be used here so as to look at different ways and basically redesign the current code.

We'll still be looking at the [GMD metrics](https://github.com/chaoss/metrics/blob/master/2_Growth-Maturity-Decline.md)

In [12]:
# import the necessary libraries

# analysis modules
import pandas as pd

# query and connection modules
from elasticsearch import Elasticsearch

from elasticsearch_dsl import A, Q, Search
from elasticsearch_dsl.query import Match, MultiMatch

# utility and support modules
import new_functions as nf
from pprint import pprint
from datetime import datetime, timezone
from dateutil import parser, relativedelta

In [13]:
# declare the necessary variables
es = Elasticsearch("http://localhost:9200/")

github_index = "perceval_github"
git_index = "perceval_git"

start_date = datetime(2014, 8, 1)
start_date = start_date.isoformat() # "2014-08-01"
end_date = datetime(2018, 5, 22)
end_date = end_date.isoformat()

max_size = 10000 # temporary hack to get all the values in the query

### Types of filters we will be looking at:

Let's talk about the kind of filters we want while looking at the metrics. 

Can we look at the metrics by seggregating them according to:
- Date?
 - days
 - weeks
 - months
 - years


- Organizations?
 - if people from multiple organizations are a part of the project, then we might need to see how they play along and which org is having the most influence?
 

- Authors?
 - What if we want all the issues by the authors that created them?

### Structure of aggregations

Aggregation structure will be as follows:
parent-child aggregations will follow the path:
- 0(parent) we are starting from zero because it'll make it easy for us to loop through multiple aggregations
  - 0.1(child)
    - 0.01(child's child)

sibling-sibling aggregations will follow the path:
- 0(sibling)
- 1(sibling)
- 2(sibling)

## Issue Resolution
Goal: Identify how effective the community is at addressing issues identified by community partcipants.

Name | Question | Implemented | Issue | PR | Visualisation 
--- | --- | --- | --- | --- | --- |
[Open Issues](https://github.com/chaoss/metrics/tree/master/activity-metrics/open-issues.md) | What is the number of open issues? | Yes | None | None | No
[Closed Issues](https://github.com/chaoss/metrics/tree/master/activity-metrics/closed-issues.md) | What is the number of closed issues? | Yes | None | None | No
[Issue Resolution Efficiency](https://github.com/chaoss/metrics/tree/master/activity-metrics/issue-resolution-efficiency.md) | What is the number of closed issues/number of abandoned issues? | No | [wg-gmd#5](https://github.com/chaoss/wg-gmd/issues/5) | None | No
[Open Issue Age](https://github.com/chaoss/metrics/tree/master/activity-metrics/open-issue-age.md) | What is the the age of open issues? | Yes | None | None | No
[First Response to Issue Duration](https://github.com/chaoss/metrics/tree/master/activity-metrics/first-response-to-issue-duration.md) | What is the duration of time for a first response to an issue? | No | [wg-gmd#8](https://github.com/chaoss/wg-gmd/issues/8) | None | No
[Closed Issue Resolution Duration](https://github.com/chaoss/metrics/tree/master/activity-metrics/closed-issue-resolution-duration.md) | What is the duration of time for issues to be resolved? | Yes | [wg-gmd#7](https://github.com/chaoss/wg-gmd/issues/7) | None | No

<a id="open_issues"></a>
### open issues

In [14]:
# We are looking at all the issues that are open and were created between 2017-08-01 and 2018-05-22

parent_id = 0
child_id = 0.1
s = Search(using=es, index=github_index)
q1 = Q("match", **{"item_type":"issue"})
q2 = Q("match", **{"state": "open"})
q = q1 & q2
s = s.query(q)

# Number of open issues:
agg = A("cardinality", field="id_in_repo")
s.aggs.bucket(parent_id, agg)
parent_id += 1

# Open issues by the people that created them:
agg = A("terms", field="author_name", missing="others", size=max_size)
agg.metric(child_id, "cardinality", field="id_in_repo")
s.aggs.bucket(parent_id, agg)
parent_id += 1

# Open issues by the organizations that created them:
agg = A("terms", field="user_org", missing="others", size=max_size)
agg.metric(child_id, "cardinality", field="id_in_repo")
s.aggs.bucket(parent_id, agg)
parent_id += 1

# Open issues by the months in which they were created:
period = "month" # should be one of month, week or year
agg = A("date_histogram", field="created_at", interval=period)
agg.metric(child_id, "cardinality", field="id_in_repo")
s.aggs.bucket(parent_id, agg)
parent_id += 1

# apply the range filter:
s = s.filter("range", **{"created_at":{"gte":start_date, "lte":end_date}})
s = s.extra(size=0)

response = s.execute()
aggs = response.aggregations.to_dict()
for i in range(parent_id):
    try:
        df = nf.buckets_to_df(aggs[str(i)]['buckets'])
        print(df)
        print("Total:", df['value'].sum())
    except:
        print(aggs[str(i)]['value'])
    print()

23

                            doc_count  value
key                                         
Jesus M. Gonzalez-Barahona          4      4
Alberto Martín                      3      3
others                              3      3
Kapil Thangavelu                    2      2
Alvaro del Castillo                 1      1
Armijn Hemel                        1      1
Brylie Christopher Oxley            1      1
Daniel Izquierdo Cortazar           1      1
Germán Poo-Caamaño                  1      1
Lluis Josep Martinez                1      1
Luis Cañas-Díaz                     1      1
Maëlick                             1      1
Michael Downey                      1      1
Sachin S. Kamath                    1      1
Samuel Ytterbrink                   1      1
Total: 23

                     doc_count  value
key                                  
others                      12     12
@Bitergia                    3      3
Bitergia                     3      3
@DIAL-Community              1

Here, we can see that we get the issues open by authors, by organizations and by the month in which they were created.

<a id="closed_issues"></a>
### closed issues

In [15]:
parent_id = 0
child_id = 0.1
s = Search(using=es, index=github_index)
q1 = Q("match", **{"item_type":"issue"})
q2 = Q("match", **{"state": "closed"})
q = q1 & q2
s = s.query(q)

# Number of closed issues:
agg = A("cardinality", field="id_in_repo")
s.aggs.bucket(parent_id, agg)
parent_id += 1

# Closed issues by the people that created them:
agg = A("terms", field="author_name", missing="others", size=max_size)
agg.metric(child_id, "cardinality", field="id_in_repo")
s.aggs.bucket(parent_id, agg)
parent_id += 1

# Closed issues by the organizations that created them:
agg = A("terms", field="user_org", missing="others", size=max_size)
agg.metric(child_id, "cardinality", field="id_in_repo")
s.aggs.bucket(parent_id, agg)
parent_id += 1

# Closed issues by the months in which they were created:
period = "month" # should be one of month, week or year
agg = A("date_histogram", field="created_at", interval=period)
agg.metric(child_id, "cardinality", field="id_in_repo")
s.aggs.bucket(parent_id, agg)
parent_id += 1

# apply the range filter:
s = s.filter("range", **{"created_at":{"gte":start_date, "lte":end_date}})
s = s.extra(size=0)

response = s.execute()
aggs = response.aggregations.to_dict()
for i in range(parent_id):
    try:
        df = nf.buckets_to_df(aggs[str(i)]['buckets'])
        print(df)
        print("Total:", df['value'].sum())
    except:
        print(aggs[str(i)]['value'])
    print()

113

                            doc_count  value
key                                         
Alvaro del Castillo                27     27
Alberto Martín                     20     20
others                             10     10
Jesus M. Gonzalez-Barahona          9      9
Manrique Lopez                      9      9
Santiago Dueñas                     9      9
Daniel Izquierdo Cortazar           3      3
David Pose Fernández                3      3
Jose Miguel                         3      3
Quan Zhou                           3      3
Brylie Christopher Oxley            2      2
Robin Muilwijk                      2      2
Saad Bin Shahid                     2      2
valerio                             2      2
Andre Klapper                       1      1
Bogdan Vasilescu                    1      1
Bowen Chen                          1      1
Cristian Baldi                      1      1
Heather Booker                      1      1
Lluis Josep Martinez                1      1
Phill

Here, we can see that we get the issues closed by authors, by organizations and by the month in which they were created.

### Issue resolution efficiency

<a id="closed_issues"></a>
### open issue age

As per the [discussion here](https://github.com/chaoss/metrics/blob/master/activity-metrics/open-issue-age.md), We'll calculate the percentile, mean, variance and create some visualisations for this metric.

In [16]:
s = Search(using=es, index=github_index)
q1 = Q("match", **{"item_type":"issue"})
q2 = Q("match", **{"state": "open"})
q = q1 & q2
s = s.query(q)
agg1 = A("percentiles", field="time_open_days")
agg2 = A("extended_stats", field="time_open_days")
s.aggs.bucket("open_issue_age_percentile", agg1)
s.aggs.bucket("open_issue_age_stats", agg2)
s = s.extra(size=0)
response = s.execute()
values = response.aggregations.open_issue_age_percentile.values

#### pecentiles

In [17]:
percentiles = values.to_dict()
print(percentiles)
print()
open_issue_age_percentile = pd.DataFrame.from_dict(percentiles, orient='index', columns=['value'])
print(open_issue_age_percentile)

{'1.0': 18.565400066375734, '5.0': 54.01599884033203, '25.0': 170.45999908447266, '50.0': 507.9200134277344, '75.0': 664.3300170898438, '95.0': 795.547021484375, '99.0': 810.472392578125}

           value
1.0    18.565400
5.0    54.015999
25.0  170.459999
50.0  507.920013
75.0  664.330017
95.0  795.547021
99.0  810.472393


#### stats

In [18]:
extended_stats = response.aggregations.open_issue_age_stats.to_dict()
pprint(extended_stats)

{'avg': 443.7830442760302,
 'count': 23,
 'max': 814.6699829101562,
 'min': 9.640000343322754,
 'std_deviation': 275.9541022453834,
 'std_deviation_bounds': {'lower': -108.12516021473664,
                          'upper': 995.691248766797},
 'sum': 10207.010018348694,
 'sum_of_squares': 6281163.309457999,
 'variance': 76150.66654605552}


#### visualizations

In [19]:
# visualisations
s = Search(using=es, index=github_index)
q0 = Q("match_all")
q1 = Q("match", **{"item_type":"issue"})
q2 = Q("match", **{"state": "open"})
q = q0 & q1 & q2
s = s.query(q)
agg = A("cardinality", field="id_in_repo")
s.aggs.bucket("num_open_issues", agg)
s = s.extra(_source=['time_open_days', 'id_in_repo'])
s = s.extra(size=max_size)
response = s.execute()

In [20]:
open_issue_age = pd.DataFrame.from_records([hit['_source'] for hit in response.hits.hits], index="id_in_repo")
open_issue_age = open_issue_age.sort_values(by="time_open_days", ascending=True)
print(open_issue_age)

            time_open_days
id_in_repo                
384                   9.64
367                  50.21
331                  88.27
324                  95.15
319                  99.84
234                 170.13
229                 170.79
217                 173.08
140                 373.01
139                 374.73
119                 451.62
104                 507.92
91                  557.90
88                  569.10
74                  586.77
59                  612.92
58                  617.82
42                  710.84
28                  788.04
20                  793.81
19                  795.16
18                  795.59
16                  814.67


In [21]:
# Todo: create visualizations

### First response to issue

### Closed issue resolution duration (Time to resolution of closed issue)

#### percentiles

In [22]:
s = Search(using=es, index=github_index)
q1 = Q("match", **{"item_type":"issue"})
q2 = Q("match", **{"state": "closed"})
q = q1 & q2
s = s.query(q)
agg1 = A("percentiles", field="time_to_close_days")
agg2 = A("extended_stats", field="time_to_close_days")
s.aggs.bucket("closed_issues_percentile", agg1)
s.aggs.bucket("closed_issues_stats", agg2)
s = s.extra(size=0)
response = s.execute()
values = response.aggregations.closed_issues_percentile.values
percentiles = values.to_dict()
closed_issue_percentile = pd.DataFrame.from_dict(percentiles, orient='index', columns=['value'])
print(closed_issue_percentile)

           value
1.0     0.010000
5.0     0.066000
25.0    0.710000
50.0    3.650000
75.0   12.620000
95.0  188.107996
99.0  523.223621


This shows that the time to resolve an issue. We can see that more than 50% issues were resolved in under 4 days and more than 75% of the issues were resolved in under 13 days or approximately 2 weeks!

#### stats

In [23]:
extended_stats = response.aggregations.closed_issues_stats.to_dict()
pprint(extended_stats)

{'avg': 31.253982658819417,
 'count': 113,
 'max': 582.3300170898438,
 'min': 0.0,
 'std_deviation': 93.39081320398705,
 'std_deviation_bounds': {'lower': -155.5276437491547,
                          'upper': 218.0356090667935},
 'sum': 3531.7000404465944,
 'sum_of_squares': 1095948.062792196,
 'variance': 8721.843990902002}


#### Moving average
Moving average: For time to issue resolution, we'll also look at the moving average.

In [24]:
'''
Example query to get moving average 
{
    "size": 0,
    "aggs": {
        "my_date_histo":{                
            "date_histogram":{
                "field":"created_at",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "time_to_close_days" } 
                },
                "the_movavg":{
                    "moving_avg":{ "buckets_path": "the_sum" } 
                }
            }
        }
    }
}
'''
s = Search(using=es, index=github_index)
q1 = Q("match", **{"item_type":"issue"})
q2 = Q("match", **{"state": "closed"})
q = q1 & q2
s = s.query(q)
a = A("date_histogram", field="created_at", interval="1M")
a.metric("the_sum", "sum", field="time_to_close_days")
a.metric("monthly_moving_average", "moving_avg", buckets_path="the_sum")
s.aggs.bucket("the_histogram", a)
s = s.extra(size=0)
response = s.execute()
buckets = response.aggregations.the_histogram.to_dict()['buckets']

In [25]:
df = nf.buckets_to_df(buckets)
print(df)

            date_in_seconds       value
key                                    
2016-01-01    1451606400000    0.570000
2016-02-01    1454284800000    0.570000
2016-03-01    1456790400000    8.435000
2016-04-01    1459468800000    8.510000
2016-05-01    1462060800000    0.000000
2016-06-01    1464739200000    6.572500
2016-07-01    1467331200000   44.860000
2016-08-01    1470009600000   45.982000
2016-09-01    1472688000000  159.188003
2016-10-01    1475280000000  160.906003
2016-11-01    1477958400000  300.138008
2016-12-01    1480550400000  329.372011
2017-01-01    1483228800000  332.380011
2017-02-01    1485907200000    0.000000
2017-03-01    1488326400000  311.332006
2017-04-01    1491004800000  313.678006
2017-05-01    1493596800000  179.372002
2017-06-01    1496275200000  142.147999
2017-07-01    1498867200000  137.923999
2017-08-01    1501545600000    0.000000
2017-09-01    1504224000000   43.258000
2017-10-01    1506816000000  133.978000
2017-11-01    1509494400000  145.562000


#### visualizations

In [26]:
# visualisations
s = Search(using=es, index=github_index)
q0 = Q("match_all")
q1 = Q("match", **{"item_type":"issue"})
q2 = Q("match", **{"state": "closed"})
q = q0 & q1 & q2
s = s.query(q)
agg = A("cardinality", field="id_in_repo")
s.aggs.bucket("num_closed_issues", agg)
s = s.extra(_source=['time_to_close_days', 'id_in_repo'])
s = s.extra(size=max_size)
response = s.execute()
closed_issue_age = pd.DataFrame.from_records([hit['_source'] for hit in response.hits.hits], index="id_in_repo")
closed_issue_age = closed_issue_age.sort_values(by="time_to_close_days", ascending=True)
print(closed_issue_age)

            time_to_close_days
id_in_repo                    
265                       0.00
204                       0.01
26                        0.01
354                       0.02
133                       0.03
142                       0.06
113                       0.07
17                        0.07
80                        0.07
89                        0.08
72                        0.09
210                       0.09
144                       0.10
92                        0.14
116                       0.19
257                       0.20
103                       0.20
123                       0.21
90                        0.22
63                        0.24
122                       0.24
216                       0.27
187                       0.53
327                       0.53
107                       0.54
8                         0.57
95                        0.59
221                       0.69
110                       0.71
94                        0.72
...     

### Code Development
Goal: Identify how effective the community is at merging new code into the codebase.

Name | Question | Implemented | Issue | PR
--- | --- | --- | --- | --- |
[Code Commits](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-commits.md) | What is the number of code commits? | Yes | None | None
[Lines of Code Changed](https://github.com/chaoss/metrics/tree/master/activity-metrics/lines-of-code-changed.md) | What is the number of lines of code changed? | Yes | None | None
[Code Reviews](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-reviews.md) | What is the number of code reviews?
[Code Merge Duration](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-merge-duration.md) | What is the duration of time between code merge request and code commit?
[Code Review Efficiency](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-review-efficiency.md) | What is the number of merged code changes/number of abandoned code change requests?
[Maintainer Response to Merge Request Duration](https://github.com/chaoss/metrics/tree/master/activity-metrics/maintainer-response-to-merge-request-duration.md) | What is the duration of time for a maintainer to make a first response to a code merge request?
[Code Review Iteration](https://github.com/chaoss/metrics/tree/master/activity-metrics/code-review-iteration.md) | What is the number of iterations that occur before a merge request is accepted or declined?
[Forks](activity-metrics/forks.md) | Forks are a concept in distributed version control systems like GitHub. It is a proxy for the approximate number of developers who have taken a shot at building and deploying the codebase *for development*.
[Pull Requests Open](activity-metrics/pull-requests-open.md) | Number of open pull requests.
[Pull Requests Closed](activity-metrics/pull-requests-made-closed.md) | Number of closed pull requests.
[Pull Request Comment Duration](activity-metrics/pull-requests-comment-duration.md) | The difference between the timestamp of the pull request creation date and the most recent comment on the pull request.
[Pull Request Comment Diversity](activity-metrics/pull-requests-comment-diversity.md) | Number of each people discussing each pull request.
[Pull Request Comments](activity-metrics/pull-request-comments.md) | Number of comments on each pull request. 


### code commits

In [27]:
s = Search(using=es, index=git_index)
#q = Q("match", **{"files": 0})
#s = s.query(~q)
a = A("cardinality", field="hash", precision_threshold=2000)
s.aggs.bucket("total_commits", a)
s = s.extra(_source=["hash", "commit_date"])
s = s.extra(sort={"commit_date":"asc"})
response = s.execute()
print(response.aggregations.total_commits.value)

1186


When you go to the [perceval github repo](https://github.com/chaoss/grimoirelab-perceval), you'll see that actually 1182 commit are present. That maybe because of some empty commit messages. 

#### by months

In [28]:
s = Search(using=es, index=git_index)
a = A("date_histogram", field="commit_date", interval="month")
s.aggs.bucket("commits_by_months", a)
response = s.execute()
buckets = response.aggregations.commits_by_months.to_dict()['buckets']
commits_by_month = nf.buckets_to_df(buckets)
print(commits_by_month)

            date_in_seconds
key                        
2015-08-01    1438387200000
2015-09-01    1441065600000
2015-10-01    1443657600000
2015-11-01    1446336000000
2015-12-01    1448928000000
2016-01-01    1451606400000
2016-02-01    1454284800000
2016-03-01    1456790400000
2016-04-01    1459468800000
2016-05-01    1462060800000
2016-06-01    1464739200000
2016-07-01    1467331200000
2016-08-01    1470009600000
2016-09-01    1472688000000
2016-10-01    1475280000000
2016-11-01    1477958400000
2016-12-01    1480550400000
2017-01-01    1483228800000
2017-02-01    1485907200000
2017-03-01    1488326400000
2017-04-01    1491004800000
2017-05-01    1493596800000
2017-06-01    1496275200000
2017-07-01    1498867200000
2017-08-01    1501545600000
2017-09-01    1504224000000
2017-10-01    1506816000000
2017-11-01    1509494400000
2017-12-01    1512086400000
2018-01-01    1514764800000
2018-02-01    1517443200000
2018-03-01    1519862400000
2018-04-01    1522540800000
2018-05-01    152513

### Lines of code changed

In [29]:
s = Search(using=es, index=git_index)
a1 = A("sum", field="lines_changed")
a2 = A("sum", field="lines_added")
a3 = A("sum", field="lines_removed")
s.aggs.bucket("total_lines_changed", a1)
s.aggs.bucket("total_lines_added", a2)
s.aggs.bucket("total_lines_removed", a3)
s = s.extra(size=0)
response = s.execute()

print("Total lines changed: ", response.aggregations.total_lines_changed.value)
print("Total lines added: ", response.aggregations.total_lines_added.value)
print("Total lines removed: ", response.aggregations.total_lines_removed.value)

Total lines changed:  354358.0
Total lines added:  265068.0
Total lines removed:  89290.0


## Community Growth
Goal: Identify the size of the project community and whether it's growing, shrinking, or staying the same.

Name | Question | Implemented | Issue | PR
--- | --- | --- | --- | --- |
[Contributors](https://github.com/chaoss/metrics/tree/master/activity-metrics/contributors.md) | What is the number of contributors? | Yes | None | None
[New Contributors](https://github.com/chaoss/metrics/tree/master/activity-metrics/new-contributors.md) | What is the number of new contributors? | Yes | None | None
[Contributing Organizations](https://github.com/chaoss/metrics/tree/master/activity-metrics/contributing-organizations.md) | What is the number of contributing organizations? | Yes | None | None
[New Contributing Organizations](https://github.com/chaoss/metrics/tree/master/activity-metrics/new-contributing-organizations.md) | What is the number of new contributing organizations?
[Sub-Projects](https://github.com/chaoss/metrics/tree/master/activity-metrics/sub-projects.md) | What is the number of sub-projects?

### Number of contributors

In [30]:
s = Search(using=es, index=git_index)
a = A("terms", field="author_name", size=max_size)
a.metric("lines_changed", "sum", field="lines_changed")
a.metric("lines_added", "sum", field="lines_added")
a.metric("lines_removed", "sum", field="lines_removed")
a.metric("average_files_changed", "avg", field="files")
s.aggs.bucket("contributors_contributions", a)
b = A("cardinality", field="author_name")
s.aggs.bucket("total_contributors", b)
response = s.execute()
buckets = response.aggregations.contributors_contributions.to_dict()['buckets']
contributor_contributions = []
for item in buckets:
    bucket = {}
    for key, val in item.items():
        try:
            bucket[key] = val['value']
        except:
            bucket[key] = val
    contributor_contributions.append(bucket)
contributor_contributions = pd.DataFrame.from_records(contributor_contributions, index="key")
print(contributor_contributions)

                                average_files_changed  doc_count  lines_added  \
key                                                                             
Santiago Dueñas                              2.139224       1494      91122.0   
Valerio Cosentino                            2.223986        567      86338.0   
Alberto Martín                               1.921569        102      43562.0   
Alvaro del Castillo                          2.588235        102      30664.0   
Jesus M. Gonzalez-Barahona                   2.054054         37       2058.0   
valerio cosentino                            3.666667         12       2548.0   
quan                                         5.600000         10       7186.0   
Miguel Ángel Fernández                      11.333333          6         92.0   
David Pose Fernández                         2.000000          4          8.0   
camillem                                     1.000000          4          4.0   
valerio                     

In [31]:
print("Total contributors: ", response.aggregations.total_contributors.value)

Total contributors:  19


### New contributors

For new contributors, we have to get the names and counts of the people who made commits to the project. [This](https://grimoirelab.gitbooks.io/tutorial/python/pandas-for-grimoirelab-indexes.html) tutorial of Grimoirelab actually gets the dates on which the authors made their first commits. Based on that we can get the months when the authors made their first commits and those authors will be the new authors for that month. We can do a similar thing for Year. (We can also get the authors by week, but there is little point in calculating that and it will be complex to calculate that too.)

In [32]:
# new contributors by month
s = Search(using=es, index=git_index)
s.aggs.bucket("by_authors", "terms", field="author_name", size=10000).metric("first_commit", "min", field="author_date")
s.sort("author_date")
response = s.execute()
buckets = response.aggregations.by_authors.to_dict()['buckets']

authors = []
for bucket in buckets:
    temp = {}
    temp['first_commit_date'] = datetime.strptime(bucket['first_commit']['value_as_string'][:-5], "%Y-%m-%dT%H:%M:%S")
    temp['author'] = bucket['key']
    temp['year-month'] = str(temp['first_commit_date'].year) + "-" + str(temp['first_commit_date'].month)
    authors.append(temp)
authors = pd.DataFrame.from_records(authors, index="author")
print("Authors: ")
print(authors)
print("\n\n")
print("Number of new authors by month: ")
print(authors['year-month'].value_counts(sort=True))

Authors: 
                                 first_commit_date year-month
author                                                       
Santiago Dueñas                2015-08-18 18:08:27     2015-8
Valerio Cosentino              2017-09-14 12:14:04     2017-9
Alberto Martín                 2016-02-09 15:56:45     2016-2
Alvaro del Castillo            2015-12-04 18:46:14    2015-12
Jesus M. Gonzalez-Barahona     2015-12-31 19:16:25    2015-12
valerio cosentino              2017-09-07 14:46:30     2017-9
quan                           2016-04-01 12:16:29     2016-4
Miguel Ángel Fernández         2018-02-12 12:56:11     2018-2
David Pose Fernández           2017-11-03 08:23:54    2017-11
camillem                       2016-03-28 11:08:04     2016-3
valerio                        2017-10-10 16:27:29    2017-10
David Esler                    2017-10-17 22:46:36    2017-10
Israel Herraiz                 2018-01-09 15:40:57     2018-1
J. Manrique Lopez de la Fuente 2016-03-05 08:04:02     2016-

### Contributing Organizations

In [33]:
s = Search(using=es, index=github_index)
a = A("terms", field="user_org", size=max_size)
s.aggs.bucket("contributing_orgs", a)
s = s.extra(size=0)
response = s.execute()
buckets = response.aggregations.contributing_orgs.to_dict()['buckets']
organizations = pd.Series([item['key'] for item in buckets])
print(organizations)

0                       @Bitergia 
1                         Bitergia
2                         GNUmedia
3              @amrita-university 
4            BBVA Data & Analytics
5                   Geeky Engineer
6                 T-Systems Iberia
7                 @DIAL-Community 
8                              CMU
9                   IIIT Hyderabad
10                          Orange
11                            SUSE
12                         Samsung
13              University of Oulu
14    http://www.aeva.in/team.html
dtype: object


In [34]:
s = Search(using=es, index=github_index)
a = A("terms", field="user_org", missing="Others", size=max_size)
a.metric("users", "terms", field="author_name")
s.aggs.bucket("users_by_org", a)
response = s.execute()

In [35]:
response.aggregations.users_by_org.to_dict()['buckets']

[{'doc_count': 164,
  'key': '@Bitergia ',
  'users': {'buckets': [{'doc_count': 131, 'key': 'valerio'},
    {'doc_count': 33, 'key': 'Alberto Martín'}],
   'doc_count_error_upper_bound': 0,
   'sum_other_doc_count': 0}},
 {'doc_count': 122,
  'key': 'Others',
  'users': {'buckets': [{'doc_count': 32, 'key': 'Jesus M. Gonzalez-Barahona'},
    {'doc_count': 31, 'key': 'Santiago Dueñas'},
    {'doc_count': 8, 'key': 'Jose Miguel'},
    {'doc_count': 7, 'key': 'Quan Zhou'},
    {'doc_count': 6, 'key': 'David Pose Fernández'},
    {'doc_count': 4, 'key': 'Keanu Nichols'},
    {'doc_count': 2, 'key': 'David Esler'},
    {'doc_count': 2, 'key': 'Gustavo Silva'},
    {'doc_count': 2, 'key': 'Kapil Thangavelu'},
    {'doc_count': 2, 'key': 'Robin Muilwijk'}],
   'doc_count_error_upper_bound': 0,
   'sum_other_doc_count': 12}},
 {'doc_count': 79,
  'key': 'Bitergia',
  'users': {'buckets': [{'doc_count': 61, 'key': 'Alvaro del Castillo'},
    {'doc_count': 10, 'key': 'Manrique Lopez'},
    {'do