# Manuscripts: Re-visited

Manuscripts, currently, mostly only provides us with aggregations of data. It isn't flexible enough to let us play with data For example: sort the data by different filters and values.

Here, we will be experimenting what all can be done with the metrics. None of the previously written code will be used here so as to look at different ways and basically redesign the current code.

We'll still be looking at the [GMD metrics](https://github.com/chaoss/metrics/blob/master/2_Growth-Maturity-Decline.md)

In [1]:
# import the necessary libraries
import json
import requests

import pandas as pd

from pprint import pprint

from elasticsearch import Elasticsearch

from elasticsearch_dsl import A, Q, Search
from elasticsearch_dsl.query import Match, MultiMatch

from datetime import date, timezone
from dateutil import parser, relativedelta

In [3]:
# declare the necessary variables
es = Elasticsearch("http://localhost:9200/")

github_index = "perceval_github"
git_index = "perceval_git"

start_date = date(2017, 8, 1)
start_date = start_date.isoformat() # "2017-08-01"
end_date = date(2018, 5, 22)
end_date = end_date.isoformat()

max_size = 10000 # temporary hack to get all the values in the query

Let's talk about the kind of filters we want while looking at the metrics. 

Can we look at the metrics by seggregating them according to:
- Date?
 - days
 - weeks
 - months
 - years


- organizations?
 - if people from multiple organizations are a part of the project, then we might need to see how they play along and which org is having the most influence?

### Issue Resolution

###### open issues

In [7]:
# We are looking at all the issues that are open and were created between 2017-08-01 and 2018-05-22

s = Search(using=es, index=github_index)
q1 = Q("match", **{"item_type":"issue"})
q2 = Q("match", **{"state": "open"})
q = q1 & q2
s = s.query(q)
agg = A("cardinality", field="id_in_repo")
s.aggs.bucket("num_open_issues", agg)
s = s.filter("range", **{"created_at":{"gte":start_date, "lte":end_date}})
s = s.extra(size=0)

response = s.execute()
response.aggregations.num_open_issues.value

8

###### closed issues


In [8]:
# we are looking at all the closed issues

s = Search(using=es, index=github_index)
q3 = Q("match", **{"item_type":"issue"})
q4 = Q("match", **{"state": "closed"})
q = q3 & q4
s = s.query(q)
agg = A("cardinality", field="id_in_repo")
s.aggs.bucket("num_open_issues", agg)
s = s.extra(size=0)

response = s.execute()
response.aggregations.num_open_issues.value

113

Now suppose, we want all the closed issues sorted out by the people who created it:

In [20]:
s = Search(using=es, index=github_index)
q5 = Q("match", **{"item_type":"issue"})
q6 = Q("match", **{"state": "closed"})
q = q5 & q6
s = s.query(q)
users_agg = A("terms", field="author_name", missing="Others", size=max_size)

To create buckets for aggregation, we can use the methods that were used in esquery.py file: using numbers in heirarchial order. The higher number bucket is the main bucket and the lower number bucket is the nested bucket used inside the main bucket.

In [21]:
main_bucket = "1"
secondary_bucket = "2"
users_agg.metric(secondary_bucket, "cardinality", field="id_in_repo")
s.aggs.bucket(main_bucket, users_agg)
s = s.extra(size=max_size)

response = s.execute()
response.aggregations.to_dict()[main_bucket]['buckets']

[{'2': {'value': 27}, 'doc_count': 27, 'key': 'Alvaro del Castillo'},
 {'2': {'value': 20}, 'doc_count': 20, 'key': 'Alberto Martín'},
 {'2': {'value': 10}, 'doc_count': 10, 'key': 'Others'},
 {'2': {'value': 9}, 'doc_count': 9, 'key': 'Jesus M. Gonzalez-Barahona'},
 {'2': {'value': 9}, 'doc_count': 9, 'key': 'Manrique Lopez'},
 {'2': {'value': 9}, 'doc_count': 9, 'key': 'Santiago Dueñas'},
 {'2': {'value': 3}, 'doc_count': 3, 'key': 'Daniel Izquierdo Cortazar'},
 {'2': {'value': 3}, 'doc_count': 3, 'key': 'David Pose Fernández'},
 {'2': {'value': 3}, 'doc_count': 3, 'key': 'Jose Miguel'},
 {'2': {'value': 3}, 'doc_count': 3, 'key': 'Quan Zhou'},
 {'2': {'value': 2}, 'doc_count': 2, 'key': 'Brylie Christopher Oxley'},
 {'2': {'value': 2}, 'doc_count': 2, 'key': 'Robin Muilwijk'},
 {'2': {'value': 2}, 'doc_count': 2, 'key': 'Saad Bin Shahid'},
 {'2': {'value': 2}, 'doc_count': 2, 'key': 'valerio'},
 {'2': {'value': 1}, 'doc_count': 1, 'key': 'Andre Klapper'},
 {'2': {'value': 1}, 'doc_c

###### open issue age

In [28]:


s = Search(using=es, index=github_index)
q0 = Q("match_all")
q1 = Q("match", **{"item_type":"issue"})
q2 = Q("match", **{"state": "open"})
q = q0 & q1 & q2
s = s.query(q)
agg = A("cardinality", field="id_in_repo")
s.aggs.bucket("num_open_issues", agg)
s = s.extra(_source=['time_open_days', 'id_in_repo'])
s = s.extra(size=max_size)
response = s.execute()

In [29]:
open_issue_age = pd.DataFrame([hit['_source'] for hit in response.hits.hits])

In [30]:
open_issue_age

Unnamed: 0,id_in_repo,time_open_days
0,58,617.82
1,104,507.92
2,319,99.84
3,91,557.9
4,139,374.73
5,217,173.08
6,331,88.27
7,19,795.16
8,28,788.04
9,74,586.77


### Code Development

In [38]:
# Total commits

s = Search(using=es, index=git_index)
#q = Q("match", **{"files": 0})
#s = s.query(~q)
a = A("cardinality", field="hash", precision_threshold=2000)
s.aggs.bucket("total_commits", a)
s = s.extra(_source=["hash", "commit_date"])
s = s.extra(sort={"commit_date":"asc"})
response = s.execute()

In [39]:
response.aggregations.total_commits.value

1186

When you go to the [perceval github repo](https://github.com/chaoss/grimoirelab-perceval), you'll see that actually 1182 commit are present. That maybe because of some empty commit messages. 

In [35]:
# commits by months
s = Search(using=es, index=git_index)
a = A("date_histogram", field="commit_date", interval="month")
s.aggs.bucket("commits_by_months", a)
response = s.execute()

In [37]:
response.aggregations.commits_by_months.buckets

[{'key_as_string': '2015-08-01T00:00:00.000Z', 'key': 1438387200000, 'doc_count': 16}, {'key_as_string': '2015-09-01T00:00:00.000Z', 'key': 1441065600000, 'doc_count': 0}, {'key_as_string': '2015-10-01T00:00:00.000Z', 'key': 1443657600000, 'doc_count': 0}, {'key_as_string': '2015-11-01T00:00:00.000Z', 'key': 1446336000000, 'doc_count': 46}, {'key_as_string': '2015-12-01T00:00:00.000Z', 'key': 1448928000000, 'doc_count': 34}, {'key_as_string': '2016-01-01T00:00:00.000Z', 'key': 1451606400000, 'doc_count': 60}, {'key_as_string': '2016-02-01T00:00:00.000Z', 'key': 1454284800000, 'doc_count': 152}, {'key_as_string': '2016-03-01T00:00:00.000Z', 'key': 1456790400000, 'doc_count': 98}, {'key_as_string': '2016-04-01T00:00:00.000Z', 'key': 1459468800000, 'doc_count': 44}, {'key_as_string': '2016-05-01T00:00:00.000Z', 'key': 1462060800000, 'doc_count': 38}, {'key_as_string': '2016-06-01T00:00:00.000Z', 'key': 1464739200000, 'doc_count': 66}, {'key_as_string': '2016-07-01T00:00:00.000Z', 'key': 1

In [40]:
# Lines of code changed

s = Search(using=es, index=git_index)
#q = Q("match", **{"files": 0})
#s = s.query(~q)
a1 = A("sum", field="lines_changed")
a2 = A("sum", field="lines_added")
a3 = A("sum", field="lines_removed")
s.aggs.bucket("total_lines_changed", a1)
s.aggs.bucket("total_lines_added", a2)
s.aggs.bucket("total_lines_removed", a3)
s = s.extra(size=0)
#s = s.extra(_source=["hash", "commit_date"])
#s = s.extra(sort={"commit_date":"asc"})
response = s.execute()

print("Total lines changed: ", response.aggregations.total_lines_changed.value)
print("Total lines added: ", response.aggregations.total_lines_added.value)
print("Total lines removed: ", response.aggregations.total_lines_removed.value)

Total lines changed:  354358.0
Total lines added:  265068.0
Total lines removed:  89290.0


### Community Growth

In [46]:
# Number of contributors

s = Search(using=es, index=git_index)
a = A("terms", field="author_name", size=max_size)
a.metric("lines_changed", "sum", field="lines_changed")
a.metric("lines_added", "sum", field="lines_added")
a.metric("average_files_changed", "avg", field="files")
s.aggs.bucket("contributors", a)
response = s.execute()

In [47]:
response.aggregations.contributors.buckets

[{'key': 'Santiago Dueñas', 'doc_count': 1494, 'average_files_changed': {'value': 2.139223560910308}, 'lines_added': {'value': 91122.0}, 'lines_changed': {'value': 123756.0}}, {'key': 'Valerio Cosentino', 'doc_count': 567, 'average_files_changed': {'value': 2.2239858906525574}, 'lines_added': {'value': 86338.0}, 'lines_changed': {'value': 138388.0}}, {'key': 'Alberto Martín', 'doc_count': 102, 'average_files_changed': {'value': 1.9215686274509804}, 'lines_added': {'value': 43562.0}, 'lines_changed': {'value': 45724.0}}, {'key': 'Alvaro del Castillo', 'doc_count': 102, 'average_files_changed': {'value': 2.588235294117647}, 'lines_added': {'value': 30664.0}, 'lines_changed': {'value': 31438.0}}, {'key': 'Jesus M. Gonzalez-Barahona', 'doc_count': 37, 'average_files_changed': {'value': 2.054054054054054}, 'lines_added': {'value': 2058.0}, 'lines_changed': {'value': 2176.0}}, {'key': 'valerio cosentino', 'doc_count': 12, 'average_files_changed': {'value': 3.6666666666666665}, 'lines_added':

Do we see a general pattern in calculating all the metrics above?

- If we want to seggregate the values by date: we use a date histogram.
- If we want to seggregate the values by members or organizations: we can use a `terms` aggregation with one or multiple nested aggregations which give a numeric value.

We can convert the output values into a dataframe for easy processing and apply filters for date and other conditions as we like.