## CHAOSS METRICS

This is an attempt to map all the Metrics that manuscripts produces currently. I am not imepelemting this directly into manuscripts, because of the following reasons:
- I think I need to test a bit more how the new functions will work with the current metrics to be calculated
- The functions and classes that are used currently in manuscripts are quite different than how we are calculating the new metrics and the old ones using new functions. Hence the new functions cannot be directly plugged in.

In [20]:
from manuscripts2.new_functions import EQCC, Index, calculate_bmi, buckets_to_df
from manuscripts2.derived_classes import Issues, PullRequests

from elasticsearch import Elasticsearch
# utility and support modules
from pprint import pprint
from datetime import datetime, timezone, timedelta

import pandas as pd
# declare the necessary variables
es = Elasticsearch("http://localhost:9200/")

github_data_source = "aima_github"
git_data_source = "aima_git"

github_index = Index(index=github_data_source)
git_index = Index(index=git_data_source)

start_date = datetime(2015, 1, 1)
end_date = datetime.now()
end_date = end_date.replace(hour=0, minute=0, second=0, microsecond=0)

[Here is the PDF of aima_python that is generated when we create a report using manuscripts as it is.](https://github.com/aswanipranjal/gsoc-manuscripts/blob/master/aima_python.pdf)

### OVERVIEW

- Activity metrics: we have to get the trend for these:
	- Closed PRs
	- Open PRs
	- Issues Open
	- Issues Closed
	- Commits created 
   

- Authors per interval selected: description: average number of developers per month by quarters (so we have the average number of developers per month during those three months). If the approach is to work at the level of month, then just the number of developers per month.


- BMI metrics: a little introduction about BMI- here, BMI calculates the efficiency of creating/closing Issues and PRs.
	- BMI of PRs: closed PRs/ submitted PRs in total and a trend showing the same ratio over the said interval(month, week, year) in the given range of time.
	- BMI for issues: same as PRs but for issues.


- Time to close metrics:
	- Median for Days to close a PR.
	- Median for Days to close an issue.

In [21]:
closed_pr = PullRequests(github_index)
closed_pr.is_closed()
# get trend by month:
closed_pr.get_cardinality("id_in_repo").by_period()
print("Trend for month: ", closed_pr.get_trend())

# get trend by quarter:
closed_pr.get_cardinality("id_in_repo").by_period(period="quarter")
print("Trend for quarter: ", closed_pr.get_trend())

Trend for month:  (6, -50)
Trend for quarter:  (15, -893)


In [22]:
opened_pr = PullRequests(github_index)
# get trend by month:
opened_pr.get_cardinality("id_in_repo").by_period()
print("Trend for month: ", opened_pr.get_trend())
# get trend by quarter:
opened_pr.get_cardinality("id_in_repo").by_period(period="quarter")
print("Trend for quarter: ", opened_pr.get_trend())

Trend for month:  (2, -250)
Trend for quarter:  (18, -738)


In [23]:
closed_issues = Issues(github_index)
closed_issues.is_closed()
# get trend by month:
closed_issues.get_cardinality("id_in_repo").by_period(field="closed_at")
print("Trend for month: ", closed_issues.get_trend())
# get trend by quarter:
closed_issues.get_cardinality("id_in_repo").by_period(period="quarter")
print("Trend for quarter: ", closed_issues.get_trend())

Trend for month:  (4, 0)
Trend for quarter:  (6, -716)


In [24]:
open_issues = Issues(github_index)
open_issues.get_cardinality("id_in_repo").by_period()
print("Trend for month: ", open_issues.get_trend())
# get trend by quarter:
open_issues.get_cardinality("id_in_repo").by_period(period="quarter")
print("Trend for quarter: ", open_issues.get_trend())

Trend for month:  (2, -400)
Trend for quarter:  (12, -508)


In [25]:
commits = EQCC(git_index)
commits.get_cardinality("hash").by_period()
print("Trend for month: ", commits.get_trend())

commits.get_cardinality("hash").by_period(period="quarter")
print("Trend for quarter: ", commits.get_trend())

Trend for month:  (8, 37)
Trend for quarter:  (13, -792)


The values below are not displayed in the PDF generated, for some reason. I will be looking into it. I have checked these values while debugging report.py file and analysing the functions, so I assure you that they are correct (apart from open and closed PRs).

In [26]:
# Issues closed in the last month:
issues = Issues(github_index)
issues.is_closed()
issues.get_cardinality("id")
# here taking a month made up of 30 days on an average
previous_month_date = end_date - timedelta(days=30)
issues.since(field="closed_at", start=previous_month_date).until(field="closed_at", end=end_date)
issues.get_aggs()

0

In [27]:
# Issues opened in the last month:
issues = Issues(github_index)
issues.get_cardinality("id")
# May has 31 days
previous_month_date = end_date - timedelta(days=31)
issues.since(start=previous_month_date).until(end=end_date)
issues.get_aggs()

0

There is still a little problem on how the dates are being calculated, hence these values differ from the origin values 5 and 8 respectively

In [29]:
# PRs closed in the last month:
pr = PullRequests(github_index)
pr.is_closed()
pr.get_cardinality("id")
# May has 31 days
previous_month_date = end_date - timedelta(days=31)
pr.since(field="closed_at", start=previous_month_date).until(field="closed_at", end=end_date)
pr.get_aggs()

4

In [30]:
# PRs opened in the last month:
pr = PullRequests(github_index)
pr.get_cardinality("id")
# May has 31 days
previous_month_date = end_date - timedelta(days=31)
pr.since(start=previous_month_date).until(end=end_date)
pr.get_aggs()

7

In [32]:
# Percentile PR closed
PR = PullRequests(github_index)
PR.is_closed()
PR.get_percentile("time_to_close_days")
# May has 31 days
previous_month_date = end_date - timedelta(days=31)
PR.since(start=previous_month_date).until(end=end_date)
PR.get_aggs()

2.1050000190734863

In [33]:
# Percentile issues closed
issues = Issues(github_index)
issues.is_closed()
issues.get_percentile("time_to_close_days")
# May has 31 days
previous_month_date = end_date - timedelta(days=31)
issues.since(start=previous_month_date).until(end=end_date)
issues.get_aggs()

There is no output for above because the answer is None!

### Communication Channels

Nothing here because all the communication for git and github happens via Issues and PRs

### Project Activities

In [37]:
# number of commits made by month 
commits = EQCC(git_index)
commits.since(start=start_date).until(end=end_date)
commits.get_cardinality("hash").by_period()
print(pd.DataFrame(commits.get_ts()))

                        date  value      unixtime
0  2015-01-01 00:00:00+00:00      0  1.420070e+09
1  2015-02-01 00:00:00+00:00      0  1.422749e+09
2  2015-03-01 00:00:00+00:00      0  1.425168e+09
3  2015-04-01 00:00:00+00:00      0  1.427846e+09
4  2015-05-01 00:00:00+00:00      0  1.430438e+09
5  2015-06-01 00:00:00+00:00      0  1.433117e+09
6  2015-07-01 00:00:00+00:00      0  1.435709e+09
7  2015-08-01 00:00:00+00:00      0  1.438387e+09
8  2015-09-01 00:00:00+00:00      0  1.441066e+09
9  2015-10-01 00:00:00+00:00      0  1.443658e+09
10 2015-11-01 00:00:00+00:00      0  1.446336e+09
11 2015-12-01 00:00:00+00:00      0  1.448928e+09
12 2016-01-01 00:00:00+00:00      0  1.451606e+09
13 2016-02-01 00:00:00+00:00      3  1.454285e+09
14 2016-03-01 00:00:00+00:00    311  1.456790e+09
15 2016-04-01 00:00:00+00:00     74  1.459469e+09
16 2016-05-01 00:00:00+00:00     14  1.462061e+09
17 2016-06-01 00:00:00+00:00     40  1.464739e+09
18 2016-07-01 00:00:00+00:00     24  1.467331e+09


In [40]:
# number of active authors per month
authors = EQCC(git_index)
authors.get_cardinality("author_uuid").by_period()
print(pd.DataFrame(authors.get_ts()))

                         date  value      unixtime
0   2007-06-01 00:00:00+00:00      1  1.180656e+09
1   2007-07-01 00:00:00+00:00      1  1.183248e+09
2   2007-08-01 00:00:00+00:00      0  1.185926e+09
3   2007-09-01 00:00:00+00:00      0  1.188605e+09
4   2007-10-01 00:00:00+00:00      0  1.191197e+09
5   2007-11-01 00:00:00+00:00      1  1.193875e+09
6   2007-12-01 00:00:00+00:00      0  1.196467e+09
7   2008-01-01 00:00:00+00:00      1  1.199146e+09
8   2008-02-01 00:00:00+00:00      0  1.201824e+09
9   2008-03-01 00:00:00+00:00      0  1.204330e+09
10  2008-04-01 00:00:00+00:00      0  1.207008e+09
11  2008-05-01 00:00:00+00:00      0  1.209600e+09
12  2008-06-01 00:00:00+00:00      0  1.212278e+09
13  2008-07-01 00:00:00+00:00      0  1.214870e+09
14  2008-08-01 00:00:00+00:00      0  1.217549e+09
15  2008-09-01 00:00:00+00:00      0  1.220227e+09
16  2008-10-01 00:00:00+00:00      0  1.222819e+09
17  2008-11-01 00:00:00+00:00      0  1.225498e+09
18  2008-12-01 00:00:00+00:00  

### Community

In [39]:
# number of active authors by month
commits = EQCC(git_index)
commits.since(start=start_date).until(end=end_date)
commits.get_cardinality("author_uuid").by_period()
print(pd.DataFrame(commits.get_ts()))

                        date  value      unixtime
0  2015-01-01 00:00:00+00:00      0  1.420070e+09
1  2015-02-01 00:00:00+00:00      0  1.422749e+09
2  2015-03-01 00:00:00+00:00      0  1.425168e+09
3  2015-04-01 00:00:00+00:00      0  1.427846e+09
4  2015-05-01 00:00:00+00:00      0  1.430438e+09
5  2015-06-01 00:00:00+00:00      0  1.433117e+09
6  2015-07-01 00:00:00+00:00      0  1.435709e+09
7  2015-08-01 00:00:00+00:00      0  1.438387e+09
8  2015-09-01 00:00:00+00:00      0  1.441066e+09
9  2015-10-01 00:00:00+00:00      0  1.443658e+09
10 2015-11-01 00:00:00+00:00      0  1.446336e+09
11 2015-12-01 00:00:00+00:00      0  1.448928e+09
12 2016-01-01 00:00:00+00:00      0  1.451606e+09
13 2016-02-01 00:00:00+00:00      1  1.454285e+09
14 2016-03-01 00:00:00+00:00     27  1.456790e+09
15 2016-04-01 00:00:00+00:00      9  1.459469e+09
16 2016-05-01 00:00:00+00:00      8  1.462061e+09
17 2016-06-01 00:00:00+00:00      3  1.464739e+09
18 2016-07-01 00:00:00+00:00      3  1.467331e+09


In [42]:
# Top committers in the previous month:
authors = EQCC( git_index)
previous_month_date = end_date - timedelta(days=31)
authors.since(start=previous_month_date).until(end=end_date)
authors.get_terms(field="author_name")
print(buckets_to_df(authors.fetch_aggregation_results()['aggregations'][str(authors.parent_id-1)]['buckets']))

   doc_count              key
0          2  Aman Deep Singh
1          1              DKE


In [43]:
# Top commiting orgs in the previous month:
orgs = EQCC(git_index)
previous_month_date = end_date - timedelta(days=31)
orgs.since(start=previous_month_date).until(end=end_date)
orgs.get_terms(field="author_org_name")
print(buckets_to_df(orgs.fetch_aggregation_results()['aggregations'][str(authors.parent_id-1)]['buckets']))

   doc_count      key
0          3  Unknown


### Process

In [44]:
# Issues closed/ issues created

closed_issues = Issues(github_index)
closed_issues.since(start=start_date).until(end=end_date)
closed_issues.is_closed()
closed_issues.get_cardinality("id").by_period()
closed_ts = closed_issues.get_ts()

opened_issues =Issues(github_index)
opened_issues.since(start=start_date).until(end=end_date)
opened_issues.get_cardinality("id").by_period()
opened_ts = opened_issues.get_ts()

print(pd.DataFrame(calculate_bmi(closed_ts, opened_ts)))

     Period  Closed/Submitted
0    2015-1          0.000000
1    2015-2          0.000000
2    2015-3          0.000000
3    2015-4          0.000000
4    2015-5          0.000000
5    2015-6          0.000000
6    2015-7          0.000000
7    2015-8          0.000000
8    2015-9          0.000000
9   2015-10          0.000000
10  2015-11          0.000000
11  2015-12          0.000000
12   2016-1          0.000000
13   2016-2          1.000000
14   2016-3          0.923077
15   2016-4          0.750000
16   2016-5          1.000000
17   2016-6          1.000000
18   2016-7          0.000000
19   2016-8          1.000000
20   2016-9          1.000000
21  2016-10          1.000000
22  2016-11          0.500000
23  2016-12          0.000000
24   2017-1          0.666667
25   2017-2          0.666667
26   2017-3          0.833333
27   2017-4          0.933333
28   2017-5          1.000000
29   2017-6          0.714286
30   2017-7          1.000000
31   2017-8          0.727273
32   2017-

In [45]:
# PRs closed/ PRs submitted

closed_pr = PullRequests(github_index)
closed_pr.since(start=start_date).until(end=end_date)
closed_pr.is_closed()
closed_pr.get_cardinality("id").by_period()
closed_ts = closed_pr.get_ts()

opened_pr = PullRequests(github_index)
opened_pr.since(start=start_date).until(end=end_date)
opened_pr.get_cardinality("id").by_period()
opened_ts = opened_pr.get_ts()

print(pd.DataFrame(calculate_bmi(closed_ts, opened_ts)))

     Period  Closed/Submitted
0    2015-1          0.000000
1    2015-2          0.000000
2    2015-3          0.000000
3    2015-4          0.000000
4    2015-5          0.000000
5    2015-6          0.000000
6    2015-7          0.000000
7    2015-8          0.000000
8    2015-9          0.000000
9   2015-10          0.000000
10  2015-11          0.000000
11  2015-12          0.000000
12   2016-1          0.000000
13   2016-2          0.000000
14   2016-3          1.000000
15   2016-4          1.000000
16   2016-5          1.000000
17   2016-6          1.000000
18   2016-7          1.000000
19   2016-8          1.000000
20   2016-9          1.000000
21  2016-10          1.000000
22  2016-11          1.000000
23  2016-12          1.000000
24   2017-1          1.000000
25   2017-2          1.000000
26   2017-3          1.000000
27   2017-4          1.000000
28   2017-5          1.000000
29   2017-6          1.000000
30   2017-7          1.000000
31   2017-8          1.000000
32   2017-

In [46]:
# days to close review(PR) average
closed_pr = PullRequests(github_index)
closed_pr.since(start=start_date).until(end=end_date)
closed_pr.is_closed()
closed_pr.get_average("time_to_close_days").by_period()
print(pd.DataFrame(closed_pr.get_ts()))

                        date       value      unixtime
0  2015-01-01 00:00:00+00:00         NaN  1.420070e+09
1  2015-02-01 00:00:00+00:00         NaN  1.422749e+09
2  2015-03-01 00:00:00+00:00         NaN  1.425168e+09
3  2015-04-01 00:00:00+00:00         NaN  1.427846e+09
4  2015-05-01 00:00:00+00:00         NaN  1.430438e+09
5  2015-06-01 00:00:00+00:00         NaN  1.433117e+09
6  2015-07-01 00:00:00+00:00         NaN  1.435709e+09
7  2015-08-01 00:00:00+00:00         NaN  1.438387e+09
8  2015-09-01 00:00:00+00:00         NaN  1.441066e+09
9  2015-10-01 00:00:00+00:00         NaN  1.443658e+09
10 2015-11-01 00:00:00+00:00         NaN  1.446336e+09
11 2015-12-01 00:00:00+00:00         NaN  1.448928e+09
12 2016-01-01 00:00:00+00:00         NaN  1.451606e+09
13 2016-02-01 00:00:00+00:00         NaN  1.454285e+09
14 2016-03-01 00:00:00+00:00    1.375500  1.456790e+09
15 2016-04-01 00:00:00+00:00    0.282105  1.459469e+09
16 2016-05-01 00:00:00+00:00    1.386000  1.462061e+09
17 2016-06

In [48]:
# days to close review(PR) average
closed_pr = PullRequests(github_index)
closed_pr.since(start=start_date).until(end=end_date)
closed_pr.is_closed()
closed_pr.get_percentile("time_to_close_days").by_period()
print(pd.DataFrame(closed_pr.get_ts()))

                        date       value      unixtime
0  2015-01-01 00:00:00+00:00         NaN  1.420070e+09
1  2015-02-01 00:00:00+00:00         NaN  1.422749e+09
2  2015-03-01 00:00:00+00:00         NaN  1.425168e+09
3  2015-04-01 00:00:00+00:00         NaN  1.427846e+09
4  2015-05-01 00:00:00+00:00         NaN  1.430438e+09
5  2015-06-01 00:00:00+00:00         NaN  1.433117e+09
6  2015-07-01 00:00:00+00:00         NaN  1.435709e+09
7  2015-08-01 00:00:00+00:00         NaN  1.438387e+09
8  2015-09-01 00:00:00+00:00         NaN  1.441066e+09
9  2015-10-01 00:00:00+00:00         NaN  1.443658e+09
10 2015-11-01 00:00:00+00:00         NaN  1.446336e+09
11 2015-12-01 00:00:00+00:00         NaN  1.448928e+09
12 2016-01-01 00:00:00+00:00         NaN  1.451606e+09
13 2016-02-01 00:00:00+00:00         NaN  1.454285e+09
14 2016-03-01 00:00:00+00:00    0.465000  1.456790e+09
15 2016-04-01 00:00:00+00:00    0.200000  1.459469e+09
16 2016-05-01 00:00:00+00:00    0.600000  1.462061e+09
17 2016-06