Add new functions to calculate the metrics #67

aswanipranjal · 2018-06-12T12:10:29Z

This PR adds a new folder named: manuscripts2 containing some new classes and functions to calculate the metrics. The files are as follows:

new_functions.py contains Query & Index classes as well as functions which help parse the output from the Query class.
derived_classes.py file which has the derived classes: Issues and PullRequests
Readme.md describing how to use the above 2 files

Also adding:

Examples/Sample_metrics.ipynb notebook showing how these classes can be used to calculate the current metrics created by manuscripts as well as the GMD metrics.

This PR addresses issues #59 and #62

aswanipranjal · 2018-06-12T15:41:53Z

I believe #64 fixes the tests.

jgbarah · 2018-06-12T22:17:25Z

I believe #64 fixes the tests.

But we need to pass the tests without ordering in the main code, see my comment in the review in #64.

jgbarah

I've included some comments about specific names of properties of the class, and stuff like that.

But my main concern is how the class implementation is framed. I see that you are orienting it towards the JSON to produce. I was thinking more about orienting it towards the chain of calls to elasticsearch_dsl to produce....

For example, for a simple chain, it would be just a list of the functions to chain with the corresponding parameters. Each function in the class would just add one or more es_dsl functions to the list, or even directly to the connection... If you want, I can produce some example.

The problem I see with the current approach is that it is a bit too rigid for including future functionality.

jgbarah · 2018-06-13T08:32:56Z

manuscripts2/Readme.md

+
+
+### Structure:
+The main Class which is being implemented is the `EQCC` class. `EQCC`  stands for `Elasticsearch Query, Connection and Compute` as it provides a connection to elasticsearch, queries it and computes the response.


I think we need to find a better name for EQCC. As it is now, you need to know what the acronym means, or you have no clue. What about naming it "Query", since in the end it is a query for ES? I still don't like that "Query" name very much, since it says nothing about this being the root of all our metrics, but well, I find it better than EQCC.... Any other idea?

jgbarah · 2018-06-13T10:54:57Z

manuscripts2/Readme.md

+
+The important variables inside the `EQCC` objects are as follows:
+
+- `s`: This is the elasticsearch_dsl Search object which is responsible of quering elasticsearch and getting the results. All the aggregations, queries and filters are applied to this Search object and then it queries elasticsearch and fetches the required results.


Use a name that says something, please ;-) If you cannot figure out something better, search could do it.

jgbarah · 2018-06-13T10:55:41Z

manuscripts2/Readme.md

+The important variables inside the `EQCC` objects are as follows:
+
+- `s`: This is the elasticsearch_dsl Search object which is responsible of quering elasticsearch and getting the results. All the aggregations, queries and filters are applied to this Search object and then it queries elasticsearch and fetches the required results.
+- `parent_id`: This is a counter counting the aggregations that are applied to the Search object. It starts with `0` and increments as aggregations are added.


If it counts aggregation, what about naming it agg_counter?

It actually counts the parent aggregation so what about parent_agg_counter?

jgbarah · 2018-06-13T10:56:43Z

manuscripts2/Readme.md

+- `s`: This is the elasticsearch_dsl Search object which is responsible of quering elasticsearch and getting the results. All the aggregations, queries and filters are applied to this Search object and then it queries elasticsearch and fetches the required results.
+- `parent_id`: This is a counter counting the aggregations that are applied to the Search object. It starts with `0` and increments as aggregations are added.
+- `queries`: This is a dictionary containing two lists: `must` and `must_not`. This dictionary contains the normal(`must`) and inverse(`must_not`) queries.
+- `aggregations`: This is an OrderedDict which contains all the aggregations applied. An ordered dict allows us to create nested aggregations easily, as we'll see below.


Do you really need an OrderedDict? Wouldn't a dict be good enough?

@jgbarah The way this class is implemented, I need a data structure which stores the data in an ordered manner so that nested aggregations can be created by chaining two aggregations (or applying two aggregations one after the other: the second one being one of by_authors, by_organizations, by_period) and first one being the child aggregation of the second.

issues = Issues(index) issues.get_cardinality("id_in_repo").by_period("author_name")

In the first step, the first aggregation (get_cardinality) will be added to the dict and on applying the second aggregation(by_period), the first aggregation will be popped from the dict and added as a child aggregation to the second agg.

jgbarah · 2018-06-13T10:57:40Z

manuscripts2/Readme.md

+- `parent_id`: This is a counter counting the aggregations that are applied to the Search object. It starts with `0` and increments as aggregations are added.
+- `queries`: This is a dictionary containing two lists: `must` and `must_not`. This dictionary contains the normal(`must`) and inverse(`must_not`) queries.
+- `aggregations`: This is an OrderedDict which contains all the aggregations applied. An ordered dict allows us to create nested aggregations easily, as we'll see below.
+- `child_agg_counter_dict`: This dict keeps a track of the number of child aggregations that have been applied in an aggregation.


Maybe child_agg_count is good enough?

@jgbarah It is actually a dict which contains individual counters for all the nested aggregations that the object contains. Example:

prs = PullRequests(index) prs.get_cardinality("id_in_repo").by_authors() prs.get_sum("time_open_days").by_authors() prs.get_percentile("time_to_close_days").by_authors()

here, the child_agg_counter_dict will contain one key terms_author_uuid (because by_authors is a terms aggregation and uses the field author_uuid) and it's value will be 3.
This aggregation will give us the number of prs, the sum of time_open_days of the open prs and the percentile of time_to_close_days of the closed PRs, all seggregated by authors, in one go!
This is a fuctionality that I think will decrease the amount of times we have to query elasticsearch and get the results.

jgbarah · 2018-06-13T11:01:33Z

manuscripts2/Readme.md

+
+---
+
+##### EXAMPLE 4: Moar chainability


More instead of moar?

I was being creative :P https://en.wiktionary.org/wiki/moar

;-) I didn't know about this "moar". Thanks for teaching me!!

jgbarah

I'm sorry I couldn't review everything. But my main issue with this implementation is how chainable objects would work. Please have a look specially at my comments on the Query class.

jgbarah · 2018-06-27T10:33:39Z

manuscripts2/Readme.md

+
+---
+
+##### EXAMPLE 4: Moar chainability


;-) I didn't know about this "moar". Thanks for teaching me!!

jgbarah · 2018-06-27T10:34:41Z

manuscripts2/derived_classes.py

@@ -0,0 +1,14 @@
+from new_functions import Query


You need to add the license/copyright template. Please, get it from some other file, and include it here.

Okay, I'll add it to the file.

jgbarah · 2018-06-27T10:34:50Z

manuscripts2/new_functions.py

@@ -0,0 +1,510 @@
+import logging


You need to add the license/copyright template. Please, get it from some other file, and include it here.

jgbarah · 2018-06-27T10:41:49Z

manuscripts2/new_functions.py

+class Index():
+    """Index class which will represent an index in elasticsearch"""
+
+    def __init__(self, index, url=None):


Instead of using an url, use an Elasticsearch object. This allows you to have all the power of an ES object, and don't need to re-create that functionality in this module. The pattern for using this Index object would be:

es = Elasticsearch(...) index = Index(index_name, es)

If you need a complex Elasticsearch object, you just initialize it as needed (eg, a SSL instance without checking certificates, or whatever you may need).

I agree that default could be the "default" Elasticsearch instance, so that if you pass no es argument, you just produce in the initialization of the object a default Elasticsearch object, just by instatiating it as Elasticsearch() (afaik, by default it will use http://localhost:9200 ).

jgbarah · 2018-06-27T10:47:05Z

manuscripts2/new_functions.py

+
+    def __init__(self, index, url=None):
+        """
+        :url: the address/url of the elasticsearch instance to be used?


Respect the order of the arguments, whatever you use in the list of parameters to the function (in this case, I think 'index' would be the first one).

Besides, the correct way of writing it is

:param url: comment

(note "param")

We're using Sphynx conventions, see Sphynx entry in Formats

jgbarah · 2018-06-27T10:48:18Z

manuscripts2/new_functions.py

+    The base class of all the necessary metrics that are going to be calculated.
+    """
+
+    def __init__(self, index_obj, esfilters={}, interval=None, offset=None):


Since everything is an object, usually it is better to use names such as index instead of index_obj. The description of the parameters, below in the docstring, should specify which type index should be.

jgbarah · 2018-06-27T10:55:10Z

manuscripts2/new_functions.py

+        self.es = index_obj.es
+        self.index = index_obj.index
+
+        self.search = Search(using=self.es, index=self.index)  # create the search object


My main concern with this implementation is related to how you use this search variable, and the others related to it, such as filters, queries, etc. Why not just use a Search variable, which is extended with anything new you may need when chaining functions?

For example, I would implement add_query as:

def add_query(self, key_val={}): q = Q("match", **key_val) self.search = self.search.query(q) return self

This way, you could use it as:

q = Query().add_query(...)....

(I'm not talking about the parameters for add_query here, but about how to implement it).

jgbarah · 2018-06-27T10:58:21Z

manuscripts2/Readme.md

+The idea is that the user can use the chainability of functions to create nested aggs and add queries seamlessly.
+
+```python
+from new_functions import Query, Index


See my notes on the Query class. My idea for this is more like this would be used this way:

sample = Query(github_index) \ .add_query({"name1":"value1"}) \ .add_inverse_query({"name2":"value2"})

aswanipranjal · 2018-06-29T06:43:27Z

@jgbarah Please review?

jgbarah · 2018-07-03T09:04:42Z

Sorry for being late with this :-(

However, when looking at the patch, I see changes in the core files (eg, functions now returning self), but no changes in the testing files, README or in the notebook... Could you please either change all of it to use chained functions, or at least add cases, examples, etc. with chained functions?

I mean, instead of (or maybe in addition to), for example:

opened_pr = PullRequests(github_index)
opened_pr.get_cardinality("id_in_repo").by_period(period="quarter")
print("Trend for quarter: ", get_trend(get_timeseries(opened_pr)))

I would like to have:

opened_pr = PullRequests(github_index) \
    .get_cardinality("id_in_repo").by_period(period="quarter")
print("Trend for quarter: ", get_trend(get_timeseries(opened_pr)))

In addition, I would like to also have functions such as get_trend or get_timeseries to be chainable, but if you find it difficult to do it now, please at least fix the above cases, and we can work on that in a future pr.

- new_functions.py: contains Query & Index classes and functions to parse the fetched data - derived_classes.py: contains PullRequests and Issues classes - Readme.md describing how to use the above 2 files - tests for the basic functions inside the Query class Also add: - Examples/Sample_metrics.ipynb: contains the CHAOSS and GMD metrics that are being calculated using the new functions Addresses issues chaoss#59 and chaoss#62

aswanipranjal · 2018-07-04T05:57:20Z

@jgbarah I am sorry I forgot to change those two files. This fresh commit adds more examples as you wanted.

For tests, I am planning to make another PR (or I can create another commit in this PR only) to add specific tests for functions such as fetch_aggregation_results and some others which have been initialised in the file(test_new_functions.py) but have not been implemented yet.
I'll add tests for the chainable methods with the above tests only.

In this commit, i've changed the path of the sample_metrics.ipynb notebook to an Examples folder because we have to treat manuscripts2 as a module and we should be able to import the functions and classes from that.

aswanipranjal · 2018-07-06T06:23:13Z

@jgbarah ping

jgbarah

OK, let's go with this for now, and let's build from this up.

jgbarah requested changes Jun 13, 2018

View reviewed changes

aswanipranjal force-pushed the adding-new-functions branch 4 times, most recently from 8821bca to 56710ed Compare June 27, 2018 02:21

jgbarah requested changes Jun 27, 2018

View reviewed changes

aswanipranjal force-pushed the adding-new-functions branch from 56710ed to b264441 Compare June 29, 2018 06:11

aswanipranjal force-pushed the adding-new-functions branch from b264441 to e864750 Compare July 4, 2018 05:52

aswanipranjal changed the title ~~Adding the new functions to calculate the metrics~~ Add new functions to calculate the metrics Jul 4, 2018

aswanipranjal mentioned this pull request Jul 6, 2018

Porting manuscripts to use new functions and classes added in #67 #70

Closed

jgbarah approved these changes Jul 6, 2018

View reviewed changes

jgbarah merged commit 6b25218 into chaoss:master Jul 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new functions to calculate the metrics #67

Add new functions to calculate the metrics #67

aswanipranjal commented Jun 12, 2018 •

edited

aswanipranjal commented Jun 12, 2018 •

edited

jgbarah commented Jun 12, 2018

jgbarah left a comment

jgbarah Jun 13, 2018

jgbarah Jun 13, 2018

jgbarah Jun 13, 2018

aswanipranjal Jun 13, 2018

jgbarah Jun 13, 2018

aswanipranjal Jun 14, 2018

jgbarah Jun 13, 2018

aswanipranjal Jun 14, 2018

jgbarah Jun 13, 2018

aswanipranjal Jun 13, 2018

jgbarah Jun 27, 2018

jgbarah left a comment

jgbarah Jun 27, 2018

jgbarah Jun 27, 2018

aswanipranjal Jun 27, 2018

jgbarah Jun 27, 2018

jgbarah Jun 27, 2018

jgbarah Jun 27, 2018

jgbarah Jun 27, 2018

jgbarah Jun 27, 2018

jgbarah Jun 27, 2018

aswanipranjal commented Jun 29, 2018

jgbarah commented Jul 3, 2018

aswanipranjal commented Jul 4, 2018

aswanipranjal commented Jul 6, 2018

jgbarah left a comment



		### Structure:
		The main Class which is being implemented is the `EQCC` class. `EQCC` stands for `Elasticsearch Query, Connection and Compute` as it provides a connection to elasticsearch, queries it and computes the response.


		The important variables inside the `EQCC` objects are as follows:

		- `s`: This is the elasticsearch_dsl Search object which is responsible of quering elasticsearch and getting the results. All the aggregations, queries and filters are applied to this Search object and then it queries elasticsearch and fetches the required results.

Add new functions to calculate the metrics #67

Add new functions to calculate the metrics #67

Conversation

aswanipranjal commented Jun 12, 2018 • edited

aswanipranjal commented Jun 12, 2018 • edited

jgbarah commented Jun 12, 2018

jgbarah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgbarah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aswanipranjal commented Jun 29, 2018

jgbarah commented Jul 3, 2018

aswanipranjal commented Jul 4, 2018

aswanipranjal commented Jul 6, 2018

jgbarah left a comment

Choose a reason for hiding this comment

aswanipranjal commented Jun 12, 2018 •

edited

aswanipranjal commented Jun 12, 2018 •

edited