Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new functions to calculate the metrics #67

Merged
merged 1 commit into from Jul 6, 2018

Conversation

aswanipranjal
Copy link
Contributor

@aswanipranjal aswanipranjal commented Jun 12, 2018

This PR adds a new folder named: manuscripts2 containing some new classes and functions to calculate the metrics. The files are as follows:

  • new_functions.py contains Query & Index classes as well as functions which help parse the output from the Query class.
  • derived_classes.py file which has the derived classes: Issues and PullRequests
  • Readme.md describing how to use the above 2 files

Also adding:

  • Examples/Sample_metrics.ipynb notebook showing how these classes can be used to calculate the current metrics created by manuscripts as well as the GMD metrics.

This PR addresses issues #59 and #62

@aswanipranjal
Copy link
Contributor Author

aswanipranjal commented Jun 12, 2018

I believe #64 fixes the tests.

@jgbarah
Copy link
Contributor

jgbarah commented Jun 12, 2018

I believe #64 fixes the tests.

But we need to pass the tests without ordering in the main code, see my comment in the review in #64.

Copy link
Contributor

@jgbarah jgbarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included some comments about specific names of properties of the class, and stuff like that.

But my main concern is how the class implementation is framed. I see that you are orienting it towards the JSON to produce. I was thinking more about orienting it towards the chain of calls to elasticsearch_dsl to produce....

For example, for a simple chain, it would be just a list of the functions to chain with the corresponding parameters. Each function in the class would just add one or more es_dsl functions to the list, or even directly to the connection... If you want, I can produce some example.

The problem I see with the current approach is that it is a bit too rigid for including future functionality.



### Structure:
The main Class which is being implemented is the `EQCC` class. `EQCC` stands for `Elasticsearch Query, Connection and Compute` as it provides a connection to elasticsearch, queries it and computes the response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to find a better name for EQCC. As it is now, you need to know what the acronym means, or you have no clue. What about naming it "Query", since in the end it is a query for ES? I still don't like that "Query" name very much, since it says nothing about this being the root of all our metrics, but well, I find it better than EQCC.... Any other idea?


The important variables inside the `EQCC` objects are as follows:

- `s`: This is the elasticsearch_dsl Search object which is responsible of quering elasticsearch and getting the results. All the aggregations, queries and filters are applied to this Search object and then it queries elasticsearch and fetches the required results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a name that says something, please ;-) If you cannot figure out something better, search could do it.

The important variables inside the `EQCC` objects are as follows:

- `s`: This is the elasticsearch_dsl Search object which is responsible of quering elasticsearch and getting the results. All the aggregations, queries and filters are applied to this Search object and then it queries elasticsearch and fetches the required results.
- `parent_id`: This is a counter counting the aggregations that are applied to the Search object. It starts with `0` and increments as aggregations are added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it counts aggregation, what about naming it agg_counter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually counts the parent aggregation so what about parent_agg_counter?

- `s`: This is the elasticsearch_dsl Search object which is responsible of quering elasticsearch and getting the results. All the aggregations, queries and filters are applied to this Search object and then it queries elasticsearch and fetches the required results.
- `parent_id`: This is a counter counting the aggregations that are applied to the Search object. It starts with `0` and increments as aggregations are added.
- `queries`: This is a dictionary containing two lists: `must` and `must_not`. This dictionary contains the normal(`must`) and inverse(`must_not`) queries.
- `aggregations`: This is an OrderedDict which contains all the aggregations applied. An ordered dict allows us to create nested aggregations easily, as we'll see below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need an OrderedDict? Wouldn't a dict be good enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgbarah The way this class is implemented, I need a data structure which stores the data in an ordered manner so that nested aggregations can be created by chaining two aggregations (or applying two aggregations one after the other: the second one being one of by_authors, by_organizations, by_period) and first one being the child aggregation of the second.

issues = Issues(index)
issues.get_cardinality("id_in_repo").by_period("author_name")

In the first step, the first aggregation (get_cardinality) will be added to the dict and on applying the second aggregation(by_period), the first aggregation will be popped from the dict and added as a child aggregation to the second agg.

- `parent_id`: This is a counter counting the aggregations that are applied to the Search object. It starts with `0` and increments as aggregations are added.
- `queries`: This is a dictionary containing two lists: `must` and `must_not`. This dictionary contains the normal(`must`) and inverse(`must_not`) queries.
- `aggregations`: This is an OrderedDict which contains all the aggregations applied. An ordered dict allows us to create nested aggregations easily, as we'll see below.
- `child_agg_counter_dict`: This dict keeps a track of the number of child aggregations that have been applied in an aggregation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe child_agg_count is good enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgbarah It is actually a dict which contains individual counters for all the nested aggregations that the object contains. Example:

prs = PullRequests(index)
prs.get_cardinality("id_in_repo").by_authors()
prs.get_sum("time_open_days").by_authors()
prs.get_percentile("time_to_close_days").by_authors()

here, the child_agg_counter_dict will contain one key terms_author_uuid (because by_authors is a terms aggregation and uses the field author_uuid) and it's value will be 3.
This aggregation will give us the number of prs, the sum of time_open_days of the open prs and the percentile of time_to_close_days of the closed PRs, all seggregated by authors, in one go!
This is a fuctionality that I think will decrease the amount of times we have to query elasticsearch and get the results.


---

##### EXAMPLE 4: Moar chainability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More instead of moar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was being creative :P https://en.wiktionary.org/wiki/moar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

;-) I didn't know about this "moar". Thanks for teaching me!!

@aswanipranjal aswanipranjal force-pushed the adding-new-functions branch 4 times, most recently from 8821bca to 56710ed Compare June 27, 2018 02:21
Copy link
Contributor

@jgbarah jgbarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I couldn't review everything. But my main issue with this implementation is how chainable objects would work. Please have a look specially at my comments on the Query class.


---

##### EXAMPLE 4: Moar chainability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

;-) I didn't know about this "moar". Thanks for teaching me!!

@@ -0,0 +1,14 @@
from new_functions import Query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add the license/copyright template. Please, get it from some other file, and include it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll add it to the file.

@@ -0,0 +1,510 @@
import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add the license/copyright template. Please, get it from some other file, and include it here.

class Index():
"""Index class which will represent an index in elasticsearch"""

def __init__(self, index, url=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using an url, use an Elasticsearch object. This allows you to have all the power of an ES object, and don't need to re-create that functionality in this module. The pattern for using this Index object would be:

es = Elasticsearch(...)
index = Index(index_name, es)

If you need a complex Elasticsearch object, you just initialize it as needed (eg, a SSL instance without checking certificates, or whatever you may need).

I agree that default could be the "default" Elasticsearch instance, so that if you pass no es argument, you just produce in the initialization of the object a default Elasticsearch object, just by instatiating it as Elasticsearch() (afaik, by default it will use http://localhost:9200 ).


def __init__(self, index, url=None):
"""
:url: the address/url of the elasticsearch instance to be used?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Respect the order of the arguments, whatever you use in the list of parameters to the function (in this case, I think 'index' would be the first one).

Besides, the correct way of writing it is

  :param url: comment

(note "param")

We're using Sphynx conventions, see Sphynx entry in Formats

The base class of all the necessary metrics that are going to be calculated.
"""

def __init__(self, index_obj, esfilters={}, interval=None, offset=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since everything is an object, usually it is better to use names such as index instead of index_obj. The description of the parameters, below in the docstring, should specify which type index should be.

self.es = index_obj.es
self.index = index_obj.index

self.search = Search(using=self.es, index=self.index) # create the search object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern with this implementation is related to how you use this search variable, and the others related to it, such as filters, queries, etc. Why not just use a Search variable, which is extended with anything new you may need when chaining functions?

For example, I would implement add_query as:

def add_query(self, key_val={}):
    q = Q("match", **key_val)
    self.search = self.search.query(q)
    return self

This way, you could use it as:

q = Query().add_query(...)....

(I'm not talking about the parameters for add_query here, but about how to implement it).

The idea is that the user can use the chainability of functions to create nested aggs and add queries seamlessly.

```python
from new_functions import Query, Index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my notes on the Query class. My idea for this is more like this would be used this way:

sample = Query(github_index) \
    .add_query({"name1":"value1"}) \
    .add_inverse_query({"name2":"value2"})

@aswanipranjal
Copy link
Contributor Author

@jgbarah Please review?

@jgbarah
Copy link
Contributor

jgbarah commented Jul 3, 2018

Sorry for being late with this :-(

However, when looking at the patch, I see changes in the core files (eg, functions now returning self), but no changes in the testing files, README or in the notebook... Could you please either change all of it to use chained functions, or at least add cases, examples, etc. with chained functions?

I mean, instead of (or maybe in addition to), for example:

opened_pr = PullRequests(github_index)
opened_pr.get_cardinality("id_in_repo").by_period(period="quarter")
print("Trend for quarter: ", get_trend(get_timeseries(opened_pr)))

I would like to have:

opened_pr = PullRequests(github_index) \
    .get_cardinality("id_in_repo").by_period(period="quarter")
print("Trend for quarter: ", get_trend(get_timeseries(opened_pr)))

In addition, I would like to also have functions such as get_trend or get_timeseries to be chainable, but if you find it difficult to do it now, please at least fix the above cases, and we can work on that in a future pr.

- new_functions.py: contains Query & Index classes and functions
to parse the fetched data
- derived_classes.py: contains PullRequests and Issues classes
- Readme.md describing how to use the above 2 files
- tests for the basic functions inside the Query class

Also add:
- Examples/Sample_metrics.ipynb: contains the CHAOSS and GMD
metrics that are being calculated using the new functions

Addresses issues chaoss#59 and chaoss#62
@aswanipranjal
Copy link
Contributor Author

@jgbarah I am sorry I forgot to change those two files. This fresh commit adds more examples as you wanted.

For tests, I am planning to make another PR (or I can create another commit in this PR only) to add specific tests for functions such as fetch_aggregation_results and some others which have been initialised in the file(test_new_functions.py) but have not been implemented yet.
I'll add tests for the chainable methods with the above tests only.

In this commit, i've changed the path of the sample_metrics.ipynb notebook to an Examples folder because we have to treat manuscripts2 as a module and we should be able to import the functions and classes from that.

@aswanipranjal aswanipranjal changed the title Adding the new functions to calculate the metrics Add new functions to calculate the metrics Jul 4, 2018
@aswanipranjal
Copy link
Contributor Author

@jgbarah ping

Copy link
Contributor

@jgbarah jgbarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's go with this for now, and let's build from this up.

@jgbarah jgbarah merged commit 6b25218 into chaoss:master Jul 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants