New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new functions to calculate the metrics #67
Conversation
I believe #64 fixes the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included some comments about specific names of properties of the class, and stuff like that.
But my main concern is how the class implementation is framed. I see that you are orienting it towards the JSON to produce. I was thinking more about orienting it towards the chain of calls to elasticsearch_dsl to produce....
For example, for a simple chain, it would be just a list of the functions to chain with the corresponding parameters. Each function in the class would just add one or more es_dsl functions to the list, or even directly to the connection... If you want, I can produce some example.
The problem I see with the current approach is that it is a bit too rigid for including future functionality.
manuscripts2/Readme.md
Outdated
|
||
|
||
### Structure: | ||
The main Class which is being implemented is the `EQCC` class. `EQCC` stands for `Elasticsearch Query, Connection and Compute` as it provides a connection to elasticsearch, queries it and computes the response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to find a better name for EQCC. As it is now, you need to know what the acronym means, or you have no clue. What about naming it "Query", since in the end it is a query for ES? I still don't like that "Query" name very much, since it says nothing about this being the root of all our metrics, but well, I find it better than EQCC.... Any other idea?
manuscripts2/Readme.md
Outdated
|
||
The important variables inside the `EQCC` objects are as follows: | ||
|
||
- `s`: This is the elasticsearch_dsl Search object which is responsible of quering elasticsearch and getting the results. All the aggregations, queries and filters are applied to this Search object and then it queries elasticsearch and fetches the required results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a name that says something, please ;-) If you cannot figure out something better, search
could do it.
manuscripts2/Readme.md
Outdated
The important variables inside the `EQCC` objects are as follows: | ||
|
||
- `s`: This is the elasticsearch_dsl Search object which is responsible of quering elasticsearch and getting the results. All the aggregations, queries and filters are applied to this Search object and then it queries elasticsearch and fetches the required results. | ||
- `parent_id`: This is a counter counting the aggregations that are applied to the Search object. It starts with `0` and increments as aggregations are added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it counts aggregation, what about naming it agg_counter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually counts the parent aggregation so what about parent_agg_counter
?
manuscripts2/Readme.md
Outdated
- `s`: This is the elasticsearch_dsl Search object which is responsible of quering elasticsearch and getting the results. All the aggregations, queries and filters are applied to this Search object and then it queries elasticsearch and fetches the required results. | ||
- `parent_id`: This is a counter counting the aggregations that are applied to the Search object. It starts with `0` and increments as aggregations are added. | ||
- `queries`: This is a dictionary containing two lists: `must` and `must_not`. This dictionary contains the normal(`must`) and inverse(`must_not`) queries. | ||
- `aggregations`: This is an OrderedDict which contains all the aggregations applied. An ordered dict allows us to create nested aggregations easily, as we'll see below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need an OrderedDict? Wouldn't a dict be good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgbarah The way this class is implemented, I need a data structure which stores the data in an ordered manner so that nested aggregations can be created by chaining two aggregations (or applying two aggregations one after the other: the second one being one of by_authors
, by_organizations
, by_period
) and first one being the child aggregation of the second.
issues = Issues(index)
issues.get_cardinality("id_in_repo").by_period("author_name")
In the first step, the first aggregation (get_cardinality
) will be added to the dict and on applying the second aggregation(by_period
), the first aggregation will be popped
from the dict and added as a child aggregation to the second agg.
manuscripts2/Readme.md
Outdated
- `parent_id`: This is a counter counting the aggregations that are applied to the Search object. It starts with `0` and increments as aggregations are added. | ||
- `queries`: This is a dictionary containing two lists: `must` and `must_not`. This dictionary contains the normal(`must`) and inverse(`must_not`) queries. | ||
- `aggregations`: This is an OrderedDict which contains all the aggregations applied. An ordered dict allows us to create nested aggregations easily, as we'll see below. | ||
- `child_agg_counter_dict`: This dict keeps a track of the number of child aggregations that have been applied in an aggregation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe child_agg_count
is good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgbarah It is actually a dict which contains individual counters for all the nested aggregations that the object contains. Example:
prs = PullRequests(index)
prs.get_cardinality("id_in_repo").by_authors()
prs.get_sum("time_open_days").by_authors()
prs.get_percentile("time_to_close_days").by_authors()
here, the child_agg_counter_dict will contain one key terms_author_uuid
(because by_authors
is a terms
aggregation and uses the field author_uuid
) and it's value will be 3.
This aggregation will give us the number of prs, the sum of time_open_days of the open prs and the percentile of time_to_close_days of the closed PRs, all seggregated by authors, in one go!
This is a fuctionality that I think will decrease the amount of times we have to query elasticsearch and get the results.
|
||
--- | ||
|
||
##### EXAMPLE 4: Moar chainability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More instead of moar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was being creative :P https://en.wiktionary.org/wiki/moar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
;-) I didn't know about this "moar". Thanks for teaching me!!
8821bca
to
56710ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry I couldn't review everything. But my main issue with this implementation is how chainable objects would work. Please have a look specially at my comments on the Query class.
|
||
--- | ||
|
||
##### EXAMPLE 4: Moar chainability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
;-) I didn't know about this "moar". Thanks for teaching me!!
manuscripts2/derived_classes.py
Outdated
@@ -0,0 +1,14 @@ | |||
from new_functions import Query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add the license/copyright template. Please, get it from some other file, and include it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll add it to the file.
manuscripts2/new_functions.py
Outdated
@@ -0,0 +1,510 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add the license/copyright template. Please, get it from some other file, and include it here.
manuscripts2/new_functions.py
Outdated
class Index(): | ||
"""Index class which will represent an index in elasticsearch""" | ||
|
||
def __init__(self, index, url=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using an url, use an Elasticsearch object. This allows you to have all the power of an ES object, and don't need to re-create that functionality in this module. The pattern for using this Index object would be:
es = Elasticsearch(...)
index = Index(index_name, es)
If you need a complex Elasticsearch object, you just initialize it as needed (eg, a SSL instance without checking certificates, or whatever you may need).
I agree that default could be the "default" Elasticsearch instance, so that if you pass no es argument, you just produce in the initialization of the object a default Elasticsearch object, just by instatiating it as Elasticsearch()
(afaik, by default it will use http://localhost:9200 ).
manuscripts2/new_functions.py
Outdated
|
||
def __init__(self, index, url=None): | ||
""" | ||
:url: the address/url of the elasticsearch instance to be used? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Respect the order of the arguments, whatever you use in the list of parameters to the function (in this case, I think 'index' would be the first one).
Besides, the correct way of writing it is
:param url: comment
(note "param")
We're using Sphynx conventions, see Sphynx entry in Formats
manuscripts2/new_functions.py
Outdated
The base class of all the necessary metrics that are going to be calculated. | ||
""" | ||
|
||
def __init__(self, index_obj, esfilters={}, interval=None, offset=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since everything is an object, usually it is better to use names such as index
instead of index_obj
. The description of the parameters, below in the docstring, should specify which type index
should be.
manuscripts2/new_functions.py
Outdated
self.es = index_obj.es | ||
self.index = index_obj.index | ||
|
||
self.search = Search(using=self.es, index=self.index) # create the search object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern with this implementation is related to how you use this search variable, and the others related to it, such as filters
, queries
, etc. Why not just use a Search variable, which is extended with anything new you may need when chaining functions?
For example, I would implement add_query
as:
def add_query(self, key_val={}):
q = Q("match", **key_val)
self.search = self.search.query(q)
return self
This way, you could use it as:
q = Query().add_query(...)....
(I'm not talking about the parameters for add_query
here, but about how to implement it).
manuscripts2/Readme.md
Outdated
The idea is that the user can use the chainability of functions to create nested aggs and add queries seamlessly. | ||
|
||
```python | ||
from new_functions import Query, Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my notes on the Query class. My idea for this is more like this would be used this way:
sample = Query(github_index) \
.add_query({"name1":"value1"}) \
.add_inverse_query({"name2":"value2"})
56710ed
to
b264441
Compare
@jgbarah Please review? |
Sorry for being late with this :-( However, when looking at the patch, I see changes in the core files (eg, functions now returning self), but no changes in the testing files, README or in the notebook... Could you please either change all of it to use chained functions, or at least add cases, examples, etc. with chained functions? I mean, instead of (or maybe in addition to), for example:
I would like to have:
In addition, I would like to also have functions such as get_trend or get_timeseries to be chainable, but if you find it difficult to do it now, please at least fix the above cases, and we can work on that in a future pr. |
- new_functions.py: contains Query & Index classes and functions to parse the fetched data - derived_classes.py: contains PullRequests and Issues classes - Readme.md describing how to use the above 2 files - tests for the basic functions inside the Query class Also add: - Examples/Sample_metrics.ipynb: contains the CHAOSS and GMD metrics that are being calculated using the new functions Addresses issues chaoss#59 and chaoss#62
b264441
to
e864750
Compare
@jgbarah I am sorry I forgot to change those two files. This fresh commit adds more examples as you wanted. For tests, I am planning to make another PR (or I can create another commit in this PR only) to add specific tests for functions such as In this commit, i've changed the path of the sample_metrics.ipynb notebook to an Examples folder because we have to treat manuscripts2 as a module and we should be able to import the functions and classes from that. |
@jgbarah ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's go with this for now, and let's build from this up.
This PR adds a new folder named: manuscripts2 containing some new classes and functions to calculate the metrics. The files are as follows:
Also adding:
This PR addresses issues #59 and #62