-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated esquery.py file. #58
Conversation
Removed all the manually created string queries in all the functions and replaced them with elasticsearch_dsl objects like A(aggregation), Q(query) and Search. Updated queries to use the current methods to search and aggregate. Closes chaoss#57
@jgbarah, @valeriocos, @acs can you please review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @aswanipranjal. If it is possible, I would suggest to add some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot. Very good job. Could you please try to address the comments? In addition, we need some tests. Could you write at least one test function per function? I think that producing some queries, and comparing that they are the same (by comparing the resulting DSL dicts, for example) would be good enogugh for a start.
manuscripts/esquery.py
Outdated
|
||
:param filters: dict with the filters to be applied | ||
:param inverse: if True include all the inverse filters (the one starting with *) | ||
:return: a string with the DSL value for these filters | ||
:return: a list of match object(dict)(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all references to elasticsearch_dsl objects, I would expliticly mention the module. For example in this case, "list of elasticsearch_dsl match objects", or if you prefer, "es_dsl" for short. Please, add this to all this kind of references in other methods, so that it is crystal clear what they do.
Besides, what do you mean by (dict)(s)? It is better that you describe in detail the data structure that the method returns, including, if it is convenient, an example. Remember that this message should include enough information for a trained developer to be able of using the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By dicts i meant that the filters are dictionaries. I'll update the comments to be more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx
manuscripts/esquery.py
Outdated
|
||
class ElasticQuery(): | ||
""" Helper class for building Elastic queries """ | ||
|
||
AGGREGATION_ID = 1 # min aggregation identifier | ||
AGG_SIZE = 100 # Default max number of buckets | ||
ES_PRECISION = 3000 | ||
ES_PRECISION = 3000 # This is the default value for percision_threshold | ||
|
||
@classmethod | ||
def __get_query_filters(cls, filters=None, inverse=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use "filters=[]". Since it is expected to be a list, this way you don't need, in the implementation, to check for None, and it is even more explicit that you're expecting a list for that argument. I would only use None if you find a case where you need to make a difference between that "None" and an empty list as an argument.
|
||
return query_filters | ||
|
||
@classmethod | ||
def __get_query_range(cls, date_field, start=None, end=None): | ||
""" | ||
Create a string with range DSL query for the date in date_field from start to end dates. | ||
Create a filter dict with date_field from start to end dates. | ||
|
||
:param date_field: field with the date value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For dates, and other "complex" data types, please specify the actual format, if possible. For example, in this case, is it "datetime.datetime"? (besides: if it is not something such as datetime.datetime, I would rather consider changing the profile to really use something like that, so that dates are converted into a pythonic format before this method (and others below) are called. We can build some auxiliary function for that, if needed.
manuscripts/esquery.py
Outdated
query_basic = Search() | ||
|
||
# Not using this for now. | ||
# query_basic = query_basic.query("query_string", analyse_wildcard="true", query="*") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for keeping the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, nothing special. I was just going to experiment with the wildcard flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, maybe remove it now, and produce later a pr with the extra functionality, if you find it convenient.
manuscripts/esquery.py
Outdated
if not agg_id: | ||
agg_id = cls.AGGREGATION_ID | ||
query_agg = A("terms", field=field, size=cls.AGG_SIZE, order={"_count": "desc"}) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much better in the description of the method, above.
manuscripts/esquery.py
Outdated
return bounds | ||
|
||
@classmethod | ||
def __get_query_agg_ts(cls, field, time_field, interval=None, | ||
time_zone=None, start=None, end=None, | ||
agg_type='count', offset=None): | ||
""" | ||
Create a string with an aggregation DSL query for getting the time series | ||
values for field. | ||
Create an aggregation DSL object for getting the time series values for field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it really a "es_dsl aggregation object" instead of a "aggregation DSL object"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my bad. Thanks!
""" % (query_basic) | ||
|
||
# size=0 gives only the count and not the hits | ||
query = query_basic.extra(size=0) | ||
return query | ||
|
||
@classmethod | ||
def get_agg(cls, field=None, date_field=None, start=None, end=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented above, in other comments, try to clean up the profile a bit, so that for example fields is a list, dates are datetime or similar, etc. Please remember to do so for all methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
offset=offset) | ||
s.aggs.bucket(agg_id, query_agg) | ||
|
||
return json.dumps(s.to_dict()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this is returning the query as a JSON file. Shouldn't it better return a es_dsl object, so that we get rid of es DSL queries completely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgbarah, that is because in metrics.py, get_metrics_data function requests elasticsearch for the data and it requires the data to be in JSON format other wise we will get a 500 error.
We can change the metrics.py file as well to use the Search().execute method to get the data but i thought that we are probably going to change all the methods and replace them with better more agile methods anyways hence I didn't change all the files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgbarah, @valeriocos: Should I be looking at changing all the files so that we are only using elasticsearch_dsl module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. Not for now. Let's keep that for another pr, but let's finish this letting it produce JSON as the result.
I'll make another commit with the tests and cleaning up the comments a bit more. |
Added details to the comments in esquery.py about the data sctuctures used and the format that the parameters should be in. Added tests for every function in esquery.py file.
08bfcd9
to
9acc5c2
Compare
@jgbarah @valeriocos @acs please review. I've updated esquery.py with proper comments and added tests for all the functions. I am ready to make another PR which completes the cleaning process. Once this PR has been merged, then I'll make the second PR. Or, if not, I can make that commit in that PR a part of this PR only. |
self.assertDictEqual(self.es._ElasticQuery__get_query_range(date_field=self.date_field, | ||
start=self.start, end=self.end), test_range_dict) | ||
|
||
def test_get_query_basic(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would split this test in two: test_no_filters and test_with_filters.
} | ||
self.assertDictEqual(self.es._ElasticQuery__get_bounds(self.start, self.end), test_bounds_dict) | ||
|
||
def test_get_query_agg_ts(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would split this test in two: tests that check exceptions and tests that do not
self.assertDictEqual(test_ts_dict2.to_dict(), test_avg_ts_dict) | ||
|
||
def test_get_agg(self): | ||
"""Test the aggregation function""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add a blank line after the method description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @aswanipranjal for this PR. Overall it looks good to me. I have added just some minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's go with this version, so that we can build on top of it.
@aswanipranjal please take note of @valeriocos suggestion, and split the tests in a separate pr.
Closes #57.
With this PR, all the string queries in esquery.py file have been converted into elasticsearch_dsl objects such as Aggregation, Query and Search.
I've also tried to update the searches and queries to the latest norms that elasticsearch follows by using
match_phrase
rather than usingtype: phrase
to search for queries.get_agg
returns a dictionary containing the search object with the necessary fields.