New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Support for Series.value_counts() #49
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general!
Comments inline.
eland/operations.py
Outdated
|
||
return s | ||
|
||
def _terms_aggs(self, query_compiler, func): | ||
def _terms_aggs(self, query_compiler, func, buckets=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buckets
parameter should be called size
+ add 'Parameters' argument to describe what it's doing.
eland/operations.py
Outdated
|
||
def hist(self, query_compiler, bins): | ||
return self._hist_aggs(query_compiler, bins) | ||
|
||
def _metric_aggs(self, query_compiler, func): | ||
def _metric_aggs(self, query_compiler, func, card=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of card=None
we should probably introduce a parameter like field_types
which could be numeric
or aggregatable
to describe the different functionality + add 'Parameters' comment block to method to describe this.
eland/operations.py
Outdated
|
||
s = pd.Series(data=results, index=results.keys()) | ||
s = pd.Series(data=results, index=results.keys(), name=columns[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if columns
is empty? Add test case to check behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't be possible. df[[]]
is an empty dataframe, and value_counts()
is a series method, so this would cause an AttributeError
to be thrown by the dataframe module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've added two new tests that check for KeyError
and AttributeError
in the case of passing a nonexistent column name and a DataFrame, respectively.
eland/series.py
Outdated
|
||
Parameters | ||
---------- | ||
size: int, default 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call this es_size
as it an Elasticsearch specific (non-pandas) parameter. Also, make it clear in the comment this is a non-pandas parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done - 7b1f4d6
also added elasticsearch docs as an external sphinx link as :es_api_docs:
print(type(pd_vc)) | ||
print(type(ed_vc)) | ||
|
||
assert pd_vc == ed_vc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the PR! Learnt a lot about how eland works just by going through it!
eland/operations.py
Outdated
# some metrics aggs (including cardinality) work on all aggregatable fields | ||
# therefore we include an optional all parameter on operations | ||
# that call _metric_aggs | ||
if card: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line of code has a potential to be confusing since Python treats many values as booleans (for ex. empty list, True/False, 0/1 etc). Would it be possible to rename this variable somehow to give some hint of its type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has been changed to field_types
in 6476e65
This part of the code will continue to evolve as we add support for more aggregations
ed_vc = ed_s.value_counts(size=1).to_string() | ||
|
||
print(type(pd_vc)) | ||
print(type(ed_vc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious - are we logging the print statements from the tests anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was included unintentionally. good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed - 18e92db
eland/series.py
Outdated
ES-Air 3220 | ||
Name: Carrier, dtype: int64 | ||
""" | ||
return self._query_compiler.value_counts(size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if someone passes a size that is somehow invalid?
Do we have any fail fast code for those cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't currently handled. Good catch, will add this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed: e574ef0
@stevedodson addressed all of your comments. Please approve when you get a chance so I can merge 💃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of suggestions, but LGTM
pd_vc = pd_s.value_counts().to_string() | ||
ed_vc = ed_s.value_counts().to_string() | ||
|
||
assert pd_vc == ed_vc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to use pandas.testing.assert_series_equal
than string comparisons
pd_vc = pd_s.value_counts()[:1].to_string() | ||
ed_vc = ed_s.value_counts(es_size=1).to_string() | ||
|
||
assert pd_vc == ed_vc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to use pandas.testing.assert_series_equal
than string comparisons
ES-Air 3220 | ||
Name: Carrier, dtype: int64 | ||
""" | ||
if not isinstance(es_size, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whether we should do defensive programming here is probably something to discuss - this shouldn't block the PR though + we should use type hints to avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just thought it would be best to "fail fast" in the top level eland client code instead of passing something to the lower level transport mechanisms and waiting for a cryptic exception to bubble up from there or ES itself.
This PR addresses #40. It contains the following:
Series.value_counts()
with an optionalsize
parameter to determine the number of buckets to retrieve from elasticsearch.Series.value_counts()
size
parameter, and one without.example usage:
In implementing this, I reworked
DataFrame.nunique()
, having it implement_metrics_aggs
instead of_terms_aggs
to better match how elasticsearch thinks of cardinality aggregations.Since the cardinality aggregation can work on non-numeric fields, unlike the other operations that implement
_metrics_aggs
, I've added an optionalcard
parameter to_metrics_aggs
. This is not the best solution, but it achieves the desired effect. In the future, the_*_aggs
section ofoperations.py
should probably be it's own module in which we implement elasticsearch aggregations as they're categorized in the es docs.