Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Support for Series.value_counts() #49

Merged
merged 13 commits into from Nov 19, 2019

Conversation

blaklaybul
Copy link
Contributor

This PR addresses #40. It contains the following:

  1. support for Series.value_counts() with an optional size parameter to determine the number of buckets to retrieve from elasticsearch.
  2. documentation for Series.value_counts()
  3. 2 tests, one with the size parameter, and one without.

example usage:

>>> df = ed.DataFrame('localhost', 'flights')
>>> df['Carrier'].value_counts()
Logstash Airways    3331
JetBeats            3274
Kibana Airlines     3234
ES-Air              3220
Name: Carrier, dtype: int64

In implementing this, I reworked DataFrame.nunique(), having it implement _metrics_aggs instead of _terms_aggs to better match how elasticsearch thinks of cardinality aggregations.

Since the cardinality aggregation can work on non-numeric fields, unlike the other operations that implement _metrics_aggs, I've added an optional card parameter to _metrics_aggs. This is not the best solution, but it achieves the desired effect. In the future, the _*_aggs section of operations.py should probably be it's own module in which we implement elasticsearch aggregations as they're categorized in the es docs.

@blaklaybul blaklaybul self-assigned this Nov 19, 2019
@blaklaybul blaklaybul added the topic:series Issue or PR about eland.Series label Nov 19, 2019
Copy link
Contributor

@stevedodson stevedodson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general!

Comments inline.


return s

def _terms_aggs(self, query_compiler, func):
def _terms_aggs(self, query_compiler, func, buckets=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buckets parameter should be called size + add 'Parameters' argument to describe what it's doing.


def hist(self, query_compiler, bins):
return self._hist_aggs(query_compiler, bins)

def _metric_aggs(self, query_compiler, func):
def _metric_aggs(self, query_compiler, func, card=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of card=None we should probably introduce a parameter like field_types which could be numeric or aggregatable to describe the different functionality + add 'Parameters' comment block to method to describe this.


s = pd.Series(data=results, index=results.keys())
s = pd.Series(data=results, index=results.keys(), name=columns[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if columns is empty? Add test case to check behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be possible. df[[]] is an empty dataframe, and value_counts() is a series method, so this would cause an AttributeError to be thrown by the dataframe module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've added two new tests that check for KeyError and AttributeError in the case of passing a nonexistent column name and a DataFrame, respectively.

eland/series.py Outdated

Parameters
----------
size: int, default 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this es_size as it an Elasticsearch specific (non-pandas) parameter. Also, make it clear in the comment this is a non-pandas parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - 7b1f4d6
also added elasticsearch docs as an external sphinx link as :es_api_docs:

print(type(pd_vc))
print(type(ed_vc))

assert pd_vc == ed_vc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add newline

Copy link
Contributor

@Winterflower Winterflower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the PR! Learnt a lot about how eland works just by going through it!

# some metrics aggs (including cardinality) work on all aggregatable fields
# therefore we include an optional all parameter on operations
# that call _metric_aggs
if card:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line of code has a potential to be confusing since Python treats many values as booleans (for ex. empty list, True/False, 0/1 etc). Would it be possible to rename this variable somehow to give some hint of its type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has been changed to field_types in 6476e65
This part of the code will continue to evolve as we add support for more aggregations

ed_vc = ed_s.value_counts(size=1).to_string()

print(type(pd_vc))
print(type(ed_vc))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious - are we logging the print statements from the tests anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was included unintentionally. good catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed - 18e92db

eland/series.py Outdated
ES-Air 3220
Name: Carrier, dtype: int64
"""
return self._query_compiler.value_counts(size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone passes a size that is somehow invalid?
Do we have any fail fast code for those cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't currently handled. Good catch, will add this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed: e574ef0

@blaklaybul
Copy link
Contributor Author

@stevedodson addressed all of your comments. Please approve when you get a chance so I can merge 💃

@blaklaybul blaklaybul added the enhancement New feature or request label Nov 19, 2019
Copy link
Contributor

@stevedodson stevedodson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of suggestions, but LGTM

pd_vc = pd_s.value_counts().to_string()
ed_vc = ed_s.value_counts().to_string()

assert pd_vc == ed_vc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use pandas.testing.assert_series_equal than string comparisons

pd_vc = pd_s.value_counts()[:1].to_string()
ed_vc = ed_s.value_counts(es_size=1).to_string()

assert pd_vc == ed_vc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use pandas.testing.assert_series_equal than string comparisons

ES-Air 3220
Name: Carrier, dtype: int64
"""
if not isinstance(es_size, int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether we should do defensive programming here is probably something to discuss - this shouldn't block the PR though + we should use type hints to avoid this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought it would be best to "fail fast" in the top level eland client code instead of passing something to the lower level transport mechanisms and waiting for a cryptic exception to bubble up from there or ES itself.

@blaklaybul blaklaybul merged commit 9c9ca90 into elastic:master Nov 19, 2019
This was referenced Nov 19, 2019
@blaklaybul blaklaybul deleted the value-counts branch November 26, 2019 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request topic:series Issue or PR about eland.Series
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants