Adds Support for Series.value_counts() #49

blaklaybul · 2019-11-19T00:21:12Z

This PR addresses #40. It contains the following:

support for Series.value_counts() with an optional size parameter to determine the number of buckets to retrieve from elasticsearch.
documentation for Series.value_counts()
2 tests, one with the size parameter, and one without.

example usage:

>>> df = ed.DataFrame('localhost', 'flights')
>>> df['Carrier'].value_counts()
Logstash Airways    3331
JetBeats            3274
Kibana Airlines     3234
ES-Air              3220
Name: Carrier, dtype: int64

In implementing this, I reworked DataFrame.nunique(), having it implement _metrics_aggs instead of _terms_aggs to better match how elasticsearch thinks of cardinality aggregations.

Since the cardinality aggregation can work on non-numeric fields, unlike the other operations that implement _metrics_aggs, I've added an optional card parameter to _metrics_aggs. This is not the best solution, but it achieves the desired effect. In the future, the _*_aggs section of operations.py should probably be it's own module in which we implement elasticsearch aggregations as they're categorized in the es docs.

stevedodson

Looks good in general!

Comments inline.

stevedodson · 2019-11-19T08:55:24Z

eland/operations.py


        return s

-    def _terms_aggs(self, query_compiler, func):
+    def _terms_aggs(self, query_compiler, func, buckets=None):


The buckets parameter should be called size + add 'Parameters' argument to describe what it's doing.

stevedodson · 2019-11-19T08:55:47Z

eland/operations.py


    def hist(self, query_compiler, bins):
        return self._hist_aggs(query_compiler, bins)

-    def _metric_aggs(self, query_compiler, func):
+    def _metric_aggs(self, query_compiler, func, card=None):


instead of card=None we should probably introduce a parameter like field_types which could be numeric or aggregatable to describe the different functionality + add 'Parameters' comment block to method to describe this.

stevedodson · 2019-11-19T08:57:21Z

eland/operations.py


-        s = pd.Series(data=results, index=results.keys())
+        s = pd.Series(data=results, index=results.keys(), name=columns[0])


what if columns is empty? Add test case to check behaviour.

this shouldn't be possible. df[[]] is an empty dataframe, and value_counts() is a series method, so this would cause an AttributeError to be thrown by the dataframe module.

i've added two new tests that check for KeyError and AttributeError in the case of passing a nonexistent column name and a DataFrame, respectively.

stevedodson · 2019-11-19T08:58:58Z

eland/series.py

+
+        Parameters
+        ----------
+        size: int, default 10


maybe call this es_size as it an Elasticsearch specific (non-pandas) parameter. Also, make it clear in the comment this is a non-pandas parameter.

done - 7b1f4d6
also added elasticsearch docs as an external sphinx link as :es_api_docs:

stevedodson · 2019-11-19T08:59:14Z

eland/tests/series/test_value_counts_pytest.py

+        print(type(pd_vc))
+        print(type(ed_vc))
+
+        assert pd_vc == ed_vc


add newline

Winterflower

Thanks for making the PR! Learnt a lot about how eland works just by going through it!

Winterflower · 2019-11-19T14:04:40Z

eland/operations.py

+        # some metrics aggs (including cardinality) work on all aggregatable fields
+        # therefore we include an optional all parameter on operations
+        # that call _metric_aggs
+        if card:


This line of code has a potential to be confusing since Python treats many values as booleans (for ex. empty list, True/False, 0/1 etc). Would it be possible to rename this variable somehow to give some hint of its type?

this has been changed to field_types in 6476e65
This part of the code will continue to evolve as we add support for more aggregations

Winterflower · 2019-11-19T14:11:51Z

eland/tests/series/test_value_counts_pytest.py

+        ed_vc = ed_s.value_counts(size=1).to_string()
+
+        print(type(pd_vc))
+        print(type(ed_vc))


Curious - are we logging the print statements from the tests anywhere?

this was included unintentionally. good catch!

fixed - 18e92db

Winterflower · 2019-11-19T14:13:47Z

eland/series.py

+        ES-Air              3220
+        Name: Carrier, dtype: int64
+        """
+        return self._query_compiler.value_counts(size)


What if someone passes a size that is somehow invalid?
Do we have any fail fast code for those cases?

This isn't currently handled. Good catch, will add this.

fixed: e574ef0

blaklaybul · 2019-11-19T14:40:47Z

@stevedodson addressed all of your comments. Please approve when you get a chance so I can merge 💃

stevedodson

Couple of suggestions, but LGTM

stevedodson · 2019-11-19T16:10:25Z

eland/tests/series/test_value_counts_pytest.py

+        pd_vc = pd_s.value_counts().to_string()
+        ed_vc = ed_s.value_counts().to_string()
+
+        assert pd_vc == ed_vc


better to use pandas.testing.assert_series_equal than string comparisons

stevedodson · 2019-11-19T16:10:35Z

eland/tests/series/test_value_counts_pytest.py

+        pd_vc = pd_s.value_counts()[:1].to_string()
+        ed_vc = ed_s.value_counts(es_size=1).to_string()
+
+        assert pd_vc == ed_vc


better to use pandas.testing.assert_series_equal than string comparisons

stevedodson · 2019-11-19T16:13:10Z

eland/series.py

+        ES-Air              3220
+        Name: Carrier, dtype: int64
+        """
+        if not isinstance(es_size, int):


whether we should do defensive programming here is probably something to discuss - this shouldn't block the PR though + we should use type hints to avoid this.

I just thought it would be best to "fail fast" in the top level eland client code instead of passing something to the lower level transport mechanisms and waiting for a cryptic exception to bubble up from there or ES itself.

…-counts

blaklaybul added 3 commits November 18, 2019 19:10

adds support for series.value_counts

97352cd

adds docs for series.value_counts

aa230c5

adds tests for series.value_counts

de243bd

blaklaybul requested a review from stevedodson November 19, 2019 00:21

blaklaybul self-assigned this Nov 19, 2019

blaklaybul added the topic:series Issue or PR about eland.Series label Nov 19, 2019

stevedodson requested changes Nov 19, 2019

View reviewed changes

blaklaybul added 2 commits November 19, 2019 08:34

updates keyerror language

4af21c5

adds es docs as an external source

dba2035

Winterflower reviewed Nov 19, 2019

View reviewed changes

blaklaybul added 5 commits November 19, 2019 09:19

adds parameters for metrics and terms aggs

6476e65

adds 2 tests to check for exceptions

2a9c686

explains the size parameter

7b1f4d6

removes print statements from tests

18e92db

checks that es_size is a positive integer

e574ef0

blaklaybul added the enhancement New feature or request label Nov 19, 2019

Merge branch 'master' into value-counts

f417619

stevedodson approved these changes Nov 19, 2019

View reviewed changes

blaklaybul added 2 commits November 19, 2019 11:25

implements assert_series_equal

af269a9

Merge branch 'value-counts' of github.com:blaklaybul/eland into value…

e11e38c

…-counts

blaklaybul merged commit 9c9ca90 into elastic:master Nov 19, 2019

This was referenced Nov 19, 2019

Add support for value counts #40

Closed

Support unique() on Series #62

Closed

blaklaybul deleted the value-counts branch November 26, 2019 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds Support for Series.value_counts() #49

Adds Support for Series.value_counts() #49

blaklaybul commented Nov 19, 2019

stevedodson left a comment

stevedodson Nov 19, 2019

stevedodson Nov 19, 2019 •

edited

stevedodson Nov 19, 2019

blaklaybul Nov 19, 2019

blaklaybul Nov 19, 2019

stevedodson Nov 19, 2019

blaklaybul Nov 19, 2019

stevedodson Nov 19, 2019

Winterflower left a comment

Winterflower Nov 19, 2019

blaklaybul Nov 19, 2019 •

edited

Winterflower Nov 19, 2019

blaklaybul Nov 19, 2019

blaklaybul Nov 19, 2019

Winterflower Nov 19, 2019

blaklaybul Nov 19, 2019

blaklaybul Nov 19, 2019

blaklaybul commented Nov 19, 2019

stevedodson left a comment

stevedodson Nov 19, 2019

stevedodson Nov 19, 2019

stevedodson Nov 19, 2019 •

edited

Winterflower Nov 19, 2019


		s = pd.Series(data=results, index=results.keys())
		s = pd.Series(data=results, index=results.keys(), name=columns[0])

Adds Support for Series.value_counts() #49

Adds Support for Series.value_counts() #49

Conversation

blaklaybul commented Nov 19, 2019

stevedodson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevedodson Nov 19, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Winterflower left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blaklaybul Nov 19, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blaklaybul commented Nov 19, 2019

stevedodson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevedodson Nov 19, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevedodson Nov 19, 2019 •

edited

blaklaybul Nov 19, 2019 •

edited

stevedodson Nov 19, 2019 •

edited