Add sample method to eland.DataFrame #196

mesejo · 2020-04-24T19:15:57Z

This includes the following items:

Add SampleTask
Add tests, blacken, typing and refactor
Add new file score with the class RandomScore
Add parameters n and frac

This PR, as it stands, is relatively big, this the reason the implementation of sampling with replacement and seed were left out. This also can be consider WIP.

Closes #183

add SampleTask add tests and run black add new file score RandomScore add parameters n and frac add typing, refactor and reorder

elasticmachine · 2020-04-24T19:15:59Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

sethmlarson

This is an amazing start to this feature, thank you so much! 💖

I saw you marked as potentially WIP, I left you some starting comments and things to think about. :)

Some general comments:

Will need to add this to eland.NDFrame and eland.Series as well
Will have to add an RST doc for these methods. I should fix this by adding a generator for docs. It has somewhere to live finally within utils/ 🚀

eland/dataframe.py

eland/query_compiler.py

sethmlarson · 2020-04-24T21:11:36Z

eland/tests/dataframe/test_sample_pytest.py

+from eland.tests.common import assert_pandas_eland_frame_equal
+
+
+class TestDataFrameSample(TestData):


Eventually will need to add a test case that calls .sample() and then other operations such a .head(), .agg(), .shape, etc

Thank you so much for your thorough review of this PR, I learn a lot with each comment you made. My doubt with the test is how do I assert is working, I mean what assertion should I check. I already fix some minor issues and believe I can solve the others early next week.

The combination asserts will probably be easier after implementing random_state. Mostly want to verify that we can add additional queries to our .sample() calls without pulling data from ES

As far as checking whether .sample() itself is working you could test that calling .sample(10) twice gives you two different sets of rows :)

eland/score.py

sethmlarson · 2020-04-24T21:37:45Z

eland/tasks.py

+            query_params["query_size"] = min(self._count, query_params["query_size"])
+        else:
+            query_params["query_size"] = self._count
+


Something else to think about here is we want to order by _score (unless pandas maintains the index order after a .sample() call?)

I actually think when I checked this out locally and added the query_params["query_sort_order"] = "score" I found a bug in TailTask not picking up the current query_sort_order when resolving tasks. Something to potentially investigate outside of this issue.

sethmlarson · 2020-04-24T21:43:15Z

jenkins test this please

sethmlarson · 2020-05-03T13:08:03Z

Jenkins test this please

sethmlarson · 2020-05-03T13:08:44Z

The changes you submitted look great, I'll review them closely tomorrow and we can get merged soon!

sethmlarson

This looks great! 💪 A few comments for you.

eland/filter.py

eland/ndframe.py

eland/tests/dataframe/test_sample_pytest.py

sethmlarson

Nice! Merging this after CI 🎉

sethmlarson · 2020-05-04T16:54:52Z

jenkins test this please

sethmlarson · 2020-05-04T17:07:29Z

Thanks much @mesejo!

Add sample method to eland.DataFrame

ff9bb15

add SampleTask add tests and run black add new file score RandomScore add parameters n and frac add typing, refactor and reorder

sethmlarson suggested changes Apr 24, 2020

View reviewed changes

mesejo and others added 6 commits April 25, 2020 15:34

fix some issues

8c28913

Add an enforce license headers

3be37ae

Add agg compatibility logic to Field class

d9292a2

Make QueryParams a dataclass

5f6e2d4

Add random_state, refactor test and add documentation

48b8c1a

Merge branch 'master' into issue/183-add-sample

75efe0f

sethmlarson suggested changes May 4, 2020

View reviewed changes

eland/filter.py Outdated Show resolved Hide resolved

eland/filter.py Outdated Show resolved Hide resolved

eland/ndframe.py Show resolved Hide resolved

eland/tests/dataframe/test_sample_pytest.py Outdated Show resolved Hide resolved

address requested changes

18e8640

sethmlarson approved these changes May 4, 2020

View reviewed changes

sethmlarson merged commit 94dbb36 into elastic:master May 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sample method to eland.DataFrame #196

Add sample method to eland.DataFrame #196

mesejo commented Apr 24, 2020

elasticmachine commented Apr 24, 2020

sethmlarson left a comment

sethmlarson Apr 24, 2020

mesejo Apr 25, 2020

sethmlarson Apr 25, 2020

sethmlarson Apr 25, 2020

sethmlarson Apr 24, 2020 •

edited

sethmlarson commented Apr 24, 2020

sethmlarson commented May 3, 2020

sethmlarson commented May 3, 2020

sethmlarson left a comment

sethmlarson left a comment •

edited

sethmlarson commented May 4, 2020

sethmlarson commented May 4, 2020

		from eland.tests.common import assert_pandas_eland_frame_equal


		class TestDataFrameSample(TestData):

Add sample method to eland.DataFrame #196

Add sample method to eland.DataFrame #196

Conversation

mesejo commented Apr 24, 2020

elasticmachine commented Apr 24, 2020

sethmlarson left a comment

Choose a reason for hiding this comment

sethmlarson Apr 24, 2020

Choose a reason for hiding this comment

mesejo Apr 25, 2020

Choose a reason for hiding this comment

sethmlarson Apr 25, 2020

Choose a reason for hiding this comment

sethmlarson Apr 25, 2020

Choose a reason for hiding this comment

sethmlarson Apr 24, 2020 • edited

Choose a reason for hiding this comment

sethmlarson commented Apr 24, 2020

sethmlarson commented May 3, 2020

sethmlarson commented May 3, 2020

sethmlarson left a comment

Choose a reason for hiding this comment

sethmlarson left a comment • edited

Choose a reason for hiding this comment

sethmlarson commented May 4, 2020

sethmlarson commented May 4, 2020

sethmlarson Apr 24, 2020 •

edited

sethmlarson left a comment •

edited