Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample method to eland.DataFrame #196

Merged
merged 8 commits into from May 4, 2020

Conversation

mesejo
Copy link
Contributor

@mesejo mesejo commented Apr 24, 2020

This includes the following items:

  • Add SampleTask
  • Add tests, blacken, typing and refactor
  • Add new file score with the class RandomScore
  • Add parameters n and frac

This PR, as it stands, is relatively big, this the reason the implementation of sampling with replacement and seed were left out. This also can be consider WIP.

Closes #183

 add SampleTask
 add tests and run black
 add new file score RandomScore
 add parameters n and frac
 add typing, refactor and reorder
@elasticmachine
Copy link

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an amazing start to this feature, thank you so much! 💖

I saw you marked as potentially WIP, I left you some starting comments and things to think about. :)

Some general comments:

  • Will need to add this to eland.NDFrame and eland.Series as well
  • Will have to add an RST doc for these methods. I should fix this by adding a generator for docs. It has somewhere to live finally within utils/ 🚀

eland/dataframe.py Outdated Show resolved Hide resolved
eland/dataframe.py Outdated Show resolved Hide resolved
eland/query_compiler.py Outdated Show resolved Hide resolved
from eland.tests.common import assert_pandas_eland_frame_equal


class TestDataFrameSample(TestData):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually will need to add a test case that calls .sample() and then other operations such a .head(), .agg(), .shape, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for your thorough review of this PR, I learn a lot with each comment you made. My doubt with the test is how do I assert is working, I mean what assertion should I check. I already fix some minor issues and believe I can solve the others early next week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The combination asserts will probably be easier after implementing random_state. Mostly want to verify that we can add additional queries to our .sample() calls without pulling data from ES

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as checking whether .sample() itself is working you could test that calling .sample(10) twice gives you two different sets of rows :)

eland/score.py Outdated Show resolved Hide resolved
query_params["query_size"] = min(self._count, query_params["query_size"])
else:
query_params["query_size"] = self._count

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else to think about here is we want to order by _score (unless pandas maintains the index order after a .sample() call?)

I actually think when I checked this out locally and added the query_params["query_sort_order"] = "score" I found a bug in TailTask not picking up the current query_sort_order when resolving tasks. Something to potentially investigate outside of this issue.

@sethmlarson
Copy link
Contributor

jenkins test this please

@sethmlarson
Copy link
Contributor

Jenkins test this please

@sethmlarson
Copy link
Contributor

The changes you submitted look great, I'll review them closely tomorrow and we can get merged soon!

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! 💪 A few comments for you.

eland/filter.py Outdated Show resolved Hide resolved
eland/filter.py Outdated Show resolved Hide resolved
eland/ndframe.py Show resolved Hide resolved
eland/tests/dataframe/test_sample_pytest.py Outdated Show resolved Hide resolved
Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Merging this after CI 🎉

@sethmlarson
Copy link
Contributor

jenkins test this please

@sethmlarson sethmlarson merged commit 94dbb36 into elastic:master May 4, 2020
@sethmlarson
Copy link
Contributor

Thanks much @mesejo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement DataFrame.sample() via random_score
3 participants