New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sample method to eland.DataFrame #196
Conversation
add SampleTask add tests and run black add new file score RandomScore add parameters n and frac add typing, refactor and reorder
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an amazing start to this feature, thank you so much! 💖
I saw you marked as potentially WIP, I left you some starting comments and things to think about. :)
Some general comments:
- Will need to add this to
eland.NDFrame
andeland.Series
as well - Will have to add an RST doc for these methods. I should fix this by adding a generator for docs. It has somewhere to live finally within
utils/
🚀
from eland.tests.common import assert_pandas_eland_frame_equal | ||
|
||
|
||
class TestDataFrameSample(TestData): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually will need to add a test case that calls .sample()
and then other operations such a .head()
, .agg()
, .shape
, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for your thorough review of this PR, I learn a lot with each comment you made. My doubt with the test is how do I assert is working, I mean what assertion should I check. I already fix some minor issues and believe I can solve the others early next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The combination asserts will probably be easier after implementing random_state
. Mostly want to verify that we can add additional queries to our .sample() calls without pulling data from ES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as checking whether .sample()
itself is working you could test that calling .sample(10)
twice gives you two different sets of rows :)
query_params["query_size"] = min(self._count, query_params["query_size"]) | ||
else: | ||
query_params["query_size"] = self._count | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else to think about here is we want to order by _score
(unless pandas maintains the index
order after a .sample()
call?)
I actually think when I checked this out locally and added the query_params["query_sort_order"] = "score"
I found a bug in TailTask
not picking up the current query_sort_order
when resolving tasks. Something to potentially investigate outside of this issue.
jenkins test this please |
Jenkins test this please |
The changes you submitted look great, I'll review them closely tomorrow and we can get merged soon! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! 💪 A few comments for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Merging this after CI 🎉
jenkins test this please |
Thanks much @mesejo! |
This includes the following items:
This PR, as it stands, is relatively big, this the reason the implementation of sampling with replacement and seed were left out. This also can be consider WIP.
Closes #183