Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add iterrows() and itertuples() DataFrame API, its usage is similar to pandas #380

Merged
merged 11 commits into from Aug 20, 2021

Conversation

kxbin
Copy link
Contributor

@kxbin kxbin commented Aug 18, 2021

I'm sorry about this, because I want to squash multiple commits, and accidentally modified the branch name.
So that the PR was automatically closed unexpectedly.

For historical Conversation see PR #369

@sethmlarson Thanks for your comment
Based on these suggestions, I have completed the modification, And passed the lint and docs jobs.

Thanks to this change in #379, no need to convert _es_results_to_pandas into a generator now, because they have similar effects.
Now, the performance has been further improved and the logic is more concise.

Finally, the same 50,000 data sets, my test results are as follows:

ed.iterrows(),  It took a total of `19 seconds` after the iteration
ed.itertuples(), It took a total of `15 seconds` after the iteration
ed.to_pandas(), It took `15 seconds`

Closes #345

@elasticmachine
Copy link

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

Copy link
Contributor

@V1NAY8 V1NAY8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank You for all these changes.

eland/dataframe.py Outdated Show resolved Hide resolved
eland/dataframe.py Outdated Show resolved Hide resolved
eland/dataframe.py Outdated Show resolved Hide resolved
eland/operations.py Outdated Show resolved Hide resolved
tests/dataframe/test_iterrows_itertuples_pytest.py Outdated Show resolved Hide resolved
tests/dataframe/test_iterrows_itertuples_pytest.py Outdated Show resolved Hide resolved
@sethmlarson
Copy link
Contributor

Jenkins test this please

@kxbin
Copy link
Contributor Author

kxbin commented Aug 18, 2021

Thank You for all these changes.

Ok, wait a minute, let me try.

@kxbin
Copy link
Contributor Author

kxbin commented Aug 18, 2021

I run pytest --doctest-modules eland/tests/ failed.

Raise a elasticsearch.exceptions.NotFoundError: no such index [flights]

Maybe it’s because I don’t have the ES_TEST_CLIENT environment

@V1NAY8
Copy link
Contributor

V1NAY8 commented Aug 18, 2021

If you are using linux and have docker installed
Do the following:

>>> sysctl -w vm.max_map_count=262144
>>> ELASTICSEARCH_VERSION=elasticsearch:7.x-SNAPSHOT .ci/run-elasticsearch.sh

leave the above terminal open

Open a different terminal and then

python -m tests.setup_tests

Then run tests

@sethmlarson
Copy link
Contributor

I'd recommend only running the tests you need, some of the ML tests require a lot of resources.

@kxbin
Copy link
Contributor Author

kxbin commented Aug 19, 2021

@V1NAY8 Thank you so much for the tutorial.

I successfully set up a test environment.
After further modification and testing, it can now pass the doctests.

@sethmlarson
Copy link
Contributor

jenkins test this please

@kxbin
Copy link
Contributor Author

kxbin commented Aug 20, 2021

Oops,It seems that the test command executed by Jenkins is not the same as the one I tested locally.

Jenkins executes this command:
python -m pytest --cov-report term-missing --cov=eland/ --doctest-modules --nbval eland/tests/

Let me modify it again, I probably know the problem.

@sethmlarson
Copy link
Contributor

Awesome! Thanks for all the work you've put into this, I and others really excited to use this feature! 🤩

Copy link
Contributor

@V1NAY8 V1NAY8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@sethmlarson
Copy link
Contributor

jenkins test this please

@sethmlarson
Copy link
Contributor

jenkins test this please

@sethmlarson
Copy link
Contributor

jenkins test this please

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some modifications to your test suite to make it more concise and remove the no-ops. One more round of testing and this should be good to merge! ✨

@sethmlarson
Copy link
Contributor

jenkins test this please

@sethmlarson
Copy link
Contributor

Looks like CI is failing after removing the generator exhaustion inside of list(...). The itertuples doesn't look like a failure it's just a precision thing with floats. The iterrows failure looks stranger, will have to dive deeper. Going to check it out locally.

Btw you should do a git pull on your branch as I've made a handful of changes.

@sethmlarson
Copy link
Contributor

jenkins test this please

@sethmlarson
Copy link
Contributor

There was nothing wrong with the functionality, just the tests needed slightly different assert to make sure things worked. I think we're nearly there!

@sethmlarson sethmlarson merged commit 1aa193d into elastic:master Aug 20, 2021
@sethmlarson
Copy link
Contributor

Woo! 🎉 Thank you so much for this @kxbin! I'll mention your name next to this contribution in the next release changelog.

I hope you enjoyed contributing to Eland, if you've got time and are interested there are plenty of other issues to work on that I can assist with. You can also submit this contribution to the Elastic Contributor Program which recognizes community contributors and sends you prizes and swag too 🎁

@kxbin
Copy link
Contributor Author

kxbin commented Aug 21, 2021

Woo! 🎉 Thank you so much for this @kxbin! I'll mention your name next to this contribution in the next release changelog.

I hope you enjoyed contributing to Eland, if you've got time and are interested there are plenty of other issues to work on that I can assist with. You can also submit this contribution to the Elastic Contributor Program which recognizes community contributors and sends you prizes and swag too 🎁

Thank you very much for your assistance.
I am very happy to be able to use these new features in eland in the future! 😋

@kxbin kxbin deleted the master branch August 31, 2021 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can we get batch data use df.to_pandas() in the case of big data?
4 participants