Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: df.apply() #350

Open
aamirkhan34 opened this issue Jun 1, 2021 · 4 comments
Open

Feature Request: df.apply() #350

aamirkhan34 opened this issue Jun 1, 2021 · 4 comments

Comments

@aamirkhan34
Copy link

Requesting the feature df.apply() . I did not find any issues regarding this.

Thanks.

@sethmlarson
Copy link
Contributor

This is unlikely to be implemented as its no more efficient than ed_df.to_pandas().apply(). Is there a use-case in particular that's more efficient to implement and desirable?

@aamirkhan34
Copy link
Author

Thanks @sethmlarson.
The ed_df.to_pandas() method is slow and might run out of memory for larger sample. I want to harness the power of our elasticsearch cluster to process the eland dataframe using apply method. This will be very efficient for our process.

What do you think?

Thanks.

@sethmlarson
Copy link
Contributor

Unfortunately apply is very generic, you can basically pass it anything and we can't transform arbitrary Python functions into an Elasticsearch query. Is there some operation(s) in particular you're interested in?

@kxbin
Copy link
Contributor

kxbin commented Aug 3, 2021

Thanks @sethmlarson.
The ed_df.to_pandas() method is slow and might run out of memory for larger sample. I want to harness the power of our elasticsearch cluster to process the eland dataframe using apply method. This will be very efficient for our process.

What do you think?

Thanks.

Yeah, I also found that ed_df.to_pandas() method is very slow for larger sample.

So, Maybe we can process the data in batches,Like this:

pd_df_iterator = ed_df.to_pandas_in_batch(batch_size=1000)
for pd_df in pd_df_iterator:
    pd_df.apply()

After testing, I found that the speed has increased a lot, because the amount of data in each batches is determined by batch_size.

Here is a pull request to handle this situation:
Add to_pandas_in_batch() DataFrame API #369

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants