Add `iterrows()` and `itertuples()` DataFrame API, its usage is similar to `pandas` #369

kxbin · 2021-08-03T02:23:03Z

Related to this issues:
Can we get batch data use df.to_pandas() in the case of big data? close #345

Add to_pandas_in_batch() DataFrame API

Then, We can use below code to get batch dataframe in the case of big data

pd_df_iterator = ed_df.to_pandas_in_batch(batch_size=1000)
for pd_df in pd_df_iterator:
    print(pd_df)

If there code is something wrong, please give some suggestions. Thank you!

elasticmachine · 2021-08-03T02:23:05Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

cla-checker-service · 2021-08-03T02:23:06Z

💚 CLA has been signed

V1NAY8 · 2021-08-03T08:05:16Z

So, I am thinking with this #368 We will eliminate scroll API.
Current logic is if batch_size is less than 10k (hard coded) we take search else we take scroll

Since, search API accepts only 10k size at once. else we have to use pagination to fetch next set of results.

I am thinking we can add a new parameter to to_pandas called batch_size which can take until 10k.
If given more than that. we throw a warning saying 10k is maximum and fetch results with max batch size

Also turn existing method into iterator by default

So, Is a different method required for this ?
Also we need to document how this iterator has to be used.

Once this is done. I am thinking if Collector wont be required after this. Since logic will be straight forward.

@sethmlarson What do you think ?

kxbin · 2021-08-03T08:35:42Z

So, I am thinking with this #368 We will eliminate scroll API.
Current logic is if batch_size is less than 10k (hard coded) we take search else we take scroll

Since, search API accepts only 10k size at once. else we have to use pagination to fetch next set of results.

I am thinking we can add a new parameter to to_pandas called batch_size which can take until 10k.
If given more than that. we throw a warning saying 10k is maximum and fetch results with max batch size

Also turn existing method into iterator by default

So, Is a different method required for this ?

Also we need to document how this iterator has to be used.

Once this is done. I am thinking if Collector wont be required after this. Since logic will be straight forward.

@sethmlarson What do you think ?

@V1NAY8 Thanks for the reply

I agree with you.

We can throw a warning saying 10k is maximum and fetch results with max batch size，I will modify it now.

In addition, I also considered turn existing method into iterator by default.
but will some users want to get all the data at once? (On the other hand, compatibility is also considered.)

Maybe we can determine whether to return an iterator or a dataframe based on the batch_size parameters (batch_size is None or not None).
How about this?

V1NAY8 · 2021-08-03T08:41:20Z

@kxbin Don't throw a warning now in this PR. The existing code can run for both < 10k and > 10k. I will throw if needed in my other changes.

kxbin · 2021-08-03T08:52:49Z

@V1NAY8 Okay, I'll trouble you to throw a warning.

V1NAY8 · 2021-08-03T09:06:42Z

So, we can turn internally into iterator by default. The method to_pandas() which will dump all the data at once, we can change it to iterate internally and construct a df and return.
We can expose another method just to expose iterator
Thus compatibility will be there 😛

kxbin · 2021-08-03T09:17:50Z

So, we can turn internally into iterator by default. The method to_pandas() which will dump all the data at once, we can change it to iterate internally and construct a df and return.
We can expose another method just to expose iterator
Thus compatibility will be there 😛

This idea is great 😃

sethmlarson · 2021-08-04T18:25:32Z

Thanks @kxbin and @V1NAY8 for your interest in this feature. I discussed this with the team and I believe we're thinking it makes sense to implement iterrows() which can be transformed by the user into a "chunking/batching".

The reason we'd like to do this instead of a DataFrame chunking API is it provides compatibility with pandas and only allows two ways to view your data (either "all" the data or every row) instead of any number of views based on batch_size. Does this make sense?

kxbin · 2021-08-05T03:56:27Z

Thanks @kxbin and @V1NAY8 for your interest in this feature. I discussed this with the team and I believe we're thinking it makes sense to implement iterrows() which can be transformed by the user into a "chunking/batching".

The reason we'd like to do this instead of a DataFrame chunking API is it provides compatibility with pandas and only allows two ways to view your data (either "all" the data or every row) instead of any number of views based on batch_size. Does this make sense?

@sethmlarson Thanks for the reply

Yeah, If we can be compatible with pandas, it will make very sense to do so.
Because all users are very familiar with the usage of pandas.

We can implement the following methods：
ed.DataFrame.iterrows()
ed.DataFrame.itertuples()

Make it similar to pandas usage:
pandas.DataFrame.iterrows()
pandas.DataFrame.itertuples()

Please give me some time, I think I can finish it. 😛

sethmlarson · 2021-08-05T03:57:56Z

Awesome, sounds great! 💪 Let me know if you have questions.

kxbin · 2021-08-05T04:01:22Z

Awesome, sounds great! 💪 Let me know if you have questions.

Okay 😃

eland/dataframe.py

sethmlarson · 2021-08-05T14:31:21Z

Btw you may want to wait for #370 to land and base your work off of this function that's been added. It may help you out a lot!

kxbin · 2021-08-06T02:14:38Z

Btw you may want to wait for #370 to land and base your work off of this function that's been added. It may help you out a lot!

Thank you for your corrections and tips.

I will wait for the landing, otherwise it is possible that _es_result() will be modified at the same time, and there will be conflicts.

sethmlarson · 2021-08-07T21:07:52Z

@kxbin I've merged #370, now it should be easier to implement this feature using eland.operations.search_after_with_pit()

kxbin · 2021-08-09T01:11:14Z

@kxbin I've merged #370, now it should be easier to implement this feature using eland.operations.search_after_with_pit()

Thanks for the tips.

kxbin · 2021-08-17T06:35:11Z

I think I should have succeeded in the optimization.
At present, the same 50,000 data sets, the test speed is as follows:

ed.iterrows(),  It took a total of `21 seconds` after the iteration
ed.itertuples(), It took a total of `25 seconds` after the iteration
ed.to_pandas(), It took `16 seconds`

This is a good idea!

I try to convert QueryCompiler._es_results_to_pandas() into a generator itself. And did some speed tests.

I used a data set of 50,000 rows to do the test, and the results are as follows:

Before conversion:
ed.iterrows(),  It took a total of `2 minutes 30 seconds` after the iteration
ed.itertuples(), It took a total of `3 minutes 53 seconds` after the iteration
ed.to_pandas(), It took `15 seconds`
After conversion:
ed.iterrows(),  It took a total of `1 minutes 52 seconds` after the iteration
ed.itertuples(), It took a total of `2 minutes 11 seconds` after the iteration
ed.to_pandas(), It took `1 minutes 54 seconds`

sethmlarson · 2021-08-17T12:41:55Z

jenkins test this please

sethmlarson

This is really great for a first contribution, thanks for working on this! I have some comments below for you, however I also have a higher-level comment of I think our search_after_hits implementation should be yielding "batches" of documents instead of individual documents.

This makes more sense I think because the "post-processing" phase is a lot more efficient to be called less frequently and means we'll end up creating fewer pd.DataFrame in the end. We'll still be able to use the current setup we're doing now in itertuples and iterrows having yield from pd_df.itertuples(). I can accomplish this in a separate PR.

eland/operations.py

docs/sphinx/reference/api/eland.DataFrame.iterrows.rst

docs/sphinx/reference/api/eland.DataFrame.itertuples.rst

eland/query_compiler.py

eland/operations.py

eland/query_compiler.py

sethmlarson · 2021-08-17T12:57:16Z

Looks like the lint and docs jobs are failing, make sure nox -rs format and nox -rs docs pass.

sethmlarson · 2021-08-17T18:14:49Z

@kxbin Take a look at the change in #379 and adapt your PR to do essentially:

def iterrows():
    ... (setup)
    for hits in _search_yield_hits(...):
        df = _es_results_to_pandas(hits)
        df = self._post_process_...(df)
        yield from df.iterrows()

Perhaps you can even encapsulate the logic within (setup) but maybe that can be a separate PR. Focus on getting iterrows() and itertuples() integrated first :)

…c-master

th0ger · 2021-08-18T07:20:42Z

I am thinking we can add a new parameter to to_pandas called batch_size which can take until 10k.
If given more than that. we throw a warning saying 10k is maximum and fetch results with max batch size

I think this would be useful, to allow raising scroll size from eland default 1000 to elastic maximum 10K.

kxbin · 2021-08-18T09:00:23Z

I am thinking we can add a new parameter to to_pandas called batch_size which can take until 10k.
If given more than that. we throw a warning saying 10k is maximum and fetch results with max batch size

I think this would be useful, to allow raising scroll size from eland default 1000 to elastic maximum 10K.

Yeah, Really useful.

We have now adopted a better solution, we don't use batch_size now, and change to expose iterrows and itertuples method.

Then the user can implement his own batch_size in the dataframe iteration.

th0ger · 2021-08-18T13:09:16Z

Then the user can implement his own batch_size in the dataframe iteration.

Would you mind providing an example of how this should be called?

kxbin · 2021-08-19T06:58:16Z

Then the user can implement his own batch_size in the dataframe iteration.

Would you mind providing an example of how this should be called?

Such like this:

ed_flights = ed.DataFrame('localhost:9200', 'flights')
batch_size = 10000

batch_series = []

for index, row in ed_flights.iterrows()
    batch_series.append(row)

    if index % batch_size == 0:
        batch_dataframe = pd.Dataframe(batch_series)
        batch_series = []
        # Then, we can use this batch_dataframe to do something we want

Add to_pandas_in_batch() DataFrame API

58ba4ab

kxbin mentioned this pull request Aug 3, 2021

Can we get batch data use df.to_pandas() in the case of big data? #345

Closed

Format indent

51eec7c

kxbin mentioned this pull request Aug 3, 2021

Feature Request: df.apply() #350

Open

kxbin added 4 commits August 5, 2021 13:39

Merge branch 'elastic:master' into master

6565f87

Go back to modify and add interfaces that have not yet been implemented

37b31a2

Merge branch 'master' of https://github.com/kxbin/eland into master

bc0cdc4

Go back to the modify of etl.py

18b1d99

kxbin marked this pull request as draft August 5, 2021 09:09

sethmlarson reviewed Aug 5, 2021

View reviewed changes

eland/dataframe.py Outdated Show resolved Hide resolved

sethmlarson reviewed Aug 5, 2021

View reviewed changes

eland/dataframe.py Outdated Show resolved Hide resolved

Fix errors and add documentation

4301edd

Fix the return parameter type of itertuples()

443b5ef

Merge branch 'elastic:master' into master

0725dee

kxbin added 2 commits August 17, 2021 13:46

Merge branch 'elastic-master' into master

3abcfe1

Optimize performance

cbfb0a2

kxbin requested a review from sethmlarson August 17, 2021 06:39

Add pytest unit

0cde43d

sethmlarson suggested changes Aug 17, 2021

View reviewed changes

sethmlarson mentioned this pull request Aug 17, 2021

Yield list of hits instead of individual hits #379

Merged

kxbin added 8 commits August 18, 2021 11:25

Merge branch 'master' of https://github.com/elastic/eland into elasti…

0a696ae

…c-master

Merge branch 'elastic-master' into master

6a16b52

Format code

e27c36e

Format document

05f6130

Delete pytest unit

5bcacc1

Add pytest unit

bb6d6e3

Format code

c5303ef

Format code

290ffb6

kxbin deleted the branch elastic:master August 18, 2021 07:16

kxbin closed this Aug 18, 2021

kxbin deleted the master branch August 18, 2021 07:16

kxbin restored the master branch August 18, 2021 07:21

kxbin deleted the master branch August 18, 2021 07:25

kxbin restored the master branch August 18, 2021 07:31

kxbin deleted the master branch August 18, 2021 07:31

kxbin mentioned this pull request Aug 18, 2021

Add iterrows() and itertuples() DataFrame API, its usage is similar to pandas #380

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `iterrows()` and `itertuples()` DataFrame API, its usage is similar to `pandas` #369

Add `iterrows()` and `itertuples()` DataFrame API, its usage is similar to `pandas` #369

kxbin commented Aug 3, 2021 •

edited

Loading

elasticmachine commented Aug 3, 2021

cla-checker-service bot commented Aug 3, 2021 •

edited

Loading

V1NAY8 commented Aug 3, 2021

kxbin commented Aug 3, 2021 •

edited

Loading

V1NAY8 commented Aug 3, 2021

kxbin commented Aug 3, 2021

V1NAY8 commented Aug 3, 2021

kxbin commented Aug 3, 2021

sethmlarson commented Aug 4, 2021

kxbin commented Aug 5, 2021

sethmlarson commented Aug 5, 2021

kxbin commented Aug 5, 2021

sethmlarson commented Aug 5, 2021

kxbin commented Aug 6, 2021

sethmlarson commented Aug 7, 2021

kxbin commented Aug 9, 2021

kxbin commented Aug 17, 2021

sethmlarson commented Aug 17, 2021

sethmlarson left a comment

sethmlarson commented Aug 17, 2021

sethmlarson commented Aug 17, 2021 •

edited

Loading

th0ger commented Aug 18, 2021

kxbin commented Aug 18, 2021 •

edited

Loading

th0ger commented Aug 18, 2021

kxbin commented Aug 19, 2021 •

edited

Loading

Add iterrows() and itertuples() DataFrame API, its usage is similar to pandas #369

Add iterrows() and itertuples() DataFrame API, its usage is similar to pandas #369

Conversation

kxbin commented Aug 3, 2021 • edited Loading

elasticmachine commented Aug 3, 2021

cla-checker-service bot commented Aug 3, 2021 • edited Loading

V1NAY8 commented Aug 3, 2021

kxbin commented Aug 3, 2021 • edited Loading

V1NAY8 commented Aug 3, 2021

kxbin commented Aug 3, 2021

V1NAY8 commented Aug 3, 2021

kxbin commented Aug 3, 2021

sethmlarson commented Aug 4, 2021

kxbin commented Aug 5, 2021

sethmlarson commented Aug 5, 2021

kxbin commented Aug 5, 2021

sethmlarson commented Aug 5, 2021

kxbin commented Aug 6, 2021

sethmlarson commented Aug 7, 2021

kxbin commented Aug 9, 2021

kxbin commented Aug 17, 2021

sethmlarson commented Aug 17, 2021

sethmlarson left a comment

Choose a reason for hiding this comment

sethmlarson commented Aug 17, 2021

sethmlarson commented Aug 17, 2021 • edited Loading

th0ger commented Aug 18, 2021

kxbin commented Aug 18, 2021 • edited Loading

th0ger commented Aug 18, 2021

kxbin commented Aug 19, 2021 • edited Loading

Add `iterrows()` and `itertuples()` DataFrame API, its usage is similar to `pandas` #369

Add `iterrows()` and `itertuples()` DataFrame API, its usage is similar to `pandas` #369

kxbin commented Aug 3, 2021 •

edited

Loading

cla-checker-service bot commented Aug 3, 2021 •

edited

Loading

kxbin commented Aug 3, 2021 •

edited

Loading

sethmlarson commented Aug 17, 2021 •

edited

Loading

kxbin commented Aug 18, 2021 •

edited

Loading

kxbin commented Aug 19, 2021 •

edited

Loading