Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Slice RecordBatch of String array with offset 0 returns whole batch #22449

Closed
asfimport opened this issue Jul 26, 2019 · 2 comments
Closed

Comments

@asfimport
Copy link

We are seeing a very similar bug as in ARROW-809, just for a RecordBatch of strings. A slice of a RecordBatch with a string column and offset =0 returns the whole batch instead.

 

import pandas as pd
import pyarrow as pa
df = pd.DataFrame({ 'b': ['test' for x in range(1000_000)]})
tbl = pa.Table.from_pandas(df)
batch = tbl.to_batches()[0]

batch.slice(0,2).serialize().size4000232

batch.slice(1,2).serialize().size240

 

Reporter: Sascha Hofmann / @saschahofmann
Assignee: Wes McKinney / @wesm

PRs and other links:

Note: This issue was originally created as ARROW-6046. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
The offsets buffer is not truncated in the first case. The logic for this is found here

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L226

A correct fix would be to slice the offsets buffer to the expected length (plus padding, if it exists) when the offset is zero rather than serializing the whole thing

@asfimport
Copy link
Author

Francois Saint-Jacques / @fsaintjacques:
Issue resolved by pull request 5126
#5126

@asfimport asfimport added this to the 0.15.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants