Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Writing nullable nested strings results in wrong data in file #26466

Closed
asfimport opened this issue Nov 4, 2020 · 1 comment
Closed

Comments

@asfimport
Copy link

asfimport commented Nov 4, 2020

When I try writing a column of type struct(string) that has more elements than the write_batch_size, the output will only contain the first batch, repeated. The data in batches after the first batch are not written to the output.

I am only seeing this behaviour with arrow 2.0.0, in 1.0.1 the output contains all the data as expected.
 
This python test case reproduces the problem, the last value in the output is "key-0" instead of the expected "key-1024":
 

import io
import pyarrow as pa
import pyarrow.parquet as pq

def test_struct_array():
    default_writer_batch_size = 1024
    n_samples = default_writer_batch_size + 1
    keys = [f"key-{i}" for i in range(n_samples)]
    expected = list(keys)

    struct_array = pa.StructArray.from_arrays(
        [pa.array(keys, type=pa.string())],
        names=["string"],
    )
    table = pa.table({"struct": struct_array})

    buf = io.BytesIO()
    pq.write_table(table, buf)

    actual = pq.read_table(buf).flatten()[0].to_pylist()

    assert actual[:1024] == expected[:1024]
    assert actual[-1] == expected[-1], (actual[-1], expected[-1])

 

Environment: Python 3.6
Reporter: Christian Lundgren / @chrisavl
Assignee: Christian Lundgren / @chrisavl

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-10493. Please see the migration documentation for further details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant