Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyarrow casting chunked array to string type yields ArrowCapacityError #38835

Open
samster25 opened this issue Nov 21, 2023 · 1 comment
Open

Comments

@samster25
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Hello,

I'm hitting an issue when casting an integer ChunkedArray to a string type. The issue happens when the target array chunk size of the string array is too large. Instead of performing the cast at the granularity of the input chunks, pyarrow should do it based on the target array size.

arr = 9_999_999_999_999 * np.ones(1_000_000_000//8) # 1 GB Array

pa_arr = pa.chunked_array([arr])

pa_arr
<pyarrow.lib.ChunkedArray object at 0x16a1cfce0>
[
  [
    9.999999999999e+12,
    9.999999999999e+12,
    9.999999999999e+12,
    9.999999999999e+12,
    9.999999999999e+12,
    ...
    9.999999999999e+12,
    9.999999999999e+12,
    9.999999999999e+12,
    9.999999999999e+12,
    9.999999999999e+12
  ]
]

pa_arr.cast(pa.string())
---------------------------------------------------------------------------
ArrowCapacityError                        Traceback (most recent call last)
line 1
----> 1 pa_arr.cast(pa.string())

File ~/code/Daft/venv/lib/python3.11/site-packages/pyarrow/table.pxi:551, in pyarrow.lib.ChunkedArray.cast()

File ~/code/Daft/venv/lib/python3.11/site-packages/pyarrow/compute.py:400, in cast(arr, target_type, safe, options, memory_pool)
    398     else:
    399         options = CastOptions.safe(target_type)
--> 400 return call_function("cast", [arr], options, memory_pool)

File ~/code/Daft/venv/lib/python3.11/site-packages/pyarrow/_compute.pyx:572, in pyarrow._compute.call_function()

File ~/code/Daft/venv/lib/python3.11/site-packages/pyarrow/_compute.pyx:367, in pyarrow._compute.Function.call()

File ~/code/Daft/venv/lib/python3.11/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/code/Daft/venv/lib/python3.11/site-packages/pyarrow/error.pxi:125, in pyarrow.lib.check_status()

ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2147483664

Component(s)

Python

@js8544
Copy link
Collaborator

js8544 commented Nov 22, 2023

An arrow StringArray cannot contain more than int32_max bytes in total because StringArray uses int32 to store the offsets. pa_arr.cast(pa.large_string()) works fine in your case.

Instead of performing the cast at the granularity of the input chunks, pyarrow should do it based on the target array size.

Sorry but I don't quite get this. Could you elaborate a bit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants