Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] [Docs] Change docstring for decompressed_size arg in pyarrow.decompress to reflect implementation #15043

Closed
amoeba opened this issue Dec 20, 2022 · 1 comment · Fixed by #15061

Comments

@amoeba
Copy link
Member

amoeba commented Dec 20, 2022

Describe the bug, including details regarding any error messages, version, and platform.

The docstring for decompressed_size in pyarrow.decompress and the underlying Codec.decompress states:

decompressed_size : int, default None
If not specified, will be computed if the codec is able to determine the uncompressed buffer size.

However, you can see from the implementation that the behavior isn't possibly and that the argument is effectively required because we always raise when it's None:

arrow/python/pyarrow/io.pxi

Lines 2091 to 2124 in 4e9158d

def decompress(self, object buf, decompressed_size=None, asbytes=False,
memory_pool=None):
"""
Decompress data from buffer-like object.
Parameters
----------
buf : pyarrow.Buffer, bytes, or memoryview-compatible object
decompressed_size : int, default None
If not specified, will be computed if the codec is able to
determine the uncompressed buffer size.
asbytes : boolean, default False
Return result as Python bytes object, otherwise Buffer
memory_pool : MemoryPool, default None
Memory pool to use for buffer allocations, if any.
Returns
-------
uncompressed : pyarrow.Buffer or bytes (if asbytes=True)
"""
cdef:
shared_ptr[CBuffer] owned_buf
CBuffer* c_buf
Buffer out_buf
int64_t output_size
uint8_t* output_buffer = NULL
owned_buf = as_c_buffer(buf)
c_buf = owned_buf.get()
if decompressed_size is None:
raise ValueError(
"Must pass decompressed_size for {} codec".format(self)
)

For example,

import pyarrow as pa

pa.decompress(pa.compress(bytes('mydata','utf-8')))

Raises:

ValueError: Must pass decompressed_size for <pyarrow.Codec name=lz4 compression_level=1> codec

I think the original intent was to support detecting or estimating the size of the output buffer for codecs that support it (like Snappy) but, the last time this was brought up, it was marked as low priority and remains unimplemented.

Should the argument be made required and the bit about "If not specified" be removed?

Component(s)

Python

@amoeba amoeba changed the title Change docstring for decompressed_size arg in pyarrow.decompress to reflect implementation [Python] Change docstring for decompressed_size arg in pyarrow.decompress to reflect implementation Dec 20, 2022
@jorisvandenbossche
Copy link
Member

Yes, I think we should certainly update the docs to match the current behaviour (which is indeed to always require the decompressed_size)

@amoeba amoeba changed the title [Python] Change docstring for decompressed_size arg in pyarrow.decompress to reflect implementation [Python] [Docs] Change docstring for decompressed_size arg in pyarrow.decompress to reflect implementation Dec 21, 2022
@jorisvandenbossche jorisvandenbossche added this to the 11.0.0 milestone Dec 22, 2022
jorisvandenbossche pushed a commit that referenced this issue Dec 22, 2022
…5061)

I opted to keep `decompressed_size` optional to reduce churn (optional->required->optional) if we ever add in detection/estimation for certain codecs in the future. Let me know what you think.
* Closes: #15043

Authored-by: Bryce Mecum <petridish@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
EpsilonPrime pushed a commit to EpsilonPrime/arrow that referenced this issue Jan 5, 2023
…ss (apache#15061)

I opted to keep `decompressed_size` optional to reduce churn (optional->required->optional) if we ever add in detection/estimation for certain codecs in the future. Let me know what you think.
* Closes: apache#15043

Authored-by: Bryce Mecum <petridish@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants