Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Create string/binary arrays from iterables #430

Merged
merged 14 commits into from
Apr 16, 2024

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Apr 15, 2024

This PR adds support for building string and binary arrays via iterable.

It also cleans up a few parts of #426 that resulted in the wheel builds failing for (at least) PyPy 3.8 and 3.9. We can circle back to the performance of building from iterables (and whether or not pack_into() is essential) when all the wheels are building reliably.

import nanoarrow as na

strings = ["pizza", "yogurt", "noodles", "peanut butter sandwiches"]

na.Array(strings, na.string())
#> nanoarrow.Array<string>[4]
#> 'pizza'
#> 'yogurt'
#> 'noodles'
#> 'peanut butter sandwiches'

na.Array((s.encode() for s in strings), na.binary())
#> nanoarrow.Array<binary>[4]
#> b'pizza'
#> b'yogurt'
#> b'noodles'
#> b'peanut butter sandwiches'

The "build from iterable" code is now sufficiently complicated that it should be separated out. I did an initial attempt at that for this PR; however, it scrambles things up a bit and is complicated by the interdependence between the functions that sanitize arguments (e.g., c_schema(), c_array()) and the functions that build from iterable.

Currently faster for strings and slightly slower for bytes than pyarrow.

from itertools import cycle, islice
import nanoarrow as na
import pyarrow as pa

strings = ["pizza", "yogurt", "noodles", "peanut butter sandwiches"]
binary = [s.encode() for s in strings]

def many_strings():
    return islice(cycle(strings), int(1e6))

def many_strings_with_nulls():
    return islice(cycle(strings + [None]), int(1e6))

def many_bytes():
    return islice(cycle(binary), int(1e6))

def many_bytes_with_nulls():
    return islice(cycle(binary + [None]), int(1e6))

%timeit pa.array(many_strings(), pa.string())
#> 23.4 ms ± 488 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit na.c_array(many_strings(), na.string())
#> 14.3 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pa.array(many_strings_with_nulls(), pa.string())
#> 21.4 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit na.c_array(many_strings_with_nulls(), na.string())
#> 17.1 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pa.array(many_bytes(), pa.binary())
#> 19.7 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit na.c_array(many_bytes(), na.binary())
#> 16.3 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pa.array(many_bytes_with_nulls(), pa.binary())
#> 17.6 ms ± 37.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit na.c_array(many_bytes_with_nulls(), na.binary())
#> 19 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@@ -23,7 +23,7 @@
import warnings


# Generate the nanoarrow_c.pxd file used by the Cython extension
# Generate the nanoarrow_c.pxd file used by the Cython extensions
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was basically just to force the wheels to build 😬

@paleolimbot paleolimbot marked this pull request as ready for review April 15, 2024 18:56
Comment on lines -2012 to 2015
code = ArrowBufferReserve(self._buffer._ptr, bytes_per_element)
if code != NANOARROW_OK:
Error.raise_error("ArrowBufferReserve()", code)

pack_into(self, self._buffer._ptr.size_bytes, *item)
self._buffer._ptr.size_bytes += bytes_per_element
write(pack(*item))
n_values += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you rolled this back?
(AFAIK this part was not the reason for the failures?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't a reason for the original failure but was exposed once that was fixed. Exactly what type of object is considered a buffer is different for (maybe just some versions of) PyPy, so Struct.packinto() threw an error. If we have time we can roll that back in (or figure out if there's some minimum version of PyPy we have to drop to make it work), but at least this gets us back to wheels that build!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, sounds good!

@paleolimbot paleolimbot merged commit 0948151 into apache:main Apr 16, 2024
11 checks passed
@paleolimbot paleolimbot deleted the python-builder-string branch April 19, 2024 00:32
@paleolimbot paleolimbot added this to the nanoarrow 0.5.0 milestone May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants