Skip to content

[Python] Useful scope of the python bindings? #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Sep 28, 2022 · 6 comments
Closed

[Python] Useful scope of the python bindings? #53

jorisvandenbossche opened this issue Sep 28, 2022 · 6 comments
Labels

Comments

@jorisvandenbossche
Copy link
Member

I opened #52 as a start for Python bindings for nanoarrow, currently just a package scaffold.

I am planning to subsequently add basic introspection/consumption of ArrowArray structs and basic conversions of buffers/arrays to numpy.
Some more elaborate notes / thoughts can be found https://docs.google.com/document/d/1119poLwF0r4AN19dGt9U8vLM07zEVdlRpP2omL1PXdc/edit?usp=sharing

More in general, it would be interesting to discuss or get feedback on what functionality would be useful / needed for other potential use cases.

@paleolimbot
Copy link
Member

This is awesome!

The scope of the Python bindings certainly doesn't have to be the same as the scope of the R bindings; however I'll offer up what I had in mind for the scope of the R bindings in case that helps with the conversation. Basically:

  • Data structure: provide an R class that holds an Array/Schema/ArrayStream and releases it when it goes out of scope. In Python you have duck typing so it matters less, but in R it helps to be able to write array <- as_nanoarrow_array(some_object) and have the as_nanoarrow_array() S3 generic live in the nanoarrow package.
  • Introspection: for debugging the R bindings themselves it's really nice to be able to "see" the Array/Schema/ArrayStream fields. This helps write tests, too, particularly when constructing Arrays from buffers.
  • Conversion from Array to R vector: ADBC makes database results available via a struct ArrowArrayStream (and so does the next GDAL release): including conversions in nanoarrow means that these results are accessible without an Arrow C++ dependency.
  • Creating ArrowArray objects from buffers: This makes it so that you can make an array for any arrow type, regardless if R arrow or R nanoarrow has implemented a conversion for it. I used this (and still use this) to prototype GeoArrow extension type representations since nested list support in R/arrow is not great and extension type support was non-existent when I started.
  • Creating ArrowArray objects from R vectors: In R this is pretty straightforward since we only have a few vector types, most of which we can zero-copy convert into Arrow arrays. ADBC lets you do database inserts using a struct ArrowArrayStream, so being able to zero-copy R vectors into that form allows database connectivity without an Arrow C++ dependency.
  • Reading ArrayStreams: This is pretty easy as long as there are data structures for schemas and arrays
  • Creating ArrowArrayStreams: At the very least, from a list of arrays to facilitate testing APIs that use them

For me, the R bindings are all about facilitating the development of libraries that expose data sources as Arrays or ArrayStreams. The dream is that Arrow is the in-memory standard for this kind of thing; however, without a lightweight way to convert those objects to a data frame or numpy array there is not much incentive to do so. The nanoarrow C library makes it relatively easy to create struct ArrowArrays from your data source; the nanoarrow R bindings make it trivial to make your library useful to R users.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Oct 24, 2022

Data structure: provide an R class that holds an Array/Schema/ArrayStream and releases it when it goes out of scope.

Yes, that was the first thing I was planning to do, and opened a PR with an initial version of this for Array -> #62 (will have to do something similar for ArrayStream).
Will have to think a bit how to do this for ArrowSchema (whether the Array class keeps track of a python Schema object (which wraps ArrowSchema), or whether it has its own ArrowSchema pointer directly, as I did for now)

@paleolimbot
Copy link
Member

Yeah, the fact that some_r_schema$name <- "something else" makes a deep copy of some_r_schema with my current design is probably not ideal...if it were just a classed R list I could probably avoid some copies. That design is mostly nice because it simplifies the overall structure (a nanoarrow_schema == struct ArrowSchema; a nanoarrow_array == struct ArrowArray).

@paleolimbot
Copy link
Member

We now have Python bindings with some specific bullets on the roadmap ( https://arrow.apache.org/nanoarrow/main/roadmap.html#python-bindings )...feel free to reopen an issue or PR modifying those!

@kylebarron
Copy link

Is there also a section of the docs that says what's in scope for the python bindings? As an example use case, pyarrow is so large that it's hard to use in Lambda and stay under the 250MB limit. So for example those docs mention that IPC isn't currently exposed, but is it intended/in scope for it to be exposed in the future?

@paleolimbot
Copy link
Member

I think it's a little bit up to who has time to do all the implementing...the things I put on the roadmap are a few things that have come up (but until there's some concrete plan/resources allocated I was going to keep a lid on the issues). There's certainly no technical limitation to exposing the IPC reader!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants