feat: Support for Binaryview and StringView types #367

jorisvandenbossche · 2024-01-19T14:29:48Z

Very draft PR, just putting here publicly what I experimented.

This is currently a minimal addition that just allows to inspect an array / type:

In [1]: builder = pa.lib.StringViewBuilder()
   ...: builder.append("test")
   ...: builder.append("very long string that is not inlined")
   ...: builder.append(None)
   ...: builder.append("test")
   ...: arr = builder.finish()

In [2]: import nanoarrow as na

In [3]: na.c_schema(arr.type)
Out[3]: 
<nanoarrow.c_lib.CSchema string_view>
- format: 'vu'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:

In [4]: na.c_schema_view(arr.type)
Out[4]: 
<nanoarrow.c_lib.CSchemaView>
- type: 'string_view'
- storage_type: 'string_view'

In [5]: na.c_array(arr)
Out[5]: 
<nanoarrow.c_lib.CArray string_view>
- length: 4
- offset: 0
- null_count: 1
- buffers: (140250311598080, 140250311598144, 140250311598208, 94079215855408)
- dictionary: NULL
- children[0]:

In [4]: na.c_array_view(arr)
Out[4]: 
<nanoarrow.c_lib.CArrayView>
- storage_type: 'string_view'
- length: 4
- offset: 0
- null_count: 1
- buffers[2]:
  - <bool validity[1 b] 11010000>
  - <string_view data[0 b] b''>
- dictionary: NULL
- children[0]:

jorisvandenbossche · 2024-01-19T14:31:08Z

src/nanoarrow/nanoarrow_types.h

-  NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO
+  NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO,
+  NANOARROW_TYPE_BINARY_VIEW,
+  NANOARROW_TYPE_STRING_VIEW,


Is this ArrowType enum meant to be stable? (i.e. can I only add new types at the end, or can I put them at a more logical place in the list?)

I think it's probably best to put it at the end to keep the existing values (we can always rearrange them in places where they're more likely to be seen).

codecov-commenter · 2024-01-19T14:33:40Z

Codecov Report

Attention: 26 lines in your changes are missing coverage. Please review.

Comparison is base (b3c952a) 88.38% compared to head (b3a9894) 88.02%.

Files	Patch %	Lines
src/nanoarrow/schema.c	0.00%	19 Missing ⚠️
src/nanoarrow/nanoarrow_types.h	0.00%	4 Missing ⚠️
src/nanoarrow/array.c	25.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #367      +/-   ##
==========================================
- Coverage   88.38%   88.02%   -0.36%     
==========================================
  Files          75       75              
  Lines       12677    12731      +54     
==========================================
+ Hits        11205    11207       +2     
- Misses       1472     1524      +52

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jorisvandenbossche · 2024-01-19T14:38:20Z

@paleolimbot the reason that the Python binding for the ArrayView currently checks the buffer types dynamically to determine the number of buffers, is that so this works on an uninitialized ArrayView (just based on the layout, without an Array already being created)? See here:

arrow-nanoarrow/python/src/nanoarrow/_lib.pyx

Lines 685 to 690 in b3c952a

    
           @property 
        
           def n_buffers(self): 
        
               for i in range(3): 
        
                   if self._ptr.layout.buffer_type[i] == NANOARROW_BUFFER_TYPE_NONE: 
        
                       return i 
        
               return 3

i.e. why not just return self._ptr.array.n_buffers here?

Because if we don't consider the variadic buffers as part of the ArrayView's buffers, then accessing like the above in Python doesn't work for those

paleolimbot · 2024-01-19T16:50:40Z

why not just return self._ptr.array.n_buffers here?

I've tried to remove all references to ArrayView.array from nanoarrow (C, Python, R, and otherwise), although I haven't removed the member from the ArrayView yet because I think some previous code relies on it. Basically, the ArrowArrayView doesn't care how the memory it's pointing to got there (i.e., there's no requirement that an ArrowArray ever existed). This is mostly helpful for reading IPC: the decoder first fills in an ArrowArrayView and only allocates the ArrowArray if requested.

Because if we don't consider the variadic buffers as part of the ArrayView's buffers, then accessing like the above in Python doesn't work for those

That's true. We could rewrite n_buffers() to += _ptr.n_variadic_buffers and rewrite buffer() to create a buffer view from the variaidic buffers if i is large enough (or error, since accessing variadic buffers directly isn't strictly needed to decode the content).

paleolimbot · 2024-01-19T16:52:10Z

src/nanoarrow/array.c

+  if (array_view->storage_type == NANOARROW_TYPE_STRING_VIEW) {
+    array_view->n_varidic_buffers = array->n_buffers - 3;
+    array_view->variadic_buffer_sizes = array->buffers[array->n_buffers - 1];
+    // array_view->variadic_buffers = ...


Suggested change

// array_view->variadic_buffers = ...

array_view->variadic_buffers = array-> buffers + 2

...I think?

rouault · 2024-01-21T16:21:34Z

Sorry for hickjacking this thread but I've become aware of the introductions of those new arrow data types because GDAL compilation broke against arrow 15.0 (due to a switch() not handling the new cases). Will be addressed in OSGeo/gdal#9116 in a minimalistic way by erroring out on those types. Do you know if those types will be actually found in serialized formats, namely Parquet and Feather ? And I hope that I won't have to add support for them in OGRLayer::WriteArrowBatch()... Their value proposition compared to regular string or binary is unclear to me. The only thing I see is that they might be a way to reduce memory usage by pointing to the same offset in case of duplicated strings? but wasn't that the purpose of dictionaries? (actually the same question would hold for the RUN_END_ENCODED stuff added in libarrow 12). As a relatively new comer to the Arrow ecosystem, I should point that the proliferation of basic data types is going to be a serious obstacle to adoption by new implementations.
Perhaps there should be some "data type negociation" mechanism where a consumer could tell to the producer something like "I just understand Int32, Float64, String. Do your best to present me only those data types by possibly morphing content to that"

paleolimbot · 2024-01-21T19:19:44Z

Do you know if those types will be actually found in serialized formats, namely Parquet and Feather ?

I don't believe it's possible to get a stringview or binaryview from Parquet; however, in theory you could get one from Feather or Arrow IPC today with Arrow 15. I think that until pyarrow implements a basic level of support it's unlikely to actually end up being used. Some day the C++ Parquet scanner might be able to return those types but there are very few people working on that part of the code base and I think it's unlikely to be implemented any time soon.

Their value proposition compared to regular string or binary is unclear to me.

I think the gist of it is that regular string and binary are slow to sort, which is why Meta's Velox, DuckDB, and (very recently) Polars have adopted it as their primary representation. Arrow added it primarily for interchange with those systems (e.g., a user-defined function based on the C Data interface) although I agree that it came at the expense unnecessary complexity for 99.9% of Arrow users.

I should point that the proliferation of basic data types is going to be a serious obstacle to adoption by new implementations.

I agree. In theory this is what nanoarrow is designed to help with (when the types are supported...so not yet). The focus of the version about to be released is testing and stability...0.5.0 is more likely to include features that could be used in a fallback sort of way (i.e., maybe ArrowArrayViewGetLogicalXXX() that would handle the dictionary, run-end-encoded, view, or normal cases at the expense of an extra switch()).

Perhaps there should be some "data type negociation" mechanism

In Arrow C++, this is most likely to be supported via an option in the readers. For Parquet In Python, the __arrow_c_schema__() and __arrow_c_stream__(requested_schema=None) methods can in theory handle this (i.e., you query __arrow_c_schema__() and if it contains types you don't understand, you pass a different schema to __arrow_c_stream__()'s requested_schema. That said, pyarrow doesn't implement requested_schema yet, but it also doesn't really implement the view types either.

And I hope that I won't have to add support for them in OGRLayer::WriteArrowBatch()

For now I think you will be hard-pressed to find a producer that actually produces REE or View types (ListView is also on the horizon if it's not already implemented in Arrow C++). It's on nanoarrow's roadmap to support all of them (but Python bindings are on the roadmap first, which is no small feat!)...perhaps it can help!

jorisvandenbossche · 2024-01-22T15:22:21Z

Basically, the ArrowArrayView doesn't care how the memory it's pointing to got there (i.e., there's no requirement that an ArrowArray ever existed). This is mostly helpful for reading IPC: the decoder first fills in an ArrowArrayView and only allocates the ArrowArray if requested.

OK, understood! I see the C doc comment actually mentions this: "use it to represent a hypothetical ArrowArray that does not exist yet, or use it to validate the buffers of a future ArrowArray.".

eitsupi · 2024-02-06T15:10:11Z

FYI, note that Python Polars can create Arrow files with the Utf8View type written to them by using the experimental option;

>>> import polars as pl
>>> import pyarrow.feather as feather
>>> pl.DataFrame({"a": ["foo"]}).write_ipc("test.arrow", future=True)
>>> feather.read_table("test.arrow")
pyarrow.Table
a: string_view
----
a: [["foo"]]

Since Rust Polars started using the Utf8View type, downstream Rust projects will use it.

paleolimbot · 2024-02-06T15:40:04Z

Nice! We're currently focused on a few other short-term nanoarrow things (e.g., Python bindings) but Joris did all the hard work here and we should be able to get this merged in the next few months with support in R and Python.

eitsupi · 2024-05-06T04:03:24Z

Hey, R polars package's Series/DataFrame to/from nanoarrow_array_stream conversion have been rewritten to only using the C Stream interface (pola-rs/r-polars#1075, pola-rs/r-polars#1076, pola-rs/r-polars#1078).
Even though nanoarrow 0.4.0 does not support StringView, It seems to work fine even when StringView is included.

polars::pl$Series(values = "foo") |>
  nanoarrow::as_nanoarrow_array_stream(future = TRUE) |>
  polars::as_polars_series()
#> polars Series: shape: (1,)
#> Series: '' [str]
#> [
#>  "foo"
#> ]

^{Created on 2024-05-06 with reprex v2.1.0}

paleolimbot · 2024-05-10T19:21:06Z

@eitsupi Sorry I missed this comment!

It seems to work fine even when StringView is included.

nanoarrow (for R or Python) can definitely transport any type (even invalid ones!), even though conversion to/from R won't work. I think you might also run into trouble printing these (although I believe I have some tests for this...you'd probably be able to see the format and buffer addresses).

feat: Support for BinaryvieW and StringView types

b3a9894

jorisvandenbossche commented Jan 19, 2024

View reviewed changes

paleolimbot reviewed Jan 19, 2024

View reviewed changes

eitsupi mentioned this pull request Feb 6, 2024

Export Utf8View type via Arrow C stream pola-rs/r-polars#781

Closed

eitsupi mentioned this pull request May 3, 2024

feat: export/import DataFrame as raw vector pola-rs/r-polars#1072

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support for Binaryview and StringView types #367

feat: Support for Binaryview and StringView types #367

jorisvandenbossche commented Jan 19, 2024

jorisvandenbossche Jan 19, 2024

paleolimbot Jan 19, 2024

codecov-commenter commented Jan 19, 2024

jorisvandenbossche commented Jan 19, 2024

paleolimbot commented Jan 19, 2024

paleolimbot Jan 19, 2024

rouault commented Jan 21, 2024

paleolimbot commented Jan 21, 2024

jorisvandenbossche commented Jan 22, 2024

eitsupi commented Feb 6, 2024 •

edited

Loading

paleolimbot commented Feb 6, 2024

eitsupi commented May 6, 2024 •

edited

Loading

paleolimbot commented May 10, 2024

	// array_view->variadic_buffers = ...
	array_view->variadic_buffers = array-> buffers + 2

feat: Support for Binaryview and StringView types #367

Are you sure you want to change the base?

feat: Support for Binaryview and StringView types #367

Conversation

jorisvandenbossche commented Jan 19, 2024

jorisvandenbossche Jan 19, 2024

Choose a reason for hiding this comment

paleolimbot Jan 19, 2024

Choose a reason for hiding this comment

codecov-commenter commented Jan 19, 2024

Codecov Report

jorisvandenbossche commented Jan 19, 2024

paleolimbot commented Jan 19, 2024

paleolimbot Jan 19, 2024

Choose a reason for hiding this comment

rouault commented Jan 21, 2024

paleolimbot commented Jan 21, 2024

jorisvandenbossche commented Jan 22, 2024

eitsupi commented Feb 6, 2024 • edited Loading

paleolimbot commented Feb 6, 2024

eitsupi commented May 6, 2024 • edited Loading

paleolimbot commented May 10, 2024

eitsupi commented Feb 6, 2024 •

edited

Loading

eitsupi commented May 6, 2024 •

edited

Loading