New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Arrow support #866
Conversation
@Mytherin can you have a look please? |
Looks good to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Thanks for being the guinea pig on the C interface iterator, getting that working well going the other direction will help us harden the interactions with pyarrow, that could be tested in a branch and then merged later once it's available in the production pyarrow packages.
@pitrou may have some more detailed comments on the C interface details
} else if (format == "d:38,0") { // decimal128 | ||
return_types.push_back(LogicalType::HUGEINT); | ||
} else if (format == "u") { | ||
return_types.push_back(LogicalType::VARCHAR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are your VARCHAR required to be utf8? Curious how Arrow BINARY might be handled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are required to be UTF8 yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add to that, we have a LogicalType::BLOB that can handle arbitrary binary blobs
bitset<STANDARD_VECTOR_SIZE + 8> temp_nullmask; | ||
memcpy(&temp_nullmask, (uint8_t *)array.buffers[0] + bit_offset / 8, n_bitmask_bytes + 1); | ||
|
||
temp_nullmask >>= (bit_offset % 8); // why this has to be a right shift is a mystery to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringVector::AddString(output.data[col_idx], cptr, str_len); | ||
break; | ||
case UnicodeType::UNICODE: | ||
// this regrettably copies to normalize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that?
for (idx_t row = 0; row < output.size(); row++) { | ||
auto source_idx = data.chunk_offset + row; | ||
|
||
auto ms = src_ptr[source_idx] / 1000000; // nanoseconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to use the time unit from the schema -- this will be incorrect for units other than nanoseconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we need to do another iteration on the many time types that arrow seems to have
|
||
out_schema->children = root_holder->children.get(); | ||
|
||
out_schema->format = "+s"; // struct apparently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More precisely you could say that the C interface uses a struct to communicate an Arrow record batch (to avoid having an unwieldy dichotomy between an Array and a RecordBatch).
https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst#record-batches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am quite happy with the struct approach
batches.append(batch_import_func((uint64_t)&data, (uint64_t)&schema)); | ||
} | ||
return from_batches_func(batches, schema_obj); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as pyarrow has built-in support for the C interface stream/iterator (i.e. hopefully in the next release!) this can be basically dropped
Need to open a follow up issue to fix the timestamp issue? |
Apache Arrow defines a "standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations". There has been a long-standing feature request to support fetching DuckDB result sets as Arrow arrays (#151). In this PR, we define two interfaces between DuckDB and Arrow:
Thankfully, Arrow has defined a lightweight "C Data Interface" that allows us to provide both interfaces without any build or runtime dependency on the Arrow library itself. This interface is even being extended at the moment to also allow streaming data in and out of arrow without said dependency. DuckDB already adopts the proposed streaming interface internally.
For now, the Arrow bridge is implemented for DuckDB's C++ and Python APIs. There are also some restrictions wrt. the kind of Arrow arrays that can be passed for now:
Below some usage examples.
1. Reading Arrow Arrays as tables from DuckDB queries
Reading Arrow arrays from DuckDB is implemented as a table-producing function,
arrow_scan
. This function takes aArrowArrayStream*
parameter. The Python API provides a wrapper so we can usepyarrow.Table
instances.Using the Python Relation API, we can for example run arbitrary SQL on an Arrow Table from a Parquet file and convert to Pandas:
From C++:
2. Fetching DuckDB query results as Arrow Table
Using the Python DB API:
Using the Python Relation API:
From C++: (already a streaming API!)
See
tools/pythonpkg/tests/test_arrow.py
andtest/api/test_arrow.cpp
for some more examples.Many thanks to @wesm, @pitrou and @fsaintjacques for making this possible!