Initial Arrow support #866

hannes · 2020-08-25T15:37:37Z

Apache Arrow defines a "standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations". There has been a long-standing feature request to support fetching DuckDB result sets as Arrow arrays (#151). In this PR, we define two interfaces between DuckDB and Arrow:

Reading Arrow Arrays as tables from DuckDB queries
Fetching DuckDB query results as Arrow Arrays

Thankfully, Arrow has defined a lightweight "C Data Interface" that allows us to provide both interfaces without any build or runtime dependency on the Arrow library itself. This interface is even being extended at the moment to also allow streaming data in and out of arrow without said dependency. DuckDB already adopts the proposed streaming interface internally.

For now, the Arrow bridge is implemented for DuckDB's C++ and Python APIs. There are also some restrictions wrt. the kind of Arrow arrays that can be passed for now:

Data types: Only signed integer types, floating point types and strings are supported
Dictionaries are not supported
Nested types are not supported

Below some usage examples.

1. Reading Arrow Arrays as tables from DuckDB queries

Reading Arrow arrays from DuckDB is implemented as a table-producing function, arrow_scan. This function takes a ArrowArrayStream* parameter. The Python API provides a wrapper so we can use pyarrow.Table instances.

Using the Python Relation API, we can for example run arbitrary SQL on an Arrow Table from a Parquet file and convert to Pandas:

my_arrow_table  = pyarrow.parquet.read_table(parquet_filename) # for example
rel = duckdb.from_arrow_table(my_arrow_table) # or just arrow() 
print(rel.query("arrow", "SELECT * FROM arrow LIMIT 10").df())

From C++:

ArrowArrayStream stream;
// fill stream with some meaning, then
auto result = con.TableFunction("arrow_scan", {Value::POINTER((uintptr_t)&stream)})->Execute();

2. Fetching DuckDB query results as Arrow Table

Using the Python DB API:

con = duckdb.connect()
con.execute('select 44')
print(con.fetch_arrow_table()) # or just arrow()

Using the Python Relation API:

print(duckdb.values([42]).to_arrow_table()) # or just arrow()

From C++: (already a streaming API!)

DuckDB db(nullptr);
Connection con(db);
auto result = con.Query("SELECT 42");

ArrowSchema arrow_schema;
result->ToArrowSchema(&arrow_schema);
// do something with arrow_schema's contents

while(true) {
    ArrowArray arrow_chunk;
    auto chunk = result->Fetch();
    if (chunk.size() == 0) break;
    chunk->ToArrowArray(&arrow_chunk);
    // Do something with arrow_chunk's contents
}

See tools/pythonpkg/tests/test_arrow.py and test/api/test_arrow.cpp for some more examples.

Many thanks to @wesm, @pitrou and @fsaintjacques for making this possible!

hannes · 2020-08-25T15:37:57Z

@Mytherin can you have a look please?

Mytherin · 2020-08-25T15:40:42Z

Looks good to me

…32 bit :/

wesm

Cool! Thanks for being the guinea pig on the C interface iterator, getting that working well going the other direction will help us harden the interactions with pyarrow, that could be tested in a branch and then merged later once it's available in the production pyarrow packages.

@pitrou may have some more detailed comments on the C interface details

wesm · 2020-08-26T12:33:44Z

src/function/table/arrow.cpp

+		} else if (format == "d:38,0") { // decimal128
+			return_types.push_back(LogicalType::HUGEINT);
+		} else if (format == "u") {
+			return_types.push_back(LogicalType::VARCHAR);


are your VARCHAR required to be utf8? Curious how Arrow BINARY might be handled

they are required to be UTF8 yes

To add to that, we have a LogicalType::BLOB that can handle arbitrary binary blobs

wesm · 2020-08-26T12:38:05Z

src/function/table/arrow.cpp

+				bitset<STANDARD_VECTOR_SIZE + 8> temp_nullmask;
+				memcpy(&temp_nullmask, (uint8_t *)array.buffers[0] + bit_offset / 8, n_bitmask_bytes + 1);
+
+				temp_nullmask >>= (bit_offset % 8); // why this has to be a right shift is a mystery to me


In case it's helpful https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#validity-bitmaps

wesm · 2020-08-26T12:39:23Z

src/function/table/arrow.cpp

+					    StringVector::AddString(output.data[col_idx], cptr, str_len);
+					break;
+				case UnicodeType::UNICODE:
+					// this regrettably copies to normalize


Why is that?

wesm · 2020-08-26T12:40:27Z

src/function/table/arrow.cpp

+			for (idx_t row = 0; row < output.size(); row++) {
+				auto source_idx = data.chunk_offset + row;
+
+				auto ms = src_ptr[source_idx] / 1000000; // nanoseconds


Need to use the time unit from the schema -- this will be incorrect for units other than nanoseconds?

yes we need to do another iteration on the many time types that arrow seems to have

wesm · 2020-08-26T12:43:07Z

src/main/query_result.cpp

+
+	out_schema->children = root_holder->children.get();
+
+	out_schema->format = "+s"; // struct apparently


More precisely you could say that the C interface uses a struct to communicate an Arrow record batch (to avoid having an unwieldy dichotomy between an Array and a RecordBatch).

https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst#record-batches

I am quite happy with the struct approach

wesm · 2020-08-26T12:45:22Z

tools/pythonpkg/duckdb_python.cpp

+			batches.append(batch_import_func((uint64_t)&data, (uint64_t)&schema));
+		}
+		return from_batches_func(batches, schema_obj);
+	}


As soon as pyarrow has built-in support for the C interface stream/iterator (i.e. hopefully in the next release!) this can be basically dropped

wesm · 2020-08-26T13:04:52Z

Need to open a follow up issue to fix the timestamp issue?

hannes · 2020-08-26T13:12:33Z

@wesm yes its in #868

hannes added 29 commits August 14, 2020 12:38

no debug symbols in python packages by default

638acef

Merge branch 'master' of github.com:cwida/duckdb

a967fb6

first baby steps

fbf031f

zero-copy arrow scans working

2c0079d

basic string support

d8721cd

Merge remote-tracking branch 'origin/master' into arrow

bd88718

some moar stuff

dd3337b

more generic

83c1b87

Merge remote-tracking branch 'origin/master' into arrow

691a192

grr

5914fbc

import order setup.py

699e40b

Merge branch 'arrow' of github.com:cwida/duckdb into arrow

8c7f95f

import side working sort of

ac8bd90

first part of fetch to arrow

5b25170

hacky version of fetch_arrow_table implemented but not working yet

1b1d407

got result set fetching work with serious memory ownership issues

a2802ae

first part of sane memory management for arrow

1d80c82

mc

af366d7

moving data chunk to arrow conversion into duckdb core

5b5b1b0

string support for arrow

de43bb1

added round trip test for basic types

442016b

round trip from python basically working

347e10c

some nullmask stuff still to fix

09993b1

bitmasks work

551141e

test case

9c24e4d

pyarrow for tests

66d975f

extra !

b71d5bf

getting rid of malloc() and free()

ff66f70

working!

f141721

hannes requested a review from Mytherin August 25, 2020 15:37

hannes added 3 commits August 25, 2020 18:40

removing extra file

02c2da6

making some tests pass

d3005a0

extra schema child cleanup

a9aa504

hannes mentioned this pull request Aug 25, 2020

Support Arrow for table scans and result sets #151

Closed

hannes added 7 commits August 26, 2020 08:28

switching to ArrowArrayStream

13bf2e5

fixing empty results to arrow in python

215cfd7

arrow cant install from source on buildwheel and has no binaries for …

5aac23d

…32 bit :/

some cleanup

34d8868

more release

b8727eb

some minor changes

07f7632

fixing issue for python2

d6ff2c6

hannes linked an issue Aug 26, 2020 that may be closed by this pull request

Support Arrow for table scans and result sets #151

Closed

wesm reviewed Aug 26, 2020

View reviewed changes

hannes merged commit 9d86f78 into master Aug 26, 2020

hannes deleted the arrow branch August 27, 2020 10:34

DonIvanCorleone mentioned this pull request Sep 10, 2020

Bug or wrong usage of Arrow (#866)? #890

Closed

willium mentioned this pull request Nov 16, 2020

Improve Node.js support #1122

Closed

2 tasks

Alex-Monahan mentioned this pull request Nov 25, 2020

Add Appache Arrow pyodide/pyodide#802

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Arrow support #866

Initial Arrow support #866

hannes commented Aug 25, 2020 •

edited

hannes commented Aug 25, 2020

Mytherin commented Aug 25, 2020

wesm left a comment

wesm Aug 26, 2020

hannes Aug 26, 2020

Mytherin Aug 26, 2020

wesm Aug 26, 2020

wesm Aug 26, 2020

wesm Aug 26, 2020

hannes Aug 26, 2020

wesm Aug 26, 2020

hannes Aug 26, 2020

wesm Aug 26, 2020

wesm commented Aug 26, 2020

hannes commented Aug 26, 2020


		out_schema->children = root_holder->children.get();

		out_schema->format = "+s"; // struct apparently

Initial Arrow support #866

Initial Arrow support #866

Conversation

hannes commented Aug 25, 2020 • edited

1. Reading Arrow Arrays as tables from DuckDB queries

2. Fetching DuckDB query results as Arrow Table

hannes commented Aug 25, 2020

Mytherin commented Aug 25, 2020

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Aug 26, 2020

hannes commented Aug 26, 2020

hannes commented Aug 25, 2020 •

edited