Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appender support for Apache Arrow or columnar data #3412

Closed
mharmer opened this issue Apr 11, 2022 · 2 comments
Closed

Appender support for Apache Arrow or columnar data #3412

mharmer opened this issue Apr 11, 2022 · 2 comments

Comments

@mharmer
Copy link

mharmer commented Apr 11, 2022

I noticed there was a C-API for querying results as an arrow format (added in #1978), but I don't see any support currently in the c-api or the C or C++ appenders for bulk inserting arrow/columnar data into the database.

Currently I have several billion rows of data I would like to bulk insert that is already in columnar form in memory and the only interface that I'm aware of are the row-wise appenders. The performance of inserting into a single table is around 50,000 rows per second (on older hardware) - I'm assuming that this translation back-and-forth is likely a bottleneck.

It doesn't appear that the duckdb_data_chunk has support for this either.

@Mytherin
Copy link
Collaborator

You should be able to use duckdb_append_data_chunk to do batch/vectorized appends, which should indeed be much more efficient than the scalar functions.

@mharmer
Copy link
Author

mharmer commented Apr 12, 2022

My mistake, the confusion came from looking at the header that was using the opaque type and I wasn't sure how to use it.

For future reference to myself and others:

  • It appears the implementation for duckdb_create_data_chunk actually returns a DataChunk*
  • The documentation for DataChunk isn't entirely clear that data can be written to it, the free functions documented all appear to be used for reading data (with the exception of duckdb_vector_assign_string_element).
  • Presumably the methods on the DataChunk object itself can be used to write various data types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants