New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add support for querying/attaching external databases #5048

Closed

kouta-kun wants to merge 7 commits into duckdb:master from kouta-kun:dev/external_attach

Contributor

kouta-kun commented Oct 20, 2022 •

edited

Loading

This PR adds support for two functions that allow for loading tables from external duckdb databases:

> call attach_external(DATABASE_PATH, overwrite=[true/false], temporary=[true/false]);
This one loads the tables from DATABASE_PATH into the current database. A big issue right now is that using temporary=true produces a segmentation fault. Apparently the view gets added into the CatalogSet with a parent pointer of 0x70 but I haven't been able to find the cause for that.

> select * from query_external(DATABASE_PATH, TABLE_NAME);
This one returns a temporary table with the contents of table TABLE_NAME from DATABASE_PATH. It was implemented as a workaround for temporary=true crashing but some people might prefer to use this instead of attach_external.

Partially accomplishes #1985 (no remote support yet)

kouta-kun added 7 commits

October 19, 2022 20:33


          add method allowing for multiple rows in connection::values

fecbd7b


          attach external duckdb database (crashes when view is added as tempor…

631aaec

…ary)


          format fix

acfc411


          add tests, update stubs

b6b37bc


          added a query that returns a table instead of creating it in the curr…

b92342d

…ent connection as a workaround for failing temporary views


          format fix

265d423


          Merge branch 'master' of github.com:kouta-kun/duckdb into external_at…

bba5cd9

…tach

Mytherin reviewed

View reviewed changes

Collaborator

Mytherin left a comment

Thanks for the PR! This is very exciting functionality, but there is still some work to be done. In particular, we should also think about benchmarking this. Ideally, a scan of a different database will perform exactly the same as a scan of the same database.

Several comments below:

src/function/table/system/attach_duckdb.cpp


		namespace duckdb {

		// this is private in query_result for some reason, I'd rather include it from there but for now I'll do this

Collaborator

Mytherin Oct 21, 2022

It is private because you're not supposed to access the internals of the iterator - but are supposed to use it through the begin and end methods on the query result. Either way, copying the code seems like a bad idea.

I don't think you want to use the QueryResultIterator to begin with for this functionality as it is extremely slow and mostly provided as a convenience wrapper. See below.

src/function/table/system/attach_duckdb.cpp

+              	result->file_name = input.inputs[0].GetValue<string>();
+              	result->table = input.inputs[1].GetValue<string>();
+              	result->database = make_unique<DuckDB>(result->file_name);

Collaborator

Mytherin Oct 21, 2022

This should probably open a read-only connection to avoid contention. Ideally the buffer manager and thread pool are also shared between the instances, but we can implement that as an enhancement later.

src/function/table/system/attach_duckdb.cpp

+              		data.rowiterator = make_unique<QueryResultIterator>(data.tablerelation.get());
+              	}
+              	auto dconn = Connection(context.db->GetDatabase(context));

Collaborator

Mytherin Oct 21, 2022

Opening a new connection seems unnecessary here

src/function/table/system/attach_duckdb.cpp

+              	DuckDB db {data.file_name};
+              	auto econn = Connection(db);
+              	auto dconn = Connection(context.db->GetDatabase(context));

Collaborator

Mytherin Oct 21, 2022 •

edited

Loading

We shouldn't be opening a new connection to the database, we need to use the existing client context that is passed in (ClientContext), otherwise our transaction context does not carry over. This is likely why you were hitting that segfault.

src/function/table/system/attach_duckdb.cpp


		idx_t count = 0;

		while ((data.rowiterator)->result != nullptr && count < STANDARD_VECTOR_SIZE) {

Collaborator

Mytherin Oct 21, 2022

Iterating over a result set one-row-at-a-time is extremely slow. This needs to be changed to iterate over data chunks instead. Ideally also in parallel.

src/function/table/system/attach_duckdb.cpp

+              	attach_duckdb.named_parameters["temporary"] = LogicalType::BOOLEAN;
+              	set.AddFunction(attach_duckdb);
+              	TableFunction query_external_duckdb("query_external", {LogicalType::VARCHAR, LogicalType::VARCHAR},

Collaborator

Mytherin Oct 21, 2022

query_duckdb?

src/function/table/system/attach_duckdb.cpp

+              			vector<vector<Value>> table_values;
+              			unique_ptr<DataChunk> trow = nullptr;
+              			while ((trow = table->Fetch()) != nullptr) {

Collaborator

Mytherin Oct 21, 2022

This copies over the entire table, which is undesirable. This should be rewritten to create views over the query_external function.

src/function/table/system/attach_duckdb.cpp

+              				for (idx_t row_idx = 0; row_idx < trow->size(); row_idx++) {
+              					vector<Value> row_values;
+              					for (idx_t col = 0; col < table->names.size(); col++) {
+              						row_values.push_back(trow->GetValue(col, row_idx));

Collaborator

Mytherin Oct 21, 2022

While this materialization should be removed entirely, I would just like to re-iterate that any use of the Value API is extremely slow. The Value type is intended to be used during planning and optimizing, not during execution.

src/function/table/system/attach_duckdb.cpp

+              	result->database = make_unique<DuckDB>(result->file_name);
+              	result->connection = make_unique<Connection>(*result->database);
+              	result->tablerelation = result->connection->Table(result->table)->Execute();

Collaborator

Mytherin Oct 21, 2022

Generally we don't want to do any execution in the bind. We can use TableInfo to fetch the metadata of the table. The actual execution should happen in the table_function_init_global_t function. Otherwise we also (pointlessly) materialize an entire table when EXPLAIN or DESCRIBE are used.

src/function/table/system/attach_duckdb.cpp

+              	string table = "";
+              	unique_ptr<DuckDB> database = nullptr;
+              	unique_ptr<Connection> connection = nullptr;
+              	unique_ptr<QueryResult> tablerelation = nullptr;

Collaborator

Mytherin Oct 21, 2022

We should not be using a QueryResult here to begin with, as the QueryResult structure will invoke an extra copy of the table unnecessarily. Can't we scan the underlying DataTable directly?

In fact - we already have a function that efficiently scans a DataTable object, namely our own TableScan (in table_scan.cpp). Perhaps we can just re-engineer that so that it also works for external tables?

tomsej mentioned this pull request

Parquet materialization duckdb/dbt-duckdb#19

Closed

neverchanje commented Dec 8, 2022

Hi @kouta-kun Is there any progress on this PR? It's very interesting feature, and I also have a similar requirement to union two duckdb tables that share the same schema. I think it would be a very good enhancement to duckdb.

Mytherin mentioned this pull request

Add support for attaching multiple DuckDB Databases #5764

Merged

Mytherin closed this in #5764

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment