Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for querying/attaching external databases #5048

Closed
wants to merge 7 commits into from

Conversation

kouta-kun
Copy link
Contributor

@kouta-kun kouta-kun commented Oct 20, 2022

This PR adds support for two functions that allow for loading tables from external duckdb databases:

> call attach_external(DATABASE_PATH, overwrite=[true/false], temporary=[true/false]);
This one loads the tables from DATABASE_PATH into the current database. A big issue right now is that using temporary=true produces a segmentation fault. Apparently the view gets added into the CatalogSet with a parent pointer of 0x70 but I haven't been able to find the cause for that.

> select * from query_external(DATABASE_PATH, TABLE_NAME);
This one returns a temporary table with the contents of table TABLE_NAME from DATABASE_PATH. It was implemented as a workaround for temporary=true crashing but some people might prefer to use this instead of attach_external.

Partially accomplishes #1985 (no remote support yet)

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This is very exciting functionality, but there is still some work to be done. In particular, we should also think about benchmarking this. Ideally, a scan of a different database will perform exactly the same as a scan of the same database.

Several comments below:


namespace duckdb {

// this is private in query_result for some reason, I'd rather include it from there but for now I'll do this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is private because you're not supposed to access the internals of the iterator - but are supposed to use it through the begin and end methods on the query result. Either way, copying the code seems like a bad idea.

I don't think you want to use the QueryResultIterator to begin with for this functionality as it is extremely slow and mostly provided as a convenience wrapper. See below.

result->file_name = input.inputs[0].GetValue<string>();
result->table = input.inputs[1].GetValue<string>();

result->database = make_unique<DuckDB>(result->file_name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably open a read-only connection to avoid contention. Ideally the buffer manager and thread pool are also shared between the instances, but we can implement that as an enhancement later.

data.rowiterator = make_unique<QueryResultIterator>(data.tablerelation.get());
}

auto dconn = Connection(context.db->GetDatabase(context));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening a new connection seems unnecessary here


DuckDB db {data.file_name};
auto econn = Connection(db);
auto dconn = Connection(context.db->GetDatabase(context));
Copy link
Collaborator

@Mytherin Mytherin Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be opening a new connection to the database, we need to use the existing client context that is passed in (ClientContext), otherwise our transaction context does not carry over. This is likely why you were hitting that segfault.


idx_t count = 0;

while ((data.rowiterator)->result != nullptr && count < STANDARD_VECTOR_SIZE) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating over a result set one-row-at-a-time is extremely slow. This needs to be changed to iterate over data chunks instead. Ideally also in parallel.

attach_duckdb.named_parameters["temporary"] = LogicalType::BOOLEAN;
set.AddFunction(attach_duckdb);

TableFunction query_external_duckdb("query_external", {LogicalType::VARCHAR, LogicalType::VARCHAR},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_duckdb?


vector<vector<Value>> table_values;
unique_ptr<DataChunk> trow = nullptr;
while ((trow = table->Fetch()) != nullptr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copies over the entire table, which is undesirable. This should be rewritten to create views over the query_external function.

for (idx_t row_idx = 0; row_idx < trow->size(); row_idx++) {
vector<Value> row_values;
for (idx_t col = 0; col < table->names.size(); col++) {
row_values.push_back(trow->GetValue(col, row_idx));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this materialization should be removed entirely, I would just like to re-iterate that any use of the Value API is extremely slow. The Value type is intended to be used during planning and optimizing, not during execution.

result->database = make_unique<DuckDB>(result->file_name);
result->connection = make_unique<Connection>(*result->database);

result->tablerelation = result->connection->Table(result->table)->Execute();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we don't want to do any execution in the bind. We can use TableInfo to fetch the metadata of the table. The actual execution should happen in the table_function_init_global_t function. Otherwise we also (pointlessly) materialize an entire table when EXPLAIN or DESCRIBE are used.

string table = "";
unique_ptr<DuckDB> database = nullptr;
unique_ptr<Connection> connection = nullptr;
unique_ptr<QueryResult> tablerelation = nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be using a QueryResult here to begin with, as the QueryResult structure will invoke an extra copy of the table unnecessarily. Can't we scan the underlying DataTable directly?

In fact - we already have a function that efficiently scans a DataTable object, namely our own TableScan (in table_scan.cpp). Perhaps we can just re-engineer that so that it also works for external tables?

@neverchanje
Copy link

Hi @kouta-kun Is there any progress on this PR? It's very interesting feature, and I also have a similar requirement to union two duckdb tables that share the same schema. I think it would be a very good enhancement to duckdb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants