Skip to content

[Relation] Add MaterializedRelation #11835

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
May 3, 2024
Merged

Conversation

Tishj
Copy link
Contributor

@Tishj Tishj commented Apr 25, 2024

This PR adds a new Relation type.

MaterializedRelation

This Relation represents a materialized data set, likely produced by a MaterializedQueryResult.
It contains a ColumnDataCollection and allows us to efficiently scan this with the existing Logical + PhysicalColumnDataScan.

ColumnDataRef

To do this it introduces the ColumnDataRef and BoundColumnDataRef classes.
We don't want to make unnecessary copies of this (potentially giant) column data collection, so the ColumnDataRef only gets a reference to the collection.

Because TableRef's need to be Serializable though, we can't rely on this reference after deserialization.
If the ColumnDataRef is serialized to disk, it writes the entire column data collection to disk as well.
When the ColumnDataRef is deserialized, it creates a copy of the collection that it now owns.

As a result, the ColumnDataRef has an optionally_owned_ptr, and this trickles down into BoundColumnDataRef, LogicalColumnDataScan and PhysicalColumnDataScan (which already had this quality)

optionally_owned_ptr

To encapsulate the scenario where a class contains this pattern:

//! The owned object (if any)
unique_ptr<T> owned_object;
//! The referenced object
optional_ptr<T> object;

We create a new class optionally_owned_ptr which makes sure that these constraints are respected:

  • if owned_object is set, object must refer to it
  • if the instance is moved, both pointers become null
  • the instance can not be copied

is_owned() can be used to check whether the pointer is owning
Because the class is neither an optional_ptr or a unique_ptr it is intentionally not convertible to either.

Usage

In the Python API, in the internal execute method (shared by query/sql and execute) we previously created a ValueRelation if the statement was not a SELECT, this is now replaced by a MaterializedRelation.

Benchmark results:

These are timings measuring fetchall() from relations created with either CALL or SELECT statements.
The CALL statements utilize the MaterializedRelation.

call_repeat_row_10000000
timing: 4.085824966430664

call_repeat_row_2048
timing: 0.0007717609405517578

call_repeat_row_2500000
timing: 1.0177351236343384

call_repeat_row_50000
timing: 0.018980026245117188

select_repeat_row_10000000
timing: 4.076231598854065

select_repeat_row_2048
timing: 0.0007649660110473633

select_repeat_row_2500000
timing: 1.0249409675598145

select_repeat_row_50000
timing: 0.019537687301635742

Future work

The plan is to extend the DuckDBPyRelation to store a MaterializedRelation internally when DuckDBPyRelation.execute is called, any future reference of the relation after that (by a replacement scan or by incrementally building on top of it) will make use of the MaterializedRelation.

Tishj added 26 commits April 23, 2024 09:52
…ataCollection and Copy is called, we perform a copy of the column data collection
@carlopi
Copy link
Contributor

carlopi commented Apr 25, 2024

As always, impressive work.

Only note for posterity: this might make so that going through serialization and then deserialization we might start from 1 object somewhere in memory to having a few copies, potentially with degenerate cases like circular linked list that expand to infinite data.

Responsibility is on the users of optionally_owned_ptr to guarantee that data structure are acyclical.

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks great - some minor comments.

  • Maybe we can also do some tests where we create a materialized relation and then query it.
  • Perhaps also something like, we destroy the connection it came from and then query it in a different one.
  • Maybe also some performance tests, e.g. how fast is the query of a materialized query result?

@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 26, 2024 13:13
@Tishj
Copy link
Contributor Author

Tishj commented Apr 26, 2024

Maybe we can also do some tests where we create a materialized relation and then query it.

I'll add those 👍

Perhaps also something like, we destroy the connection it came from and then query it in a different one.

MaterializedRelations are currently not standalone, I think this requires rethinking the structure of Relation a bit
Perhaps we need to add a BaseRelation.
Relation would then add the ClientContextWrapper, MaterializedRelation would inherit from BaseRelation instead.
We could then decide on a case by case basis if MaterializedRelation can support it (CreateView is a no for example?)

I think this change will not be trivial and should probably be separate.

Maybe also some performance tests, e.g. how fast is the query of a materialized query result?

Good idea, I'll add it to the regression_test_python.py script

@Tishj Tishj marked this pull request as ready for review April 29, 2024 14:10
@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 29, 2024 14:11
@Tishj Tishj marked this pull request as ready for review April 30, 2024 07:13
@Mytherin
Copy link
Collaborator

Mytherin commented Apr 30, 2024

The benchmarks seem to use too much memory for the CI, could you perhaps reduce the load there?

Run python scripts/regression_test_python.py --threads=2 --out-file=current.csv
/home/runner/work/_temp/dab0e3f6-9880-413b-9a38-42d9f6fa3f8e.sh: line 1: 7239 Killed python scripts/regression_test_python.py --threads=2 --out-file=current.csv
Error: Process completed with exit code 137.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 1, 2024 08:45
@Tishj Tishj marked this pull request as ready for review May 1, 2024 11:57
@Mytherin Mytherin merged commit e484b61 into duckdb:main May 3, 2024
@Mytherin
Copy link
Collaborator

Mytherin commented May 3, 2024

Thanks!

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request May 6, 2024
Merge pull request duckdb/duckdb#11835 from Tishj/materialized_relation
Merge pull request duckdb/duckdb#11913 from hawkfish/icu-basictz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants