Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet | ExportDatabase] Deal with unsupported parquet types in EXPORT DATABASE. #8798

Merged
merged 30 commits into from
Sep 14, 2023

Conversation

Tishj
Copy link
Contributor

@Tishj Tishj commented Sep 5, 2023

Write as VARCHAR

We don't support writing some types as Parquet (yet).
To make sure this does not blow up in export/import database, we fall back to VARCHAR for types that are not supported.
Relying on the ability to roundtrip the values to and from string.

HUGEINT

Hugeint was previously written as DOUBLE, but when testing with test_all_types() this results in a conversion exception on IMPORT, so I've also removed support there and fall back to VARCHAR there as well.
Hugeint is a supported type by the parquet writer, but it's lossy, so for EXPORT DATABASE we will fall back to VARCHAR.

UNION

Since UNION loses information on string cast, we add additional logic for them.
When exporting a database to parquet containing a UNION column, it is now written as STRUCT instead, including the tag information

Since EXPORT DATABASE writes a schema.sql we don't lose the type information.
We allow casting a struct to UNION if the struct value has a valid representation for UNION internals

@Tishj
Copy link
Contributor Author

Tishj commented Sep 5, 2023

As pointed out by Carlo, this does have a flaw when typed nulls are involved:

statement ok
begin transaction;

statement ok
create table tbl2 (a UNION(a bit, b bool));

statement ok
insert into tbl2 VALUES
	(NULL),
	(union_value(a := NULL)),
	(union_value(b := NULL));

statement ok
SELECT union_tag(a) FROM tbl2;
#┌────────────────┐
#│  union_tag(a)  │
#│ enum('a', 'b') │
#├────────────────┤
#│                │
#│ a              │
#│ b              │
#└────────────────┘

statement ok
EXPORT DATABASE '__TEST_DIR__/union' (FORMAT PARQUET);

statement ok
rollback;

statement ok
IMPORT DATABASE '__TEST_DIR__/union';

statement ok
SELECT union_tag(a) FROM tbl2;
#┌────────────────┐
#│  union_tag(a)  │
#│ enum('a', 'b') │
#├────────────────┤
#│                │
#│                │
#│                │
#└────────────────┘

@Tishj
Copy link
Contributor Author

Tishj commented Sep 5, 2023

To fix this we're now also writing the tag field in parquet export

@github-actions github-actions bot marked this pull request as draft September 5, 2023 15:46
@Tishj Tishj marked this pull request as ready for review September 7, 2023 06:58
@github-actions github-actions bot marked this pull request as draft September 7, 2023 10:18
@Tishj Tishj marked this pull request as ready for review September 8, 2023 07:39
@github-actions github-actions bot marked this pull request as draft September 8, 2023 10:30
@Tishj Tishj marked this pull request as ready for review September 8, 2023 10:30
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good - two minor comments.

Could you also have a look into fully covering from_struct.cpp in the tests?

extension/parquet/parquet_writer.cpp Outdated Show resolved Hide resolved
src/common/types.cpp Show resolved Hide resolved
@Tishj
Copy link
Contributor Author

Tishj commented Sep 11, 2023

Thanks for the PR! Looks good - two minor comments.

Could you also have a look into fully covering from_struct.cpp in the tests?

Some of them are already asserted by the AllowImplicitCastFromStruct function
such as:

		if (!cast_data.child_cast_info[i].function(source_child_vector, result_child_vector, count, child_parameters)) {
			all_converted = false;
		}

We have already made sure the types are the same

		if (entry.init_local_state) {
			CastLocalStateParameters child_params(parameters, entry.cast_data);
			child_state = entry.init_local_state(child_params);
		}

I haven't seen anything use this when a direct cast is made, but I'll have another look

	case UnionInvalidReason::NO_MEMBERS:
		throw ConversionException("The produced UNION does not have any members");

This can never happen, but just wanted to add cases for all known invalid reasons

@github-actions github-actions bot marked this pull request as draft September 11, 2023 11:26
@Mytherin
Copy link
Collaborator

I think the JSON casts use the local state

@Tishj Tishj marked this pull request as ready for review September 11, 2023 13:45
@github-actions github-actions bot marked this pull request as draft September 12, 2023 08:47
@Tishj Tishj marked this pull request as ready for review September 12, 2023 09:43
@Mytherin Mytherin merged commit ba71015 into duckdb:main Sep 14, 2023
48 of 49 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants