Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and nested types support for Parquet Reader #1314

Merged
merged 43 commits into from Jan 24, 2021

Conversation

hannes
Copy link
Member

@hannes hannes commented Jan 21, 2021

This PR refactors and extends the Parquet reader. A major feature addition is the support for nested types in Parquet files, which are mapped to DuckDB's STRUCT and LIST types. Under the hood the Parquet reader now does zero-copy of strings, which should increase performance. While I was at it, I also added DATE support to Parquet, should improve the TPCH timings ^^

@hannes
Copy link
Member Author

hannes commented Jan 22, 2021

Performance went down ~10% so need to investigate what's going on there.

@hannes hannes merged commit fa178a2 into duckdb:master Jan 24, 2021
@hannes
Copy link
Member Author

hannes commented Jan 24, 2021

CC @oerling

@hannes hannes deleted the parquetrefactor branch January 24, 2021 12:30
@hannes
Copy link
Member Author

hannes commented Jan 24, 2021

@apkwan @dforsber this changes quite a lot in the parquet reader, perhaps you can try the latest master before we release.

@rprovodenko
Copy link

It seems that there is a bug: after all data has been Fetched, the parquet reader returns nullptr instead of an empty DataChunk

@hannes
Copy link
Member Author

hannes commented Jan 27, 2021

It seems that there is a bug: after all data has been Fetched, the parquet reader returns nullptr instead of an empty DataChunk

Could you please post a full example of what goes wrong in a new issue? Not sure I understand.

@rprovodenko
Copy link

It seems that there is a bug: after all data has been Fetched, the parquet reader returns nullptr instead of an empty DataChunk

Could you please post a full example of what goes wrong in a new issue? Not sure I understand.

Ah, actually, disregard that, it seems that the behaviour of duckdb has changed
in query-result:
Used to be:

	//! Fetches a DataChunk from the query result. Returns an empty chunk if the result is empty, or nullptr on failure.
	virtual unique_ptr<DataChunk> Fetch() = 0;

Now:

	//! Fetches a DataChunk of normalized (flat) vectors from the query result.
	//! Returns nullptr if there are no more results to fetch.
	DUCKDB_API virtual unique_ptr<DataChunk> Fetch();

So, it used to return an empty chunk, now it returns a nullptr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants