New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

MultiFileReader refactor #11806

Merged

Mytherin merged 52 commits into duckdb:main from samansmink:parquet-refactor-pr1

May 3, 2024

Contributor

samansmink commented Apr 24, 2024

This PR is the first step of a refactor of the MultiFileReader.

Goals

The goals of the refactor as a whole are:

allow late expansion of a list of file(globs)
allow switching MultiFileReaders of existing table functions (to allow reusing TableFunctions through catalog from in an extension)

Current problem

To illustrate why this is important lets look at an example:

SELECT * FROM "s3://bucket/**/*.parquet"

What we currently do for the above query is to fully expand the glob in the bind phase. This can be very slow if the bucket is large as it has several problems:

bind phase is single threaded, meaning that this process is fully sequential
we cannot reduce the amount of file listing to be done with limits or filter pushdown
this does not allow extensions to implement more clever ways to scan big sets of files

This PR

Because the MultiFileReader code is already quite complex, and is used by some pretty complex parts of duckdb (the CSV reader mainly). The refactor will be done in steps. In this first step I aim to achieve:

Allow subclassing the MultiFileReader and injecting a factory method for it in a TableFunction
Add a MultiFileList interface
Add a SimpleMultiFileList class that implements the MultiFileList interface but still being a fully expanded list of files internally.
Switch from using a vector<string> to using the new MultiFileList class in most MultiFileReader code
Switch the Parquet Scan to use the MultiFileList in a way that shows how the MultiFileList interface enables interacting with a lazily generated file list
Add virtual methods for Filter Pushdown to MultiFileList and MultiFileReader
Adds a new (unused) bind method to the MultiFileReader that can be used by subclasses to implement custom binding for a multifileread

Note that functionally nothing should have changed in DuckDB with this PR

Todo's

The JSON and CSV reader still store a vector of strings in their bind data, this should switch to using a MultiFileList, but some work regarding serialization and backwards compatibility remains there
This PR just adds complexity to an already quite complex part of DuckDB, we should investigate where things can be simplified
I'm currently still missing a way for a custom MultiFileReader to pass state between MultiFileReader::BindOptions and MultiFileReader::FinalizeBind

Notable points to review

The logic in ParquetScanInitGlobal has changed and is non trivial to say the least
I have some assert(false) in ParquetScanInitGlobal where we may want to just throw internalexceptions?
The result of making the MultiFileReader methods non-static is that I have some weird empty MultiFileReader objects in several places now, the idea being that those place should make sure they should properly initialize a multifilereader at some point and reuse the same one.

samansmink added 24 commits

March 28, 2024 11:08


          poc-parquet-scan-refactor

feaae9a


          wip: moving more stuff out of parquet bind data

543d956


          wip: lazy file generation should work now. Warning! union_readers is …

4b5fcb8

…broken, see test/sql/copy/parquet/recursive_parquet_union_by_name.test


          wip: create abstract base class

0fa8fe6


          small fix

7cc49a5


          first step to switching to a MultiFileReader-based interface

6db3f9b


          revert parquet extension

95eca11


          wip extensible multifilereader

e210a07


          fix some small parquet refactor issue

93d8d92


          supporting parquet options in custom multifilereader binds

047bc95


          add support for custom MultiFileReader custom generated columns

332e4a2


          refactor finalizechunk to pass client context

3abd45c


          first cleanup

b8014ef


          no more static template methods


          more cleanup for MultiFileList refactor

60bca2a


          parquet refactor: fix union readers optimization

c27d56c


          remove MultiFileReader for csv for now


          fix more issues from parquet refactor

eb0b76d


          format-fix

10d478e


          avoid serialization issues for 1st PR

d0b45d1


          format

ac7a6d6


          Merge branch 'main' into parquet-refactor-pr1

29e3fcc


          format

681cf8d


          more cleanup of multifilereader refactor

878ec1d

samansmink requested a review from Mytherin

April 24, 2024 09:18

Tishj reviewed

View reviewed changes

extension/parquet/parquet_extension.cpp Outdated

+              static void ParseFileRowNumberOption(MultiFileReaderBindData &bind_data, ParquetOptions &options,
+                                                   vector<LogicalType> &return_types, vector<string> &names) {
+              	if (options.file_row_number) {
+              		if (std::find(names.begin(), names.end(), "file_row_number") != names.end()) {

Contributor

Tishj Apr 24, 2024 •

edited

Loading

This is case sensitive, doesn't FILE_ROW_NUMBER break in the same way?
I understand this is existing logic that is merely moved, but still this seems problematic

Tishj reviewed

View reviewed changes

extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved

Tishj reviewed

View reviewed changes

extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved

Mytherin reviewed

View reviewed changes

Collaborator

Mytherin left a comment

Thanks for the PR! Looks good - some comments below:

extension/json/json_functions/read_json_objects.cpp Outdated Show resolved Hide resolved

extension/json/json_functions/read_json.cpp Outdated Show resolved Hide resolved

extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved

extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved

extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved

src/common/multi_file_reader.cpp Outdated Show resolved Hide resolved

src/common/multi_file_reader.cpp Show resolved Hide resolved

src/common/multi_file_reader.cpp Outdated Show resolved Hide resolved

src/include/duckdb/common/multi_file_reader_options.hpp Outdated Show resolved Hide resolved

src/include/duckdb/function/table_function.hpp Show resolved Hide resolved

rustyconover mentioned this pull request

Predicate Pushdown for scans duckdb/duckdb_iceberg#2

Open

duckdb-draftbot marked this pull request as draft

April 30, 2024 07:50

samansmink added 11 commits

April 30, 2024 10:34


          small multifilereader tweaks

5dfdded


          split files, use multifile scan in parquet

c333ecd


          fix casing issue

7998d3c


          format, fix comments

38900b0


          fix more multifile reader issues, add mutex for parquet

b006556


          format & tidy

059e7b3


          fix spatial patch

2f9803c


          pass dummy TableFunction

f0b3470


          remove assertions and add TODOs

dfb5664


          format

84ab7b7


          make tidy

134078d

Mytherin marked this pull request as ready for review

May 1, 2024 10:18

Mytherin reviewed

View reviewed changes

Collaborator

Mytherin left a comment

Thanks for the fixes! Some more comments:

extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved

extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved

extension/parquet/parquet_extension.cpp

+              	//! Flag to indicate the file is being opened
+              	ParquetFileState file_state;
+              	//! Mutexes to wait for the file when it is being opened
+              	unique_ptr<mutex> file_mutex;

Collaborator

Mytherin May 1, 2024

Can we turn this into a regular mutex?

Contributor Author

samansmink May 2, 2024

vector doesn't like that: mutex is not movable.

The idea is that with this you can:

get a ref to this mutex
drop the global state lock
grab the file lock
release the file lock
regrab the global state lock

with the idea of opening a new file without locking global state while doing so: in the meantime the vector holding the mutex may get resized.

extension/parquet/parquet_extension.cpp

		shared_ptr<ParquetReader> initial_reader;
		vector<string> files;

Collaborator

Mytherin May 1, 2024

Nightly test run here - https://github.com/samansmink/duckdb/actions/runs/8908217982 - PTAL at the threadsan

extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved

extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved

src/include/duckdb/common/multi_file_list.hpp Outdated Show resolved Hide resolved

src/include/duckdb/common/multi_file_list.hpp Outdated Show resolved Hide resolved

src/include/duckdb/common/multi_file_list.hpp Outdated Show resolved Hide resolved

src/common/multi_file_list.cpp Outdated Show resolved Hide resolved

samansmink added 3 commits

May 2, 2024 09:47


          make multifilelist thread-safe

0311c7c


          copy MultiFileList on filter pushdown to ensure consistency under par…

c90394c

…allel access


          finishing touches to multifilereader

8e02018

duckdb-draftbot marked this pull request as draft

May 2, 2024 10:53

samansmink marked this pull request as ready for review

May 2, 2024 11:01

Contributor Author

samansmink commented May 2, 2024

I have significantly reworked the MultiFileList to make it thread-safe. Some notes:

Subclasses of the MultiFileList now implement locking (if necessary)
The SimpleMultiFileList is completely constant, so it can be lock-free
The ComplexFilterPushdown method now returns a new list instead of modifying the existing one in-place. This was the crux to make this work nicely: MultiFileLists are now effectively immutable even though internally they can expand lazily.

This solves basically all problems: Multiple scans can reuse the same MultiFileLists without problems, and we can just leave the multifilelist in the BindData (even honoring the fact that the bind_data are const to some degree, as the list only changes its internal state, its result should be consistent even under parallel use).

Currently what this means is that when a filter is pushed down into a MultiFileReader AND hive partitioning or filename is enabled, the GlobMultiFileList gets transformed into a SimpleMultiFileList during filter pushdown. In the future we can:

add lazy glob expansion
add filter pushdown into the glob

If you think this is good and threadsan has no objections, I think this is good to go.

I will update the PR description with an explanation of how to use the MultiFileList for future reference once this is approved.

Collaborator

Mytherin commented May 2, 2024

Sounds great!


          make tidy

dadbae2

duckdb-draftbot marked this pull request as draft

May 2, 2024 20:41

samansmink marked this pull request as ready for review

May 2, 2024 20:42

Contributor Author

samansmink commented May 2, 2024

Nightly run started here for threadsan: https://github.com/samansmink/duckdb/actions/runs/8930201930, ThreadSan seems happy with the changes

Mytherin merged commit 4ae9499 into duckdb:main

41 checks passed

Collaborator

Mytherin commented May 3, 2024

Thanks! Looks good

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request


          chore: Update vendored sources to duckdb/duckdb@4ae9499

599155f

Merge pull request duckdb/duckdb#11806 from samansmink/parquet-refactor-pr1

samansmink mentioned this pull request

Query with hive partitioning much slower than hardcoded path #13217

Open

2 tasks

samansmink mentioned this pull request

Add predicate pushdown duckdb/duckdb_iceberg#72

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Changes Requested