Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiFileReader refactor #11806

Merged
merged 52 commits into from
May 3, 2024
Merged

Conversation

samansmink
Copy link
Contributor

This PR is the first step of a refactor of the MultiFileReader.

Goals

The goals of the refactor as a whole are:

  • allow late expansion of a list of file(globs)
  • allow switching MultiFileReaders of existing table functions (to allow reusing TableFunctions through catalog from in an extension)

Current problem

To illustrate why this is important lets look at an example:

SELECT * FROM "s3://bucket/**/*.parquet"

What we currently do for the above query is to fully expand the glob in the bind phase. This can be very slow if the bucket is large as it has several problems:

  • bind phase is single threaded, meaning that this process is fully sequential
  • we cannot reduce the amount of file listing to be done with limits or filter pushdown
  • this does not allow extensions to implement more clever ways to scan big sets of files

This PR

Because the MultiFileReader code is already quite complex, and is used by some pretty complex parts of duckdb (the CSV reader mainly). The refactor will be done in steps. In this first step I aim to achieve:

  • Allow subclassing the MultiFileReader and injecting a factory method for it in a TableFunction
  • Add a MultiFileList interface
  • Add a SimpleMultiFileList class that implements the MultiFileList interface but still being a fully expanded list of files internally.
  • Switch from using a vector<string> to using the new MultiFileList class in most MultiFileReader code
  • Switch the Parquet Scan to use the MultiFileList in a way that shows how the MultiFileList interface enables interacting with a lazily generated file list
  • Add virtual methods for Filter Pushdown to MultiFileList and MultiFileReader
  • Adds a new (unused) bind method to the MultiFileReader that can be used by subclasses to implement custom binding for a multifileread

Note that functionally nothing should have changed in DuckDB with this PR

Todo's

  • The JSON and CSV reader still store a vector of strings in their bind data, this should switch to using a MultiFileList, but some work regarding serialization and backwards compatibility remains there
  • This PR just adds complexity to an already quite complex part of DuckDB, we should investigate where things can be simplified
  • I'm currently still missing a way for a custom MultiFileReader to pass state between MultiFileReader::BindOptions and MultiFileReader::FinalizeBind

Notable points to review

  • The logic in ParquetScanInitGlobal has changed and is non trivial to say the least
  • I have some assert(false) in ParquetScanInitGlobal where we may want to just throw internalexceptions?
  • The result of making the MultiFileReader methods non-static is that I have some weird empty MultiFileReader objects in several places now, the idea being that those place should make sure they should properly initialize a multifilereader at some point and reuse the same one.

static void ParseFileRowNumberOption(MultiFileReaderBindData &bind_data, ParquetOptions &options,
vector<LogicalType> &return_types, vector<string> &names) {
if (options.file_row_number) {
if (std::find(names.begin(), names.end(), "file_row_number") != names.end()) {
Copy link
Contributor

@Tishj Tishj Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is case sensitive, doesn't FILE_ROW_NUMBER break in the same way?
I understand this is existing logic that is merely moved, but still this seems problematic

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good - some comments below:

extension/json/json_functions/read_json_objects.cpp Outdated Show resolved Hide resolved
extension/json/json_functions/read_json.cpp Outdated Show resolved Hide resolved
extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved
extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved
extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved
src/common/multi_file_reader.cpp Outdated Show resolved Hide resolved
src/common/multi_file_reader.cpp Show resolved Hide resolved
src/common/multi_file_reader.cpp Outdated Show resolved Hide resolved
src/include/duckdb/common/multi_file_reader_options.hpp Outdated Show resolved Hide resolved
@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 30, 2024 07:50
@Mytherin Mytherin marked this pull request as ready for review May 1, 2024 10:18
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes! Some more comments:

extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved
extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved
//! Flag to indicate the file is being opened
ParquetFileState file_state;
//! Mutexes to wait for the file when it is being opened
unique_ptr<mutex> file_mutex;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we turn this into a regular mutex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector doesn't like that: mutex is not movable.

The idea is that with this you can:

  • get a ref to this mutex
  • drop the global state lock
  • grab the file lock
  • release the file lock
  • regrab the global state lock

with the idea of opening a new file without locking global state while doing so: in the meantime the vector holding the mutex may get resized.

shared_ptr<ParquetReader> initial_reader;
vector<string> files;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nightly test run here - https://github.com/samansmink/duckdb/actions/runs/8908217982 - PTAL at the threadsan

extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved
extension/parquet/parquet_extension.cpp Outdated Show resolved Hide resolved
src/include/duckdb/common/multi_file_list.hpp Outdated Show resolved Hide resolved
src/include/duckdb/common/multi_file_list.hpp Outdated Show resolved Hide resolved
src/include/duckdb/common/multi_file_list.hpp Outdated Show resolved Hide resolved
src/common/multi_file_list.cpp Outdated Show resolved Hide resolved
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 2, 2024 10:53
@samansmink samansmink marked this pull request as ready for review May 2, 2024 11:01
@samansmink
Copy link
Contributor Author

@Mytherin PTAL

I have significantly reworked the MultiFileList to make it thread-safe. Some notes:

  • Subclasses of the MultiFileList now implement locking (if necessary)
  • The SimpleMultiFileList is completely constant, so it can be lock-free
  • The ComplexFilterPushdown method now returns a new list instead of modifying the existing one in-place. This was the crux to make this work nicely: MultiFileLists are now effectively immutable even though internally they can expand lazily.

This solves basically all problems: Multiple scans can reuse the same MultiFileLists without problems, and we can just leave the multifilelist in the BindData (even honoring the fact that the bind_data are const to some degree, as the list only changes its internal state, its result should be consistent even under parallel use).

Currently what this means is that when a filter is pushed down into a MultiFileReader AND hive partitioning or filename is enabled, the GlobMultiFileList gets transformed into a SimpleMultiFileList during filter pushdown. In the future we can:

  • add lazy glob expansion
  • add filter pushdown into the glob

If you think this is good and threadsan has no objections, I think this is good to go.

I will update the PR description with an explanation of how to use the MultiFileList for future reference once this is approved.

@Mytherin
Copy link
Collaborator

Mytherin commented May 2, 2024

Sounds great!

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 2, 2024 20:41
@samansmink samansmink marked this pull request as ready for review May 2, 2024 20:42
@samansmink
Copy link
Contributor Author

Nightly run started here for threadsan: https://github.com/samansmink/duckdb/actions/runs/8930201930, ThreadSan seems happy with the changes

@Mytherin Mytherin merged commit 4ae9499 into duckdb:main May 3, 2024
41 checks passed
@Mytherin
Copy link
Collaborator

Mytherin commented May 3, 2024

Thanks! Looks good

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request May 6, 2024
Merge pull request duckdb/duckdb#11806 from samansmink/parquet-refactor-pr1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants