-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiFileReader refactor #11806
MultiFileReader refactor #11806
Conversation
…broken, see test/sql/copy/parquet/recursive_parquet_union_by_name.test
static void ParseFileRowNumberOption(MultiFileReaderBindData &bind_data, ParquetOptions &options, | ||
vector<LogicalType> &return_types, vector<string> &names) { | ||
if (options.file_row_number) { | ||
if (std::find(names.begin(), names.end(), "file_row_number") != names.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is case sensitive, doesn't FILE_ROW_NUMBER break in the same way?
I understand this is existing logic that is merely moved, but still this seems problematic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks good - some comments below:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes! Some more comments:
//! Flag to indicate the file is being opened | ||
ParquetFileState file_state; | ||
//! Mutexes to wait for the file when it is being opened | ||
unique_ptr<mutex> file_mutex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we turn this into a regular mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector doesn't like that: mutex is not movable.
The idea is that with this you can:
- get a ref to this mutex
- drop the global state lock
- grab the file lock
- release the file lock
- regrab the global state lock
with the idea of opening a new file without locking global state while doing so: in the meantime the vector holding the mutex may get resized.
shared_ptr<ParquetReader> initial_reader; | ||
vector<string> files; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nightly test run here - https://github.com/samansmink/duckdb/actions/runs/8908217982 - PTAL at the threadsan
@Mytherin PTAL I have significantly reworked the MultiFileList to make it thread-safe. Some notes:
This solves basically all problems: Multiple scans can reuse the same MultiFileLists without problems, and we can just leave the multifilelist in the BindData (even honoring the fact that the bind_data are const to some degree, as the list only changes its internal state, its result should be consistent even under parallel use). Currently what this means is that when a filter is pushed down into a MultiFileReader AND hive partitioning or filename is enabled, the
If you think this is good and threadsan has no objections, I think this is good to go. I will update the PR description with an explanation of how to use the MultiFileList for future reference once this is approved. |
Sounds great! |
Nightly run started here for threadsan: https://github.com/samansmink/duckdb/actions/runs/8930201930, ThreadSan seems happy with the changes |
Thanks! Looks good |
Merge pull request duckdb/duckdb#11806 from samansmink/parquet-refactor-pr1
This PR is the first step of a refactor of the MultiFileReader.
Goals
The goals of the refactor as a whole are:
Current problem
To illustrate why this is important lets look at an example:
What we currently do for the above query is to fully expand the glob in the bind phase. This can be very slow if the bucket is large as it has several problems:
This PR
Because the MultiFileReader code is already quite complex, and is used by some pretty complex parts of duckdb (the CSV reader mainly). The refactor will be done in steps. In this first step I aim to achieve:
MultiFileList
interfaceSimpleMultiFileList
class that implements theMultiFileList
interface but still being a fully expanded list of files internally.vector<string>
to using the newMultiFileList
class in most MultiFileReader codeMultiFileList
in a way that shows how theMultiFileList
interface enables interacting with a lazily generated file listNote that functionally nothing should have changed in DuckDB with this PR
Todo's
MultiFileList
, but some work regarding serialization and backwards compatibility remains thereMultiFileReader::BindOptions
andMultiFileReader::FinalizeBind
Notable points to review
ParquetScanInitGlobal
has changed and is non trivial to say the leastassert(false)
inParquetScanInitGlobal
where we may want to just throw internalexceptions?