-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read Parquet files faster #47964
Read Parquet files faster #47964
Conversation
88777cd
to
570ae57
Compare
"bigger than background_schedule_pool_size." | ||
: getCurrentExceptionMessage(/* with_stacktrace */ true)); | ||
abort(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is unrelated to the rest of the PR. Without this, the server gives you a quiet SIGABRT if you set max_thread_pool_size too small. Now it'll log an error.)
Some speed numbers. Better than before, worse than what is possible.
|
I will start reviewing the changes. Looks very promising, great work! |
src/IO/ParallelReadBuffer.cpp
Outdated
} | ||
chassert(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make it obvious that this branch always returns, when skimming the code. But looks like it's more confusing than helpful, removing.
Looks good in general. I will go deep into details later. |
Couldn't actually remove any logic from FormatFactory, each bit seems useful, almost all of it existed already, in different places. Moved things around a little to hopefully make it more clear. Lmk if you have better ideas. Especially for how to organize the whole thing better, I don't like how awkward the whole
Same, couldn't think of a simpler way to do it (especially if we're going to make it decode columns in parallel too). Reorganized a little and added comments, open to suggestions. |
if (response.getStatus() == Poco::Net::HTTPResponse::HTTPStatus::HTTP_PARTIAL_CONTENT) | ||
{ | ||
*res.file_size += requested_range_begin; | ||
res.seekable = true; | ||
} | ||
else | ||
{ | ||
res.seekable = response.has("Accept-Ranges") && response.get("Accept-Ranges") == "bytes"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes behavior slightly: previously StorageURL would send a HEAD request with "Range: 0-" for this check, now it's without "Range" (except if this parseFileInfo() is called from nextImpl(), which is not the important call site). I expect it's ok because HTTP servers are supposed to report "Accept-Ranges: bytes" anyway (?), but I don't actually know what HTTP servers are out there and what headers they send. For S3 this works, for althttpd this didn't work even before the change, didn't try anything else.
Still looking at the test failures. |
Ok, fixed, there were two pre-existing bugs in ReadBufferFromS3::{seek,setReadUntilPosition} that I didn't notice, which previously weren't being hit because these methods were mostly unused. |
@@ -197,15 +199,15 @@ bool ReadBufferFromS3::nextImpl() | |||
|
|||
off_t ReadBufferFromS3::seek(off_t offset_, int whence) | |||
{ | |||
if (offset_ == offset && whence == SEEK_SET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One bug.
read_until_position = position; | ||
impl.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two bug (impl
was being destroyed without resetting working_buffer
, which points into impl
).
Tests still fail, at least some of them because order of rows is not preserved by default anymore. Should we make |
Set it to true by default in this pull request. Then we will merge. Then we will change it to false in another pull request. It will allow us to highlight it better in the changelog. |
Would be nice to add a virtual column for row index, then detect |
8f55cfd
to
8dc28ad
Compare
Couldn't reproduce the CachedOnDiskReadBufferFromFile assertion failure locally: https://s3.amazonaws.com/clickhouse-test-reports/47964/8dc28ad500c79074bb7d638b5ec5f06b3c6ed30e/stress_test__debug_.html . From log and core dump, it appears to be this scenario:
But this doesn't seem related to this PR, and the test doesn't fail on master (and has failed on the PR multiple times), so I'm not sure. @kssenii, plz review the last commit ("Hopefully fix assertion failure in CachedOnDiskReadBufferFromFile"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't reproduce the CachedOnDiskReadBufferFromFile assertion failure locally: https://s3.amazonaws.com/clickhouse-test-reports/47964/8dc28ad500c79074bb7d638b5ec5f06b3c6ed30e/stress_test__debug_.html .
I've never seen this assertion fail before, I'd think it is related to changes
another CachedOnDiskReadBufferFromFile wants to read the same segment. It takes the remote_file_reader, probably seeks it to 0 (within working_buffer), then fails an assert when getFileOffsetOfBufferEnd() (732) is different from downloaded_size (0).
I'd think the the problem is in another place, because see these lines of logs
2023.04.07 07:44:39.755076 [ 3363 ] {befe85ec-9b42-4d3c-b454-c75455a15a04} <Test> CachedOnDiskReadBufferFromFile(00170_test/yaq/fqcczxfqquhmvacyzmwqwnoglwagg): Read 732 bytes, read type REMOTE_FS_READ_AND_PUT_IN_CACHE, position: 0, offset: 732
2023.04.07 07:44:39.777827 [ 3363 ] {befe85ec-9b42-4d3c-b454-c75455a15a04} <Test> FileSegment(b2c2cee274e671de0e4265cbb005a440) : [0, 731]: Updated state from DOWNLOADING to PARTIALLY DOWNLOADED
2023.04.07 07:44:39.777959 [ 3363 ] {befe85ec-9b42-4d3c-b454-c75455a15a04} <Test> FileSegment(b2c2cee274e671de0e4265cbb005a440) : [0, 731]: Resetting downloader from befe85ec-9b42-4d3c-b454-c75455a15a04:3363
2023.04.07 07:44:39.778193 [ 3363 ] {befe85ec-9b42-4d3c-b454-c75455a15a04} <Test> FileSegment(b2c2cee274e671de0e4265cbb005a440) : [0, 731]: Complete batch. (File segment: [0, 731], key: b2c2cee274e671de0e4265cbb005a440, state: PARTIALLY_DOWNLOADED, downloaded size: 0, reserved size: 732, downloader id: None, current write offset: 0, first non-downloaded offset: 0, caller id: befe85ec-9b42-4d3c-b454-c75455a15a04:3363, detached: 0, kind: Regular, unbound: 0)
732 bytes were read, then we returned from nextImpl
without writing anything to cache and there is no error in log, this is not normal. Probably the reason is that we passed working_buffer.begin()
instead of position()
(therefore it could silently write nothing, let's add an assertion to FileSegment::write
like assertdata != nullptr)
?) but this is a bit strange, there was a guarantee that impl buffer is not seekable (the restricted_seek
check guaranteed that) and we always passed external buffer forward into impl
, so it was fine to use working_buffer.begin()
.
Undid most of the nonsense refactoring in CachedOnDiskReadBufferFromFile. Switched from making it accept pre-existing reader in any state to making sure that pre-existing reader is always in the correct state. |
46e4ad8
to
3fd74ca
Compare
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Reading files in Parquet format is now much faster. IO and decoding are parallelized (controlled by
max_threads
setting), and only required data ranges are read.This is still not very efficient. Possible future improvements: