Read Parquet files faster #47964

al13n321 · 2023-03-24T04:07:24Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Reading files in Parquet format is now much faster. IO and decoding are parallelized (controlled by max_threads setting), and only required data ranges are read.

This is still not very efficient. Possible future improvements:

Implement parallel reading for StorageFile (likely by implementing factory versions of MMapReadBufferFromFileDescriptor and ReadBufferFromFileDescriptorPRead). With this PR, decoding happens in max_threads threads, but they all read from one ReadBuffer, locking a mutex. No good, but still faster than one thread.
Do parallel reading separately from parallel parsing, to allow having different number of threads for it. In particular, lots of download threads when reading from a different region, as Parquet often wants to read lots of short ranges.
Do decoding ourselves instead of using arrow, to avoid some copying. Maybe reimplement everything, or maybe reuse arrow's metadata decoding and lower-level data decoding (low enough level that we don't have to copy unnecessarily, if such level exists).

al13n321 · 2023-03-30T04:19:42Z

src/Core/BackgroundSchedulePool.cpp

+                  "bigger than background_schedule_pool_size."
+                : getCurrentExceptionMessage(/* with_stacktrace */ true));
+        abort();
+    }


(This is unrelated to the rest of the PR. Without this, the server gives you a quiet SIGABRT if you set max_thread_pool_size too small. Now it'll log an error.)

src/Disks/IO/AsynchronousReadIndirectBufferFromRemoteFS.cpp

al13n321 · 2023-03-30T07:04:09Z

Some speed numbers. Better than before, worse than what is possible.

Querying one column from a local file didn't get faster (or slower), it's still reading sequentially. (SELECT AdvEngineID, COUNT(*) AS c FROM file('ds/hits.parquet') GROUP BY AdvEngineID ORDER BY c DESC LIMIT 10 0.75 s)
All columns, local file: ~8x faster, it's decoding in parallel (after reading sequentially). (SELECT sum(ignore(*)) AS c FROM file('ds/hits.parquet') 102 s -> 12.7 s)
One column, S3, local region: 50x faster, it's both reading and decoding in parallel. (SELECT AdvEngineID, COUNT(*) AS c FROM url('https://clickhouse-datasets-us-west-2.s3.amazonaws.com/hits.parquet') GROUP BY AdvEngineID ORDER BY c DESC LIMIT 10 49 s -> 1 s)
All columns, S3, local region: 4x faster. (SELECT sum(ignore(*)) AS c FROM url('https://clickhouse-datasets-us-west-2.s3.amazonaws.com/hits.parquet') 193 s -> 46 s)
One column, remote region (US <- Europe): 27x faster (SELECT AdvEngineID, COUNT(*) AS c FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/hits.parquet') GROUP BY AdvEngineID ORDER BY c DESC LIMIT 10 385 s -> 14 s)
Increasing max_threads from 16 to 128 makes the previous query another 3x faster (5 s). Increasing to 256 (more than the number of HTTP requests the query does) does nothing, still 5x slower than local region, not sure why.
All columns, remote region: 7x faster (SELECT sum(ignore(*)) AS c FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/hits.parquet') 800 s -> 115 s)
All of the above gets about the same speed between url() and s3() table functions. (As it should, S3 is just HTTP with extra steps.) (Before this PR, this wasn't always the case; the comparisons above are using url().)
If the query does anything nontrivial with the fetched data, it gets bottlenecked on Data processing is not parallelized if the source returns one stream of data. #38755 instead.

Avogar · 2023-03-30T11:27:39Z

I will start reviewing the changes. Looks very promising, great work!
Also please check tests failures, most of them are related.

Avogar · 2023-03-30T13:56:08Z

src/IO/ParallelReadBuffer.cpp

        }
+        chassert(false);


Just to make it obvious that this branch always returns, when skimming the code. But looks like it's more confusing than helpful, removing.

Avogar · 2023-03-30T13:59:07Z

Looks good in general. I will go deep into details later.
Also, it would be really good to try to refactor/simplify code in FormatFactory::getInputImpl and in ParquetBlockInputFormat, because it was a bit hard to read it and understand all logic.

al13n321 · 2023-03-31T03:12:04Z

refactor/simplify code in FormatFactory::getInputImpl

Couldn't actually remove any logic from FormatFactory, each bit seems useful, almost all of it existed already, in different places. Moved things around a little to hopefully make it more clear. Lmk if you have better ideas. Especially for how to organize the whole thing better, I don't like how awkward the whole random_access_input_creator thing turned out, but I couldn't think of anything better that would work.

ParquetBlockInputFormat

Same, couldn't think of a simpler way to do it (especially if we're going to make it decode columns in parallel too). Reorganized a little and added comments, open to suggestions.

al13n321 · 2023-03-31T03:22:43Z

src/IO/ReadWriteBufferFromHTTP.h

+                if (response.getStatus() == Poco::Net::HTTPResponse::HTTPStatus::HTTP_PARTIAL_CONTENT)
+                {
+                    *res.file_size += requested_range_begin;
+                    res.seekable = true;
+                }
+                else
+                {
+                    res.seekable = response.has("Accept-Ranges") && response.get("Accept-Ranges") == "bytes";
+                }


This changes behavior slightly: previously StorageURL would send a HEAD request with "Range: 0-" for this check, now it's without "Range" (except if this parseFileInfo() is called from nextImpl(), which is not the important call site). I expect it's ok because HTTP servers are supposed to report "Accept-Ranges: bytes" anyway (?), but I don't actually know what HTTP servers are out there and what headers they send. For S3 this works, for althttpd this didn't work even before the change, didn't try anything else.

al13n321 · 2023-03-31T04:23:35Z

Still looking at the test failures.

al13n321 · 2023-03-31T05:39:06Z

Ok, fixed, there were two pre-existing bugs in ReadBufferFromS3::{seek,setReadUntilPosition} that I didn't notice, which previously weren't being hit because these methods were mostly unused.

al13n321 · 2023-03-31T05:39:38Z

src/IO/ReadBufferFromS3.cpp

@@ -197,15 +199,15 @@ bool ReadBufferFromS3::nextImpl()

 off_t ReadBufferFromS3::seek(off_t offset_, int whence)
 {
-    if (offset_ == offset && whence == SEEK_SET)


al13n321 · 2023-03-31T05:41:49Z

src/IO/ReadBufferFromS3.cpp

        read_until_position = position;
-        impl.reset();


Two bug (impl was being destroyed without resetting working_buffer, which points into impl).

al13n321 · 2023-04-04T01:54:57Z

Tests still fail, at least some of them because order of rows is not preserved by default anymore.

Should we make input_format_parquet_preserve_order setting default to true, in case anyone relies on the ordering? It's tempting to default to false to get better speed out of the box, especially in benchmarks.

alexey-milovidov · 2023-04-04T02:10:30Z

Set it to true by default in this pull request. Then we will merge. Then we will change it to false in another pull request. It will allow us to highlight it better in the changelog.

al13n321 · 2023-04-04T02:37:52Z

Would be nice to add a virtual column for row index, then detect ORDER BY _row_idx to switch to order-preserving mode. Doesn't help with breaking people's existing queries, just a prettier interface than a setting.

src/Processors/Formats/Impl/ParquetBlockInputFormat.cpp

al13n321 · 2023-04-12T03:27:09Z

Couldn't reproduce the CachedOnDiskReadBufferFromFile assertion failure locally: https://s3.amazonaws.com/clickhouse-test-reports/47964/8dc28ad500c79074bb7d638b5ec5f06b3c6ed30e/stress_test__debug_.html . From log and core dump, it appears to be this scenario:

One CachedOnDiskReadBufferFromFile downloads a file segment, then fails to write it to cache (probably because of an exception). The FileSegment ends up in the following state:

segment_range = [0, 731]
downloaded_size = 0
remote_file_reader->getFileOffsetOfBufferEnd() == 732
remote_file_reader->working_buffer size = 732

Another CachedOnDiskReadBufferFromFile wants to read the same segment. It takes the remote_file_reader, probably seeks it to 0 (within working_buffer), then fails an assert when getFileOffsetOfBufferEnd() (732) is different from downloaded_size (0).

But this doesn't seem related to this PR, and the test doesn't fail on master (and has failed on the PR multiple times), so I'm not sure.

@kssenii, plz review the last commit ("Hopefully fix assertion failure in CachedOnDiskReadBufferFromFile").

kssenii

Couldn't reproduce the CachedOnDiskReadBufferFromFile assertion failure locally: https://s3.amazonaws.com/clickhouse-test-reports/47964/8dc28ad500c79074bb7d638b5ec5f06b3c6ed30e/stress_test__debug_.html .

I've never seen this assertion fail before, I'd think it is related to changes

another CachedOnDiskReadBufferFromFile wants to read the same segment. It takes the remote_file_reader, probably seeks it to 0 (within working_buffer), then fails an assert when getFileOffsetOfBufferEnd() (732) is different from downloaded_size (0).

I'd think the the problem is in another place, because see these lines of logs

2023.04.07 07:44:39.755076 [ 3363 ] {befe85ec-9b42-4d3c-b454-c75455a15a04} <Test> CachedOnDiskReadBufferFromFile(00170_test/yaq/fqcczxfqquhmvacyzmwqwnoglwagg): Read 732 bytes, read type REMOTE_FS_READ_AND_PUT_IN_CACHE, position: 0, offset: 732
2023.04.07 07:44:39.777827 [ 3363 ] {befe85ec-9b42-4d3c-b454-c75455a15a04} <Test> FileSegment(b2c2cee274e671de0e4265cbb005a440) : [0, 731]: Updated state from DOWNLOADING to PARTIALLY DOWNLOADED
2023.04.07 07:44:39.777959 [ 3363 ] {befe85ec-9b42-4d3c-b454-c75455a15a04} <Test> FileSegment(b2c2cee274e671de0e4265cbb005a440) : [0, 731]: Resetting downloader from befe85ec-9b42-4d3c-b454-c75455a15a04:3363
2023.04.07 07:44:39.778193 [ 3363 ] {befe85ec-9b42-4d3c-b454-c75455a15a04} <Test> FileSegment(b2c2cee274e671de0e4265cbb005a440) : [0, 731]: Complete batch. (File segment: [0, 731], key: b2c2cee274e671de0e4265cbb005a440, state: PARTIALLY_DOWNLOADED, downloaded size: 0, reserved size: 732, downloader id: None, current write offset: 0, first non-downloaded offset: 0, caller id: befe85ec-9b42-4d3c-b454-c75455a15a04:3363, detached: 0, kind: Regular, unbound: 0)

732 bytes were read, then we returned from nextImpl without writing anything to cache and there is no error in log, this is not normal. Probably the reason is that we passed working_buffer.begin() instead of position() (therefore it could silently write nothing, let's add an assertion to FileSegment::write like assertdata != nullptr)?) but this is a bit strange, there was a guarantee that impl buffer is not seekable (the restricted_seek check guaranteed that) and we always passed external buffer forward into impl, so it was fine to use working_buffer.begin().

src/Disks/IO/CachedOnDiskReadBufferFromFile.cpp

al13n321 · 2023-04-13T02:20:16Z

Undid most of the nonsense refactoring in CachedOnDiskReadBufferFromFile. Switched from making it accept pre-existing reader in any state to making sure that pre-existing reader is always in the correct state.

robot-ch-test-poll4 added the pr-performance Pull request with some performance improvements label Mar 24, 2023

Avogar self-assigned this Mar 24, 2023

al13n321 force-pushed the fast-parquet branch 2 times, most recently from 88777cd to 570ae57 Compare March 30, 2023 03:43

al13n321 commented Mar 30, 2023

View reviewed changes

src/Disks/IO/AsynchronousReadIndirectBufferFromRemoteFS.cpp Show resolved Hide resolved

al13n321 force-pushed the fast-parquet branch from 570ae57 to 6261c7c Compare March 30, 2023 05:53

al13n321 marked this pull request as ready for review March 30, 2023 05:54

Avogar reviewed Mar 30, 2023

View reviewed changes

al13n321 force-pushed the fast-parquet branch from 6261c7c to 6039f9f Compare March 31, 2023 03:05

al13n321 commented Mar 31, 2023

View reviewed changes

al13n321 force-pushed the fast-parquet branch from 6039f9f to 2bd32c6 Compare March 31, 2023 05:37

al13n321 commented Mar 31, 2023

View reviewed changes

al13n321 force-pushed the fast-parquet branch from 2bd32c6 to 757ffbe Compare April 4, 2023 01:51

al13n321 force-pushed the fast-parquet branch from 757ffbe to 03aed06 Compare April 4, 2023 02:20

Avogar reviewed Apr 4, 2023

View reviewed changes

src/Processors/Formats/Impl/ParquetBlockInputFormat.cpp Show resolved Hide resolved

Avogar reviewed Apr 4, 2023

View reviewed changes

src/Processors/Formats/Impl/ParquetBlockInputFormat.cpp Show resolved Hide resolved

al13n321 force-pushed the fast-parquet branch 3 times, most recently from 8f55cfd to 8dc28ad Compare April 7, 2023 00:38

kssenii reviewed Apr 12, 2023

View reviewed changes

al13n321 force-pushed the fast-parquet branch from e22d802 to 50efc5f Compare April 13, 2023 00:21

al13n321 force-pushed the fast-parquet branch 2 times, most recently from 46e4ad8 to 3fd74ca Compare April 13, 2023 21:10

al13n321 mentioned this pull request Apr 13, 2023

CANNOT_READ_FROM_ISTREAM when selecting from s3(), with a glob, with remote_filesystem_read_method='read' #48137

Closed

al13n321 force-pushed the fast-parquet branch from 3fd74ca to 5862983 Compare April 13, 2023 22:49

qoega assigned al13n321 and unassigned al13n321 Apr 14, 2023

al13n321 added 7 commits April 17, 2023 04:58

Read less unnecessary data from Parquet files

dc6e340

Something

2d4fe85

Highly questionable refactoring (getInputMultistream() nonsense)

6830778

Parallel decoding with one row group per thread

e133633

Better

87be78e

Hopefully fix assertion failure in CachedOnDiskReadBufferFromFile

473f212

Unbreak reading from web servers that don't support HEAD requests

bd426a7

al13n321 force-pushed the fast-parquet branch from 5862983 to bd426a7 Compare April 17, 2023 04:59

Avogar approved these changes Apr 17, 2023

View reviewed changes

robot-clickhouse-ci-2 merged commit 45f4a5f into master Apr 17, 2023
139 checks passed

robot-clickhouse-ci-2 deleted the fast-parquet branch April 17, 2023 17:27

al13n321 mentioned this pull request Apr 21, 2023

Something good happens when I query Parquet file from S3 #46703

Closed

6 tasks

SmitaRKulkarni mentioned this pull request Apr 25, 2023

When reading from multiple files displace parallel parsing #46661

Merged

This was referenced Apr 28, 2023

Read Parquet files even faster #49121

Closed

Performance degradation when row group size is reduced in parquet file in s3 #49477

Closed

This was referenced May 5, 2023

Read Parquet files even faster, take 2 #49539

Merged

Fix http_skip_not_found_url with compressed files #49541

Closed

baibaichen mentioned this pull request May 22, 2023

[GLUTEN-1632][CH]Update Clickhouse Version (20230517) apache/incubator-gluten#1692

Merged

Avogar mentioned this pull request Jul 30, 2023

S3 parquet queries should be parallelized (and distributed) by row group, not file #52790

Closed

Clark0 mentioned this pull request Aug 3, 2023

Support reading ORC file in parallel #52963

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read Parquet files faster #47964

Read Parquet files faster #47964

al13n321 commented Mar 24, 2023 •

edited

al13n321 Mar 30, 2023

al13n321 commented Mar 30, 2023

Avogar commented Mar 30, 2023

Avogar Mar 30, 2023

al13n321 Mar 30, 2023

Avogar commented Mar 30, 2023

al13n321 commented Mar 31, 2023

al13n321 Mar 31, 2023

al13n321 commented Mar 31, 2023

al13n321 commented Mar 31, 2023

al13n321 Mar 31, 2023

al13n321 Mar 31, 2023

al13n321 commented Apr 4, 2023

alexey-milovidov commented Apr 4, 2023

al13n321 commented Apr 4, 2023

al13n321 commented Apr 12, 2023

kssenii left a comment

al13n321 commented Apr 13, 2023

Read Parquet files faster #47964

Read Parquet files faster #47964

Conversation

al13n321 commented Mar 24, 2023 • edited

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

al13n321 Mar 30, 2023

Choose a reason for hiding this comment

al13n321 commented Mar 30, 2023

Avogar commented Mar 30, 2023

Avogar Mar 30, 2023

Choose a reason for hiding this comment

al13n321 Mar 30, 2023

Choose a reason for hiding this comment

Avogar commented Mar 30, 2023

al13n321 commented Mar 31, 2023

al13n321 Mar 31, 2023

Choose a reason for hiding this comment

al13n321 commented Mar 31, 2023

al13n321 commented Mar 31, 2023

al13n321 Mar 31, 2023

Choose a reason for hiding this comment

al13n321 Mar 31, 2023

Choose a reason for hiding this comment

al13n321 commented Apr 4, 2023

alexey-milovidov commented Apr 4, 2023

al13n321 commented Apr 4, 2023

al13n321 commented Apr 12, 2023

kssenii left a comment

Choose a reason for hiding this comment

al13n321 commented Apr 13, 2023

al13n321 commented Mar 24, 2023 •

edited