Experiment with asynchronous readers #26791

alexey-milovidov · 2021-07-26T00:37:46Z

Changelog category (leave one):

Not for changelog (changelog entry is not required)

Currently unrelated and with current setup it makes performance slightly worse most of the times.
But it is aimed to fix absolutely terrible performance of point queries from external storage: #23199
by requesting reading of multiple columns at once (vertical parallelization).

alexey-milovidov · 2021-07-30T06:40:06Z

We can use

   preadv2() and pwritev2()
       These system calls are similar to preadv() and pwritev() calls, but add a fifth argument, flags, which modifies the behavior on a per-call basis.

       RWF_NOWAIT (since Linux 4.14)
              Do not wait for data which is not immediately available.  If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock.  If some data was
              successfully read, it will return the number of bytes read.  If no bytes were read, it will return -1 and set errno to EAGAIN.  Currently, this flag is meaningful only for preadv2().

for synchronous reads + thread pool for asynchronous reads.

alexey-milovidov · 2021-08-29T22:21:29Z

I checked manually all the tests that slowed down in performance test.
Both with clickhouse-client/clickhouse-benchmark and performance-test script.
It did not show any difference.

At the same time I remember that codec* tests were not stable...

But reproducing this difference in multiple runs of performance test will be suspicious.

alexey-milovidov · 2021-08-30T23:46:02Z

Checked on mtlog-perftest03j:

Queries executed: 6456.

localhost:9000, queries 3254, QPS: 6.090, RPS: 108078873.607, MiB/s: 8908.154, result RPS: 6.090, result MiB/s: 0.000.
localhost:9001, queries 3202, QPS: 6.114, RPS: 108501542.512, MiB/s: 8942.991, result RPS: 6.114, result MiB/s: 0.000.

0.000%          0.124 sec.      0.123 sec.
10.000%         0.135 sec.      0.134 sec.
20.000%         0.140 sec.      0.138 sec.
30.000%         0.146 sec.      0.144 sec.
40.000%         0.151 sec.      0.150 sec.
50.000%         0.158 sec.      0.158 sec.
60.000%         0.166 sec.      0.166 sec.
70.000%         0.176 sec.      0.178 sec.
80.000%         0.187 sec.      0.187 sec.
90.000%         0.205 sec.      0.203 sec.
95.000%         0.216 sec.      0.216 sec.
99.000%         0.230 sec.      0.232 sec.
99.900%         0.246 sec.      0.250 sec.
99.990%         0.258 sec.      0.254 sec.

No difference proven at 99.5% confidence

9000 - old, 9001 - new.
Actually new version is slightly better but it's not statistically significant.

alexey-milovidov · 2021-08-30T23:52:25Z

Ok. I've found the reason - performance test has

            <!-- mmap shows some improvements in perf tests -->
            <min_bytes_to_use_mmap_io>64Mi</min_bytes_to_use_mmap_io>

but it's not used in production.

In this PR, the user has to activate mmap read method explicitly in addition to setting min_bytes_to_use_mmap_io.

alesapin

LGTM

alesapin · 2021-08-31T10:10:46Z

mmap_io | clickhouse_driver.errors.ServerException: Code: 115.
mmap_io | DB::Exception: Unknown setting local_filesystem_read_method. Stack trace:

I think it's Ok for the old (master) version?

alesapin · 2021-08-31T10:17:23Z

Integration -- no related failures.

alexey-milovidov · 2021-09-01T01:44:16Z

Experiment: compare pread_threadpool with default pread on mtlog-perftest server with slow HDD RAID and clean cache.

echo 3 | sudo tee /proc/sys/vm/drop_caches

Then run two concurrent queries SELECT count(), sum(ignore(*)) FROM hits_1000m in two terminal windows.

Run both queries with pread.
They finished at the same time in 89.427 sec, 2.67 GB/s.
Run both queries with pread_threadpool and set read_priority to 0 for one query and to 1 for another query.
They finished at the same time in 71.802 sec., 3.33 GB/s.

So, the pread_threadpool is better for slow HDDs.
Currently it is hardcoded for 16 threads, but different values can be better.

Regardless to setting different read_priority, queries finished at the same time, most likely because lower priority query takes data from page cache after it is being read from higher priority query.

Run one query with pread and another with pread_threadpool.
The query with pread finished in 98.173 sec. and the query with pread_threadpool finished in 94.337 sec. (4% faster)
Do the same but with warm cache.
No difference (<3% difference depends on what query did started first).
Run both queries with pread_threadpool and set read_priority to 0 for one query and to 1 for another query. Also set min_bytes_to_use_direct_io to 1 to activate O_DIRECT, so data is never read from cache and priority should have effect.

It has failed with Invalid argument error.

Copy the table to another table and run queries from different tables with pread_threadpool and set read_priority to 0 for one query and to 1 for another query. So they cannot reuse data from each other and priorities should take effect.

Note: tables may have different read performance due to different placement of files on HDD platters (near or far from the center).

The effect of read_priority is clearly visible.
The second query started to process data only after about a third of first query has been processed (so, it was completely preempted). Then it continued to process data, but much slower than the first query, until the first query has finished.

High priority query: 147.093 sec.
Low priority query: 191.454 sec.

Swapped tables:

High priority query: 175.486 sec.
Low priority query: 188.640 sec.

But low priority query still slows down the higher priority query.

The same with enabled prefetch.

Now low priority query has waited before high priority query completely finishes.
Most likely it's due to the fact that with prefetching every thread is pushing reads for every column in advance.

High priority query: 121.232 sec.
Low priority query: 188.658 sec.

Single query with enabled prefetch.

93.583 sec.

It is still 25% better than for concurrent queries.

Read from different tables concurrently with pread.

Without prefetch: 162.786 sec. and 202.415 sec.
With prefetch: 307.606 sec. and 315.921 sec.

So, the pread_threadpool with priorities is definitely better.

Same with lowered number of max_threads = 4.

160.886 sec. and 162.142 sec.

alexey-milovidov · 2021-09-01T01:44:48Z

But we also need to test on EBS and local SSD.
And with reading fewer number of columns.

alexey-milovidov · 2021-09-01T01:56:59Z

When reading data from page cache, pread_threadpool is faster with prefetch (because it is using userspace level prefetching with double-buffering) and pread is slower with prefetch (because it's using OS readahead).

alexey-milovidov · 2021-09-01T02:05:46Z

Experiment: read data from page cache, single column and compare various methods:

clickhouse-benchmark --iterations 100 --local_filesystem_read_method pread --local_filesystem_read_prefetch 0 --query "SELECT sum(ignore(WatchID)) FROM hits_1000m"

50.000%         0.292 sec.
60.000%         0.311 sec.
70.000%         0.319 sec.
80.000%         0.334 sec.
90.000%         0.356 sec.
95.000%         0.368 sec.
99.000%         0.397 sec.

clickhouse-benchmark --iterations 100 --local_filesystem_read_method pread --local_filesystem_read_prefetch 1 --query "SELECT sum(ignore(WatchID)) FROM hits_1000m"

50.000%         0.288 sec.
60.000%         0.302 sec.
70.000%         0.318 sec.
80.000%         0.336 sec.
90.000%         0.354 sec.
95.000%         0.368 sec.
99.000%         0.376 sec.

clickhouse-benchmark --iterations 100 --local_filesystem_read_method pread_threadpool --local_filesystem_read_prefetch 0 --query "SELECT sum(ignore(WatchID)) FROM hits_1000m"

50.000%         0.268 sec.
60.000%         0.268 sec.
70.000%         0.270 sec.
80.000%         0.273 sec.
90.000%         0.277 sec.
95.000%         0.282 sec.
99.000%         0.303 sec.

clickhouse-benchmark --iterations 100 --local_filesystem_read_method pread_threadpool --local_filesystem_read_prefetch 1 --query "SELECT sum(ignore(WatchID)) FROM hits_1000m"

50.000%         0.248 sec.
60.000%         0.249 sec.
70.000%         0.250 sec.
80.000%         0.252 sec.
90.000%         0.257 sec.
95.000%         0.265 sec.
99.000%         0.274 sec.

azat · 2021-09-01T08:14:09Z

src/IO/ThreadPoolReader.cpp

+    /// Check if data is already in page cache with preadv2 syscall.
+
+    /// We don't want to depend on new Linux kernel.
+    static std::atomic<bool> has_pread_nowait_support{true};


std::atomic is superfluous right? Since initializing static variable is thread safe.
Also maybe it worth initializing it before any read to avoid any atomic requirement?

It's lazy initialized which is not thread safe.

azat · 2021-09-01T09:35:56Z

@alexey-milovidov interesting numbers, but what about real queries (when the data is in the local filesystem) in production env, do you think that it will give any benefits?
To me simply read-and-process the column (i.e. decompress and pass blocks next on pipeline) instead of using thread pool for reading block looks better pattern.

Currently it is hardcoded for 16 threads, but different values can be better.

Especially when underlying storage is some RAID device, for regular HDD 16 threads does not looks optimal.

alexey-milovidov · 2021-09-01T12:36:33Z

To me simply read-and-process the column (i.e. decompress and pass blocks next on pipeline) instead of using thread pool for reading block looks better pattern.

When data is not in page cache it is beneficial to limit parallelism for HDD.

Especially when underlying storage is some RAID device, for regular HDD 16 threads does not looks optimal.

This server is using RAID of 8 HDDs.

alexey-milovidov · 2021-09-02T23:26:35Z

Experiment: compare pread and pread_threadpool on a server with EBS in Yandex Cloud.

The server has 80 vCPU.
Note: EBS performance is capped at 430 MB/sec.

SELECT toStartOfMonth(created_at) AS d, count() AS c, bar(c, 0, 5000, 100) FROM github_events WHERE actor_login = 'alexey-milovidov' GROUP BY d ORDER BY d

There is no difference between pread and pread_threadpool on data from page cache.
When data is in page cache, enabling prefetch adds some minor slowdown.

There is no difference between pread and pread_threadpool on cold data.
On cold data, enabling prefetch makes no difference for pread_threadpool and adds minor slowdown for pread.

Result: pread_threadpool is no worse than pread on a server with EBS.

alexey-milovidov · 2021-09-02T23:32:18Z

Experiment: multicolumn short query on a server with EBS in Yandex Cloud.

SELECT sum(ignore(*)) FROM github_events WHERE repo_name IN ('ClickHouse/ClickHouse', 'yandex/ClickHouse')

When reading from page cache there is no difference between pread and pread_threadpool.
When reading from EBS there is no difference between pread and pread_threadpool.

alexey-milovidov · 2021-09-02T23:33:12Z

Also I've tested on Intel Optane SSD.
There is no difference between pread and pread_threadpool.

alexey-milovidov · 2021-09-03T00:11:09Z

Experiment: testing read_priority on a server with EBS in Yandex Cloud.

Run queries in two terminal windows concurrently:

First session:

SET local_filesystem_read_method = 'pread_threadpool', read_priority = 0
SELECT sum(ignore(*)) FROM github_events

Second session:

SET local_filesystem_read_method = 'pread_threadpool', read_priority = 1
SELECT sum(ignore(*)) FROM github_events

Data is being read at constant speed of 430 MB/sec (the limit of EBS).

First query started processing data instantly.

The second query started only after several tens of seconds and then process data nearly at the same speed as it picks up data from page cache. The second query also paused a few times in the middle - it's when it has to read something from disk but get preempted by the first query.

First query has finished in 618.832 sec.
Second query has finished in 634.192 sec.

What if we run only single query?

612.859 sec.

The same speed as for two concurrent queries (within 1% difference). This is very good result.

What if we run two concurrent queries with pread method (no priorities):

689.694 sec.

Result: read_threadpool method is a clear win and priorities are working perfectly on a server with EBS.

alexey-milovidov · 2021-09-03T00:13:49Z

I see that we need to enable pread_threadpool method by default, but I want to also add the following parameters:

Per query and per user maximum concurrency of reads.
read_share parameter as an addition to read_priority for soft priorities.

azat · 2021-10-14T18:13:08Z

Run both queries with pread_threadpool and set read_priority to 0 for one query and to 1 for another query. Also set min_bytes_to_use_direct_io to 1 to activate O_DIRECT, so data is never read from cache and priority should have effect.

It has failed with Invalid argument error.

Should be addressed in #30191

Experiment with asynchronous readers

5c3f5cb

robot-clickhouse added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jul 26, 2021

alexey-milovidov marked this pull request as draft July 26, 2021 01:50

alexey-milovidov added 14 commits July 26, 2021 04:51

Minor change

663dc91

Merge branch 'master' into async-reads

fd5c156

Fix error

46723a2

Fix build

9baf842

Fix build

10752dd

Fix build

1974ed9

Fix style

e841996

Remove useless code

00e20c7

Fix build

ae853b8

Fix build

d52a0cf

Fix build

2ed9c12

Fix build

37c4ac6

Merge branch 'master' into async-reads

1149058

Experiments

7861ca6

alexey-milovidov added 12 commits August 1, 2021 01:25

Merge branch 'master' into async-reads

2a0cba2

Experiment with checking if data is in page cache

462c89e

Merge branch 'master' into async-reads

fd3b568

Better check for supported feature

425bf5d

Improve performance

91173b8

Merge branch 'master' into async-reads

cea9690

Fix Mac OS build

e385409

Remove experimental file

78ed0e7

Fix style

3b421ac

Remove debug output

42cf8f3

Merge branch 'master' into async-reads

8b9da6f

Merge branch 'master' into async-reads

020beb8

alexey-milovidov added 4 commits August 28, 2021 01:59

Fix warnings

93ebb40

Fix one of perf tests

e995796

Merge branch 'master' of github.com:yandex/ClickHouse into async-reads

e4f3d97

Minor fix in perf test

9027a66

alexey-milovidov marked this pull request as ready for review August 30, 2021 22:42

alexey-milovidov added 3 commits August 31, 2021 02:52

Merge branch 'master' into async-reads

ca6f630

Fix perf test

143744f

Remove idiotic whitespace

3663597

alesapin approved these changes Aug 31, 2021

View reviewed changes

alesapin merged commit 5259991 into master Aug 31, 2021

alesapin deleted the async-reads branch August 31, 2021 10:17

azat reviewed Sep 1, 2021

View reviewed changes

azat mentioned this pull request Oct 14, 2021

Fix pread_fake_async/pread_threadpool with min_bytes_to_use_direct_io #30191

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment with asynchronous readers #26791

Experiment with asynchronous readers #26791

alexey-milovidov commented Jul 26, 2021 •

edited

alexey-milovidov commented Jul 30, 2021

alexey-milovidov commented Aug 29, 2021

alexey-milovidov commented Aug 30, 2021

alexey-milovidov commented Aug 30, 2021

alesapin left a comment

alesapin commented Aug 31, 2021 •

edited

alesapin commented Aug 31, 2021

alexey-milovidov commented Sep 1, 2021

alexey-milovidov commented Sep 1, 2021 •

edited

alexey-milovidov commented Sep 1, 2021

alexey-milovidov commented Sep 1, 2021

azat Sep 1, 2021

alexey-milovidov Sep 8, 2021

azat commented Sep 1, 2021

alexey-milovidov commented Sep 1, 2021 •

edited

alexey-milovidov commented Sep 2, 2021

alexey-milovidov commented Sep 2, 2021 •

edited

alexey-milovidov commented Sep 2, 2021

alexey-milovidov commented Sep 3, 2021

alexey-milovidov commented Sep 3, 2021

azat commented Oct 14, 2021

Experiment with asynchronous readers #26791

Experiment with asynchronous readers #26791

Conversation

alexey-milovidov commented Jul 26, 2021 • edited

alexey-milovidov commented Jul 30, 2021

alexey-milovidov commented Aug 29, 2021

alexey-milovidov commented Aug 30, 2021

alexey-milovidov commented Aug 30, 2021

alesapin left a comment

Choose a reason for hiding this comment

alesapin commented Aug 31, 2021 • edited

alesapin commented Aug 31, 2021

alexey-milovidov commented Sep 1, 2021

alexey-milovidov commented Sep 1, 2021 • edited

alexey-milovidov commented Sep 1, 2021

alexey-milovidov commented Sep 1, 2021

azat Sep 1, 2021

Choose a reason for hiding this comment

alexey-milovidov Sep 8, 2021

Choose a reason for hiding this comment

azat commented Sep 1, 2021

alexey-milovidov commented Sep 1, 2021 • edited

alexey-milovidov commented Sep 2, 2021

alexey-milovidov commented Sep 2, 2021 • edited

alexey-milovidov commented Sep 2, 2021

alexey-milovidov commented Sep 3, 2021

alexey-milovidov commented Sep 3, 2021

azat commented Oct 14, 2021

alexey-milovidov commented Jul 26, 2021 •

edited

alesapin commented Aug 31, 2021 •

edited

alexey-milovidov commented Sep 1, 2021 •

edited

alexey-milovidov commented Sep 1, 2021 •

edited

alexey-milovidov commented Sep 2, 2021 •

edited