Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with asynchronous readers #26791

Merged
merged 65 commits into from
Aug 31, 2021
Merged

Experiment with asynchronous readers #26791

merged 65 commits into from
Aug 31, 2021

Conversation

alexey-milovidov
Copy link
Member

@alexey-milovidov alexey-milovidov commented Jul 26, 2021

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Currently unrelated and with current setup it makes performance slightly worse most of the times.
But it is aimed to fix absolutely terrible performance of point queries from external storage: #23199
by requesting reading of multiple columns at once (vertical parallelization).

@robot-clickhouse robot-clickhouse added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jul 26, 2021
@alexey-milovidov alexey-milovidov marked this pull request as draft July 26, 2021 01:50
@alexey-milovidov
Copy link
Member Author

We can use

   preadv2() and pwritev2()
       These system calls are similar to preadv() and pwritev() calls, but add a fifth argument, flags, which modifies the behavior on a per-call basis.

       RWF_NOWAIT (since Linux 4.14)
              Do not wait for data which is not immediately available.  If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock.  If some data was
              successfully read, it will return the number of bytes read.  If no bytes were read, it will return -1 and set errno to EAGAIN.  Currently, this flag is meaningful only for preadv2().

for synchronous reads + thread pool for asynchronous reads.

@alexey-milovidov
Copy link
Member Author

I checked manually all the tests that slowed down in performance test.
Both with clickhouse-client/clickhouse-benchmark and performance-test script.
It did not show any difference.

At the same time I remember that codec* tests were not stable...

But reproducing this difference in multiple runs of performance test will be suspicious.

@alexey-milovidov alexey-milovidov marked this pull request as ready for review August 30, 2021 22:42
@alexey-milovidov
Copy link
Member Author

Checked on mtlog-perftest03j:

Queries executed: 6456.

localhost:9000, queries 3254, QPS: 6.090, RPS: 108078873.607, MiB/s: 8908.154, result RPS: 6.090, result MiB/s: 0.000.
localhost:9001, queries 3202, QPS: 6.114, RPS: 108501542.512, MiB/s: 8942.991, result RPS: 6.114, result MiB/s: 0.000.

0.000%          0.124 sec.      0.123 sec.
10.000%         0.135 sec.      0.134 sec.
20.000%         0.140 sec.      0.138 sec.
30.000%         0.146 sec.      0.144 sec.
40.000%         0.151 sec.      0.150 sec.
50.000%         0.158 sec.      0.158 sec.
60.000%         0.166 sec.      0.166 sec.
70.000%         0.176 sec.      0.178 sec.
80.000%         0.187 sec.      0.187 sec.
90.000%         0.205 sec.      0.203 sec.
95.000%         0.216 sec.      0.216 sec.
99.000%         0.230 sec.      0.232 sec.
99.900%         0.246 sec.      0.250 sec.
99.990%         0.258 sec.      0.254 sec.

No difference proven at 99.5% confidence

9000 - old, 9001 - new.
Actually new version is slightly better but it's not statistically significant.

@alexey-milovidov
Copy link
Member Author

Ok. I've found the reason - performance test has

            <!-- mmap shows some improvements in perf tests -->
            <min_bytes_to_use_mmap_io>64Mi</min_bytes_to_use_mmap_io>

but it's not used in production.

In this PR, the user has to activate mmap read method explicitly in addition to setting min_bytes_to_use_mmap_io.

Copy link
Member

@alesapin alesapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alesapin
Copy link
Member

alesapin commented Aug 31, 2021

mmap_io | clickhouse_driver.errors.ServerException: Code: 115.
mmap_io | DB::Exception: Unknown setting local_filesystem_read_method. Stack trace:

I think it's Ok for the old (master) version?

@alesapin
Copy link
Member

Integration -- no related failures.

@alesapin alesapin merged commit 5259991 into master Aug 31, 2021
@alesapin alesapin deleted the async-reads branch August 31, 2021 10:17
@alexey-milovidov
Copy link
Member Author

Experiment: compare pread_threadpool with default pread on mtlog-perftest server with slow HDD RAID and clean cache.

echo 3 | sudo tee /proc/sys/vm/drop_caches

Then run two concurrent queries SELECT count(), sum(ignore(*)) FROM hits_1000m in two terminal windows.

  1. Run both queries with pread.
    They finished at the same time in 89.427 sec, 2.67 GB/s.

  2. Run both queries with pread_threadpool and set read_priority to 0 for one query and to 1 for another query.
    They finished at the same time in 71.802 sec., 3.33 GB/s.

So, the pread_threadpool is better for slow HDDs.
Currently it is hardcoded for 16 threads, but different values can be better.

Regardless to setting different read_priority, queries finished at the same time, most likely because lower priority query takes data from page cache after it is being read from higher priority query.

  1. Run one query with pread and another with pread_threadpool.
    The query with pread finished in 98.173 sec. and the query with pread_threadpool finished in 94.337 sec. (4% faster)

  2. Do the same but with warm cache.
    No difference (<3% difference depends on what query did started first).

  3. Run both queries with pread_threadpool and set read_priority to 0 for one query and to 1 for another query. Also set min_bytes_to_use_direct_io to 1 to activate O_DIRECT, so data is never read from cache and priority should have effect.

It has failed with Invalid argument error.

  1. Copy the table to another table and run queries from different tables with pread_threadpool and set read_priority to 0 for one query and to 1 for another query. So they cannot reuse data from each other and priorities should take effect.

Note: tables may have different read performance due to different placement of files on HDD platters (near or far from the center).

The effect of read_priority is clearly visible.
The second query started to process data only after about a third of first query has been processed (so, it was completely preempted). Then it continued to process data, but much slower than the first query, until the first query has finished.

High priority query: 147.093 sec.
Low priority query: 191.454 sec.

Swapped tables:

High priority query: 175.486 sec.
Low priority query: 188.640 sec.

But low priority query still slows down the higher priority query.

  1. The same with enabled prefetch.

Now low priority query has waited before high priority query completely finishes.
Most likely it's due to the fact that with prefetching every thread is pushing reads for every column in advance.

High priority query: 121.232 sec.
Low priority query: 188.658 sec.

  1. Single query with enabled prefetch.

93.583 sec.

It is still 25% better than for concurrent queries.

  1. Read from different tables concurrently with pread.

Without prefetch: 162.786 sec. and 202.415 sec.
With prefetch: 307.606 sec. and 315.921 sec.

So, the pread_threadpool with priorities is definitely better.

  1. Same with lowered number of max_threads = 4.

160.886 sec. and 162.142 sec.

@alexey-milovidov
Copy link
Member Author

alexey-milovidov commented Sep 1, 2021

But we also need to test on EBS and local SSD.
And with reading fewer number of columns.

@alexey-milovidov
Copy link
Member Author

When reading data from page cache, pread_threadpool is faster with prefetch (because it is using userspace level prefetching with double-buffering) and pread is slower with prefetch (because it's using OS readahead).

@alexey-milovidov
Copy link
Member Author

Experiment: read data from page cache, single column and compare various methods:

clickhouse-benchmark --iterations 100 --local_filesystem_read_method pread --local_filesystem_read_prefetch 0 --query "SELECT sum(ignore(WatchID)) FROM hits_1000m"

50.000%         0.292 sec.
60.000%         0.311 sec.
70.000%         0.319 sec.
80.000%         0.334 sec.
90.000%         0.356 sec.
95.000%         0.368 sec.
99.000%         0.397 sec.

clickhouse-benchmark --iterations 100 --local_filesystem_read_method pread --local_filesystem_read_prefetch 1 --query "SELECT sum(ignore(WatchID)) FROM hits_1000m"

50.000%         0.288 sec.
60.000%         0.302 sec.
70.000%         0.318 sec.
80.000%         0.336 sec.
90.000%         0.354 sec.
95.000%         0.368 sec.
99.000%         0.376 sec.

clickhouse-benchmark --iterations 100 --local_filesystem_read_method pread_threadpool --local_filesystem_read_prefetch 0 --query "SELECT sum(ignore(WatchID)) FROM hits_1000m"

50.000%         0.268 sec.
60.000%         0.268 sec.
70.000%         0.270 sec.
80.000%         0.273 sec.
90.000%         0.277 sec.
95.000%         0.282 sec.
99.000%         0.303 sec.

clickhouse-benchmark --iterations 100 --local_filesystem_read_method pread_threadpool --local_filesystem_read_prefetch 1 --query "SELECT sum(ignore(WatchID)) FROM hits_1000m"

50.000%         0.248 sec.
60.000%         0.249 sec.
70.000%         0.250 sec.
80.000%         0.252 sec.
90.000%         0.257 sec.
95.000%         0.265 sec.
99.000%         0.274 sec.

/// Check if data is already in page cache with preadv2 syscall.

/// We don't want to depend on new Linux kernel.
static std::atomic<bool> has_pread_nowait_support{true};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::atomic is superfluous right? Since initializing static variable is thread safe.
Also maybe it worth initializing it before any read to avoid any atomic requirement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's lazy initialized which is not thread safe.

@azat
Copy link
Collaborator

azat commented Sep 1, 2021

@alexey-milovidov interesting numbers, but what about real queries (when the data is in the local filesystem) in production env, do you think that it will give any benefits?
To me simply read-and-process the column (i.e. decompress and pass blocks next on pipeline) instead of using thread pool for reading block looks better pattern.

Currently it is hardcoded for 16 threads, but different values can be better.

Especially when underlying storage is some RAID device, for regular HDD 16 threads does not looks optimal.

@alexey-milovidov
Copy link
Member Author

alexey-milovidov commented Sep 1, 2021

To me simply read-and-process the column (i.e. decompress and pass blocks next on pipeline) instead of using thread pool for reading block looks better pattern.

When data is not in page cache it is beneficial to limit parallelism for HDD.

Especially when underlying storage is some RAID device, for regular HDD 16 threads does not looks optimal.

This server is using RAID of 8 HDDs.

@alexey-milovidov
Copy link
Member Author

Experiment: compare pread and pread_threadpool on a server with EBS in Yandex Cloud.

The server has 80 vCPU.
Note: EBS performance is capped at 430 MB/sec.

SELECT toStartOfMonth(created_at) AS d, count() AS c, bar(c, 0, 5000, 100) FROM github_events WHERE actor_login = 'alexey-milovidov' GROUP BY d ORDER BY d

There is no difference between pread and pread_threadpool on data from page cache.
When data is in page cache, enabling prefetch adds some minor slowdown.

There is no difference between pread and pread_threadpool on cold data.
On cold data, enabling prefetch makes no difference for pread_threadpool and adds minor slowdown for pread.

Result: pread_threadpool is no worse than pread on a server with EBS.

@alexey-milovidov
Copy link
Member Author

alexey-milovidov commented Sep 2, 2021

Experiment: multicolumn short query on a server with EBS in Yandex Cloud.

SELECT sum(ignore(*)) FROM github_events WHERE repo_name IN ('ClickHouse/ClickHouse', 'yandex/ClickHouse')

When reading from page cache there is no difference between pread and pread_threadpool.
When reading from EBS there is no difference between pread and pread_threadpool.

@alexey-milovidov
Copy link
Member Author

Also I've tested on Intel Optane SSD.
There is no difference between pread and pread_threadpool.

@alexey-milovidov
Copy link
Member Author

Experiment: testing read_priority on a server with EBS in Yandex Cloud.

  1. Run queries in two terminal windows concurrently:

First session:

SET local_filesystem_read_method = 'pread_threadpool', read_priority = 0
SELECT sum(ignore(*)) FROM github_events

Second session:

SET local_filesystem_read_method = 'pread_threadpool', read_priority = 1
SELECT sum(ignore(*)) FROM github_events

Data is being read at constant speed of 430 MB/sec (the limit of EBS).

First query started processing data instantly.

The second query started only after several tens of seconds and then process data nearly at the same speed as it picks up data from page cache. The second query also paused a few times in the middle - it's when it has to read something from disk but get preempted by the first query.

First query has finished in 618.832 sec.
Second query has finished in 634.192 sec.

  1. What if we run only single query?

612.859 sec.

The same speed as for two concurrent queries (within 1% difference). This is very good result.

  1. What if we run two concurrent queries with pread method (no priorities):

689.694 sec.

Result: read_threadpool method is a clear win and priorities are working perfectly on a server with EBS.

@alexey-milovidov
Copy link
Member Author

I see that we need to enable pread_threadpool method by default, but I want to also add the following parameters:

  1. Per query and per user maximum concurrency of reads.
  2. read_share parameter as an addition to read_priority for soft priorities.

@azat
Copy link
Collaborator

azat commented Oct 14, 2021

Run both queries with pread_threadpool and set read_priority to 0 for one query and to 1 for another query. Also set min_bytes_to_use_direct_io to 1 to activate O_DIRECT, so data is never read from cache and priority should have effect.

It has failed with Invalid argument error.

Should be addressed in #30191

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-not-for-changelog This PR should not be mentioned in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants