Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predict size of hash table for GROUP BY #33439

Merged
merged 22 commits into from
Mar 30, 2022

Conversation

nickitat
Copy link
Member

@nickitat nickitat commented Jan 6, 2022

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Sizes of hash tables used during aggregation now collected and used in later queries to avoid hash tables resizes.

Detailed description / Documentation draft:
...

I see the following options to improve the current implementation:

  • For now statistics collecting and usage is not implemented for merging partially aggregated results. It happens in a notable number of queries (distributed, in-order, cube & rollup), so it may worth doing.

Tl;dr results

Issue #24317

@robot-clickhouse robot-clickhouse added the pr-performance Pull request with some performance improvements label Jan 6, 2022
@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Jan 6, 2022
@alexey-milovidov
Copy link
Member

Maybe smth like SYSTEM DROP HASH TABLE STATISTICS should be provided by analogy with SYSTEM DROP * CACHE.

It looks very low level, maybe we can avoid adding this command.
I don't want that the users will use it :)

The only "observable behaviour" here is performance and in theory performance tests on sufficiently rich set of queries may be enough. But there are enough places where things may go wrong and rely only on perf tests is insufficient IMO.

Just check that nothing broken - by our existing tests.
Also an option is to add ProfileEvents and then write a test that will check their values in system.query_log.

@@ -483,6 +483,10 @@ class IColumn;
M(Bool, optimize_rewrite_sum_if_to_count_if, true, "Rewrite sumIf() and sum(if()) function countIf() function when logically equivalent", 0) \
M(UInt64, insert_shard_id, 0, "If non zero, when insert into a distributed table, the data will be inserted into the shard `insert_shard_id` synchronously. Possible values range from 1 to `shards_number` of corresponding distributed table", 0) \
\
M(Bool, collect_hash_table_stats_during_aggregation, true, "Enable collecting hash table statistics to optimize memory allocation", 0) \
M(UInt64, max_entries_for_hash_table_stats, 10'000, "How much entries hash table statistics collected during aggregation is allowed to have", 0) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much -> many.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

$CLICKHOUSE_CLIENT -q "SYSTEM FLUSH LOGS"
$CLICKHOUSE_CLIENT \
--param_query_id="$query_id" \
-q "WITH ( \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work without newline escaping (inside string literal).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@alexey-milovidov
Copy link
Member

BTW, what performance improvement do you observe with manual testing?

}

std::mutex mutex;
CachePtr hash_table_stats;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth exporting this LRUCache in the same manner as others, in AsynchronousMetrics.cpp:

{
if (auto uncompressed_cache = getContext()->getUncompressedCache())
{
new_values["UncompressedCacheBytes"] = uncompressed_cache->weight();
new_values["UncompressedCacheCells"] = uncompressed_cache->count();
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, I replaced in the description the part about SYSTEM DROP HASH TABLE STATISTICS with what you suggested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@nickitat
Copy link
Member Author

nickitat commented Jan 7, 2022

BTW, what performance improvement do you observe with manual testing?

Here are some numbers
display-name    0       SELECT number FROM numbers(10000000) GROUP BY number FORMAT Null SETTINGS collect_hash_table_stats_during_aggregation = 0
prewarm 0       hash_table_sizes_stats.query0.prewarm0  0 0.7254605293273926
query   0       hash_table_sizes_stats.query0.run0      0 0.724083662033081
query   0       hash_table_sizes_stats.query0.run1      0 0.7410163879394531
query   0       hash_table_sizes_stats.query0.run2      0 0.7370643615722656
query   0       hash_table_sizes_stats.query0.run3      0 0.7427923679351807
query   0       hash_table_sizes_stats.query0.run4      0 0.7495505809783936
query   0       hash_table_sizes_stats.query0.run5      0 0.7057178020477295
query   0       hash_table_sizes_stats.query0.run6      0 0.7060608863830566
client-time     0       5.106584181998187       5.10628604888916
display-name    1       SELECT number FROM numbers(10000000) GROUP BY number FORMAT Null SETTINGS collect_hash_table_stats_during_aggregation = 1
prewarm 1       hash_table_sizes_stats.query1.prewarm0  0 0.5177068710327148
query   1       hash_table_sizes_stats.query1.run0      0 0.5138120651245117
query   1       hash_table_sizes_stats.query1.run1      0 0.516057014465332
query   1       hash_table_sizes_stats.query1.run2      0 0.511322021484375
query   1       hash_table_sizes_stats.query1.run3      0 0.5145668983459473
query   1       hash_table_sizes_stats.query1.run4      0 0.5127284526824951
query   1       hash_table_sizes_stats.query1.run5      0 0.5113763809204102
query   1       hash_table_sizes_stats.query1.run6      0 0.5138087272644043
client-time     1       3.593963317998714       3.5936715602874756
display-name    2       SELECT number FROM numbers_mt(500000) GROUP BY number format Null SETTINGS collect_hash_table_stats_during_aggregation = 0
prewarm 2       hash_table_sizes_stats.query2.prewarm0  0 0.03299689292907715
query   2       hash_table_sizes_stats.query2.run0      0 0.030817031860351562
query   2       hash_table_sizes_stats.query2.run1      0 0.030933618545532227
query   2       hash_table_sizes_stats.query2.run2      0 0.03640604019165039
query   2       hash_table_sizes_stats.query2.run3      0 0.026384830474853516
query   2       hash_table_sizes_stats.query2.run4      0 0.03062891960144043
query   2       hash_table_sizes_stats.query2.run5      0 0.025572776794433594
query   2       hash_table_sizes_stats.query2.run6      0 0.02617502212524414
client-time     2       0.20719847900181776     0.20691823959350586
display-name    3       SELECT number FROM numbers_mt(500000) GROUP BY number format Null SETTINGS collect_hash_table_stats_during_aggregation = 1
prewarm 3       hash_table_sizes_stats.query3.prewarm0  0 0.015240907669067383
query   3       hash_table_sizes_stats.query3.run0      0 0.01188802719116211
query   3       hash_table_sizes_stats.query3.run1      0 0.012000560760498047
query   3       hash_table_sizes_stats.query3.run2      0 0.0118408203125
query   3       hash_table_sizes_stats.query3.run3      0 0.0110015869140625
query   3       hash_table_sizes_stats.query3.run4      0 0.012171268463134766
query   3       hash_table_sizes_stats.query3.run5      0 0.014752864837646484
query   3       hash_table_sizes_stats.query3.run6      0 0.011242151260375977
client-time     3       0.08522102699862444     0.08489727973937988
display-name    4       SELECT number FROM numbers_mt(1000000) GROUP BY number format Null SETTINGS collect_hash_table_stats_during_aggregation = 0
prewarm 4       hash_table_sizes_stats.query4.prewarm0  0 0.04354548454284668
query   4       hash_table_sizes_stats.query4.run0      0 0.03917956352233887
query   4       hash_table_sizes_stats.query4.run1      0 0.04040980339050293
query   4       hash_table_sizes_stats.query4.run2      0 0.03927874565124512
query   4       hash_table_sizes_stats.query4.run3      0 0.03807783126831055
query   4       hash_table_sizes_stats.query4.run4      0 0.04324507713317871
query   4       hash_table_sizes_stats.query4.run5      0 0.03917741775512695
query   4       hash_table_sizes_stats.query4.run6      0 0.04053497314453125
client-time     4       0.2802344739975524      0.2799034118652344
display-name    5       SELECT number FROM numbers_mt(1000000) GROUP BY number format Null SETTINGS collect_hash_table_stats_during_aggregation = 1
prewarm 5       hash_table_sizes_stats.query5.prewarm0  0 0.032332420349121094
query   5       hash_table_sizes_stats.query5.run0      0 0.033093929290771484
query   5       hash_table_sizes_stats.query5.run1      0 0.03340029716491699
query   5       hash_table_sizes_stats.query5.run2      0 0.03559160232543945
query   5       hash_table_sizes_stats.query5.run3      0 0.030649662017822266
query   5       hash_table_sizes_stats.query5.run4      0 0.03403615951538086
query   5       hash_table_sizes_stats.query5.run5      0 0.03227496147155762
query   5       hash_table_sizes_stats.query5.run6      0 0.030020475387573242
client-time     5       0.22936771600143402     0.22906708717346191
So, at least it performs good in case of a large single-level HT and in case of a two-level one if table's size is such that previously we decided to convert aggregation to two-level near to the end of the aggregation (but now we're able do that from the beginning). For bigger tables profit of early convertion to two-level become less noticable.

Since the setting is enabled by default all existing perf tests actually test this logic too, so I didn't spent much time on exploring different queries. And tests actually showed some improvements one, two, but also some regression too, so there are something to look at. Mind to do that the first chance I'll have. UPD: check the comment below.

@UnamedRus
Copy link
Contributor

It looks very low level, maybe we can avoid adding this command.

In other DMBS like PostgreSQL, there is such commands which allow to alter table data statistics:
https://postgrespro.ru/docs/postgresql/10/sql-dropstatistics

@alexey-milovidov
Copy link
Member

But this statistic is not about tables' data, it is something more ephemeral.

@nickitat
Copy link
Member Author

nickitat commented Jan 29, 2022

Some queries from the "group_by_fixed_keys" test have slowed down. It happened because previously we used a hash table of 1024 elements (finally) and now this query will use HT of 512 elements (preallocated at the beginning). This is true for every second power of two, but seems like it makes things worse only for fairly small sizes (~4096 on my machine), but for bigger tables it actually gives us perf benefits (PLEASE check the "Experiment results" section below). So I just left all this logic untouched.

Experiment results

The following query from the "group_by_fixed_keys" was used:

    WITH number % SIZE AS k, toUInt64(k) AS k1, k1 + 1 AS k2
  SELECT k1, k2, count()
    FROM numbers(100000000)
GROUP BY k1, k2
  FORMAT Null
SETTINGS collect_hash_table_stats_during_aggregation = CS

I encourage you not to compare the values from different columns (but only from adjacent rows), since they are from different runs.

<SIZE>_<CS> Current implementation If we did like before
255_0 1.5808916091918945 1.5261054039001465
255_1 1.6315653324127197 1.552426815032959
257_0 1.5843119621276855 1.5510237216949463
257_1 1.5761158466339111 1.555800199508667
511_0 1.7008326053619385 1.663393497467041
511_1 1.7077183723449707 1.6842105388641357
513_0 1.651308298110962 1.6336848735809326
513_1 1.6486756801605225 1.6197535991668701
1023_0 1.7353456020355225 1.6944849491119385
1023_1 1.8785557746887207 1.6881804466247559
1025_0 1.6594724655151367 1.6393749713897705
1025_1 1.6534018516540527 1.6477224826812744
2047_0 1.9330251216888428 1.8986682891845703
2047_1 1.944204568862915 1.9351270198822021
2049_0 1.7217907905578613 1.6737751960754395
2049_1 1.7120473384857178 1.685054063796997
4095_0 1.9070887565612793 1.882392168045044
4095_1 2.0191121101379395 1.8864517211914062
4097_0 1.8290636539459229 1.8191907405853271
4097_1 1.8279407024383545 1.7885808944702148
8191_0 1.9226725101470947 1.9338366985321045
8191_1 1.8923923969268799 1.8756773471832275
8193_0 1.902085542678833 1.8574156761169434
8193_1 1.8811531066894531 1.880134105682373
16383_0 2.0350911617279053 1.977320909500122
16383_1 2.0209827423095703 1.9731152057647705
16385_0 2.0228984355926514 1.933598518371582
16385_1 1.9846243858337402 1.9249930381774902
32767_0 2.1377968788146973 2.0998997688293457
32767_1 2.0927042961120605 2.088822841644287
32769_0 3.0240345001220703 2.3491389751434326
32769_1 2.7204763889312744 2.344632625579834
65535_0 3.0606114864349365 2.3248038291931152
65535_1 2.7485265731811523 2.2993004322052
65537_0 3.06917142868042 2.3265013694763184
65537_1 3.0379860401153564 2.350400447845459
131071_0 3.2479352951049805 2.5082695484161377
131071_1 3.236302137374878 2.424055576324463
131073_0 6.623970985412598 5.846493721008301
131073_1 4.761634588241577 5.839216470718384
262143_0 6.754836082458496 6.059891700744629
262143_1 4.791410207748413 5.989444255828857
262145_0 6.685712814331055 5.900468111038208
262145_1 6.6301140785217285 6.011543273925781
524287_0 6.935018301010132 6.260606050491333
524287_1 7.00163722038269 6.196463584899902
524289_0 8.178144693374634 7.380904197692871
524289_1 7.439335346221924 7.432044744491577

Without preallocation we start with a hash table of 2^8 elements and then we make it 4 times bigger when size become equal to 2^(k+1) + 1 for a next k. And this process is asymetric in a sense that, say, a query with 2^9+1 unique keys and a query with 2^10+1 keys will both deal with HT of 2^12 elements. With preallocation we will act more fair and a query with 2^9+1 keys will use HT of 2^11 elements. Effectively what happens is it lowers the amount of cache misses, e.g. for the 524289_0 vs 524289_1 I observed smth like 35.5% vs 32% in perf stat which pretty much the same ratio as 8.2s vs 7.4s (this consideration ignores the clean time of the processing but it seems to be sufficiently small to make this rough statement
plausible) .

For bigger sizes looks also OK (here the benefit of not doing reallocations also kicks in):

<SIZE>_<CS> Current implementation
1048575_0 8.115857362747192
1048575_1 7.616235017776489
1048577_0 8.04812216758728
1048577_1 8.017758131027222
2097151_0 8.882615566253662
2097151_1 8.804941892623901
2097153_0 9.062370777130127
2097153_1 9.2320556640625
4194303_0 9.837358951568604
4194303_1 10.021861791610718
4194305_0 9.858987092971802
4194305_1 9.566895961761475
8388607_0 11.311129331588745
8388607_1 11.123503923416138
8388609_0 10.947338104248047
8388609_1 10.44078254699707

Other slowdowns look irrelevant.

@nickitat nickitat force-pushed the nickitat_cache_hash_table_sizes branch 2 times, most recently from 193e203 to 53ca06a Compare February 6, 2022 17:21
@nickitat
Copy link
Member Author

nickitat commented Feb 6, 2022

Seems like smth happened with perf tests in CI, I see the same picture in other recent PRs.

@alexey-milovidov
Copy link
Member

There are errors in perf tests in master, @Avogar will fix.

@nickitat nickitat marked this pull request as ready for review February 7, 2022 23:17
@nickitat
Copy link
Member Author

nickitat commented Feb 9, 2022

@alexey-milovidov May you pls suggest who I can ask to review?

@vdimir vdimir self-assigned this Feb 10, 2022
@nickitat
Copy link
Member Author

nickitat commented Feb 10, 2022

CI spotted couple of small problems in tests. I will fix them soon (it shouldn't affect the implementation).

}

private:
CachePtr getHashTableStatsCache(const Params & params, [[maybe_unused]] std::lock_guard<std::mutex> & cache_lock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's cache_lock passed for? Just not to forget to lock mutex before calling it?

Copy link
Member Author

@nickitat nickitat Feb 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, and it makes all operations with cache sequential which is also handy

, max_size_to_preallocate_for_aggregation(max_size_to_preallocate_for_aggregation_)
{
if (!select_query_)
throw Exception(ErrorCodes::LOGICAL_ERROR, "query ptr cannot be null");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can check that it is actually DB::ASTSelectQuery, just in case

Suggested change
throw Exception(ErrorCodes::LOGICAL_ERROR, "query ptr cannot be null");
throw Exception(ErrorCodes::LOGICAL_ERROR, "Query ptr cannot be null");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is being checked in select_query->as<DB::ASTSelectQuery &>()

@@ -237,8 +469,7 @@ class CompiledAggregateFunctionsHolder final : public CompiledExpressionCacheEnt

#endif

Aggregator::Aggregator(const Params & params_)
: params(params_)
Aggregator::Aggregator(const Params & params_) : params(params_), stats_updater(std::make_unique<StatisticsUpdater>(*this))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing this from ctor looks dangerous because object is not fully constructed yet, but seems in this case it's fine, we will just store refernce to this somewhere

https://isocpp.org/wiki/faq/ctors#using-this-in-ctors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe not to construct stats_updater at all if statistics in disabled, just leave nullptr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe not to construct stats_updater at all if statistics in disabled, just leave nullptr?

done

const auto cache = getHashTableStatsCache(params, lock);
const auto hint = cache->get(params.key);
// We'll maintain the maximum among all the observed values until the next prediction turns out to be too wrong.
if (!hint || observed_size > *hint || observed_size < *hint / 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really minor thing, but I suppose it's easier to read it like that, because we immediately see range when condition is not apllied [hint/2, hint]

Suggested change
if (!hint || observed_size > *hint || observed_size < *hint / 2)
if (!hint || observed_size < *hint / 2 || *hint < observed_size)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}

CachePtr getCache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used only for getStats, so maybe let's expose getStats method here and return HashTablesCacheStatistics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -912,29 +913,60 @@ class Aggregator final
bool compile_aggregate_expressions;
size_t min_count_to_compile_aggregate_expression;

struct StatsCollectingParams
{
StatsCollectingParams();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's default contructor used for? I suppose we can provide more gurantees that we won't get inconsistent StatsCollectingParams, if we either:

  • remove default ctor,
  • use key != 0 instead of collect_hash_table_stats_during_aggregation (it can't be collect_hash_table_stats_during_aggregation = true and key = 0 at the same time, isn't it? In this case collect_hash_table_stats_during_aggregation may be reucntant)
  • make fileds const to be sure that we construct only valid params
  • Or even create StatsCollectingParams only when collect_hash_table_stats_during_aggregation = true, otherwise leave it nullptr

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default ctor is used in those calls where we don't pass any statistics collection related params.

use key != 0 instead of collect_hash_table_stats_during_aggregation

done

@nickitat nickitat force-pushed the nickitat_cache_hash_table_sizes branch 2 times, most recently from c89daed to 50bfcc7 Compare February 20, 2022 10:51
@nickitat
Copy link
Member Author

FunctionalStatelessTestFlakyCheck failed because of timeout - 300 long tests haven't fit in timeout. but there were no fails, so I think we can ignore that.

@nickitat nickitat requested a review from vdimir February 21, 2022 14:14
@nickitat nickitat marked this pull request as draft February 22, 2022 22:41
@nickitat
Copy link
Member Author

Actually eleminating resizes on ordinary merging phase seems to give another ~10% for numbers_mt(10M) and ~20% for numbers_mt(50M) (i.e. for aggregation by high-cardinality columns)

I want to work on it a little

@nickitat nickitat force-pushed the nickitat_cache_hash_table_sizes branch from 96477da to a6b7f9e Compare February 26, 2022 22:22
@nickitat nickitat force-pushed the nickitat_cache_hash_table_sizes branch from 725b2cf to df0dfa9 Compare March 20, 2022 23:41
@nickitat nickitat force-pushed the nickitat_cache_hash_table_sizes branch from df0dfa9 to 6cf6f75 Compare March 21, 2022 12:53
@nickitat nickitat marked this pull request as ready for review March 21, 2022 19:52
Copy link
Member

@vdimir vdimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but maybe @kitaisreal want take another look

Comment on lines +64 to +65
if (!params.isCollectionAndUseEnabled())
throw DB::Exception(DB::ErrorCodes::LOGICAL_ERROR, "Collection and use of the statistics should be enabled.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just return null and don't check isCollectionAndUseEnabled on caller side?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer straightforward code )

for (size_t i = 0; i < data_variants.size(); ++i)
sizes[i] = data_variants[i]->size();
const auto median_size = sizes.begin() + sizes.size() / 2; // not precisely though...
std::nth_element(sizes.begin(), median_size, sizes.end());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we do use median (but not let's say max)? Actually I'm not familiar with AggregatedDataVariants and didn't get why we have many data_variants, could you explain it to me?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is usually a separate variant for each thread of aggregation. this is a relevant place in the code.
median seems to be the correct choice from statisctical considerations and experiments also agree with that.

Copy link
Collaborator

@kitaisreal kitaisreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, need minor clarifications, check performance tests and ready to be merge.

@@ -284,6 +284,9 @@
\
M(MainConfigLoads, "Number of times the main configuration was reloaded.") \
\
M(HashTablesPreallocatedElements, "How many elements were preallocated in hash tables for aggregation.") \
M(HashTablesInitedAsTwoLevel, "How many hash tables were inited as two-level for aggregation.") \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to rename setting into something related to aggregation, and also Inited looks strange, maybe AggregationHashTablesInitializedAsTwoLevel, althought I am not sure it this event make sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

@@ -284,6 +284,9 @@
\
M(MainConfigLoads, "Number of times the main configuration was reloaded.") \
\
M(HashTablesPreallocatedElements, "How many elements were preallocated in hash tables for aggregation.") \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, we need to change name to understand that this is related to aggregation. And also could you explain why how this setting could help client ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is mostly for testing and could be useful in debuging since all the added traces are of the 'debug' level

throw DB::Exception(DB::ErrorCodes::LOGICAL_ERROR, "Collection and use of the statistics should be enabled.");

std::lock_guard lock(mutex);
const auto cache = getHashTableStatsCache(params, lock);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutex could be used only to protect getting or initializing cache, after that LRUCache guarantee thread safety.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true, but it is nice to have operations with cache sequential

@kitaisreal
Copy link
Collaborator

Performance tests looks good. We can additionally investigate why prellocating does not work good for StringHashTable in some separate pull request.

@kitaisreal kitaisreal self-assigned this Mar 22, 2022
@nickitat
Copy link
Member Author

nickitat commented Mar 22, 2022

Performance tests looks good. We can additionally investigate why prellocating does not work good for StringHashTable in some separate pull request.

my understanding is that it is because StringHT is 4 HT under the hood for different size classes and unless the distribution of input strings between size classes is uniform preallocation will not make much difference.

@nickitat
Copy link
Member Author

nickitat commented Mar 22, 2022

Perf tests in CI: 1, 2, 3.

Measurements done on my dev-machine
Query Old median New median
SELECT number FROM numbers(5000000) GROUP BY number FORMAT Null 0.6044783592224121 0.4008183479309082
SELECT number FROM numbers(10000000) GROUP BY number FORMAT Null 1.2764267921447754 0.8201110363006592
SELECT number FROM numbers_mt(500000) GROUP BY number FORMAT Null 0.021776676177978516 0.00978708267211914
SELECT number FROM numbers_mt(1000000) GROUP BY number FORMAT Null 0.02394711971282959 0.013240933418273926
SELECT number FROM numbers_mt(10000000) GROUP BY number FORMAT Null 0.1833176612854004 0.11336398124694824
SELECT number FROM numbers_mt(50000000) GROUP BY number FORMAT Null 0.7955801486968994 0.5713565349578857
WITH number % 524289 AS k, toUInt64(k) AS k1, k1 + 1 AS k2 SELECT k1, k2, count() FROM numbers(100000000) GROUP BY k1, k2 FORMAT Null 4.682622194290161 3.3435096740722656
SELECT number FROM numbers_mt(10000000) GROUP BY number FORMAT Null SETTINGS group_by_two_level_threshold = 1e12, group_by_two_level_threshold_bytes = 1e12 1.665412187576294 1.4745798110961914
SELECT number FROM numbers_mt(50000000) GROUP BY number FORMAT Null SETTINGS group_by_two_level_threshold = 1e12, group_by_two_level_threshold_bytes = 1e12 8.194661140441895 6.724031209945679
SELECT WatchID FROM hits_10m_single GROUP BY WatchID FORMAT Null 0.19792938232421875 0.13988232612609863
SELECT WatchID FROM hits_100m_single GROUP BY WatchID FORMAT Null 1.583496332168579 1.445692539215088
SELECT ClientIP AS x, x - 1, x - 2, x - 3, count() AS c FROM hits_10m_single GROUP BY x, x - 1, x - 2, x - 3 ORDER BY c DESC LIMIT 10 0.09448504447937012 0.09355950355529785
SELECT ClientIP AS x, x - 1, x - 2, x - 3, count() AS c FROM hits_100m_single GROUP BY x, x - 1, x - 2, x - 3 ORDER BY c DESC LIMIT 10 0.7526161670684814 0.7189710140228271
SELECT WatchID, ClientIP, count() AS c, sum(Refresh), avg(ResolutionWidth) FROM hits_10m_single WHERE SearchPhrase != '' GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10 0.09185647964477539 0.08975696563720703
SELECT WatchID, ClientIP, count() AS c, sum(Refresh), avg(ResolutionWidth) FROM hits_100m_single WHERE SearchPhrase != '' GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10 0.5960164070129395 0.5273172855377197
SELECT min(MobilePhoneModel) FROM hits_10m_single WHERE MobilePhoneModel != '' GROUP BY intHash32(UserID) % 1000000 FORMAT Null 0.03293633460998535 0.021028995513916016
SELECT min(MobilePhoneModel) FROM hits_100m_single WHERE MobilePhoneModel != '' GROUP BY intHash32(UserID) % 1000000 FORMAT Null 0.1293332576751709 0.1220090389251709

For example, for the SELECT number FROM numbers_mt(50000000) GROUP BY number here difference is much bigger than in CI, it is likely because of different number of threads used (it was 16 on my machine).

Also an interesting side-effect.

@alexey-milovidov
Copy link
Member

Performance numbers are really great 👍

@@ -0,0 +1,96 @@
#!/usr/bin/env bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickitat need to fix this test.

@nickitat
Copy link
Member Author

failures look irrelevant, but lets rerun just in case

@nickitat
Copy link
Member Author

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Mar 30, 2022

update

✅ Branch has been successfully updated

@nickitat nickitat merged commit 30f2a94 into ClickHouse:master Mar 30, 2022
@nickitat nickitat linked an issue Mar 30, 2022 that may be closed by this pull request
if (worthConvertToTwoLevel(
params.group_by_two_level_threshold,
hint->sum_of_sizes,
/*group_by_two_level_threshold_bytes*/ 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean the two-level aggregate never be used? why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be activated if sum_of_sizes >= params.group_by_two_level_threshold

@Nathaniel-Han
Copy link

Hi guys @nickitat , have you thought about apply the similar strategy for the hash join? I am wondering the improvement it can bring to the hash join.

@alexey-milovidov
Copy link
Member

@Nathaniel-Han it should help and worth trying,
but it will help to a less extent, because more time is spent by moving data between blocks than by resizing the hash table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Predict size of hash table for GROUP BY
9 participants