sql: non-uniform hash-sharded index with power-of-2 number of buckets #91109

michae2 · 2022-11-02T00:11:08Z

Looks like the hash bucket calculation mod(fnv32(crdb_internal.datums_to_bytes(x, y)), z) is poorly distributed when x and y are INT8 columns with equal values, and z is a power of 2. For example:

CREATE TABLE ijk (i, j, k) AS SELECT x, x, x FROM generate_series(0, 1048575) s (x);
CREATE INDEX ON ijk (i) USING HASH WITH BUCKET_COUNT = 8;
CREATE INDEX ON ijk (i, j) USING HASH WITH BUCKET_COUNT = 8;
CREATE INDEX ON ijk (i, j, k) USING HASH WITH BUCKET_COUNT = 8;
SELECT crdb_internal_i_shard_8, COUNT(crdb_internal_i_shard_8) FROM ijk GROUP BY 1 ORDER BY 1;
SELECT crdb_internal_i_j_shard_8, COUNT(crdb_internal_i_j_shard_8) FROM ijk GROUP BY 1 ORDER BY 1;
SELECT crdb_internal_i_j_k_shard_8, COUNT(crdb_internal_i_j_k_shard_8) FROM ijk GROUP BY 1 ORDER BY 1;

Gives:

demo@127.0.0.1:26257/defaultdb> SELECT crdb_internal_i_shard_8, COUNT(crdb_internal_i_shard_8) FROM ijk GROUP BY 1 ORDER BY 1;
  crdb_internal_i_shard_8 | count
--------------------------+---------
                        0 | 131071
                        1 | 131071
                        2 | 131072
                        3 | 131072
                        4 | 131073
                        5 | 131073
                        6 | 131072
                        7 | 131072
(8 rows)


Time: 770ms total (execution 770ms / network 0ms)

demo@127.0.0.1:26257/defaultdb> SELECT crdb_internal_i_j_shard_8, COUNT(crdb_internal_i_j_shard_8) FROM ijk GROUP BY 1 ORDER BY 1;
  crdb_internal_i_j_shard_8 | count
----------------------------+---------
                          1 | 376905
                          3 |  16348
                          5 | 638976
                          7 |  16347
(4 rows)


Time: 836ms total (execution 835ms / network 0ms)

demo@127.0.0.1:26257/defaultdb> SELECT crdb_internal_i_j_k_shard_8, COUNT(crdb_internal_i_j_k_shard_8) FROM ijk GROUP BY 1 ORDER BY 1;
  crdb_internal_i_j_k_shard_8 | count
------------------------------+---------
                            0 | 131073
                            1 | 131071
                            2 | 131071
                            3 | 131072
                            4 | 131072
                            5 | 131073
                            6 | 131072
                            7 | 131072
(8 rows)


Time: 976ms total (execution 976ms / network 0ms)

The hash-sharded index is still unevenly distributed with 16 buckets:

demo@127.0.0.1:26257/defaultdb> CREATE INDEX ij16 ON ijk (i, j) USING HASH WITH BUCKET_COUNT = 16;
CREATE INDEX


Time: 5.629s total (execution 5.629s / network 0.000s)

demo@127.0.0.1:26257/defaultdb> SELECT crdb_internal_i_j_shard_16, COUNT(crdb_internal_i_j_shard_16) FROM ijk GROUP BY 1 ORDER BY 1;
  crdb_internal_i_j_shard_16 | count
-----------------------------+---------
                           1 | 184357
                           3 |  12224
                           5 | 487461
                           7 |   8174
                           9 | 192548
                          11 |   4124
                          13 | 151515
                          15 |   8173
(8 rows)


Time: 819ms total (execution 819ms / network 0ms)

With 14 it's evenly distributed across half the buckets:

demo@127.0.0.1:26257/defaultdb> CREATE INDEX ij14 ON ijk (i, j) USING HASH WITH BUCKET_COUNT = 14;
CREATE INDEX


Time: 6.913s total (execution 6.912s / network 0.000s)

demo@127.0.0.1:26257/defaultdb> SELECT crdb_internal_i_j_shard_14, COUNT(crdb_internal_i_j_shard_14) FROM ijk GROUP BY 1 ORDER BY 1;
  crdb_internal_i_j_shard_14 | count
-----------------------------+---------
                           1 | 149778
                           3 | 150045
                           5 | 149398
                           7 | 149531
                           9 | 150521
                          11 | 149379
                          13 | 149924
(7 rows)


Time: 846ms total (execution 846ms / network 0ms)

With a prime number of buckets it's fine:

demo@127.0.0.1:26257/defaultdb> CREATE INDEX ij17 ON ijk (i, j) USING HASH WITH BUCKET_COUNT = 17;
CREATE INDEX


Time: 6.533s total (execution 6.532s / network 0.000s)

demo@127.0.0.1:26257/defaultdb> SELECT crdb_internal_i_j_shard_17, COUNT(crdb_internal_i_j_shard_17) FROM ijk GROUP BY 1 ORDER BY 1;
  crdb_internal_i_j_shard_17 | count
-----------------------------+--------
                           0 | 61416
                           1 | 61852
                           2 | 61537
                           3 | 61534
                           4 | 61605
                           5 | 62347
                           6 | 61745
                           7 | 61611
                           8 | 61527
                           9 | 61494
                          10 | 62198
                          11 | 61752
                          12 | 61749
                          13 | 61353
                          14 | 61502
                          15 | 61921
                          16 | 61433
(17 rows)


Time: 849ms total (execution 849ms / network 0ms)

This example is (admittedly) contrived, but there could be other cases like this. Maybe we should default to creating a prime number of buckets?

Jira issue: CRDB-21107

Epic CRDB-27601

The text was updated successfully, but these errors were encountered:

We've discovered a case where non-random data and a power-of-2 number of buckets cause an uneven distribution of data among buckets in a hash- sharded index. There's an old trick of using a prime number of buckets in hash tables to avoid this sort of unfortunate clustering. Let's use that. Fixes: cockroachdb#91109 Epic: None Release note (performance improvement): We now recommend using a prime number of buckets in hash-sharded indexes to increase the chance of an even distribution of data among buckets. The default value of `sql.defaults.default_hash_sharded_index_bucket_count` has been changed to a prime number.

ajwerner · 2022-11-02T17:30:07Z

From a prescient @RaduBerinde

What feels wrong is that we do mod 8, so we only look at the last few bits of the fnv32, which is not great for a non-cryptographic hash. Using a prime number for the buckets would fix that. I think we could also mod against a bigger prime number before modding down to 8, or shuffle the bits somehow.

I think this is a good idea. The question is how to pick a prime that's larger than the bucket count to mod with first. If we had a limit on the number of buckets, which I'm sure we should have but I'm less sure we do have, we could just do the smallest prime larger than that limit.

We've discovered a case where a combination of non-random data and a power-of-two number of buckets causes an uneven distribution of rows in a hash-sharded index. While this specific case is very contrived, it illustrates a small weakness in the current hash-shard calculation: modulo by a power-of-two number of buckets only uses the last few bits of the hash value. Radu suggested this would be a problem in cockroachdb#67865 and also suggested a fix: add an intermediate modulo by a larger prime before modulo by num buckets, so we'll try that. Fixes: cockroachdb#91109 Epic: None Release note (performance improvement): This change updates the shard calculation of newly-created hash-sharded indexes so that uneven distributions of rows are less likely.

ajwerner · 2022-11-02T22:33:58Z

The thing that's weird about your input is that it has a lot of repetition of bytes because the values are all the same for every tuple. I don't know how often that comes up. If you did this:

CREATE TABLE ijk (i, j, k) AS SELECT * FROM generate_series(0, 100), generate_series(0, 100), generate_series(0, 100) ;
CREATE INDEX ON ijk (i) USING HASH WITH BUCKET_COUNT = 8;
CREATE INDEX ON ijk (i, j) USING HASH WITH BUCKET_COUNT = 8;
CREATE INDEX ON ijk (i, j, k) USING HASH WITH BUCKET_COUNT = 8;
SELECT crdb_internal_i_shard_8, COUNT(crdb_internal_i_shard_8) FROM ijk GROUP BY 1 ORDER BY 1;
SELECT crdb_internal_i_j_shard_8, COUNT(crdb_internal_i_j_shard_8) FROM ijk GROUP BY 1 ORDER BY 1;
SELECT crdb_internal_i_j_k_shard_8, COUNT(crdb_internal_i_j_k_shard_8) FROM ijk GROUP BY 1 ORDER BY 1;

Then you get a perfect distribution.

ajwerner · 2022-11-08T15:20:02Z

Deep down, I think what's going on here is that the bitstring is the same if the column values are the same. This is not a real problem in practice. I'm going to leave this around as documentation but am closing as won't fix.

michae2 · 2023-07-21T23:01:57Z

@odessit55 found another case using DECIMAL:

CREATE TABLE products (
  ts DECIMAL NOT NULL PRIMARY KEY USING HASH WITH (bucket_count=16),
  product_id INT8
);

INSERT INTO products SELECT generate_series(0, 1023), 101;

Which shows a similar problem:

demo@127.0.0.1:26257/demoapp/defaultdb> SELECT crdb_internal_ts_shard_16 AS shard, COUNT(*) FROM products GROUP BY 1 ORDER BY 1;
  shard | count
--------+--------
      0 |     1
      1 |   127
      2 |     1
      3 |   126
      4 |     1
      5 |   126
      6 |     1
      7 |   127
      8 |     3
      9 |   126
     10 |     2
     11 |   128
     12 |     1
     13 |   127
     14 |     1
     15 |   126
(16 rows)

Time: 3ms total (execution 3ms / network 0ms)

Even if we expect real-world data to be better distributed, it doesn't look great when simple user tests result in non-uniform distributions.

rafiss · 2023-08-22T18:47:20Z

@michae2 in that most recent example, the bucket count was specified manually, so it seems this issue is going beyond just changing the default bucket count. Do you think we should change the hashing function?

michae2 · 2023-08-22T20:05:33Z

@michae2 in that most recent example, the bucket count was specified manually, so it seems this issue is going beyond just changing the default bucket count. Do you think we should change the hashing function?

Yes, I think we should.

rafiss · 2023-08-23T21:32:23Z

One idea is to use mod(fnv32(md5(crdb_internal.datums_to_bytes(columns))), bucket_count). In that example:

> create table t(a decimal primary key, b int as (mod(fnv32(md5(crdb_internal.datums_to_bytes(a))), 16)) virtual);

> INSERT INTO t SELECT generate_series(0, 1023);

> SELECT b AS shard, COUNT(*) FROM t GROUP BY 1 ORDER BY 1;
  shard | count
--------+--------
      0 |    51
      1 |    54
      2 |    65
      3 |    68
      4 |    52
      5 |    67
      6 |    58
      7 |    67
      8 |    74
      9 |    73
     10 |    73
     11 |    56
     12 |    72
     13 |    63
     14 |    64
     15 |    67
(16 rows)

michae2 added C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-schema-deprecated Use T-sql-foundations instead T-sql-queries SQL Queries Team labels Nov 2, 2022

blathers-crl bot added this to Triage in SQL Foundations Nov 2, 2022

blathers-crl bot added this to Triage in SQL Queries Nov 2, 2022

michae2 added the E-quick-win Likely to be a quick win for someone experienced. label Nov 2, 2022

This was referenced Nov 2, 2022

the performance of hash sharded index insert not good #90749

Closed

sql: use prime number of buckets by default for hash-sharded indexes #91140

Closed

michae2 mentioned this issue Nov 2, 2022

sql: modulo by prime in shard calculation for hash-sharded indexes #91155

Closed

postamar moved this from Triage to Cold storage in SQL Foundations Nov 8, 2022

ajwerner closed this as completed Nov 8, 2022

SQL Queries automation moved this from Triage to Done Nov 8, 2022

SQL Foundations automation moved this from Cold storage to Done Nov 8, 2022

exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Nov 8, 2022

exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023

michae2 reopened this Jul 21, 2023

SQL Foundations automation moved this from Done to Triage Jul 21, 2023

mgartner moved this from Done to New Backlog in SQL Queries Jul 21, 2023

michae2 changed the title ~~sql: non-uniform hash-sharded index on two equal INT8 columns with power-of-2 number of buckets~~ sql: non-uniform hash-sharded index with power-of-2 number of buckets Jul 25, 2023

michae2 added A-hash-sharding Hash-sharded indexes and removed E-quick-win Likely to be a quick win for someone experienced. labels Jul 25, 2023

rafiss self-assigned this Aug 23, 2023

rafiss mentioned this issue Aug 23, 2023

catalog: change hash-sharded indexes to use md5 #109374

Merged

craig bot closed this as completed in 9d3115d Sep 8, 2023

SQL Foundations automation moved this from Triage to Done Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: non-uniform hash-sharded index with power-of-2 number of buckets #91109

sql: non-uniform hash-sharded index with power-of-2 number of buckets #91109

michae2 commented Nov 2, 2022 •

edited by exalate-issue-sync bot

ajwerner commented Nov 2, 2022

ajwerner commented Nov 2, 2022

ajwerner commented Nov 8, 2022

michae2 commented Jul 21, 2023

rafiss commented Aug 22, 2023

michae2 commented Aug 22, 2023

rafiss commented Aug 23, 2023

sql: non-uniform hash-sharded index with power-of-2 number of buckets #91109

sql: non-uniform hash-sharded index with power-of-2 number of buckets #91109

Comments

michae2 commented Nov 2, 2022 • edited by exalate-issue-sync bot

ajwerner commented Nov 2, 2022

ajwerner commented Nov 2, 2022

ajwerner commented Nov 8, 2022

michae2 commented Jul 21, 2023

rafiss commented Aug 22, 2023

michae2 commented Aug 22, 2023

rafiss commented Aug 23, 2023

michae2 commented Nov 2, 2022 •

edited by exalate-issue-sync bot