Fix and improve hash token distribution algorithm #145

jasonmp85 · 2015-09-18T05:22:37Z

There are 2^32 distinct "hash tokens" in our hash space, but we were using INT32_MAX (2^32 - 1) in the code instead. Because of this, shard counts which one might expect to divide evenly into the space (such as 16, 32, or 256) had fewer tokens than they should have. The remainder of the tokens were stuffed into the last shard, causing uneven load.

Though fixing the INT32_MAX bug solves the above case, it still doesn't deal with the remainder, which can be as large as shardCount - 1. We could continue stuffing it into the top shard, but I find it nicer to have all shard sizes be within one token of one another.

We previously divided the shard count into the hash token count to get a "hash token increment" and added that increment each iteration: this gives something like shardIndex * (hashCount / shardCount). By changing the grouping to (shardIndex * hashCount) / shardCount, the issue with distributing the remainder goes away entirely and we get "nice" shards.

There are 2^32 distinct "hash tokens" in our hash space, but we were using INT32_MAX (2^32 - 1) in the code instead. Because of this, shard counts which one might expect to divide evenly into the space (such as 16, 32, or 256) had fewer tokens than they should have. The remainder of the tokens were stuffed into the last shard, causing uneven load. Though fixing the INT32_MAX bug solves the above case, it still doesn't deal with the remainder, which can be as large as `shardCount - 1`. We could continue stuffing it into the top shard, but I find it nicer to have all shard sizes be within one token of one another. We previously divided the shard count into the hash token count to get a "hash token increment" and added that increment each iteration: this gives something like shardIndex * (hashCount / shardCount). By changing the grouping to (shardIndex * hashCount) / shardCount, the issue with distributing the remainder goes away entirely and we get "nice" shards.

jasonmp85 · 2015-09-18T05:53:09Z

test/expected/05-create_shards.out

+
+\set VERBOSITY default
+-- pg_shard ensures all shards are roughly the same size
+SELECT max_value::integer-min_value::integer AS shard_size


Compare the output of this query to that in #146 to see the difference in the algorithms.

should we add white space: max_value::integer-min_value::integer AS shard_size =>
max_value::integer - min_value::integer AS shard_size

jasonmp85 · 2015-10-05T06:44:07Z

Closing in favor of #146.

jasonmp85 mentioned this pull request Sep 18, 2015

Fix hash token distribution algorithm #146

Merged

jasonmp85 assigned onderkalaci Sep 18, 2015

jasonmp85 reviewed Sep 18, 2015
View reviewed changes

jasonmp85 closed this Oct 5, 2015

jasonmp85 deleted the better_hash_token_distribution branch October 5, 2015 07:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and improve hash token distribution algorithm #145

Fix and improve hash token distribution algorithm #145

jasonmp85 commented Sep 18, 2015

jasonmp85 Sep 18, 2015

onderkalaci Sep 18, 2015

jasonmp85 commented Oct 5, 2015

Fix and improve hash token distribution algorithm #145

Fix and improve hash token distribution algorithm #145

Conversation

jasonmp85 commented Sep 18, 2015

jasonmp85 Sep 18, 2015

Choose a reason for hiding this comment

onderkalaci Sep 18, 2015

Choose a reason for hiding this comment

jasonmp85 commented Oct 5, 2015