Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and improve hash token distribution algorithm #145

Closed
wants to merge 1 commit into from

Conversation

jasonmp85
Copy link
Collaborator

There are 2^32 distinct "hash tokens" in our hash space, but we were using INT32_MAX (2^32 - 1) in the code instead. Because of this, shard counts which one might expect to divide evenly into the space (such as 16, 32, or 256) had fewer tokens than they should have. The remainder of the tokens were stuffed into the last shard, causing uneven load.

Though fixing the INT32_MAX bug solves the above case, it still doesn't deal with the remainder, which can be as large as shardCount - 1. We could continue stuffing it into the top shard, but I find it nicer to have all shard sizes be within one token of one another.

We previously divided the shard count into the hash token count to get a "hash token increment" and added that increment each iteration: this gives something like shardIndex * (hashCount / shardCount). By changing the grouping to (shardIndex * hashCount) / shardCount, the issue with distributing the remainder goes away entirely and we get "nice" shards.

There are 2^32 distinct "hash tokens" in our hash space, but we were
using INT32_MAX (2^32 - 1) in the code instead. Because of this, shard
counts which one might expect to divide evenly into the space (such as
16, 32, or 256) had fewer tokens than they should have. The remainder
of the tokens were stuffed into the last shard, causing uneven load.

Though fixing the INT32_MAX bug solves the above case, it still doesn't
deal with the remainder, which can be as large as `shardCount - 1`. We
could continue stuffing it into the top shard, but I find it nicer to
have all shard sizes be within one token of one another.

We previously divided the shard count into the hash token count to get
a "hash token increment" and added that increment each iteration: this
gives something like shardIndex * (hashCount / shardCount). By changing
the grouping to (shardIndex * hashCount) / shardCount, the issue with
distributing the remainder goes away entirely and we get "nice" shards.

\set VERBOSITY default
-- pg_shard ensures all shards are roughly the same size
SELECT max_value::integer-min_value::integer AS shard_size
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compare the output of this query to that in #146 to see the difference in the algorithms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add white space: max_value::integer-min_value::integer AS shard_size =>
max_value::integer - min_value::integer AS shard_size

@jasonmp85
Copy link
Collaborator Author

Closing in favor of #146.

@jasonmp85 jasonmp85 closed this Oct 5, 2015
@jasonmp85 jasonmp85 deleted the better_hash_token_distribution branch October 5, 2015 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants