Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data skew in low-concurrency scenarios #44

Open
robertsong9972 opened this issue Apr 15, 2022 · 4 comments
Open

Data skew in low-concurrency scenarios #44

robertsong9972 opened this issue Apr 15, 2022 · 4 comments

Comments

@robertsong9972
Copy link

Data skew in low-concurrency scenarios

According to the source code, it will set the step to 0 once the time increased, it will cause a problem when we use the id as the sharding key for sharding strategy.
The code affects

if now == n.time {
		n.step = (n.step + 1) & n.stepMask

		if n.step == 0 {
			for now <= n.time {
				now = time.Since(n.epoch).Nanoseconds() / 1000000
			}
		}
	} else {
		n.step = 0
	}

Consider the following circumstance:

Scenario

  1. table:order_tab_00000000-order_tab_00001000;
  2. An api for user to create new order: create_order;
  3. use the above logic to generate id when QPS is less than 10;

Impact

  1. After calculating with mod all of the result of id generated will be less than 10, which means most of the records will be stored into order_tab_00000000-order_tab_00000009, it caused data skew;

Improve

  1. Make the step round by round, and do not set it to 0;

Code May be referenced

	n.mu.Lock()
	now := time.Since(n.epoch).Nanoseconds() / 1000000
	n.step = (n.step + 1) & n.stepMask
	if now == n.time && n.step == 0 {
		for now <= n.time {
			now = time.Since(n.epoch).Nanoseconds() / 1000000
		}
	}
	n.time = now
	r := ID((now)<<n.timeShift |
		(n.node << n.nodeShift) |
		(n.step),
	)
	n.mu.Unlock()
	return r
@patchthecode
Copy link

does this mean this lib is prone to errors?

@bwmarrin
Copy link
Owner

@robertsong9972

Thanks for reporting an issue :)

However, I'm struggling to follow what you're trying to explain and how the current method is causing a problem and why it should be changed :( I'm not sure if there's any more explanation you can provide or if there's someone else that might be familiar with the topic you're discussing and can help clarify the issue.

Are you saying the library can generate non-unique ID's with low concurrency? Or are you saying that the way the IDs would sort would be a problem? What does "Make the step round by round" mean?

"After calculating with mod all of the result of id generated will be less than 10, which means most of the records will be stored into order_tab_00000000-order_tab_00000009, it caused data skew;"

You're doing something to truncate the numbers or reduce them? None of the IDs would be less than 10 normally.

I'm also not really clear on the importance of the "order_tab_00000000-order_tab_00000009" tables.

I'm just all around confused, sorry. :(

@bwmarrin
Copy link
Owner

@patchthecode - I'm not sure :(. Though I've not heard any other reports of the library not generating properly unique IDs. If you run into any issues or if you can help explain the above concern I'd be happy to look into it.

@bwmarrin
Copy link
Owner

I think I'm putting together what's being asked here, maybe.

@robertsong9972

Are you talking about data sharding via MongoDB (or maybe some other DB)?

By "order_tab_00000000-order_tab_00001000" are you saying you have 1001 order tables?

Are you taking the modulus of the snowflake ID and 1001 (the number of tables you have) to try and get a random
number between 0 and 1000?

If you're dividing by 1001 I think it should work but if you happen to be using 1000 I could see how that would return 0 since the last four digits of a showflake using 12 stepbits would be 0000 and the entire number would divide into 1000 leaving no remainder. But if you divide by 1001 that shouldn't be an issue. Another option would be to change the step bits to a smaller value.

By "round by round" do you mean just allow the step count to continue to count up and only reset to zero once it overflows (hits the max number it could be) ? If we did that the library performance under heavy load would be impacted negatively and since there's other ways to solve the problem I'm assuming you're working on I'd rather not do that.

Finally, would just using a rand(1000) method to get a random number from 0 to 1000 to tell the DB what shard to shove the data into work?

If I'm interpreting your issue wrong - please let me know! Hopefully this helps some, I know it's a late attempt to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants