Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniqueness broken since 2.2.9? #153

Open
robhogan opened this issue Feb 19, 2020 · 21 comments
Open

Uniqueness broken since 2.2.9? #153

robhogan opened this issue Feb 19, 2020 · 21 comments

Comments

@robhogan
Copy link

robhogan commented Feb 19, 2020

Hi,

From https://github.com/dylang/shortid/blob/master/README.md#shortid---

Can generate any number of ids without duplicates, even millions per day.

I've been trying to figure out how shortid guarantees uniqueness, and I've come to the conclusion that in fact it doesn't at all (any more).

The way this module worked as at https://github.com/dylang/shortid/tree/2.2.8/lib is that it encoded a number of variables (version, worker, seconds offset, counter), mixing in some randomness so that the ID wasn't guessable - however the original data was still hidden between those random bits, and it could be decoded by bit-masking. Given different input the ID was therefore guaranteed to be different (at least subject to #105).

Since 0e9c560 that's all broken. The number to be "encoded" now only influences the string length, but doesn't actually make it into the string at all

str = str + format(random, alphabet.get(), 1);

The string is actually completely random, and has at least as much chance of collision as any random string of its length and alphabet. None of version, worker ID, or time have any impact whatsoever on what is generated.

Please tell me I'm wrong!

Edit: I notice @ai has been closing issues with the assertion that this package is deprecated, but in anticipation of that I'll just note -

  • This project is not marked as deprecated on either NPM or github. It still has almost 1M downloads per week and ~80k dependent projects. If it's deprecated that needs more visibility.
  • It's had a fairly massive behaviour change that means it won't do what people expect it to do, and this was changed in a semver patch release. The readme still promises the old functionality, even though there have been 5 further new versions since.
  • nanoid, though it seems good at what it does, is not a like-for-like replacement - that's a random generator, not a unique generator.

It seems to me like when maintenance changed hands after 2.2.8 there might've been an honest misunderstanding about what this package did / was meant to do / how it worked.

@robhogan robhogan changed the title Broken since 2.2.9? Uniqueness broken since 2.2.9? Feb 20, 2020
@robhogan
Copy link
Author

robhogan commented Feb 20, 2020

If indeed this package has no active maintainers, I'd like to volunteer to pick it up and restore the original short-and-unique functionality. There's a place for both this package and nanoid since they serve different purposes.

@ai
Copy link
Collaborator

ai commented Feb 20, 2020

Finally, I found an expert to talk about the unique ID topic! I like the topic, but it is hard to find a person to talk about my passion :).

Collision Probability

TLDR; old shortid didn’t use a unique machine ID. As a result, old way had a bigger collision probability that the current one.

Yeap, you are right, we moved from mixing environment data to using hardware randomness. But we made it for the reason.

Uniqueness guarantee is a very complex problem. And it became even more complex when we start to generate random IDs across multiple computers in the cluster (at the same time). The problematic was studied very well in different UUID versions.

  • “UUID v1” way is to use “node ID” (often it is MAC address), process ID, current time, local counter. If you are sure that each component is unique and use has enough bits space in ID for each component, you can be sure that IDs are unique.
  • “UUID v4” is to use “real” randomness (hardware random number) and use so big ID, so collision probability will extremely small.

shortid used the mix of “UUID v1” and “UUID v4” ways. It uses a few unique components like cluster-ID, time and hardware random bytes. But this way created a few big problems and we had a multiple issues in this repo:

  • Node.js and browser JS environment has no access to MAC address. As a result, since the beginning, we had no real unique component for proper “UUID v1” implementation. If you called the old version of shortid on many machines at the same time you will get higher collision probability.
  • It is possible to set a unique cluster ID manually. But it was easy to forget to do it and docs were not good about it. Honestly, I have no idea how to do it properly in the browser.
  • The lack of MAC was fixed with hardware bytes. But mixing “UUID v1” and “UUID v4” since they have very different ways to guarantee uniqueness. It is very easy to make a mistake and increase collision probability.

In that commit, I removed the old bytes packing system to:

  • Increase random bytes by removing time-based bytes. It should reduce collision probability between different machines. For instance, after that fix, we improved ID distribution and closed that issue.
  • Increase the performance.

Major vs Patch Release

Yeap, I thought about the major releases. But here are my reasons why I didn’t do it:

  1. Most of the projects used shortid are dead right now. Nobody will update dependencies.
  2. shortid “public API” was to give random IDs. Implementation can be threaded as a private ID.
  3. The fix anyway decreased collision probability.

Deprecation

Yeap, we can do it more properly. Let’s discussed after the collision topic.

Solution

I believe that the whole way of mixing “UUID v1” and “UUID v4” was wrong. “UUID v1” is very very hard to implement.

Right now “UUID v4” is a much safer way since it is easy to implement and using >= 122 bits of hardware random generator gives extremely low collision probability:

For there to be a one in a billion chance of duplication, 103 trillion version 4 IDs must be generated.

UUID v4 is an industry-standard right now. It is safe to use.

I made a new library based on good “UUID v4” way: Nano ID. It is faster, smaller and has less collision probability than old and new shortid versions.

@robhogan
Copy link
Author

Thanks for the detailed response, appreciate it. It sounds like your main complaint about shortid @ 2.2.8 was that it wasn't clear that "worker" ID should be set differently on each process, and that in some environments (browser) it's barely even possible, so users would encounter collisions if they were trying to use it as a replacement for UUIDv4.

However, assuming a user did read the docs, did know what they were doing and did set worker ID correctly (or only ever had one process generating IDs), then shortid was cast-iron guaranteed unique within the scope of your cluster. I personally ran a 5-node backend service with it for years, and I never had to guess what my demand would be, never had to customise ID length, never had to check for collisions, and could've gone for years more without thinking about it, at least assuming I never updated to 2.2.9.

ID length

There's also the space efficiency consideration - for a random ID generator to be useful it has to have a very sparsely populated range of potential values, so that a new random ID has a very low probability of collision. UUIDv4 has a colossal range and no two will ever be generated the same, but it has to be very long to achieve that. It's entirely dependent on the size of my record set vs the size of my id space.

If my IDs are based on worker + timestamp + counter, then it doesn't matter how many records I have already. Every new second of time guarantees a big empty chunk of id space, and we know how to fill it efficiently. I could generate a shortid every 1ms across half a dozen workers from now for the next 10 years and I'd never need more than 7 characters, and there'd be zero probability of collision. My id space is relatively efficiently filled, because I'm filling it with an algorithm and I know where the gaps are.

So in the case of ID generation at a fairly constant rate over a very long time span, shortid is shorter than nanoid, and has a lower (zero) collision probability.

Comparison

Of course UUIDv4 is also practically unique, but because it aims to be globally unique across all instances of everything, it's definitely not short, and this matters when you're conscious of things like index size or, device storage, document size, power drain...

nanoid is a shorter UUIDv4 with a customisable collision probability. That's great for distributed processes where you either can't assign worker IDs or you have so many of them that encoding their IDs is a problem. It's also good for short-lived applications where bytes used to encode a timestamp are a waste.

shortid is a shorter UUIDv1 for local uniqueness - mainly backend systems with a known, identifiable number of instances. It does/did something that can't be done with random ids.

Consider that shortid describes itself as an alternative to mongodb's ObjectID, and look at how ObjectIDs are generated by the mongo process. It's the same process ID + timestamp + sequence approach, because over a controlled cluster that works extremely well.

@ai
Copy link
Collaborator

ai commented Feb 20, 2020

I personally ran a 5-node backend service with it for years

How do you set worker ID for now?

@robhogan
Copy link
Author

On that multi-instance system (on AWS) I think I just used the last part of a private IP address since I had one node process per instance/IP on a small subnet.

On a manually scaled system it's obviously fairly easy to set an ID as an environment variable or similar as part of provisioning.

On an auto scaling GCP system I'm working on now I could probably set an integer instance ID as instance metadata, and when a new instance boots that instance could allocate itself the lowest unused integer (the cluster has access to cluster metadata via an http API).

Lots of ways to tackle that problem - more of a devops thing really. The point is that if you can determine an ID per worker, this library would give you unique IDs.

@ai
Copy link
Collaborator

ai commented Feb 20, 2020

Hm. The current API doesn’t explain how cluster-ID is important. We can revert random-based generation and add new API which will be clear about node ID:

let shortid = require('shortid')

shortid.generate()
//=> Error, You didn’t set a unique worker ID. Read github.com/dylang/shortid#worker-id show to set it properly

shortid.uniqueWorkerId(getIPAddress())
shortid.generate() //=> "PPBqWA9"

@robhogan
Copy link
Author

uniqueWorkerId() was already there, it's just called worker() .

I guess it you could improve the function naming, but it's all in the readme.

@ai
Copy link
Collaborator

ai commented Feb 20, 2020

Yeap, I forget to explain my reasons behind this idea:

  1. Axiom: we are creating an open-source for all users, not only for developers with good habits.
  2. Most developers do not read docs. API should be self-explained and do not allow to “shoot in the leg”.
  3. Throwing error without explicitly setting worker ID will protect users from forgetting to set worker ID. It will tell them that in this case, it is important to read docs about worker ID. Docs will give them simple examples of how to create a unique ID in different environments.
  4. shortid.uniqueWorkerId() will explain that ID should be unique. It stops users from putting there 0 just to avoid an error message. But maybe we can keep worker() and error with a link to docs will be enough.

@robhogan
Copy link
Author

I don't really think that's necessary but it is subjective, if you want to make it idiot-proof at the expense of simplicity in the majority case I guess that's your call.

What's not subjective at all is the fact that 0e9c560 should not have been approved as a patch release, or at all frankly - that was obviously a mistake. Hopefully it hasn't caused a massive problem to too many people.

@ai
Copy link
Collaborator

ai commented Feb 20, 2020

What's not subjective at all is the fact that 0e9c560 should not have been approved as a patch release, or at all frankly - that was obviously a mistake

It is much more complicated.

2.2.9 worse for people with the correct worker ID. But most of the users use shortid in a browser or do not set correct worker ID.

Users who don’t set worker ID in 2.2.9 have lower collision probability. But I agree that people who set worker ID now have bigger collision probability (but now very big). Because more people use shortid wrong, 2.2.9 made world better.

This is why we can’t just revert 2.2.8. It will make better for you, but worse for most of the users.

This is why I suggest a better way. Change API to force everyone to set correct worker ID and revert old method.

@robhogan
Copy link
Author

If you think about how this is used, anyone who has been storing ids with >=2.2.9 now has random ids in their database. There’s no way to revert things so that we won’t collide with those random ids - the id space is of all current users of this package is permanently polluted. I’m not sure where you go from here.

@ai
Copy link
Collaborator

ai commented Feb 20, 2020

Why we can’t change version here? https://github.com/dylang/shortid/blob/master/lib/build.js#L13

@robhogan
Copy link
Author

Because we don’t encode the version any more - we now set the first character completely randomly because generate doesn’t respect the data passed to it. All possible versions are therefore taken.

@ai
Copy link
Collaborator

ai commented Feb 20, 2020

Hm, yeap, you are right. We uses version to increase ID size.

@robhogan
Copy link
Author

robhogan commented Feb 20, 2020

You can only really publish a new major version 3, which restores 2.2.8 behaviour, and explain that v3 ids may collide with v2 ids, because unfortunately v2 ids were non-uniquely generated contrary to the docs.

There’s nothing that can be done for people who relied on the package behaving as documented after 2.2.9, unfortunately. They may now have an ID migration to have to go through.

@ai
Copy link
Collaborator

ai commented Feb 20, 2020

We can’t just restore 2.2.8 without solving the problem of forgetting a set worker ID. But we can do both: restore 2.2.8 and fix API to force everyone to set worker ID.

@robhogan
Copy link
Author

You can make the API whatever you like in v3, obviously. The behaviour of 2.2.8 is the important part - ie zero-collision guarantee.

@ai
Copy link
Collaborator

ai commented Feb 20, 2020

You mentioned that you want to become a maintainer. I can help you with 3.0 with reviewing PRs and then I will give you access.

But it is important for me to find a solution for everyone, not only for people who set worker ID correctly, since 2.2.9 was released to fix problem of worker ID bad practices.

@robhogan
Copy link
Author

Forgetting about v3 for the moment, I do really think you should make it very clear (shout it, big letters!) in the readme that 2.2.9 and onwards do not offer uniqueness, to the extent of publishing a warning/deprecation notice to npm. The readme at the moment is considerably out of step with what this package now does.

I don't mean to be dramatic but I really feel sorry for people who have used this package long-term thinking it was a good choice for time-based unique IDs - now assuming they've been patching their dependencies they have unknowingly introduced an insidious bug - an increasing possibility of collisions - and no way to fix/avoid collisions with the random IDs they have already accidentally generated.

In the short term I'll maintain a fork, https://www.npmjs.com/package/@rh389/shortid, which will always guarantee uniqueness.

CC @dylang, if he's still out there!

@ai
Copy link
Collaborator

ai commented Feb 27, 2020

Sure, send PR to docs

@GottZ
Copy link

GottZ commented Mar 19, 2020

literally went here because of this example of "trusting docs without checking": https://www.youtube.com/watch?v=SLpUKAGnm-g

i was curious and ended up here.

docs should probably point out that id's can in fact collide and uniquenes should always be checked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants