Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore using timebased decentralized UUID for autogenerated IDs #5941

Closed
s1monw opened this Issue Apr 25, 2014 · 8 comments

Comments

Projects
None yet
5 participants
@s1monw
Copy link
Contributor

commented Apr 25, 2014

Today we use UUIDs which are very costly to compute since they use SecureRandom and they have some properties that might not play very well / or as well as they could with the Lucene internal datastructures since they don't share a common prefix. With a time based ID we can potentially remove the need for bloom fitlers on the ID fields since the shared prefixes across the IDs might be all in memory and that would speedup updates dramatically. Something along the lines of flake ids would be very interesting since it has this property

@s1monw s1monw added the enhancement label Apr 25, 2014

@GaelTadh GaelTadh self-assigned this Apr 25, 2014

@GaelTadh

This comment has been minimized.

Copy link
Contributor

commented Apr 25, 2014

What we are proposing is since a machine may have several es instances running in parallel. We will generate a secure xor of the mac address, prefixed by a 64bit timestamp and postfixed by an atomic counter.

timestamp64bit-macAddr-counter

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Apr 25, 2014

+1 for prefixing the ID with timestamp, or anything "predictable": this is better for Lucene's term dictionary since sometimes it will be able to short-circuit a lookup and know the ID cannot reside in a given segment. It's also important that the IDs are fixed-length, so that all terms happen in the "leaf blocks" of the dictionary. Finally, something like base36 is best, since we tell the terms dict writing to put 25 - 48 terms per block (so up to base 48 should be good too).

@GaelTadh

This comment has been minimized.

Copy link
Contributor

commented Apr 25, 2014

@mikemccand why does the encoding matter ? Or do you mean between 36 and 48 bytes long ?

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Apr 25, 2014

@GaelTadh what I mean is each digit should have no more than 48 values (and lower would be better: less scanning within a block). This is because the terms dictionary assigns terms into blocks of at most 48 terms at once.

However ... thinking about this more, I think this matters less, since the IDs won't be sequential here.

@GaelTadh

This comment has been minimized.

Copy link
Contributor

commented Apr 25, 2014

Ahh ok that makes sense.

GaelTadh added a commit to GaelTadh/elasticsearch that referenced this issue Apr 28, 2014

Use a timestamp/macaddress based UUID
Using SecureRandom as a UUID generator is slow and doesn't allow us
to take adavantage of some lucene optimizations around ids with common
prefixes. This commit will allow us to use a
timestamp64bit-macAddr-counter UUID. Since the macAddr may be shared
among several nodes running on the same hardware we use an xor of the
macaddr with a SecureRandom number generated on startup.

See elastic#5941
@GaelTadh

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2014

I used jmh (http://openjdk.java.net/projects/code-tools/jmh/) to do some micro benchmarks on the following code :

,,,,

@GenerateMicroBenchmark
public void secureRandomUUID(){
    randomBase64UUID(SecureRandomHolder.INSTANCE);
}

@GenerateMicroBenchmark
public void insecureRandomUUID(){
    randomBase64UUID(new Random());
}


@GenerateMicroBenchmark
public void timestampUUID(){
    timestampBase64UUID();
}

,,,

Which gave the following output :

Benchmark Mode Samples Mean Mean error Units
o.s.UUIDBenchmark.insecureRandomUUID thrpt 200 5155.120 16.137 ops/ms
o.s.UUIDBenchmark.secureRandomUUID thrpt 200 469.264 3.287 ops/ms
o.s.UUIDBenchmark.timestampUUID thrpt 200 5703.520 13.350 ops/ms
@GaelTadh

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2014

I'm gonna run these again and remove the new call from insecureRandom.

@GaelTadh

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2014

Benchmark Mode Samples Mean Mean error Units
o.s.UUIDBenchmark.insecureRandomUUID thrpt 200 7348.913 16.807 ops/ms
o.s.UUIDBenchmark.secureRandomUUID thrpt 200 475.986 1.489 ops/ms
o.s.UUIDBenchmark.timestampUUID thrpt 200 5698.841 23.908 ops/ms

GaelTadh added a commit to GaelTadh/elasticsearch that referenced this issue May 1, 2014

Use a timestamp/macaddres/sequence number as the uuid for objects wit…
…hout an incoming id.

Wire up the timestampUUID generator to indexing.

See elastic#5941

GaelTadh added a commit to GaelTadh/elasticsearch that referenced this issue May 2, 2014

Add time based UUIDs.
Incorporate some of the changes from @kimchy and @s1monw.
Move the UUID generators into their own classes and provide a common
interface as a first step to moving them under a singleton. Use a better
method of getting the mac address and fall back to a secure random address
if it fails. Add tests to test conccurency and shared prefix integrity of
UUIDs. Use PaddedAtomicLongs to hold the sequence number and lasttime.
Check to see if a time slip has occured as described by @s1monw in a CAS loop.
Next step is to move the impls under a singleton.

See elastic#5941

@kevinkluge kevinkluge added the v1.2.0 label May 5, 2014

GaelTadh added a commit to GaelTadh/elasticsearch that referenced this issue May 6, 2014

Support for timebased flakelike uuids.
Reduce number of time bytes to 6 reducing total number of bytes to 20.
Validate that we have a mac address that contains data to avoid getting
addresses that are just 00:00:00:00:00:00 which can happen on virtualized
machines. Remove use of ByteBuffer on puts to reduce overhead. Add code to
attempt to prevent time slips.

See elastic#5941

@kevinkluge kevinkluge added v1.3.0 and removed v1.2.0 labels May 12, 2014

@clintongormley clintongormley added v1.4.0 and removed v1.3.0 labels Jul 11, 2014

@mikemccand mikemccand added the v2.0.0 label Sep 1, 2014

@mikemccand mikemccand closed this in 9c1ac95 Sep 2, 2014

mikemccand added a commit that referenced this issue Sep 2, 2014

Use Flake IDs instead of random UUIDs when auto-generating id field
Flake IDs give better lookup performance in Lucene since they share
predictable prefixes (timestamp).

Closes #7531

Closes #6004

Closes #5941

mikemccand added a commit that referenced this issue Sep 8, 2014

Use Flake IDs instead of random UUIDs when auto-generating id field
Flake IDs give better lookup performance in Lucene since they share
predictable prefixes (timestamp).

Closes #7531

Closes #6004

Closes #5941
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.