Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add time based UUIDs #6004

Closed
wants to merge 5 commits into from

Conversation

Projects
None yet
7 participants
@GaelTadh
Copy link
Contributor

commented May 1, 2014

See #5941.

Using SecureRandom as a UUID generator is slow and doesn't allow us
to take adavantage of some lucene optimizations around ids with common
prefixes. This commit will allow us to use a
timestamp64bit-macAddr-counter UUID. Since the macAddr may be shared
among several nodes running on the same hardware we use an xor of the
macaddr with a SecureRandom number generated on startup.

GaelTadh added some commits Apr 28, 2014

Use a timestamp/macaddress based UUID
Using SecureRandom as a UUID generator is slow and doesn't allow us
to take adavantage of some lucene optimizations around ids with common
prefixes. This commit will allow us to use a
timestamp64bit-macAddr-counter UUID. Since the macAddr may be shared
among several nodes running on the same hardware we use an xor of the
macaddr with a SecureRandom number generated on startup.

See #5941
Use a timestamp/macaddres/sequence number as the uuid for objects wit…
…hout an incoming id.

Wire up the timestampUUID generator to indexing.

See #5941
@kimchy

This comment has been minimized.

Copy link
Member

commented May 1, 2014

can we add a unit test that verifies the shared prefix behavior of this new UUID?

}
}

} catch (Exception uhe){

This comment has been minimized.

Copy link
@kimchy

kimchy May 1, 2014

Member

should we catch Throwable here to be safe? Also, maybe do a simple debug log it so we can see why it fails if we really want to?

@kimchy

This comment has been minimized.

Copy link
Member

commented May 1, 2014

LAST_TIMESTAMP handling is not thread safe, but at the end, an 8 bytes incremental sequence number with salted mac address should protect from time shifts and restarts. I wonder if we can remove it completely? Need to think about this.

Also, now we are moving to 22 byte UID, compared to previously having 16 byte UID, potentially, the sequence number can be smaller, but then, we might need to make sure we generate / make it incremental within the same timestamp as the seq generation scope is smaller which entails synchronization.

@kimchy

This comment has been minimized.

Copy link
Member

commented May 1, 2014

also, the mac address code is flaky, its not only on OSX that we can potentially fail, but also on systems that are virtualized, ... . Check https://github.com/cowtowncoder/java-uuid-generator/blob/3.0/src/main/java/com/fasterxml/uuid/EthernetAddress.java for inspiration.

@@ -1546,6 +1581,36 @@ public static String randomBase64UUID(Random random) {
}
}

public static String timestampBase64UUID() {

This comment has been minimized.

Copy link
@s1monw

s1monw May 1, 2014

Contributor

a couple of things:

  • I think this should be it's own class that can be a singleton. We should have a TimeBasedUUID.java and RandomBasedUUID.java that way we can put in the right impl on initialisation time and don't need the if here.
  • if we have dedicated impls then we can simply use member variables rather than static variable.s
  • we should use a PaddedAtomicLong for the timestamp and update in a CAS loop like:
do {
  long last = lastTimestamp.get();
  if (lastTimeStamp.compareAndSet(last, timestamp) {
    break;
  } 
} while(last < timestamp);
  • I wonder if 32bit sequence IDs are enough?
  • do we really need the ByteBuffer instance of can we maybe just use a simple copy method for the bytes, the long should be simple as well just a static method with 4 shifts?
Add time based UUIDs.
Incorporate some of the changes from @kimchy and @s1monw.
Move the UUID generators into their own classes and provide a common
interface as a first step to moving them under a singleton. Use a better
method of getting the mac address and fall back to a secure random address
if it fails. Add tests to test conccurency and shared prefix integrity of
UUIDs. Use PaddedAtomicLongs to hold the sequence number and lasttime.
Check to see if a time slip has occured as described by @s1monw in a CAS loop.
Next step is to move the impls under a singleton.

See #5941
@GaelTadh

This comment has been minimized.

Copy link
Contributor Author

commented May 2, 2014

I've pushed some new changes including the unit tests. @s1monw I can remove the ByteBuffer stuff if we really want to and add our own. I'm pulling things into a singleton now.

long last;

final byte[] uuidBytes = new byte[22];
do {

This comment has been minimized.

Copy link
@s1monw

s1monw May 3, 2014

Contributor

I think we are missing the time is going backwards checks here? Essentially we should:

  • check the current timestamp and if it goes backwards add one
  • try update the timestamp
    • if success step out
    • else read the current timestamp again and retry

GaelTadh added some commits May 6, 2014

Support for timebased flakelike uuids.
Reduce number of time bytes to 6 reducing total number of bytes to 20.
Validate that we have a mac address that contains data to avoid getting
addresses that are just 00:00:00:00:00:00 which can happen on virtualized
machines. Remove use of ByteBuffer on puts to reduce overhead. Add code to
attempt to prevent time slips.

See #5941
Remove unneeded variable.
Simplify mac address validation routing and remove unneed variable.

@kevinkluge kevinkluge added v1.2.0 and removed v1.2.0 labels May 8, 2014

@clintongormley clintongormley changed the title Enhancement/5941 Add time based UUIDs Jul 11, 2014

@clintongormley clintongormley added v1.4.0 and removed v1.3.0 labels Jul 11, 2014

private PaddedAtomicLong lastTimestamp = new PaddedAtomicLong();

byte[] secureMungedAddress = MacAddressProvider.getSecureMungedAddress();

This comment has been minimized.

Copy link
@mikemccand

mikemccand Aug 21, 2014

Contributor

I think these members can be final?

} while(last < timestamp);

putLong(uuidBytes,timestamp,0,6); //Only use 6 bytes of the timestamp this will suffice beyond the year 10000
System.arraycopy(secureMungedAddress,0,uuidBytes,6,secureMungedAddress.length);

This comment has been minimized.

Copy link
@mikemccand

mikemccand Aug 21, 2014

Contributor

Maybe assert secureMungedAddress.length == 6 here since we are relying on that?


putLong(uuidBytes,timestamp,0,6); //Only use 6 bytes of the timestamp this will suffice beyond the year 10000
System.arraycopy(secureMungedAddress,0,uuidBytes,6,secureMungedAddress.length);
putLong(uuidBytes,sequenceNumber.incrementAndGet(),12,8);

This comment has been minimized.

Copy link
@mikemccand

mikemccand Aug 21, 2014

Contributor

Hmm do we really need 8 bytes for the sequence number? Flake IDs just use 2 bytes ...

This comment has been minimized.

Copy link
@GaelTadh

GaelTadh Aug 21, 2014

Author Contributor

That's a good question, honestly probably 2-3 bytes is fine we just need to test around expected levels of ntp timeslips at high indexing volumes. Since we don't inc the timestamp on slip, during a slip we may have a large number of events at the same timestamp. Perhaps we should just inc the timestamp on slip ?

This comment has been minimized.

Copy link
@mikemccand

mikemccand Aug 21, 2014

Contributor

Also, I wonder if we should randomize the sequenceID on init, to try to thwart the "machine/JVM goes down, clock goes backwards, JVM restarted" case ...

This comment has been minimized.

Copy link
@mikemccand

mikemccand Aug 21, 2014

Contributor

That's a good question, honestly probably 2-3 bytes is fine we just need to test around expected levels of ntp timeslips at high indexing volumes

+1 for 2 bytes: this buys us 64K values per msec slip which should be more than enough.

Perhaps we should just inc the timestamp on slip ?

Maybe we could inc if sequenceId is 0? This way we don't risk racing through the clock after a big backwards time shift?

break;
}
} while(last < timestamp);

This comment has been minimized.

Copy link
@mikemccand

mikemccand Aug 21, 2014

Contributor

Could we maybe simplify this loop to something like this? CAS concurrency is hard to think about!

while (true) {
  last = lastTimestamp.get();
  timestamp = System.currentTimeMillis();
  if (timestamp < last) {
      // Clock shifted backwards: don't increment here, since that risks running away, but it does mean we risk exhausting sequence ids for
      // a long backwards clock shift:
      timestamp = last;
      break;
  } else if (timestamp == last) {
      break;
  } else if (lastTimestamp.compareAndSet(last, timestamp)) {
      break;
  }
}

This comment has been minimized.

Copy link
@mikemccand

mikemccand Aug 21, 2014

Contributor

Thinking more about this: is it really necessary to use paranoia-concurrency here?

What's really wrong with a simple volatile lastTimeStamp? Yeah, it won't be perfect, and sometimes it will move backwards a bit because of thread switching from the compare to the assign, but this is in fact harmless in practice, because after this ID is assigned we go on to index the doc and there's no temporal order guaranteed.

}
}

public static long getTimestampFromUUID(String uuid) throws IOException {

This comment has been minimized.

Copy link
@mikemccand

mikemccand Aug 21, 2014

Contributor

Maybe add a comment that this is just for testing?

I think it's important that we (API wise, internal to ES) make it clear that these are opaque IDs, we are free to change/improve over time, we / users shouldn't expect to be able to introspect into an ES-assigned ID to glean things like timestamp, etc.

address = getMacAddress();
} catch( SocketException se ) {
logger.warn("Unable to get mac address, will use a dummy address",se);
address = constructDummyMulticastAddress();

This comment has been minimized.

Copy link
@pickypg

pickypg Aug 30, 2014

Member

This happens regardless in the next if-statement since address remains null.


protected static class SecureRandomHolder {
// class loading is atomic - this is a lazy & safe singleton
protected static final SecureRandom INSTANCE = new SecureRandom();

This comment has been minimized.

Copy link
@pickypg

pickypg Aug 30, 2014

Member

Seems strange to only expose the singleton instance to the package or child classes. Could also simplify access by exposing a class level accessor to it:

public static SecureRandom getRandom() {
    return SecureRandomHolder.INSTANCE;
}

If it's only intended for package-level use, then perhaps a package protected class should provide it instead.

/**
*
*/
public interface UUIDGenerator {

This comment has been minimized.

Copy link
@pickypg

pickypg Aug 30, 2014

Member

If we want everyone to use one implementation versus another, then I suggest adding a static final UUIDGenerator DEFAULT = new ... to the interface. That way anyone can simply reference it and we can avoid multiple instances all over the place.

This should also hit at the point where @mikemccand suggested that we should not hold ourselves to any specific implementation; users could reference the constant and always be using the latest version and never have to actually know what implementation that is.

mungedAddress = address;
}
} catch (IOException ie){
assertFalse(ie!=null);

This comment has been minimized.

Copy link
@pickypg

pickypg Aug 30, 2014

Member

Why not just let the test throw the IOException and fail automatically, thereby avoiding the try-catch altogether? Alternatively, you can always invoke fail(); to fail a unit test.

/**
*
*/
public class PaddedAtomicLong extends AtomicLong {

This comment has been minimized.

Copy link
@pickypg

pickypg Aug 30, 2014

Member

Does this class add any value to anything except RandomBasedUUID? Feel like it should just be a static inner class within that.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Aug 31, 2014

Thanks @pickypg, I started from @GaelTadh patch and then folded in
most of your & my feedback into a new branch here:

mikemccand@e5e21b7

The one thing I didn't do is move the singleton to UUIDGenerator
interface; I think Strings.* APIs is good place to expose access to
these different generators?

I removed the PaddedAtomicLong: I think it's overkill here, and a
distraction.

I left the CAS loop to ensure lastTimestamp grows monotonically, but
tried to simplify it.

I removed APIs to parse out the time-based UUID: it's supposed to be
opaque to callers.

I broke out SecureRandomHolder as its own package private class.

I dropped sequence number from 8 bytes to 3, and now increment the
timestamp when the bottom two bytes are 0, to deal with a biggish
clock slip backwards while JVM is up. I also randomized the sequence
id on init for best effort protection of backwards clock-slip while
JVM is down.

@pickypg

This comment has been minimized.

Copy link
Member

commented Aug 31, 2014

@mikemccand mikemccand removed the review label Sep 1, 2014

@mikemccand mikemccand closed this in 9c1ac95 Sep 2, 2014

@mikemccand mikemccand removed the v1.4.0 label Sep 2, 2014

mikemccand added a commit that referenced this pull request Sep 2, 2014

Use Flake IDs instead of random UUIDs when auto-generating id field
Flake IDs give better lookup performance in Lucene since they share
predictable prefixes (timestamp).

Closes #7531

Closes #6004

Closes #5941

mikemccand added a commit that referenced this pull request Sep 8, 2014

Use Flake IDs instead of random UUIDs when auto-generating id field
Flake IDs give better lookup performance in Lucene since they share
predictable prefixes (timestamp).

Closes #7531

Closes #6004

Closes #5941
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.