Timeuuid #68

Closed
wants to merge 3 commits into
from

Conversation

Projects
None yet
2 participants
@GrLawrence

This is for issue 66, it addresses the generation of timeuuids in multi-threaded environments and regenerates the clock sequence only when needed.

@nberardi

This comment has been minimized.

Show comment Hide comment
@nberardi

nberardi Sep 25, 2012

Contributor

As I was saying in issue #66 I don't think we need to abide by the RFC, the sequence clock is throw away data. And not even Cassandra's internal support abides by how the RFC wants the clockSequence to work. You can see their source here:

https://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/utils/UUIDGen.java

Here are my main problems with what you have provided, lockings should be done sparingly because they can add alot of overhead. And I don't know if the overhead is worth it for a throw away bit, when we can generate it randomly alot cheaper without blocking threads.

As an added note I am switching back to the Random Byte filling, because the GetTimestamp is slightly slower.

    var r = new Random();

    var sw = new Stopwatch();
    var bytes = new byte[2];

    sw.Start();
    for(int i = 0; i < 10000000; i++) {
        r.NextBytes(bytes);
    }
    sw.Stop();

    Console.WriteLine(sw.Elapsed);

    var ts = 0L;
    sw.Restart();
    for(int i = 0; i< 10000000; i++) {
        ts = Stopwatch.GetTimestamp();
        bytes = new byte[] {
                (byte)(ts),
                (byte)(ts >> 0x8)
            };
    }
    sw.Stop();

    Console.WriteLine(sw.Elapsed);

It produced the following numbers pretty reliably:

00:00:00.2551339
00:00:00.3656215

Do you have any testing to show that the locking is faster in a multi-threaded environment?

Contributor

nberardi commented Sep 25, 2012

As I was saying in issue #66 I don't think we need to abide by the RFC, the sequence clock is throw away data. And not even Cassandra's internal support abides by how the RFC wants the clockSequence to work. You can see their source here:

https://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/utils/UUIDGen.java

Here are my main problems with what you have provided, lockings should be done sparingly because they can add alot of overhead. And I don't know if the overhead is worth it for a throw away bit, when we can generate it randomly alot cheaper without blocking threads.

As an added note I am switching back to the Random Byte filling, because the GetTimestamp is slightly slower.

    var r = new Random();

    var sw = new Stopwatch();
    var bytes = new byte[2];

    sw.Start();
    for(int i = 0; i < 10000000; i++) {
        r.NextBytes(bytes);
    }
    sw.Stop();

    Console.WriteLine(sw.Elapsed);

    var ts = 0L;
    sw.Restart();
    for(int i = 0; i< 10000000; i++) {
        ts = Stopwatch.GetTimestamp();
        bytes = new byte[] {
                (byte)(ts),
                (byte)(ts >> 0x8)
            };
    }
    sw.Stop();

    Console.WriteLine(sw.Elapsed);

It produced the following numbers pretty reliably:

00:00:00.2551339
00:00:00.3656215

Do you have any testing to show that the locking is faster in a multi-threaded environment?

@nberardi

This comment has been minimized.

Show comment Hide comment
@nberardi

nberardi Sep 25, 2012

Contributor

With the code how it currently exists, we get 4-8 duplicates out of 100,000 tries. This is probably un realistic given how tight the loop is.

But I wanted to give you method a benefit of the doubt so I timed both.

private static void Main(string[] args)
{
    var sw = new Stopwatch();
    TaskFactory factory = new TaskFactory();
    Task[] tasks = new Task[100];

    sw.Start();
    for (int i = 0; i < tasks.Length; i++) {
        Task t = factory.StartNew(WithLock);
        tasks[i] = t;
    }

    Task.WaitAll(tasks);
    sw.Stop();
    Console.WriteLine("with-lock: " + sw.Elapsed);

    factory = new TaskFactory();
    tasks = new Task[100];

    sw.Restart();
    for (int i = 0; i < tasks.Length; i++)
    {
        Task t = factory.StartNew(NoLock);
        tasks[i] = t;
    }

    Task.WaitAll(tasks);
    sw.Stop();
    Console.WriteLine("no-lock:   " + sw.Elapsed);
    Console.Read();
}

private static object _lock = new object();
private static DateTimeOffset _lastTs = TimestampHelper.UtcNow();
private static byte[] _clock = GuidGenerator.GetClockSequenceBytes();

private static void NoLock()
{
    var guid = Guid.NewGuid();

    for (int i = 0; i < 1000; i++)
    {
        var ts = TimestampHelper.UtcNow();
        guid = GuidGenerator.GenerateTimeBasedGuid(ts, GuidGenerator.GetClockSequenceBytes(), GuidGenerator.DefaultNode);
    }
}

private static void WithLock()
{
    var guid = Guid.NewGuid();

    for(int i = 0; i < 1000; i++) {
        lock(_lock) {
            var ts = TimestampHelper.UtcNow();
            if (ts <= _lastTs)
                _clock = GuidGenerator.GetClockSequenceBytes();

            _lastTs = ts;

            guid = GuidGenerator.GenerateTimeBasedGuid(ts, _clock, GuidGenerator.DefaultNode);
        }
    }
}

The lock is about 10 times slower.

with-lock: 00:00:00.2390053
no-lock:   00:00:00.0260270

Realistically we have to weight speed verse absolute accuracy. Because in both situations we are dealing with a really tight loop. So is it worth slowing down the generator by 10x to make sure an 8 out of 100,000 chance doesn't happen? What do you think?

Contributor

nberardi commented Sep 25, 2012

With the code how it currently exists, we get 4-8 duplicates out of 100,000 tries. This is probably un realistic given how tight the loop is.

But I wanted to give you method a benefit of the doubt so I timed both.

private static void Main(string[] args)
{
    var sw = new Stopwatch();
    TaskFactory factory = new TaskFactory();
    Task[] tasks = new Task[100];

    sw.Start();
    for (int i = 0; i < tasks.Length; i++) {
        Task t = factory.StartNew(WithLock);
        tasks[i] = t;
    }

    Task.WaitAll(tasks);
    sw.Stop();
    Console.WriteLine("with-lock: " + sw.Elapsed);

    factory = new TaskFactory();
    tasks = new Task[100];

    sw.Restart();
    for (int i = 0; i < tasks.Length; i++)
    {
        Task t = factory.StartNew(NoLock);
        tasks[i] = t;
    }

    Task.WaitAll(tasks);
    sw.Stop();
    Console.WriteLine("no-lock:   " + sw.Elapsed);
    Console.Read();
}

private static object _lock = new object();
private static DateTimeOffset _lastTs = TimestampHelper.UtcNow();
private static byte[] _clock = GuidGenerator.GetClockSequenceBytes();

private static void NoLock()
{
    var guid = Guid.NewGuid();

    for (int i = 0; i < 1000; i++)
    {
        var ts = TimestampHelper.UtcNow();
        guid = GuidGenerator.GenerateTimeBasedGuid(ts, GuidGenerator.GetClockSequenceBytes(), GuidGenerator.DefaultNode);
    }
}

private static void WithLock()
{
    var guid = Guid.NewGuid();

    for(int i = 0; i < 1000; i++) {
        lock(_lock) {
            var ts = TimestampHelper.UtcNow();
            if (ts <= _lastTs)
                _clock = GuidGenerator.GetClockSequenceBytes();

            _lastTs = ts;

            guid = GuidGenerator.GenerateTimeBasedGuid(ts, _clock, GuidGenerator.DefaultNode);
        }
    }
}

The lock is about 10 times slower.

with-lock: 00:00:00.2390053
no-lock:   00:00:00.0260270

Realistically we have to weight speed verse absolute accuracy. Because in both situations we are dealing with a really tight loop. So is it worth slowing down the generator by 10x to make sure an 8 out of 100,000 chance doesn't happen? What do you think?

nberardi added a commit that referenced this pull request Sep 25, 2012

decided to add an option of Fast vs NoDuplicates, since NoDuplicates …
…is 10 times slower, this should satisfy issue #68 and #66
@nberardi

This comment has been minimized.

Show comment Hide comment
@nberardi

nberardi Sep 25, 2012

Contributor

Finally think we have a good option for this.

Contributor

nberardi commented Sep 25, 2012

Finally think we have a good option for this.

@nberardi nberardi closed this Sep 25, 2012

@GrLawrence

This comment has been minimized.

Show comment Hide comment
@GrLawrence

GrLawrence Sep 25, 2012

Thanks for the effort taking those timings. I was just going to dig into
those. We (I and others at W3i) thought collisions could be a problem and
we were able to reproduce them with this multi-threaded loop. Looking at
the spec, it recommended obtaining a lock (similar to what Hector is using)
so I implemented that and it seems to correct the problem.

For us, we envision generating many UUIDs and using them for column names
so we're concerned about accidentally overwriting data, so the uniqueness
is important to us. While your stats show it's 10x slower, it still seems
fast enough for low volumes.

Maybe an override is in order to choose which implementation meets the
user's need?

Thanks,
Grant

On Tue, Sep 25, 2012 at 9:59 AM, Nick Berardi notifications@github.comwrote:

With the code how it currently exists, we get 4-8 duplicates out of
100,000 tries. This is probably un realistic given how tight the loop is.

But I wanted to give you method a benefit of the doubt so I timed both.

private static void Main(string[] args)
{
var sw = new Stopwatch();

TaskFactory factory = new TaskFactory();
Task[] tasks = new Task[100];

sw.Start();
for (int i = 0; i < tasks.Length; i++) {
    Task t = factory.StartNew(WithLock);
    tasks[i] = t;
}

Task.WaitAll(tasks);
sw.Stop();
Console.WriteLine("with-lock: " + sw.Elapsed);

factory = new TaskFactory();
tasks = new Task[100];

sw.Restart();
for (int i = 0; i < tasks.Length; i++)
{
    Task t = factory.StartNew(NoLock);
    tasks[i] = t;
}

Task.WaitAll(tasks);
sw.Stop();
Console.WriteLine("no-lock:   " + sw.Elapsed);
Console.Read();}

private static object _lock = new object();private static DateTimeOffset _lastTs = TimestampHelper.UtcNow();private static byte[] _clock = GuidGenerator.GetClockSequenceBytes();
private static void NoLock(){
var guid = Guid.NewGuid();

for (int i = 0; i < 1000; i++)
{
    var ts = TimestampHelper.UtcNow();
    guid = GuidGenerator.GenerateTimeBasedGuid(ts, GuidGenerator.GetClockSequenceBytes(), GuidGenerator.DefaultNode);
}}

private static void WithLock(){
var guid = Guid.NewGuid();

for(int i = 0; i < 1000; i++) {
    lock(_lock) {
        var ts = TimestampHelper.UtcNow();
        if (ts <= _lastTs)
            _clock = GuidGenerator.GetClockSequenceBytes();

        _lastTs = ts;

        guid = GuidGenerator.GenerateTimeBasedGuid(ts, _clock, GuidGenerator.DefaultNode);
    }
}}

The lock is about 10 times slower.

with-lock: 00:00:00.2390053
no-lock: 00:00:00.0260270

Realistically we have to weight speed verse absolute accuracy. Because in
both situations we are dealing with a really tight loop. So is it worth
slowing down the generator by 10x to make sure an 8 out of 100,000 chance
doesn't happen? What do you think?


Reply to this email directly or view it on GitHubhttps://github.com/managedfusion/fluentcassandra/pull/68#issuecomment-8857291.

People do not care how much you know until they know how much you care.
--John Maxwell

Thanks for the effort taking those timings. I was just going to dig into
those. We (I and others at W3i) thought collisions could be a problem and
we were able to reproduce them with this multi-threaded loop. Looking at
the spec, it recommended obtaining a lock (similar to what Hector is using)
so I implemented that and it seems to correct the problem.

For us, we envision generating many UUIDs and using them for column names
so we're concerned about accidentally overwriting data, so the uniqueness
is important to us. While your stats show it's 10x slower, it still seems
fast enough for low volumes.

Maybe an override is in order to choose which implementation meets the
user's need?

Thanks,
Grant

On Tue, Sep 25, 2012 at 9:59 AM, Nick Berardi notifications@github.comwrote:

With the code how it currently exists, we get 4-8 duplicates out of
100,000 tries. This is probably un realistic given how tight the loop is.

But I wanted to give you method a benefit of the doubt so I timed both.

private static void Main(string[] args)
{
var sw = new Stopwatch();

TaskFactory factory = new TaskFactory();
Task[] tasks = new Task[100];

sw.Start();
for (int i = 0; i < tasks.Length; i++) {
    Task t = factory.StartNew(WithLock);
    tasks[i] = t;
}

Task.WaitAll(tasks);
sw.Stop();
Console.WriteLine("with-lock: " + sw.Elapsed);

factory = new TaskFactory();
tasks = new Task[100];

sw.Restart();
for (int i = 0; i < tasks.Length; i++)
{
    Task t = factory.StartNew(NoLock);
    tasks[i] = t;
}

Task.WaitAll(tasks);
sw.Stop();
Console.WriteLine("no-lock:   " + sw.Elapsed);
Console.Read();}

private static object _lock = new object();private static DateTimeOffset _lastTs = TimestampHelper.UtcNow();private static byte[] _clock = GuidGenerator.GetClockSequenceBytes();
private static void NoLock(){
var guid = Guid.NewGuid();

for (int i = 0; i < 1000; i++)
{
    var ts = TimestampHelper.UtcNow();
    guid = GuidGenerator.GenerateTimeBasedGuid(ts, GuidGenerator.GetClockSequenceBytes(), GuidGenerator.DefaultNode);
}}

private static void WithLock(){
var guid = Guid.NewGuid();

for(int i = 0; i < 1000; i++) {
    lock(_lock) {
        var ts = TimestampHelper.UtcNow();
        if (ts <= _lastTs)
            _clock = GuidGenerator.GetClockSequenceBytes();

        _lastTs = ts;

        guid = GuidGenerator.GenerateTimeBasedGuid(ts, _clock, GuidGenerator.DefaultNode);
    }
}}

The lock is about 10 times slower.

with-lock: 00:00:00.2390053
no-lock: 00:00:00.0260270

Realistically we have to weight speed verse absolute accuracy. Because in
both situations we are dealing with a really tight loop. So is it worth
slowing down the generator by 10x to make sure an 8 out of 100,000 chance
doesn't happen? What do you think?


Reply to this email directly or view it on GitHubhttps://github.com/managedfusion/fluentcassandra/pull/68#issuecomment-8857291.

People do not care how much you know until they know how much you care.
--John Maxwell

@nberardi

This comment has been minimized.

Show comment Hide comment
@nberardi

nberardi Sep 25, 2012

Contributor

I gave the user an option in the latest source. This should be enough. It will be safe by default, but if the users want speed over accuracy they now have the option.

By the way this 10x number is not constant, so take that into consideration. The more you generate the slower it gets compared to the fast way. at 50,000 it was about 8x at 100,000 it was about 10x and I imagine it just keeps increasing from there.

Contributor

nberardi commented Sep 25, 2012

I gave the user an option in the latest source. This should be enough. It will be safe by default, but if the users want speed over accuracy they now have the option.

By the way this 10x number is not constant, so take that into consideration. The more you generate the slower it gets compared to the fast way. at 50,000 it was about 8x at 100,000 it was about 10x and I imagine it just keeps increasing from there.

@GrLawrence

This comment has been minimized.

Show comment Hide comment
@GrLawrence

GrLawrence Sep 25, 2012

I like it, thanks.

On Tue, Sep 25, 2012 at 10:34 AM, Nick Berardi notifications@github.comwrote:

I gave the user an option in the latest source. This should be enough. It
will be safe by default, but if the users want speed over accuracy they now
have the option.


Reply to this email directly or view it on GitHubhttps://github.com/managedfusion/fluentcassandra/pull/68#issuecomment-8858607.

People do not care how much you know until they know how much you care.
--John Maxwell

I like it, thanks.

On Tue, Sep 25, 2012 at 10:34 AM, Nick Berardi notifications@github.comwrote:

I gave the user an option in the latest source. This should be enough. It
will be safe by default, but if the users want speed over accuracy they now
have the option.


Reply to this email directly or view it on GitHubhttps://github.com/managedfusion/fluentcassandra/pull/68#issuecomment-8858607.

People do not care how much you know until they know how much you care.
--John Maxwell

@ghost ghost assigned nberardi Oct 11, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment