Distributed transactions #22

andybons · 2014-06-03T19:46:44Z

This is an umbrella issue for now for accounting purposes. Smaller issues should be created for the various paths a transaction can take.

Frank-Jin · 2014-11-12T07:38:43Z

hi, I am learning the design of cockroach's transaction. I have a question that cockroach's transaction is designed to support SSI, but I think it can't solve the problem of write skew.

I think the cockroach can't detect this conflict, Maybe It's my misunderstand.

tbg · 2014-11-12T08:01:23Z

Hi fenglin99,
the read timestamp cache should cause one of the two transactions to restart. Your diagram is in absolute time, but the txns will really use their internal timestamp to do all the reads (ideally that will be their initial provisional commit timestamp, but whenever any of the ops in the txn returns a higher one, they will have to use that one, or a later one). So an SSI transaction will want every operation to actually be carried out at the timestamp they put in, everything else will push the commit timestamp, and they don't like that.
When writing, the write will consult the read timestamp cache for that key. If there have been any read ops with a timestamp higher than the write's, the write will return that later timestamp.
In your example, because each of the transactions both read each others keys, the writes will "see" the other one's timestamp. So the only way both transactions commit is that they both commit at the same time. That should not be possible since our timestamps have a logical component and since the two transactions "touch" a common set of keys and should be causally connected, even though:
@spencerkimball what happens with the above example if the two transactions originate on different nodes and happen at exactly the same hybrid logical clock time? This is conceivable (albeit unlikely) if the two nodes don't connect much. Both txns would end up using the same timestamp, and the read ts cache trap wouldn't work. Is there another mechanism? If there isn't then wouldn't we have to have the read timestamp cache add an extra tick on top of the last read timestamp, which means txns that write to keys they read will have to have some weird workarounds? I sure hope not.

Frank-Jin · 2014-11-12T08:46:50Z

So candidate timestamp can't be changed in a SSI transaction, if do so, the transaction will be restart.
Is it right?
Another question, Does it mean the cockroach can't support long SSI transaction,because the read cache are kept for only 10s.
thanks!

Frank-Jin · 2014-11-12T09:15:14Z

And Where do we get the initial candidate timestamp? assigned by client or by cockroach server node? if we get candidate timestamp from cockroach node, when is the candidate timestamp asssigned? the first read or write?

spencerkimball · 2014-11-12T15:55:26Z

Github is annoying in how you can't respond to individual questions.

@tschottdorf: you can't write a value at the same timestamp as a previous read. The requirement is that the write timestamp be bumped up to the previous read's timestamp + 1 logical tick, so it doesn't matter if the two transactions pick identical candidate timestamps. One will have to be +1 logical tick in the end.

@fenglin99: if candidate timestamp is changed in an SSI txn, txn will restart. An SSI cockroach txn which is sending writes for longer than 10s will end up pushing its timestamp forward. SSI will require a restart at the end. On the restart however, the intents are still in place. Having to always retry a long transaction is pretty terrible however... Probably if you're running extremely long transactions like this, you want to reconsider what you're doing, or possibly switch to using SI isolation.

@fenglin99: the initial candidate timestamp is taken from the gateway node which receives the first command(s) which are part of the txn. This timestamp must be assigned by a node in the cluster because it should be within the maximum clock offset of the cluster's "true time". The client will have no guarantee of being synchronized properly as it's not estimating clock offsets to other nodes on a constant basis.

tbg · 2014-11-12T18:18:03Z

@spencerkimball what does an SSI transaction do that reads x, then writes x? Why won't that restart forever?

spencerkimball · 2014-11-12T18:53:53Z

The timestamp cache contains a txn id for each entry if applicable. If a write encounters an entry in the timestamp cache with its own txn id, it's able to avoid incrementing its timestamp.

tbg · 2014-11-12T19:21:33Z

Ah, perfect.

Frank-Jin · 2014-11-13T00:42:25Z

@spencerkimball ok, I see, thanks

spencerkimball · 2014-11-13T00:50:13Z

@fenglin99 you might want to take a look at kv/txn_correctness_test.go for tests which specifically verify the operation of the txn model with respect to various write anomalies.

alkfbb · 2014-11-16T10:52:04Z

Hi, I am reading the transaction part of cockroach design, have a question about the read timestamp cache of each range.Read timestamp cache maintains the "latest timestamp" at which key was read, what exactly the "latest timestamp" mean?is the "latest timestamp" mean the candidate commit timestamp of transaction read the key or the start timestamp（snapshot timestamp） of transaction read the key?

tbg · 2014-11-16T11:28:47Z

Hi,
It's the timestamp which was passed to the read operation by the readers. For a txn, that's its original timestamp (when writing, the current provisional commit timestamp is used instead).

alkfbb · 2014-11-16T13:19:47Z

Hi, you mean a transaction only hava a provisional commit timestamp? but how a transaction determine which version of a key is visible? just use the provisional commit timestamp of a transaction? does all read opertion read the latest commit version of a key

tbg · 2014-11-16T14:21:15Z

Hi alkfbb, sorry. What I said was misleading. I'm editing the comment above to be more appropriate.

cockroach-team · 2014-11-20T16:39:12Z

Hi there,

I have some more doubt about the interaction between distributed
transactions. Hope you could help, I really couldn't figure it out by
myself :S

Assume we are running transactions in SI. We will run a distributed
transaction, say TX1, composed of two operations, one for reading a value X
and one for updating a value Y.

The 99% clock skew is 10... And let's say the candidate timestamp of TX1 is
5.
When TX1 is reading X, it encounters a write intent with timestamp 4 and
that transaction has not committed. So it just pushes the commit time of
the writer transaction to 6 (By the way, how is this done? Is that 'push'
written to the transaction table, or is it sent to the coordinator of that
transaction? Does the coordinator need to wait sufficiently long time
before committing? Because incoming transactions may continually push the
commit time forward). As a result, TX1 does not read that intent.
When TX1 is updating Y, however it encounters a read for Y happened with
timestamp 6 by another transaction. So now the write timestamp for TX1 is
updated to 7.

So when committing, TX1 commits with timestamp 7. However, it didn't read
the value of X with timestamp 6. Isn't that incorrect?

Thank you very much in advance,
Lee

On Sunday, November 16, 2014 2:21:28 PM UTC, Tobias Schottdorf wrote:

Hi alkfbb, sorry. What I said was misleading. I'm editing the comment
above to be more appropriate.

—
Reply to this email directly or view it on GitHub
#22 (comment)
.

tbg · 2014-11-20T16:57:25Z

Hi Lee,

the txn status is checked via the transaction table (that's the single source of truth). Whatever is written there is fact, and all updates to this table are atomic.
Pushing a txn means updating that other txn's candidate timestamp (unless it has already committed at the time you write; in that case the push fails. This will happen in practice.)

I'm not sure I see the problem with your example. You're reading a consistent cut of your database (at "time" 5) throughout your transaction. The transaction will commit with a later ts 7 (pushed to accomodate concurrent readers, in your example). Whatever is written between 5 and 7 doesn't concern you (there would be an uncertainty restart, but let's ignore that).
If you want to find an issue with the consistency of the snapshot, you would have to start cooking up examples that read more than one key. But since we're always reading at the same ts (the initial ts of your txn), that will be hard.

Best,
Tobias

spencerkimball · 2014-11-20T17:06:25Z

Lee,

The example you've given is actually correct behavior for snapshot isolation. It's also the root of why snapshot isolation doesn't guarantee serializable consistency--you may read values from a candidate timestamp which is earlier than your final commit timestamp. This is the basis for the write skew anomaly.

cockroach-team · 2014-11-20T17:16:37Z

Thank you Spencer and Tobias! You are absolutely right :P..
Somehow I got the wrong idea that snapshot timestamp and commit timestamp
are treated indifferently even in Snapshot Isolation, which is wrong :)

On Thursday, November 20, 2014 5:06:39 PM UTC, Spencer Kimball wrote:

Lee,

The example you've given is actually correct behavior for snapshot
isolation. It's also the root of why snapshot isolation doesn't guarantee
serializable consistency--you may read values from a candidate timestamp
which is earlier than your final commit timestamp. This is the basis for
the write skew anomaly.

—
Reply to this email directly or view it on GitHub
#22 (comment)
.

cockroach-team · 2014-11-20T17:21:28Z

As far as DB isolation goes, SI exhibits relatively few anomalies, and is
even the default isolation mode on some commercial databases. Here is the
description in Wikipedia:

http://en.wikipedia.org/wiki/Snapshot_isolation

See especially the write-skew example in the Definition section, which is
similar to the example you have been pondering.

Spencer and I had a vigorous discussion as to what the default isolation
level for Cockroach should be. I like SI for performance, but he likes SSI
for correctness. Since he's the one actually doing the work, he wins. : )
Also, he's probably right. Programmers tend to trip up on these anomalies
at the worst possible time - only after their code is running in production.

~Andy

On Thu, Nov 20, 2014 at 9:06 AM, Spencer Kimball notifications@github.com
wrote:

Lee,

The example you've given is actually correct behavior for snapshot
isolation. It's also the root of why snapshot isolation doesn't guarantee
serializable consistency--you may read values from a candidate timestamp
which is earlier than your final commit timestamp. This is the basis for
the write skew anomaly.

—
Reply to this email directly or view it on GitHub
#22 (comment)
.

You received this message because you are subscribed to the Google Groups
"Cockroach DB" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to cockroach-db+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

spencerkimball · 2014-11-20T17:29:12Z

There are enough issues with performance I might change my mind and make snapshot the default. Ultimately, Oracle ran with snapshot isolation claiming to be "serializable" for many years (a decade plus?) in probably 10s of thousands of installations. Further, most people using MySQL across the entire industry never go above "read committed". Isolation matters, but when you are talking about the difference between snapshot isolation and serializable consistency, it's a pretty marginal benefit.

That said, I think the difference between "read committed" and "snapshot isolation" is somewhat more dramatic. With "read committed" (and "repeatable read"), decent programmers constantly have to think about when to read "for update" which strikes me as exactly the sort of thing you want to avoid having them spend their time doing (and often getting wrong).

If you're interested in the correctness of the transaction model, take a look at https://github.com/cockroachdb/cockroach/blob/master/kv/txn_correctness_test.go. It makes sense to start reading the file at about line 613. Everything above that is part of the testing harness.

cockroach-team · 2014-12-08T13:36:34Z

Hello guys, I am also trying to become a contributor of Cockroach. I just
started to reading your document.
You maintaining read-cache for last 10s. So if there is a write txi with
candidate ts(txi) less than last evicted ts(le)
then we must update ts(txi) with ts(le). But if txi does not relate to
ts(le) by key then what is the reason to update it ?
Why do we need to update low water mark if a new range replica leader is
elected ?
What should I read to understand your logic deeper ?

On Thursday, November 20, 2014 11:29:23 PM UTC+6, Spencer Kimball wrote:

There are enough issues with performance I might change my mind and make
snapshot the default. Ultimately, Oracle ran with snapshot isolation
claiming to be "serializable" for many years (a decade plus?) in probably
10s of thousands of installations. Further, most people using MySQL across
the entire industry never go above "read committed". Isolation matters, but
when you are talking about the difference between snapshot isolation and
serializable consistency, it's a pretty marginal benefit.

That said, I think the difference between "read committed" and "snapshot
isolation" is somewhat more dramatic. With "read committed" (and
"repeatable read"), decent programmers constantly have to think about when
to read "for update" which strikes me as exactly the sort of thing you want
to avoid having them spend their time doing (and often getting wrong).

If you're interested in the correctness of the transaction model, take a
look at
https://github.com/cockroachdb/cockroach/blob/master/kv/txn_correctness_test.go.
It makes sense to start reading the file at about line 613. Everything
above that is part of the testing harness.

—
Reply to this email directly or view it on GitHub
#22 (comment)
.

cockroach-team · 2014-12-08T13:40:04Z

Btw, when we update timestamp of current rw tx with last evicted timestamp
we make a pushing action, am i right ?
On Monday, December 8, 2014 7:36:30 PM UTC+6, Rustem Kamun wrote:

Hello guys, I am also trying to become a contributor of Cockroach. I just
started to reading your document.
You maintaining read-cache for last 10s. So if there is a write txi with
candidate ts(txi) less than last evicted ts(le)
then we must update ts(txi) with ts(le). But if txi does not relate to
ts(le) by key then what is the reason to update it ?
Why do we need to update low water mark if a new range replica leader is
elected ?
What should I read to understand your logic deeper ?

On Thursday, November 20, 2014 11:29:23 PM UTC+6, Spencer Kimball wrote:

There are enough issues with performance I might change my mind and make
snapshot the default. Ultimately, Oracle ran with snapshot isolation
claiming to be "serializable" for many years (a decade plus?) in probably
10s of thousands of installations. Further, most people using MySQL across
the entire industry never go above "read committed". Isolation matters, but
when you are talking about the difference between snapshot isolation and
serializable consistency, it's a pretty marginal benefit.

That said, I think the difference between "read committed" and "snapshot
isolation" is somewhat more dramatic. With "read committed" (and
"repeatable read"), decent programmers constantly have to think about when
to read "for update" which strikes me as exactly the sort of thing you want
to avoid having them spend their time doing (and often getting wrong).

If you're interested in the correctness of the transaction model, take a
look at
https://github.com/cockroachdb/cockroach/blob/master/kv/txn_correctness_test.go.
It makes sense to start reading the file at about line 613. Everything
above that is part of the testing harness.

—
Reply to this email directly or view it on GitHub
#22 (comment)
.

tbg · 2014-12-08T14:10:26Z

Hi Rustem,
the read ts cache is kept by key, so you only update your provisional commit timestamp if you were trying to write to a key that might have been read with a later timestamp already. Technically pushing only happens between two transactions (one is trying to push the other) but we sometimes say that a timestamp is "pushed", but that just means that the timestamp is increased to accomodate, for instance, something larger from the tsCache.

It's great that you want to contribute. The items below look like they could be a good starting point - usually everything leads down the rabbit hole sooner or later. Just assign yourself and off you go.
Also you definitely want to participate in our code reviews - it's a great (and really the only) way to get accustomed to our conventions as well as the codebase.

kv rest api: counter operations shouldn't work on Entry values and vice versa #78 kv rest api: counter operations shouldn't work on Entry values and vice versa
http: graceful shutdown of server #40 graceful http server shutdown
from TODO.md: Cleanup proto files to adhere to proto capitalization instead of go's.
SSL for all inter-node comms #21: SSL for all inter-node comms

Rustem · 2014-12-08T14:33:20Z

Thank you, Tobias (@tschottdorf) for nice explanation of pushing.
Actually I am from python world, but I think the language is not a problem at all.
Could you please recommend me a set of papers that makes a clear view in my mind about distributed transactions you implemented and why this works. I have read the first three: Cahill, Yabandeh and about Calvin. I just wanted to understand why this works ?! I have read a lot of times paper about Spanner, but have few misunderstanding there. May be it's better start to code rather than reading.

See the Cahill paper for one possible implementation of SSI. This is another great paper. For a discussion of SSI implemented by preventing read-write conflicts (in contrast to detecting them, called write-snapshot isolation), see the Yabandeh paper, which is the source of much inspiration for Cockroach’s SSI.

Rustem · 2014-12-08T14:57:36Z

Why do we need to update low water mark if a new range replica leader is
elected ? Because election is also a transaction ?

tbg · 2014-12-08T15:21:16Z

the tsCache is associated to the respective replica leader. The new leader's clock will not be exactly in sync with the old leader's and if we don't up the high water mark to take care of the offset, we could wind up not pushing the timestamp of a write that should have been pushed because the old leader might have seen a read for that timestamp prior to losing leadership.

cockroach-team · 2014-12-09T03:28:16Z

Now I got it. I forgot that Timestamp is taken from range replica leader
node. Each node has a time skew Eps such as RealTime - Eps <= RealTime
<=RealTime + Eps. That's why rw transaction just after new election is at
risk not being pushed, that's why serializability property might be lost.
Am I right ?

On Monday, December 8, 2014 9:21:28 PM UTC+6, Tobias Schottdorf wrote:

the tsCache is associated to the respective replica leader. The new
leader's clock will not be exactly in sync with the old leader's and if we
don't up the high water mark to take care of the offset, we could wind up
not pushing the timestamp of a write that should have been pushed because
the old leader might have seen a read for that timestamp prior to losing
leadership.

—
Reply to this email directly or view it on GitHub
#22 (comment)
.

yorkxu · 2016-07-14T10:57:42Z

Hi,
I read the distributed transaction part of cockroach design doc.
I have some doubt about how to choose a timestamp:

For multiple nodes, the timestamp of the node coordinating the transaction t is used. In addition, a maximum timestamp t+ε is supplied to provide an upper bound on timestamps for already-committed data (ε is the maximum clock skew). As the transaction progresses, any data read which have timestamps greater than t but less than t+ε cause the transaction to abort and retry with the conflicting timestamp tc, where tc > t. The maximum timestamp t+ε remains the same. This implies that transaction restarts due to clock uncertainty can only happen on a time interval of length ε.

If the data read have timestamp greater than t-ε but less than t, how to decide whether that transaction write this data committed before start timestamp of the read transaction or after?

Or there is other mechanism can guarantee all data which have timestamp greater than t-ε but less than t has commited before the read transaction with timestamp t started?

Thank you very much.

tbg · 2016-07-14T11:04:41Z

Hi @yorkxu, the transaction won't read such data (it will come back with a higher timestamp). We optimize that a bit more using the node's timestamp (which limits these restarts to one per node), but essentially within that uncertainty interval [t, t+eps) you will restart as you encounter values, moving your start timestamp forward. If you pass t+eps, you know that anything that you see after must have started after you in absolute time.
We never read uncommitted values. If there is a provisional value, the standard intent resolution procedures kick in.
If you find committed values in your past (maybe that's what you're asking about?), of course you always read them.

I hope that helped,
Tobias

yorkxu · 2016-07-14T11:23:15Z

Thank you @tschottdorf

If a transaction named txn1 has start timestamp t1. As the transaction progress, it read a data with commit timestamp t2 (t2 is greater than t1-ε but less than t, ε is the maximum clock skew), then will txn1 read this data?

I mean the commit timestamp t2 maybe later than t1, am i right?

Thank you very much

tbg · 2016-07-14T11:31:06Z

t1 will read t2's write (which is in its past) if t2 has committed at that point. That is, either there is a committed value or it sees a provisional value and checks on t2's central transaction record. This record will either confirm the value as committed (in which case again it is read) or not committed, in which case t2 and t1 will undergo conflict resolution and the result of the read depends on its outcome (and may have to mean that t1 has to wait for t2 to complete if it can't abort it).

tbg · 2016-07-14T11:31:47Z

See also the section "Transaction interactions" in https://github.com/cockroachdb/cockroach/blob/master/docs/design.md.

Himanshu GitHub

…ons/actions/checkout-3 build(deps): bump actions/checkout from 2 to 3

andybons assigned spencerkimball Jun 3, 2014

andybons modified the milestone: v0.1 (Beta) Jan 13, 2015

spencerkimball closed this as completed Apr 29, 2015

tbg mentioned this issue Apr 12, 2016

stability: panic on RangeLookup dispatched to correct range #6000

Closed

zhaoyuxi mentioned this issue Aug 16, 2016

sql/performance: independent filter expressions on JOINs should be pushed down to the operands #8566

Closed

soniabhishek pushed a commit to soniabhishek/cockroach that referenced this issue Feb 15, 2017

Merge pull request cockroachdb#22 from crowdflux/himanshu-github

9c9b8af

Himanshu GitHub

tbg mentioned this issue Oct 13, 2017

stability: stuck requests after rebalancing-inducing downtime #19165

Closed

bra-fsn mentioned this issue Jul 12, 2019

One node eats all CPU and an increasing amount of memory #38788

Closed

pav-kv pushed a commit to pav-kv/cockroach that referenced this issue Mar 5, 2024

Merge pull request cockroachdb#22 from etcd-io/dependabot/github_acti…

5507235

…ons/actions/checkout-3 build(deps): bump actions/checkout from 2 to 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed transactions #22

Distributed transactions #22

andybons commented Jun 3, 2014

Frank-Jin commented Nov 12, 2014

tbg commented Nov 12, 2014

Frank-Jin commented Nov 12, 2014

Frank-Jin commented Nov 12, 2014

spencerkimball commented Nov 12, 2014

tbg commented Nov 12, 2014

spencerkimball commented Nov 12, 2014

tbg commented Nov 12, 2014

Frank-Jin commented Nov 13, 2014

spencerkimball commented Nov 13, 2014

alkfbb commented Nov 16, 2014

tbg commented Nov 16, 2014

alkfbb commented Nov 16, 2014

tbg commented Nov 16, 2014

cockroach-team commented Nov 20, 2014

tbg commented Nov 20, 2014

spencerkimball commented Nov 20, 2014

cockroach-team commented Nov 20, 2014

cockroach-team commented Nov 20, 2014

spencerkimball commented Nov 20, 2014

cockroach-team commented Dec 8, 2014

cockroach-team commented Dec 8, 2014

tbg commented Dec 8, 2014

Rustem commented Dec 8, 2014

Rustem commented Dec 8, 2014

tbg commented Dec 8, 2014

cockroach-team commented Dec 9, 2014

yorkxu commented Jul 14, 2016 •

edited

tbg commented Jul 14, 2016

yorkxu commented Jul 14, 2016 •

edited

tbg commented Jul 14, 2016

tbg commented Jul 14, 2016

Distributed transactions #22

Distributed transactions #22

Comments

andybons commented Jun 3, 2014

Frank-Jin commented Nov 12, 2014

tbg commented Nov 12, 2014

Frank-Jin commented Nov 12, 2014

Frank-Jin commented Nov 12, 2014

spencerkimball commented Nov 12, 2014

tbg commented Nov 12, 2014

spencerkimball commented Nov 12, 2014

tbg commented Nov 12, 2014

Frank-Jin commented Nov 13, 2014

spencerkimball commented Nov 13, 2014

alkfbb commented Nov 16, 2014

tbg commented Nov 16, 2014

alkfbb commented Nov 16, 2014

tbg commented Nov 16, 2014

cockroach-team commented Nov 20, 2014

tbg commented Nov 20, 2014

spencerkimball commented Nov 20, 2014

cockroach-team commented Nov 20, 2014

cockroach-team commented Nov 20, 2014

spencerkimball commented Nov 20, 2014

cockroach-team commented Dec 8, 2014

cockroach-team commented Dec 8, 2014

tbg commented Dec 8, 2014

Rustem commented Dec 8, 2014

Rustem commented Dec 8, 2014

tbg commented Dec 8, 2014

cockroach-team commented Dec 9, 2014

yorkxu commented Jul 14, 2016 • edited

tbg commented Jul 14, 2016

yorkxu commented Jul 14, 2016 • edited

tbg commented Jul 14, 2016

tbg commented Jul 14, 2016

yorkxu commented Jul 14, 2016 •

edited

yorkxu commented Jul 14, 2016 •

edited