Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed transactions #22

Closed
andybons opened this issue Jun 3, 2014 · 32 comments
Closed

Distributed transactions #22

andybons opened this issue Jun 3, 2014 · 32 comments
Assignees
Milestone

Comments

@andybons
Copy link
Contributor

andybons commented Jun 3, 2014

This is an umbrella issue for now for accounting purposes. Smaller issues should be created for the various paths a transaction can take.

@Frank-Jin
Copy link

hi, I am learning the design of cockroach's transaction. I have a question that cockroach's transaction is designed to support SSI, but I think it can't solve the problem of write skew.
sshot-278

I think the cockroach can't detect this conflict, Maybe It's my misunderstand.

@tbg
Copy link
Member

tbg commented Nov 12, 2014

Hi fenglin99,
the read timestamp cache should cause one of the two transactions to restart. Your diagram is in absolute time, but the txns will really use their internal timestamp to do all the reads (ideally that will be their initial provisional commit timestamp, but whenever any of the ops in the txn returns a higher one, they will have to use that one, or a later one). So an SSI transaction will want every operation to actually be carried out at the timestamp they put in, everything else will push the commit timestamp, and they don't like that.
When writing, the write will consult the read timestamp cache for that key. If there have been any read ops with a timestamp higher than the write's, the write will return that later timestamp.
In your example, because each of the transactions both read each others keys, the writes will "see" the other one's timestamp. So the only way both transactions commit is that they both commit at the same time. That should not be possible since our timestamps have a logical component and since the two transactions "touch" a common set of keys and should be causally connected, even though:
@spencerkimball what happens with the above example if the two transactions originate on different nodes and happen at exactly the same hybrid logical clock time? This is conceivable (albeit unlikely) if the two nodes don't connect much. Both txns would end up using the same timestamp, and the read ts cache trap wouldn't work. Is there another mechanism? If there isn't then wouldn't we have to have the read timestamp cache add an extra tick on top of the last read timestamp, which means txns that write to keys they read will have to have some weird workarounds? I sure hope not.

@Frank-Jin
Copy link

So candidate timestamp can't be changed in a SSI transaction, if do so, the transaction will be restart.
Is it right?
Another question, Does it mean the cockroach can't support long SSI transaction,because the read cache are kept for only 10s.
thanks!

@Frank-Jin
Copy link

And Where do we get the initial candidate timestamp? assigned by client or by cockroach server node? if we get candidate timestamp from cockroach node, when is the candidate timestamp asssigned? the first read or write?

@spencerkimball
Copy link
Member

Github is annoying in how you can't respond to individual questions.

@tschottdorf: you can't write a value at the same timestamp as a previous read. The requirement is that the write timestamp be bumped up to the previous read's timestamp + 1 logical tick, so it doesn't matter if the two transactions pick identical candidate timestamps. One will have to be +1 logical tick in the end.

@fenglin99: if candidate timestamp is changed in an SSI txn, txn will restart. An SSI cockroach txn which is sending writes for longer than 10s will end up pushing its timestamp forward. SSI will require a restart at the end. On the restart however, the intents are still in place. Having to always retry a long transaction is pretty terrible however... Probably if you're running extremely long transactions like this, you want to reconsider what you're doing, or possibly switch to using SI isolation.

@fenglin99: the initial candidate timestamp is taken from the gateway node which receives the first command(s) which are part of the txn. This timestamp must be assigned by a node in the cluster because it should be within the maximum clock offset of the cluster's "true time". The client will have no guarantee of being synchronized properly as it's not estimating clock offsets to other nodes on a constant basis.

@tbg
Copy link
Member

tbg commented Nov 12, 2014

@spencerkimball what does an SSI transaction do that reads x, then writes x? Why won't that restart forever?

@spencerkimball
Copy link
Member

The timestamp cache contains a txn id for each entry if applicable. If a write encounters an entry in the timestamp cache with its own txn id, it's able to avoid incrementing its timestamp.

@tbg
Copy link
Member

tbg commented Nov 12, 2014

Ah, perfect.

@Frank-Jin
Copy link

@spencerkimball ok, I see, thanks

@spencerkimball
Copy link
Member

@fenglin99 you might want to take a look at kv/txn_correctness_test.go for tests which specifically verify the operation of the txn model with respect to various write anomalies.

@alkfbb
Copy link

alkfbb commented Nov 16, 2014

Hi, I am reading the transaction part of cockroach design, have a question about the read timestamp cache of each range.Read timestamp cache maintains the "latest timestamp" at which key was read, what exactly the "latest timestamp" mean?is the "latest timestamp" mean the candidate commit timestamp of transaction read the key or the start timestamp(snapshot timestamp) of transaction read the key?

@tbg
Copy link
Member

tbg commented Nov 16, 2014

Hi,
It's the timestamp which was passed to the read operation by the readers. For a txn, that's its original timestamp (when writing, the current provisional commit timestamp is used instead).

@alkfbb
Copy link

alkfbb commented Nov 16, 2014

Hi, you mean a transaction only hava a provisional commit timestamp? but how a transaction determine which version of a key is visible? just use the provisional commit timestamp of a transaction? does all read opertion read the latest commit version of a key

@tbg
Copy link
Member

tbg commented Nov 16, 2014

Hi alkfbb, sorry. What I said was misleading. I'm editing the comment above to be more appropriate.

@cockroach-team
Copy link

Hi there,

I have some more doubt about the interaction between distributed
transactions. Hope you could help, I really couldn't figure it out by
myself :S

Assume we are running transactions in SI. We will run a distributed
transaction, say TX1, composed of two operations, one for reading a value X
and one for updating a value Y.

The 99% clock skew is 10... And let's say the candidate timestamp of TX1 is
5.
When TX1 is reading X, it encounters a write intent with timestamp 4 and
that transaction has not committed. So it just pushes the commit time of
the writer transaction to 6 (By the way, how is this done? Is that 'push'
written to the transaction table, or is it sent to the coordinator of that
transaction? Does the coordinator need to wait sufficiently long time
before committing? Because incoming transactions may continually push the
commit time forward). As a result, TX1 does not read that intent.
When TX1 is updating Y, however it encounters a read for Y happened with
timestamp 6 by another transaction. So now the write timestamp for TX1 is
updated to 7.

So when committing, TX1 commits with timestamp 7. However, it didn't read
the value of X with timestamp 6. Isn't that incorrect?

Thank you very much in advance,
Lee

On Sunday, November 16, 2014 2:21:28 PM UTC, Tobias Schottdorf wrote:

Hi alkfbb, sorry. What I said was misleading. I'm editing the comment
above to be more appropriate.


Reply to this email directly or view it on GitHub
#22 (comment)
.

@tbg
Copy link
Member

tbg commented Nov 20, 2014

Hi Lee,

the txn status is checked via the transaction table (that's the single source of truth). Whatever is written there is fact, and all updates to this table are atomic.
Pushing a txn means updating that other txn's candidate timestamp (unless it has already committed at the time you write; in that case the push fails. This will happen in practice.)

I'm not sure I see the problem with your example. You're reading a consistent cut of your database (at "time" 5) throughout your transaction. The transaction will commit with a later ts 7 (pushed to accomodate concurrent readers, in your example). Whatever is written between 5 and 7 doesn't concern you (there would be an uncertainty restart, but let's ignore that).
If you want to find an issue with the consistency of the snapshot, you would have to start cooking up examples that read more than one key. But since we're always reading at the same ts (the initial ts of your txn), that will be hard.

Best,
Tobias

@spencerkimball
Copy link
Member

Lee,

The example you've given is actually correct behavior for snapshot isolation. It's also the root of why snapshot isolation doesn't guarantee serializable consistency--you may read values from a candidate timestamp which is earlier than your final commit timestamp. This is the basis for the write skew anomaly.

@cockroach-team
Copy link

Thank you Spencer and Tobias! You are absolutely right :P..
Somehow I got the wrong idea that snapshot timestamp and commit timestamp
are treated indifferently even in Snapshot Isolation, which is wrong :)

On Thursday, November 20, 2014 5:06:39 PM UTC, Spencer Kimball wrote:

Lee,

The example you've given is actually correct behavior for snapshot
isolation. It's also the root of why snapshot isolation doesn't guarantee
serializable consistency--you may read values from a candidate timestamp
which is earlier than your final commit timestamp. This is the basis for
the write skew anomaly.


Reply to this email directly or view it on GitHub
#22 (comment)
.

@cockroach-team
Copy link

As far as DB isolation goes, SI exhibits relatively few anomalies, and is
even the default isolation mode on some commercial databases. Here is the
description in Wikipedia:

http://en.wikipedia.org/wiki/Snapshot_isolation

See especially the write-skew example in the Definition section, which is
similar to the example you have been pondering.

Spencer and I had a vigorous discussion as to what the default isolation
level for Cockroach should be. I like SI for performance, but he likes SSI
for correctness. Since he's the one actually doing the work, he wins. : )
Also, he's probably right. Programmers tend to trip up on these anomalies
at the worst possible time - only after their code is running in production.

~Andy

On Thu, Nov 20, 2014 at 9:06 AM, Spencer Kimball notifications@github.com
wrote:

Lee,

The example you've given is actually correct behavior for snapshot
isolation. It's also the root of why snapshot isolation doesn't guarantee
serializable consistency--you may read values from a candidate timestamp
which is earlier than your final commit timestamp. This is the basis for
the write skew anomaly.


Reply to this email directly or view it on GitHub
#22 (comment)
.

You received this message because you are subscribed to the Google Groups
"Cockroach DB" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to cockroach-db+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

@spencerkimball
Copy link
Member

There are enough issues with performance I might change my mind and make snapshot the default. Ultimately, Oracle ran with snapshot isolation claiming to be "serializable" for many years (a decade plus?) in probably 10s of thousands of installations. Further, most people using MySQL across the entire industry never go above "read committed". Isolation matters, but when you are talking about the difference between snapshot isolation and serializable consistency, it's a pretty marginal benefit.

That said, I think the difference between "read committed" and "snapshot isolation" is somewhat more dramatic. With "read committed" (and "repeatable read"), decent programmers constantly have to think about when to read "for update" which strikes me as exactly the sort of thing you want to avoid having them spend their time doing (and often getting wrong).

If you're interested in the correctness of the transaction model, take a look at https://github.com/cockroachdb/cockroach/blob/master/kv/txn_correctness_test.go. It makes sense to start reading the file at about line 613. Everything above that is part of the testing harness.

@cockroach-team
Copy link

Hello guys, I am also trying to become a contributor of Cockroach. I just
started to reading your document.
You maintaining read-cache for last 10s. So if there is a write txi with
candidate ts(txi) less than last evicted ts(le)
then we must update ts(txi) with ts(le). But if txi does not relate to
ts(le) by key then what is the reason to update it ?
Why do we need to update low water mark if a new range replica leader is
elected ?
What should I read to understand your logic deeper ?

On Thursday, November 20, 2014 11:29:23 PM UTC+6, Spencer Kimball wrote:

There are enough issues with performance I might change my mind and make
snapshot the default. Ultimately, Oracle ran with snapshot isolation
claiming to be "serializable" for many years (a decade plus?) in probably
10s of thousands of installations. Further, most people using MySQL across
the entire industry never go above "read committed". Isolation matters, but
when you are talking about the difference between snapshot isolation and
serializable consistency, it's a pretty marginal benefit.

That said, I think the difference between "read committed" and "snapshot
isolation" is somewhat more dramatic. With "read committed" (and
"repeatable read"), decent programmers constantly have to think about when
to read "for update" which strikes me as exactly the sort of thing you want
to avoid having them spend their time doing (and often getting wrong).

If you're interested in the correctness of the transaction model, take a
look at
https://github.com/cockroachdb/cockroach/blob/master/kv/txn_correctness_test.go.
It makes sense to start reading the file at about line 613. Everything
above that is part of the testing harness.


Reply to this email directly or view it on GitHub
#22 (comment)
.

@cockroach-team
Copy link

Btw, when we update timestamp of current rw tx with last evicted timestamp
we make a pushing action, am i right ?
On Monday, December 8, 2014 7:36:30 PM UTC+6, Rustem Kamun wrote:

Hello guys, I am also trying to become a contributor of Cockroach. I just
started to reading your document.
You maintaining read-cache for last 10s. So if there is a write txi with
candidate ts(txi) less than last evicted ts(le)
then we must update ts(txi) with ts(le). But if txi does not relate to
ts(le) by key then what is the reason to update it ?
Why do we need to update low water mark if a new range replica leader is
elected ?
What should I read to understand your logic deeper ?

On Thursday, November 20, 2014 11:29:23 PM UTC+6, Spencer Kimball wrote:

There are enough issues with performance I might change my mind and make
snapshot the default. Ultimately, Oracle ran with snapshot isolation
claiming to be "serializable" for many years (a decade plus?) in probably
10s of thousands of installations. Further, most people using MySQL across
the entire industry never go above "read committed". Isolation matters, but
when you are talking about the difference between snapshot isolation and
serializable consistency, it's a pretty marginal benefit.

That said, I think the difference between "read committed" and "snapshot
isolation" is somewhat more dramatic. With "read committed" (and
"repeatable read"), decent programmers constantly have to think about when
to read "for update" which strikes me as exactly the sort of thing you want
to avoid having them spend their time doing (and often getting wrong).

If you're interested in the correctness of the transaction model, take a
look at
https://github.com/cockroachdb/cockroach/blob/master/kv/txn_correctness_test.go.
It makes sense to start reading the file at about line 613. Everything
above that is part of the testing harness.


Reply to this email directly or view it on GitHub
#22 (comment)
.

@tbg
Copy link
Member

tbg commented Dec 8, 2014

Hi Rustem,
the read ts cache is kept by key, so you only update your provisional commit timestamp if you were trying to write to a key that might have been read with a later timestamp already. Technically pushing only happens between two transactions (one is trying to push the other) but we sometimes say that a timestamp is "pushed", but that just means that the timestamp is increased to accomodate, for instance, something larger from the tsCache.

It's great that you want to contribute. The items below look like they could be a good starting point - usually everything leads down the rabbit hole sooner or later. Just assign yourself and off you go.
Also you definitely want to participate in our code reviews - it's a great (and really the only) way to get accustomed to our conventions as well as the codebase.

@Rustem
Copy link

Rustem commented Dec 8, 2014

Thank you, Tobias (@tschottdorf) for nice explanation of pushing.
Actually I am from python world, but I think the language is not a problem at all.
Could you please recommend me a set of papers that makes a clear view in my mind about distributed transactions you implemented and why this works. I have read the first three: Cahill, Yabandeh and about Calvin. I just wanted to understand why this works ?! I have read a lot of times paper about Spanner, but have few misunderstanding there. May be it's better start to code rather than reading.

See the Cahill paper for one possible implementation of SSI. This is another great paper. For a discussion of SSI implemented by preventing read-write conflicts (in contrast to detecting them, called write-snapshot isolation), see the Yabandeh paper, which is the source of much inspiration for Cockroach’s SSI. 

@Rustem
Copy link

Rustem commented Dec 8, 2014

Why do we need to update low water mark if a new range replica leader is
elected ? Because election is also a transaction ?

@tbg
Copy link
Member

tbg commented Dec 8, 2014

the tsCache is associated to the respective replica leader. The new leader's clock will not be exactly in sync with the old leader's and if we don't up the high water mark to take care of the offset, we could wind up not pushing the timestamp of a write that should have been pushed because the old leader might have seen a read for that timestamp prior to losing leadership.

@cockroach-team
Copy link

Now I got it. I forgot that Timestamp is taken from range replica leader
node. Each node has a time skew Eps such as RealTime - Eps <= RealTime
<=RealTime + Eps. That's why rw transaction just after new election is at
risk not being pushed, that's why serializability property might be lost.
Am I right ?

On Monday, December 8, 2014 9:21:28 PM UTC+6, Tobias Schottdorf wrote:

the tsCache is associated to the respective replica leader. The new
leader's clock will not be exactly in sync with the old leader's and if we
don't up the high water mark to take care of the offset, we could wind up
not pushing the timestamp of a write that should have been pushed because
the old leader might have seen a read for that timestamp prior to losing
leadership.


Reply to this email directly or view it on GitHub
#22 (comment)
.

@andybons andybons modified the milestone: v0.1 (Beta) Jan 13, 2015
@yorkxu
Copy link

yorkxu commented Jul 14, 2016

Hi,
I read the distributed transaction part of cockroach design doc.
I have some doubt about how to choose a timestamp:

For multiple nodes, the timestamp of the node coordinating the transaction t is used. In addition, a maximum timestamp t+ε is supplied to provide an upper bound on timestamps for already-committed data (ε is the maximum clock skew). As the transaction progresses, any data read which have timestamps greater than t but less than t+ε cause the transaction to abort and retry with the conflicting timestamp tc, where tc > t. The maximum timestamp t+ε remains the same. This implies that transaction restarts due to clock uncertainty can only happen on a time interval of length ε.

If the data read have timestamp greater than t-ε but less than t, how to decide whether that transaction write this data committed before start timestamp of the read transaction or after?

Or there is other mechanism can guarantee all data which have timestamp greater than t-ε but less than t has commited before the read transaction with timestamp t started?

Thank you very much.

@tbg
Copy link
Member

tbg commented Jul 14, 2016

Hi @yorkxu, the transaction won't read such data (it will come back with a higher timestamp). We optimize that a bit more using the node's timestamp (which limits these restarts to one per node), but essentially within that uncertainty interval [t, t+eps) you will restart as you encounter values, moving your start timestamp forward. If you pass t+eps, you know that anything that you see after must have started after you in absolute time.
We never read uncommitted values. If there is a provisional value, the standard intent resolution procedures kick in.
If you find committed values in your past (maybe that's what you're asking about?), of course you always read them.

I hope that helped,
Tobias

@yorkxu
Copy link

yorkxu commented Jul 14, 2016

Thank you @tschottdorf

If a transaction named txn1 has start timestamp t1. As the transaction progress, it read a data with commit timestamp t2 (t2 is greater than t1-ε but less than t, ε is the maximum clock skew), then will txn1 read this data?

I mean the commit timestamp t2 maybe later than t1, am i right?

Thank you very much

@tbg
Copy link
Member

tbg commented Jul 14, 2016

t1 will read t2's write (which is in its past) if t2 has committed at that point. That is, either there is a committed value or it sees a provisional value and checks on t2's central transaction record. This record will either confirm the value as committed (in which case again it is read) or not committed, in which case t2 and t1 will undergo conflict resolution and the result of the read depends on its outcome (and may have to mean that t1 has to wait for t2 to complete if it can't abort it).

@tbg
Copy link
Member

tbg commented Jul 14, 2016

See also the section "Transaction interactions" in https://github.com/cockroachdb/cockroach/blob/master/docs/design.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants