storage: timestampCache ignores txn IDs on timestamp collisions #9083

Closed
bdarnell opened this Issue Sep 3, 2016 · 4 comments

Comments

Projects
None yet
2 participants
@bdarnell
Member

bdarnell commented Sep 3, 2016

The timestamp cache avoids adding redundant spans to its interval tree. However, its definition of "redundant" is incomplete. A timestamp cache entry consists of a timestamp, a key span, and a transaction ID; spans are considered redundant based solely on their timestamp and keys without regard for the transaction ID.

The purpose of the transaction ID is to allow transactions to read their own writes or to write after they have done a read. This turns out to matter only for the most recent read or write on a given key, so as long as timestamps are globally unique, everything is fine. When two transactions have the same timestamp, problems arise. (Duplicate timestamps may sound rare, but there are various mechanisms that can funnel transactions onto the same timestamps, especially in the presence of clock offsets between nodes)

Consider two transactions executing SELECT COUNT(*) FROM tbl followed by INSERT INTO tbl (x) VALUES (?) (writing the count that was just read into the table). Serializability requires that this produce no duplicates in the table: two transactions may both read from the table, but only one will be able to write and commit because it will encounter the other transaction's read in the timestamp cache.

When the two transactions have different timestamps, the one with the higher timestamp effectively "owns" the newest span in the timestamp cache, thus guaranteeing that only this transaction will be allowed to commit. When they have the same timestamp, however, the one that reads second "steals" ownership of that span. If the first transaction has already performed its write, this could allow both transactions to write and commit.

The solution is to split the baby: when considering a new span with the same timestamp as one already in the timestamp cache, the transaction ID for all shared parts of the span must be cleared, so neither transaction "owns" it. Then both transactions will be forced to restart when they attempt to write, since they will see a conflict with the existing read (which now appears to be non-transactional).

@bdarnell bdarnell self-assigned this Sep 3, 2016

@petermattis

This comment has been minimized.

Show comment
Hide comment
@petermattis

petermattis Sep 3, 2016

Contributor

Wow, nice find.

Contributor

petermattis commented Sep 3, 2016

Wow, nice find.

@bdarnell

This comment has been minimized.

Show comment
Hide comment
@bdarnell

bdarnell Sep 4, 2016

Member

As I've worked on tests for this, I've found that there are other issues that don't require duplicate timestamps. Consider the "left partial overlap" scenario (similar arguments apply to many other branches of this switch statement). Once the old span has been truncated, future calls to timestampCache.GetMaxRead on keys within the overlap by the new span's transaction will look past the new span and "see" all the way to the low water mark, instead of stopping at the old span. Much of the slicing and dicing of intervals (introduced in #4789) violates the semantics of GetMaxRead when transaction IDs are considered.

I don't think we can provide the documented semantics of GetMaxRead while retaining the memory savings of #4789. Fortunately, I'm not sure we have to. We use the timestamp cache to tell us whether a transaction can act at a certain timestamp, or if it needs to be pushed into the future. As long as the value returned by GetMaxRead is in the past, it doesn't matter when in the past it is.

I propose to change the interface to GetMaxRead (and GetMaxWrite) so that instead of taking a transaction ID argument, they return the transaction ID of the latest span in the cache. The caller can then recognize its own transaction ID and see if it's a match. It doesn't actually need to know about what's behind it.

Member

bdarnell commented Sep 4, 2016

As I've worked on tests for this, I've found that there are other issues that don't require duplicate timestamps. Consider the "left partial overlap" scenario (similar arguments apply to many other branches of this switch statement). Once the old span has been truncated, future calls to timestampCache.GetMaxRead on keys within the overlap by the new span's transaction will look past the new span and "see" all the way to the low water mark, instead of stopping at the old span. Much of the slicing and dicing of intervals (introduced in #4789) violates the semantics of GetMaxRead when transaction IDs are considered.

I don't think we can provide the documented semantics of GetMaxRead while retaining the memory savings of #4789. Fortunately, I'm not sure we have to. We use the timestamp cache to tell us whether a transaction can act at a certain timestamp, or if it needs to be pushed into the future. As long as the value returned by GetMaxRead is in the past, it doesn't matter when in the past it is.

I propose to change the interface to GetMaxRead (and GetMaxWrite) so that instead of taking a transaction ID argument, they return the transaction ID of the latest span in the cache. The caller can then recognize its own transaction ID and see if it's a match. It doesn't actually need to know about what's behind it.

@petermattis

This comment has been minimized.

Show comment
Hide comment
@petermattis

petermattis Sep 4, 2016

Contributor

I'm probably missing something: why does GetMaxRead on keys within the overlap look past the new span? The transaction IDs between the new span and the transaction will differ, right?

Contributor

petermattis commented Sep 4, 2016

I'm probably missing something: why does GetMaxRead on keys within the overlap look past the new span? The transaction IDs between the new span and the transaction will differ, right?

@bdarnell

This comment has been minimized.

Show comment
Hide comment
@bdarnell

bdarnell Sep 4, 2016

Member

As far as I can tell there's no reason for GetMaxRead to look past the first span. I've made the refactor to have it return rather than take a transaction ID and it seems to be working. There was just one thing slightly tricky about it, which is that when there are two non-overlapping cache entries at the same timestamp, we must report a nil txnID instead of using either of them.

Member

bdarnell commented Sep 4, 2016

As far as I can tell there's no reason for GetMaxRead to look past the first span. I've made the refactor to have it return rather than take a transaction ID and it seems to be working. There was just one thing slightly tricky about it, which is that when there are two non-overlapping cache entries at the same timestamp, we must report a nil txnID instead of using either of them.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Sep 4, 2016

storage: Handle timestamp collisions in timestampCache
When two intervals in the timestamp cache have the same timestamp,
neither of them can be said to own that timestamp. We must adjust
intervals and clear transaction IDs whenever this collision occurs.

Failure to do so allowed one transaction to write after another
transaction had read at the same timestamp, leading to a violation of
serializability.

Fixes #9083

@tamird tamird closed this in c0fd012 Sep 8, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment