Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC of Transaction table #2062

Closed
tbg opened this issue Aug 12, 2015 · 20 comments · Fixed by #3185
Closed

GC of Transaction table #2062

tbg opened this issue Aug 12, 2015 · 20 comments · Fixed by #3185
Assignees

Comments

@tbg
Copy link
Member

tbg commented Aug 12, 2015

this is the missing part of #1873 .

GC of proto.Transaction

A Transaction record can be removed if its last heartbeat is sufficiently
old (say, > 4*HeartbeatInterval) and if no intents which should be visible
remain (i.e. only if it's committed do we need to ensure that).
If no heartbeat timestamp is present,the Transaction timestamp is used instead.
Those are HLC timestamps, but on the time scales involved, this is of no concern.

The GC queue is in charge of periodically grooming the Transaction table.
It can't ever GC a successfully committed Transaction which lists possibly-
unresolved intents, though. Encountering one of those requires prior intent
resolution. It will be easiest if the queue just gives those intents to the range
to handle then asynchronously as skipped intents, which should render
the Transaction GCable in the next pass.

Any successful ResolveIntent (or collection thereof) should be followed by
an invocation of a new RPC ResolveTransaction (or overload EndTransaction
for this purpose) which updates the Transaction record, removing
the resolved intents from the list stored within the Transaction.

PushTxn

When executing against an absent on-disk record, the Transaction is
considered ABORTED if the intent's Transaction timestamp is older than
< 2*HeartbeatInterval, and no new entry is written in that case.

The choice of {2,4}*HeartbeatInterval above makes sure that PushTxn
will always find the Transaction record or decide correctly that it's ABORTED
without the need to write a new one.

successful EndTransaction

Resolve local intents right away. If that includes all intents, we can remove
the record entirely. Otherwise, add the remainder to the Transaction record and return them for asynchronous resolution.

failed EndTransaction

this fails either because of regression errors (these amount to bugs), or,
more interestingly, because the Transaction was aborted by a concurrent
writer.

Prime objective is removing the intents as soon as possible, but we can't
write on error. So we leave the Transaction record and resolve all intents
asynchronously.

Note that that can leave a Transaction in PENDING state so that it'll
likely never commit (it can be aborted due to timeout at some point) in
the case of regression errors (which amount to bugs).

@tbg tbg self-assigned this Aug 12, 2015
@tbg tbg changed the title missing part of # missing part of #1873 Aug 12, 2015
@spencerkimball
Copy link
Member

I think you should also remove the transaction record even if there are non-local intents. You'd just do it after successful asynchronous resolution of intents, assuming the EndTransaction was called with all intents.

@tbg
Copy link
Member Author

tbg commented Aug 12, 2015

yes, the RPC which updates the intents list will kill the entry when the new list is empty.

tbg added a commit to tbg/cockroach that referenced this issue Aug 14, 2015
this is step two (and not the last) for safely garbage collecting
transaction entries (see cockroachdb#2062).

EndTransaction calls now store the non-local portion of the intents
in their transaction record (if there are any; otherwise the record
is instantly deleted; this is not new in this change).

on each run of the GC queue for a given range, its transaction table
is scanned and the following actions taken

* intents referenced from any "old" transaction records are resolved
  asynchronously
* old pending transactions are pushed (which will succeed), effectively
  aborting them
* old aborted transactions are added to the GC request.

TODO:
* necessary to send extra GC request?
* unregister intents from their Txn's entries when they've been resolved
  (currently they are never removed, so each GC run will waste work resolving
  a long gone intent every time and can never GC Txns with non-local intents)
@bdarnell
Copy link
Member

I'm worried about the timing dependence here. I think we need some extra safeguards: HeartbeatTxn and EndTransaction should refuse to write a transaction record that could have been GC'd previously. I think this means that each such command must include the previous heartbeat timestamp, and if it at least 2*HeartbeatInterval old by the time the command is applied it becomes a noop.

The first sentence is unclear; I would separate the committed and uncommitted case explicitly:

A committed Transaction record can be deleted iff all of its intents are resolved. A pending or aborted Transaction record can be deleted if its last heartbeat time was at least 4*HeartbeatInterval ago (since a transaction this old will not be allowed to commit).

I'm unconvinced by "Those are HLC timestamps, but on the time scales involved, this is of no concern." We should always be explicit about which clocks are being used for any comparison. GC decisions are (presumably) made outside of raft and can see the real time, but HeartbeatTxn (and other raft commands) see only a timestamp which was captured before the command was submitted to raft. I think we need to make sure that the GC queue only makes a preliminary decision about whether a txn is GCable, and the real work is done inside a new raft command which repeats the time check (to make sure it is well-ordered with respect to any concurrent commands in the raft pipeline). In fact, if we do this I'm not sure if we even need the 2* vs 4* distinction.

@tbg
Copy link
Member Author

tbg commented Aug 14, 2015

I'm worried about the timing dependence here. I think we need some extra safeguards: HeartbeatTxn and EndTransaction should refuse to write a transaction record that could have been GC'd previously. I think this means that each such command must include the previous heartbeat timestamp, and if it at least 2*HeartbeatInterval old by the time the command is applied it becomes a noop.

I think we can rely on the GC metadata here. That contains a timestamp from the HLC clock which is basically the Now() used for all GC decisions, and it is persisted along with the execution of the GC request (which actually clears keys from the engine and goes through Raft normally).

Consequently, if you're heartbeating/pushing/committing and there isn't an entry, you get the persisted GC metadata from the engine. If that timestamp minus the GC threshold is higher than what you're operating at, then the entry might have been GCed. Then for a Push you succeed (telling the pusher the Pushee has been aborted, but without writing a new entry), and for a Heartbeat or EndTransaction, you return an error. To make it easier, if there isn't a catch and we can do this, we should probably put some more info into the metadata (i.e. the threshold timestamps which were used for that GC cycle) so that the above code needs less time arithmetic.

@spencerkimball
Copy link
Member

There are benefits to leaving the current compaction-based GC of transaction records in place. This mechanism relies on getting a cluster-wide low water mark on oldest extant intent and allowing the normal compaction to clean up any straggling transaction records.

  • Compaction is already implemented
  • Not as dicey on the timing related stuff, though I think what you're proposing will work--it's just harder to puzzle out
  • Don't need to store or update intents on committed transaction records

@tbg
Copy link
Member Author

tbg commented Aug 15, 2015

re @spencerkimball's comment in #2111:

How can an actor which encounters an intent and is able to push the attached txn distinguish between the txn having been auto-committed (meaning the intent will have been concurrently cleaned up) vs. still pending (in which case the txn record should be written as either ABORTED or with a pushed timestamp)?

It's a an annoying edge case. At the point at which the reader tries the push and doesn't find the entry in the table, there isn't an intent any more (it's been removed synchronously with the commit).
If you knew that was what was happening, logically the Push should fail (after all, if there were an entry, it would read COMMITTED). But without the info of whether there actually is an intent (which you don't have at that point because you don't want to send the intent along with the Push) of course you can't distinguish the case in which the Pushee just hasn't written to the Txn table.

But the current behavior should be fine: The entry is recreated as PENDING or ABORTED. Of course nobody's going to ever touch that entry again, so it'll get GC'ed eventually. The Pusher will either retry or try to resolve the intent (which should work, because there isn't an intent) and only when writing again it'll find that the value is in fact there, tough luck, and will have to deal with it.

The alternative to the above is to only remove the Txn entry if the transaction committed in a single round-trip (after we've introduced the Batch handling on Replica) because that means that nobody else can ever see any intents.

@tbg
Copy link
Member Author

tbg commented Aug 15, 2015

There are benefits to leaving the current compaction-based GC of transaction records in place. This mechanism relies on getting a cluster-wide low water mark on oldest extant intent and allowing the normal compaction to clean up any straggling transaction records.

I've thought about this some more. My major discomfort with this was that it's a centralized system, but I'm getting more comfortable with it and think it might be the better solution. I'm 70% through the changes proposed above, but I'm going to freeze that until we've reached a decision. Let me outline my understanding of how I think it would work, please add in anything I missed.

  • each leader Replica periodically updates its RangeTree entry to include its oldest intent (after each GC run).
  • one replica set (probably the one which has the root) is in charge of periodically combing through the RangeTree and collecting the global oldest intent's timestamp. That is then added to Gossip (with infinite lifetime and, for diagnostics, the Range reporting it - the danger of this system is that a single Range can stop this GC globally if its lowest timestamp remains the same).
  • when a Replica runs GC, it gets this timestamp from Gossip (if there's nothing: ZeroTimestamp, i.e. no GC for txn records). It can then kill off all Txn records which are older than that with the issued GC command, unless they're still being heartbeat. The GC scan also gets the oldest timestamp among all intents found locally (potentially trying to resolve synchronously, and taking into account only those that couldn't be resolved) - it's already mostly doing this - to perform step one above.

Assuming the RangeTree stuff is on solid ground, I think this could work nicely.

I'm still a little concerned about the RangeTree though. With many ranges, it will constantly be in flux, rotating around etc. Have we checked that it can actually keep up with the txn contention on that scale? I had the impression that every mutation on the tree requires, more or less, exclusive access because there's a lot of walking around the tree involved. Since it needs to be update synchronously with splits/merges/rebalances, that thing really needs to work smoothly and not have a lot of Txn restarts.
For the oldest intent age, well, doesn't matter too much, because we can read the tree at a past timestamp. For updating the oldest intent age as well, that's just a CPut on a single node. It's really more generally about how expensive the upkeep is.

I've looked at the RangeTree code and there's a lot of work to be done on it, including the Txn handling (currently it just takes a client.Txn, so a restart kills us). I'll take some time to think things through. @BramGruneir, probably going to bother you with a thing or two :-)

@tbg tbg changed the title missing part of #1873 GC of Transaction table Aug 15, 2015
@petermattis
Copy link
Collaborator

I'm a little worried about the RangeTree stuff as well. It feels like a solution in search of a problem. I'm not sure it is the right solution for doing mapreduce operations over tables (e.g. for backfilling a newly created index). I haven't thought enough about the tracking of the oldest intent to decide if it is overkill there or not.

@tbg
Copy link
Member Author

tbg commented Aug 15, 2015

I think we need to properly flesh out the RangeTree in general. I'm not even sure you can use it for much because it keeps changing from under your hands, so everyone reading it consistently will have contention (and for updating it, I suspect that's what's happening anyways). If you read at a past timestamp, not sure what the old version is worth because it doesn't correspond to the actual ranges any more.

@BramGruneir
Copy link
Member

Just let me know what improvements/changes need to go into it. With the change to using a classic rb tree instead of the llrb one, it reduces the number of rotations required to keep it balanced.
And if it's not the right solution, let's get rid of it, since it slows down splits and merges.

@bdarnell
Copy link
Member

Using the range-local GC metadata timestamp satisfies my concerns about clock sensitivity.

A cluster-wide GC watermark makes me very nervous. In addition to the RangeTree concerns expressed here, I'm worried about the use of gossip. One replica may GC with one timestamp, then after a leadership change another replica could GC with an older timestamp. Do we really need the cluster-wide watermark? As long as Push and Heartbeat refuse to create new transaction records that are already old enough to be GC'd, I don't see why we'd need it.

@tbg
Copy link
Member Author

tbg commented Aug 15, 2015

It would save us from having to store a list of intents on successfully committed transactions. I don't see why Gossip would be an issue. The info is always from the RangeTree (possibly at some past timestamp, so the oldest intent timestamp might be lower than what it needs to be). If a replica gossips one and then another takes over and gossips something lower, then it must have read at an even older timestamp for whatever reason. That doesn't mean anything is incorrect though, just less optimal. You could even (maybe because the Range containing the RangeTree root becomes unavailable) suddenly see the zero timestamp. At which point you just don't GC Txn entries at all.

But yeah, the globality of it worries me. If I trusted in the RangeTree I think I would be less concerned, but I'm not even sure if the concept of it winds up practical. We haven't discussed it much.

@bdarnell
Copy link
Member

Ping. What's the status here?

@tbg
Copy link
Member Author

tbg commented Sep 30, 2015

@spencerkimball and I need to pick up the discussion again. I have a half-finished branch in my graveyard for the local option.
Should also mention that if we did auto-gc when the last intent is resolved, there's no guarantee that a "slow" concurrent attempt to resolve might come in for a Push which recreates a transaction record we just deleted, so we'd need precautions against that as well.
This issue is a bit on the trickier side still.

@spencerkimball
Copy link
Member

How about if we always write the transaction record on the first write? Because the transaction records are located on the same range as the first key, this wouldn't require any extra RPCs or latency.

Then, if a push txn request arrives for a txn which is missing, it's a simple failure. There is one special case that I can think of, but it'll still work if we just treat the missing txn record as a failure (it will retry):

  • Txn writes keys "a" and "b" in parallel to different ranges.
  • Write for key "a", and associated txn record are delayed.
  • Write for key "b" succeeds, but another write for "b" arrives and attempts to push extant txn.
  • Push arrives at "a"'s range, but write for key "a", and associated txn record haven't yet been written.

@spencerkimball
Copy link
Member

@tschottdorf I'm experimenting with adding a begin transaction request.

@tbg
Copy link
Member Author

tbg commented Oct 1, 2015

the idea sounds good (not the one about adding BeginTransaction). I'm not too worried about parallel transactions because they're likely not working right now for multiple reasons. Why are you considering a dedicated BeginTransaction? That seems premature unless there's something I'm missing.

@spencerkimball
Copy link
Member

Because I want to create a transaction record preemptively, not lazily.
This is integral to the scheme of ignoring erstwhile heartbeats and push
txn requests which would otherwise recreate garbage txn records.

On Thu, Oct 1, 2015 at 11:57 AM, Tobias Schottdorf <notifications@github.com

wrote:

the idea sounds good (not the one about adding BeginTransaction). I'm not
too worried about parallel transactions because they're likely not working
right now for multiple reasons. Why are you considering a dedicated
BeginTransaction? That seems premature unless there's something I'm missing.


Reply to this email directly or view it on GitHub
#2062 (comment)
.

@tbg
Copy link
Member Author

tbg commented Oct 1, 2015

But you're optimizing against a delay in parallel usage, which I don't think we'll have in beta. That seems like something we'd want to do when we want to make parallel txns work.

@spencerkimball
Copy link
Member

The current plan (after some offline discussion with Tobias) is the following:

  • Introduce a BeginTransaction request to be sent in a batch with the first txn write.
  • The txn record is written when executing BeginTransaction.
  • In the event that a push or heartbeat arrives for a txn without a record, write an aborted txn record.
  • BeginTransaction will fail upon discovering an aborted txn record.

spencerkimball added a commit that referenced this issue Oct 8, 2015
BeginTransactionRequest is automatically inserted by the KV client
immediately before the first transactional write. On execution,
BeginTransaction creates the transaction record.

If a heartbeat arrives for a txn that has no transaction record, it's
ignored. If a push txn arrives for a txn that has no transaction record,
the txn is considered aborted.

This solves the problem of errant heartbeats or pushes recreating a GC'd
transaction record, addressing #2062.
spencerkimball added a commit that referenced this issue Oct 23, 2015
BeginTransactionRequest is automatically inserted by the KV client
immediately before the first transactional write. On execution,
BeginTransaction creates the transaction record.

If a heartbeat arrives for a txn that has no transaction record, it's
ignored. If a push txn arrives for a txn that has no transaction record,
the txn is considered aborted.

This solves the problem of errant heartbeats or pushes recreating a GC'd
transaction record, addressing #2062.
spencerkimball added a commit that referenced this issue Oct 24, 2015
BeginTransactionRequest is automatically inserted by the KV client
immediately before the first transactional write. On execution,
BeginTransaction creates the transaction record.

If a heartbeat arrives for a txn that has no transaction record, it's
ignored. If a push txn arrives for a txn that has no transaction record,
the txn is considered aborted.

This solves the problem of errant heartbeats or pushes recreating a GC'd
transaction record, addressing #2062.
tbg added a commit to tbg/cockroach that referenced this issue Nov 21, 2015
see cockroachdb#2062.

on each run of the GC queue for a given range, the transaction
and sequence prefixes are scanned and the following actions taken:

* old pending transactions are pushed (which will succeed), effectively
  aborting them
* old aborted transactions are added to the GC request.
* aborted and committed transactions will have the intents referenced
  in their record resolved synchronously and are GCed (on success)
* sequence cache entries which are "old" and belong to "old" (or
  nonexistent) transactions are deleted.
tbg added a commit to tbg/cockroach that referenced this issue Nov 24, 2015
see cockroachdb#2062.

on each run of the GC queue for a given range, the transaction
and sequence prefixes are scanned and the following actions taken:

* old pending transactions are pushed (which will succeed), effectively
  aborting them
* old aborted transactions are added to the GC request.
* aborted and committed transactions will have the intents referenced
  in their record resolved synchronously and are GCed (on success)
* sequence cache entries which are "old" and belong to "old" (or
  nonexistent) transactions are deleted.
tbg added a commit to tbg/cockroach that referenced this issue Nov 24, 2015
see cockroachdb#2062.

on each run of the GC queue for a given range, the transaction
and sequence prefixes are scanned and the following actions taken:

* old pending transactions are pushed (which will succeed), effectively
  aborting them
* old aborted transactions are added to the GC request.
* aborted and committed transactions will have the intents referenced
  in their record resolved synchronously and are GCed (on success)
* sequence cache entries which are "old" and belong to "old" (or
  nonexistent) transactions are deleted.
@tbg tbg closed this as completed in #3185 Nov 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants