rfc: backup/restore/ingest by danhhz · Pull Request #8966 · cockroachdb/cockroach

danhhz · 2016-08-30T20:34:33Z

Any durable datastore is expected to have the ability to save a snaphot of data
and later restore from that snapshot. Even in a system that can gracefully
handle a configurable number of node failues, there are other motivations: a
general sense of security, "Oops I dropped a table", legally required data
archiving, and others.

Additionally, in an era where it's easy and useful to produce datasets in the
hundreds of gigabytes or terabyte range, CockroachDB has an opportunity for a
competitive advantage. First class support for using consistent snapshots as an
input to a bulk data pipeline (without any possibility of affecting production
traffic) as well as the ability to very quickly serve the output of them could
be a deciding factor for potential customers.

Closes #8191.

This change is

danhhz · 2016-08-30T20:37:55Z

I was hitting the limit of my knowledge of the system, so there are a few fuzzy details left. Please poke holes in what's here.

bdarnell · 2016-08-31T04:11:35Z

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

docs/RFCS/backup_restore.md, line 60 [r1] (raw file):

(for testing and nfs mounts).

_TODO(dan): Is there any advantage of using WriteBatch over a file of proto

WriteBatches are faster to apply than higher-level formats, or at least they should be. This difference may be irrelevant since we have to rekey the backup anyway to apply it.

docs/RFCS/backup_restore.md, line 76 [r1] (raw file):

## Full Backup

An hlc timestamp is given as the end time (likely `now` or `now - max skew`) and

Must be no greater than now - max offset to avoid any possibility of uncertainty-related restarts. Choosing a time further in the past (up to a few minutes) will help minimize interference with in-progress transactions.

s/skew/offset/g

docs/RFCS/backup_restore.md, line 77 [r1] (raw file):

An hlc timestamp is given as the end time (likely `now` or `now - max skew`) and
the start time is implictly `0,0`. All range descriptors are scanned in a txn

If the DistSender supported parallel RPCs, could we use that instead of building a new custom scheduler?

docs/RFCS/backup_restore.md, line 85 [r1] (raw file):

`system.jobs` table. Link to an RFC for creating new system tables._

Via rpc (not Raft), a replica of each segment is tasked with backing it up. The

What kind of RPC?

docs/RFCS/backup_restore.md, line 86 [r1] (raw file):

Via rpc (not Raft), a replica of each segment is tasked with backing it up. The
replica waits until the end time + max skew is in the past and then flushes

"In the past" needs to be defined precisely. Does this operation require the range lease?

What exactly does it mean to flush rocksdb?

docs/RFCS/backup_restore.md, line 89 [r1] (raw file):

RocksDB and grabs a snapshot. If the node doesn't have the entire keyrange
because a split has happened in the interim, the rpc is failed and the
coordinator retries with more fine-grained segments.

Alternately, the backup RPC could continue, and do a (potentially) remote Scan to get the data from the right side of the split.

docs/RFCS/backup_restore.md, line 113 [r1] (raw file):

for each keyrange is calculated. The end time for the incremental backup is
selected as described above, but the start time for each segment is set using
lookups into this tree.

Is it a requirement to be able to do an incremental backup on top of other incremental backups that don't cover the whole key range? Seems simpler to require that there be one "last backup" timestamp for the whole backup job.

docs/RFCS/backup_restore.md, line 117 [r1] (raw file):

Each sstable in RocksDB has an explict keyrange bound and an implicit hlc
timestamp bound. RocksDB will use the former to select relevant sstables during
iteration, but doesn't understand the structure of our keys and so cannot use

Could we add this to rocksdb? I haven't looked at how their filter interface works closely enough to see if this is feasible but bigtable had this kind of metadata associated with each sstable.

docs/RFCS/backup_restore.md, line 182 [r1] (raw file):

## In-Place Restore

Why would in-place restore be desirable (as opposed to restoring to a new ID and swapping with a rename?

Comments from Reviewable

petermattis · 2016-08-31T14:25:10Z

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

docs/RFCS/backup_restore.md, line 60 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

WriteBatches are faster to apply than higher-level formats, or at least they should be. This difference may be irrelevant since we have to rekey the backup anyway to apply it.

Ditto what Ben said about speed. I think I'd stick with a WriteBatch even with the rekeying requirement. An naive equivalent proto would have a lot of allocation overhead. Rekeying a WriteBatch could likely be done in C++ very quickly.

docs/RFCS/backup_restore.md, line 117 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Could we add this to rocksdb? I haven't looked at how their filter interface works closely enough to see if this is feasible but bigtable had this kind of metadata associated with each sstable.

Yes, we can add knowledge of our keys to the sstable metadata. This is what being proposed for Option 1.

Comments from Reviewable

danhhz · 2016-08-31T15:25:36Z

Thanks for the comments! I responded to a couple and am going to need to think through some of the others

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

docs/RFCS/backup_restore.md, line 60 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ditto what Ben said about speed. I think I'd stick with a WriteBatch even with the rekeying requirement. An naive equivalent proto would have a lot of allocation overhead. Rekeying a WriteBatch could likely be done in C++ very quickly.

I should have been more explict; I indeed meant because we're always rekeying anyway. Good enough for y'all, good enough for me. Will remove this.

@petermattis, out of curiously, why would the equivalent proto have more allocation overhead?

docs/RFCS/backup_restore.md, line 76 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Must be no greater than now - max offset to avoid any possibility of uncertainty-related restarts. Choosing a time further in the past (up to a few minutes) will help minimize interference with in-progress transactions.

s/skew/offset/g

I was thinking through the "few minutes in the past" bit while writing this, but got stuck. We'd have to do something to get around this in tests (testing knob?). It seems like sending an error back and making the backup coordinator retry the keyrange at the same timestamp until there's no conflict eliminates needing a parameter we have to tune here. This also elegantly solves how it works in tests.

What's the difference between skew and offset?

docs/RFCS/backup_restore.md, line 77 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If the DistSender supported parallel RPCs, could we use that instead of building a new custom scheduler?

A conversion I had with Andrei after I sent this out convinced me that I should likely be using DistSender for this, so I'm going to think that through a bit and update. Can you expand a bit on what you meant by "a new custom scheduler"?

docs/RFCS/backup_restore.md, line 113 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Is it a requirement to be able to do an incremental backup on top of other incremental backups that don't cover the whole key range? Seems simpler to require that there be one "last backup" timestamp for the whole backup job.

Yeah, that'd be simpler, but only a little. This lets you do something like a full backup every week + an hourly incremental backup of some really important table + a daily incremental backup of everything

docs/RFCS/backup_restore.md, line 117 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Yes, we can add knowledge of our keys to the sstable metadata. This is what being proposed for Option 1.

I think ben meant the iterator side, which as far as I see, can't use our metadata to skip reading an sstable entirely.

That bit seems reasonable to add to rocksdb, but the iterator also has to return tombstones for the incremental backup case, which seems to me to be much less likely to be something they'd want upstream.

docs/RFCS/backup_restore.md, line 182 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why would in-place restore be desirable (as opposed to restoring to a new ID and swapping with a rename?

I was imagining a huge table that the user didn't want to store twice while it was being restored. If this isn't something we think we need, it simplifies some things to get rid of it, so I'm happy to.

Comments from Reviewable

petermattis · 2016-08-31T16:11:08Z

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

docs/RFCS/backup_restore.md, line 60 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

I should have been more explict; I indeed meant because we're always rekeying anyway. Good enough for y'all, good enough for me. Will remove this.

@petermattis, out of curiously, why would the equivalent proto have more allocation overhead?

Say you had a proto like:

message WriteBatch {
  message Op {
    enum Type {
      PUT = 1;
      MERGE = 2;
      DELETE = 3;
    }
    Type type;
    bytes key;
    bytes value;
  }
  repeated Op ops;
}

This gets translated into something like:

type Type int

const (
  PUT Type = 1
  MERGE Type = 2
  DELETE Type = 3
)

type Op struct {
  Type Type
  Key []byte
  Value []byte
}

type WriteBatch struct {
  Ops []Op
}

The way proto deserialization works, every Op.{Key,Value} will be a separately allocated slice.

Comments from Reviewable

bdarnell · 2016-09-01T05:50:55Z

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

docs/RFCS/backup_restore.md, line 60 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Say you had a proto like:
message WriteBatch {
  message Op {
    enum Type {
      PUT = 1;
      MERGE = 2;
      DELETE = 3;
    }
    Type type;
    bytes key;
    bytes value;
  }
  repeated Op ops;
}
This gets translated into something like:
type Type int

const (
  PUT Type = 1
  MERGE Type = 2
  DELETE Type = 3
)

type Op struct {
  Type Type
  Key []byte
  Value []byte
}

type WriteBatch struct {
  Ops []Op
}
The way proto deserialization works, every Op.{Key,Value} will be a separately allocated slice.

But you don't have to fully deserialize a proto to re-key it. You could use the lower-level proto decoder interface (is that even exposed in go?) to do a zero-copy pass over the data and find where the keys are.

As long as the new table id encodes to the same number of bytes as the old one, you can do it in-place. If the table ID has gotten too big, you'll need to do a lot of moving data around (but not necessarily independent allocations)

docs/RFCS/backup_restore.md, line 76 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

I was thinking through the "few minutes in the past" bit while writing this, but got stuck. We'd have to do something to get around this in tests (testing knob?). It seems like sending an error back and making the backup coordinator retry the keyrange at the same timestamp until there's no conflict eliminates needing a parameter we have to tune here. This also elegantly solves how it works in tests.

What's the difference between skew and offset?

We wouldn't have to do "a few minutes in the past" for tests; it would just be a recommendation for clusters that are concurrently serving traffic (since a backup at ~now could force other transactions to restart, while one in the past wouldn't. A read-only transaction can't always see that it's causing concurrent write transactions to restart, so there's not an error to watch for).

As defined in https://tools.ietf.org/html/rfc2330#section-10.1, "offset" is the difference between two clocks (or between a clock and "true" time), while "skew" is the derivative of offset (how quickly the clock is gaining or losing time). CockroachDB is only concerned with offsets.

docs/RFCS/backup_restore.md, line 77 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

A conversion I had with Andrei after I sent this out convinced me that I should likely be using DistSender for this, so I'm going to think that through a bit and update. Can you expand a bit on what you meant by "a new custom scheduler"?

The "custom scheduler" I was talking about is the mechanism that is going to query range descriptors, map work onto those ranges, send requests to nodes, handle retries, wait for completion, etc. This is all stuff that the DistSender does, and backup seems like a good fit for a parallelized DistSender

docs/RFCS/backup_restore.md, line 113 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

Yeah, that'd be simpler, but only a little. This lets you do something like a full backup every week + an hourly incremental backup of some really important table + a daily incremental backup of everything

It's only a little simpler for the backup itself, but there is also more complexity on the restore side (it must make sure that it has all the necessary pieces), and there is operational complexity that falls outside cockroachdb's scope. (which files in s3 are safe to delete without making other backups unrecoverable?)

docs/RFCS/backup_restore.md, line 117 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

I think ben meant the iterator side, which as far as I see, can't use our metadata to skip reading an sstable entirely.

That bit seems reasonable to add to rocksdb, but the iterator also has to return tombstones for the incremental backup case, which seems to me to be much less likely to be something they'd want upstream.

Tombstones have keys; shouldn't the timestamps of those keys be included in the metadata so that an sstable is loaded whenever it contains relevant tombstones?

docs/RFCS/backup_restore.md, line 182 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

I was imagining a huge table that the user didn't want to store twice while it was being restored. If this isn't something we think we need, it simplifies some things to get rid of it, so I'm happy to.

I don't think this actually saves you from storing the data twice, since the overwritten data won't go away immediately. The benefits seem small given the complexity.

Comments from Reviewable

vivekmenezes · 2016-09-01T13:17:10Z

+be a deciding factor for potential customers.
+
+
+# Detailed design


madelynnblue · 2016-09-01T20:43:58Z

Review status: 0 of 1 files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.

docs/RFCS/backup_restore.md, line 91 [r1] (raw file):

coordinator retries with more fine-grained segments.

The RocksDB snapshot is iterated and MVCC entries between the start and end time

Why do we need all MVCC entries instead of just the most recent one per key? Assuming the default GC of 24h, if a backup is restored more than 24h after backup, everything but the most recent MVCC entry will immediately be collected during the next GC cycle. Seems like an easy way to reduce backup/restore time without losing any data that a user would have access to anyway. (Near the end of this RFC you say it would only use the most recent, so I think something is off.)

docs/RFCS/backup_restore.md, line 110 [r1] (raw file):

The user requests an incremental backup and gives a path to a full backup.
Previous incremental backups may also be provided. The descriptors are read in,
checked for consistency, and an interval tree with the most recent backup time

What does consistency mean? That the two table descriptors are identical? What if a column/index has been added?

docs/RFCS/backup_restore.md, line 146 [r1] (raw file):

The user requests an restore and gives a path to a full backup and any
incremental backups. Either all or a subset of tables are selected. A record of
the restore's progress is written and used to continue if the coordinating node

We discussed punting on this issue and making the client the coordinator for now since this depends on yet another feature that lacks an RFC. Does it make sense to do the simpler approach for now?

Comments from Reviewable

danhhz · 2016-09-07T19:55:25Z

Sorry for the delay, RFAL

Review status: 0 of 1 files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.

docs/RFCS/backup_restore.md, line 30 [r1] (raw file):

Previously, vivekmenezes wrote…

Goals

Done.

docs/RFCS/backup_restore.md, line 42 [r1] (raw file):

Previously, vivekmenezes wrote…

Separate out non-goals

Done.

docs/RFCS/backup_restore.md, line 60 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

But you don't have to fully deserialize a proto to re-key it. You could use the lower-level proto decoder interface (is that even exposed in go?) to do a zero-copy pass over the data and find where the keys are.

As long as the new table id encodes to the same number of bytes as the old one, you can do it in-place. If the table ID has gotten too big, you'll need to do a lot of moving data around (but not necessarily independent allocations)

If we're concerned about allocations, I think we could reasonably do the rekeying in c++, it really only requires porting our varints, which seems doable. I don't see anything in rocksdb that allows for reading a write batch without having the entire thing in memory. So if doing it on-the-fly is something we care about (do we?), this would be easier by repeating and length-prefixing the marshalled version of peter's `Op`.

Ultimately, I'm thinking ahead to bulk ingest. It's much easier to assume someone can produce protos than rocksdb WriteBatches from Hadoop or Spark. If we can make the proto performance workable, it seems better than having two formats.

docs/RFCS/backup_restore.md, line 76 [r1] (raw file):

We wouldn't have to do "a few minutes in the past" for tests; it would just be a recommendation for clusters that are concurrently serving traffic (since a backup at ~now could force other transactions to restart, while one in the past wouldn't. A read-only transaction can't always see that it's causing concurrent write transactions to restart, so there's not an error to watch for).

Done.

As defined in https://tools.ietf.org/html/rfc2330#section-10.1, "offset" is the difference between two clocks (or between a clock and "true" time), while "skew" is the derivative of offset (how quickly the clock is gaining or losing time). CockroachDB is only concerned with offsets.

TIL!

docs/RFCS/backup_restore.md, line 77 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The "custom scheduler" I was talking about is the mechanism that is going to query range descriptors, map work onto those ranges, send requests to nodes, handle retries, wait for completion, etc. This is all stuff that the DistSender does, and backup seems like a good fit for a parallelized DistSender

Agreed. I've rewritten some of this to more explicitly talk about DistSender.

docs/RFCS/backup_restore.md, line 83 [r1] (raw file):

Previously, vivekmenezes wrote…

I'm working on PR for creating new system tables.

I saw https://github.com//pull/7073, is that what you're referring to? I'd like to keep this TODO here until that merges

docs/RFCS/backup_restore.md, line 85 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What kind of RPC?

Done.

docs/RFCS/backup_restore.md, line 86 [r1] (raw file):

"In the past" needs to be defined precisely.

Removed "in the past", realized it wasn't needed.

Does this operation require the range lease?

Done. (I said yes, is there any way to relax this restriction?)

What exactly does it mean to flush rocksdb?

Done

docs/RFCS/backup_restore.md, line 89 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Alternately, the backup RPC could continue, and do a (potentially) remote Scan to get the data from the right side of the split.

Done.

docs/RFCS/backup_restore.md, line 91 [r1] (raw file):

Previously, mjibson (Matt Jibson) wrote…

Why do we need all MVCC entries instead of just the most recent one per key? Assuming the default GC of 24h, if a backup is restored more than 24h after backup, everything but the most recent MVCC entry will immediately be collected during the next GC cycle. Seems like an easy way to reduce backup/restore time without losing any data that a user would have access to anyway. (Near the end of this RFC you say it would only use the most recent, so I think something is off.)

We only need all of them them if we want to time travel on the restored data.

docs/RFCS/backup_restore.md, line 110 [r1] (raw file):

Previously, mjibson (Matt Jibson) wrote…

What does consistency mean? That the two table descriptors are identical? What if a column/index has been added?

It was referring to start and end time consistency. Done.

docs/RFCS/backup_restore.md, line 113 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's only a little simpler for the backup itself, but there is also more complexity on the restore side (it must make sure that it has all the necessary pieces), and there is operational complexity that falls outside cockroachdb's scope. (which files in s3 are safe to delete without making other backups unrecoverable?)

These all seem like small problems to me compared to the potential benefits. OTOH, there's no reason we can't revisit this decision once someone has actually requested this feature. Removed.

docs/RFCS/backup_restore.md, line 117 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Tombstones have keys; shouldn't the timestamps of those keys be included in the metadata so that an sstable is loaded whenever it contains relevant tombstones?

It's not the loading that's an issue so much as the rocksdb iterator interface not having any mechanism to indicate a tombstone. We need something more like a `DiffIterator` which is pretty different than any of their existing uses.

Mentioned this more explicitly in the doc.

docs/RFCS/backup_restore.md, line 146 [r1] (raw file):

Previously, mjibson (Matt Jibson) wrote…

We discussed punting on this issue and making the client the coordinator for now since this depends on yet another feature that lacks an RFC. Does it make sense to do the simpler approach for now?

We'll certainly implement it in that order, but I think having a survivable coordinator is the end goal, so the RFC should reflect that.

docs/RFCS/backup_restore.md, line 182 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't think this actually saves you from storing the data twice, since the overwritten data won't go away immediately. The benefits seem small given the complexity.

Done.

Comments from Reviewable

bdarnell · 2016-09-08T10:08:24Z

Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions.

docs/RFCS/backup_restore.md, line 86 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

"In the past" needs to be defined precisely.

Removed "in the past", realized it wasn't needed.

Does this operation require the range lease?

Done. (I said yes, is there any way to relax this restriction?)

What exactly does it mean to flush rocksdb?

Done

Why does it need to flush? RocksDB snapshots can refer to memtables as well as sstables.

docs/RFCS/backup_restore.md, line 117 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

It's not the loading that's an issue so much as the rocksdb iterator interface not having any mechanism to indicate a tombstone. We need something more like a DiffIterator which is pretty different than any of their existing uses.

Mentioned this more explicitly in the doc.

If incremental backups are always taken more frequently than the MVCC GC threshold, then we don't need to worry about rocksdb-level tombstones (our own MVCC tombstones will be visible).

docs/RFCS/backup_restore.md, line 5 [r2] (raw file):

- Start Date: 2016-07-13
- Authors: Daniel Harrison
- RFC PR: (#8966)[https://github.com/cockroachdb/cockroach/pull/8966]

The markdown link syntax is brackets followed by parens.

docs/RFCS/backup_restore.md, line 37 [r2] (raw file):

- Backup and restore are distributed and don't interrupt production traffic
- Table-level backup scheduling
- Restore to a different number of machines than were backed up

Also: restore to a cluster running a newer version of cockroachdb than was used for the backup. (moving to an older version is also desirable when feasible)

docs/RFCS/backup_restore.md, line 100 [r2] (raw file):

(potentially) remote scan to get the data from the right side of the split.

The RocksDB snapshot is iterated and for each key the latest MVCC entry before

What about intents? I think we're going to have to resolve all intents before we can proceed with the backup, since transaction records (which are not MVCC) cannot be backed up consistently with the rest of the data. We may not have to write any code to handle this (the existing consistent read code and store-level intent resolution may do the job), but this could slow down the backup process and introduce new sources of errors.

docs/RFCS/backup_restore.md, line 171 [r2] (raw file):

interleaved tables are being restored, they must be restored together to use it.

Similar to Postgres, the restored table's constraints are marked as `NOT VALID`:

All constraints or just constraints referencing other tables? CHECK and UNIQUE constraints should be fine, right?

docs/RFCS/backup_restore.md, line 191 [r2] (raw file):

ingest, when possible.

_TODO(dan): Can we allow the user to provide the mvcc timestamp? This could be

I think user-supplied timestamps for an initial restore into fresh keyspace are fine (but of questionable use). Adding new data with user-supplied timestamps to an existing and visible table seems problematic. I think incremental ingestion will have to use the current time unless we're willing to break various consistency properties.

docs/RFCS/backup_restore.md, line 223 [r2] (raw file):

Distributed SQL is already building a framework for scalable computation. It
seems likely that this could be used to segment files of overlapping keyranges
(which are much easier for users to produce).

Why are these easier for users to produce?

docs/RFCS/backup_restore.md, line 244 [r2] (raw file):

## Only Backup Primary Data
Secondary indexes are derivable from the primary data, so we could skip backing
them up to save space. Recomputing them on restore will probably make it too

For the bulk ingestion case, third-parties will probably not be able to produce appropriate index records, so we'll need to process their data to generate index records (and then sort everything into non-overlapping ranges) anyway.

docs/RFCS/backup_restore.md, line 258 [r2] (raw file):

semantics should be during restore. Should backups from before the gc threshold
be ineligible for timetravel after restore or should we add complications to the
gc worker to understand restored data?

Restored data should be GC'd (or not) just like regular data. I think the users who would be interested in the option to back up and restore time travel data would also set a very large GC threshold for the tables where they use this feature.

docs/RFCS/backup_restore.md, line 261 [r2] (raw file):

# Restoring a Fraction of a Table (Options)
- We may want first-class support for restoring a segment of primary keys (or a

What's the use case for this?

Comments from Reviewable

danhhz · 2016-09-08T15:27:44Z

Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions.

docs/RFCS/backup_restore.md, line 86 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why does it need to flush? RocksDB snapshots can refer to memtables as well as sstables.

Holdover from when I thought we'd be copying sstables directly. Done.

docs/RFCS/backup_restore.md, line 117 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If incremental backups are always taken more frequently than the MVCC GC threshold, then we don't need to worry about rocksdb-level tombstones (our own MVCC tombstones will be visible).

Oh, huh. I guess it makes sense that we'd need our own tombstones. Where can I see what they look like?

I don't particularly agree that "incremental backups are always taken more frequently than the MVCC GC threshold" is a reasonable requirement, however.

docs/RFCS/backup_restore.md, line 5 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The markdown link syntax is brackets followed by parens.

Only if you want your links to work

docs/RFCS/backup_restore.md, line 37 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Also: restore to a cluster running a newer version of cockroachdb than was used for the backup. (moving to an older version is also desirable when feasible)

Done.

docs/RFCS/backup_restore.md, line 100 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What about intents? I think we're going to have to resolve all intents before we can proceed with the backup, since transaction records (which are not MVCC) cannot be backed up consistently with the rest of the data. We may not have to write any code to handle this (the existing consistent read code and store-level intent resolution may do the job), but this could slow down the backup process and introduce new sources of errors.

I said for now that we'd block until they're resolved (I think this is what our engine does now?), but it doesn't seem great. Any ideas on how we'd do something better? Push them I guess?

docs/RFCS/backup_restore.md, line 171 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

All constraints or just constraints referencing other tables? CHECK and UNIQUE constraints should be fine, right?

Yup. Clarified.

docs/RFCS/backup_restore.md, line 191 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think user-supplied timestamps for an initial restore into fresh keyspace are fine (but of questionable use). Adding new data with user-supplied timestamps to an existing and visible table seems problematic. I think incremental ingestion will have to use the current time unless we're willing to break various consistency properties.

Deleted.

docs/RFCS/backup_restore.md, line 223 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why are these easier for users to produce?

- The vast majority of hadoop and mapreduce jobs hash shard their keys and then sort them within a shard. - Also, if we implemented a version of COPY FROM that outputs whatever it reads into these files, we'd end up with overlapping keyranges

docs/RFCS/backup_restore.md, line 244 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

For the bulk ingestion case, third-parties will probably not be able to produce appropriate index records, so we'll need to process their data to generate index records (and then sort everything into non-overlapping ranges) anyway.

Done.

docs/RFCS/backup_restore.md, line 258 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Restored data should be GC'd (or not) just like regular data. I think the users who would be interested in the option to back up and restore time travel data would also set a very large GC threshold for the tables where they use this feature.

Done.

docs/RFCS/backup_restore.md, line 261 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What's the use case for this?

Restoring a couple of records that were accidentally deleted.

Comments from Reviewable

madelynnblue · 2016-09-13T18:25:09Z

Review status: 0 of 1 files reviewed at latest revision, 16 unresolved discussions, some commit checks failed.

docs/RFCS/backup_restore.md, line 91 at r1 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

We only need all of them them if we want to time travel on the restored data.

I guess I don't understand. My argument is that under normal uses, users won't be able to time travel anyway because the data will be GCd right after restore. How does keeping many MVCC values allow time travel if they do a restore more than the GC interval after the backup?

Comments from Reviewable

danhhz · 2016-09-13T18:35:52Z

Review status: 0 of 1 files reviewed at latest revision, 16 unresolved discussions, some commit checks failed.

docs/RFCS/backup_restore.md, line 91 at r1 (raw file):

Previously, mjibson (Matt Jibson) wrote…

I guess I don't understand. My argument is that under normal uses, users won't be able to time travel anyway because the data will be GCd right after restore. How does keeping many MVCC values allow time travel if they do a restore more than the GC interval after the backup?

This is addressed below: "Users who would be interested in the option to back up and restore time travel data would also set a very large GC threshold for the tables where they use this feature."

Comments from Reviewable

danhhz · 2016-09-13T18:40:20Z

@bdarnell any outstanding concerns?

Review status: 0 of 1 files reviewed at latest revision, 16 unresolved discussions, some commit checks failed.

Comments from Reviewable

bdarnell · 2016-09-14T03:20:40Z

LGTM

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.

docs/RFCS/backup_restore.md, line 117 at r1 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

Oh, huh. I guess it makes sense that we'd need our own tombstones. Where can I see what they look like?

I don't particularly agree that "incremental backups are always taken more frequently than the MVCC GC threshold" is a reasonable requirement, however.

Our tombstones use the `deleted` flag in `MVCCMetadata`.

Reducing the MVCC GC threshold is expensive due to all the full table scans required - more expensive than running incremental backups more often. So I think that requiring incremental backups to be run within the GC window may be reasonable (at least at first, until we have more experience with how GC will be tuned in practice), and would cut out or postpone a good amount of complexity here.

docs/RFCS/backup_restore.md, line 100 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

I said for now that we'd block until they're resolved (I think this is what our engine does now?), but it doesn't seem great. Any ideas on how we'd do something better? Push them I guess?

We don't block at the engine level. (and even if we did, there's no guarantee that something else would come along and resolve the intent for us). The backup process must either go through the Store at the right level for the Store's read/push/resolve loop to take effect or implement such a loop itself.

Comments from Reviewable

Any durable datastore is expected to have the ability to save a snaphot of data and later restore from that snapshot. Even in a system that can gracefully handle a configurable number of node failues, there are other motivations: a general sense of security, "Oops I dropped a table", legally required data archiving, and others. Additionally, in an era where it's easy and useful to produce datasets in the hundreds of gigabytes or terabyte range, CockroachDB has an opportunity for a competitive advantage. First class support for using consistent snapshots as an input to a bulk data pipeline (without any possibility of affecting production traffic) as well as the ability to very quickly serve the output of them could be a deciding factor for potential customers. Closes cockroachdb#8191.

danhhz · 2016-09-26T20:44:00Z

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.

docs/RFCS/backup_restore.md, line 117 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Our tombstones use the deleted flag in MVCCMetadata.

Reducing the MVCC GC threshold is expensive due to all the full table scans required - more expensive than running incremental backups more often. So I think that requiring incremental backups to be run within the GC window may be reasonable (at least at first, until we have more experience with how GC will be tuned in practice), and would cut out or postpone a good amount of complexity here.

Fair enough. I'm going to operate under the assumption that we can just handle both types of tombstones. If it's more complexity than I'm imagining, I'll revisit this.

docs/RFCS/backup_restore.md, line 100 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We don't block at the engine level. (and even if we did, there's no guarantee that something else would come along and resolve the intent for us). The backup process must either go through the Store at the right level for the Store's read/push/resolve loop to take effect or implement such a loop itself.

Done.

Comments from Reviewable

danhhz assigned dt and madelynnblue Aug 30, 2016

vivekmenezes reviewed Sep 1, 2016
View reviewed changes

danhhz force-pushed the rfc_backup_restore branch from 910ba6e to 35d3526 Compare September 7, 2016 19:55

danhhz force-pushed the rfc_backup_restore branch 2 times, most recently from 6f5838b to c9d75d5 Compare September 8, 2016 15:27

danhhz mentioned this pull request Sep 12, 2016

sql: backup/restore/ingest 2016Q4 work #9293

Closed

23 tasks

danhhz force-pushed the rfc_backup_restore branch from c9d75d5 to fc9f2b8 Compare September 26, 2016 20:06

danhhz merged commit 2d0258d into cockroachdb:master Sep 26, 2016

danhhz deleted the rfc_backup_restore branch October 25, 2016 20:22

		be a deciding factor for potential customers.


		# Detailed design

Conversation

danhhz commented Aug 30, 2016 • edited by rjnn Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danhhz commented Aug 30, 2016

Uh oh!

bdarnell commented Aug 31, 2016

Uh oh!

petermattis commented Aug 31, 2016

Uh oh!

danhhz commented Aug 31, 2016

Uh oh!

petermattis commented Aug 31, 2016

Uh oh!

bdarnell commented Sep 1, 2016

Uh oh!

vivekmenezes Sep 1, 2016

Choose a reason for hiding this comment

Uh oh!

madelynnblue commented Sep 1, 2016

Uh oh!

danhhz commented Sep 7, 2016

Uh oh!

bdarnell commented Sep 8, 2016

Uh oh!

danhhz commented Sep 8, 2016

Uh oh!

madelynnblue commented Sep 13, 2016

Uh oh!

danhhz commented Sep 13, 2016

Uh oh!

danhhz commented Sep 13, 2016

Uh oh!

bdarnell commented Sep 14, 2016

Uh oh!

danhhz commented Sep 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

danhhz commented Aug 30, 2016 •

edited by rjnn

Loading