delta snapshots computation #12298

patelprateek · 2024-01-26T06:28:47Z

I have a use case where i have a few large indexing machines that are ingesting all the data (in real time) and periodically (every X seconds) running a compaction and checkpointing the index to some distributed storage. The serving (read-only server) read these latest snapshots whenever new available (copy SST files from remote storage to local ssd). One limitation of this approach is if my indexes are pretty large for example few hundred GB to TB scale , taking snapshot of entire index , uploading them to remote storage , and the read-only servers loading these giant indexes every few minutes to the local SSD is taking a lot of bandwidth and affects performance on online path (even if the amount of updates in those N seconds was in order of few MBs) .

Is there a way to transfer only deltas between snapshots so the read servers can only load the new sst files and clean up the old files , without having to run any compaction on serving side and indexers instead of checkpointing the entire index , periodically updates the delta (with any metadata) ? Any ideas on how to accomplish this or if this is even feasible approach would be appreciated ?

jowlyzhang · 2024-02-02T20:42:27Z

If I understand correctly, there are two steps involved in this pipeline that you want to check if can be optimized:

Step 1): large indexing machine periodically checkpointing to distributed storage. Each time a checkpointing happens, it uploads the entire index (the whole DB). We want to check if this can be improved.
Step 2): read-only servers loading from distributed storage periodically to local SSD and serve read traffic. Each time the loading happens, it loads the entire index too. We want to check if this can be improved.

RocksDB's backup feature support incremental backup, it's essentially to not copy duplicated files. It seems to me using this feature for step 1) can achieve what you need, you would need to implement a rocksdb::Env object for you distributed file system

rocksdb/include/rocksdb/utilities/backup_engine.h

Line 44 in 95b41ee

Env* backup_env;

For step 2): Can you use backup feature again and treat what is on the local storage as a backup for what is on the distributed storage.

I'm curious how does the read-only server handles a new snapshot, does it reopen the DB everytime a new download is made?

patelprateek · 2024-02-02T21:22:55Z

@jowlyzhang : thanks your understanding is correct.
Our current operations are
Indexing (write only data ingestion jobs)

Indexers : index about few hundred Gbs for each shard (1 indexer per shard) . storage : local ssd or network attached ssd
Every 30 minutes or synchronization point , we run full compaction , copy full index (all sst files) to distributed storage like gcs or s3.
So even if the incremental data within 30 miutes was few hundred megabytes , we end up copying the entire hundreds of gb data to remote storage from local/network ssds .

Serving (read-only , serving queries

One server can load multiple shards (produced by multiple indexers) . So typically we serve TB scale data on a single serving machine . We limit it to 1TB per serving node to ensure we can scale up new servers under 10 minutes.
Servers copy data from remote GCS/S3 buckets to local SSD and then open multiple rocksdb instance (one for each shard that is served on this machine).
when new snapshot appears the background threads copy new full snapshot to local ssds and then close down current instance and open a new db instance which points to these new snapshot.

We want to make these process efficient by making data ingested on indexing to be available on serving faster i.e we want to take these snapshots every minute (or possibly 30 seconds) .
The main bottleneck is copying full snapshots (from local to remote for indexer) and (copying full snapshot from remote to local + opening and initializing new instances) .
I was wondering if the incremental checkpoints can help us scale so that even if our index are large but the incremental data updates are around few hundred mbs , we can possibly reduce the end to end latency.

jowlyzhang · 2024-02-02T21:54:55Z

@patelprateek Thank you for the context, do you think the incremental capability of RocksDB backup feature can help here?

patelprateek · 2024-02-02T22:14:22Z

@jowlyzhang : my thought process was if there is a way to propagate deltas about index such that they can be replicated easily among query server would be way to go.
Another approach is having servers ingest and serve queries both from same node (read-write both with rate limiting enabled for reads so as to not impact query read latencies). The issue in this scenario would be that all replicated servers are essentially ingesting same data and doing same compaction work (i wanted to avoid that , and hence was leaning indexers doing compaction and read only servers just replicating the state of DB such that its compute efficient) Or some how one master node ingesting and compacting while all other replicas are copying some delta files and metadata such that they dont have to do compaction work and can make new data available to be queried under few seconds latency

my question was for rocksd db team to see if this is feasible , what you suggested might help but i don't know if it allows us to scale for example do you see any issues if we have say 128gb index , trying to take incremental backup every 30 seconds (new data coming every 30 seconds are probably few hundred mbs) ?

jowlyzhang · 2024-02-02T22:38:01Z

Your current workflow already do checkpointing every 30 seconds, right? Backup is built on top of checkpointing, shouldn't be more expensive since it's incremental.

patelprateek · 2024-02-02T23:05:30Z

No current cadence is 30 minutes (in my update above in the thread) full snapshot (not incremental) , thats why i wanted to understand any perf implication if we need to take incremental backup every 30 sec ?

jowlyzhang · 2024-02-02T23:10:54Z

In that case, you would need to do a DB reopen on the read-only servers every 30 seconds, right? I think sometimes it will even take longer than 30 seconds for the DB to open.

patelprateek · 2024-02-02T23:16:40Z

yes , i was wondering if its possible to apply incre,ental update without having to re-open.
dont kow the implementation details but if incremental update can tell like file x y deleted and new file a b added , then possibly we could copy those new file or replicate it to be in same state ?

jowlyzhang · 2024-02-02T23:27:38Z

I see what you mean. This feature is backup though, it's mainly for backing up a DB, only used to restore a DB when accident happens, which is rare, and thus designed in a way that it needs a reopen. This is for serving online, real time changes. We have a secondary instance feature that can catch up with the primary's changes. But that's for accessing a common set of files on the same file system though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delta snapshots computation #12298

delta snapshots computation #12298

patelprateek commented Jan 26, 2024

jowlyzhang commented Feb 2, 2024

patelprateek commented Feb 2, 2024

jowlyzhang commented Feb 2, 2024

patelprateek commented Feb 2, 2024

jowlyzhang commented Feb 2, 2024

patelprateek commented Feb 2, 2024

jowlyzhang commented Feb 2, 2024

patelprateek commented Feb 2, 2024

jowlyzhang commented Feb 2, 2024

delta snapshots computation #12298

delta snapshots computation #12298

Comments

patelprateek commented Jan 26, 2024

jowlyzhang commented Feb 2, 2024

patelprateek commented Feb 2, 2024

jowlyzhang commented Feb 2, 2024

patelprateek commented Feb 2, 2024

jowlyzhang commented Feb 2, 2024

patelprateek commented Feb 2, 2024

jowlyzhang commented Feb 2, 2024

patelprateek commented Feb 2, 2024

jowlyzhang commented Feb 2, 2024