New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
delta snapshots computation #12298
Comments
If I understand correctly, there are two steps involved in this pipeline that you want to check if can be optimized: Step 1): large indexing machine periodically checkpointing to distributed storage. Each time a checkpointing happens, it uploads the entire index (the whole DB). We want to check if this can be improved. RocksDB's backup feature support incremental backup, it's essentially to not copy duplicated files. It seems to me using this feature for step 1) can achieve what you need, you would need to implement a
For step 2): Can you use backup feature again and treat what is on the local storage as a backup for what is on the distributed storage. I'm curious how does the read-only server handles a new snapshot, does it reopen the DB everytime a new download is made? |
@jowlyzhang : thanks your understanding is correct.
Serving (read-only , serving queries
We want to make these process efficient by making data ingested on indexing to be available on serving faster i.e we want to take these snapshots every minute (or possibly 30 seconds) . |
@patelprateek Thank you for the context, do you think the incremental capability of RocksDB backup feature can help here? |
@jowlyzhang : my thought process was if there is a way to propagate deltas about index such that they can be replicated easily among query server would be way to go. my question was for rocksd db team to see if this is feasible , what you suggested might help but i don't know if it allows us to scale for example do you see any issues if we have say 128gb index , trying to take incremental backup every 30 seconds (new data coming every 30 seconds are probably few hundred mbs) ? |
Your current workflow already do checkpointing every 30 seconds, right? Backup is built on top of checkpointing, shouldn't be more expensive since it's incremental. |
No current cadence is 30 minutes (in my update above in the thread) full snapshot (not incremental) , thats why i wanted to understand any perf implication if we need to take incremental backup every 30 sec ? |
In that case, you would need to do a DB reopen on the read-only servers every 30 seconds, right? I think sometimes it will even take longer than 30 seconds for the DB to open. |
yes , i was wondering if its possible to apply incre,ental update without having to re-open. |
I see what you mean. This feature is backup though, it's mainly for backing up a DB, only used to restore a DB when accident happens, which is rare, and thus designed in a way that it needs a reopen. This is for serving online, real time changes. We have a secondary instance feature that can catch up with the primary's changes. But that's for accessing a common set of files on the same file system though. |
I have a use case where i have a few large indexing machines that are ingesting all the data (in real time) and periodically (every X seconds) running a compaction and checkpointing the index to some distributed storage. The serving (read-only server) read these latest snapshots whenever new available (copy SST files from remote storage to local ssd). One limitation of this approach is if my indexes are pretty large for example few hundred GB to TB scale , taking snapshot of entire index , uploading them to remote storage , and the read-only servers loading these giant indexes every few minutes to the local SSD is taking a lot of bandwidth and affects performance on online path (even if the amount of updates in those N seconds was in order of few MBs) .
Is there a way to transfer only deltas between snapshots so the read servers can only load the new sst files and clean up the old files , without having to run any compaction on serving side and indexers instead of checkpointing the entire index , periodically updates the delta (with any metadata) ? Any ideas on how to accomplish this or if this is even feasible approach would be appreciated ?
The text was updated successfully, but these errors were encountered: