-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Compress/decompress snapshots before/after uploading/downloading to/from object store #255
Comments
I was looking into this, because we do compression in our backup-restore-sidecar which is for postgres and rethink databases. I would like to reuse this implementation and create a PR here. But when inspecting the stored files (GCP Buckets in our case), there are always two instances of every backup:
What is the rational of this approach ? From what i understood from reading the code, uploading the backup in chunks should speed up the upload process. But then the single file with the whole content can be skipped. |
@majst01 Thanks a lot for the offer of contribution! ❤️
This is because of the approach of uploading in chunks. In S3, the individual chunks are automatically cleaned up. I was under the impression that the same is true in GCS and other storage providers. But I checked the documentations of GCS composite objects and it looks like the source parts are not automatically cleaned up.
Apparently, not in S3. But yes, in GCS. Perhaps other providers too. Thanks for pointing this out. I will check the other providers documentation and raise a separate issue for this. |
OK, interesting. |
I afraid, I do not have the numbers as chunking implementation was done quite some time ago. But I do remember that the main reason for implementing chunking upload because we started facing upload timeouts on flaky networks and since the retry would start the upload from the beginning, the upload would practically never happen in such cases. Hence, the chunked upload to make it possible to upload under flaky networks and not just large files.
Yes. The idea of this issue was to compress the one file and then upload the chunks of the compressed file, if needed. On the restoration side, the compressed file would be downloaded and decompressed. |
Just opened #268 |
@majst01 Regarding your question:
Chunk upload of objects to the bucket basically chunks the object on client side and first uploads the chunks to the bucket. In order to finalize this as a proper object on the bucket, the client finally "composes" the chunks into a logical encapsulation, which you would see on the GCS bucket as the final object. See here for more details about GCS composite objects and composing sub-objects. And as @amshuman-kr mentioned, chunk upload was adopted to avoid failing uplaods due to poor network connections.
Indeed, it would make more sense to compress the single file first and then let the GCS API chunk it and upload it.
That would be great @majst01 . Thanks for that 😀 |
+1 Quite some time ago, that was indeed extremely painful and we couldn't have an unreliable backup, so we went for chunks (more effort, not as efficient, but it was magnitudes more important to have reliable backups and no alerts of failed backups increasing in likelihood with larger ETCDs). So that is the history of that. :-)
Awesome, thank you very much @majst01. 👍 |
First Shoot of my WIP can be seen here: https://github.com/gardener/etcd-backup-restore/compare/master...majst01:compression?expand=1 Most difficult part is the cli flag handling :-) i dont want to open a Draft PR already, i want to get a first feedback of my approach. So if someone would spend some minutes to have a look, this would be much appreciated. |
@majst01 Thanks. That was quick! The changes generally look good and I liked the idea of supporting multiple compression method. I had a couple of points though. 1. Backward compatibilityEspecially, while restoring from uncompressed previous snapshots. This is mentioned in the issue description above.
Since this change will be rolled out to all live etcds in gardener landscapes, we need to make sure the restoration from existing snapshots work. I was thinking of some kind of metadata. Rules of thumb based on snapshot file extension might be ok too. 2. Streaming or on-the-go compression/decompressionPerhaps using the Typically, we allocate volumes of Additional storage during compression/decompression would eat up into the buffer. Do you think stream/on-the-go compression/decompression is possible? |
I pushed a small enhancement to support uncompressed snaphots as well. No metadata needed because all is decided based on the file extension.
mholt/archiver is able to compress/decompress a |
Thanks a lot! The change looks good.
Sounds good to me 👍 We can pick up on-the-go compression/decompression later before picking encryption. |
One question arises for me, how do you guys test this sidecar actually, bit hard because deployment is usually done via etcd-druid which is managed by gardener. We dont have a lot of gardener seeds where we can modify image-vectors easily. Any hint is welcome. |
@shreyas-s-rao should be able to help more with that. But there is the sample helm chart, which might be of help. It even uses the |
OK, thanks. Was able to test locally for the snapshot creation with the local snap store including compression. @shreyas-s-rao how do you test restoration in this setup ? |
@majst01 I tested your branch for both backup and restoration. I had to make a small adjustment (please see the patch below). diff --git a/pkg/compress/compress.go b/pkg/compress/compress.go
index 4ba7a0df..b7f95e0e 100644
--- a/pkg/compress/compress.go
+++ b/pkg/compress/compress.go
@@ -44,7 +44,7 @@ func (c *Compressor) Compress(snap *snapstore.Snapshot) error {
if !c.enabled {
return nil
}
- err := archiver.Archive([]string{snap.SnapDir}, path.Join(snap.SnapDir, snap.SnapName))
+ err := archiver.Archive([]string{snap.SnapDir}, path.Join(snap.SnapDir, snap.SnapName + c.extension))
if err != nil {
return err
} I could verify that compression/decompression was working well for full snapshots. But the delta snapshots are not compressed/decompressed. I checked the code for delta snapshots and found it not conducive for file compression. In fact, the implementation as well as the snapstore interface is more conducive to on-the-go compression especially during backup upload. The temporary file creation is currently being done in the individual snapstore implementations (only if needed). IMHO, instead of changing delta snapshot backup code to be more file compression friendly, it is better to implement on-the-go compression/decompression which then can be used in all the cases (full and delta, backup and restoration). In this context, I found golang's standard PS: Last but not the least, I am keenly aware that this might be asking a bit to much from you. So, it is perfectly OK if you do not pursue this change. In that case, we will pick this up along the lines you have already shown. |
Hi @amshuman-kr I am a bit short in time to spent more on my actual effort, so if you would like to pick my ideas, go ahead. I am happy to help in any form you wish or ask for. |
@majst01 Thanks a lot for the help already. Much appreciated ❤️ |
/assign |
Feature (What you would like to be added):
At present, (full and incremental) snapshots are stored in the object store uncompressed.
Can we optionally support storing compressed (full and incremental) snapshots in object store? I.e. Compress the snapshots before uploading to object store and decompress backups after downloading from object store for restoration.
This could be controlled by a configuration/cli flag, e.g.
compressSnapshots: true
or--compress-snapshots=true
.Motivation (Why is this needed?):
While storing uncompressed snapshots in the object store may lead to some simplicity in implementation and optimization in time performance, it also leads to higher network and storage usage.
Approach/Hint to the implement solution (optional):
Transfer-Encoding
.Content-Encoding: gzip
header to mark the content in the object store as compressed while supporting theContent-Type
header for the actual uncompressed content.So, it makes sense for
etcd-backup-restore
toContent-Encoding: gzip
in the object store to indicate that the content is compressed while maintaining the originalContent-Type
.The text was updated successfully, but these errors were encountered: