New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restic CLI is faster than Backrest on Windows #92
Comments
interesting report, it may be worth doing this benchmarking in an empty repository. It's very possible that the pack files already exist on the second run if you're rerunning with the same parent and set of files as restic does content-based deduplication. On the second run it's likely nothing is being uploaded. Backrest uses this function to read backup progress events in its own go routine. I wouldn't expect it ever be the bottleneck in a backup (though restic does output a lot of bytes of progress events). |
Unfortunately, my understand of go is only rudimentary. I did however converse with ChatGPT about that and it mentioned that there are other, non-blocking ways of reading from STDOUT: https://chat.openai.com/share/f4d8b443-203f-406b-8130-3c3116eb79f7 My test is with a folder that contains close to 4 million files. No files are changed and it is absolutely repeatable: Running the CLI command takes about half the time of running it in Backrest. It's repeatable, no matter the order. |
Mind sharing some generic test commands? I'll spend some time benchmarking tonight on an artificial dataset. |
This is the command I run on the command line:
I've also set the env variables The final output from the CLI is this:
The result in Backrest looks like this: Does that help? |
Definitely helps & also good to know that this is on windows. Thanks! I'll try benchmarking this on my windows machine (and can compare with linux), I think there's a good chance that your intuition re: bad buffering somewhere is right. |
Looked at this a bit on Linux but I wasn't able to reproduce such a big speed discrepancy (just normal variance between runs). I suspect I'll need to benchmark specifically on windows. it's very possible something about the way I'm doing IO is buffering there specifically. |
Let me know if I can help by running special builds |
Validated this tonight using Windows Terminal (GPU accelerated rendering) through Parallels on a M1 Pro macOS (allocated 4 processors and 8 GB RAM.) First I created a test directory with a 1G zeroes file and a 1G random file. I then created two local repos: one for CLI and one for backrest. The initial backrest backup took 14 seconds, while the CLI backup took 12 seconds. Not a huge difference. I then added a ~22G random file. The CLI backed it up in 3m35s seconds, while backrest took 5m57s, so I think there is room for improvement. |
I've been trying various things to see if I can speed it up with 10G random test files. These take about 3 minutes under backrest and 2 minutes under CLI. I added cumulative timing around the various calls here: https://github.com/garethgeorge/backrest/blob/main/pkg/restic/outputs.go#L127-L145
This could become an issue with larger backups - back of the napkin math: callback would add about 13 minutes to a 1 TB backup, while json.Unmarshal would add about 4.5 minutes. I think it's possible to call the callback in a goroutine - I briefly tested it, and nothing was obviously broken. Also, there are faster alternatives to the json package - though we may be reaching a point of diminishing returns. Regardless, this only accounts for 10/60 seconds difference I see between backrest and CLI - there are still 50 seconds unaccounted for. I tried black box debugging for a while, which didn't get me very far:
I also took a pprof (without the code changes mentioned above) although I'm not very adept at reading them.
Maybe someone else will have more insight into how to interpret this. I tried one more thing to try to rule out some kind of performance issue in the Go stdlib: I created a program to print a bunch of lines of text, and another program to exec the first program, read its output using |
Really interesting results re: the cumulative benchmarking & thanks for the detailed investigation. I'm going to take a deeper look at that chart tomorrow -- but leaving some initial impressions: I'm surprised the callback is taking that long, but in retrospect it makes sense that it might.
It should be safe and is probably the right idea to fork this update out into a goroutine. edit: i'm remembering that this is already debounced, so it's pretty interesting that it's still taking ~5 seconds of the total runtime of the benchmark. |
Implemented 6c4bada based on the benchmarking results, should be a decent speed boost. Per the rest of the results, I'm having a hard time telling if there's anything immediately obvious on the golang side that would be slowing things down. It looks like the vast majority of the time is simply spent in syscall wait on output from the restic process. I do wonder if the goroutine issue you found had an out sized performance penalty. It may have essentially blocked restic from writing logs until the fsync on the oplog finished which could cause the disk head to thrash around? But that's a bit of a reach-- I'm curious if those numbers are on an HDD or SSD? |
FYI, I'm still working on this as I get time. I did some inconclusive benchmarking the other night when I got my Windows environment setup a little more fully, but I then realized I was skipping over some of the important code (such as the callback.) I need to run some more experiments before reporting what I saw. However, I had a question about 6c4bada. My understanding is that we are trying to move the call to I see that the code has a I will try to run some benchmarks specifically to see if that commit improved the speed at all. On the day I ran pprof testing, I did create similar code, but without the WaitGroup, and I didn't see any immediate errors - but I'm not familiar enough with BoltDB or the data you're saving in this callback to understand whether it's important that these calls be serialized. |
Right -- the intent of the fix is to buffer up to 1 write at a time running in a goroutine that doesn't block the callback. So we're spawning a goroutine and, if the callback fires again, it will block while waiting for the previously created goroutine to finish it's write. Since we debounce the callback to fire every 1 seconds at most, my thinking is this should be plenty of time for the oplog update and network IO it triggers to run in most cases. If the system is sufficiently bogged down that the updates are taking more than 1 second we could instead consider dropping them on the floor, but for now we still throttle in that case. WDYT?
Serialization of the calls to boltdb isn't important; the operation log implementation is thread safe :) I'll be very interested to see your new benchmarking results & also give some thought to where else CPU cycles might be going. My expectation is that for devices with relatively slow disk IO there should be a small performance improvement. 4e2bf1f also fixed a case where a backpressured connection could possibly cause backups to hang entirely (though I've never seen a report of this happening in the wild). |
Hmm, emphasis on the fact that the revision is a modest performance uplift. I think it may be sitting within the sample variance (or simply a non-factor if you're storing the oplog on a different disk from your backup dataset). Definitely need to keep looking at other places where the slowdown could be coming from here. One thing to note is that restic itself is very chatty during backup operations in JSON mode. A possible fix we can attempt here is an upstream change to restic that reduces the amount of IO pressure it generates during backup runs. It's not uncommon for backrest to output many megabytes of JSON during the course of a backup -- though it's still surprising that reading this across process boundaries would cause the slowdown we're seeing. |
Not sure I understand? There is really no difference in performance: I don't know Go unfortunately. In C#, I would just put the JSON string into a concurrent queue and go through the queue, parsing and handling the JSON on a completely separate thread. |
Backrest is already using a thread to handle output from restic . It's still fairly unclear from any of the results we have so far where the performance difference is going. From what I can tell, this is a windows only performance issue. In 6c0b47d I've gone ahead and introduced a buffered pipe in between the backrest process and the output parsing logic. I've also gone ahead and added a benchmark to the restic package which we can hopefully use to measure changes going forward. I'll try to post some results from my Windows system with and without the buffering. Edit: if you're interested to give a new build with the buffering changes a go, see the snapshot builds from https://github.com/garethgeorge/backrest/actions/runs/8672450469 |
@garethgeorge Just an FYI, I created a similar benchmark to the one in your commit, and I think I was seeing a 2m backup time, where I had seen a 3m backup time in the UI. But my memory is fuzzy, so I want to verify this. I had hinted at this here:
I'll try to get some time here to run some side-by-side tests again soon. I'd like to write a benchmark that clearly demonstrates the issue (even better if we can run it in CI) - I'd hate to add a bunch of code around output capturing that ultimately doesn't improve anything. |
Very much agree that this is probably the most important step in tracking this down. Forgot to collect my results to post here, but I ran through my new benchmark tests this weekend and wasn't really able to see much change in performance when disabling output handling or even when simulating additional load in the callback. Restic appears to run fully in parallel with the callback handling. I think what I need to setup here is (or would love to see if anyone has some bandwidth)
So far I've compared the latter two cases and, running as administrator, I don't think I saw a difference on my system but my windows knowledge is pretty limited (I primarily program and work on Linux) so it's easily possible I'm missing something here! The profiling and results posted so far have been really helpful (and motivating to keep digging on this issue :) ).
Agreed, I think adding the buffered pipe is the last "shot in the dark" type change I'll make to address this (and it didn't seem to have much impact) but I find some piece of mind in knowing that there's some explicit buffering eliminating backpressure between backrest (consuming bytes) and restic producing bytes -- outside of whatever buffering the kernel is doing. Let's try to lock down some real profiling data or repro's to make a targeted fix, especially if it sounds like you've been able to reproduce the problem in your Windows VM. |
While investigating #86, I copied the exact command that Backrest runs and ran it myself from the command line.
The summary line prints that 197 seconds elapsed. That's 3:17.
However, Backrest takes 6:25.
Is Backrest reading Restic's STDOUT in a blocking manner and by that slowing down the whole process?
The text was updated successfully, but these errors were encountered: