Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restic CLI is faster than Backrest on Windows #92

Open
dhilgarth opened this issue Feb 14, 2024 · 19 comments
Open

Restic CLI is faster than Backrest on Windows #92

dhilgarth opened this issue Feb 14, 2024 · 19 comments
Labels
bug Something isn't working p2

Comments

@dhilgarth
Copy link
Contributor

While investigating #86, I copied the exact command that Backrest runs and ran it myself from the command line.
The summary line prints that 197 seconds elapsed. That's 3:17.
However, Backrest takes 6:25.

Is Backrest reading Restic's STDOUT in a blocking manner and by that slowing down the whole process?

@garethgeorge
Copy link
Owner

interesting report, it may be worth doing this benchmarking in an empty repository. It's very possible that the pack files already exist on the second run if you're rerunning with the same parent and set of files as restic does content-based deduplication. On the second run it's likely nothing is being uploaded.

Backrest uses this function to read backup progress events in its own go routine. I wouldn't expect it ever be the bottleneck in a backup (though restic does output a lot of bytes of progress events).

@dhilgarth
Copy link
Contributor Author

Unfortunately, my understand of go is only rudimentary. I did however converse with ChatGPT about that and it mentioned that there are other, non-blocking ways of reading from STDOUT: https://chat.openai.com/share/f4d8b443-203f-406b-8130-3c3116eb79f7

My test is with a folder that contains close to 4 million files. No files are changed and it is absolutely repeatable: Running the CLI command takes about half the time of running it in Backrest. It's repeatable, no matter the order.

@garethgeorge garethgeorge added bug Something isn't working p2 labels Feb 14, 2024
@garethgeorge
Copy link
Owner

Mind sharing some generic test commands? I'll spend some time benchmarking tonight on an artificial dataset.

@dhilgarth
Copy link
Contributor Author

This is the command I run on the command line:

C:\Program Files (x86)\restic> .\restic-0.16.4.exe backup --json --exclude-caches -o sftp.args=-oBatchMode=yes Z:\Projects --tag plan:projects --parent a9839cf1b29e97e3a362cfe6556747d312799655fb19d368d5ae619cc63d5233

I've also set the env variables RESTIC_REPOSITORY, RESTIC_PASSWORD, AZURE_ACCOUNT_NAME and AZURE_ACCOUNT_KEY.

The final output from the CLI is this:

{"message_type":"summary","files_new":0,"files_changed":0,"files_unmodified":3832732,"dirs_new":0,"dirs_changed":0,"dirs_unmodified":470557,"data_blobs":0,"tree_blobs":0,"data_added":0,"total_files_processed":3832732,"total_bytes_processed":703919075979,"total_duration":197.1801373,"snapshot_id":"bc13916175dd5fd71d4626afb38876f2dee2892217dc836dfec24c1695163633"}

The result in Backrest looks like this:

image

Does that help?

@garethgeorge
Copy link
Owner

garethgeorge commented Feb 14, 2024

Definitely helps & also good to know that this is on windows. Thanks! I'll try benchmarking this on my windows machine (and can compare with linux), I think there's a good chance that your intuition re: bad buffering somewhere is right.

@garethgeorge
Copy link
Owner

Looked at this a bit on Linux but I wasn't able to reproduce such a big speed discrepancy (just normal variance between runs). I suspect I'll need to benchmark specifically on windows. it's very possible something about the way I'm doing IO is buffering there specifically.

@dhilgarth
Copy link
Contributor Author

Let me know if I can help by running special builds

@johnmaguire
Copy link
Contributor

johnmaguire commented Apr 6, 2024

Validated this tonight using Windows Terminal (GPU accelerated rendering) through Parallels on a M1 Pro macOS (allocated 4 processors and 8 GB RAM.)

First I created a test directory with a 1G zeroes file and a 1G random file. I then created two local repos: one for CLI and one for backrest. The initial backrest backup took 14 seconds, while the CLI backup took 12 seconds. Not a huge difference.

I then added a ~22G random file. The CLI backed it up in 3m35s seconds, while backrest took 5m57s, so I think there is room for improvement.

@johnmaguire
Copy link
Contributor

johnmaguire commented Apr 6, 2024

I've been trying various things to see if I can speed it up with 10G random test files. These take about 3 minutes under backrest and 2 minutes under CLI.

I added cumulative timing around the various calls here: https://github.com/garethgeorge/backrest/blob/main/pkg/restic/outputs.go#L127-L145

{"json.Unmarshal": "2.7009415s", "event.Validate": "26.83ms", "callback": "8.4080815s"}

This could become an issue with larger backups - back of the napkin math: callback would add about 13 minutes to a 1 TB backup, while json.Unmarshal would add about 4.5 minutes. I think it's possible to call the callback in a goroutine - I briefly tested it, and nothing was obviously broken. Also, there are faster alternatives to the json package - though we may be reaching a point of diminishing returns.

Regardless, this only accounts for 10/60 seconds difference I see between backrest and CLI - there are still 50 seconds unaccounted for.

I tried black box debugging for a while, which didn't get me very far:

I also took a pprof (without the code changes mentioned above) although I'm not very adept at reading them.

(pprof) top10
Showing nodes accounting for 361.27s, 98.92% of 365.21s total
Dropped 399 nodes (cum <= 1.83s)
Showing top 10 nodes out of 37
      flat  flat%   sum%        cum   cum%
   361.01s 98.85% 98.85%    361.13s 98.88%  runtime.cgocall
     0.05s 0.014% 98.86%    236.28s 64.70%  io.copyBuffer
     0.04s 0.011% 98.87%    235.56s 64.50%  syscall.Syscall6
     0.03s 0.0082% 98.88%    235.72s 64.54%  os.(*File).read (inline)
     0.03s 0.0082% 98.89%    361.17s 98.89%  syscall.SyscallN
     0.03s 0.0082% 98.90%    235.57s 64.50%  syscall.readFile
     0.02s 0.0055% 98.90%      7.18s  1.97%  github.com/garethgeorge/backrest/pkg/restic.readBackupProgressEntries
     0.02s 0.0055% 98.91%      5.42s  1.48%  go.etcd.io/bbolt.(*DB).Update
     0.02s 0.0055% 98.92%    235.70s 64.54%  internal/poll.(*FD).Read
     0.02s 0.0055% 98.92%    235.59s 64.51%  syscall.Read
(pprof) top10 -cum
Showing nodes accounting for 361.12s, 98.88% of 365.21s total
Dropped 399 nodes (cum <= 1.83s)
Showing top 10 nodes out of 37
      flat  flat%   sum%        cum   cum%
     0.03s 0.0082% 0.0082%    361.17s 98.89%  syscall.SyscallN
   361.01s 98.85% 98.86%    361.13s 98.88%  runtime.cgocall
         0     0% 98.86%    236.28s 64.70%  io.Copy (inline)
     0.05s 0.014% 98.87%    236.28s 64.70%  io.copyBuffer
         0     0% 98.87%    236.28s 64.70%  os.(*File).WriteTo
         0     0% 98.87%    236.28s 64.70%  os.genericWriteTo
         0     0% 98.87%    236.28s 64.70%  os/exec.(*Cmd).Start.func2
         0     0% 98.87%    236.28s 64.70%  os/exec.(*Cmd).writerDescriptor.func1
         0     0% 98.87%    235.73s 64.55%  os.(*File).Read
     0.03s 0.0082% 98.88%    235.72s 64.54%  os.(*File).read (inline)
(pprof) web

pprof-web

Maybe someone else will have more insight into how to interpret this.


I tried one more thing to try to rule out some kind of performance issue in the Go stdlib: I created a program to print a bunch of lines of text, and another program to exec the first program, read its output using bufio.NewScanner, and print the lines out again. I then used the Measure-Command { } Cmdlet in Powershell to time the two programs. They each ran in 30 seconds. You can see the repo here: https://github.com/johnmaguire/test-wrap/tree/main

@garethgeorge
Copy link
Owner

garethgeorge commented Apr 7, 2024

Really interesting results re: the cumulative benchmarking & thanks for the detailed investigation. I'm going to take a deeper look at that chart tomorrow -- but leaving some initial impressions: I'm surprised the callback is taking that long, but in retrospect it makes sense that it might.

It should be safe and is probably the right idea to fork this update out into a goroutine.

edit: i'm remembering that this is already debounced, so it's pretty interesting that it's still taking ~5 seconds of the total runtime of the benchmark.

@garethgeorge
Copy link
Owner

garethgeorge commented Apr 8, 2024

Implemented 6c4bada based on the benchmarking results, should be a decent speed boost.

Per the rest of the results, I'm having a hard time telling if there's anything immediately obvious on the golang side that would be slowing things down. It looks like the vast majority of the time is simply spent in syscall wait on output from the restic process.

I do wonder if the goroutine issue you found had an out sized performance penalty. It may have essentially blocked restic from writing logs until the fsync on the oplog finished which could cause the disk head to thrash around? But that's a bit of a reach-- I'm curious if those numbers are on an HDD or SSD?

@johnmaguire
Copy link
Contributor

johnmaguire commented Apr 12, 2024

FYI, I'm still working on this as I get time. I did some inconclusive benchmarking the other night when I got my Windows environment setup a little more fully, but I then realized I was skipping over some of the important code (such as the callback.) I need to run some more experiments before reporting what I saw.

However, I had a question about 6c4bada. My understanding is that we are trying to move the call to orchestrator.OpLog.Update into a goroutine, so that if it is slow, it doesn't block the main read loop.

I see that the code has a sync.WaitGroup created before starting the backup, and on each progress entry, we block on the WG completing. Since we add 1 to the WG before starting the goroutine, and remove it only when the goroutine is exiting, do we have the same behavior as before that the goroutine is blocking the read loop?

I will try to run some benchmarks specifically to see if that commit improved the speed at all. On the day I ran pprof testing, I did create similar code, but without the WaitGroup, and I didn't see any immediate errors - but I'm not familiar enough with BoltDB or the data you're saving in this callback to understand whether it's important that these calls be serialized.

@dhilgarth
Copy link
Contributor Author

I just tested v17.1 and there is no change at all in the execution time. The backup from 08:12:12 PM is v17.1, the ones below are v14.0 I think (haven't updated in a few weeks):

image

@garethgeorge
Copy link
Owner

garethgeorge commented Apr 12, 2024

I see that the code has a sync.WaitGroup created before starting the backup, and on each progress entry, we block on the WG completing. Since we add 1 to the WG before starting the goroutine, and remove it only when the goroutine is exiting, do we have the same behavior as before that the goroutine is blocking the read loop?

Right -- the intent of the fix is to buffer up to 1 write at a time running in a goroutine that doesn't block the callback.

So we're spawning a goroutine and, if the callback fires again, it will block while waiting for the previously created goroutine to finish it's write. Since we debounce the callback to fire every 1 seconds at most, my thinking is this should be plenty of time for the oplog update and network IO it triggers to run in most cases. If the system is sufficiently bogged down that the updates are taking more than 1 second we could instead consider dropping them on the floor, but for now we still throttle in that case.

WDYT?

I will try to run some benchmarks specifically to see if that commit improved the speed at all. On the day I ran pprof testing, I did create similar code, but without the WaitGroup, and I didn't see any immediate errors - but I'm not familiar enough with BoltDB or the data you're saving in this callback to understand whether it's important that these calls be serialized.

Serialization of the calls to boltdb isn't important; the operation log implementation is thread safe :)

I'll be very interested to see your new benchmarking results & also give some thought to where else CPU cycles might be going. My expectation is that for devices with relatively slow disk IO there should be a small performance improvement. 4e2bf1f also fixed a case where a backpressured connection could possibly cause backups to hang entirely (though I've never seen a report of this happening in the wild).

@garethgeorge
Copy link
Owner

garethgeorge commented Apr 12, 2024

I just tested v17.1 and there is no change at all in the execution time. The backup from 08:12:12 PM is v17.1, the ones below are v14.0 I think (haven't updated in a few weeks):

Hmm, emphasis on the fact that the revision is a modest performance uplift. I think it may be sitting within the sample variance (or simply a non-factor if you're storing the oplog on a different disk from your backup dataset). Definitely need to keep looking at other places where the slowdown could be coming from here.

One thing to note is that restic itself is very chatty during backup operations in JSON mode. A possible fix we can attempt here is an upstream change to restic that reduces the amount of IO pressure it generates during backup runs. It's not uncommon for backrest to output many megabytes of JSON during the course of a backup -- though it's still surprising that reading this across process boundaries would cause the slowdown we're seeing.

@dhilgarth
Copy link
Contributor Author

dhilgarth commented Apr 13, 2024

Hmm, emphasis on the fact that the fix is a modest performance uplift

Not sure I understand? There is really no difference in performance:
v0.17.1: 7m46s
v0.14.0: between 7m10 and 7m47s

I don't know Go unfortunately. In C#, I would just put the JSON string into a concurrent queue and go through the queue, parsing and handling the JSON on a completely separate thread.

@garethgeorge garethgeorge changed the title Restic CLI is faster than Backrest Restic CLI is faster than Backrest on Windows Apr 13, 2024
@garethgeorge
Copy link
Owner

garethgeorge commented Apr 13, 2024

Backrest is already using a thread to handle output from restic . It's still fairly unclear from any of the results we have so far where the performance difference is going. From what I can tell, this is a windows only performance issue.

In 6c0b47d I've gone ahead and introduced a buffered pipe in between the backrest process and the output parsing logic. I've also gone ahead and added a benchmark to the restic package which we can hopefully use to measure changes going forward. I'll try to post some results from my Windows system with and without the buffering.

Edit: if you're interested to give a new build with the buffering changes a go, see the snapshot builds from https://github.com/garethgeorge/backrest/actions/runs/8672450469

@johnmaguire
Copy link
Contributor

johnmaguire commented Apr 16, 2024

@garethgeorge Just an FYI, I created a similar benchmark to the one in your commit, and I think I was seeing a 2m backup time, where I had seen a 3m backup time in the UI. But my memory is fuzzy, so I want to verify this. I had hinted at this here:

I did some inconclusive benchmarking the other night when I got my Windows environment setup a little more fully, but I then realized I was skipping over some of the important code (such as the [progress] callback.) I need to run some more experiments before reporting what I saw.

I'll try to get some time here to run some side-by-side tests again soon. I'd like to write a benchmark that clearly demonstrates the issue (even better if we can run it in CI) - I'd hate to add a bunch of code around output capturing that ultimately doesn't improve anything.

@garethgeorge
Copy link
Owner

garethgeorge commented Apr 17, 2024

I'll try to get some time here to run some side-by-side tests again soon. I'd like to write a benchmark that clearly demonstrates the issue (even better if we can run it in CI) - I'd hate to add a bunch of code around output capturing that ultimately doesn't improve anything.

Very much agree that this is probably the most important step in tracking this down. Forgot to collect my results to post here, but I ran through my new benchmark tests this weekend and wasn't really able to see much change in performance when disabling output handling or even when simulating additional load in the callback. Restic appears to run fully in parallel with the callback handling.

I think what I need to setup here is (or would love to see if anyone has some bandwidth)

  • a standard benchmark dataset e.g. a directory of a million small files or similar
  • backup times running restic through a .bat script
  • backup times running restic in golang w/no stdout capture
  • backup times running restic in golang with backrest's stdout processing.

So far I've compared the latter two cases and, running as administrator, I don't think I saw a difference on my system but my windows knowledge is pretty limited (I primarily program and work on Linux) so it's easily possible I'm missing something here!

The profiling and results posted so far have been really helpful (and motivating to keep digging on this issue :) ).

I'd hate to add a bunch of code around output capturing that ultimately doesn't improve anything.

Agreed, I think adding the buffered pipe is the last "shot in the dark" type change I'll make to address this (and it didn't seem to have much impact) but I find some piece of mind in knowing that there's some explicit buffering eliminating backpressure between backrest (consuming bytes) and restic producing bytes -- outside of whatever buffering the kernel is doing.

Let's try to lock down some real profiling data or repro's to make a targeted fix, especially if it sounds like you've been able to reproduce the problem in your Windows VM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p2
Projects
None yet
Development

No branches or pull requests

3 participants