Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Connections/Streams #6794

Open
Slind14 opened this issue Jun 26, 2022 · 18 comments
Open

Multiple Connections/Streams #6794

Slind14 opened this issue Jun 26, 2022 · 18 comments
Labels

Comments

@Slind14
Copy link

Slind14 commented Jun 26, 2022

Are there any plans for supporting multiple concurrent connections for the data transfer? Or is this already possible somehow?

Doing backups across > 1G networks is quite slow, due to the bottleneck of a single connection.
For cross-continent backups, it can be even worse, where a single connection won't be able to utilize a 1G and sit at 200M max.

@ThomasWaldmann
Copy link
Member

You can run multiple borg processes in parallel, backing up to one repo per process.

@Slind14
Copy link
Author

Slind14 commented Jun 26, 2022

Hi Thomas,

we use it to backup data from a data warehouse. We can't split the data across multiple repos without losing consistency I'm afraid.
Is there another option?

@ThomasWaldmann
Copy link
Member

no. not being able to saturate your connection with 1 borg likely comes from internal processing being single-threaded and not internally queued.

but not sure how you ensure consistency. if you used a snapshot to get consistency, you could also run multiple borg to save the snapshot.

@ThomasWaldmann
Copy link
Member

Is this the first backup you are doing or is there already data in the repo from previous backups?

@Slind14
Copy link
Author

Slind14 commented Jun 26, 2022

it is not the first backup we just got the point where they can't complete within a day anymore.

When we use iperf3 to measure the bandwidth then we can see that a single connection only gets 100-200M while multiple get > 900M.

For data centers that are not on the other side of the world, we get a higher bandwidth for a single connection. So I doubt it is borg directly. Btw. borg CPU usage is always sitting at 10-20% of one core while uploading. Only when saving the file cache does it go to 100% and bandwidth to 0. The files are also quite large (multiple GB).


We do have a hardlink-based snapshot. How would we run multiple borg processes and ensure that they are not cannibalizing each other and also that we end up with a consistent backup?

@ThomasWaldmann
Copy link
Member

borg manages caching, indexes and locking based on the repo id (which is unique and random). so you can run borg on the same machine, as the same user, at the same time IF you use different repos.

so you could partition your input data set and give each part to another borg.

@ThomasWaldmann
Copy link
Member

also wondering why a not-first backup takes that long. does the dedup not work or is it really lots of NEW data?

@Slind14
Copy link
Author

Slind14 commented Jun 26, 2022

also wondering why a not-first backup takes that long. does the dedup not work or is it really lots of NEW data?

There is more new data than 100MBit/s can do.

@Slind14
Copy link
Author

Slind14 commented Jun 26, 2022

borg manages caching, indexes and locking based on the repo id (which is unique and random). so you can run borg on the same machine, as the same user, at the same time IF you use different repos.

so you could partition your input data set and give each part to another borg.

Unfortunately, partitioning is not possible with the way the data is stored. 90% is under the same directory and then goes into around one million files.

@ThomasWaldmann
Copy link
Member

ok.

iirc there is some --upload-buffer (or so) option, maybe you can try using that to speed it up.

you use some fast compression (default is lz4, zstd,1 .. zstd,3 would also work i guess)?

@ThomasWaldmann
Copy link
Member

another idea is not to use different repo for partitions of the data, but for different times.

not pretty, but would work: use a different repo depending on weekday.

@Slind14
Copy link
Author

Slind14 commented Jun 26, 2022

iirc there is some --upload-buffer (or so) option, maybe you can try using that to speed it up.

the data is already compressed, hence we don't use any


Are there any plans to support multi-connection uploads? Would it be a major change or something simple?

@Slind14
Copy link
Author

Slind14 commented Jun 26, 2022

another idea is not to use different repo for partitions of the data, but for different times.

the majority of the new data is from the last 24 hours :( these are in the same place - not really possible to be split.

@ThomasWaldmann
Copy link
Member

--upload-buffer is about buffering, not compression.

@Slind14
Copy link
Author

Slind14 commented Jun 26, 2022

--upload-buffer is about buffering, not compression.

Sorry I quoted the wrong line. ;)

@Slind14
Copy link
Author

Slind14 commented Jun 26, 2022

Unfortunately changing the buffer does not help.

Restic added parallel uploads not too long ago, if borg had something similar it would be great.

restic/restic#3593
restic/restic#3513

@RonnyPfannschmidt
Copy link
Contributor

with the current backend structure multi connection upload are not sensibly possible as the log structured store is not concurrent and the encryption scheme is also not yet prepared for such a scenario

i would imagine that a major refactor would be necessary to support them

@Slind14
Copy link
Author

Slind14 commented Jun 27, 2022

I see, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants