Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a feature to help with server side data integrity verification #330

Open
kevinvinv opened this issue Jan 16, 2018 · 13 comments

Comments

Projects
None yet
6 participants
@kevinvinv
Copy link

commented Jan 16, 2018

On Dec 19, 2017, at 7:10 PM, Gilbert (Gang) Chen gchen@acrosync.com wrote:

Hi, Kevin,

Thank you for your support!

Yes, I think that is a good idea.  The implementation can be very simple -- after the snapshot file, say, 'snapshots/test/1', has been uploaded, upload another file named 'snapshots/test/1.chunks', which contains the names of all referenced chunks by '1'.

<><><><><><>

On Mon, Dec 18, 2017 at 10:53 PM, Kevin Vannorsdel kv@vannorsdel.com wrote:
Hi Gilbert,

I think that I really need a way to verify that the storage contains all the necessary chunks for each snapshot. I am operating an SFTP server and about 8 family members backup to me- and I need to know that their data is reliably present.

The -check option doesnt work b/c I also dont want to know their passwords.

What I am thinking is that I need an un-encrypted list of chunks to be uploaded with each snapshot… so that I can crawl through the list on the server-side and verify the presence of every chunk.

@jt70471

This comment has been minimized.

Copy link
Contributor

commented Jan 17, 2018

You should have your "clients" perform duplicacy check's on their end. My scripts will check, backup, copy, prune, and check again... All automated.

@kevinvinv

This comment has been minimized.

Copy link
Author

commented Jan 17, 2018

Care to share? :)

I am concerned about client side checking for two reasons:

  1. BW required over the internet for remote users
  2. Cluelessness of my clients in general

Perhaps your scripts address these issues?

@jt70471

This comment has been minimized.

Copy link
Contributor

commented Jan 17, 2018

Check is not a bandwidth hog, certainly less than running the backups from the client to your server in the first place. As far as cluelessness of your clients, I presume you setup duplicacy for them and I should add that I don't use the GUI...

The backup script is kicked off by cron and it runs the copy and prune scripts if the bkup is successful, etc. Not sure which OS you're running, if Linux or Mac, should be relatively easy to modify for your environment, but if Windows, all bets are off... :)

In my environment, I backup to storage location nas02, then copy to usb01, and copy off-site to backblaze b2.

scripts.zip

@williamgs

This comment has been minimized.

Copy link

commented Jan 18, 2018

@jt70471 Thank you for sharing this!
As a side note, @TheBestPessimist has kindly shared some scripts for Windows
in this thread

Back on topic. I'm not saying a server-side check wouldn't be a good idea, I don't know.
Presumable each client has the credentials/keyring to run a backup job, so maybe you could set up a post-backup script to email you the results of a "duplicacy check" rather than sharing their credentials?

@kevinvinv

This comment has been minimized.

Copy link
Author

commented Jan 18, 2018

Thanks!! VERY kind of you.

I am using the CLI and yeah I set it up for everyone. I actually have found that CHECK is a HUGE time consumer. Even when I run it local on my NAS it can take a number of days to check 300GB. You dont think check downloads each chunk to do the verify? I assumed it did.

My concern with the cluelessness of my clients is that I doubt they would even tell me if a script reported a check fail... so I wanted a way to do it on my side...

@jt70471

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2018

@kevinvinv

This comment has been minimized.

Copy link
Author

commented Jan 18, 2018

I forgot a detail... the -check option ONLY checks for the EXISTENCE of the chunk files in the storage... that is why it is so fast.

To check the chunk integrity (hash consistency) you have to run -check -files... and that downloads the chunk and makes sure the hash is correct. This is what I want to do and this is what obviously takes a long time.

The OP was a suggestion of a way to allow server side PARTIAL verification of the chunk files without needing the clients password info. It seems to me that if the chunk is present and the size is correct- then it is LIKELY in-tact and uncorrupted. Even better would be to include a standard hash in the lookup-file so that the server side checker could also verify with a standard hash calculator (if this makes any sense).

@TheBestPessimist

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2018

@jt70471 could you be interested in sending me a pull request for your linux scripts here: duplicacy utils ?

I intend to make a wiki page here with scrips and utils (approved by @gilbertchen, of course), and it would be useful to have versions for all OSes of automation scrips.

@jt70471

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2018

Hi, the scripts, as written are too specific to my setup to likely be of any value to a wide audience. I have hardcoded hostname's, storage locations, etc. Time permitting, I'll rework the scripts to make them more esoteric where arguments can be passed to the script, etc.

@jt70471

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2018

@kevinvinv, agreed that -files would perform much differently than without -files. I use check w/o the -files parameter. I'd only use -files if you're worried about bit rot or similar, and that's the reason I have 3 copies of the backed up data, 2 on-site and 1 off-site.

@TowerBR

This comment has been minimized.

Copy link

commented Jan 18, 2018

@jt70471 , I found your scripts very interesting, especially these parts:

${duplicacy} ${global_args} prune ${prune_args} -storage nas02
rc=$?

"rc" stands for return code?

@jt70471

This comment has been minimized.

Copy link
Contributor

commented Jan 19, 2018

Yes, rc stands for return code. The scripts were not written for general use, if I have time, I'll rewrite so that you can pass arguments to the script rather than hardcode storage locations, etc.

@markfeit

This comment has been minimized.

Copy link
Contributor

commented Feb 28, 2018

The presence or absence of chunks in one snapshot versus another could be used to do some basic traffic analysis on the amount of file activity on a source machine and is something I wouldn't want revealed to anyone not holding the decryption key for the backup.

One alternative would be for Duplicacy to write a sidecar file alongside each chunk containing a hash of the chunk. That's all the information something on the storage side would need to verify the integrity of the chunks it has stored. If the server is satisfied that the chunks it stores are intact, the clients can continue doing the chunks-present check. Passing both would mean the snapshot set is restorable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.