Skip to content

(4.x.x) Deduplicating BLOB Store#2314

Merged
duncdrum merged 20 commits intoeXist-db:develop-4.x.xfrom
adamretter:feature/binary-dedup-4.x.x
Feb 10, 2019
Merged

(4.x.x) Deduplicating BLOB Store#2314
duncdrum merged 20 commits intoeXist-db:develop-4.x.xfrom
adamretter:feature/binary-dedup-4.x.x

Conversation

@adamretter
Copy link
Contributor

@adamretter adamretter commented Nov 27, 2018

This PR provides a new Deduplicating BLOB Store to eXist-db, the design and new features are explained in my blog here: https://blog.adamretter.org.uk/blob-deduplication/

Binaries of eXist-db 4.5.0 patched with this new BLOB Store can be downloaded for testing from: http://static.adamretter.org.uk/blob-dedup/

NOTE: This PR increments the storage format versions of the collections.dbx and Journal files. So a full Backup and Restore is required from previous versions of eXist-db.

In addition this PR adds the following features:

  • Updates eXist-db to use the FasterXML Java UUID Generator.
  • For Binary Documents, the digest and file size are now available in the XML-RPC and XML:DB APIs.
  • For Binary Documents, the digest and file size now appear in the properties dialog of the Java Admin Client.
  • Adds Binary deduplication to database Backup and Restore, this is disabled by default but can be enabled via a flag or command line option. This option is also exposed in the Java Admin Client as a checkbox.
  • Adds the XQuery function util:binary-doc-content-digest which calculates a digest for a Binary Document. It is also optimised to just retrieve the digest if the digest type matches that used for Binary deduplication (e.g. BLAKE2B-256).

NOTE: This PR first requires:

  1. (4.x.x) New implementation of the Transaction Manager #2305
  2. (4.x.x) Copying an XML Resource must remove the nodes of any XML resource it overwrites! #2310

@adamretter
Copy link
Contributor Author

@dizzzz @wolfgangmm could you please review?

@dizzzz
Copy link
Member

dizzzz commented Dec 26, 2018

@adamretter yes I will do


JComboBox collections;
JTextField backupTarget;
JCheckBox deduplicateBlobs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably needs some documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Member

@dizzzz dizzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the proof is in the pudding ; I am impressed by the work, there might be some documentation needed on the new backup parameter. I hope the performance is as good or better than it was. I shared ideas on (1) nr of files in a directory [performance, inodes] and (2) compression of the blobs and a concern regarding virus scanners on windows hosts (corporate policy)

@dizzzz
Copy link
Member

dizzzz commented Dec 26, 2018

one question: if one removes a blob file (e.g. virus scanner) is this detected? @adamretter

@adamretter
Copy link
Contributor Author

@dizzzz If the file is not available an exception will be thrown and logged.

@duncdrum duncdrum added the needs documentation Signals issues or PRs that will require an update to the documentation repo label Jan 5, 2019
@wolfgangmm
Copy link
Member

@adamretter could you rebase? Looks all good to me code wise, but would test some more.

@dizzzz
Copy link
Member

dizzzz commented Jan 10, 2019

The two pre-requisites are pulled in.....

@adamretter adamretter force-pushed the feature/binary-dedup-4.x.x branch from 12b6754 to 15145e4 Compare February 5, 2019 00:30
@adamretter
Copy link
Contributor Author

@wolfgangmm Okay it is now rebased as you requested.

@joewiz
Copy link
Member

joewiz commented Feb 6, 2019

With approvals from Dannes and Wolfgang, we just need to address the failing appveyor tests.

@dizzzz
Copy link
Member

dizzzz commented Feb 6, 2019

in only one test scenario, an upload fails.... time out ; no existdb issue afai can see

image

@joewiz
Copy link
Member

joewiz commented Feb 7, 2019

Do the appveyor failures in the other two ports also seem harmless? They were about unresolved dependencies on de.betterform#betterform-exist;5.1-SNAPSHOT-20160615. See #2460 (comment) and #2461 (comment).

@joewiz
Copy link
Member

joewiz commented Feb 9, 2019

Looks like this needs to be rebased now that #2456 was merged first.

@joewiz
Copy link
Member

joewiz commented Feb 9, 2019

I should note that the other ports (#2460 and #2461) have been merged (!), so at the moment our branches are not aligned. We should address this ASAP, especially before any releases.

@dizzzz
Copy link
Member

dizzzz commented Feb 9, 2019

I dont understand... all PRs should be in right now. Is this a GitHub issue?

@joewiz
Copy link
Member

joewiz commented Feb 9, 2019

@dizzzz I am also unclear why this PR shows a conflict, while the already-merged associated ports (#2460 and #2461) do not. The conflict here, though, is certainly due to the switch in locations of the extensions in #2456 - but that also had ports (#2463 and #2464). Resolving the conflict should be a matter of moving them over to the location that #2456 put them in, I guess.

@ljo
Copy link
Member

ljo commented Feb 9, 2019

@joewiz yes, sorry, moving the files should do it. Regarding the appveyor failure it is not a code problem, but something in appveyor. We also had problems reported by travis which were not there on subsequent runs.

…rmat as we now have the Blob Store instead of filesystem binary storage
…we now have the Blob Store instead of filesystem binary storage
…ment util:binary-doc-content-digest($binary-resource, $algorithm)
@adamretter adamretter force-pushed the feature/binary-dedup-4.x.x branch from 15145e4 to 4ebaad2 Compare February 10, 2019 12:15
@duncdrum duncdrum merged commit 98f4aa2 into eXist-db:develop-4.x.x Feb 10, 2019
@joewiz
Copy link
Member

joewiz commented Feb 12, 2019

@adamretter Will this help with the problem described in https://stackoverflow.com/q/54640371/659732? If so, we could chime in and say that the forthcoming release will include the feature...

@adamretter
Copy link
Contributor Author

@joewiz Unrelated I am afraid.

@joewiz
Copy link
Member

joewiz commented Feb 12, 2019

@adamretter Ok, thanks!

@adamretter adamretter deleted the feature/binary-dedup-4.x.x branch April 8, 2019 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement new features, suggestions, etc. needs documentation Signals issues or PRs that will require an update to the documentation repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants