Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(4.x.x) Deduplicating BLOB Store #2314

Merged
merged 20 commits into from Feb 10, 2019

Conversation

@adamretter
Copy link
Member

@adamretter adamretter commented Nov 27, 2018

This PR provides a new Deduplicating BLOB Store to eXist-db, the design and new features are explained in my blog here: https://blog.adamretter.org.uk/blob-deduplication/

Binaries of eXist-db 4.5.0 patched with this new BLOB Store can be downloaded for testing from: http://static.adamretter.org.uk/blob-dedup/

NOTE: This PR increments the storage format versions of the collections.dbx and Journal files. So a full Backup and Restore is required from previous versions of eXist-db.

In addition this PR adds the following features:

  • Updates eXist-db to use the FasterXML Java UUID Generator.
  • For Binary Documents, the digest and file size are now available in the XML-RPC and XML:DB APIs.
  • For Binary Documents, the digest and file size now appear in the properties dialog of the Java Admin Client.
  • Adds Binary deduplication to database Backup and Restore, this is disabled by default but can be enabled via a flag or command line option. This option is also exposed in the Java Admin Client as a checkbox.
  • Adds the XQuery function util:binary-doc-content-digest which calculates a digest for a Binary Document. It is also optimised to just retrieve the digest if the digest type matches that used for Binary deduplication (e.g. BLAKE2B-256).

NOTE: This PR first requires:

  1. #2305
  2. #2310
@adamretter adamretter requested review from wolfgangmm and dizzzz Nov 27, 2018
@adamretter adamretter force-pushed the adamretter:feature/binary-dedup-4.x.x branch from 0b7bb32 to 12b6754 Dec 14, 2018
@adamretter
Copy link
Member Author

@adamretter adamretter commented Dec 24, 2018

@dizzzz @wolfgangmm could you please review?

@dizzzz
Copy link
Member

@dizzzz dizzzz commented Dec 26, 2018

@adamretter yes I will do

@@ -57,6 +52,7 @@

JComboBox collections;
JTextField backupTarget;
JCheckBox deduplicateBlobs;

This comment has been minimized.

@dizzzz

dizzzz Dec 26, 2018
Member

this probably needs some documentation

This comment has been minimized.

@adamretter

adamretter Dec 26, 2018
Author Member

Sure.

@dizzzz
dizzzz approved these changes Dec 26, 2018
Copy link
Member

@dizzzz dizzzz left a comment

the proof is in the pudding ; I am impressed by the work, there might be some documentation needed on the new backup parameter. I hope the performance is as good or better than it was. I shared ideas on (1) nr of files in a directory [performance, inodes] and (2) compression of the blobs and a concern regarding virus scanners on windows hosts (corporate policy)

@dizzzz
Copy link
Member

@dizzzz dizzzz commented Dec 26, 2018

one question: if one removes a blob file (e.g. virus scanner) is this detected? @adamretter

@adamretter
Copy link
Member Author

@adamretter adamretter commented Dec 26, 2018

@dizzzz If the file is not available an exception will be thrown and logged.

@wolfgangmm
Copy link
Member

@wolfgangmm wolfgangmm commented Jan 10, 2019

@adamretter could you rebase? Looks all good to me code wise, but would test some more.

@dizzzz
Copy link
Member

@dizzzz dizzzz commented Jan 10, 2019

The two pre-requisites are pulled in.....

@adamretter adamretter force-pushed the adamretter:feature/binary-dedup-4.x.x branch from 12b6754 to 15145e4 Feb 5, 2019
@adamretter
Copy link
Member Author

@adamretter adamretter commented Feb 5, 2019

@wolfgangmm Okay it is now rebased as you requested.

@dizzzz
dizzzz approved these changes Feb 6, 2019
@joewiz
Copy link
Member

@joewiz joewiz commented Feb 6, 2019

With approvals from Dannes and Wolfgang, we just need to address the failing appveyor tests.

@dizzzz
Copy link
Member

@dizzzz dizzzz commented Feb 6, 2019

in only one test scenario, an upload fails.... time out ; no existdb issue afai can see

image

@joewiz
Copy link
Member

@joewiz joewiz commented Feb 7, 2019

Do the appveyor failures in the other two ports also seem harmless? They were about unresolved dependencies on de.betterform#betterform-exist;5.1-SNAPSHOT-20160615. See #2460 (comment) and #2461 (comment).

@joewiz
Copy link
Member

@joewiz joewiz commented Feb 9, 2019

Looks like this needs to be rebased now that #2456 was merged first.

@joewiz
Copy link
Member

@joewiz joewiz commented Feb 9, 2019

I should note that the other ports (#2460 and #2461) have been merged (!), so at the moment our branches are not aligned. We should address this ASAP, especially before any releases.

@dizzzz
Copy link
Member

@dizzzz dizzzz commented Feb 9, 2019

I dont understand... all PRs should be in right now. Is this a GitHub issue?

@joewiz
Copy link
Member

@joewiz joewiz commented Feb 9, 2019

@dizzzz I am also unclear why this PR shows a conflict, while the already-merged associated ports (#2460 and #2461) do not. The conflict here, though, is certainly due to the switch in locations of the extensions in #2456 - but that also had ports (#2463 and #2464). Resolving the conflict should be a matter of moving them over to the location that #2456 put them in, I guess.

@ljo
Copy link
Member

@ljo ljo commented Feb 9, 2019

@joewiz yes, sorry, moving the files should do it. Regarding the appveyor failure it is not a code problem, but something in appveyor. We also had problems reported by travis which were not there on subsequent runs.

adamretter added 20 commits Nov 7, 2018
…rmat as we now have the Blob Store instead of filesystem binary storage
…we now have the Blob Store instead of filesystem binary storage
…ment util:binary-doc-content-digest($binary-resource, $algorithm)
@adamretter adamretter force-pushed the adamretter:feature/binary-dedup-4.x.x branch from 15145e4 to 4ebaad2 Feb 10, 2019
@duncdrum duncdrum merged commit 98f4aa2 into eXist-db:develop-4.x.x Feb 10, 2019
2 of 3 checks passed
2 of 3 checks passed
Codacy/PR Quality Review Not up to standards. This pull request quality could be better.
Details
Travis CI - Pull Request Build Passed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
@joewiz
Copy link
Member

@joewiz joewiz commented Feb 12, 2019

@adamretter Will this help with the problem described in https://stackoverflow.com/q/54640371/659732? If so, we could chime in and say that the forthcoming release will include the feature...

@adamretter
Copy link
Member Author

@adamretter adamretter commented Feb 12, 2019

@joewiz Unrelated I am afraid.

@joewiz
Copy link
Member

@joewiz joewiz commented Feb 12, 2019

@adamretter Ok, thanks!

@adamretter adamretter deleted the adamretter:feature/binary-dedup-4.x.x branch Apr 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants