Skip to content
Permalink
Browse files

attempt to clarify wording

  • Loading branch information...
jhpoelen committed Mar 6, 2019
1 parent 79ef624 commit e6d3937d1b56518bc3b9c042cf07bc9d2a5a3660
Showing with 1 addition and 1 deletion.
  1. +1 −1 analysis.md
@@ -199,7 +199,7 @@ $ diff eml.xml.older eml.xml.newer

So, even though the content in the meta.xml and occurrence.txt are identical, a semantically insignificant change in the eml.xml causes Preston to think that a new version of the dataset at http://www.gbif.se/ipt/archive.do?r=nrm-herpetology was published. So, because Preston only detects binary content drift, a whole new version is stored in Preston's content-addressed file store. Multiplying this behavior across the >40k datasets in the network, a disproportionate growth of the total size of dataset versions is expected even when few records are changed in datasets. Assuming that publishing organizations like gbif.se are unlikely to change their publication methods anytime soon, Preston will have to get smarter about detecting changes in a dataset over time. If not, the amount of storage needed will increase rapidly, causing data mobility and storage issues. Alternatives include chunking the dataset into smaller parts (file contents, of chunks of files) or removing/ignoring parts of the content that carries no meaning (pubDates/ packageIds). The chunking in combination with merkle-trees is used by platforms such as git, ipfs and dat-project to reduce communication overhead and storage requirement by detecting duplication of (chunked) content in directories and files.

A naive estimate of duplication similar to the example above (e.g., same filesize, but different content hash) leads to an upper bound of 26GB content with duplicates out of ~283GB . The estimation assumed that datasets stored by Preston with the same file size are likely duplicates. This naive duplication estimate suggests that introducing a semantic content hash would only reduce the amount of unique content by about 10% (ed. 2019-03-05 - Later analysis showed that the estimate was similar to 24GB non-unique contents (out of total 185GB) of bzip2 compressed content of unpacked dwca's). The average file size of unique hashes, non-repeated, hashes was 7MB, while repeated hashes averaged ~100KB in size. This observation is consistent with anecdotal evidence that larger collections tend to be more actively managed, leading to more updates to larger archives/datasets relative to smaller datasets.
A naive estimate of duplication similar to the example above (e.g., same filesize, but different content hash) leads to an upper bound of 26GB content with duplicates out of ~283GB . The estimation assumed that datasets stored by Preston with the same file size are likely duplicates. This naive duplication estimate suggests that introducing a semantic content hash would only reduce the amount of unique content by about 10% (ed. 2019-03-05 - Later analysis found about 15% duplicate files, or 24GB of duplicate files out of total 185GB, excluding meta.xml and eml.xml, after unpacking all dwca's and re-compression with bzip2). The average file size of unique hashes, non-repeated, hashes was 7MB, while repeated hashes averaged ~100KB in size. This observation is consistent with anecdotal evidence that larger collections tend to be more actively managed, leading to more updates to larger archives/datasets relative to smaller datasets.

More analysis is needed to understand the duplication of information in the gathered datasets in period Sept 2018 - Jan 2019. This understanding would provide a basis for more efficient storage and processing of biodiversity datasets from iDigBio, GBIF and BioCASe networks.

0 comments on commit e6d3937

Please sign in to comment.
You can’t perform that action at this time.