Permalink
Browse files

added content-drift analysis

  • Loading branch information...
jhpoelen committed Feb 6, 2019
1 parent 34b84a6 commit 3d5c37be7163301e653334244e30c316b46fe145
Showing with 63 additions and 1 deletion.
  1. BIN 20190204-size-time-cumulative.png
  2. +63 −1 analysis.md
Binary file not shown.
@@ -6,11 +6,15 @@ This page contains some [sparql](https://www.w3.org/TR/rdf-sparql-query/) querie

# Use Cases

* [`Detecting Linkrot`](#detecting-linkrot)
* [`Content Drift`](#content-drift)


## Detecting Linkrot

[Linkrot](https://en.wikipedia.org/wiki/Link_rot), a well documented, and often occurring, phenomenon in which content associated to links become permanently unavailable.

URLs are used to access content. While the URLs might be static, the content is often not. High turnover or error rates in content linked to by a url can be a sign of an instable, actively maintained, or randomly changing datasets. Since Preston is continuously tracking urls and their content, we can use its output, a biodiversity dataset graph, to detect linkrot.
URLs are used to access content. While the URLs might be static, the content is often not. High turnover or error rates in content linked to by a url can be a sign of an instable, actively maintained, or randomly changing datasets. Since Preston is continuously tracking urls and their content, we can use its output, a biodiversity dataset graph, to detect linkrot (and content drift).

### Reproduce
If you'd like to reproduce the results below, please:
@@ -138,3 +142,61 @@ Issues opened following this analysis:
1. https://github.com/gbif/ipt/issues/1427 - expected http status 404 on missing dataset, but got 302.
1. https://github.com/gbif/ipt/issues/1428 - only a changed archive/dataset should result in a new version


## Content Drift

Preston tracked idigbio, gbif and biocase networks over a period Sept 2018 - Jan 2019. Each registry, dataset, DwC-A or EML file was downloaded in content-addressed storage. Also, the log of the discover and download processes were captured the preston crawl history as nquads. This history contains the registry urls, meta-data urls, dataset urls as well as a reference to the content that was retrieved from these urls. These content references are expressed in sha256 hashes. These hashes uniquely represent the downloaded content. If two files have the same sha256 content hash, they contain the same byte-by-byte content. A distinction is made between binary content drift and semantic content drift. The former can be detecting by comparing content hashes like sha256: even thought all bytes except for one are the same for two files, their content hashes will be very different. The latter, semantic content drift, is comparing the content across two files extracted from the same URI based on the (relevant) information they contain.

### Binary Content Drift

The graph below show the cumulative size of tracked datasets. Due to a bug in Preston (fixed in v0.0.9), few new datasets were tracked, because only the first version of registries were used to discover datasets. So, in period Sept 2018 - Jan 2019 mostly dataset urls from Sept 2018 were used to track content. Even without the addition of new dataset endpoints/urls, the graph shows a positive linear relationship between time and size of the content registry, with almost 10x increase over a 5 month period. Anecdotal evidence suggests that this increase is unlikely to come from newly added records. Instead the increase in size is due to binary content drift, but not necessarily semantic content drift.

<img src="https://raw.githubusercontent.com/bio-guoda/preston/20190204-size-time-cumulative.png" width="50%"></img>

For instance, two versions of DwC-A served by http://www.gbif.se/ipt/archive.do?r=nrm-herpetology had content hashes hash://sha256/59f32445a50646d923f8ba462a7d87a848632f28bd93ac579de210e3375714de (retrieved 2018-09-05) and hash://sha256/a7e64e7a64fdbceb35d92427b722c20456a127fc7422219f43215c5654e9b80b (retrieved 2018-09-16) respectively. However, close inspection of the content of the zip files shows:

$ unzip -l dwca.zip.old
Archive: dwca.zip.old
Length Date Time Name
--------- ---------- ----- ----
2653386 2018-08-28 10:44 occurrence.txt
5128 2018-08-28 10:44 eml.xml
10749 2018-08-28 10:44 meta.xml
--------- -------
2669263 3 files

$ unzip -l dwca.zip.new
Archive: dwca.zip.new
Length Date Time Name
--------- ---------- ----- ----
2653386 2018-09-11 10:45 occurrence.txt
5128 2018-09-11 10:45 eml.xml
10749 2018-09-11 10:45 meta.xml
--------- -------
2669263 3 files
```
On even closer inspection, the only changes were a packageId and publication date ("pubDate") in the eml.xml files:
```
$ diff eml.xml.older eml.xml.newer
5c5
< packageId="64be02e0-0c64-11dd-84d1-b8a03c50a862/v32.142" system="http://gbif.org" scope="system"
---
> packageId="64be02e0-0c64-11dd-84d1-b8a03c50a862/v32.144" system="http://gbif.org" scope="system"
50c50
< 2018-08-28
---
> 2018-09-11
97c97
< <dc:replaces>64be02e0-0c64-11dd-84d1-b8a03c50a862/v32.142.xml</dc:replaces>
---
> <dc:replaces>64be02e0-0c64-11dd-84d1-b8a03c50a862/v32.144.xml</dc:replaces>
```
So, even though the content in the meta.xml and occurrence.txt are identical, a semantically insignificant change in the eml.xml causes Preston to think that a new version of the dataset at http://www.gbif.se/ipt/archive.do?r=nrm-herpetology was published. So, because Preston only detects binary content drift, a whole new version is stored in Preston's content-addressed file store. Multiplying this behavior of the >40k datasets in the network, a disproportiate growth of total size of dataset versions is expected even when few records are changed in datasets. Assuming that publishing organizations are unlikely to change their publication methods anytime soon, Preston will have to get smarter about detecting changes in a dataset over time. If not, the amount of storage needed will increase rapidly, becausing data mobility and storage issues. Alternatives include chunking the dataset into smaller parts (file contents, of chunks of files) or removing/ignoring parts of the content that carries no meaning (pubDates/ packageIds). The chunking in combination with merkle-trees is used by platforms such as git, ipfs and dat-project to reduce communication overhead and storage requirement by detecting duplication of (chunked) content in directories and files.
Issues opened following this analysis:
* https://github.com/bio-guoda/preston/issues/10

0 comments on commit 3d5c37b

Please sign in to comment.