Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
211 lines (160 sloc) 15.5 KB

Preston Analysis

Preston records how and when datasets are discovered and accessed in the rdf/nquads. These crawl records, or biodiversity dataset graphs, can be loaded into triple stores like Apache Jena Fuseki for discovery and analysis.

This page contains some sparql queries to discover and analyze the dataset graph.

Use Cases

Detecting Linkrot

Linkrot, a well documented, and often occurring, phenomenon in which content associated to links become permanently unavailable.

URLs are used to access content. While the URLs might be static, the content is often not. High turnover or error rates in content linked to by a url can be a sign of an instable, actively maintained, or randomly changing datasets. Since Preston is continuously tracking urls and their content, we can use its output, a biodiversity dataset graph, to detect linkrot (and content drift).


If you'd like to reproduce the results below, please:

  1. download ( and install Apache Jena Fuseki
  2. start Jena Fuseki (./fuseki-server). If it runs well, you will see the administration console at http://localhost:3030
  3. download the data file
  4. using the Fuseki administration console, create a new dataset and add the data file as an in-memory resource. If this procedure fails, try decompressing the bz2 file and add its uncompressed version instead.
  5. copy the query (find below) in the SPARQL query section, leave the by-default options, and click on the arrow at the upper right to execute the query. The results should appear at the bottom.

Change Rate of Urls

To detect the rate of change of urls the following query was created. This query generates a list of urls in decreasing order of change rate. Note that Preston records each failed attempt to access a url's content as "blank" content.

SELECT DISTINCT ?url (COUNT(?url) as ?totalVersions)
  { ?url <> ?firstVersion }
  { ?aVersion (<>|^<>)* ?firstVersion }
  { ?aVersion <> ?generationTime }
} GROUP BY ?url ORDER BY DESC(?totalVersions)

Using Poelen, Jorrit H. (2018). A biodiversity dataset graph (Version 0.0.1) [Data set]. Zenodo. , this created the following results:

url totalVersions 140 140 140 140 140 140 140 140 140 140

Tracking the Origin of a URL

Now that we noticed that some urls have many version, we'd like to understand how the url was discovered.

SELECT ?originUrl ?originHash ?originCollection ?dateTime
  { ?originCollection <> <> }
  { ?originHash ?p ?originCollection }
  { ?originHash <> ?dateTime }
  { ?originUrl <> ?x }
  { ?x (<>|^<>)* ?originHash }

which produced the following result:

originUrl originHash originCollection dateTime hash://sha256/e285cbc69418e2847a3727ec650edfc7e1c405dc71dc9a93b859a28028e79cab af32ab2e-7be6-42ca-a570-ad79fe0e32bb 2018-09-01T19:02:16.675Z

Now, we know that the result was retrieved from the gbif registry via on 2018-09-01T19:02:16.675Z with content hash hash://sha256/e285cbc69418e2847a3727ec650edfc7e1c405dc71dc9a93b859a28028e79cab. After retrieving that specific registry chunk, we notice:

      "key": "af32ab2e-7be6-42ca-a570-ad79fe0e32bb",
      "installationKey": "286d31fd-2be5-4df7-be2d-20448e158c81",
      "publishingOrganizationKey": "a30d7f59-d3d4-4e89-97dc-de9cf837f591",
      "doi": "10.15468/b95s7t",
     "endpoints": [
          "key": 260345,
          "type": "DWC_ARCHIVE",
          "url": "",
          "description": "Orphaned dataset awaiting adoption.",
          "createdBy": "MattBlissett",
          "modifiedBy": "MattBlissett",
          "created": "2018-03-06T16:14:55.983+0000",
          "modified": "2018-03-06T16:14:55.983+0000",
          "machineTags": []
          "key": 127629,
          "type": "EML",
          "url": "",
          "createdBy": "a30d7f59-d3d4-4e89-97dc-de9cf837f591",
          "modifiedBy": "a30d7f59-d3d4-4e89-97dc-de9cf837f591",
          "created": "2016-07-13T05:53:31.072+0000",
          "modified": "2016-07-13T05:53:31.072+0000",
          "machineTags": []

Which seem to indicate tha the DWC-a of this dataset was orphaned and (temporarily?) archived by GBIF. However, our suspicious url, the EML file, was not orphaned or (temporarily?) archived.

On inspecting different versions of the EML file, we find most versions are blank nodes, indicating a failed attempt to retrieve the content. In one instance some content was retrieved: hash://sha256/9944f274ee46c33a577e170bb3fd85a4b824741eb7bcc18a002c8b77ca8f3e3a. This specific content turns out to be some html page, not an advertised EML file. The first few lines of this html page with hash ending on 3e3a looks like:

<html xmlns="" xml:lang="en">
 	    <meta name="copyright" lang="en" content="GBIF" />
 		<title>IPT setup</title>
	  <link rel="stylesheet" type="text/css" media="all" href="" />
		<link rel="stylesheet" type="text/css" media="all" href="" />
		<link rel="stylesheet" type="text/css" media="all" href="" />
 		<link rel="stylesheet" type="text/css" href=""/>
 		<link rel="shortcut icon" href="" type="image/x-icon" />


We were able to detect ongoing outages (or linkrot) of an EML file related to a dataset that is registered in the GBIF network using a biodiversity dataset graph tracked by a Preston instance over a period of early Sept - late Oct 2018.

Unfortunately, since Preston was not running before the EML file was orphaned/ removed, we do not have a copy of it somewhere. Also, I am not aware of a method to retrieve this historic content via some other openly available method / service. My running theory is that the referenced (dead) link points to the original location of a now malbehaving website/service. Another theory is that the GBIF team is relocating the archive associated with collection with id af32ab2e-7be6-42ca-a570-ad79fe0e32bb , and is in the process of setting up a new IPT (integrated publishing toolkit) instance. Without a configuration history associated with the dataset/collection with key af32ab2e-7be6-42ca-a570-ad79fe0e32bb (see also, we don't know how configuration or associated content changed over time, simply because this content is not being tracked in an open manner. It appears that the dataset was first published in 2016, so the longevity, or availability period, of the dataset was about 2 years. Extending this simple example, a more continuous and widescale monitoring scheme can be constructed to monitor the health of our digital datasets. Also, by employing content tracking techniques, we have an effective tool to monitor, and perhaps delay, natural phenomenon in our digital infrastructures: linkrot.

Issues opened following this analysis:

  1. - expected http status 404 on missing dataset, but got 302.
  2. - only a changed archive/dataset should result in a new version

Content Drift

Preston tracked idigbio, gbif and biocase networks over a period Sept 2018 - Jan 2019. Each registry, dataset, DwC-A or EML file was downloaded in content-addressed storage. Also, the log of the discover and download processes were captured the preston crawl history as nquads. This history contains the registry urls, meta-data urls, dataset urls as well as a reference to the content that was retrieved from these urls. These content references are expressed in sha256 hashes. These hashes uniquely represent the downloaded content. If two files have the same sha256 content hash, they contain the same byte-by-byte content. A distinction is made between binary content drift and semantic content drift. The former can be detecting by comparing content hashes like sha256: even thought all bytes except for one are the same for two files, their content hashes will be very different. The latter, semantic content drift, is comparing the content across two files extracted from the same URI based on the (relevant) information they contain.

Binary Content Drift in GBIF, iDigBio and BioCASe networks

The graph below show the cumulative size of ~ 40k tracked datasets. Due to a bug in Preston (fixed in v0.0.9), few new datasets were tracked, because only the first version of registries were used to discover datasets. So, in period Sept 2018 - Jan 2019 mostly dataset urls from Sept 2018 were used to track content. Even without the addition of new dataset endpoints/urls, the graph shows a positive linear relationship between time and size of the content registry, with about a 10x increase in total size over a 5 month period. Anecdotal evidence suggests that this increase is unlikely to come from newly added records. Instead the increase in size is due to binary content drift, but not necessarily semantic content drift.

The graph was produced using output of preston ls and preston check to extract the content hashes along with associated access timestamps and file sizes.

An example of binary, but not semantic, content drift includes two versions of DwC-A served by . These dataset versions had content hashes hash://sha256/59f32445a50646d923f8ba462a7d87a848632f28bd93ac579de210e3375714de (retrieved 2018-09-05) and hash://sha256/a7e64e7a64fdbceb35d92427b722c20456a127fc7422219f43215c5654e9b80b (retrieved 2018-09-16) respectively. However, close inspection of the content of the zip files shows that the size of the respective content is the same:

$ unzip -l
  Length      Date    Time    Name
---------  ---------- -----   ----
  2653386  2018-08-28 10:44   occurrence.txt
     5128  2018-08-28 10:44   eml.xml
    10749  2018-08-28 10:44   meta.xml
---------                     -------
  2669263                     3 files

$ unzip -l
  Length      Date    Time    Name
---------  ---------- -----   ----
  2653386  2018-09-11 10:45   occurrence.txt
     5128  2018-09-11 10:45   eml.xml
    10749  2018-09-11 10:45   meta.xml
---------                     -------
  2669263                     3 files

On even closer inspection, the only changes were a packageId and publication date ("pubDate") in the eml.xml files:

$ diff eml.xml.older eml.xml.newer
<          packageId="64be02e0-0c64-11dd-84d1-b8a03c50a862/v32.142" system="" scope="system"
>          packageId="64be02e0-0c64-11dd-84d1-b8a03c50a862/v32.144" system="" scope="system"
<       2018-08-28
>       2018-09-11
<           <dc:replaces>64be02e0-0c64-11dd-84d1-b8a03c50a862/v32.142.xml</dc:replaces>
>           <dc:replaces>64be02e0-0c64-11dd-84d1-b8a03c50a862/v32.144.xml</dc:replaces>

So, even though the content in the meta.xml and occurrence.txt are identical, a semantically insignificant change in the eml.xml causes Preston to think that a new version of the dataset at was published. So, because Preston only detects binary content drift, a whole new version is stored in Preston's content-addressed file store. Multiplying this behavior across the >40k datasets in the network, a disproportionate growth of the total size of dataset versions is expected even when few records are changed in datasets. Assuming that publishing organizations like are unlikely to change their publication methods anytime soon, Preston will have to get smarter about detecting changes in a dataset over time. If not, the amount of storage needed will increase rapidly, causing data mobility and storage issues. Alternatives include chunking the dataset into smaller parts (file contents, of chunks of files) or removing/ignoring parts of the content that carries no meaning (pubDates/ packageIds). The chunking in combination with merkle-trees is used by platforms such as git, ipfs and dat-project to reduce communication overhead and storage requirement by detecting duplication of (chunked) content in directories and files.

A naive estimate of duplication similar to the example above (e.g., same filesize, but different content hash) leads to an upper bound of 26GB content with duplicates out of ~283GB . The estimation assumed that datasets stored by Preston with the same file size are likely duplicates. This naive duplication estimate suggests that introducing a semantic content hash would only reduce the amount of unique content by about 10%. The average file size of unique hashes, non-repeated, hashes was 7MB, while repeated hashes averaged ~100KB in size. This observation is consistent with anecdotal evidence that larger collections tend to be more actively managed, leading to more updates to larger archives/datasets relative to smaller datasets.

More analysis is needed to understand the duplication of information in the gathered datasets in period Sept 2018 - Jan 2019. This understanding would provide a basis for more efficient storage and processing of biodiversity datasets from iDigBio, GBIF and BioCASe networks.

Issues opened following this analysis: