Data quality

mwainwright edited this page Oct 17, 2012 · 1 revision
Clone this wiki locally

Note: the text from the previous version of this page has been moved to the discussion page for now.

Data quality and quality assurance (QA) processing and information for data on CKAN (usually as encapsulated by Resources).

Examples:

  • Does this resource exist (i.e. not 404), is the API up
  • does it conform to a schema (if it has a schema)
For more information: See this blog post by Stefan Urbanek: http://ckan.org/2011/01/20/data-quality-what-is-it/

Table of Contents

Quality Assurance Information

The CKAN QA extension can be used to provide basic QA information about each resource in a CKAN instance (see the Readme file in the repository for installation instructions). For each resource, the following data is currently calculated:

  • openness score: A number between 0 and 5 (inclusive) based on Tim Berners-Lee's 5 five star scheme. For more information on this see http://lab.linkeddata.deri.ie/2010/star-scheme-by-example
  • openness score reason: A string of text giving the reason that the particular star rating was chosen.
  • failure count: The number of consecutive times that a given resource has received a score of 0.

How QA Information Is Calculated

The current process for calculating an openness score rating is as follows:

  • A HEAD request is sent to the resource URL.
  • If this fails for any reason, the resource is given an openness score of 0.
  • Next we try to calculate content type of the resource.
  • Firstly, we try to guess the MIME type from the CKAN resource object based on the file extension.
  • If this fails, we then try to read the content-type header from the HEAD request response.
  • If this also fails, we finally try to read the format field of the CKAN resource object.
  • If all 3 attempts fail, the resource is given a score of 0.
  • If we have a content type, the openness score is calculated from the following table:
Openness Score Content Type
1 text/plain
1 text
1 txt
2 application/vnd.ms-excel
2 application/vnd.ms-excel.sheet.binary.macroenabled.12
2 application/vnd.ms-excel.sheet.macroenabled.12
2 application/vnd.openxmlformats-officedocument.spreadsheet.sheet
2 xls
3 text/csv
3 application/json
3 text/xml
3 csv
3 xml
3 json
4 application/rdf+xml
4 rdf
  • The openness score reason is currently one of the following messages:
Openness Score Openness Score Reason
0 unrecognised content type
0 not obtainable
1 obtainable via web page
2 machine readable format
3 open and standardized format
4 ontologically represented
5 fully Linked Open Data as appropriate

Updating QA Information

QA information is currently updated in two ways:

Where QA Information Is Saved

QA results are current saved to CKAN's TaskStatus table, and not on to resource objects directly. Three key/value pairs are currently saved for each resource:
  • openness_score
  • openness_score_reason
  • openness_score_failure_count
Each entry is consists of the following information:
  • entity_id: the ID of the CKAN resource
  • entity_type: resource
  • task_type: qa
  • last_updated: time at which the QA task finished

Viewing QA Reports

QA information can be viewed at '/qa' on any CKAN instance that has the QA extension installed.

API Access

QA results can be read from the Task Status table using CKAN API v3. garbageextension Readme for details). This can therefore be scheduled as a CRON job and called regularly.
  • New in CKAN 1.5.1: QA information is calculated automatically (in the background) for individual CKAN resources each time a new one is added, and each time an existing resource URL is changed.

Where QA Information Is Saved

QA results are current saved to CKAN's TaskStatus table, and not on to resource objects directly.

Three key/value pairs are currently saved for each resource:

  • openness_score
  • openness_score_reason
  • openness_score_failure_count
Each entry is consists of the following information:
  • entity_id: the ID of the CKAN resource
  • entity_type: resource
  • task_type: qa
  • last_updated: time at which the QA task finished

Viewing QA Reports

QA information can be viewed at '/qa' on any CKAN instance that has the QA extension installed.

API Access

QA results can be read from the Task Status table using CKAN API v3.