Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schemas for Tabular Data Challenge #40

Closed
1 of 4 tasks
Lawrence-G opened this issue Mar 28, 2017 · 44 comments
Closed
1 of 4 tasks

Schemas for Tabular Data Challenge #40

Lawrence-G opened this issue Mar 28, 2017 · 44 comments

Comments

@Lawrence-G
Copy link
Contributor

Schemas for Tabular Data

Category

  • Data
  • Document
  • Technical
  • Other Suggestions

Suggested by

Originally Submitted by pwalsh on Mon, 13/03/2017 on standards.data.gov.uk

Short Description

Much data published by governments is in common tabular data formats: CSV, Excel, and ODS. This is true for the UK government and governments around the world. To provide assurances around reusability of tabular data, consumers (users) need information on the "primitive types" for each column of data (example: is it a number? is it a date?). This also allows for quality checks to ensure consistency and integrity of the data.

Publishing Table Schema with tabular data sources provides this information. Table Schema has previously been used in work by Open Knowledge International (OKI) with Cabinet Office to check the validity of 25K fiscal data, according to publication guidelines. Table Schema is also used widely by other organisations working with public data, such as the Open Data Insititute (ODI).

User Need

I've written several user stories below. Each user story applies equally to a range of users. The user personas are as follows:

  • Developer: a user reusing public data in derived databases, visualisations, or data processing pipelines.
  • Business analyst: a user looking to public data as a source of information for analysis of business use cases that involve some component of public good.
  • Citizen: a non-technical user who expects government to publish consistent, high quality data.

User stories

As a user, I want all public data published by government to conform to a known schema, so I can use this information to validate the data.

As a user, I want public data published by government to have a schema, so I can read the schema and understand at a glance the type of information in the data, and the possibilities for reuse.

Expected Benefits

  • Vastly increased reuse of public data.
  • Increased trust in publication flows, generated by publication flows creating quality data.

Functional Needs

The functional needs that the proposal must address.

@philarcher
Copy link

W3C has a formal Recommendation in this space: Metadata Vocabulary for Tabular Data that encodes the Model for Tabular Data and Metadata on the Web. The work was originally inspired and subsequently led by Jeni T with Dan Brickley as co-chair. Jeremy Tandy (Met Office) was a key WG member too.
The WG looked at tabular data in the wild (all its use cases come with real examples) and handled awkward realities like multiple lines of headings, right to left tables and non-ascii characters. It uses the Web to link the metadata definitions that, like Table Schema, are defined in JSON. The metadata can be embedded in the CSV/TSV, at a well-known location, or linked from the CSV directly. You can override defined metadata file with one of your own. All this, of course, makes the metadata definitions reusable - handy for regularly published datasets, for example. The combination of CVS/TSV and the metadata file means that you can use the data directly or transform it programmatically into JSON or RDF. The standards supports URI templating out of the box but it also has extension points for extra rules so that you can use it as the basis for something like Open Refine that would, for example, transform dates into a standard format, or either of UK|United Kingdom into a regular form.

The potential here is that a set of metadata files can be defined and maintained centrally with tooling to validate a published CSV/TSV file with reduced effort to create visualisations etc. It gets away from the notion of packages that are made available for download and local processing (what I call using the Web as a glorified USB stick), and makes linking across multiple datasets and to things like the registries much easier. If you like, it's 5 star data in CSV.

@edent
Copy link
Contributor

edent commented May 19, 2017

Also - CSV on the Web https://w3c.github.io/csvw/primer/

@DavidUnderdown
Copy link

The National Archives also developed a CSV Schema Language and validator http://digital-preservation.github.io/csv-schema - currently departments are expected to supply a CSV file meeting a supplied a schema when transferring digital records to The National Archives to meet their Public Record Act obligations

@frankieroberto
Copy link

There's also Frictionless Data Packages which apparently has some traction (eg is supported by the Open Data Institute and Google.org).

@gheye
Copy link

gheye commented Jun 21, 2019

Data Description Language

This comment is a proposal on defining descriptive language for data within government. We’re asking for your feedback so we can develop and write this guidance based on advice from the government data community.

We have published blog posts on this topic::

  1. Excel Spreadsheets - which outlines how a simplistic data description language can assist in sharing data and the reality that spreadsheets are not disappearing. https://dataingovernment.blog.gov.uk/2019/06/10/improving-how-we-manage-spreadsheet-data/

  2. Why we need Data Standards (due for publication soon) which is an introduction to data standards and some of the proposals.

The aim is to have a common way of describing files, ‘a data description language’, which organisations can use across all file types and formats.

There are several priorities for the data description language. We need to make sure that:

  • It is easy of use

  • the process for defining the language is community driven, by using tools such as Slack and GitHub

  • the language is useful and adds value for the community

  • we can explain the basics of the language in one page that we can publish with the guidance

  • We also need to make sure that the data description language follows existing standards wherever possible. This includes:

    • aligning with CSV on the web - TAG:Text

    • CSV RFC4180 - promotion of using double quotes around key items of text

    • verifying new tags against the Dublin Core to validate whether the suggested tags currently exist or if we should create them. Suggestions on tagging frameworks to be used are much appreciated.

We will produce a number of documents alongside this proposal including:

  • The one page description of the tags, which will follow the format largely of the proposal below

  • A more detailed specification of the language - this will follow the language of the previously
    published API documentation. This is currently being worked on.

Proposed Tags
Please note that these tags can be provided in a separate file for a CSV file, or on a page within a spreadsheet. Each item will exist on a separate line.

Links have been added on the second column for those which come from Dublin core. An area column has been added and a proposed compulsory column. The last item has been removed which was parse-to-end-file. Two items for reference data have been remove Register-collection and register-standard. Added declare-datasheet so the sheet the data sits on can be defined. Removed one standard item 'standard' and changed standard-url to conformsTo.

Area Property(Tag) Example Comment Compulsory
Core creator creator:"Russell Singh russell.singh@digital.cabinet-office.gov.uk"
creator:"Indira Singh, Sue Chan, Gregory Pie"
Dublin Core. This can be a comma separated list of creators.  
Core contributor contributor:”justine.gornall@company.co.uk Dublin Core. This can be a comma separated list of contributors. Note see above for multiple contributors.  
Core title title:"GDS Employees" Dublin Core.  
Core created created:2002-10-02 Dublin Core. The format of this date and whether double quotes are used needs to be agreed.  
Core identifier identifier:"0000015_GDS_SDA_XLS" Dublin Core.  
Core description description:"All heights at GDS" Dublin Core.  
Core valid valid:"2012-2013" Aligns with Dublin Core. This refers to Date valid range. What acceptable range of values would we accept here? Suggested by DWP.  
Core replaces replaces:"GDS Employees V1" Aligns with Dublin Core. The document or item that this replaces. What acceptable range of values would we accept here? Suggested by DWP.  
Core license license:"https://opensource.org/licenses/MIT" Aligns with Dublin Core. Proposed by ONS. The license that applies to the document.  
XLS\ODF declare-header declare-header:"A1:A2"
declare-header:"Sheet2!A1:A2"
For spreadsheet data.  
XLS\ODF declare-datasheet declare-datasheet:"MyDataSheet" For spreadsheet data.  
XLS\ODF\CSV column-type column-type:"ColumnName:String"
column-type:"Country:String"
column-type:"Age:Number"
For spreadsheet data. This is an item that can be repeated for each of the columns that exist in the data set. Amended following feedback from @davidread. This aligns much more closely now with CSVW.  
XLS\ODF declare-data declare-data:"Sheet2!A4:D8"
declare-data:"A4:D8"
For spreadsheet data  
Core format format:"xls" To allow future expansion of the data description language. Aligns with Dublin Core. Assumption is double quotes not required.  
CSV file-delimiter file-delimiter:“,” To be used if another file delimiter is being used.   
Proposal fileformat-puid fileformat-puid:”fmt/62” Proposed by National Archives. There is a REST API for obtaining information on the PUID (PRONOM Unique Identifier): http://www.nationalarchives.gov.uk/PRONOM/fmt/62  
Proposal fileformat-creating-application fileformat-creating-application:”Excel 1997” Proposed by National Archives. Should this be shortened to creating-app?  
Proposal standard-comment standard-comment: “RFC4180” One comment tag may be used.  
Core conformsTo conformsTo:“https://tools.ietf.org/html/rfc4180” The standard the file must conform to. Changed following feedback from @pwin  
Core doc-sensitivity   A future proposal if the documents should have sensitivity applied to it. Note should this align with the Dublin core tag: accessRights. Note what would be an acceptable range of values for this?  
Proposal register-column register-column:”Address” The column from the dataset that should have reference data applied to it. The terms that has been used in GDS for reference data is register. Should this allow a series of columns?  
Proposal register-url register-url:"https://www.registers.service.gov.uk/registers/ddat-profession-capability-framework" The URL to the specific register that this applies to. This maybe machine readable or human readable depending on the usage.  
CSV top-row-header top-row-header:true To aid processing. True or false.  

@pwin
Copy link

pwin commented Jul 1, 2019

  • I don't think there is much special about government that requires it to have something particular to itself, though it has an authority for specific data and also is a big player in the market and consequently has influence. This latter can be a problem for society generally if government decides to go the 'wrong way' - just read Programmed Inequality
  • Many of the tags above are related to the concept of the resource and less to the specific distribution. I think that it is important to separate these so that we have an approach to creating catalogues that is more logical. DCAT v2 mentions this:

an important distinction between a dataset as an abstract idea and a distribution as a manifestation of the dataset.

Most of the other bits that you mention above @gheye are schema or processing instructions and I think it is important to separate these. Also, there might be rules of different types within the constraints for validation, or within the instructions for processing. So I think that these aspects need to be handled as part of an interoperability metamodel - see EIRA as an example. But if you think that this over-complicates things then either slim down your proposal so that you're not trying to replicate the many illustrations above, or else join in with the EC activity.

@gheye
Copy link

gheye commented Jul 1, 2019

Hi @pwin

Thanks for your comments.

I am coming up to see you and we can discuss this in more detail.

We do need align with international standards and work with them but in a dispersed organisation, such as the UK government we need to be flexible and in some cases more light weight.

We can discuss more when I come up to Glasgow.

Best Wishes,

gheye

@davidread
Copy link

Along with the existing user stories of documenting and validating the data, I'd like to suggest another:

  • As a data scientist/engineer, I want to load the data into a data store, so that I can do queries/analysis

This is helped by having a schema. I'll just explain the situation we have at MOJ: we started a simply internal data catalogue with each data table described as metadata including a schema describing the column properties. Whilst you can load a CSV into a data store and let it auto-detect column types, or loading a parquet file into a data frame it often makes mistakes, for example converting numbers to dates, treating dates as text, dropping the leading 0 in telephone numbers, interpreting nulls as strings, choosing int16 when int64 will be needed in future etc. So to make this more reliable, colleagues have written some little tools to convert the existing schema to a number of related schema formats, suited to various data stores, for example:

Ideally all these things would accept a schema in the same format. Pandas and Spark accept it programmatically. Glue and Big Query express it in slightly different JSON formats. 🤷‍♂

But if we accept we need converter tools for this use case, I don't think this user story imposes anything extra on the schema format - just defining the name and type of the column would cover it. The allowable column types is probably worth discussing - I guess the SQL types is a reasonable start. See our conversion table between pandas, Spark and AWS Glue.

It would be great to hear if this use case is common or not, probably from others establishing data warehouse functionality, with data catalogues and ETL needing schemas.

@davidread
Copy link

To summarize, these are the suggested options to meet the challenge:

To advance this discussion, perhaps we should compare the suitability of these in meeting the user needs (for developers, business analysts and citizens), identified by @pwalsh in the challenge.

@timwis
Copy link

timwis commented Jul 2, 2019

Hm, I would suggest that the "Data Description Language" that @gheye proposes is rather distinct from how we describe the fields/schema of the data, and should warrant its own conversation/thread -- particularly since there are a lot of existing standards out there around metadata worth discussing, and they don't necessarily include field-level information.

If we prefer to treat it as a single standard, I would re-emphasise tabular data packages as an existing standard that prescribes the lowest common denominator of fields and allows for extensions, and also addresses field-level information via table schema.

@gheye
Copy link

gheye commented Jul 4, 2019

Hi All,

My view is this does not preclude DCAT 2 or CSV on the web, and can viewed as a stepping stone.

At the lowest level for a department using excel or CSV who are not in the position of moving to DCAT or CSV on the web can we provide a list of common tags that are used across government. These tags should allow a user to more easily migrate to a larger more comprehensive framework.

Can they not also be used in CSV on the Web and form the basis of a DCAT design especially since they align as close as possible to international standards. Additional tags will only be created where they do not exist in one of the standards.

The ask is therefore simple and quite basic. Can we agree a common set of tags to be used across government. After we have done this then you would create recommendations and migration strategies to CSV on the web and DCAT 2 or an international standard that we agree on.

@gheye
Copy link

gheye commented Jul 9, 2019

Hi,

I have amended the column definition above to align more closely with CSVW.

Gareth H.

@gheye
Copy link

gheye commented Jul 15, 2019

Following feedback from NHS Digital and GDS, I have updated the table to include:

  1. A proposed compulsory column

  2. An area that that the item applies to. For want of a better term all those created from requests that are not CSV of spreadsheet or core I have put as proposal.

@gheye
Copy link

gheye commented Jul 23, 2019

Note a suggestion that is worth discussing from @pwin is whether we have mediatype.

A list of possible media types is shown here:

https://www.iana.org/assignments/media-types/media-types.xhtml

@gheye
Copy link

gheye commented Jul 31, 2019

I received excellent and extensive feedback from ONS at the individual item level. Most of the changes have been added to the table above.

@gheye
Copy link

gheye commented Aug 7, 2019

There are three items that NHS digital would like to add to this list from schema.org:

schema:datePublished
schema:spatialCoverage
schema:temporalCoverage

Please confirm whether you agree or disagree with these items.

@frankieroberto
Copy link

@gheye I'm a bit confused. Are you proposing a new "Data Description Language" standard? My understanding was that the adopted open standards could only be existing standards, not newly-created ones? Or does your proposal build on top of CSV on the web or Frictionless Table Schema?

@gheye
Copy link

gheye commented Aug 7, 2019

HI @frankieroberto

We are not proposing a new data description language. If you look at the individual items in the list above almost all of them come from existing standards.

The only new items we are proposing are those to fill a need that has been raised in government and a tag does not currently exist. If you can find a tag that does for any of the additional tags above then please point us towards them.

As mentioned above this should be viewed as a stepping stone to something such as CSV on the web. In fact the syntax has been aligned with that as far as possible.

This a proposal for an initial set of tags that the U.K government could begin to adopt before moving to something larger. It is a stepping stone.
Regards.

@MikeThacker1
Copy link

Some points I raised in a chat with @gheye that he asked me to add here:

@pwin
Copy link

pwin commented Aug 7, 2019

w3c/dxwg#868 from the DXWG issues is relevant to the discussion about accreted datasets etc. So is the discussion within DXWG on qualified relations

@PeterParslow
Copy link

There are two kinds of standards being discussed here, without always being clear on which:

  • structures for the schema definition language: here the suggested standards offer a range of choices

  • semantics - the actual tags that are being proposed. The core ones almost all come from Dublin Core, and are effectively a declaration that 'the government' requires these metadata elements about any schema'. The others are 'the meat' e.g. what do you call the array elements that give the name & description of each column?

Several of the proposed standards may have the same mix; Table-schema & CSV on the web only do the latter - describe how to define the CSV structure, not specify the metadata expected to accompany the schema definition.

(They are pretty much the core of all metadata standards. GEMINI distinguishes 'metadata about the metadata' and 'metadata about the dataset' - but I don't think many people like the two terms!)

2nd point: the DCLG-sponsored Brownfield Land Register schema was developed by iStandUK using the 'table-schema' approach (in order to exploit the ODI's CSVLint tool). This is in use by many (all?) English planning authorities, placing data records on data.gov.uk

@MikeThacker1
Copy link

On Peter's 2nd point, the conformsTo tag should reference the relevant schema, in this case the one from iStandUK. Both CSVLint and an LGA CSV validator tool validate against the schema. data.gov.uk used to allow you to identify all datasets that conform to a given schema but they have now dropped that functionality. If GDS uses conformsTo, we should be able to revive (somewhere) discovery of datasets that conform to a given schema. This is needed for joining datasets from many publishers in local government.

@Lawrence-G
Copy link
Contributor Author

The following proposals for this challenge (Dublin Core, Schema.org, CSVW) are the result of the GDS data standards workshops held over the last couple of months and comments and suggestions made on GitHub.
The workshops focussed on the data description language for tabular data structures that gheye published here and identified that the language proposal was trying to solve a number of different problems. to create a standard for:

  • data shared privately between individuals/govt organisations so that they can catalogue their data
  • data shared privately between individuals/govt organisations so that they can easily share and combine data sets, and ensure interoperability
  • published data so that it can be easily catalogued, found and shared

it was agreed to post proposals for the different standards referenced in the language as separate recommendations. For this purpose, GDS worked with ONS on the proposals; for Dublin Core, schema.org and CSV on the Web.

I'm posting on behalf of the team who put the proposal together.

Proposal: Recommend Dublin Core to describe data shared privately

Introduction

This proposal is to recommend that government departments use Dublin Core schema as a minimum set of metadata to be associated with data they are sharing privately. For tabular data being published, we have a separate proposal with the Open Standards board to use Schema.org

Using the Dublin Core schema to describe tabular data that is shared privately means individuals/teams/departments would be consistent in the way they describe the contents of their tabular data resources, especially in departments where no metadata is collected or published allowing the data to be easily catalogued, validated and reused, as well as to be more findable. It is the first step in achieving metadata maturity across the government departments and reach a common core set of metadata associated with a dataset.

This proposal is based on the idea that the metadata elements associated with the Dublin Core represent the core set of metadata elements conserved across the government. Since Dublin Core sets a foundation for many more complicated standards such as DCAT (which is a recommended standard at the higher end of metadata maturity spectrum), it ensures that the same elements can be captured as a more complicated metadata standard can be implemented without the need for complex translation between standards.

Please note that this proposal is based on two assumptions:
1.Dublin Core should be used as a “core metadata elements” standard and be considered as the first step in achieving metadata maturity
2.The data should be identified using persistent resolvable identifiers as recommended by the Open Standards for Government.

This proposal builds on the Open Standards Board recommendation of the RFC 4180 definition of CSV (Comma Separated Values) for publishing tabular data in government.

User need approach

The user need identified by this proposal is to maximise consistency in and give context to the tabular data being shared across government. Adopting the Dublin Core schema should help to achieve the first step in sharing data with associated metadata to ensure trust and increase the confidence in handling of data.

Users in the context of this proposal are government workers who create, share and maintain tabular data, and need to be able to validate it. Individual users include but are not limited to data scientists, business analysts, people who need to use a spreadsheet application to do basic analysis and developers who process data in a range of software.

Achieving the expected benefits

If departments use the Dublin Core schema, they would be consistent in the way they describe their tabular data. The Dublin Core Metadata Element Set is one of the simplest and most widely used metadata schema since it represent a set of 15 minimum set of metadata conserved across metadata standards. For example, in ONS, all of the Dublin Core metadata elements are conserved in a much more complex metadata model which captures metadata needs of a statistical organisation.

Dublin Core is comprised of 15 “core” metadata elements; whereas the "qualified" Dublin Core set includes additional metadata elements to provide for greater specificity and granularity. If Dublin Core is adopted as an Open Standard for government, the Government Digital Service will produce guidance on how these elements should be used. The GDS will advocate for the 15 core metadata elements to be conserved across the government department.
Additionally, the Government Data Architecture Community (GDAC) fostered by ONS has offered to work with GDS to deliver half day workshops that will help government users understand why using Dublin Core is useful and helpful.

Using the Dublin Core standard will mean government users have an improved idea of what kinds of information should be recorded, where and how.

How Dublin Core complements other standards

The Dublin Core standard is at the core of majority metadata standard and is limited to only 15 core elements which GDS will be advocating the cross-government community to use to share data across departments where no other metadata is captured. Dublin Core elements are independent of coding syntax. As the metadata maturity of a government department improves, the Dublin Core metadata elements will become a part of a much more complex standards as is the case in ONS (ONS model includes standards such as DCAT, ADMS, amongst others). Using Dublin Core elements is a first step in reaching that maturity and ensures that at least a minimal set of metadata is captured and shared across government departments in a standardised way.

However, when departments are publishing data openly, they should use schema.org to describe data, as this is a collection of metadata schemas that are focused on SEO and are particularly targeted at webmasters. Schema.org is already currently in use by Data.gov.uk and GOV.UK. Whist Dublin Core elements describe both physical and web resources, schema.org has been designed specifically for search engine optimization. Schema.org is also a complex and mature metadata standard and the majority of government departments are required to support it when publishing their open data dataset to data.gov.uk pages.

Whilst schema.org is a fantastic standard for publishing data on the web, the government departments often lack any means of publishing common and coherent metadata to fit their inhouse need. Dublin core set of elements forms a basis to many other standards which can be easily adopted as the organisation matures in their metadata journey. Schema.org metadata standard cannot be used, nor is designed to work in the same way.

Other steps to achieving interoperability

This proposal is only concerned with promoting consistent and accurate set of metadata associated with tabular data when government is sharing data privately.

When government is sharing data publically, ie publishing the data, we have a separate proposal to the Open Standards Board to use schema.org (link) since this has been the preferred option in the open linked data community.

When wanting to describe how the data is shaped and formatted within a particular file, we have a separate proposal to the Open Standards Board for the use of CSV on the Web.

Proposal: Recommend schema.org to describe tabular data you are publishing

Introduction

This proposal is to recommend that government departments use schema.org to describe open data they are publishing. For data being shared but not published, we have a separate proposal with the Open Standards board to use Dublin Core as the minimal set of metadata elements conserved across the government.

Using the schema.org to describe tabular data that is published by individuals/teams/departments means search engines can better find the data and display structured results to end users more efficiently. Describing contents of published tabular data resources in a consistent way will also allow the data to be easily catalogued, validated and reused.

This proposal:

  • follows the Open Standards Board adoption of the schema.org JobPosting schema in 2016 to ensure consistent formatting of job posts across government
  • builds on the Open Standards Board recommendation of the RFC 4180 definition of CSV (Comma Separated Values) for publishing tabular data in government.

User need approach

The user need identified by this proposal is to make government published data easier to find and maximise use. Officially adopting the schema.org should help to achieve this since this standard has been used to publish open data to government websites for a while.

Users in the context of this proposal are government workers who create, share and maintain tabular data, and want people to be able to find it. Individual users include but are not limited to data scientists, business analysts, technology policy advisors, economists and members of the public.

Often, openly published government data can be difficult to find. Web publishers who include schema.org markup generally tend to have a competitive SEO advantage over those who don’t so it makes sense to adopt Schema.org within government. Schema.org allows context to be provided for an otherwise ambiguous webpage, improving the quality of search results for users.

The user needs identified for adopting schema.org are for users to be able to:

  • find and reuse data published by the government on GOV.UK and 3rd party websites
  • perform advanced searches by item type, for example, event or location
  • search for data published by government regardless of where it is published

Achieving the expected benefits

If departments use schema.org, they would be consistent in the way they describe their published data. schema.org is supported by the major search engines and takes full advantage of the semantic web.

schema.org is a set of tags that aims to make annotating HTML elements with machine-readable tags much easier. schema.org is already used by Government websites publishing data, including GOV.UK and Data.gov.uk.

Different bits of government, and associated agencies publish tabular data in different formats and they regularly exclude pertinent information. Sometimes, even when the data is included, it is not machine readable so is not easily findable. Using schema.org as an open standard for published data across government will help online users find relevant, accurate data to fit their needs. schema.org will also make it easier for aggregators, search engines and others to re-use data published by the government, in turn making it easier for users to find data relevant to them on non-government services.

When schema.org may not be suitable

The schema.org standard is a powerful tool for helping online users find the information they need. However, when departments or government individuals are sharing data between themselves, they should use Dublin Core tags to describe data, as this is a collection of tags that are less focused on the web, and are more suited to structuring the data.

Whilst schema.org is a fantastic standard for publishing data on the web, the government departments often lack any means of publishing common and coherent metadata to fit their inhouse need. Dublin core set of elements forms a basis (and is part of) many other more complicated standards (such as DCAT) which can be easily adopted as the organisation matures in their metadata journey. Schema.org metadata standard cannot be used, nor is designed to work in the same way, however, the metadata elements, described as “core” are also shared in Schema.org standard.

Other steps to achieving interoperability

This proposal is only concerned with promoting consistent and accurate data when government is publishing data openly.

When government is sharing data privately, we have a separate proposal to use Dublin Core.

When wanting to describe how the data is shaped and formatted within a particular file, we have a separate proposal to the Open Standards Board for the use of CSV on the Web.

Proposal: Recommend CSV on the Web to annotate tabular data column properties

Introduction

A large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. CSV files may be of a significant size but they can be generated and manipulated easily, and there is a significant body of software available to handle them. Indeed, popular spreadsheet applications (Microsoft Excel, iWork’s Number, or OpenOffice.org) as well as numerous other applications can produce and consume these files. However, although these tools make conversion to CSV easy, it is resisted by some publishers because CSV is a much less rich format that can't express important detail that the publishers want to express, such as annotations, the meaning of identifier codes etc.

Existing formats for tabular data are format-oriented and hard to process (e.g. Excel); un-extensible (e.g. CSV/Tab Separated Values(TSV)); or they assume the use of particular technologies (e.g. SQL dumps). None of these formats allow developers to pull in multiple data sets, manipulate, visualize and combine them in flexible ways. Other information relevant to these datasets, such as access rights and provenance, is not easy to find. CSV is a very useful and simple format, but to unlock the data and make it portable to environments other than the one in which it was created, there needs to be a means of encoding and associating relevant metadata.

To address these issues, the CSVW seeks to provide:

  • Metadata vocabulary for CSV data
  • Access methods for CSV Metadata
  • Mapping mechanism to transforming CSV into various formats (e.g., RDF [rdf11-concepts], JSON [RFC7159], or XML [xml])

This proposal is to recommend that government departments use CSV on the Web (CSVW) to process CSVs into an annotated data model so CSVs can be annotated, interoperable and more easily shared. This is an open and established standard and is currently being used and recommended by ONS. It should be noted that this standard deals with tabular data and a standard which deals with an extended array of formats might be recommended to use in the future.

CSVW is a W3C standard for metadata descriptions for tabular data, and will give government a standard way to express useful metadata about CSV files and other kinds of tabular data.

With CSVW, after the tabular data is annotated, the model is used as the basis to create RDF or JSON. For example CSVW assumes that the first row of a CSV is a header row containing the titles for the CSV cells. CSVW also assumes each row in the CSV file constitutes a record with properties. After assuming a model for the tabular data, the file can be easily integrated with other data.

This proposal builds on the Open Standards Board recommendation of the RFC 4180 definition of CSV (Comma Separated Values) for publishing tabular data in government.

User need approach

Different bits of government, and associated agencies publish tabular data in different formats and they regularly exclude pertinent information on the column properties, making it hard for files to be shared. Using CSVW as an open standard for tabular data to be shared across government will help government workers aggregate and reuse data.
CSVW will help users consolidate different tabular data sources into one file, or load their data into a data store so that they can do queries/analysis. To help users document, validate and catalogue their data, we have separately proposed Dublin Core and Schema.org as Open Standards.
CSVW is a standard for describing column properties in tabular data. Users can load a CSV into a data store and let it auto-detect column types, but often mistakes are made, for example converting numbers to dates, treating dates as text, dropping the leading 0 in telephone numbers, interpreting nulls as strings, choosing int16 when int64 will be needed in future etc. To make data transfer more reliable, CSVW can be used to convert the existing schema to a consistent format that data stores understand how to read. The consistent format will also mean different tabular data files can be easily integrated.

Users in the context of this proposal include:

  • developers who have to pull in multiple data sets, and manipulate, visualize and combine them in flexible ways
  • statisticians who want to reuse statistical data and not be inhibited by a lack of explicit definition of column heading meanings
  • suppliers and consumers of CSV data including government organisations
  • linked open data users since this format can be easily parsed and add to open data portals

Achieving the expected benefits

As is noted by the W3C Working Group, CSV is a very useful and simple format, but to unlock the data and make it portable to environments other than the one in which it was created, there needs to be a means of encoding and associating relevant metadata

To address these issues, the CSV on the Web Working Group seeks to provide:

  • Metadata vocabulary for CSV data
  • Access methods for CSV Metadata
  • Mapping mechanism to transforming CSV into various formats (e.g., RDF [rdf11-concepts], JSON [RFC7159], or XML [xml])

If departments use CSVW, they would be consistent in the way they describe their tabular data column properties data. CSVW is already in use across government and support for the standard is noted in the W3C Working Group Use Case and Requirement brief.

When CSVW may not be suitable

CSVW is suitable for tabular information formats but not for other formats.

Other steps to achieving interoperability

This proposal is only concerned with promoting consistent and accurate tabular data when government is transporting data into a data store or another tabular data file.

When government is wanting to share details about the data we hold so it can be catalogued and found easily we have separate proposals to the Open Standards Board to use Dublin Core and Schema.org .

@davidread
Copy link

I agree with @pwin, @PeterParslow and @timwis that we are discussing two rather distinct sorts of metadata here.

I'd describe them as Descriptive Metadata (DCAT, DC, schema.org) and Structural Metadata (Table Schema, CSVW, TNA's CSV Schema Language). @Lawrence-G Please can a separate challenge be created for the former? These two areas have rather distinct user needs and merit separate discussions. And a bit of "divide and conquer" is probably needed when the latest proposals are 9 screenfuls long :)

@Lawrence-G
Copy link
Contributor Author

@davidread That may be the best approach but I’ll leave it to the Open Standards Board to decide. This challenge has acted as a touchpaper to (re)ignite the conversation over these past months and this extended proposal is the result. To stall the momentum would be a shame but we do need to get things clearly defined. I hope that the work carried out so far will help inform the final profile.

Yes, the nine pages are somewhat epic. for the assessments, I have three links.

Dublin Core Metadata Assessment

Schema.org Assessment

CSV on the Web (CSVW) Assessment

As part of the open standards process, any standards recommended for use in a standards proposal are assessed using the 47 questions. The Open Standards Board agreed on the criteria used in the assessment. These questions are based on the EU CAMSS (common assessment method for standards and specifications).
A negative answer to a particular question does not automatically indicate a failure certain questions are weighted more than others and the assessment will be taken as a guide to the suitability of a standard.

@PeterParslow
Copy link

Regarding the Descriptive Metadata, I hope that the Open Standards Board takes into account the collection of Statutory Instruments collectively known as "The INSPIRE Regulations".

Starting with 2009/3157 (http://www.legislation.gov.uk/uksi/2009/3157/contents), and amended by 2012 No. 1672, it has since become quite a collection because of the range of "EU Exit" amendments. But basically it requires most public bodies in the UK to use a particular metadata "standard" to describe rather a lot of their data. Defra has invested in ensuring that the UK has a "standard" which allows this to work with data.gov.uk: GEMINI (https://www.agi.org.uk/gemini/).

Luckily, it's based on Dublin Core, and the Geospatial Commission is funding some work that is likely to result in GEMINI including advice on a Schema.org encoding.

I do urge proper thought before issuing "standards advice" that contradicts a Statutory Instrument. Sometimes "stalling the momentum" might be better than confusing the organisations one is trying to influence!

@pwin
Copy link

pwin commented Dec 2, 2019

I agree with @PeterParslow in the need to move forward carefully. Not every organisation or line of business has the same requirement or is at the same level of maturity.
Other steps, such as putting in place a persistent identifiers scheme for all information assets, and then building that up into a set of catalogues is something that could be more readily implemented across the board and would help getting other metadata into place

@davidread
Copy link

Can we reduce the proposal from recommending the whole of schema.org to just their Dataset schema? schema.org includes schemas for everything from DrugStrength to Electrician to PublicToilet.

schema.org is a disrupter, a competitor to existing data standards in numerous fields e.g. library catalogues has for decades used standards like MARC, BibFrame and FRBR. If Open Standards adopt the whole of schema.org wholesale without considering the more established open standards in each field, that could bias government adoption towards schema.org in all those fields.

@davidread
Copy link

On schema.org the primary reason this proposal recommends it is:

Web publishers who include schema.org markup generally tend to have a competitive SEO advantage over those who don’t so it makes sense to adopt Schema.org within government.

Perhaps the supporters can expand on this point? A quick search brought this summary from Mozilla:

Whether structured data affects rankings has been the subject of much discussion and many experiments. As of yet, there is no conclusive evidence that this markup improves rankings. But there are some indications that search results with more extensive rich snippets (like those created using Schema) will have a better click-through rate. For best results, experiment with Schema markup to see how your audience responds to the resulting rich snippets.
https://moz.com/learn/seo/schema-structured-data

@davidread
Copy link

Again on schema.org, a secondary reason for the recommendation is:

Describing contents of published tabular data resources in a consistent way will also allow the data to be easily catalogued, validated and reused.

schema.org's Dataset is a big sprawling standard, with over 100 properties, to cover a wide variety of use cases. I strongly dispute idea that simply adopting it wholesale will bring consistency to government metadata. Fields are all optional - each publisher will choose their own. The sheer number of options rather defeats the idea of standardization.

For example, just reading the standard you'll see lots of options for specifying the publisher of the data - whilst data.gov.uk and gov.uk have a simple model of allowing one or more departments attached as the "publisher", and no other role, schema.org's Dataset can have a publisher, author, creator, producer, provider, publisherImprint, funder, sponsor, sdPublisher, sourceOrganization, contributor, copyrightHolder. The definitions are ambiguous and confusing so people will get them wrong. Some will feel the only thing needed is which org captured the data ("producer"?), some will only include the "publisher", and some will think the copyrightHolder is the only thing to record. Whilst a human can make sense of one record, you cannot compare or automatically process them in bulk. So that is not a standard.

To take another field as an example - "date". There is a huge variety of ways to express dates: contentReferenceTime, dateCreated, dateModified, datePublished, expires, sdDatePublished, temporal, temporalCoverage. Now if you're trying to search for data on a topic, and want to filter by data collected in the past year, you've got a really tough problem to write that query for datasets that could express its date 8 different ways.

With so much complexity, organizations will assign these differently every time, and those trying to make sense across government will have an impossible task.

The only obvious alternative is DCAT, which is no better in this respect, indeed comes with more linked data extensibility/complexity, so I wouldn't favour that.

In truth I rather like the core of schema.org's Dataset schema. I assume the likes of data.gov.uk and gov.uk have defined a subset of 10ish properties that they read (out of the 100 in the standard) and the rest are disregarded. I would suggest that this proposal does a similar thing and specifies which fields are recommended (a "profile" of the standard), which the Open Standards board points to in recommending this standard.

I've been involved in user research where publishers ask questions like "if you release a monthly set of figures on a topic, should each month be recorded as a new Dataset or just a Distribution within the existing Dataset"? schema.org doesn't have a view (and DCAT is unclear, I believe). This sort of problem can be solved by best-practice examples / supplementary guidance, so I'd like to see this in the proposal too.

@davidread
Copy link

@PeterParslow rightly points out that the substantial proportion of datasets covered by the INSPIRE law need to have GEMINI-format metadata. To add some more background (someone correct me if this has changed since I was last involved), GEMINI is a specialist format for use by those mainly in the environmental research community. GEMINI is a standard for data with a spatial element to it (whereas schema.org is for all types of data). The UK's INSPIRE datasets' metadata are all harvested into data.gov.uk, where the GEMINI metadata is simply translated to schema.org Dataset format and exposed like that to give the benefits of search engine discoverability.

So we already have both metadata formats being published for these datasets, which makes total sense to me. I'm not convinced that Open Standards Board making a recommendation for schema.org's Dataset would confuse publishers of INSPIRE data, but this could be usefully clarified in accompanying guidance.

@PeterParslow
Copy link

@davidread A reasonable summary - just a couple of points:

  • GEMINI is used across the whole range of geographic information, not just environmental research. About half the records in data.gov.uk are created by harvesting GEMINI records; I think there are over 300 UK organisations contributing them - including most local authorities.

  • many of these relate to INSPIRE, but not all. But then INSPIRE's scope is " any data with a direct or indirect reference to a specific location or geographical area;" (http://www.legislation.gov.uk/uksi/2009/3157/regulation/2) for which a public authority is responsible.

For the GEMINI community, I can say we're glad that data.gov.uk translates parts of the GEMINI record to schema.org. A current Geospatial Commission project is likely to recommend that we continue this way, with a few proposed improvements to the schema.org

In conversation with Rosalie Marshall of GDS, I've volunteered to draft some accompanying guidance for GEMINI authors. I'd like to include the mappings*, so they can 'see for themselves' that publishing their GEMINI records will satisfy this Schema.org proposal. That will be much easier with David's other suggestion - of specifying which Schema.org elements are in mind. Rosalie suggested that would be in guidance. Even starting at https://schema.org/Dataset has 110+ properties, many of which on a quick look seem quite irrelevant.

*I'll probably base it on https://github.com/geonetwork/core-geonetwork/wiki/JSON-LD---ISO19139-mapping-proposal - GeoNetwork is a very common tool for creating & managing GEMINI records, and already has an optional Schema.org output.

@PeterParslow
Copy link

GEMINI comes with quite extensive guidance on things like "publisher" (which organisations to specify) - https://www.agi.org.uk/agi-groups/standards-committee/uk-gemini/40-gemini/1062-gemini-datasets-and-data-series#23

and "date" (which date?) - https://www.agi.org.uk/agi-groups/standards-committee/uk-gemini/40-gemini/1062-gemini-datasets-and-data-series#8

If these need to be improved, I can do that easily. If they need to change, that needs a bit more governance!

There's also a related publication on improving your metadata quality: https://www.agi.org.uk/about/resources/category/81-gemini?download=100:metadata-guidelines-part-3-april-2015

Feel free to borrow any of these; it's all CC-BY

@davidread
Copy link

Thanks @PeterParslow. That sounds useful to clarify the mapping to schema.org.

Thanks also for the links to GEMINI guidance on publisher and date. schema.org has similar list of definitions. You talk about governance to get metadata publishers to improve, but my experience shows it is tough to do. In Europe and UK there have been various incentives, metrics, reaching out, bottom-up efforts etc.. I think the best thing is to require metadata against a small standard, with strict validation on submission, with short and clear guidance and convenient online tools that help you meet the standard.

Complexity is the main enemy in this space. You should include in a standard just enough fields for dataset publishers to satisfy the key user needs of open data users. The main need is "discoverability" (e.g. search engines) and you can do most things with title, description and some stricter machine-readable fields - publisher, date, link to the data. It's a disservice to offer the metadata author 110 fields. And the next most useful thing in dataset metadata is probably structural metadata (data dictionary), which schema.org's Dataset doesn't cover. Structural metadata is more CSVW's domain, however that suffers from the complexity issue too.

@PeterParslow
Copy link

Thoroughly agree @davidread - getting metadata quality to improve is more about education than governance. And much easier if we start with a small set. GEMINI has 21 mandatory fields - the ones you mention plus keywords, link to license, link to specification (we've found the former is important to most searchers; and the spec helps people decide between 'hits')

https://www.agi.org.uk/40-gemini/1250-element-summary

My comment about governance was actually if this discussion concludes that the 'kinds of date' need to be different. Because GEMINI is built on an ISO standard (19115:2003 - which is itself an evolution from Dublin Core) and accompanying European (INSPIRE) guidance.

@pwin
Copy link

pwin commented Dec 5, 2019

I know is it slightly off topic, but I think that we'd be better getting persistent identifiers in place for tables and for some of the fields in tabular data that are used for linking than getting a set of metadata in place. I'm presuming here that the main reasons for documenting these tables is to find them and to merge/aggregate the data they contain. the choice of metadata is quite a rabbit hole with domains having their own issues and preferences. However, knowing that "this csv dataset" contains the same data as "that XML dataset", or knowing that the identifiers used in the "Local Authority" in one CSV are from the same set as the "council" identifiers in another are going to be of much more practical use than which specific items from schema.org are used.

@davidread
Copy link

On CSVW:

I totally agree with the proposal's aim to improve government data by using structural metadata. It is great to record for each column what the data type is, any data standards or code lists it follows, or if it references another dataset. As the proposal says, by defining these things in structural metadata, you can automate both the checking of the data and also the loading data into different data stores. And I'd go further - this support for common identifiers, standards and linking may well lead to incentivising these key elements. We can all agree that if you have quality, inter-connected data then you have a great basis for quickly drawing valuable insights from your the data.

Let me expand a bit on the user needs, to help us evaluate a structural metadata standard like this. I believe the key user needs are:

  • "validate" - validate that a CSV file doesn't have the wrong columns or bad data in them, so that:
    • you can automatically check published CSVs are usable, correctly reference items in other datasets and can be aggregated across multiple publishers
    • in an internal data pipeline, you can catch errors early by validating at every stage
  • "load" - load the CSV into a datastore (e.g. database, dataframe, data warehouse), and get the column types setup sensibly, and references to other tables marked, so that you can:
    • make meaningful queries - e.g. a query for all records from April 2017 to March 2018 is hard if the date is loaded as a string
    • visualize - e.g. if you've correctly interpreted lat/long then you can plot records on a map
    • combine/aggregate

CSVW Positives:

CSVW Negatives:

  • very low adoption - hardly any organizations publish datasets with CSVW annotations and there are few tools (none support it fully)
  • complex - high bar to understanding it, publishing it and consuming it. e.g. a column can have 19 different sorts of annotations. The author can choose from three different vocabularies for defining the descriptive metadata e.g. dc:description, dcat:description, schema:description
  • a key design criterion was the ability to convert CSV to RDF, to marry up with W3C's linked data vision, which adds complexity, rather than focussing on the central use cases of "validate" and "load"
  • existing attempts at publishing CSWV are of doubtful value e.g. see this twitter thread: https://twitter.com/ldodds/status/1195398689532940288

In comparison, I think Frictionless Data's Table Schema achieves 80% of the value with 20% of the complexity. It has built up a much stronger claim to having actually improved data flows, stronger community and array of tooling. The standard is developed in the usual open source way on GitHub, and the likes of Open Data Institute have given it support in their csvlint tool (along with CSVW). So it's an open standard, with a more lightweight process than the formality of a W3C working group.

I think there is a serious question mark over CSVW because it has not got much traction so far. Would UK government be happy to embrace it with the risk that the rest of the data world doesn't? Goverment might find itself investing in tools and trying to make the tech work, only to find another one becomes the dominant standard, and the investment is wasted.

CSVW is also a big standard, that hasn't been tested much, and can't easily be iterated (there is a lot of overhead of a formal W3C working group, ensuring worldwide compatibility, seeking consent etc.) Big bang, intransient specs - another risk.

To mitigate these risks I'd suggest that ONS and/or other interested parties might:

  • define a "profile" of CSVW, suggesting a subset of the vocabulary we should use, based on a narrow set of user needs that add the most value, to concentrate efforts to build momentum and deliver value
  • publish some example metadata, to help people understand the standard and build momentum
    And come back to get approval from Open Standards Board with some evidence of becoming established and delivering value.

@pwin
Copy link

pwin commented Dec 6, 2019 via email

@lisssek
Copy link

lisssek commented Dec 6, 2019

Thank you @pwin, @PeterParslow and @timwis for this really useful discussion and for being our critical friends in this challenge, helping us shape these proposals.

Throughout the GDS discovery in considering the needs for metadata across government departments, it became clear that there is a vast gap in how, across government, we collect and publish metadata. Some organisations are advanced and ahead of others in adapting their standards and others collect no descriptive metadata at all. The aim of recommending rather than mandating Dublin Core elements (which as @PeterParslow pointed out form the basis for the majority of metadata standards) is to help departments which do not publish any metadata to start with Dublin Core and eventually mature to more complex standards such as DCAT.

You are all right in saying that there are various application profiles where different use-cases and requirements are captured but at its core Dublin Core “core elements” constitute the bare minimum. The same bare minimum is also found in schema.org. I completely agree with @DavidReed that schema.org is a vast and comprehensive standard and I like your idea of just recommending using schema.org for Dataset schema and for recommending the use of the same persistent elements in Dublin Core. I think the next step in these proposals is to provide a comprehensive recommendation and examples as to how we envisage these standards to be used, and GDS will be working with you, us and the community on publishing this guidance to GOV.UK .

I think we have also already pointed out in the proposals all of these standards are a starting point on a maturity framework we would like to help departments to progress on that framework. @PeterParslow, you made an excellent point that to improve the metadata quality is more about education than governance. These proposals are the first step in that and will be accompanied by comprehensive engagement, training and workshops across departments to help them improve the quality of metadata they publish, with input from the Government Data Architect community.

Also, @pwin you made an excellent point on persistent identifiers. Without them, no matter what metadata standards we use, we have a problem of not being able to link or refer to data and metadata. Open Data Board has recommended a standard for it and I think COGS team at ONS are doing some work in this space as well. This issue will be discussed at the Board Meeting on Monday.

And @PeterParslow, we have redrafted the proposal on Dublin Core to confine the remit and reference GEMINI and INSPIRE as you suggested during a call with Rosalie. The guidance that will be published on GOV.UK will also reflect exactly what is in scope and what isn't when it comes to geospatial data and following Dublin Core, but essentially these different standards can be aligned so there is clarity on what metadata standard to follow when.

@frankieroberto
Copy link

In our discovery on this we concluded that neither CSVW nor Frictionless Data were really widely adopted enough to be particularly useful. Frictionless Data has the better tooling though, and is a lot simpler to understand and implement.

In the short-term, I suspect that recommending either isn't helpful. Right now, probably the most useful thing you can do is to publish CSV in a consistent way, and then have some really good HTML pages documenting what all the columns mean in clear, understandable language. Not machine-readable, but at least human-readable.

Longer term, it might be the case that different de facto standards for CSV metadata emerge for different domains (eg publishing statistics vs financial transparency data).

@davidread
Copy link

@frankieroberto I really appreciate the user research insights. Several have said that now is not the time to be agreeing or recommending a particular standard for structural metadata for open data.

However I think many of us here are keen on the potential benefits in this area (and we're slowly working out how to express the vision in an understandable way!). So let's use our energy we collectively have in government to do lots of 'alpha'-stage work in this space. We can try different approaches and standards, each time trying to cultivate a little ecosystem of publishers and consumers that benefit from the metadata. And once we find a successful formula, only then do we start shouting about it, firm up the standard, and scale up.

There are some great practical ideas and growing consensus in this thread of where we can start:

  • working with a small standard and iterating it (great to hear @pwin that W3C are moving towards supporting that way of working!)
  • creating simple guidance, examples and even training (amazing @lisssek!)
  • community - it's super that @gheye and Rosalie are driving our community to discuss these things, build the goodwill needed and drive it forward together

The highest profile work on this is probably the ONS Cogs project in the stats community of UK gov. I'm really interested to see can be achieved here - the people are distributed but closely networked, they include both producers and consumers of data & stats, and they have plenty of technical clout - it seems like a really fertile opportunity.

@Lawrence-G
Copy link
Contributor Author

Today, the Open Standards Board accepted these proposals as recommended standards with a few conditions that will be added to the profiles when published on GOV.UK

Thank you, everyone, who contributed to this challenge.

@bdsharmaco
Copy link

Hi everyone, I have recently applied schema.org dataset markup to NHS Digital's publications on https://digital.nhs.uk. There are around 2000 now flowing, and in theory visible on Google's data set search tool. I would be open to how we can consistently apply schema.org, dublin core and any recommended standard to the data, so it is embedded in the html web content we publish. Any feedback is welcome along with improvement suggestions.

@oughnic
Copy link

oughnic commented Dec 12, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests