Schemas for Tabular Data Challenge #40

Lawrence-G · 2017-03-28T14:06:16Z

Schemas for Tabular Data

Suggested by

Originally Submitted by pwalsh on Mon, 13/03/2017 on standards.data.gov.uk

Short Description

Much data published by governments is in common tabular data formats: CSV, Excel, and ODS. This is true for the UK government and governments around the world. To provide assurances around reusability of tabular data, consumers (users) need information on the "primitive types" for each column of data (example: is it a number? is it a date?). This also allows for quality checks to ensure consistency and integrity of the data.

Publishing Table Schema with tabular data sources provides this information. Table Schema has previously been used in work by Open Knowledge International (OKI) with Cabinet Office to check the validity of 25K fiscal data, according to publication guidelines. Table Schema is also used widely by other organisations working with public data, such as the Open Data Insititute (ODI).

User Need

I've written several user stories below. Each user story applies equally to a range of users. The user personas are as follows:

Developer: a user reusing public data in derived databases, visualisations, or data processing pipelines.
Business analyst: a user looking to public data as a source of information for analysis of business use cases that involve some component of public good.
Citizen: a non-technical user who expects government to publish consistent, high quality data.

User stories

As a user, I want all public data published by government to conform to a known schema, so I can use this information to validate the data.

As a user, I want public data published by government to have a schema, so I can read the schema and understand at a glance the type of information in the data, and the possibilities for reuse.

Expected Benefits

Vastly increased reuse of public data.
Increased trust in publication flows, generated by publication flows creating quality data.

Functional Needs

The functional needs that the proposal must address.

philarcher · 2017-04-18T11:34:59Z

W3C has a formal Recommendation in this space: Metadata Vocabulary for Tabular Data that encodes the Model for Tabular Data and Metadata on the Web. The work was originally inspired and subsequently led by Jeni T with Dan Brickley as co-chair. Jeremy Tandy (Met Office) was a key WG member too.
The WG looked at tabular data in the wild (all its use cases come with real examples) and handled awkward realities like multiple lines of headings, right to left tables and non-ascii characters. It uses the Web to link the metadata definitions that, like Table Schema, are defined in JSON. The metadata can be embedded in the CSV/TSV, at a well-known location, or linked from the CSV directly. You can override defined metadata file with one of your own. All this, of course, makes the metadata definitions reusable - handy for regularly published datasets, for example. The combination of CVS/TSV and the metadata file means that you can use the data directly or transform it programmatically into JSON or RDF. The standards supports URI templating out of the box but it also has extension points for extra rules so that you can use it as the basis for something like Open Refine that would, for example, transform dates into a standard format, or either of UK|United Kingdom into a regular form.

The potential here is that a set of metadata files can be defined and maintained centrally with tooling to validate a published CSV/TSV file with reduced effort to create visualisations etc. It gets away from the notion of packages that are made available for download and local processing (what I call using the Web as a glorified USB stick), and makes linking across multiple datasets and to things like the registries much easier. If you like, it's 5 star data in CSV.

edent · 2017-05-19T13:11:10Z

Also - CSV on the Web https://w3c.github.io/csvw/primer/

DavidUnderdown · 2018-09-05T09:19:48Z

The National Archives also developed a CSV Schema Language and validator http://digital-preservation.github.io/csv-schema - currently departments are expected to supply a CSV file meeting a supplied a schema when transferring digital records to The National Archives to meet their Public Record Act obligations

frankieroberto · 2019-02-08T10:36:41Z

There's also Frictionless Data Packages which apparently has some traction (eg is supported by the Open Data Institute and Google.org).

gheye · 2019-06-21T07:29:17Z

Data Description Language

This comment is a proposal on defining descriptive language for data within government. We’re asking for your feedback so we can develop and write this guidance based on advice from the government data community.

We have published blog posts on this topic::

Excel Spreadsheets - which outlines how a simplistic data description language can assist in sharing data and the reality that spreadsheets are not disappearing. https://dataingovernment.blog.gov.uk/2019/06/10/improving-how-we-manage-spreadsheet-data/
Why we need Data Standards (due for publication soon) which is an introduction to data standards and some of the proposals.

The aim is to have a common way of describing files, ‘a data description language’, which organisations can use across all file types and formats.

There are several priorities for the data description language. We need to make sure that:

It is easy of use
the process for defining the language is community driven, by using tools such as Slack and GitHub
the language is useful and adds value for the community
we can explain the basics of the language in one page that we can publish with the guidance
We also need to make sure that the data description language follows existing standards wherever possible. This includes:
- aligning with CSV on the web - TAG:Text
- CSV RFC4180 - promotion of using double quotes around key items of text
- verifying new tags against the Dublin Core to validate whether the suggested tags currently exist or if we should create them. Suggestions on tagging frameworks to be used are much appreciated.

We will produce a number of documents alongside this proposal including:

The one page description of the tags, which will follow the format largely of the proposal below
A more detailed specification of the language - this will follow the language of the previously
published API documentation. This is currently being worked on.

Proposed Tags
Please note that these tags can be provided in a separate file for a CSV file, or on a page within a spreadsheet. Each item will exist on a separate line.

Links have been added on the second column for those which come from Dublin core. An area column has been added and a proposed compulsory column. The last item has been removed which was parse-to-end-file. Two items for reference data have been remove Register-collection and register-standard. Added declare-datasheet so the sheet the data sits on can be defined. Removed one standard item 'standard' and changed standard-url to conformsTo.

Area	Property(Tag)	Example	Comment
Core	creator	creator:"Russell Singh russell.singh@digital.cabinet-office.gov.uk" creator:"Indira Singh, Sue Chan, Gregory Pie"	Dublin Core. This can be a comma separated list of creators.
Core	contributor	contributor:”justine.gornall@company.co.uk”	Dublin Core. This can be a comma separated list of contributors. Note see above for multiple contributors.
Core	title	title:"GDS Employees"	Dublin Core.
Core	created	created:2002-10-02	Dublin Core. The format of this date and whether double quotes are used needs to be agreed.
Core	identifier	identifier:"0000015_GDS_SDA_XLS"	Dublin Core.
Core	description	description:"All heights at GDS"	Dublin Core.
Core	valid	valid:"2012-2013"	Aligns with Dublin Core. This refers to Date valid range. What acceptable range of values would we accept here? Suggested by DWP.
Core	replaces	replaces:"GDS Employees V1"	Aligns with Dublin Core. The document or item that this replaces. What acceptable range of values would we accept here? Suggested by DWP.
Core	license	license:"https://opensource.org/licenses/MIT"	Aligns with Dublin Core. Proposed by ONS. The license that applies to the document.
XLS\ODF	declare-header	declare-header:"A1:A2" declare-header:"Sheet2!A1:A2"	For spreadsheet data.
XLS\ODF	declare-datasheet	declare-datasheet:"MyDataSheet"	For spreadsheet data.
XLS\ODF\CSV	column-type	column-type:"ColumnName:String" column-type:"Country:String" column-type:"Age:Number"	For spreadsheet data. This is an item that can be repeated for each of the columns that exist in the data set. Amended following feedback from @davidread. This aligns much more closely now with CSVW.
XLS\ODF	declare-data	declare-data:"Sheet2!A4:D8" declare-data:"A4:D8"	For spreadsheet data
Core	format	format:"xls"	To allow future expansion of the data description language. Aligns with Dublin Core. Assumption is double quotes not required.
CSV	file-delimiter	file-delimiter:“,”	To be used if another file delimiter is being used.
Proposal	fileformat-puid	fileformat-puid:”fmt/62”	Proposed by National Archives. There is a REST API for obtaining information on the PUID (PRONOM Unique Identifier): http://www.nationalarchives.gov.uk/PRONOM/fmt/62
Proposal	fileformat-creating-application	fileformat-creating-application:”Excel 1997”	Proposed by National Archives. Should this be shortened to creating-app?
Proposal	standard-comment	standard-comment: “RFC4180”	One comment tag may be used.
Core	conformsTo	conformsTo:“https://tools.ietf.org/html/rfc4180”	The standard the file must conform to. Changed following feedback from @pwin
Core	doc-sensitivity		A future proposal if the documents should have sensitivity applied to it. Note should this align with the Dublin core tag: accessRights. Note what would be an acceptable range of values for this?
Proposal	register-column	register-column:”Address”	The column from the dataset that should have reference data applied to it. The terms that has been used in GDS for reference data is register. Should this allow a series of columns?
Proposal	register-url	register-url:"https://www.registers.service.gov.uk/registers/ddat-profession-capability-framework"	The URL to the specific register that this applies to. This maybe machine readable or human readable depending on the usage.
CSV	top-row-header	top-row-header:true	To aid processing. True or false.

pwin · 2019-07-01T09:54:59Z

I don't think there is much special about government that requires it to have something particular to itself, though it has an authority for specific data and also is a big player in the market and consequently has influence. This latter can be a problem for society generally if government decides to go the 'wrong way' - just read Programmed Inequality
Many of the tags above are related to the concept of the resource and less to the specific distribution. I think that it is important to separate these so that we have an approach to creating catalogues that is more logical. DCAT v2 mentions this:

an important distinction between a dataset as an abstract idea and a distribution as a manifestation of the dataset.

Most of the other bits that you mention above @gheye are schema or processing instructions and I think it is important to separate these. Also, there might be rules of different types within the constraints for validation, or within the instructions for processing. So I think that these aspects need to be handled as part of an interoperability metamodel - see EIRA as an example. But if you think that this over-complicates things then either slim down your proposal so that you're not trying to replicate the many illustrations above, or else join in with the EC activity.

gheye · 2019-07-01T10:18:00Z

Hi @pwin

Thanks for your comments.

I am coming up to see you and we can discuss this in more detail.

We do need align with international standards and work with them but in a dispersed organisation, such as the UK government we need to be flexible and in some cases more light weight.

We can discuss more when I come up to Glasgow.

Best Wishes,

gheye

davidread · 2019-07-01T20:00:15Z

Along with the existing user stories of documenting and validating the data, I'd like to suggest another:

As a data scientist/engineer, I want to load the data into a data store, so that I can do queries/analysis

This is helped by having a schema. I'll just explain the situation we have at MOJ: we started a simply internal data catalogue with each data table described as metadata including a schema describing the column properties. Whilst you can load a CSV into a data store and let it auto-detect column types, or loading a parquet file into a data frame it often makes mistakes, for example converting numbers to dates, treating dates as text, dropping the leading 0 in telephone numbers, interpreting nulls as strings, choosing int16 when int64 will be needed in future etc. So to make this more reliable, colleagues have written some little tools to convert the existing schema to a number of related schema formats, suited to various data stores, for example:

Pandas data frame
Spark - pyspark.sql.types
AWS Glue catalogue (e.g. for serverless queries with AWS Athena)
Google Big Query

Ideally all these things would accept a schema in the same format. Pandas and Spark accept it programmatically. Glue and Big Query express it in slightly different JSON formats. 🤷‍♂

But if we accept we need converter tools for this use case, I don't think this user story imposes anything extra on the schema format - just defining the name and type of the column would cover it. The allowable column types is probably worth discussing - I guess the SQL types is a reasonable start. See our conversion table between pandas, Spark and AWS Glue.

It would be great to hear if this use case is common or not, probably from others establishing data warehouse functionality, with data catalogues and ETL needing schemas.

davidread · 2019-07-01T21:14:02Z

To summarize, these are the suggested options to meet the challenge:

Table Schema from Frictionless Data https://frictionlessdata.io/docs/table-schema/
CSV Schema Language and validator from The National Archives http://digital-preservation.github.io/csv-schema
Tabular Data Schemas https://www.w3.org/TR/tabular-metadata/
CSV on the Web https://w3c.github.io/csvw/primer/
Data Description Language from @gheye

To advance this discussion, perhaps we should compare the suitability of these in meeting the user needs (for developers, business analysts and citizens), identified by @pwalsh in the challenge.

timwis · 2019-07-02T15:25:39Z

Hm, I would suggest that the "Data Description Language" that @gheye proposes is rather distinct from how we describe the fields/schema of the data, and should warrant its own conversation/thread -- particularly since there are a lot of existing standards out there around metadata worth discussing, and they don't necessarily include field-level information.

If we prefer to treat it as a single standard, I would re-emphasise tabular data packages as an existing standard that prescribes the lowest common denominator of fields and allows for extensions, and also addresses field-level information via table schema.

gheye · 2019-07-04T07:23:24Z

Hi All,

My view is this does not preclude DCAT 2 or CSV on the web, and can viewed as a stepping stone.

At the lowest level for a department using excel or CSV who are not in the position of moving to DCAT or CSV on the web can we provide a list of common tags that are used across government. These tags should allow a user to more easily migrate to a larger more comprehensive framework.

Can they not also be used in CSV on the Web and form the basis of a DCAT design especially since they align as close as possible to international standards. Additional tags will only be created where they do not exist in one of the standards.

The ask is therefore simple and quite basic. Can we agree a common set of tags to be used across government. After we have done this then you would create recommendations and migration strategies to CSV on the web and DCAT 2 or an international standard that we agree on.

gheye · 2019-07-09T14:20:01Z

Hi,

I have amended the column definition above to align more closely with CSVW.

Gareth H.

gheye · 2019-07-15T15:41:00Z

Following feedback from NHS Digital and GDS, I have updated the table to include:

A proposed compulsory column
An area that that the item applies to. For want of a better term all those created from requests that are not CSV of spreadsheet or core I have put as proposal.

gheye · 2019-07-23T12:22:27Z

Note a suggestion that is worth discussing from @pwin is whether we have mediatype.

A list of possible media types is shown here:

https://www.iana.org/assignments/media-types/media-types.xhtml

gheye · 2019-07-31T11:43:52Z

I received excellent and extensive feedback from ONS at the individual item level. Most of the changes have been added to the table above.

gheye · 2019-08-07T10:35:13Z

There are three items that NHS digital would like to add to this list from schema.org:

schema:datePublished
schema:spatialCoverage
schema:temporalCoverage

Please confirm whether you agree or disagree with these items.

frankieroberto · 2019-08-07T10:49:39Z

@gheye I'm a bit confused. Are you proposing a new "Data Description Language" standard? My understanding was that the adopted open standards could only be existing standards, not newly-created ones? Or does your proposal build on top of CSV on the web or Frictionless Table Schema?

gheye · 2019-08-07T12:13:24Z

HI @frankieroberto

We are not proposing a new data description language. If you look at the individual items in the list above almost all of them come from existing standards.

The only new items we are proposing are those to fill a need that has been raised in government and a tag does not currently exist. If you can find a tag that does for any of the additional tags above then please point us towards them.

As mentioned above this should be viewed as a stepping stone to something such as CSV on the web. In fact the syntax has been aligned with that as far as possible.

This a proposal for an initial set of tags that the U.K government could begin to adopt before moving to something larger. It is a stepping stone.
Regards.

MikeThacker1 · 2019-08-07T15:51:46Z

Some points I raised in a chat with @gheye that he asked me to add here:

as well as replaces we need the converse: isReplacedBy
as mentioned in the Metadata section of "Guide to Creating and maintaining open standards", we should consider Quality properties given in Designing URI sets for the UK public sector and suggested in the W3C Web Best Practices Working Group wiki Quality and Granularity Description Vocabulary page. These are important in helping people decide if they should adopt a resource. For example, local authorities and their suppliers are reassured if intendedLongevity indicates a resource will remain up-to-date long enough to be worth investing in.
terms used by local government vocabularies (eg for Local Government Services List might be considered (many are already included)

pwin · 2019-08-07T16:02:18Z

w3c/dxwg#868 from the DXWG issues is relevant to the discussion about accreted datasets etc. So is the discussion within DXWG on qualified relations

PeterParslow · 2019-09-16T08:44:19Z

There are two kinds of standards being discussed here, without always being clear on which:

structures for the schema definition language: here the suggested standards offer a range of choices
semantics - the actual tags that are being proposed. The core ones almost all come from Dublin Core, and are effectively a declaration that 'the government' requires these metadata elements about any schema'. The others are 'the meat' e.g. what do you call the array elements that give the name & description of each column?

Several of the proposed standards may have the same mix; Table-schema & CSV on the web only do the latter - describe how to define the CSV structure, not specify the metadata expected to accompany the schema definition.

(They are pretty much the core of all metadata standards. GEMINI distinguishes 'metadata about the metadata' and 'metadata about the dataset' - but I don't think many people like the two terms!)

2nd point: the DCLG-sponsored Brownfield Land Register schema was developed by iStandUK using the 'table-schema' approach (in order to exploit the ODI's CSVLint tool). This is in use by many (all?) English planning authorities, placing data records on data.gov.uk

MikeThacker1 · 2019-09-18T14:59:41Z

On Peter's 2nd point, the conformsTo tag should reference the relevant schema, in this case the one from iStandUK. Both CSVLint and an LGA CSV validator tool validate against the schema. data.gov.uk used to allow you to identify all datasets that conform to a given schema but they have now dropped that functionality. If GDS uses conformsTo, we should be able to revive (somewhere) discovery of datasets that conform to a given schema. This is needed for joining datasets from many publishers in local government.

Lawrence-G · 2019-11-22T12:53:44Z

The following proposals for this challenge (Dublin Core, Schema.org, CSVW) are the result of the GDS data standards workshops held over the last couple of months and comments and suggestions made on GitHub.
The workshops focussed on the data description language for tabular data structures that gheye published here and identified that the language proposal was trying to solve a number of different problems. to create a standard for:

data shared privately between individuals/govt organisations so that they can catalogue their data
data shared privately between individuals/govt organisations so that they can easily share and combine data sets, and ensure interoperability
published data so that it can be easily catalogued, found and shared

it was agreed to post proposals for the different standards referenced in the language as separate recommendations. For this purpose, GDS worked with ONS on the proposals; for Dublin Core, schema.org and CSV on the Web.

I'm posting on behalf of the team who put the proposal together.

Proposal: Recommend Dublin Core to describe data shared privately

Introduction

This proposal is to recommend that government departments use Dublin Core schema as a minimum set of metadata to be associated with data they are sharing privately. For tabular data being published, we have a separate proposal with the Open Standards board to use Schema.org

Using the Dublin Core schema to describe tabular data that is shared privately means individuals/teams/departments would be consistent in the way they describe the contents of their tabular data resources, especially in departments where no metadata is collected or published allowing the data to be easily catalogued, validated and reused, as well as to be more findable. It is the first step in achieving metadata maturity across the government departments and reach a common core set of metadata associated with a dataset.

This proposal is based on the idea that the metadata elements associated with the Dublin Core represent the core set of metadata elements conserved across the government. Since Dublin Core sets a foundation for many more complicated standards such as DCAT (which is a recommended standard at the higher end of metadata maturity spectrum), it ensures that the same elements can be captured as a more complicated metadata standard can be implemented without the need for complex translation between standards.

Please note that this proposal is based on two assumptions:
1.Dublin Core should be used as a “core metadata elements” standard and be considered as the first step in achieving metadata maturity
2.The data should be identified using persistent resolvable identifiers as recommended by the Open Standards for Government.

This proposal builds on the Open Standards Board recommendation of the RFC 4180 definition of CSV (Comma Separated Values) for publishing tabular data in government.

User need approach

The user need identified by this proposal is to maximise consistency in and give context to the tabular data being shared across government. Adopting the Dublin Core schema should help to achieve the first step in sharing data with associated metadata to ensure trust and increase the confidence in handling of data.

Users in the context of this proposal are government workers who create, share and maintain tabular data, and need to be able to validate it. Individual users include but are not limited to data scientists, business analysts, people who need to use a spreadsheet application to do basic analysis and developers who process data in a range of software.

Achieving the expected benefits

If departments use the Dublin Core schema, they would be consistent in the way they describe their tabular data. The Dublin Core Metadata Element Set is one of the simplest and most widely used metadata schema since it represent a set of 15 minimum set of metadata conserved across metadata standards. For example, in ONS, all of the Dublin Core metadata elements are conserved in a much more complex metadata model which captures metadata needs of a statistical organisation.

Dublin Core is comprised of 15 “core” metadata elements; whereas the "qualified" Dublin Core set includes additional metadata elements to provide for greater specificity and granularity. If Dublin Core is adopted as an Open Standard for government, the Government Digital Service will produce guidance on how these elements should be used. The GDS will advocate for the 15 core metadata elements to be conserved across the government department.
Additionally, the Government Data Architecture Community (GDAC) fostered by ONS has offered to work with GDS to deliver half day workshops that will help government users understand why using Dublin Core is useful and helpful.

Using the Dublin Core standard will mean government users have an improved idea of what kinds of information should be recorded, where and how.

How Dublin Core complements other standards

The Dublin Core standard is at the core of majority metadata standard and is limited to only 15 core elements which GDS will be advocating the cross-government community to use to share data across departments where no other metadata is captured. Dublin Core elements are independent of coding syntax. As the metadata maturity of a government department improves, the Dublin Core metadata elements will become a part of a much more complex standards as is the case in ONS (ONS model includes standards such as DCAT, ADMS, amongst others). Using Dublin Core elements is a first step in reaching that maturity and ensures that at least a minimal set of metadata is captured and shared across government departments in a standardised way.

However, when departments are publishing data openly, they should use schema.org to describe data, as this is a collection of metadata schemas that are focused on SEO and are particularly targeted at webmasters. Schema.org is already currently in use by Data.gov.uk and GOV.UK. Whist Dublin Core elements describe both physical and web resources, schema.org has been designed specifically for search engine optimization. Schema.org is also a complex and mature metadata standard and the majority of government departments are required to support it when publishing their open data dataset to data.gov.uk pages.

Whilst schema.org is a fantastic standard for publishing data on the web, the government departments often lack any means of publishing common and coherent metadata to fit their inhouse need. Dublin core set of elements forms a basis to many other standards which can be easily adopted as the organisation matures in their metadata journey. Schema.org metadata standard cannot be used, nor is designed to work in the same way.

Other steps to achieving interoperability

This proposal is only concerned with promoting consistent and accurate set of metadata associated with tabular data when government is sharing data privately.

When government is sharing data publically, ie publishing the data, we have a separate proposal to the Open Standards Board to use schema.org (link) since this has been the preferred option in the open linked data community.

When wanting to describe how the data is shaped and formatted within a particular file, we have a separate proposal to the Open Standards Board for the use of CSV on the Web.

Proposal: Recommend schema.org to describe tabular data you are publishing

Introduction

This proposal is to recommend that government departments use schema.org to describe open data they are publishing. For data being shared but not published, we have a separate proposal with the Open Standards board to use Dublin Core as the minimal set of metadata elements conserved across the government.

Using the schema.org to describe tabular data that is published by individuals/teams/departments means search engines can better find the data and display structured results to end users more efficiently. Describing contents of published tabular data resources in a consistent way will also allow the data to be easily catalogued, validated and reused.

This proposal:

follows the Open Standards Board adoption of the schema.org JobPosting schema in 2016 to ensure consistent formatting of job posts across government
builds on the Open Standards Board recommendation of the RFC 4180 definition of CSV (Comma Separated Values) for publishing tabular data in government.

User need approach

The user need identified by this proposal is to make government published data easier to find and maximise use. Officially adopting the schema.org should help to achieve this since this standard has been used to publish open data to government websites for a while.

Users in the context of this proposal are government workers who create, share and maintain tabular data, and want people to be able to find it. Individual users include but are not limited to data scientists, business analysts, technology policy advisors, economists and members of the public.

Often, openly published government data can be difficult to find. Web publishers who include schema.org markup generally tend to have a competitive SEO advantage over those who don’t so it makes sense to adopt Schema.org within government. Schema.org allows context to be provided for an otherwise ambiguous webpage, improving the quality of search results for users.

The user needs identified for adopting schema.org are for users to be able to:

find and reuse data published by the government on GOV.UK and 3rd party websites
perform advanced searches by item type, for example, event or location
search for data published by government regardless of where it is published

Achieving the expected benefits

If departments use schema.org, they would be consistent in the way they describe their published data. schema.org is supported by the major search engines and takes full advantage of the semantic web.

schema.org is a set of tags that aims to make annotating HTML elements with machine-readable tags much easier. schema.org is already used by Government websites publishing data, including GOV.UK and Data.gov.uk.

Different bits of government, and associated agencies publish tabular data in different formats and they regularly exclude pertinent information. Sometimes, even when the data is included, it is not machine readable so is not easily findable. Using schema.org as an open standard for published data across government will help online users find relevant, accurate data to fit their needs. schema.org will also make it easier for aggregators, search engines and others to re-use data published by the government, in turn making it easier for users to find data relevant to them on non-government services.

When schema.org may not be suitable

The schema.org standard is a powerful tool for helping online users find the information they need. However, when departments or government individuals are sharing data between themselves, they should use Dublin Core tags to describe data, as this is a collection of tags that are less focused on the web, and are more suited to structuring the data.

Whilst schema.org is a fantastic standard for publishing data on the web, the government departments often lack any means of publishing common and coherent metadata to fit their inhouse need. Dublin core set of elements forms a basis (and is part of) many other more complicated standards (such as DCAT) which can be easily adopted as the organisation matures in their metadata journey. Schema.org metadata standard cannot be used, nor is designed to work in the same way, however, the metadata elements, described as “core” are also shared in Schema.org standard.

Other steps to achieving interoperability

This proposal is only concerned with promoting consistent and accurate data when government is publishing data openly.

When government is sharing data privately, we have a separate proposal to use Dublin Core.

When wanting to describe how the data is shaped and formatted within a particular file, we have a separate proposal to the Open Standards Board for the use of CSV on the Web.

Proposal: Recommend CSV on the Web to annotate tabular data column properties

Introduction

A large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. CSV files may be of a significant size but they can be generated and manipulated easily, and there is a significant body of software available to handle them. Indeed, popular spreadsheet applications (Microsoft Excel, iWork’s Number, or OpenOffice.org) as well as numerous other applications can produce and consume these files. However, although these tools make conversion to CSV easy, it is resisted by some publishers because CSV is a much less rich format that can't express important detail that the publishers want to express, such as annotations, the meaning of identifier codes etc.

Existing formats for tabular data are format-oriented and hard to process (e.g. Excel); un-extensible (e.g. CSV/Tab Separated Values(TSV)); or they assume the use of particular technologies (e.g. SQL dumps). None of these formats allow developers to pull in multiple data sets, manipulate, visualize and combine them in flexible ways. Other information relevant to these datasets, such as access rights and provenance, is not easy to find. CSV is a very useful and simple format, but to unlock the data and make it portable to environments other than the one in which it was created, there needs to be a means of encoding and associating relevant metadata.

To address these issues, the CSVW seeks to provide:

Metadata vocabulary for CSV data
Access methods for CSV Metadata
Mapping mechanism to transforming CSV into various formats (e.g., RDF [rdf11-concepts], JSON [RFC7159], or XML [xml])

This proposal is to recommend that government departments use CSV on the Web (CSVW) to process CSVs into an annotated data model so CSVs can be annotated, interoperable and more easily shared. This is an open and established standard and is currently being used and recommended by ONS. It should be noted that this standard deals with tabular data and a standard which deals with an extended array of formats might be recommended to use in the future.

CSVW is a W3C standard for metadata descriptions for tabular data, and will give government a standard way to express useful metadata about CSV files and other kinds of tabular data.

With CSVW, after the tabular data is annotated, the model is used as the basis to create RDF or JSON. For example CSVW assumes that the first row of a CSV is a header row containing the titles for the CSV cells. CSVW also assumes each row in the CSV file constitutes a record with properties. After assuming a model for the tabular data, the file can be easily integrated with other data.

This proposal builds on the Open Standards Board recommendation of the RFC 4180 definition of CSV (Comma Separated Values) for publishing tabular data in government.

User need approach

Different bits of government, and associated agencies publish tabular data in different formats and they regularly exclude pertinent information on the column properties, making it hard for files to be shared. Using CSVW as an open standard for tabular data to be shared across government will help government workers aggregate and reuse data.
CSVW will help users consolidate different tabular data sources into one file, or load their data into a data store so that they can do queries/analysis. To help users document, validate and catalogue their data, we have separately proposed Dublin Core and Schema.org as Open Standards.
CSVW is a standard for describing column properties in tabular data. Users can load a CSV into a data store and let it auto-detect column types, but often mistakes are made, for example converting numbers to dates, treating dates as text, dropping the leading 0 in telephone numbers, interpreting nulls as strings, choosing int16 when int64 will be needed in future etc. To make data transfer more reliable, CSVW can be used to convert the existing schema to a consistent format that data stores understand how to read. The consistent format will also mean different tabular data files can be easily integrated.

Users in the context of this proposal include:

developers who have to pull in multiple data sets, and manipulate, visualize and combine them in flexible ways
statisticians who want to reuse statistical data and not be inhibited by a lack of explicit definition of column heading meanings
suppliers and consumers of CSV data including government organisations
linked open data users since this format can be easily parsed and add to open data portals

Achieving the expected benefits

As is noted by the W3C Working Group, CSV is a very useful and simple format, but to unlock the data and make it portable to environments other than the one in which it was created, there needs to be a means of encoding and associating relevant metadata

To address these issues, the CSV on the Web Working Group seeks to provide:

Metadata vocabulary for CSV data
Access methods for CSV Metadata
Mapping mechanism to transforming CSV into various formats (e.g., RDF [rdf11-concepts], JSON [RFC7159], or XML [xml])

If departments use CSVW, they would be consistent in the way they describe their tabular data column properties data. CSVW is already in use across government and support for the standard is noted in the W3C Working Group Use Case and Requirement brief.

When CSVW may not be suitable

CSVW is suitable for tabular information formats but not for other formats.

Other steps to achieving interoperability

This proposal is only concerned with promoting consistent and accurate tabular data when government is transporting data into a data store or another tabular data file.

When government is wanting to share details about the data we hold so it can be catalogued and found easily we have separate proposals to the Open Standards Board to use Dublin Core and Schema.org .

davidread · 2019-11-22T15:20:15Z

I agree with @pwin, @PeterParslow and @timwis that we are discussing two rather distinct sorts of metadata here.

I'd describe them as Descriptive Metadata (DCAT, DC, schema.org) and Structural Metadata (Table Schema, CSVW, TNA's CSV Schema Language). @Lawrence-G Please can a separate challenge be created for the former? These two areas have rather distinct user needs and merit separate discussions. And a bit of "divide and conquer" is probably needed when the latest proposals are 9 screenfuls long :)

Lawrence-G · 2019-11-28T15:24:41Z

@davidread That may be the best approach but I’ll leave it to the Open Standards Board to decide. This challenge has acted as a touchpaper to (re)ignite the conversation over these past months and this extended proposal is the result. To stall the momentum would be a shame but we do need to get things clearly defined. I hope that the work carried out so far will help inform the final profile.

Yes, the nine pages are somewhat epic. for the assessments, I have three links.

Dublin Core Metadata Assessment

Schema.org Assessment

CSV on the Web (CSVW) Assessment

As part of the open standards process, any standards recommended for use in a standards proposal are assessed using the 47 questions. The Open Standards Board agreed on the criteria used in the assessment. These questions are based on the EU CAMSS (common assessment method for standards and specifications).
A negative answer to a particular question does not automatically indicate a failure certain questions are weighted more than others and the assessment will be taken as a guide to the suitability of a standard.

PeterParslow · 2019-12-02T13:47:03Z

Regarding the Descriptive Metadata, I hope that the Open Standards Board takes into account the collection of Statutory Instruments collectively known as "The INSPIRE Regulations".

Starting with 2009/3157 (http://www.legislation.gov.uk/uksi/2009/3157/contents), and amended by 2012 No. 1672, it has since become quite a collection because of the range of "EU Exit" amendments. But basically it requires most public bodies in the UK to use a particular metadata "standard" to describe rather a lot of their data. Defra has invested in ensuring that the UK has a "standard" which allows this to work with data.gov.uk: GEMINI (https://www.agi.org.uk/gemini/).

Luckily, it's based on Dublin Core, and the Geospatial Commission is funding some work that is likely to result in GEMINI including advice on a Schema.org encoding.

I do urge proper thought before issuing "standards advice" that contradicts a Statutory Instrument. Sometimes "stalling the momentum" might be better than confusing the organisations one is trying to influence!

pwin · 2019-12-02T14:56:53Z

I agree with @PeterParslow in the need to move forward carefully. Not every organisation or line of business has the same requirement or is at the same level of maturity.
Other steps, such as putting in place a persistent identifiers scheme for all information assets, and then building that up into a set of catalogues is something that could be more readily implemented across the board and would help getting other metadata into place

davidread · 2019-12-05T08:36:35Z

Can we reduce the proposal from recommending the whole of schema.org to just their Dataset schema? schema.org includes schemas for everything from DrugStrength to Electrician to PublicToilet.

schema.org is a disrupter, a competitor to existing data standards in numerous fields e.g. library catalogues has for decades used standards like MARC, BibFrame and FRBR. If Open Standards adopt the whole of schema.org wholesale without considering the more established open standards in each field, that could bias government adoption towards schema.org in all those fields.

davidread · 2019-12-05T08:41:21Z

On schema.org the primary reason this proposal recommends it is:

Web publishers who include schema.org markup generally tend to have a competitive SEO advantage over those who don’t so it makes sense to adopt Schema.org within government.

Perhaps the supporters can expand on this point? A quick search brought this summary from Mozilla:

Whether structured data affects rankings has been the subject of much discussion and many experiments. As of yet, there is no conclusive evidence that this markup improves rankings. But there are some indications that search results with more extensive rich snippets (like those created using Schema) will have a better click-through rate. For best results, experiment with Schema markup to see how your audience responds to the resulting rich snippets.
https://moz.com/learn/seo/schema-structured-data

davidread · 2019-12-05T09:07:16Z

Again on schema.org, a secondary reason for the recommendation is:

Describing contents of published tabular data resources in a consistent way will also allow the data to be easily catalogued, validated and reused.

schema.org's Dataset is a big sprawling standard, with over 100 properties, to cover a wide variety of use cases. I strongly dispute idea that simply adopting it wholesale will bring consistency to government metadata. Fields are all optional - each publisher will choose their own. The sheer number of options rather defeats the idea of standardization.

For example, just reading the standard you'll see lots of options for specifying the publisher of the data - whilst data.gov.uk and gov.uk have a simple model of allowing one or more departments attached as the "publisher", and no other role, schema.org's Dataset can have a publisher, author, creator, producer, provider, publisherImprint, funder, sponsor, sdPublisher, sourceOrganization, contributor, copyrightHolder. The definitions are ambiguous and confusing so people will get them wrong. Some will feel the only thing needed is which org captured the data ("producer"?), some will only include the "publisher", and some will think the copyrightHolder is the only thing to record. Whilst a human can make sense of one record, you cannot compare or automatically process them in bulk. So that is not a standard.

To take another field as an example - "date". There is a huge variety of ways to express dates: contentReferenceTime, dateCreated, dateModified, datePublished, expires, sdDatePublished, temporal, temporalCoverage. Now if you're trying to search for data on a topic, and want to filter by data collected in the past year, you've got a really tough problem to write that query for datasets that could express its date 8 different ways.

With so much complexity, organizations will assign these differently every time, and those trying to make sense across government will have an impossible task.

The only obvious alternative is DCAT, which is no better in this respect, indeed comes with more linked data extensibility/complexity, so I wouldn't favour that.

In truth I rather like the core of schema.org's Dataset schema. I assume the likes of data.gov.uk and gov.uk have defined a subset of 10ish properties that they read (out of the 100 in the standard) and the rest are disregarded. I would suggest that this proposal does a similar thing and specifies which fields are recommended (a "profile" of the standard), which the Open Standards board points to in recommending this standard.

I've been involved in user research where publishers ask questions like "if you release a monthly set of figures on a topic, should each month be recorded as a new Dataset or just a Distribution within the existing Dataset"? schema.org doesn't have a view (and DCAT is unclear, I believe). This sort of problem can be solved by best-practice examples / supplementary guidance, so I'd like to see this in the proposal too.

davidread · 2019-12-05T09:31:43Z

@PeterParslow rightly points out that the substantial proportion of datasets covered by the INSPIRE law need to have GEMINI-format metadata. To add some more background (someone correct me if this has changed since I was last involved), GEMINI is a specialist format for use by those mainly in the environmental research community. GEMINI is a standard for data with a spatial element to it (whereas schema.org is for all types of data). The UK's INSPIRE datasets' metadata are all harvested into data.gov.uk, where the GEMINI metadata is simply translated to schema.org Dataset format and exposed like that to give the benefits of search engine discoverability.

So we already have both metadata formats being published for these datasets, which makes total sense to me. I'm not convinced that Open Standards Board making a recommendation for schema.org's Dataset would confuse publishers of INSPIRE data, but this could be usefully clarified in accompanying guidance.

PeterParslow · 2019-12-05T10:23:46Z

@davidread A reasonable summary - just a couple of points:

GEMINI is used across the whole range of geographic information, not just environmental research. About half the records in data.gov.uk are created by harvesting GEMINI records; I think there are over 300 UK organisations contributing them - including most local authorities.
many of these relate to INSPIRE, but not all. But then INSPIRE's scope is " any data with a direct or indirect reference to a specific location or geographical area;" (http://www.legislation.gov.uk/uksi/2009/3157/regulation/2) for which a public authority is responsible.

For the GEMINI community, I can say we're glad that data.gov.uk translates parts of the GEMINI record to schema.org. A current Geospatial Commission project is likely to recommend that we continue this way, with a few proposed improvements to the schema.org

In conversation with Rosalie Marshall of GDS, I've volunteered to draft some accompanying guidance for GEMINI authors. I'd like to include the mappings*, so they can 'see for themselves' that publishing their GEMINI records will satisfy this Schema.org proposal. That will be much easier with David's other suggestion - of specifying which Schema.org elements are in mind. Rosalie suggested that would be in guidance. Even starting at https://schema.org/Dataset has 110+ properties, many of which on a quick look seem quite irrelevant.

*I'll probably base it on https://github.com/geonetwork/core-geonetwork/wiki/JSON-LD---ISO19139-mapping-proposal - GeoNetwork is a very common tool for creating & managing GEMINI records, and already has an optional Schema.org output.

PeterParslow · 2019-12-05T10:29:23Z

GEMINI comes with quite extensive guidance on things like "publisher" (which organisations to specify) - https://www.agi.org.uk/agi-groups/standards-committee/uk-gemini/40-gemini/1062-gemini-datasets-and-data-series#23

and "date" (which date?) - https://www.agi.org.uk/agi-groups/standards-committee/uk-gemini/40-gemini/1062-gemini-datasets-and-data-series#8

If these need to be improved, I can do that easily. If they need to change, that needs a bit more governance!

There's also a related publication on improving your metadata quality: https://www.agi.org.uk/about/resources/category/81-gemini?download=100:metadata-guidelines-part-3-april-2015

Feel free to borrow any of these; it's all CC-BY

davidread · 2019-12-05T11:12:10Z

Thanks @PeterParslow. That sounds useful to clarify the mapping to schema.org.

Thanks also for the links to GEMINI guidance on publisher and date. schema.org has similar list of definitions. You talk about governance to get metadata publishers to improve, but my experience shows it is tough to do. In Europe and UK there have been various incentives, metrics, reaching out, bottom-up efforts etc.. I think the best thing is to require metadata against a small standard, with strict validation on submission, with short and clear guidance and convenient online tools that help you meet the standard.

Complexity is the main enemy in this space. You should include in a standard just enough fields for dataset publishers to satisfy the key user needs of open data users. The main need is "discoverability" (e.g. search engines) and you can do most things with title, description and some stricter machine-readable fields - publisher, date, link to the data. It's a disservice to offer the metadata author 110 fields. And the next most useful thing in dataset metadata is probably structural metadata (data dictionary), which schema.org's Dataset doesn't cover. Structural metadata is more CSVW's domain, however that suffers from the complexity issue too.

PeterParslow · 2019-12-05T11:38:45Z

Thoroughly agree @davidread - getting metadata quality to improve is more about education than governance. And much easier if we start with a small set. GEMINI has 21 mandatory fields - the ones you mention plus keywords, link to license, link to specification (we've found the former is important to most searchers; and the spec helps people decide between 'hits')

https://www.agi.org.uk/40-gemini/1250-element-summary

My comment about governance was actually if this discussion concludes that the 'kinds of date' need to be different. Because GEMINI is built on an ISO standard (19115:2003 - which is itself an evolution from Dublin Core) and accompanying European (INSPIRE) guidance.

pwin · 2019-12-05T12:08:41Z

I know is it slightly off topic, but I think that we'd be better getting persistent identifiers in place for tables and for some of the fields in tabular data that are used for linking than getting a set of metadata in place. I'm presuming here that the main reasons for documenting these tables is to find them and to merge/aggregate the data they contain. the choice of metadata is quite a rabbit hole with domains having their own issues and preferences. However, knowing that "this csv dataset" contains the same data as "that XML dataset", or knowing that the identifiers used in the "Local Authority" in one CSV are from the same set as the "council" identifiers in another are going to be of much more practical use than which specific items from schema.org are used.

davidread · 2019-12-06T11:39:11Z

On CSVW:

I totally agree with the proposal's aim to improve government data by using structural metadata. It is great to record for each column what the data type is, any data standards or code lists it follows, or if it references another dataset. As the proposal says, by defining these things in structural metadata, you can automate both the checking of the data and also the loading data into different data stores. And I'd go further - this support for common identifiers, standards and linking may well lead to incentivising these key elements. We can all agree that if you have quality, inter-connected data then you have a great basis for quickly drawing valuable insights from your the data.

Let me expand a bit on the user needs, to help us evaluate a structural metadata standard like this. I believe the key user needs are:

"validate" - validate that a CSV file doesn't have the wrong columns or bad data in them, so that:
- you can automatically check published CSVs are usable, correctly reference items in other datasets and can be aggregated across multiple publishers
- in an internal data pipeline, you can catch errors early by validating at every stage
"load" - load the CSV into a datastore (e.g. database, dataframe, data warehouse), and get the column types setup sensibly, and references to other tables marked, so that you can:
- make meaningful queries - e.g. a query for all records from April 2017 to March 2018 is hard if the date is loaded as a string
- visualize - e.g. if you've correctly interpreted lat/long then you can plot records on a map
- combine/aggregate

CSVW Positives:

offers a large vocabulary, which can cover a huge range of use cases. It also links with and is extensible with numerous other linked data standards. (Large vocabulary is a double edged sword though - see complexity below)
has some key supporters e.g. https://blog.ldodds.com/2019/01/26/talk-tabular-data-on-the-web/

CSVW Negatives:

very low adoption - hardly any organizations publish datasets with CSVW annotations and there are few tools (none support it fully)
complex - high bar to understanding it, publishing it and consuming it. e.g. a column can have 19 different sorts of annotations. The author can choose from three different vocabularies for defining the descriptive metadata e.g. dc:description, dcat:description, schema:description
a key design criterion was the ability to convert CSV to RDF, to marry up with W3C's linked data vision, which adds complexity, rather than focussing on the central use cases of "validate" and "load"
existing attempts at publishing CSWV are of doubtful value e.g. see this twitter thread: https://twitter.com/ldodds/status/1195398689532940288

In comparison, I think Frictionless Data's Table Schema achieves 80% of the value with 20% of the complexity. It has built up a much stronger claim to having actually improved data flows, stronger community and array of tooling. The standard is developed in the usual open source way on GitHub, and the likes of Open Data Institute have given it support in their csvlint tool (along with CSVW). So it's an open standard, with a more lightweight process than the formality of a W3C working group.

I think there is a serious question mark over CSVW because it has not got much traction so far. Would UK government be happy to embrace it with the risk that the rest of the data world doesn't? Goverment might find itself investing in tools and trying to make the tech work, only to find another one becomes the dominant standard, and the investment is wasted.

CSVW is also a big standard, that hasn't been tested much, and can't easily be iterated (there is a lot of overhead of a formal W3C working group, ensuring worldwide compatibility, seeking consent etc.) Big bang, intransient specs - another risk.

To mitigate these risks I'd suggest that ONS and/or other interested parties might:

define a "profile" of CSVW, suggesting a subset of the vocabulary we should use, based on a narrow set of user needs that add the most value, to concentrate efforts to build momentum and deliver value
publish some example metadata, to help people understand the standard and build momentum
And come back to get approval from Open Standards Board with some evidence of becoming established and delivering value.

pwin · 2019-12-06T13:03:48Z

That's a thoughtful reflection with several useful points. Re: "...CSVW is also a big standard, that hasn't been tested much, and can't easily be iterated (there is a lot of overhead of a formal W3C working group, ensuring worldwide compatibility, seeking consent etc.) Big bang, intransient specs - another risk....", W3C is moving to a much more responsive "evergreen standards" process that will be easier to work with. The Dataset Exchange WG (DXWG) is moving to that already in the coming year. Proposals for modification of CSVW could perhaps follow this formula Using a "profile" approach is very sensible too, and the DXWG work on content negotiation by profile might meld well with this approach.

…

On Fri, 6 Dec 2019, 11:39 David Read, ***@***.***> wrote: On CSVW: I totally agree with the proposal's aim to improve government data by using structural metadata. It is great to record for each column what the data type is, any data standards or code lists it follows, or if it references another dataset. As the proposal says, by defining these things in structural metadata, you can automate both the checking of the data and also the loading data into different data stores. And I'd go further - this support for common identifiers, standards and linking may well lead to *incentivising* these key elements. We can all agree that if you have quality, inter-connected data then you have a great basis for quickly drawing valuable insights from your the data. Let me expand a bit on the user needs, to help us evaluate a structural metadata standard like this. I believe the key user needs are: - "validate" - validate that a CSV file doesn't have the wrong columns or bad data in them, so that: - you can automatically check published CSVs are usable, correctly reference items in other datasets and can be aggregated across multiple publishers - in an internal data pipeline, you can catch errors early by validating at every stage - "load" - load the CSV into a datastore (e.g. database, dataframe, data warehouse), and get the column types setup sensibly, and references to other tables marked, so that you can: - make meaningful queries - e.g. a query for all records from April 2017 to March 2018 is hard if the date is loaded as a string - visualize - e.g. if you've correctly interpreted lat/long then you can plot records on a map - combine/aggregate CSVW Positives: - offers a large vocabulary, which can cover a huge range of use cases. It also links with and is extensible with numerous other linked data standards. (Large vocabulary is a double edged sword though - see complexity below) - has some key supporters e.g. https://blog.ldodds.com/2019/01/26/talk-tabular-data-on-the-web/ CSVW Negatives: - very low adoption - hardly any organizations publish datasets with CSVW annotations and there are few tools (none support it fully) - complex - high bar to understanding it <https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/>, publishing it and consuming it. e.g. a column can have 19 different sorts of annotations. The author can choose from three different vocabularies for defining the descriptive metadata e.g. dc:description, dcat:description, schema:description - a key design criterion was the ability to convert CSV to RDF, to marry up with W3C's linked data vision, which adds complexity, rather than focussing on the central use cases of "validate" and "load" - existing attempts at publishing CSWV are of doubtful value e.g. see this twitter thread: https://twitter.com/ldodds/status/1195398689532940288 In comparison, I think Frictionless Data's Table Schema achieves 80% of the value with 20% of the complexity. It has built up a much stronger claim to having actually improved data flows, stronger community and array of tooling. The standard is developed in the usual open source way on GitHub, and the likes of Open Data Institute have given it support in their csvlint tool (along with CSVW). So it's an open standard, with a more lightweight process than the formality of a W3C working group. I think there is a serious question mark over CSVW because it has not got much traction so far. Would UK government be happy to embrace it with the risk that the rest of the data world doesn't? Goverment might find itself investing in tools and trying to make the tech work, only to find another one becomes the dominant standard, and the investment is wasted. CSVW is also a big standard, that hasn't been tested much, and can't easily be iterated (there is a lot of overhead of a formal W3C working group, ensuring worldwide compatibility, seeking consent etc.) Big bang, intransient specs - another risk. To mitigate these risks I'd suggest that ONS and/or other interested parties might: - define a "profile" of CSVW, suggesting a subset of the vocabulary we should use, based on a narrow set of user needs that add the most value, to concentrate efforts to build momentum and deliver value - publish some example metadata, to help people understand the standard and build momentum And come back to get approval from Open Standards Board with some evidence of becoming established and delivering value. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIFYTDIOGL4YS7ZFLMVN5DQXI2WBANCNFSM4DFOV6ZA> .

lisssek · 2019-12-06T14:58:41Z

Thank you @pwin, @PeterParslow and @timwis for this really useful discussion and for being our critical friends in this challenge, helping us shape these proposals.

Throughout the GDS discovery in considering the needs for metadata across government departments, it became clear that there is a vast gap in how, across government, we collect and publish metadata. Some organisations are advanced and ahead of others in adapting their standards and others collect no descriptive metadata at all. The aim of recommending rather than mandating Dublin Core elements (which as @PeterParslow pointed out form the basis for the majority of metadata standards) is to help departments which do not publish any metadata to start with Dublin Core and eventually mature to more complex standards such as DCAT.

You are all right in saying that there are various application profiles where different use-cases and requirements are captured but at its core Dublin Core “core elements” constitute the bare minimum. The same bare minimum is also found in schema.org. I completely agree with @DavidReed that schema.org is a vast and comprehensive standard and I like your idea of just recommending using schema.org for Dataset schema and for recommending the use of the same persistent elements in Dublin Core. I think the next step in these proposals is to provide a comprehensive recommendation and examples as to how we envisage these standards to be used, and GDS will be working with you, us and the community on publishing this guidance to GOV.UK .

I think we have also already pointed out in the proposals all of these standards are a starting point on a maturity framework we would like to help departments to progress on that framework. @PeterParslow, you made an excellent point that to improve the metadata quality is more about education than governance. These proposals are the first step in that and will be accompanied by comprehensive engagement, training and workshops across departments to help them improve the quality of metadata they publish, with input from the Government Data Architect community.

Also, @pwin you made an excellent point on persistent identifiers. Without them, no matter what metadata standards we use, we have a problem of not being able to link or refer to data and metadata. Open Data Board has recommended a standard for it and I think COGS team at ONS are doing some work in this space as well. This issue will be discussed at the Board Meeting on Monday.

And @PeterParslow, we have redrafted the proposal on Dublin Core to confine the remit and reference GEMINI and INSPIRE as you suggested during a call with Rosalie. The guidance that will be published on GOV.UK will also reflect exactly what is in scope and what isn't when it comes to geospatial data and following Dublin Core, but essentially these different standards can be aligned so there is clarity on what metadata standard to follow when.

frankieroberto · 2019-12-06T15:43:32Z

In our discovery on this we concluded that neither CSVW nor Frictionless Data were really widely adopted enough to be particularly useful. Frictionless Data has the better tooling though, and is a lot simpler to understand and implement.

In the short-term, I suspect that recommending either isn't helpful. Right now, probably the most useful thing you can do is to publish CSV in a consistent way, and then have some really good HTML pages documenting what all the columns mean in clear, understandable language. Not machine-readable, but at least human-readable.

Longer term, it might be the case that different de facto standards for CSV metadata emerge for different domains (eg publishing statistics vs financial transparency data).

davidread · 2019-12-06T17:51:30Z

@frankieroberto I really appreciate the user research insights. Several have said that now is not the time to be agreeing or recommending a particular standard for structural metadata for open data.

However I think many of us here are keen on the potential benefits in this area (and we're slowly working out how to express the vision in an understandable way!). So let's use our energy we collectively have in government to do lots of 'alpha'-stage work in this space. We can try different approaches and standards, each time trying to cultivate a little ecosystem of publishers and consumers that benefit from the metadata. And once we find a successful formula, only then do we start shouting about it, firm up the standard, and scale up.

There are some great practical ideas and growing consensus in this thread of where we can start:

working with a small standard and iterating it (great to hear @pwin that W3C are moving towards supporting that way of working!)
creating simple guidance, examples and even training (amazing @lisssek!)
community - it's super that @gheye and Rosalie are driving our community to discuss these things, build the goodwill needed and drive it forward together

The highest profile work on this is probably the ONS Cogs project in the stats community of UK gov. I'm really interested to see can be achieved here - the people are distributed but closely networked, they include both producers and consumers of data & stats, and they have plenty of technical clout - it seems like a really fertile opportunity.

Lawrence-G · 2019-12-09T16:44:23Z

Today, the Open Standards Board accepted these proposals as recommended standards with a few conditions that will be added to the profiles when published on GOV.UK

Thank you, everyone, who contributed to this challenge.

bdsharmaco · 2019-12-12T11:19:42Z

Hi everyone, I have recently applied schema.org dataset markup to NHS Digital's publications on https://digital.nhs.uk. There are around 2000 now flowing, and in theory visible on Google's data set search tool. I would be open to how we can consistently apply schema.org, dublin core and any recommended standard to the data, so it is embedded in the html web content we publish. Any feedback is welcome along with improvement suggestions.

oughnic · 2019-12-12T14:46:02Z

Thanks Bharat The impact on Google's data set search is dramatic. A valuable piece of work. Nicholas Nicholas Oughtibridge BSc FBCS CITP Head of Clinical Data Architecture 0113 397 4296 [cid:image001.jpg@01D5B0FA.DC600310] NHS Digital provides information and technology for better health and care. Find out more about us: www.digital.nhs.uk<http://www.digital.nhs.uk/> | @NHSDigital<https://twitter.com/nhsdigital> The NHS Data Model and Dictionary Service is certified to ISO 9001:2015 [cid:image004.png@01D23F2D.9969C650] From: Bharat Sharma <notifications@github.com> Sent: 12 December 2019 11:20 To: alphagov/open-standards <open-standards@noreply.github.com> Cc: OUGHTIBRIDGE, Nicholas (NHS DIGITAL) <nicholas.oughtibridge@nhs.net>; Comment <comment@noreply.github.com> Subject: Re: [alphagov/open-standards] Schemas for Tabular Data Challenge (#40) Hi everyone, I have recently applied schema.org dataset markup to NHS Digital's publications on https://digital.nhs.uk. There are around 2000 now flowing, and in theory visible on Google's data set<https://toolbox.google.com/datasetsearch> search tool. I would be open to how we can consistently apply schema.org, dublin core and any recommended standard to the data, so it is embedded in the html web content we publish. Any feedback is welcome along with improvement suggestions. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#40>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGPP3X7XXYU5X3CQB27DOATQYIM5BANCNFSM4DFOV6ZA>. ******************************************************************************************************************** This message may contain confidential information. If you are not the intended recipient please inform the sender that you have received the message in error before deleting it. Please do not disclose, copy or distribute information in this e-mail or take any action in relation to its contents. To do so is strictly prohibited and may be unlawful. Thank you for your co-operation. NHSmail is the secure email and directory service available for all NHS staff in England and Scotland. NHSmail is approved for exchanging patient data and other sensitive information with NHSmail and other accredited email services. For more information and to find out how you can switch, https://portal.nhs.net/help/joiningnhsmail

Lawrence-G added the Suggestion label Mar 28, 2017

edent mentioned this issue May 19, 2017

TSV for arbitrary data #5

Closed

Lawrence-G added Challenge and removed Suggestion labels May 19, 2017

davidread mentioned this issue Apr 17, 2018

Tabular data #58

Closed

Lawrence-G added the in progress label Jun 27, 2019

Lawrence-G closed this as completed Dec 9, 2019

Lawrence-G added Recommended and removed in progress labels Dec 9, 2019

Schemas for Tabular Data Challenge #40

Schemas for Tabular Data Challenge #40

Comments

Lawrence-G commented Mar 28, 2017