Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schemas for Tabular Data Challenge #40

Open
Lawrence-G opened this issue Mar 28, 2017 · 20 comments

Comments

@Lawrence-G
Copy link
Contributor

commented Mar 28, 2017

Schemas for Tabular Data

Category

  • Data
  • Document
  • Technical
  • Other Suggestions

Suggested by

Originally Submitted by pwalsh on Mon, 13/03/2017 on standards.data.gov.uk

Short Description

Much data published by governments is in common tabular data formats: CSV, Excel, and ODS. This is true for the UK government and governments around the world. To provide assurances around reusability of tabular data, consumers (users) need information on the "primitive types" for each column of data (example: is it a number? is it a date?). This also allows for quality checks to ensure consistency and integrity of the data.

Publishing Table Schema with tabular data sources provides this information. Table Schema has previously been used in work by Open Knowledge International (OKI) with Cabinet Office to check the validity of 25K fiscal data, according to publication guidelines. Table Schema is also used widely by other organisations working with public data, such as the Open Data Insititute (ODI).

User Need

I've written several user stories below. Each user story applies equally to a range of users. The user personas are as follows:

  • Developer: a user reusing public data in derived databases, visualisations, or data processing pipelines.
  • Business analyst: a user looking to public data as a source of information for analysis of business use cases that involve some component of public good.
  • Citizen: a non-technical user who expects government to publish consistent, high quality data.

User stories

As a user, I want all public data published by government to conform to a known schema, so I can use this information to validate the data.

As a user, I want public data published by government to have a schema, so I can read the schema and understand at a glance the type of information in the data, and the possibilities for reuse.

Expected Benefits

  • Vastly increased reuse of public data.
  • Increased trust in publication flows, generated by publication flows creating quality data.

Functional Needs

The functional needs that the proposal must address.

@philarcher

This comment has been minimized.

Copy link

commented Apr 18, 2017

W3C has a formal Recommendation in this space: Metadata Vocabulary for Tabular Data that encodes the Model for Tabular Data and Metadata on the Web. The work was originally inspired and subsequently led by Jeni T with Dan Brickley as co-chair. Jeremy Tandy (Met Office) was a key WG member too.
The WG looked at tabular data in the wild (all its use cases come with real examples) and handled awkward realities like multiple lines of headings, right to left tables and non-ascii characters. It uses the Web to link the metadata definitions that, like Table Schema, are defined in JSON. The metadata can be embedded in the CSV/TSV, at a well-known location, or linked from the CSV directly. You can override defined metadata file with one of your own. All this, of course, makes the metadata definitions reusable - handy for regularly published datasets, for example. The combination of CVS/TSV and the metadata file means that you can use the data directly or transform it programmatically into JSON or RDF. The standards supports URI templating out of the box but it also has extension points for extra rules so that you can use it as the basis for something like Open Refine that would, for example, transform dates into a standard format, or either of UK|United Kingdom into a regular form.

The potential here is that a set of metadata files can be defined and maintained centrally with tooling to validate a published CSV/TSV file with reduced effort to create visualisations etc. It gets away from the notion of packages that are made available for download and local processing (what I call using the Web as a glorified USB stick), and makes linking across multiple datasets and to things like the registries much easier. If you like, it's 5 star data in CSV.

@edent

This comment has been minimized.

Copy link
Contributor

commented May 19, 2017

Also - CSV on the Web https://w3c.github.io/csvw/primer/

@Lawrence-G Lawrence-G added Challenge and removed Suggestion labels May 19, 2017

@davidread davidread referenced this issue Apr 17, 2018
@DavidUnderdown

This comment has been minimized.

Copy link

commented Sep 5, 2018

The National Archives also developed a CSV Schema Language and validator http://digital-preservation.github.io/csv-schema - currently departments are expected to supply a CSV file meeting a supplied a schema when transferring digital records to The National Archives to meet their Public Record Act obligations

@frankieroberto

This comment has been minimized.

Copy link

commented Feb 8, 2019

There's also Frictionless Data Packages which apparently has some traction (eg is supported by the Open Data Institute and Google.org).

@gheye

This comment has been minimized.

Copy link

commented Jun 21, 2019

Data Description Language

This comment is a proposal on defining descriptive language for data within government. We’re asking for your feedback so we can develop and write this guidance based on advice from the government data community.

We have published blog posts on this topic::

  1. Excel Spreadsheets - which outlines how a simplistic data description language can assist in sharing data and the reality that spreadsheets are not disappearing. https://dataingovernment.blog.gov.uk/2019/06/10/improving-how-we-manage-spreadsheet-data/

  2. Why we need Data Standards (due for publication soon) which is an introduction to data standards and some of the proposals.

The aim is to have a common way of describing files, ‘a data description language’, which organisations can use across all file types and formats.

There are several priorities for the data description language. We need to make sure that:

  • It is easy of use

  • the process for defining the language is community driven, by using tools such as Slack and GitHub

  • the language is useful and adds value for the community

  • we can explain the basics of the language in one page that we can publish with the guidance

  • We also need to make sure that the data description language follows existing standards wherever possible. This includes:

    • aligning with CSV on the web - TAG:Text

    • CSV RFC4180 - promotion of using double quotes around key items of text

    • verifying new tags against the Dublin Core to validate whether the suggested tags currently exist or if we should create them. Suggestions on tagging frameworks to be used are much appreciated.

We will produce a number of documents alongside this proposal including:

  • The one page description of the tags, which will follow the format largely of the proposal below

  • A more detailed specification of the language - this will follow the language of the previously
    published API documentation. This is currently being worked on.

Proposed Tags
Please note that these tags can be provided in a separate file for a CSV file, or on a page within a spreadsheet. Each item will exist on a separate line.

Links have been added on the second column for those which come from Dublin core. An area column has been added and a proposed compulsory column. The last item has been removed which was parse-to-end-file. Two items for reference data have been remove Register-collection and register-standard. Added declare-datasheet so the sheet the data sits on can be defined. Removed one standard item 'standard' and changed standard-url to conformsTo.

Area Property(Tag) Example Comment Compulsory
Core creator creator:"Russell Singh russell.singh@digital.cabinet-office.gov.uk"
creator:"Indira Singh, Sue Chan, Gregory Pie"
Dublin Core. This can be a comma separated list of creators.  
Core contributor contributor:”justine.gornall@company.co.uk Dublin Core. This can be a comma separated list of contributors. Note see above for multiple contributors.  
Core title title:"GDS Employees" Dublin Core.  
Core created created:2002-10-02 Dublin Core. The format of this date and whether double quotes are used needs to be agreed.  
Core identifier identifier:"0000015_GDS_SDA_XLS" Dublin Core.  
Core description description:"All heights at GDS" Dublin Core.  
Core valid valid:"2012-2013" Aligns with Dublin Core. This refers to Date valid range. What acceptable range of values would we accept here? Suggested by DWP.  
Core replaces replaces:"GDS Employees V1" Aligns with Dublin Core. The document or item that this replaces. What acceptable range of values would we accept here? Suggested by DWP.  
Core license license:"https://opensource.org/licenses/MIT" Aligns with Dublin Core. Proposed by ONS. The license that applies to the document.  
XLS\ODF declare-header declare-header:"A1:A2"
declare-header:"Sheet2!A1:A2"
For spreadsheet data.  
XLS\ODF declare-datasheet declare-datasheet:"MyDataSheet" For spreadsheet data.  
XLS\ODF\CSV column-type column-type:"ColumnName:String"
column-type:"Country:String"
column-type:"Age:Number"
For spreadsheet data. This is an item that can be repeated for each of the columns that exist in the data set. Amended following feedback from @davidread. This aligns much more closely now with CSVW.  
XLS\ODF declare-data declare-data:"Sheet2!A4:D8"
declare-data:"A4:D8"
For spreadsheet data  
Core format format:"xls" To allow future expansion of the data description language. Aligns with Dublin Core. Assumption is double quotes not required.  
CSV file-delimiter file-delimiter:“,” To be used if another file delimiter is being used.   
Proposal fileformat-puid fileformat-puid:”fmt/62” Proposed by National Archives. There is a REST API for obtaining information on the PUID (PRONOM Unique Identifier): http://www.nationalarchives.gov.uk/PRONOM/fmt/62  
Proposal fileformat-creating-application fileformat-creating-application:”Excel 1997” Proposed by National Archives. Should this be shortened to creating-app?  
Proposal standard-comment standard-comment: “RFC4180” One comment tag may be used.  
Core conformsTo conformsTo:“https://tools.ietf.org/html/rfc4180” The standard the file must conform to. Changed following feedback from @pwin  
Core doc-sensitivity   A future proposal if the documents should have sensitivity applied to it. Note should this align with the Dublin core tag: accessRights. Note what would be an acceptable range of values for this?  
Proposal register-column register-column:”Address” The column from the dataset that should have reference data applied to it. The terms that has been used in GDS for reference data is register. Should this allow a series of columns?  
Proposal register-url register-url:"https://www.registers.service.gov.uk/registers/ddat-profession-capability-framework" The URL to the specific register that this applies to. This maybe machine readable or human readable depending on the usage.  
CSV top-row-header top-row-header:true To aid processing. True or false.  
@pwin

This comment has been minimized.

Copy link

commented Jul 1, 2019

  • I don't think there is much special about government that requires it to have something particular to itself, though it has an authority for specific data and also is a big player in the market and consequently has influence. This latter can be a problem for society generally if government decides to go the 'wrong way' - just read Programmed Inequality
  • Many of the tags above are related to the concept of the resource and less to the specific distribution. I think that it is important to separate these so that we have an approach to creating catalogues that is more logical. DCAT v2 mentions this:

an important distinction between a dataset as an abstract idea and a distribution as a manifestation of the dataset.

Most of the other bits that you mention above @gheye are schema or processing instructions and I think it is important to separate these. Also, there might be rules of different types within the constraints for validation, or within the instructions for processing. So I think that these aspects need to be handled as part of an interoperability metamodel - see EIRA as an example. But if you think that this over-complicates things then either slim down your proposal so that you're not trying to replicate the many illustrations above, or else join in with the EC activity.

@gheye

This comment has been minimized.

Copy link

commented Jul 1, 2019

Hi @pwin

Thanks for your comments.

I am coming up to see you and we can discuss this in more detail.

We do need align with international standards and work with them but in a dispersed organisation, such as the UK government we need to be flexible and in some cases more light weight.

We can discuss more when I come up to Glasgow.

Best Wishes,

gheye

@davidread

This comment has been minimized.

Copy link

commented Jul 1, 2019

Along with the existing user stories of documenting and validating the data, I'd like to suggest another:

  • As a data scientist/engineer, I want to load the data into a data store, so that I can do queries/analysis

This is helped by having a schema. I'll just explain the situation we have at MOJ: we started a simply internal data catalogue with each data table described as metadata including a schema describing the column properties. Whilst you can load a CSV into a data store and let it auto-detect column types, or loading a parquet file into a data frame it often makes mistakes, for example converting numbers to dates, treating dates as text, dropping the leading 0 in telephone numbers, interpreting nulls as strings, choosing int16 when int64 will be needed in future etc. So to make this more reliable, colleagues have written some little tools to convert the existing schema to a number of related schema formats, suited to various data stores, for example:

Ideally all these things would accept a schema in the same format. Pandas and Spark accept it programmatically. Glue and Big Query express it in slightly different JSON formats. 🤷‍♂

But if we accept we need converter tools for this use case, I don't think this user story imposes anything extra on the schema format - just defining the name and type of the column would cover it. The allowable column types is probably worth discussing - I guess the SQL types is a reasonable start. See our conversion table between pandas, Spark and AWS Glue.

It would be great to hear if this use case is common or not, probably from others establishing data warehouse functionality, with data catalogues and ETL needing schemas.

@davidread

This comment has been minimized.

Copy link

commented Jul 1, 2019

To summarize, these are the suggested options to meet the challenge:

To advance this discussion, perhaps we should compare the suitability of these in meeting the user needs (for developers, business analysts and citizens), identified by @pwalsh in the challenge.

@timwis

This comment has been minimized.

Copy link

commented Jul 2, 2019

Hm, I would suggest that the "Data Description Language" that @gheye proposes is rather distinct from how we describe the fields/schema of the data, and should warrant its own conversation/thread -- particularly since there are a lot of existing standards out there around metadata worth discussing, and they don't necessarily include field-level information.

If we prefer to treat it as a single standard, I would re-emphasise tabular data packages as an existing standard that prescribes the lowest common denominator of fields and allows for extensions, and also addresses field-level information via table schema.

@gheye

This comment has been minimized.

Copy link

commented Jul 4, 2019

Hi All,

My view is this does not preclude DCAT 2 or CSV on the web, and can viewed as a stepping stone.

At the lowest level for a department using excel or CSV who are not in the position of moving to DCAT or CSV on the web can we provide a list of common tags that are used across government. These tags should allow a user to more easily migrate to a larger more comprehensive framework.

Can they not also be used in CSV on the Web and form the basis of a DCAT design especially since they align as close as possible to international standards. Additional tags will only be created where they do not exist in one of the standards.

The ask is therefore simple and quite basic. Can we agree a common set of tags to be used across government. After we have done this then you would create recommendations and migration strategies to CSV on the web and DCAT 2 or an international standard that we agree on.

@gheye

This comment has been minimized.

Copy link

commented Jul 9, 2019

Hi,

I have amended the column definition above to align more closely with CSVW.

Gareth H.

@gheye

This comment has been minimized.

Copy link

commented Jul 15, 2019

Following feedback from NHS Digital and GDS, I have updated the table to include:

  1. A proposed compulsory column

  2. An area that that the item applies to. For want of a better term all those created from requests that are not CSV of spreadsheet or core I have put as proposal.

@gheye

This comment has been minimized.

Copy link

commented Jul 23, 2019

Note a suggestion that is worth discussing from @pwin is whether we have mediatype.

@gheye

This comment has been minimized.

Copy link

commented Jul 31, 2019

I received excellent and extensive feedback from ONS at the individual item level. Most of the changes have been added to the table above.

@gheye

This comment has been minimized.

Copy link

commented Aug 7, 2019

There are three items that NHS digital would like to add to this list from schema.org:

schema:datePublished
schema:spatialCoverage
schema:temporalCoverage

Please confirm whether you agree or disagree with these items.

@frankieroberto

This comment has been minimized.

Copy link

commented Aug 7, 2019

@gheye I'm a bit confused. Are you proposing a new "Data Description Language" standard? My understanding was that the adopted open standards could only be existing standards, not newly-created ones? Or does your proposal build on top of CSV on the web or Frictionless Table Schema?

@gheye

This comment has been minimized.

Copy link

commented Aug 7, 2019

HI @frankieroberto

We are not proposing a new data description language. If you look at the individual items in the list above almost all of them come from existing standards.

The only new items we are proposing are those to fill a need that has been raised in government and a tag does not currently exist. If you can find a tag that does for any of the additional tags above then please point us towards them.

As mentioned above this should be viewed as a stepping stone to something such as CSV on the web. In fact the syntax has been aligned with that as far as possible.

This a proposal for an initial set of tags that the U.K government could begin to adopt before moving to something larger. It is a stepping stone.
Regards.

@MikeThacker1

This comment has been minimized.

Copy link

commented Aug 7, 2019

Some points I raised in a chat with @gheye that he asked me to add here:

@pwin

This comment has been minimized.

Copy link

commented Aug 7, 2019

w3c/dxwg#868 from the DXWG issues is relevant to the discussion about accreted datasets etc. So is the discussion within DXWG on qualified relations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.