-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schemas for Tabular Data Challenge #40
Comments
W3C has a formal Recommendation in this space: Metadata Vocabulary for Tabular Data that encodes the Model for Tabular Data and Metadata on the Web. The work was originally inspired and subsequently led by Jeni T with Dan Brickley as co-chair. Jeremy Tandy (Met Office) was a key WG member too. The potential here is that a set of metadata files can be defined and maintained centrally with tooling to validate a published CSV/TSV file with reduced effort to create visualisations etc. It gets away from the notion of packages that are made available for download and local processing (what I call using the Web as a glorified USB stick), and makes linking across multiple datasets and to things like the registries much easier. If you like, it's 5 star data in CSV. |
Also - CSV on the Web https://w3c.github.io/csvw/primer/ |
The National Archives also developed a CSV Schema Language and validator http://digital-preservation.github.io/csv-schema - currently departments are expected to supply a CSV file meeting a supplied a schema when transferring digital records to The National Archives to meet their Public Record Act obligations |
There's also Frictionless Data Packages which apparently has some traction (eg is supported by the Open Data Institute and Google.org). |
Data Description Language This comment is a proposal on defining descriptive language for data within government. We’re asking for your feedback so we can develop and write this guidance based on advice from the government data community. We have published blog posts on this topic::
The aim is to have a common way of describing files, ‘a data description language’, which organisations can use across all file types and formats. There are several priorities for the data description language. We need to make sure that:
We will produce a number of documents alongside this proposal including:
Proposed Tags Links have been added on the second column for those which come from Dublin core. An area column has been added and a proposed compulsory column. The last item has been removed which was parse-to-end-file. Two items for reference data have been remove Register-collection and register-standard. Added declare-datasheet so the sheet the data sits on can be defined. Removed one standard item 'standard' and changed standard-url to conformsTo.
|
Most of the other bits that you mention above @gheye are schema or processing instructions and I think it is important to separate these. Also, there might be rules of different types within the constraints for validation, or within the instructions for processing. So I think that these aspects need to be handled as part of an interoperability metamodel - see EIRA as an example. But if you think that this over-complicates things then either slim down your proposal so that you're not trying to replicate the many illustrations above, or else join in with the EC activity. |
Hi @pwin Thanks for your comments. I am coming up to see you and we can discuss this in more detail. We do need align with international standards and work with them but in a dispersed organisation, such as the UK government we need to be flexible and in some cases more light weight. We can discuss more when I come up to Glasgow. Best Wishes, gheye |
Along with the existing user stories of documenting and validating the data, I'd like to suggest another:
This is helped by having a schema. I'll just explain the situation we have at MOJ: we started a simply internal data catalogue with each data table described as metadata including a schema describing the column properties. Whilst you can load a CSV into a data store and let it auto-detect column types, or loading a parquet file into a data frame it often makes mistakes, for example converting numbers to dates, treating dates as text, dropping the leading 0 in telephone numbers, interpreting nulls as strings, choosing int16 when int64 will be needed in future etc. So to make this more reliable, colleagues have written some little tools to convert the existing schema to a number of related schema formats, suited to various data stores, for example:
Ideally all these things would accept a schema in the same format. Pandas and Spark accept it programmatically. Glue and Big Query express it in slightly different JSON formats. 🤷♂ But if we accept we need converter tools for this use case, I don't think this user story imposes anything extra on the schema format - just defining the name and type of the column would cover it. The allowable column types is probably worth discussing - I guess the SQL types is a reasonable start. See our conversion table between pandas, Spark and AWS Glue. It would be great to hear if this use case is common or not, probably from others establishing data warehouse functionality, with data catalogues and ETL needing schemas. |
To summarize, these are the suggested options to meet the challenge:
To advance this discussion, perhaps we should compare the suitability of these in meeting the user needs (for developers, business analysts and citizens), identified by @pwalsh in the challenge. |
Hm, I would suggest that the "Data Description Language" that @gheye proposes is rather distinct from how we describe the fields/schema of the data, and should warrant its own conversation/thread -- particularly since there are a lot of existing standards out there around metadata worth discussing, and they don't necessarily include field-level information. If we prefer to treat it as a single standard, I would re-emphasise tabular data packages as an existing standard that prescribes the lowest common denominator of fields and allows for extensions, and also addresses field-level information via table schema. |
Hi All, My view is this does not preclude DCAT 2 or CSV on the web, and can viewed as a stepping stone. At the lowest level for a department using excel or CSV who are not in the position of moving to DCAT or CSV on the web can we provide a list of common tags that are used across government. These tags should allow a user to more easily migrate to a larger more comprehensive framework. Can they not also be used in CSV on the Web and form the basis of a DCAT design especially since they align as close as possible to international standards. Additional tags will only be created where they do not exist in one of the standards. The ask is therefore simple and quite basic. Can we agree a common set of tags to be used across government. After we have done this then you would create recommendations and migration strategies to CSV on the web and DCAT 2 or an international standard that we agree on. |
Hi, I have amended the column definition above to align more closely with CSVW. Gareth H. |
Following feedback from NHS Digital and GDS, I have updated the table to include:
|
Note a suggestion that is worth discussing from @pwin is whether we have mediatype. A list of possible media types is shown here: https://www.iana.org/assignments/media-types/media-types.xhtml |
I received excellent and extensive feedback from ONS at the individual item level. Most of the changes have been added to the table above. |
There are three items that NHS digital would like to add to this list from schema.org: schema:datePublished Please confirm whether you agree or disagree with these items. |
@gheye I'm a bit confused. Are you proposing a new "Data Description Language" standard? My understanding was that the adopted open standards could only be existing standards, not newly-created ones? Or does your proposal build on top of CSV on the web or Frictionless Table Schema? |
We are not proposing a new data description language. If you look at the individual items in the list above almost all of them come from existing standards. The only new items we are proposing are those to fill a need that has been raised in government and a tag does not currently exist. If you can find a tag that does for any of the additional tags above then please point us towards them. As mentioned above this should be viewed as a stepping stone to something such as CSV on the web. In fact the syntax has been aligned with that as far as possible. This a proposal for an initial set of tags that the U.K government could begin to adopt before moving to something larger. It is a stepping stone. |
Some points I raised in a chat with @gheye that he asked me to add here:
|
w3c/dxwg#868 from the DXWG issues is relevant to the discussion about accreted datasets etc. So is the discussion within DXWG on qualified relations |
There are two kinds of standards being discussed here, without always being clear on which:
Several of the proposed standards may have the same mix; Table-schema & CSV on the web only do the latter - describe how to define the CSV structure, not specify the metadata expected to accompany the schema definition. (They are pretty much the core of all metadata standards. GEMINI distinguishes 'metadata about the metadata' and 'metadata about the dataset' - but I don't think many people like the two terms!) 2nd point: the DCLG-sponsored Brownfield Land Register schema was developed by iStandUK using the 'table-schema' approach (in order to exploit the ODI's CSVLint tool). This is in use by many (all?) English planning authorities, placing data records on data.gov.uk |
On Peter's 2nd point, the conformsTo tag should reference the relevant schema, in this case the one from iStandUK. Both CSVLint and an LGA CSV validator tool validate against the schema. data.gov.uk used to allow you to identify all datasets that conform to a given schema but they have now dropped that functionality. If GDS uses conformsTo, we should be able to revive (somewhere) discovery of datasets that conform to a given schema. This is needed for joining datasets from many publishers in local government. |
The following proposals for this challenge (Dublin Core, Schema.org, CSVW) are the result of the GDS data standards workshops held over the last couple of months and comments and suggestions made on GitHub.
it was agreed to post proposals for the different standards referenced in the language as separate recommendations. For this purpose, GDS worked with ONS on the proposals; for Dublin Core, schema.org and CSV on the Web. I'm posting on behalf of the team who put the proposal together. Proposal: Recommend Dublin Core to describe data shared privatelyIntroductionThis proposal is to recommend that government departments use Dublin Core schema as a minimum set of metadata to be associated with data they are sharing privately. For tabular data being published, we have a separate proposal with the Open Standards board to use Schema.org Using the Dublin Core schema to describe tabular data that is shared privately means individuals/teams/departments would be consistent in the way they describe the contents of their tabular data resources, especially in departments where no metadata is collected or published allowing the data to be easily catalogued, validated and reused, as well as to be more findable. It is the first step in achieving metadata maturity across the government departments and reach a common core set of metadata associated with a dataset. This proposal is based on the idea that the metadata elements associated with the Dublin Core represent the core set of metadata elements conserved across the government. Since Dublin Core sets a foundation for many more complicated standards such as DCAT (which is a recommended standard at the higher end of metadata maturity spectrum), it ensures that the same elements can be captured as a more complicated metadata standard can be implemented without the need for complex translation between standards. Please note that this proposal is based on two assumptions: This proposal builds on the Open Standards Board recommendation of the RFC 4180 definition of CSV (Comma Separated Values) for publishing tabular data in government. User need approachThe user need identified by this proposal is to maximise consistency in and give context to the tabular data being shared across government. Adopting the Dublin Core schema should help to achieve the first step in sharing data with associated metadata to ensure trust and increase the confidence in handling of data. Users in the context of this proposal are government workers who create, share and maintain tabular data, and need to be able to validate it. Individual users include but are not limited to data scientists, business analysts, people who need to use a spreadsheet application to do basic analysis and developers who process data in a range of software. Achieving the expected benefitsIf departments use the Dublin Core schema, they would be consistent in the way they describe their tabular data. The Dublin Core Metadata Element Set is one of the simplest and most widely used metadata schema since it represent a set of 15 minimum set of metadata conserved across metadata standards. For example, in ONS, all of the Dublin Core metadata elements are conserved in a much more complex metadata model which captures metadata needs of a statistical organisation. Dublin Core is comprised of 15 “core” metadata elements; whereas the "qualified" Dublin Core set includes additional metadata elements to provide for greater specificity and granularity. If Dublin Core is adopted as an Open Standard for government, the Government Digital Service will produce guidance on how these elements should be used. The GDS will advocate for the 15 core metadata elements to be conserved across the government department. Using the Dublin Core standard will mean government users have an improved idea of what kinds of information should be recorded, where and how. How Dublin Core complements other standardsThe Dublin Core standard is at the core of majority metadata standard and is limited to only 15 core elements which GDS will be advocating the cross-government community to use to share data across departments where no other metadata is captured. Dublin Core elements are independent of coding syntax. As the metadata maturity of a government department improves, the Dublin Core metadata elements will become a part of a much more complex standards as is the case in ONS (ONS model includes standards such as DCAT, ADMS, amongst others). Using Dublin Core elements is a first step in reaching that maturity and ensures that at least a minimal set of metadata is captured and shared across government departments in a standardised way. However, when departments are publishing data openly, they should use schema.org to describe data, as this is a collection of metadata schemas that are focused on SEO and are particularly targeted at webmasters. Schema.org is already currently in use by Data.gov.uk and GOV.UK. Whist Dublin Core elements describe both physical and web resources, schema.org has been designed specifically for search engine optimization. Schema.org is also a complex and mature metadata standard and the majority of government departments are required to support it when publishing their open data dataset to data.gov.uk pages. Whilst schema.org is a fantastic standard for publishing data on the web, the government departments often lack any means of publishing common and coherent metadata to fit their inhouse need. Dublin core set of elements forms a basis to many other standards which can be easily adopted as the organisation matures in their metadata journey. Schema.org metadata standard cannot be used, nor is designed to work in the same way. Other steps to achieving interoperabilityThis proposal is only concerned with promoting consistent and accurate set of metadata associated with tabular data when government is sharing data privately. When government is sharing data publically, ie publishing the data, we have a separate proposal to the Open Standards Board to use schema.org (link) since this has been the preferred option in the open linked data community. When wanting to describe how the data is shaped and formatted within a particular file, we have a separate proposal to the Open Standards Board for the use of CSV on the Web. Proposal: Recommend schema.org to describe tabular data you are publishingIntroductionThis proposal is to recommend that government departments use schema.org to describe open data they are publishing. For data being shared but not published, we have a separate proposal with the Open Standards board to use Dublin Core as the minimal set of metadata elements conserved across the government. Using the schema.org to describe tabular data that is published by individuals/teams/departments means search engines can better find the data and display structured results to end users more efficiently. Describing contents of published tabular data resources in a consistent way will also allow the data to be easily catalogued, validated and reused. This proposal:
User need approachThe user need identified by this proposal is to make government published data easier to find and maximise use. Officially adopting the schema.org should help to achieve this since this standard has been used to publish open data to government websites for a while. Users in the context of this proposal are government workers who create, share and maintain tabular data, and want people to be able to find it. Individual users include but are not limited to data scientists, business analysts, technology policy advisors, economists and members of the public. Often, openly published government data can be difficult to find. Web publishers who include schema.org markup generally tend to have a competitive SEO advantage over those who don’t so it makes sense to adopt Schema.org within government. Schema.org allows context to be provided for an otherwise ambiguous webpage, improving the quality of search results for users. The user needs identified for adopting schema.org are for users to be able to:
Achieving the expected benefitsIf departments use schema.org, they would be consistent in the way they describe their published data. schema.org is supported by the major search engines and takes full advantage of the semantic web. schema.org is a set of tags that aims to make annotating HTML elements with machine-readable tags much easier. schema.org is already used by Government websites publishing data, including GOV.UK and Data.gov.uk. Different bits of government, and associated agencies publish tabular data in different formats and they regularly exclude pertinent information. Sometimes, even when the data is included, it is not machine readable so is not easily findable. Using schema.org as an open standard for published data across government will help online users find relevant, accurate data to fit their needs. schema.org will also make it easier for aggregators, search engines and others to re-use data published by the government, in turn making it easier for users to find data relevant to them on non-government services. When schema.org may not be suitableThe schema.org standard is a powerful tool for helping online users find the information they need. However, when departments or government individuals are sharing data between themselves, they should use Dublin Core tags to describe data, as this is a collection of tags that are less focused on the web, and are more suited to structuring the data. Whilst schema.org is a fantastic standard for publishing data on the web, the government departments often lack any means of publishing common and coherent metadata to fit their inhouse need. Dublin core set of elements forms a basis (and is part of) many other more complicated standards (such as DCAT) which can be easily adopted as the organisation matures in their metadata journey. Schema.org metadata standard cannot be used, nor is designed to work in the same way, however, the metadata elements, described as “core” are also shared in Schema.org standard. Other steps to achieving interoperabilityThis proposal is only concerned with promoting consistent and accurate data when government is publishing data openly. When government is sharing data privately, we have a separate proposal to use Dublin Core. When wanting to describe how the data is shaped and formatted within a particular file, we have a separate proposal to the Open Standards Board for the use of CSV on the Web. Proposal: Recommend CSV on the Web to annotate tabular data column propertiesIntroductionA large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. CSV files may be of a significant size but they can be generated and manipulated easily, and there is a significant body of software available to handle them. Indeed, popular spreadsheet applications (Microsoft Excel, iWork’s Number, or OpenOffice.org) as well as numerous other applications can produce and consume these files. However, although these tools make conversion to CSV easy, it is resisted by some publishers because CSV is a much less rich format that can't express important detail that the publishers want to express, such as annotations, the meaning of identifier codes etc. Existing formats for tabular data are format-oriented and hard to process (e.g. Excel); un-extensible (e.g. CSV/Tab Separated Values(TSV)); or they assume the use of particular technologies (e.g. SQL dumps). None of these formats allow developers to pull in multiple data sets, manipulate, visualize and combine them in flexible ways. Other information relevant to these datasets, such as access rights and provenance, is not easy to find. CSV is a very useful and simple format, but to unlock the data and make it portable to environments other than the one in which it was created, there needs to be a means of encoding and associating relevant metadata. To address these issues, the CSVW seeks to provide:
This proposal is to recommend that government departments use CSV on the Web (CSVW) to process CSVs into an annotated data model so CSVs can be annotated, interoperable and more easily shared. This is an open and established standard and is currently being used and recommended by ONS. It should be noted that this standard deals with tabular data and a standard which deals with an extended array of formats might be recommended to use in the future. CSVW is a W3C standard for metadata descriptions for tabular data, and will give government a standard way to express useful metadata about CSV files and other kinds of tabular data. With CSVW, after the tabular data is annotated, the model is used as the basis to create RDF or JSON. For example CSVW assumes that the first row of a CSV is a header row containing the titles for the CSV cells. CSVW also assumes each row in the CSV file constitutes a record with properties. After assuming a model for the tabular data, the file can be easily integrated with other data. This proposal builds on the Open Standards Board recommendation of the RFC 4180 definition of CSV (Comma Separated Values) for publishing tabular data in government. User need approachDifferent bits of government, and associated agencies publish tabular data in different formats and they regularly exclude pertinent information on the column properties, making it hard for files to be shared. Using CSVW as an open standard for tabular data to be shared across government will help government workers aggregate and reuse data. Users in the context of this proposal include:
Achieving the expected benefitsAs is noted by the W3C Working Group, CSV is a very useful and simple format, but to unlock the data and make it portable to environments other than the one in which it was created, there needs to be a means of encoding and associating relevant metadata To address these issues, the CSV on the Web Working Group seeks to provide:
If departments use CSVW, they would be consistent in the way they describe their tabular data column properties data. CSVW is already in use across government and support for the standard is noted in the W3C Working Group Use Case and Requirement brief. When CSVW may not be suitableCSVW is suitable for tabular information formats but not for other formats. Other steps to achieving interoperabilityThis proposal is only concerned with promoting consistent and accurate tabular data when government is transporting data into a data store or another tabular data file. When government is wanting to share details about the data we hold so it can be catalogued and found easily we have separate proposals to the Open Standards Board to use Dublin Core and Schema.org . |
I agree with @pwin, @PeterParslow and @timwis that we are discussing two rather distinct sorts of metadata here. I'd describe them as Descriptive Metadata (DCAT, DC, schema.org) and Structural Metadata (Table Schema, CSVW, TNA's CSV Schema Language). @Lawrence-G Please can a separate challenge be created for the former? These two areas have rather distinct user needs and merit separate discussions. And a bit of "divide and conquer" is probably needed when the latest proposals are 9 screenfuls long :) |
@davidread That may be the best approach but I’ll leave it to the Open Standards Board to decide. This challenge has acted as a touchpaper to (re)ignite the conversation over these past months and this extended proposal is the result. To stall the momentum would be a shame but we do need to get things clearly defined. I hope that the work carried out so far will help inform the final profile. Yes, the nine pages are somewhat epic. for the assessments, I have three links. Dublin Core Metadata Assessment CSV on the Web (CSVW) Assessment As part of the open standards process, any standards recommended for use in a standards proposal are assessed using the 47 questions. The Open Standards Board agreed on the criteria used in the assessment. These questions are based on the EU CAMSS (common assessment method for standards and specifications). |
Regarding the Descriptive Metadata, I hope that the Open Standards Board takes into account the collection of Statutory Instruments collectively known as "The INSPIRE Regulations". Starting with 2009/3157 (http://www.legislation.gov.uk/uksi/2009/3157/contents), and amended by 2012 No. 1672, it has since become quite a collection because of the range of "EU Exit" amendments. But basically it requires most public bodies in the UK to use a particular metadata "standard" to describe rather a lot of their data. Defra has invested in ensuring that the UK has a "standard" which allows this to work with data.gov.uk: GEMINI (https://www.agi.org.uk/gemini/). Luckily, it's based on Dublin Core, and the Geospatial Commission is funding some work that is likely to result in GEMINI including advice on a Schema.org encoding. I do urge proper thought before issuing "standards advice" that contradicts a Statutory Instrument. Sometimes "stalling the momentum" might be better than confusing the organisations one is trying to influence! |
I agree with @PeterParslow in the need to move forward carefully. Not every organisation or line of business has the same requirement or is at the same level of maturity. |
Can we reduce the proposal from recommending the whole of schema.org to just their Dataset schema? schema.org includes schemas for everything from DrugStrength to Electrician to PublicToilet. schema.org is a disrupter, a competitor to existing data standards in numerous fields e.g. library catalogues has for decades used standards like MARC, BibFrame and FRBR. If Open Standards adopt the whole of schema.org wholesale without considering the more established open standards in each field, that could bias government adoption towards schema.org in all those fields. |
On schema.org the primary reason this proposal recommends it is:
Perhaps the supporters can expand on this point? A quick search brought this summary from Mozilla:
|
Again on schema.org, a secondary reason for the recommendation is:
schema.org's Dataset is a big sprawling standard, with over 100 properties, to cover a wide variety of use cases. I strongly dispute idea that simply adopting it wholesale will bring consistency to government metadata. Fields are all optional - each publisher will choose their own. The sheer number of options rather defeats the idea of standardization. For example, just reading the standard you'll see lots of options for specifying the publisher of the data - whilst data.gov.uk and gov.uk have a simple model of allowing one or more departments attached as the "publisher", and no other role, schema.org's Dataset can have a publisher, author, creator, producer, provider, publisherImprint, funder, sponsor, sdPublisher, sourceOrganization, contributor, copyrightHolder. The definitions are ambiguous and confusing so people will get them wrong. Some will feel the only thing needed is which org captured the data ("producer"?), some will only include the "publisher", and some will think the copyrightHolder is the only thing to record. Whilst a human can make sense of one record, you cannot compare or automatically process them in bulk. So that is not a standard. To take another field as an example - "date". There is a huge variety of ways to express dates: contentReferenceTime, dateCreated, dateModified, datePublished, expires, sdDatePublished, temporal, temporalCoverage. Now if you're trying to search for data on a topic, and want to filter by data collected in the past year, you've got a really tough problem to write that query for datasets that could express its date 8 different ways. With so much complexity, organizations will assign these differently every time, and those trying to make sense across government will have an impossible task. The only obvious alternative is DCAT, which is no better in this respect, indeed comes with more linked data extensibility/complexity, so I wouldn't favour that. In truth I rather like the core of schema.org's Dataset schema. I assume the likes of data.gov.uk and gov.uk have defined a subset of 10ish properties that they read (out of the 100 in the standard) and the rest are disregarded. I would suggest that this proposal does a similar thing and specifies which fields are recommended (a "profile" of the standard), which the Open Standards board points to in recommending this standard. I've been involved in user research where publishers ask questions like "if you release a monthly set of figures on a topic, should each month be recorded as a new Dataset or just a Distribution within the existing Dataset"? schema.org doesn't have a view (and DCAT is unclear, I believe). This sort of problem can be solved by best-practice examples / supplementary guidance, so I'd like to see this in the proposal too. |
@PeterParslow rightly points out that the substantial proportion of datasets covered by the INSPIRE law need to have GEMINI-format metadata. To add some more background (someone correct me if this has changed since I was last involved), GEMINI is a specialist format for use by those mainly in the environmental research community. GEMINI is a standard for data with a spatial element to it (whereas schema.org is for all types of data). The UK's INSPIRE datasets' metadata are all harvested into data.gov.uk, where the GEMINI metadata is simply translated to schema.org Dataset format and exposed like that to give the benefits of search engine discoverability. So we already have both metadata formats being published for these datasets, which makes total sense to me. I'm not convinced that Open Standards Board making a recommendation for schema.org's Dataset would confuse publishers of INSPIRE data, but this could be usefully clarified in accompanying guidance. |
@davidread A reasonable summary - just a couple of points:
For the GEMINI community, I can say we're glad that data.gov.uk translates parts of the GEMINI record to schema.org. A current Geospatial Commission project is likely to recommend that we continue this way, with a few proposed improvements to the schema.org In conversation with Rosalie Marshall of GDS, I've volunteered to draft some accompanying guidance for GEMINI authors. I'd like to include the mappings*, so they can 'see for themselves' that publishing their GEMINI records will satisfy this Schema.org proposal. That will be much easier with David's other suggestion - of specifying which Schema.org elements are in mind. Rosalie suggested that would be in guidance. Even starting at https://schema.org/Dataset has 110+ properties, many of which on a quick look seem quite irrelevant. *I'll probably base it on https://github.com/geonetwork/core-geonetwork/wiki/JSON-LD---ISO19139-mapping-proposal - GeoNetwork is a very common tool for creating & managing GEMINI records, and already has an optional Schema.org output. |
GEMINI comes with quite extensive guidance on things like "publisher" (which organisations to specify) - https://www.agi.org.uk/agi-groups/standards-committee/uk-gemini/40-gemini/1062-gemini-datasets-and-data-series#23 and "date" (which date?) - https://www.agi.org.uk/agi-groups/standards-committee/uk-gemini/40-gemini/1062-gemini-datasets-and-data-series#8 If these need to be improved, I can do that easily. If they need to change, that needs a bit more governance! There's also a related publication on improving your metadata quality: https://www.agi.org.uk/about/resources/category/81-gemini?download=100:metadata-guidelines-part-3-april-2015 Feel free to borrow any of these; it's all CC-BY |
Thanks @PeterParslow. That sounds useful to clarify the mapping to schema.org. Thanks also for the links to GEMINI guidance on publisher and date. schema.org has similar list of definitions. You talk about governance to get metadata publishers to improve, but my experience shows it is tough to do. In Europe and UK there have been various incentives, metrics, reaching out, bottom-up efforts etc.. I think the best thing is to require metadata against a small standard, with strict validation on submission, with short and clear guidance and convenient online tools that help you meet the standard. Complexity is the main enemy in this space. You should include in a standard just enough fields for dataset publishers to satisfy the key user needs of open data users. The main need is "discoverability" (e.g. search engines) and you can do most things with title, description and some stricter machine-readable fields - publisher, date, link to the data. It's a disservice to offer the metadata author 110 fields. And the next most useful thing in dataset metadata is probably structural metadata (data dictionary), which schema.org's Dataset doesn't cover. Structural metadata is more CSVW's domain, however that suffers from the complexity issue too. |
Thoroughly agree @davidread - getting metadata quality to improve is more about education than governance. And much easier if we start with a small set. GEMINI has 21 mandatory fields - the ones you mention plus keywords, link to license, link to specification (we've found the former is important to most searchers; and the spec helps people decide between 'hits') https://www.agi.org.uk/40-gemini/1250-element-summary My comment about governance was actually if this discussion concludes that the 'kinds of date' need to be different. Because GEMINI is built on an ISO standard (19115:2003 - which is itself an evolution from Dublin Core) and accompanying European (INSPIRE) guidance. |
I know is it slightly off topic, but I think that we'd be better getting persistent identifiers in place for tables and for some of the fields in tabular data that are used for linking than getting a set of metadata in place. I'm presuming here that the main reasons for documenting these tables is to find them and to merge/aggregate the data they contain. the choice of metadata is quite a rabbit hole with domains having their own issues and preferences. However, knowing that "this csv dataset" contains the same data as "that XML dataset", or knowing that the identifiers used in the "Local Authority" in one CSV are from the same set as the "council" identifiers in another are going to be of much more practical use than which specific items from schema.org are used. |
On CSVW: I totally agree with the proposal's aim to improve government data by using structural metadata. It is great to record for each column what the data type is, any data standards or code lists it follows, or if it references another dataset. As the proposal says, by defining these things in structural metadata, you can automate both the checking of the data and also the loading data into different data stores. And I'd go further - this support for common identifiers, standards and linking may well lead to incentivising these key elements. We can all agree that if you have quality, inter-connected data then you have a great basis for quickly drawing valuable insights from your the data. Let me expand a bit on the user needs, to help us evaluate a structural metadata standard like this. I believe the key user needs are:
CSVW Positives:
CSVW Negatives:
In comparison, I think Frictionless Data's Table Schema achieves 80% of the value with 20% of the complexity. It has built up a much stronger claim to having actually improved data flows, stronger community and array of tooling. The standard is developed in the usual open source way on GitHub, and the likes of Open Data Institute have given it support in their csvlint tool (along with CSVW). So it's an open standard, with a more lightweight process than the formality of a W3C working group. I think there is a serious question mark over CSVW because it has not got much traction so far. Would UK government be happy to embrace it with the risk that the rest of the data world doesn't? Goverment might find itself investing in tools and trying to make the tech work, only to find another one becomes the dominant standard, and the investment is wasted. CSVW is also a big standard, that hasn't been tested much, and can't easily be iterated (there is a lot of overhead of a formal W3C working group, ensuring worldwide compatibility, seeking consent etc.) Big bang, intransient specs - another risk. To mitigate these risks I'd suggest that ONS and/or other interested parties might:
|
That's a thoughtful reflection with several useful points.
Re: "...CSVW is also a big standard, that hasn't been tested much, and
can't easily be iterated (there is a lot of overhead of a formal W3C
working group, ensuring worldwide compatibility, seeking consent etc.) Big
bang, intransient specs - another risk....", W3C is moving to a much more
responsive "evergreen standards" process that will be easier to work with.
The Dataset Exchange WG (DXWG) is moving to that already in the coming
year. Proposals for modification of CSVW could perhaps follow this formula
Using a "profile" approach is very sensible too, and the DXWG work on
content negotiation by profile might meld well with this approach.
…On Fri, 6 Dec 2019, 11:39 David Read, ***@***.***> wrote:
On CSVW:
I totally agree with the proposal's aim to improve government data by
using structural metadata. It is great to record for each column what the
data type is, any data standards or code lists it follows, or if it
references another dataset. As the proposal says, by defining these things
in structural metadata, you can automate both the checking of the data and
also the loading data into different data stores. And I'd go further - this
support for common identifiers, standards and linking may well lead to
*incentivising* these key elements. We can all agree that if you have
quality, inter-connected data then you have a great basis for quickly
drawing valuable insights from your the data.
Let me expand a bit on the user needs, to help us evaluate a structural
metadata standard like this. I believe the key user needs are:
- "validate" - validate that a CSV file doesn't have the wrong columns
or bad data in them, so that:
- you can automatically check published CSVs are usable, correctly
reference items in other datasets and can be aggregated across multiple
publishers
- in an internal data pipeline, you can catch errors early by
validating at every stage
- "load" - load the CSV into a datastore (e.g. database, dataframe,
data warehouse), and get the column types setup sensibly, and references to
other tables marked, so that you can:
- make meaningful queries - e.g. a query for all records from April
2017 to March 2018 is hard if the date is loaded as a string
- visualize - e.g. if you've correctly interpreted lat/long then
you can plot records on a map
- combine/aggregate
CSVW Positives:
- offers a large vocabulary, which can cover a huge range of use
cases. It also links with and is extensible with numerous other linked data
standards. (Large vocabulary is a double edged sword though - see
complexity below)
- has some key supporters e.g.
https://blog.ldodds.com/2019/01/26/talk-tabular-data-on-the-web/
CSVW Negatives:
- very low adoption - hardly any organizations publish datasets with
CSVW annotations and there are few tools (none support it fully)
- complex - high bar to understanding it
<https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/>,
publishing it and consuming it. e.g. a column can have 19 different sorts
of annotations. The author can choose from three different vocabularies for
defining the descriptive metadata e.g. dc:description, dcat:description,
schema:description
- a key design criterion was the ability to convert CSV to RDF, to
marry up with W3C's linked data vision, which adds complexity, rather than
focussing on the central use cases of "validate" and "load"
- existing attempts at publishing CSWV are of doubtful value e.g. see
this twitter thread:
https://twitter.com/ldodds/status/1195398689532940288
In comparison, I think Frictionless Data's Table Schema achieves 80% of
the value with 20% of the complexity. It has built up a much stronger claim
to having actually improved data flows, stronger community and array of
tooling. The standard is developed in the usual open source way on GitHub,
and the likes of Open Data Institute have given it support in their csvlint
tool (along with CSVW). So it's an open standard, with a more lightweight
process than the formality of a W3C working group.
I think there is a serious question mark over CSVW because it has not got
much traction so far. Would UK government be happy to embrace it with the
risk that the rest of the data world doesn't? Goverment might find itself
investing in tools and trying to make the tech work, only to find another
one becomes the dominant standard, and the investment is wasted.
CSVW is also a big standard, that hasn't been tested much, and can't
easily be iterated (there is a lot of overhead of a formal W3C working
group, ensuring worldwide compatibility, seeking consent etc.) Big bang,
intransient specs - another risk.
To mitigate these risks I'd suggest that ONS and/or other interested
parties might:
- define a "profile" of CSVW, suggesting a subset of the vocabulary we
should use, based on a narrow set of user needs that add the most value, to
concentrate efforts to build momentum and deliver value
- publish some example metadata, to help people understand the
standard and build momentum
And come back to get approval from Open Standards Board with some
evidence of becoming established and delivering value.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#40>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIFYTDIOGL4YS7ZFLMVN5DQXI2WBANCNFSM4DFOV6ZA>
.
|
Thank you @pwin, @PeterParslow and @timwis for this really useful discussion and for being our critical friends in this challenge, helping us shape these proposals. Throughout the GDS discovery in considering the needs for metadata across government departments, it became clear that there is a vast gap in how, across government, we collect and publish metadata. Some organisations are advanced and ahead of others in adapting their standards and others collect no descriptive metadata at all. The aim of recommending rather than mandating Dublin Core elements (which as @PeterParslow pointed out form the basis for the majority of metadata standards) is to help departments which do not publish any metadata to start with Dublin Core and eventually mature to more complex standards such as DCAT. You are all right in saying that there are various application profiles where different use-cases and requirements are captured but at its core Dublin Core “core elements” constitute the bare minimum. The same bare minimum is also found in schema.org. I completely agree with @DavidReed that schema.org is a vast and comprehensive standard and I like your idea of just recommending using schema.org for Dataset schema and for recommending the use of the same persistent elements in Dublin Core. I think the next step in these proposals is to provide a comprehensive recommendation and examples as to how we envisage these standards to be used, and GDS will be working with you, us and the community on publishing this guidance to GOV.UK . I think we have also already pointed out in the proposals all of these standards are a starting point on a maturity framework we would like to help departments to progress on that framework. @PeterParslow, you made an excellent point that to improve the metadata quality is more about education than governance. These proposals are the first step in that and will be accompanied by comprehensive engagement, training and workshops across departments to help them improve the quality of metadata they publish, with input from the Government Data Architect community. Also, @pwin you made an excellent point on persistent identifiers. Without them, no matter what metadata standards we use, we have a problem of not being able to link or refer to data and metadata. Open Data Board has recommended a standard for it and I think COGS team at ONS are doing some work in this space as well. This issue will be discussed at the Board Meeting on Monday. And @PeterParslow, we have redrafted the proposal on Dublin Core to confine the remit and reference GEMINI and INSPIRE as you suggested during a call with Rosalie. The guidance that will be published on GOV.UK will also reflect exactly what is in scope and what isn't when it comes to geospatial data and following Dublin Core, but essentially these different standards can be aligned so there is clarity on what metadata standard to follow when. |
In our discovery on this we concluded that neither CSVW nor Frictionless Data were really widely adopted enough to be particularly useful. Frictionless Data has the better tooling though, and is a lot simpler to understand and implement. In the short-term, I suspect that recommending either isn't helpful. Right now, probably the most useful thing you can do is to publish CSV in a consistent way, and then have some really good HTML pages documenting what all the columns mean in clear, understandable language. Not machine-readable, but at least human-readable. Longer term, it might be the case that different de facto standards for CSV metadata emerge for different domains (eg publishing statistics vs financial transparency data). |
@frankieroberto I really appreciate the user research insights. Several have said that now is not the time to be agreeing or recommending a particular standard for structural metadata for open data. However I think many of us here are keen on the potential benefits in this area (and we're slowly working out how to express the vision in an understandable way!). So let's use our energy we collectively have in government to do lots of 'alpha'-stage work in this space. We can try different approaches and standards, each time trying to cultivate a little ecosystem of publishers and consumers that benefit from the metadata. And once we find a successful formula, only then do we start shouting about it, firm up the standard, and scale up. There are some great practical ideas and growing consensus in this thread of where we can start:
The highest profile work on this is probably the ONS Cogs project in the stats community of UK gov. I'm really interested to see can be achieved here - the people are distributed but closely networked, they include both producers and consumers of data & stats, and they have plenty of technical clout - it seems like a really fertile opportunity. |
Today, the Open Standards Board accepted these proposals as recommended standards with a few conditions that will be added to the profiles when published on GOV.UK Thank you, everyone, who contributed to this challenge. |
Hi everyone, I have recently applied schema.org dataset markup to NHS Digital's publications on https://digital.nhs.uk. There are around 2000 now flowing, and in theory visible on Google's data set search tool. I would be open to how we can consistently apply schema.org, dublin core and any recommended standard to the data, so it is embedded in the html web content we publish. Any feedback is welcome along with improvement suggestions. |
Thanks Bharat
The impact on Google's data set search is dramatic. A valuable piece of work.
Nicholas
Nicholas Oughtibridge BSc FBCS CITP
Head of Clinical Data Architecture
0113 397 4296
[cid:image001.jpg@01D5B0FA.DC600310]
NHS Digital provides information and technology for better health and care.
Find out more about us: www.digital.nhs.uk<http://www.digital.nhs.uk/> | @NHSDigital<https://twitter.com/nhsdigital>
The NHS Data Model and Dictionary Service is certified to ISO 9001:2015
[cid:image004.png@01D23F2D.9969C650]
From: Bharat Sharma <notifications@github.com>
Sent: 12 December 2019 11:20
To: alphagov/open-standards <open-standards@noreply.github.com>
Cc: OUGHTIBRIDGE, Nicholas (NHS DIGITAL) <nicholas.oughtibridge@nhs.net>; Comment <comment@noreply.github.com>
Subject: Re: [alphagov/open-standards] Schemas for Tabular Data Challenge (#40)
Hi everyone, I have recently applied schema.org dataset markup to NHS Digital's publications on https://digital.nhs.uk. There are around 2000 now flowing, and in theory visible on Google's data set<https://toolbox.google.com/datasetsearch> search tool. I would be open to how we can consistently apply schema.org, dublin core and any recommended standard to the data, so it is embedded in the html web content we publish. Any feedback is welcome along with improvement suggestions.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#40>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGPP3X7XXYU5X3CQB27DOATQYIM5BANCNFSM4DFOV6ZA>.
********************************************************************************************************************
This message may contain confidential information. If you are not the intended recipient please inform the
sender that you have received the message in error before deleting it.
Please do not disclose, copy or distribute information in this e-mail or take any action in relation to its contents. To do so is strictly prohibited and may be unlawful. Thank you for your co-operation.
NHSmail is the secure email and directory service available for all NHS staff in England and Scotland. NHSmail is approved for exchanging patient data and other sensitive information with NHSmail and other accredited email services.
For more information and to find out how you can switch, https://portal.nhs.net/help/joiningnhsmail
|
Schemas for Tabular Data
Category
Suggested by
Originally Submitted by pwalsh on Mon, 13/03/2017 on standards.data.gov.uk
Short Description
Much data published by governments is in common tabular data formats: CSV, Excel, and ODS. This is true for the UK government and governments around the world. To provide assurances around reusability of tabular data, consumers (users) need information on the "primitive types" for each column of data (example: is it a number? is it a date?). This also allows for quality checks to ensure consistency and integrity of the data.
Publishing Table Schema with tabular data sources provides this information. Table Schema has previously been used in work by Open Knowledge International (OKI) with Cabinet Office to check the validity of 25K fiscal data, according to publication guidelines. Table Schema is also used widely by other organisations working with public data, such as the Open Data Insititute (ODI).
User Need
I've written several user stories below. Each user story applies equally to a range of users. The user personas are as follows:
User stories
As a user, I want all public data published by government to conform to a known schema, so I can use this information to validate the data.
As a user, I want public data published by government to have a schema, so I can read the schema and understand at a glance the type of information in the data, and the possibilities for reuse.
Expected Benefits
Functional Needs
The text was updated successfully, but these errors were encountered: