Use DCAT standard for format field #1336

davidread · 2013-11-21T17:07:39Z

This is a suggestion to express the resource format in the database using mime-types where possible. This is what DCAT specifies, so this would to facilitate interoperability between data catalogs. Falling in with standards is a good thing for the community generally. However the mapping between formats in CKAN and DCAT is probably 1:1, so there is no intrinsic gain to the CKAN model by doing this.

We currently do store a mimetype value, often filled in by the data_storer or ckanext_qa, but it would be good to fill it in as a matter of course and be in step with the format field.

Current CKAN convention:

resource.format - file extension e.g. 'csv'
resource.mimetype - mimetype of the object, as seen on download e.g. 'application/zip' or 'text/csv'.
resource.mimetype_inner - the mimetype of a 'contained' file. e.g. when resource.mimetype='application/zip', resource.mimetype='text/csv'. If resource.mimetype is not a container, resource.mimetype_inner is empty.

CKAN's edit form asks for the format, using autocomplete based on previous entries (and a standard list?). The mimetype and mimetype_inner are filled automatically by datastorer? (CHECK)

DCAT formats:

DCAT "Distribution" (equivalent to a CKAN resource) has:

dcat:mediaType - mimetype according to IANA e.g. 'text/csv'
dct:format - rdfs:label (i.e. human-readable string) e.g. 'CSV'

dcat:mediaType "SHOULD be used when the media type of the distribution is defined in IANA, otherwise dct:format MAY be used with different values."

Proposal

It makes sense for CKAN to hold details about lots of formats, along the lines of the ECPortal and DGU lists

Each format would have properties including:

Whether it has a dcat:mediaType or dct:format
The canonical dcat:mimeType or dct:format string
Short display string (for showing in the search results) e.g. "XLS"
Long display string (for the dropdown) e.g. "Excel Spreadsheet (XLS)"

'format' has always been the principal field in CKAN, probably used by a host of customized templates and extensions, so there would be an advantage for it to be a 'displayable' version of the format. The mimetype and mimetype_inner are rarely used, so could be recycled much more easily.

Since we have the table of formats, we could refer to format in the resource in any number of ways, even by ID. However people using our API should would find it useful to see the short display string as well as have the DCAT info. So I suggest we have in the database:

resource.format_id - foreign key to format table
and when you package_show, you see:
resource.format - short display string
resource.format_long - long display string
resource.mimetype - dcat:mediaType (if it has one)
resource.mimetype_inner - dcat:mediaType (if there is contained data and it has a mimetype)
resource.format_canonical - the dct:format string (if no mimetype)

To map this to DCAT:

dcat:mediaType = resource.mimetype_inner or resource.mimetype
dct:format = resource.format_canonical

Other formats

If data is in an unlisted format, then in the form, instead of using the dropdown, you just want to type free text. We could store this in the db as resource.format_other. And package_show would use that text as both short and long name, with blank mimetype and format_canonical.

Inner format

The resource.mimetype can be different from resource.mimetype_inner when the data comes in a container such as ZIP, TAR, RAR, GZIP (are there any examples which are not compression?). DCAT says nothing about containers, so in the mapping we just ignore it if there is one.

Multiple mime-types for a file format.

e.g. Excel files are defined by IANA as:

application/vnd.ms-excel
application/vnd.ms-excel.addin.macroEnabled.12
application/vnd.ms-excel.sheet.binary.macroEnabled.12
application/vnd.ms-excel.sheet.macroEnabled.12
application/vnd.ms-excel.template.macroEnabled.12

and then there are the unofficial ones:

application/excel
application/x-excel
application/x-msexcel

I suggest we don't worry about the variants and just pick what looks canonical to us:

Question: Separate XLS and XLSX?

General users might not care about the differences between these formats. They both open in Excel the same. (Assuming they have Excel 2007 or later which is pretty likely.) It could be a slight faff to users entering it in the form to get it right, and may result in high inaccuracies as they don't care.

However I believe it is useful to some client software to know whether a file is XLS or XLSX format. I saves having to do detection in Messytables, although I can't remember the details. It is not too hard to use scripts to correct the resource from XLS to XLSX and vice-versa, so that could solve the accuracy thing.

Also, it would be slightly more correct to use the different mime-types for these two types:

application/vnd.ms-excel
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Thoughts welcome.

When no mime-type exists

e.g. a Shapefile looks like a zip, yet it should be described in CKAN as a Shapefile. Follow DCAT's suggestion to fall-back on using dct:format. e.g. 'Shapefile' in this case.

This could also be used for GeoJSON to differentiate it from JSON.

Landing pages

It is common to put in a resource.url which is not the data, but a web page from which you can get the data. This might be because there are vast numbers of links. Or maybe a form needs to be filled in to obtain the data. A good legitimate example is a web form for making SPARQL requests.

DCAT expresses these not on a dcat:Distribution, but as the dcat:landingPage property of a dcat:Dataset. It would make sense in CKAN to add a resource.resource_type for landing page (if not done so already) and a corresponding tab for it in the form, to allow easy mapping to DCAT.

API resources

Where a dataset is accessed via an API, the base URL is stored in a CKAN resource, and CKAN sets resource.resource_type="api" instead of "file". APIs in DCAT are differentiated in a dcat:distribution by having an 'accessURL' instead of 'downloadURL'.

An API is not a format in itself. The classic API example is a SPARQL endpoint. This might return data in RDF/XML, TTL or JSON format, but CKAN needs to display this resource as 'SPARQL'. So until now in CKAN, we've set resource.format="api/sparql", even though that is not the format of the response. It's not clear how DCAT would represent it as a SPARQL endpoint, so I suggest we continue regarding SPARQL as a format.

Other API examples: WMS, WFS, ONS Open API, Socrata Open Data API.

Centralized list of formats

You could imagine OKF setting up a central server that serves the full list of formats, for CKANs everywhere to use. We envisaged a system like this for licenses, but it was difficult. Every time you ran a CKAN instance it wanted to go and get the list, causing delays on start-up. You worry that the format server will go down, and that might cause the CKAN to not work at all.

I don't think there are so many formats, or many new ones under the sun to worry about, so I suggest just having them in a Python file checked into CKAN code. Switching to a central server always be changed to in future if called for.

cygri · 2013-11-21T17:20:38Z

Looks all good to me.

On the centralised list of formats, personally, I'd prefer to see the list of formats managed as a separate GitHub project and as a data file rather than as code. This opens it up for uses besides CKAN (for example, in other DCAT-using contexts!), and with a bit of luck (and work) it could become a de facto standard list of formats.

davidread · 2013-11-21T17:40:20Z

Thanks @cygri I like that idea!

Do you happen to know how DCAT would represent a SPARQL endpoint? I can't believe W3C have not considered this...

cygri · 2013-11-21T17:46:40Z

I guess we'd defer to VoID for describing SPARQL endpoints, which brings a bunch of handy things specifically for describing datasets with RDF distributions. See here. VoID can be used in conjunction with DCAT; a void:Dataset is a subclass of dcat:Dataset.

When it comes to distinguishing other specific APIs, DCAT doesn't provide anything to distinguish them. One approach would be to define subclasses of dcat:Distribution for specific types of APIs.

davidread · 2013-11-21T18:03:37Z

That's great about VoID for SPARQL and subclassing for other APIs. Maybe this is the spark to start a DCAT mapper tool...

wardi · 2013-11-26T13:05:57Z

If adding a new database table and actions for editing the list of valid formats is being considered I suggest using a tag vocabulary instead. Its tables and actions already exist, and a validator like is_one_of_tag_vocabulary('foo') is easy to add.

if adding a static list as a configuration file please make it similarly general so other fields could use a scheme like that for validation, e.g. is_one_of_static_list('foo') => development.ini: ckan.list.foo = /srv/bar/baz/list.json => srv/bar/baz/list.json content

davidread · 2013-11-26T13:08:14Z

Lots of questions thrown up by today's ckan dev chat:

Need to outline the UI for selecting format.

Fear that the 'free text' format option might mean multiple expressions for the same format. Many site admins haven't the time to tidy these up so maybe just don't allow 'other' as free text - just store as 'other'.

If we do allow 'other' as 'free text', over the API we need users to be explicit the value is free text, so that in other circumstances we can validate it.

Extensions can add/remove formats from the short list and long list. Or simply make it configurable.

c.f. with schema validation of format in canadian setup.

Have config option to switch on validation? Not every site would want it?

If the list of formats was stored in the db it could get v complicated, with migrations of the existing resources' formats, ones added via config etc. better as a separate data file. If was going to be stored in db, then maybe use TagVocabulary.

Compare with related DCAT work:

micheldumontier · 2013-12-12T18:30:37Z

I'd like to see this get resolved asap. preliminary analysis of datahub.io shows significant variation - i documented it here: http://help.datahub.io/discussions/problems/12-consistency-in-descriptions-of-formats

i like the idea of specifying the mime-type and the file format. we need clear specification for when there is a zip wrapper over one or more file formats. e.g. a tar.gz that contain n-triples or n-quads for instance.

davidread · 2013-12-13T10:20:33Z

The list of formats is developing as a data file in #1350

I guess it doesn't matter too much how CKAN stores the format, as long as the mapping to DCAT is correct. And this is progressing in ckanext-dcat.

So I think this ticket could usefully focus on how the data is stored in CKAN and the UI for selecting a format in the form.

It seems most useful to have some sort of drop-down or autocomplete to select the format, and save it as mimetype in preference, and extension if it doesn't exist or allow free-text as last resort.

@micheldumontier agreed that the proliferation of mime-types - I imagine providing the user with a drop-down to select the format from would solve this - do you agree? And we could do with a script to tidy up the mime-types to all be the canonical ones.

@micheldumontier can you explain a use case for specifying both the mime-type and file format? Most of the time there is a direct correlation between mime-type and file format, but not always - are you interested in the differences and if so, which ones?

micheldumontier · 2013-12-14T00:49:52Z

@davidread - yes autocomplete + drop down of available mime-types is desirable.

right, so for instance, we're publishing data as gzipped content (.gz), but the content type inside that gzipped file is n-triples or n-quads. Users/programs need to know this if they're going to pull the content out correctly. so i've been publishing my dataset descriptions annotated with 2 dc:format descriptions.

so i wonder whether this is only an issue with archived data?

philipashlock · 2014-11-19T16:41:14Z

DCAT says nothing about containers, so in the mapping we just ignore it if there is one.

Has there been any more discussion about using dcat:mediaType or anything else for resource.mimetype_inner in a way that distinguishes it from using dcat:mediaType for resource.mimetype?

cc: @philarcher1

davidread · 2016-05-24T11:20:36Z

This issue was closed due to inactivity. Feel free to reopen if you have more feedback or are interested it working on it

akuckartz · 2016-06-18T12:47:53Z

This issue was mentioned regarding the German "Open Government Data Standard 2.0" (draft).

See https://joinup.ec.europa.eu/asset/ogd2_0/issue/format#comment-18461 and https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/question-usage-mimetype-and-mediatype-zipped/container-distribu

Maybe this issue therefore should be reopened? (I generally think that issues should not be closed due to inactivity.)

This was referenced Dec 9, 2013

Typing tags gives random results #1370

Closed

Resource 'Format' field is opaque #1371

Closed

This was referenced Jun 27, 2014

Map Distribution formats to something that CKAN understands ckan/ckanext-dcat#18

Closed

Distribution's format type in DCAT XML ckan/ckanext-dcat#17

Closed

amercader added this to the CKAN 2.4 milestone Oct 22, 2014

TkTech closed this as completed May 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use DCAT standard for format field #1336

Use DCAT standard for format field #1336

davidread commented Nov 21, 2013

cygri commented Nov 21, 2013

davidread commented Nov 21, 2013

cygri commented Nov 21, 2013

davidread commented Nov 21, 2013

wardi commented Nov 26, 2013

davidread commented Nov 26, 2013

micheldumontier commented Dec 12, 2013

davidread commented Dec 13, 2013

micheldumontier commented Dec 14, 2013

philipashlock commented Nov 19, 2014

davidread commented May 24, 2016

akuckartz commented Jun 18, 2016

Use DCAT standard for format field #1336

Use DCAT standard for format field #1336

Comments

davidread commented Nov 21, 2013

Current CKAN convention:

DCAT formats:

Proposal

Other formats

Inner format

Multiple mime-types for a file format.

Question: Separate XLS and XLSX?

When no mime-type exists

Landing pages

API resources

Centralized list of formats

cygri commented Nov 21, 2013

davidread commented Nov 21, 2013

cygri commented Nov 21, 2013

davidread commented Nov 21, 2013

wardi commented Nov 26, 2013

davidread commented Nov 26, 2013

micheldumontier commented Dec 12, 2013

davidread commented Dec 13, 2013

micheldumontier commented Dec 14, 2013

philipashlock commented Nov 19, 2014

davidread commented May 24, 2016

akuckartz commented Jun 18, 2016