Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DCAT standard for format field #1336

Closed
davidread opened this issue Nov 21, 2013 · 12 comments
Closed

Use DCAT standard for format field #1336

davidread opened this issue Nov 21, 2013 · 12 comments

Comments

@davidread
Copy link
Contributor

This is a suggestion to express the resource format in the database using mime-types where possible. This is what DCAT specifies, so this would to facilitate interoperability between data catalogs. Falling in with standards is a good thing for the community generally. However the mapping between formats in CKAN and DCAT is probably 1:1, so there is no intrinsic gain to the CKAN model by doing this.

We currently do store a mimetype value, often filled in by the data_storer or ckanext_qa, but it would be good to fill it in as a matter of course and be in step with the format field.

Current CKAN convention:

  • resource.format - file extension e.g. 'csv'
  • resource.mimetype - mimetype of the object, as seen on download e.g. 'application/zip' or 'text/csv'.
  • resource.mimetype_inner - the mimetype of a 'contained' file. e.g. when resource.mimetype='application/zip', resource.mimetype='text/csv'. If resource.mimetype is not a container, resource.mimetype_inner is empty.

CKAN's edit form asks for the format, using autocomplete based on previous entries (and a standard list?). The mimetype and mimetype_inner are filled automatically by datastorer? (CHECK)

DCAT formats:

DCAT "Distribution" (equivalent to a CKAN resource) has:

  • dcat:mediaType - mimetype according to IANA e.g. 'text/csv'
  • dct:format - rdfs:label (i.e. human-readable string) e.g. 'CSV'

dcat:mediaType "SHOULD be used when the media type of the distribution is defined in IANA, otherwise dct:format MAY be used with different values."

Proposal

It makes sense for CKAN to hold details about lots of formats, along the lines of the ECPortal and DGU lists

Each format would have properties including:

  • Whether it has a dcat:mediaType or dct:format
  • The canonical dcat:mimeType or dct:format string
  • Short display string (for showing in the search results) e.g. "XLS"
  • Long display string (for the dropdown) e.g. "Excel Spreadsheet (XLS)"

'format' has always been the principal field in CKAN, probably used by a host of customized templates and extensions, so there would be an advantage for it to be a 'displayable' version of the format. The mimetype and mimetype_inner are rarely used, so could be recycled much more easily.

Since we have the table of formats, we could refer to format in the resource in any number of ways, even by ID. However people using our API should would find it useful to see the short display string as well as have the DCAT info. So I suggest we have in the database:

  • resource.format_id - foreign key to format table
    and when you package_show, you see:
  • resource.format - short display string
  • resource.format_long - long display string
  • resource.mimetype - dcat:mediaType (if it has one)
  • resource.mimetype_inner - dcat:mediaType (if there is contained data and it has a mimetype)
  • resource.format_canonical - the dct:format string (if no mimetype)

To map this to DCAT:

  • dcat:mediaType = resource.mimetype_inner or resource.mimetype
  • dct:format = resource.format_canonical

Other formats

If data is in an unlisted format, then in the form, instead of using the dropdown, you just want to type free text. We could store this in the db as resource.format_other. And package_show would use that text as both short and long name, with blank mimetype and format_canonical.

Inner format

The resource.mimetype can be different from resource.mimetype_inner when the data comes in a container such as ZIP, TAR, RAR, GZIP (are there any examples which are not compression?). DCAT says nothing about containers, so in the mapping we just ignore it if there is one.

Multiple mime-types for a file format.

e.g. Excel files are defined by IANA as:

  • application/vnd.ms-excel
  • application/vnd.ms-excel.addin.macroEnabled.12
  • application/vnd.ms-excel.sheet.binary.macroEnabled.12
  • application/vnd.ms-excel.sheet.macroEnabled.12
  • application/vnd.ms-excel.template.macroEnabled.12

and then there are the unofficial ones:

  • application/excel
  • application/x-excel
  • application/x-msexcel

I suggest we don't worry about the variants and just pick what looks canonical to us:

Question: Separate XLS and XLSX?

General users might not care about the differences between these formats. They both open in Excel the same. (Assuming they have Excel 2007 or later which is pretty likely.) It could be a slight faff to users entering it in the form to get it right, and may result in high inaccuracies as they don't care.

However I believe it is useful to some client software to know whether a file is XLS or XLSX format. I saves having to do detection in Messytables, although I can't remember the details. It is not too hard to use scripts to correct the resource from XLS to XLSX and vice-versa, so that could solve the accuracy thing.

Also, it would be slightly more correct to use the different mime-types for these two types:

  • application/vnd.ms-excel
  • application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Thoughts welcome.

When no mime-type exists

e.g. a Shapefile looks like a zip, yet it should be described in CKAN as a Shapefile. Follow DCAT's suggestion to fall-back on using dct:format. e.g. 'Shapefile' in this case.

This could also be used for GeoJSON to differentiate it from JSON.

Landing pages

It is common to put in a resource.url which is not the data, but a web page from which you can get the data. This might be because there are vast numbers of links. Or maybe a form needs to be filled in to obtain the data. A good legitimate example is a web form for making SPARQL requests.

DCAT expresses these not on a dcat:Distribution, but as the dcat:landingPage property of a dcat:Dataset. It would make sense in CKAN to add a resource.resource_type for landing page (if not done so already) and a corresponding tab for it in the form, to allow easy mapping to DCAT.

API resources

Where a dataset is accessed via an API, the base URL is stored in a CKAN resource, and CKAN sets resource.resource_type="api" instead of "file". APIs in DCAT are differentiated in a dcat:distribution by having an 'accessURL' instead of 'downloadURL'.

An API is not a format in itself. The classic API example is a SPARQL endpoint. This might return data in RDF/XML, TTL or JSON format, but CKAN needs to display this resource as 'SPARQL'. So until now in CKAN, we've set resource.format="api/sparql", even though that is not the format of the response. It's not clear how DCAT would represent it as a SPARQL endpoint, so I suggest we continue regarding SPARQL as a format.

Other API examples: WMS, WFS, ONS Open API, Socrata Open Data API.

Centralized list of formats

You could imagine OKF setting up a central server that serves the full list of formats, for CKANs everywhere to use. We envisaged a system like this for licenses, but it was difficult. Every time you ran a CKAN instance it wanted to go and get the list, causing delays on start-up. You worry that the format server will go down, and that might cause the CKAN to not work at all.

I don't think there are so many formats, or many new ones under the sun to worry about, so I suggest just having them in a Python file checked into CKAN code. Switching to a central server always be changed to in future if called for.

@cygri
Copy link

cygri commented Nov 21, 2013

Looks all good to me.

On the centralised list of formats, personally, I'd prefer to see the list of formats managed as a separate GitHub project and as a data file rather than as code. This opens it up for uses besides CKAN (for example, in other DCAT-using contexts!), and with a bit of luck (and work) it could become a de facto standard list of formats.

@davidread
Copy link
Contributor Author

Thanks @cygri I like that idea!

Do you happen to know how DCAT would represent a SPARQL endpoint? I can't believe W3C have not considered this...

@cygri
Copy link

cygri commented Nov 21, 2013

I guess we'd defer to VoID for describing SPARQL endpoints, which brings a bunch of handy things specifically for describing datasets with RDF distributions. See here. VoID can be used in conjunction with DCAT; a void:Dataset is a subclass of dcat:Dataset.

When it comes to distinguishing other specific APIs, DCAT doesn't provide anything to distinguish them. One approach would be to define subclasses of dcat:Distribution for specific types of APIs.

@davidread
Copy link
Contributor Author

That's great about VoID for SPARQL and subclassing for other APIs. Maybe this is the spark to start a DCAT mapper tool...

@wardi
Copy link
Contributor

wardi commented Nov 26, 2013

If adding a new database table and actions for editing the list of valid formats is being considered I suggest using a tag vocabulary instead. Its tables and actions already exist, and a validator like is_one_of_tag_vocabulary('foo') is easy to add.

if adding a static list as a configuration file please make it similarly general so other fields could use a scheme like that for validation, e.g. is_one_of_static_list('foo') => development.ini: ckan.list.foo = /srv/bar/baz/list.json => srv/bar/baz/list.json content

@davidread
Copy link
Contributor Author

Lots of questions thrown up by today's ckan dev chat:

Need to outline the UI for selecting format.

Fear that the 'free text' format option might mean multiple expressions for the same format. Many site admins haven't the time to tidy these up so maybe just don't allow 'other' as free text - just store as 'other'.

If we do allow 'other' as 'free text', over the API we need users to be explicit the value is free text, so that in other circumstances we can validate it.

Extensions can add/remove formats from the short list and long list. Or simply make it configurable.

c.f. with schema validation of format in canadian setup.

Have config option to switch on validation? Not every site would want it?

If the list of formats was stored in the db it could get v complicated, with migrations of the existing resources' formats, ones added via config etc. better as a separate data file. If was going to be stored in db, then maybe use TagVocabulary.

Compare with related DCAT work:

@micheldumontier
Copy link

I'd like to see this get resolved asap. preliminary analysis of datahub.io shows significant variation - i documented it here: http://help.datahub.io/discussions/problems/12-consistency-in-descriptions-of-formats

i like the idea of specifying the mime-type and the file format. we need clear specification for when there is a zip wrapper over one or more file formats. e.g. a tar.gz that contain n-triples or n-quads for instance.

@davidread
Copy link
Contributor Author

The list of formats is developing as a data file in #1350

I guess it doesn't matter too much how CKAN stores the format, as long as the mapping to DCAT is correct. And this is progressing in ckanext-dcat.

So I think this ticket could usefully focus on how the data is stored in CKAN and the UI for selecting a format in the form.

It seems most useful to have some sort of drop-down or autocomplete to select the format, and save it as mimetype in preference, and extension if it doesn't exist or allow free-text as last resort.

@micheldumontier agreed that the proliferation of mime-types - I imagine providing the user with a drop-down to select the format from would solve this - do you agree? And we could do with a script to tidy up the mime-types to all be the canonical ones.

@micheldumontier can you explain a use case for specifying both the mime-type and file format? Most of the time there is a direct correlation between mime-type and file format, but not always - are you interested in the differences and if so, which ones?

@micheldumontier
Copy link

@davidread - yes autocomplete + drop down of available mime-types is desirable.

right, so for instance, we're publishing data as gzipped content (.gz), but the content type inside that gzipped file is n-triples or n-quads. Users/programs need to know this if they're going to pull the content out correctly. so i've been publishing my dataset descriptions annotated with 2 dc:format descriptions.

so i wonder whether this is only an issue with archived data?

@philipashlock
Copy link

DCAT says nothing about containers, so in the mapping we just ignore it if there is one.

Has there been any more discussion about using dcat:mediaType or anything else for resource.mimetype_inner in a way that distinguishes it from using dcat:mediaType for resource.mimetype?

cc: @philarcher1

@TkTech TkTech closed this as completed May 10, 2016
@davidread
Copy link
Contributor Author

This issue was closed due to inactivity. Feel free to reopen if you have more feedback or are interested it working on it

@akuckartz
Copy link

This issue was mentioned regarding the German "Open Government Data Standard 2.0" (draft).

See https://joinup.ec.europa.eu/asset/ogd2_0/issue/format#comment-18461 and https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/question-usage-mimetype-and-mediatype-zipped/container-distribu

Maybe this issue therefore should be reopened? (I generally think that issues should not be closed due to inactivity.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants