-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use DCAT standard for format field #1336
Comments
Looks all good to me. On the centralised list of formats, personally, I'd prefer to see the list of formats managed as a separate GitHub project and as a data file rather than as code. This opens it up for uses besides CKAN (for example, in other DCAT-using contexts!), and with a bit of luck (and work) it could become a de facto standard list of formats. |
Thanks @cygri I like that idea! Do you happen to know how DCAT would represent a SPARQL endpoint? I can't believe W3C have not considered this... |
I guess we'd defer to VoID for describing SPARQL endpoints, which brings a bunch of handy things specifically for describing datasets with RDF distributions. See here. VoID can be used in conjunction with DCAT; a When it comes to distinguishing other specific APIs, DCAT doesn't provide anything to distinguish them. One approach would be to define subclasses of |
That's great about VoID for SPARQL and subclassing for other APIs. Maybe this is the spark to start a DCAT mapper tool... |
If adding a new database table and actions for editing the list of valid formats is being considered I suggest using a tag vocabulary instead. Its tables and actions already exist, and a validator like is_one_of_tag_vocabulary('foo') is easy to add. if adding a static list as a configuration file please make it similarly general so other fields could use a scheme like that for validation, e.g. is_one_of_static_list('foo') => development.ini: ckan.list.foo = /srv/bar/baz/list.json => srv/bar/baz/list.json content |
Lots of questions thrown up by today's ckan dev chat: Need to outline the UI for selecting format. Fear that the 'free text' format option might mean multiple expressions for the same format. Many site admins haven't the time to tidy these up so maybe just don't allow 'other' as free text - just store as 'other'. If we do allow 'other' as 'free text', over the API we need users to be explicit the value is free text, so that in other circumstances we can validate it. Extensions can add/remove formats from the short list and long list. Or simply make it configurable. c.f. with schema validation of format in canadian setup. Have config option to switch on validation? Not every site would want it? If the list of formats was stored in the db it could get v complicated, with migrations of the existing resources' formats, ones added via config etc. better as a separate data file. If was going to be stored in db, then maybe use TagVocabulary. Compare with related DCAT work: |
I'd like to see this get resolved asap. preliminary analysis of datahub.io shows significant variation - i documented it here: http://help.datahub.io/discussions/problems/12-consistency-in-descriptions-of-formats i like the idea of specifying the mime-type and the file format. we need clear specification for when there is a zip wrapper over one or more file formats. e.g. a tar.gz that contain n-triples or n-quads for instance. |
The list of formats is developing as a data file in #1350 I guess it doesn't matter too much how CKAN stores the format, as long as the mapping to DCAT is correct. And this is progressing in ckanext-dcat. So I think this ticket could usefully focus on how the data is stored in CKAN and the UI for selecting a format in the form. It seems most useful to have some sort of drop-down or autocomplete to select the format, and save it as mimetype in preference, and extension if it doesn't exist or allow free-text as last resort. @micheldumontier agreed that the proliferation of mime-types - I imagine providing the user with a drop-down to select the format from would solve this - do you agree? And we could do with a script to tidy up the mime-types to all be the canonical ones. @micheldumontier can you explain a use case for specifying both the mime-type and file format? Most of the time there is a direct correlation between mime-type and file format, but not always - are you interested in the differences and if so, which ones? |
@davidread - yes autocomplete + drop down of available mime-types is desirable. right, so for instance, we're publishing data as gzipped content (.gz), but the content type inside that gzipped file is n-triples or n-quads. Users/programs need to know this if they're going to pull the content out correctly. so i've been publishing my dataset descriptions annotated with 2 dc:format descriptions. so i wonder whether this is only an issue with archived data? |
Has there been any more discussion about using cc: @philarcher1 |
This issue was closed due to inactivity. Feel free to reopen if you have more feedback or are interested it working on it |
This issue was mentioned regarding the German "Open Government Data Standard 2.0" (draft). See https://joinup.ec.europa.eu/asset/ogd2_0/issue/format#comment-18461 and https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/question-usage-mimetype-and-mediatype-zipped/container-distribu Maybe this issue therefore should be reopened? (I generally think that issues should not be closed due to inactivity.) |
This is a suggestion to express the resource format in the database using mime-types where possible. This is what DCAT specifies, so this would to facilitate interoperability between data catalogs. Falling in with standards is a good thing for the community generally. However the mapping between formats in CKAN and DCAT is probably 1:1, so there is no intrinsic gain to the CKAN model by doing this.
We currently do store a mimetype value, often filled in by the data_storer or ckanext_qa, but it would be good to fill it in as a matter of course and be in step with the format field.
Current CKAN convention:
CKAN's edit form asks for the format, using autocomplete based on previous entries (and a standard list?). The mimetype and mimetype_inner are filled automatically by datastorer? (CHECK)
DCAT formats:
DCAT "Distribution" (equivalent to a CKAN resource) has:
dcat:mediaType "SHOULD be used when the media type of the distribution is defined in IANA, otherwise dct:format MAY be used with different values."
Proposal
It makes sense for CKAN to hold details about lots of formats, along the lines of the ECPortal and DGU lists
Each format would have properties including:
'format' has always been the principal field in CKAN, probably used by a host of customized templates and extensions, so there would be an advantage for it to be a 'displayable' version of the format. The mimetype and mimetype_inner are rarely used, so could be recycled much more easily.
Since we have the table of formats, we could refer to format in the resource in any number of ways, even by ID. However people using our API should would find it useful to see the short display string as well as have the DCAT info. So I suggest we have in the database:
and when you package_show, you see:
To map this to DCAT:
Other formats
If data is in an unlisted format, then in the form, instead of using the dropdown, you just want to type free text. We could store this in the db as resource.format_other. And package_show would use that text as both short and long name, with blank mimetype and format_canonical.
Inner format
The resource.mimetype can be different from resource.mimetype_inner when the data comes in a container such as ZIP, TAR, RAR, GZIP (are there any examples which are not compression?). DCAT says nothing about containers, so in the mapping we just ignore it if there is one.
Multiple mime-types for a file format.
e.g. Excel files are defined by IANA as:
and then there are the unofficial ones:
I suggest we don't worry about the variants and just pick what looks canonical to us:
Question: Separate XLS and XLSX?
General users might not care about the differences between these formats. They both open in Excel the same. (Assuming they have Excel 2007 or later which is pretty likely.) It could be a slight faff to users entering it in the form to get it right, and may result in high inaccuracies as they don't care.
However I believe it is useful to some client software to know whether a file is XLS or XLSX format. I saves having to do detection in Messytables, although I can't remember the details. It is not too hard to use scripts to correct the resource from XLS to XLSX and vice-versa, so that could solve the accuracy thing.
Also, it would be slightly more correct to use the different mime-types for these two types:
Thoughts welcome.
When no mime-type exists
e.g. a Shapefile looks like a zip, yet it should be described in CKAN as a Shapefile. Follow DCAT's suggestion to fall-back on using dct:format. e.g. 'Shapefile' in this case.
This could also be used for GeoJSON to differentiate it from JSON.
Landing pages
It is common to put in a resource.url which is not the data, but a web page from which you can get the data. This might be because there are vast numbers of links. Or maybe a form needs to be filled in to obtain the data. A good legitimate example is a web form for making SPARQL requests.
DCAT expresses these not on a dcat:Distribution, but as the dcat:landingPage property of a dcat:Dataset. It would make sense in CKAN to add a resource.resource_type for landing page (if not done so already) and a corresponding tab for it in the form, to allow easy mapping to DCAT.
API resources
Where a dataset is accessed via an API, the base URL is stored in a CKAN resource, and CKAN sets resource.resource_type="api" instead of "file". APIs in DCAT are differentiated in a dcat:distribution by having an 'accessURL' instead of 'downloadURL'.
An API is not a format in itself. The classic API example is a SPARQL endpoint. This might return data in RDF/XML, TTL or JSON format, but CKAN needs to display this resource as 'SPARQL'. So until now in CKAN, we've set resource.format="api/sparql", even though that is not the format of the response. It's not clear how DCAT would represent it as a SPARQL endpoint, so I suggest we continue regarding SPARQL as a format.
Other API examples: WMS, WFS, ONS Open API, Socrata Open Data API.
Centralized list of formats
You could imagine OKF setting up a central server that serves the full list of formats, for CKANs everywhere to use. We envisaged a system like this for licenses, but it was difficult. Every time you ran a CKAN instance it wanted to go and get the list, causing delays on start-up. You worry that the format server will go down, and that might cause the CKAN to not work at all.
I don't think there are so many formats, or many new ones under the sun to worry about, so I suggest just having them in a Python file checked into CKAN code. Switching to a central server always be changed to in future if called for.
The text was updated successfully, but these errors were encountered: