Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource format not guessed automatically #1350

Merged
merged 6 commits into from
Mar 26, 2014

Conversation

kindly
Copy link
Contributor

@kindly kindly commented Dec 4, 2013

[WIP] The resource format isn't guessed automatically, but CKAN claims to guess it automatically. I recommend removing the text for for 2.2 and fix guessing for 2.3.

automatically

@nigelbabu
Copy link
Contributor Author

At the dev meeting, we decided to do this guessing in a custom validator.

@kindly
Copy link
Contributor

kindly commented Dec 5, 2013

This has turned into a major rewrite of how to handle resource formats.

  • It consists of a new json file which maps input formats to canonical we formatted formats. This is configurable by the instance by specifying an alternative path. See https://github.com/okfn/ckan/blob/9fadea50e08379164ab1b1070544f9795c80a669/ckan/config/resource_formats.json
    At the moment only the second field in the list is actually used for display.
  • All logic concerning manipulating formats has been moved to a validator. Helpers have also been, added which the validators call, that do the work of loading the file.
  • Adding formats by default if the user leaves blank based on the mimetype, this is also a validator.

ISSUES
The format field not get converted to its clean form when saving and showing in the database. This could result in extra revisions when the data underneath has not changes. This also means we will be converting to uppercase more resources than we do currently. This is all fairly easily override nonetheless.

@amercader
Copy link
Member

To be decided for 2.2 or 2.3 pending review

@davidread you are probably interested in this

url = data.get(key[:-1] + ('url',), '')
type, encoding = mimetypes.guess_type(url)
if type:
data[key] = type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"value = data[key]" and "data[key] = type" Can you be more specific with your variable names?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can't get away from data and key for a validator, but value and type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value is used everywhere in the validators as the value of that key. So it may be bad but its a least consistent.

Also type, encoding is what is the documented return types.
http://docs.python.org/3.4/library/mimetypes.html#module-mimetypes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on value.

In the context of the mimetype docs "type" might be a useful abbreviation of "mimetype". But here I think 'mimetype' would be clearer. 'type' is way too overloaded in coding everywhere... I'm just nitpicking sorry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change it to mimetype! I agree its better.

@wardi
Copy link
Contributor

wardi commented Dec 5, 2013

My understanding from the call on Tuesday was that the format field is going from what was usually a file extension value (or almost any value a user might like to fill in) to a field that will contain a mime type some of the time instead.

Reading the code it looks like the the format being stored is the file extension, not the mime type. Did I misunderstand the intent during call or am I misreading the code?

@kindly
Copy link
Contributor

kindly commented Dec 5, 2013

This pull request is not the one that changes the format field to be a mimetype even though I wanted to prepare for the time if we decide to move that way... I would not try and squeeze that in version 2.2, as I do not think we have settled on this decision.

I have tried to make the new behaviour as backwards compatible as possible (except for the fact it fills in a format if left empty and removes the nasty "unified_format" stuff).

@amercader
Copy link
Member

@kindly can we have a small separate pull request with just the template notice string change, so at least we have this fixed before the string freeze? Cheers

@wardi
Copy link
Contributor

wardi commented Dec 10, 2013

what's missing is a way to update all the resources based on the new validator. This depends on the schema so it doesn't feel right to make it a migration. What about a paster command that does package_update(package_show(id)) on all the packages? paster dataset touch-all or something

kindly added a commit that referenced this pull request Dec 12, 2013
[#1350] Remove misleading resource format notice
@davidread
Copy link
Contributor

David to come back to looking at this in a couple of weeks

["MXD", "ESRI ArcGIS project file", "application/x-mxd", []],
["TAR", "TAR Compressed File", "application/x-tar", []],
["PNG", "PNG Image File", "image/png", []],
["RSS", "RSS feed", "application/rss+xml", []]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this line?

["GeoJSON", "Geographic JavaScript Object Notation", "application/json", []]

["SVG", "SVG vector image", "image/svg+xml", ["svg"]],
["PPT", "Powerpoint Presentation", "application/vnd.ms-powerpoint", []],
["JPEG", "JPG Image File", "image/jpeg", ["jpeg", "jpg"]],
["Sparql Results", "Sparql Results", "application/sparql-results+xml", []],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is useful to split SPARQL endpoints into two. This is a list of resource formats, and people recognize a SPARQL URL as an API. I don't think it is going to happen that someone publishes a text file with a SPARQL query or a SPARQL response in it, are they? Can we just get rid of the "Sparql query" line and rename "Sparql Results" to ["SPARQL", "SPARQL endpoint", "application/sparql-results+xml", []],?

["NetCDF", "NetCDF File", "application/netcdf", []],
["ArcGIS Map Service", "ArcGIS Map Service", "ArcGIS Map Service", ["arcgis map service"]],
["TSV", "Tab Separated Values File", "text/tsv", []],
["WFS", "Web Feature Service", "text/xml", []],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mimetype for WFS should be "application/xml" i.e. the same as for "XML".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that setting it to the same as for XML would spoil the if_empty_guess_format as it might see "file.xml" and may say it is a WFS file. Better to have no mimetype for WFS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually for WFS (at least 1.1.0) the mime type defined by the OGC standard is text/xml. I think is probably fine to leave it like this for the time being because of what you mention regarding if_empty_guess

davidread pushed a commit that referenced this pull request Mar 26, 2014
[#1350] Resource format not guessed automatically
@davidread davidread merged commit 7587c6e into master Mar 26, 2014
@davidread davidread deleted the 1350-resource-format-change branch March 26, 2014 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants