Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core model multilingual fields #2678

Closed
amercader opened this issue Oct 7, 2015 · 31 comments

Comments

Projects
None yet
7 participants
@amercader
Copy link
Member

commented Oct 7, 2015

The consensus for handling multilingual metadata fields seems to be to follow the way ckanext-fluent handles them.

For custom fields this looks uncontroversial:

{
...
  "favourite_fruit": {
    "en": "Peach",
    "ca": "Préssec",
    "es": "Melocotón"
  }
...
}

There is an issue though with using this approach on existing core fields like title or notes. Up until this point these have always had a single string value, and changing this to return a dict on some occasions would break things, both in core (eg in templates) or in extensions.

  1. For dealing with this in the current API version I'm proposing that for these particular core fields that need to be translated we add a separate field:

    {
    ...
    "title_lang": {
      "en": "Annual Budget",
      "ca": "Pressupost Anual",
      "es": "Presupuesto Anual"
    }
    ...
    }
    

    I have no strong preference for what the _lang suffix actually is called (_trans, _i18n ...)

    The list of fields that would need these is:

    • Dataset title
    • Dataset notes
    • Resource name
    • Resource description
    • Resource format
    • Tag name (these are likely to be handled separately)

    Just to clarify, for now it would be ckanext-fluent (or any other extension) who will create these fields, no changes on ckan core would be made.

  2. Moving forward, on a new API version, we can decide whether the same pattern can be applied directly to core fields as well:

    {
    ...
    "title": {
      "en": "Annual Budget",
      "ca": "Pressupost Anual",
      "es": "Presupuesto Anual"
    }
    ...
    }
    

    The main issue I see is what happens with those CKAN instances (the majority) that don't handle multilingual metadata. Do we allow string and dict values on the same field? (that doesn't sound like a good idea tbh). Do we always enforce a language key, even if there is only one for the instance default locale? eg:

    {
    ...
    "title": {
      "en": "Annual Budget"
    }
    ...
    }
    

    or

    {
    ...
    "title": {
      "es": "Presupuesto Anual"
    }
    ...
    }
    
@amercader

This comment has been minimized.

Copy link
Member Author

commented Oct 7, 2015

@davidread @brew @rossjones @joetsoi @wardi let me know what you think, specially about point 1 as it relates to #2668

@wardi

This comment has been minimized.

Copy link
Contributor

commented Oct 7, 2015

👍 and I'm fine with the _lang suffix

For a field like frequency I'd rather use a controlled list that stores a single string value with the translations in the schema, like how the scheming 'select' preset works, not store all the languages in every record, so that might not be a great example.

Is resource format a multilingual field? I thought those were file extensions, essentially

@rossjones

This comment has been minimized.

Copy link
Collaborator

commented Oct 7, 2015

Same here (+1) -> _lang makes a lot of sense, but I think having a controlled list for frequency is really a different issue. Even if it were a fixed list, it would presumably have a translation.

WRT your question on could we do this for the core fields in API v4 even if an instance is single-language - I'd say yes, even though it possibly provides more work for the caller - I'd guess ckanapi (and libs in other languages) could take care of abstracting this away if there's only one language, or by allowing a user to specify a language and defaulting to en.

@wardi

This comment has been minimized.

Copy link
Contributor

commented Oct 7, 2015

@rossjones you're right, frequency as a controlled list is a different discussion completely.

I think we can live with title and title_lang etc for quite a while. Adding multilingual fields to ckan core is a larger project that we'll need to plan out more carefully. I think it will involve breaking any currently existing IDatasetForm plugins.

@davidread

This comment has been minimized.

Copy link
Contributor

commented Oct 7, 2015

Thanks for making this proposal. So to be clear we'd see something like this in the API:

"title": "Ottawa.ca Web Visitation – Mont",
"title_lang": {
    "fr": "Visites du site Web Ottawa.ca – Mensuellement",
    "en": "Ottawa.ca Web Visitation – Monthly"
},

And the "title" could be in whatever language - no guarantees. But it should be a repeat of one of those in title_lang.

I think that's a good standard to agree to.

I think layout makes more sense than having a 'multilingual' element at the top level with the multilingual versions of everything below. That's the system I think is what the Pan-European Data Portal is doing - they said in April they are lumping translations to all fields into a single extra field. So if the API had it that way too then e.g. the translated resource titles would not be under the resource - that would not very nice.

Regarding a potential new API version, I'm not keen in making "title" multilingual for these few sites that need it. I'd rather that dict stayed in the "title_lang" field. Then the majority of clients can continue to use the simple "title" field as normal, and the few that really care about multi-language can take advantage of the "title_lang" field, if it is there. These little barriers to simple API use all add up.

Now, assuming we agree this, what's the next step? Do we just document it and encourage existing sites (Ca, EU Publications Office) to move that way?

@wardi

This comment has been minimized.

Copy link
Contributor

commented Oct 7, 2015

The next step that's important to me is just having the core ckan templates for breadcrumbs and other places look for a "title_lang" field and use it if it's there. Everything else is easy to do from an extension.

@amercader

This comment has been minimized.

Copy link
Member Author

commented Oct 8, 2015

Apologies for the frequency example which was indeed terrible, have now replaced.

Comments:

Is resource format a multilingual field? I thought those were file extensions, essentially

On most cases yes. There a couple of translatable names on our default values like "Atom feed" and when harvesting DCAT datasets or others I've definitely have ended up with stuff like "Comma Separate Values" or "Web Map Service". Probably not a big deal to have it multilingual or not.

And the "title" could be in whatever language - no guarantees. But it should be a repeat of one of those in title_lang.

Can you clarify how fluent would handle this @wardi ? I'd really expect to always have the same language on this fields, definitely if you have ckan.default_locale set up, otherwise defaulting to one of them.

The next step that's important to me is just having the core ckan templates for breadcrumbs and other places look for a "title_lang" field and use it if it's there. Everything else is easy to do from an extension.

I really don't want to have {% if _trans %} all over the place, so what about:

  • Having a helper function to get a metadata field value that will take care of checking if the field is multilingual and returns the relevant value depending on the current language:

    h.get(dataset_dict, 'title')

  • Having the dict that the templates get already have the title, notes etc already set up to the relevant language. I'm not sure about what the best hook is to do that, I think that @brew and @wardi may have already talked about this?

I now realized that we will need Group/Org title and description as well.

@wardi

This comment has been minimized.

Copy link
Contributor

commented Oct 8, 2015

@amercader fluent would detect that it's being used on a core field (scheming already does this to work with extras) and if so, switch to storing its values in <fieldname>_lang.

Next it adds an output validator to the core field <fieldname> that pulls the value from <fieldname>_lang, choosing first the language set in ckan.default_locale, and if that's not present, choosing the first language available (alphabetical based on language code, just to be consistent)

This approach has one drawback: the title etc. stored in the database won't match the one returned from the API. Do you think that will matter?

For the helper, yes, let's just create a h.get_language_field or similar that behaves like scheming_language_text

@amercader

This comment has been minimized.

Copy link
Member Author

commented Oct 9, 2015

@wardi I might be missing something, but can't the appropriate language for <fieldname> (ie the locale chosen by the user or the default) be set up at the validator level as well so the dict already has the relevant values and we don't need the helper?

This approach has one drawback: the title etc. stored in the database won't match the one returned from the API. Do you think that will matter?

What is the value for <fieldname> stored in the database? is it empty, one of the values, or a dumped dict?

@wardi

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2015

@amercader at least for datasets we return the json straight from solr in package_show not pass it through the validation logic, so we'd be making everything slower just to potentially swap in a language. I also prefer the results from the API to be independent of a users' language. I'm not used to language being an API parameter.

What would you like stored in <fieldname>? I think a blank string is fine

@amercader

This comment has been minimized.

Copy link
Member Author

commented Oct 15, 2015

Sorry, this slipped off my mind.
Understood regarding validators. What about using the same logic on the before_view hook that is only called on the dict when rendering a page, not the API. I think the old multilingual extension does something similar https://github.com/ckan/ckan/blob/master/ckanext/multilingual/plugin.py#L317

Regarding the DB not having the same value that the API, the only issue I can think of is if you are migrating the db to a new instance and don't have the fluent extension enabled you are going to end up with an empty title. But maybe this is an edge case.

@wardi

This comment has been minimized.

Copy link
Contributor

commented Oct 15, 2015

@amercader the before_view hook isn't called on the dataset editing pages, and I'm not sure it should be #2668. Another problem with before_view is it applies to all dataset types in a ckan instance, not just the ones that might have multilingual values. I think it's useful to have the title value contain something when returned from the API so I favour an output validator and modification to core to look for _lang fields instead of a before_view plugin.

@amercader

This comment has been minimized.

Copy link
Member Author

commented Oct 15, 2015

Sounds sensible. I was trying to avoid changing the templates to much with helpers call, but it looks like it is indeed the best approach.

@wardi

This comment has been minimized.

Copy link
Contributor

commented Oct 15, 2015

Now the hard decision:

  • h.get(pkg_dict, 'title')
  • h.get_lang(pkg_dict, 'title')
  • h.get_language(pkg_dict, 'title')
  • h.get_language_field(pkg_dict, 'title')
@metaodi

This comment has been minimized.

Copy link
Member

commented Oct 15, 2015

Why do we need the <fieldname>_lang fields? We could just save the dicts in the <fieldsname> have a plugin (core? fluent?) that implements before_view. Or am I missing something? That way instances that don't support multilingual metadata wouldn't change at all and those who do return a dict as described above.

A caller of the API would have to check what kind of value is returned, but I don't think that's bad, because if we have separate fields, what language should be returned in the <fieldname>? There is not always a default langauge. This is probably more a political question than a technical one. I currently don't understand the need for separate fields.

We currently try to implement exactly that, if you want to have a look: https://github.com/ogdch/ckanext-switzerland/blob/3a7279cec1d7b166d4658f48d6d98b4227037a69/ckanext/switzerland/plugin.py#L175-L201 (still work in progress).

@davidread

This comment has been minimized.

Copy link
Contributor

commented Oct 16, 2015

@metaodi Because API clients expect the CKAN API to return a title as a string, and changing that to a dict would break existing users of the API. It also complicates the writing of all clients, whether they are particularly interested in making use of the alternative translations or not.

@davidread

This comment has been minimized.

Copy link
Contributor

commented Oct 16, 2015

@wardi

My understanding is that this method is not for getting the name of the language. You're getting a field's value, ideally in a particular language, but falling back to another language. So something along these lines:

h.get_value_in_language(pkg_dict, 'title')
h.get_translated(pkg_dict, 'title')

I'd also prefer suffix _multilang to _lang. I just can't see how "lang" can make you clearly think "this contains a number of alternative translations" or "this is a multi-language field". "lang" makes me think we are providing the name of the language.

@metaodi

This comment has been minimized.

Copy link
Member

commented Oct 16, 2015

@davidread but that's why we have API versions, right? We just define that on API version X each field is either a string or a dict. I guess in the future it should always be a dict, even if you only use one language. Or you could specify the language as a parameter and then be sure to only get strings.
I think to use multiple fields should be avoided as its only for lazy clients that can anyway stay on an old version of the API.

@amercader

This comment has been minimized.

Copy link
Member Author

commented Oct 16, 2015

@metaodi see points 1 and 2 on my original comment regarding API versions. These changes are focused on the current (v3) version of the API. For clients expecting a string value for title and all of a sudden getting a dict that would be a major breakage. On a new API version we can move to the dict-like value on these fields.

@amercader amercader closed this Oct 16, 2015

@amercader amercader reopened this Oct 16, 2015

@amercader

This comment has been minimized.

Copy link
Member Author

commented Oct 16, 2015

I'm also +1 in the name reflecting that you are getting the value. The more explicit name for me is

h.get_translated_value(pkg_dict, 'title')

but happy to shorten it to h.get_translated.

This will work for all kinds of dicts right (resource, orgs, etc)? as it will only look for a _lang (or _multilang) field at the dict root regardless of what type of object the dict actually is.

Perhaps to be consistent then I'd prefer the field suffix to be _trans or _translated.

How does this sound?

@davidread

This comment has been minimized.

Copy link
Contributor

commented Oct 16, 2015

Those names suit me

@wardi

This comment has been minimized.

Copy link
Contributor

commented Oct 16, 2015

at least one of the values isn't translated :-)

I read title_lang like title[lang] or title_by_language

@wardi

This comment has been minimized.

Copy link
Contributor

commented Oct 16, 2015

I still prefer _lang but I'm fine with _translated

@joetsoi

This comment has been minimized.

Copy link
Contributor

commented Nov 2, 2015

does anyone object to an _ equivalent that gettext has, as having h.get_translated_value(pkg.<blah>) is making the templates pretty noisy? If so what should the shortened value be

joetsoi added a commit to joetsoi/ckan that referenced this issue Nov 2, 2015

[ckan#2678] Add core translation get_translated helper
Change package templates to use helper
@LaurentGoderre

This comment has been minimized.

Copy link
Contributor

commented Nov 2, 2015

Here is a PR to implement this in fluent ckan/ckanext-fluent#20

@wardi

This comment has been minimized.

Copy link
Contributor

commented Nov 3, 2015

@LaurentGoderre we've settled on _translated as the suffix for these fields

@LaurentGoderre

This comment has been minimized.

Copy link
Contributor

commented Nov 3, 2015

Gotcha!

@LaurentGoderre

This comment has been minimized.

Copy link
Contributor

commented Nov 9, 2015

fluent now supports this. What is the status on the templates changes for this? Can I help?

@joetsoi

This comment has been minimized.

Copy link
Contributor

commented Nov 10, 2015

there's a few more I have to double check but i should just create a pr at this point and get it merged.

@LaurentGoderre

This comment has been minimized.

Copy link
Contributor

commented Nov 10, 2015

open the PR and I can probably help test it

wardi added a commit that referenced this issue Jan 8, 2016

wardi added a commit that referenced this issue Jan 8, 2016

wardi added a commit that referenced this issue Jan 8, 2016

Merge pull request #2729 from ckan/2678-core-multilingual-fields
[#2678] Add core translation get_translated helper
@wardi

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2016

fixed by #2729 thanks @joetsoi

@wardi wardi closed this Jan 8, 2016

wardi added a commit to wardi/ckan that referenced this issue Apr 13, 2016

[ckan#2678] Add core translation get_translated helper
Change package templates to use helper

Conflicts:
	ckan/lib/helpers.py
	ckan/templates/package/confirm_delete.html
	ckan/templates/package/read.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.