Multilingual Datasets, the Government of Canada approach

wardi edited this page Sep 10, 2014 · 1 revision
Clone this wiki locally

Update 2014-09:

This page describes the approach we took to launch http://data.gc.ca/ with English and French versions of every text field. This is no longer the approach we are taking on new sites that may include datasets with one or more languages (not consistently two languages). We are building https://github.com/open-data/ckanext-fluent/ to solve this problem in a nicer way. Please jump in if you'd like to help

Problem We're Solving

The Government of Canada Official Languages Act requires that both official languages have an equal status in communications to the public.

CKAN's multilingual extensions allow registering translations for complete field values and displaying them to the user, but not associating them with specific datasets/resources or exposing them in the API. There no simple way to ensure translations are updated when a dataset or resource is modified, which would result in an incorrect language being displayed. There is also no way to provide different translations for the same string in different contexts, which may be required.

Our Approach

We add a new dataset field: language.

This field contains a list of ISO639-2/T three letter language code, ISO3166-1 three letter country code pairs. The language and country codes are joined with a semicolon and space, and the pairs are joined with a vertical bar.

For our datasets this field will always contain the value: "eng; CAN | fra; CAN". Note: this will be changed to follow BCP-47 in the future (eg. "fr" instead of "fra"). This is interpreted as :

  1. the language found in all translated fields is Canadian English, e.g. title, notes... contain Canadian English text
  2. Canadian French versions of translated fields are stored as original field name + _fra, e.g. title_fra, notes_fra... contain Canadian French text (will become *_fr in the future)
  3. Choice fields will contain both Canadian English and Canadian French in the same string separated by a vertical bar and spaces, e.g. "Annually | Annuel"
  4. Very few special characters are allowed in tags so we use a different approach with tag vocabularies. Tags vocabulary tags will contain Canadian English tag name + (two spaces) + Canadian French tag name, e.g.

    "Economics and Industry  Économie et industrie"
    

    This might be changed in the future to use have only a single language and use the multilingual extension for the other.

  5. The tags field for free-form tagging is not used because tags can't be easily associated with a particular language, we have chosen to use new keywords and keywords_fra (keywords_fr in the future) fields with comma-separated tags stored as a text values instead.

Some of our "translated fields" are actually URLs, where the different language versions are URLs pointing to information in the correct language. These are used for linking to the web site for the program responsible for the dataset or for supporting human-readable materials, not for the actual resource URLs.

Limitations

This approach would allow aggregation of datasets in multiple languages, but tags and choice fields are problematic because of the way multiple languages are wedged into the same strings.