Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locales support #33

Closed
patapizza opened this issue May 30, 2017 · 1 comment
Closed

Locales support #33

patapizza opened this issue May 30, 2017 · 1 comment

Comments

@patapizza
Copy link
Contributor

Currently, Duckling has two different granularity levels for rules: language-wise, and common rules across languages. As an example, AmountOfMoney has a couple of common rules, and more specific rules for English, French, etc.

Today, if we want to handle country-specific forms, we'd include them under the language umbrella. Although this has worked well so far, it is not ideal. We had to include Cantonese variations for Time within the Chinese rules (#21).

An improvement for Duckling would be to add a finer granularity for rules: scoping by locale.
This is something we've had in mind for a while, though we are not able to prioritize it today. The goal here is to open the discussion and maybe come up with an actual implementation.

@patapizza
Copy link
Contributor Author

patapizza commented Jun 28, 2017

On another note, there are languages not included in the two-letters ISO-639-1 standard, but have a ISO-639-3 representation (e.g. Filipino).
Some languages that have an ISO-639-1 representation are considered as macrolanguages in ISO-639-3 (e.g. Chinese). Cantonese has no ISO-639-1 representation so we've included Cantonese variations for Time within the Chinese rules. But Cantonese has a ISO-639-3 representation (yue).

From a locale perspective, as Cantonese is an official language in Hongkong and Macau, it is represented as zh-HK and zh-MO.
Locales capture regional particularities of a language. For example, in French the numeral 70 is pronounced "soixante-dix" in France (fr-FR), and "septante" in Belgium (fr-BE) and Switzerland (fr-CH).

To summarize, it seems like we could add ISO-639-3 languages if they don't derive from another ISO-639-1 (macro)language (e.g. fil for Filipino), and support locales as language extensions (e.g. zh-HK/zh-MO for Cantonese, fr-BE for French in Belgium).

Edit: It seems like Tagalog relates to Filipino the same way Castilian relates to Spanish. Tagalog has a ISO-639-1 representation (tg). So let's keep using ISO-639-1 language codes with locales extensions.

facebook-github-bot pushed a commit that referenced this issue Oct 13, 2017
Summary:
* Locales support for the library, following `<Lang>_<Region>` with ISO 639-1 code for `<Lang>` and ISO 3166-1 alpha-2 code for `<Region>` (#33)
* `Locale` opaque type (composite of `Lang` and `Region`) with `makeLocale` smart constructor to only allow valid `(Lang, Region)` combinations
* API: `Context`'s `lang` parameter has been replaced by `locale`, with optional `Region` and backward compatibility.
*  `Rules/<Lang>.hs` exposes
  - `langRules`: cross-locale rules for `<Lang>`, from `<Dimension>/<Lang>/Rules.hs`
  - `localeRules`: locale-specific rules, from `<Dimension>/<Lang>/<Region>/Rules.hs`
  - `defaultRules`: `langRules` + specific rules from select locales to ensure backward-compatibility
* Corpus, tests & classifiers
  - 1 classifier per locale, with default classifier (`<Lang>_XX`) when no locale provided (backward-compatible)
  - Default classifiers are built on existing corpus
  - Locale classifiers are built on
  - `<Dimension>/<Lang>/Corpus.hs` exposes a common `corpus` to all locales of `<Lang>`
  - `<Dimension>/<Lang>/<Region>/Corpus.hs` exposes `allExamples`: a list of examples specific to the locale (following `<Dimension>/<Lang>/<Region>/Rules.hs`).
  - Locale classifiers use the language corpus extended with the locale examples as training set.
  - Locale examples need to use the same `Context` (i.e. reference time) as the language corpus.
  - For backward compatibility, `<Dimension>/<Lang>/Corpus.hs` can expose also `defaultCorpus`, which is `corpus` augmented with specific examples. This is controlled by `getDefaultCorpusForLang` in `Duckling.Ranking.Generate`.
  - Tests run against each classifier to make sure runtime works as expected.
* MM/DD (en_US) vs DD/MM (en_GB) example to illustrate

Reviewed By: JonCoens, blandinw

Differential Revision: D6038096

fbshipit-source-id: f29c28d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant