New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Locales support #33
Comments
On another note, there are languages not included in the two-letters ISO-639-1 standard, but have a ISO-639-3 representation (e.g. Filipino). From a locale perspective, as Cantonese is an official language in Hongkong and Macau, it is represented as To summarize, it seems like we could add ISO-639-3 languages if they don't derive from another ISO-639-1 (macro)language (e.g. Edit: It seems like Tagalog relates to Filipino the same way Castilian relates to Spanish. Tagalog has a ISO-639-1 representation (tg). So let's keep using ISO-639-1 language codes with locales extensions. |
Summary: * Locales support for the library, following `<Lang>_<Region>` with ISO 639-1 code for `<Lang>` and ISO 3166-1 alpha-2 code for `<Region>` (#33) * `Locale` opaque type (composite of `Lang` and `Region`) with `makeLocale` smart constructor to only allow valid `(Lang, Region)` combinations * API: `Context`'s `lang` parameter has been replaced by `locale`, with optional `Region` and backward compatibility. * `Rules/<Lang>.hs` exposes - `langRules`: cross-locale rules for `<Lang>`, from `<Dimension>/<Lang>/Rules.hs` - `localeRules`: locale-specific rules, from `<Dimension>/<Lang>/<Region>/Rules.hs` - `defaultRules`: `langRules` + specific rules from select locales to ensure backward-compatibility * Corpus, tests & classifiers - 1 classifier per locale, with default classifier (`<Lang>_XX`) when no locale provided (backward-compatible) - Default classifiers are built on existing corpus - Locale classifiers are built on - `<Dimension>/<Lang>/Corpus.hs` exposes a common `corpus` to all locales of `<Lang>` - `<Dimension>/<Lang>/<Region>/Corpus.hs` exposes `allExamples`: a list of examples specific to the locale (following `<Dimension>/<Lang>/<Region>/Rules.hs`). - Locale classifiers use the language corpus extended with the locale examples as training set. - Locale examples need to use the same `Context` (i.e. reference time) as the language corpus. - For backward compatibility, `<Dimension>/<Lang>/Corpus.hs` can expose also `defaultCorpus`, which is `corpus` augmented with specific examples. This is controlled by `getDefaultCorpusForLang` in `Duckling.Ranking.Generate`. - Tests run against each classifier to make sure runtime works as expected. * MM/DD (en_US) vs DD/MM (en_GB) example to illustrate Reviewed By: JonCoens, blandinw Differential Revision: D6038096 fbshipit-source-id: f29c28d
Currently, Duckling has two different granularity levels for rules: language-wise, and common rules across languages. As an example,
AmountOfMoney
has a couple of common rules, and more specific rules for English, French, etc.Today, if we want to handle country-specific forms, we'd include them under the language umbrella. Although this has worked well so far, it is not ideal. We had to include Cantonese variations for
Time
within the Chinese rules (#21).An improvement for Duckling would be to add a finer granularity for rules: scoping by locale.
This is something we've had in mind for a while, though we are not able to prioritize it today. The goal here is to open the discussion and maybe come up with an actual implementation.
The text was updated successfully, but these errors were encountered: