/
language_tag.ex
273 lines (221 loc) · 10.7 KB
/
language_tag.ex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
defmodule Cldr.LanguageTag do
@moduledoc """
Represents a language tag as defined in [rfc5646](https://tools.ietf.org/html/rfc5646)
with extensions "u" and "t" as defined in [BCP 47](https://tools.ietf.org/html/bcp47).
Language tags are used to help identify languages, whether spoken,
written, signed, or otherwise signaled, for the purpose of
communication. This includes constructed and artificial languages
but excludes languages not intended primarily for human
communication, such as programming languages.
## Syntax
A language tag is composed from a sequence of one or more "subtags",
each of which refines or narrows the range of language identified by
the overall tag. Subtags, in turn, are a sequence of alphanumeric
characters (letters and digits), distinguished and separated from
other subtags in a tag by a hyphen ("-", [Unicode] U+002D).
There are different types of subtag, each of which is distinguished
by length, position in the tag, and content: each subtag's type can
be recognized solely by these features. This makes it possible to
extract and assign some semantic information to the subtags, even if
the specific subtag values are not recognized. Thus, a language tag
processor need not have a list of valid tags or subtags (that is, a
copy of some version of the IANA Language Subtag Registry) in order
to perform common searching and matching operations. The only
exceptions to this ability to infer meaning from subtag structure are
the grandfathered tags listed in the productions 'regular' and
'irregular' below. These tags were registered under [RFC3066] and
are a fixed list that can never change.
The syntax of the language tag in ABNF is:
Language-Tag = langtag ; normal language tags
/ privateuse ; private use tag
/ grandfathered ; grandfathered tags
langtag = language
["-" script]
["-" region]
*("-" variant)
*("-" extension)
["-" privateuse]
language = 2*3ALPHA ; shortest ISO 639 code
["-" extlang] ; sometimes followed by
; extended language subtags
/ 4ALPHA ; or reserved for future use
/ 5*8ALPHA ; or registered language subtag
extlang = 3ALPHA ; selected ISO 639 codes
*2("-" 3ALPHA) ; permanently reserved
script = 4ALPHA ; ISO 15924 code
region = 2ALPHA ; ISO 3166-1 code
/ 3DIGIT ; UN M.49 code
variant = 5*8alphanum ; registered variants
/ (DIGIT 3alphanum)
extension = singleton 1*("-" (2*8alphanum))
; Single alphanumerics
; "x" reserved for private use
singleton = DIGIT ; 0 - 9
/ %x41-57 ; A - W
/ %x59-5A ; Y - Z
/ %x61-77 ; a - w
/ %x79-7A ; y - z
privateuse = "x" 1*("-" (1*8alphanum))
grandfathered = irregular ; non-redundant tags registered
/ regular ; during the RFC 3066 era
irregular = "en-GB-oed" ; irregular tags do not match
/ "i-ami" ; the 'langtag' production and
/ "i-bnn" ; would not otherwise be
/ "i-default" ; considered 'well-formed'
/ "i-enochian" ; These tags are all valid,
/ "i-hak" ; but most are deprecated
/ "i-klingon" ; in favor of more modern
/ "i-lux" ; subtags or subtag
/ "i-mingo" ; combination
/ "i-navajo"
/ "i-pwn"
/ "i-tao"
/ "i-tay"
/ "i-tsu"
/ "sgn-BE-FR"
/ "sgn-BE-NL"
/ "sgn-CH-DE"
regular = "art-lojban" ; these tags match the 'langtag'
/ "cel-gaulish" ; production, but their subtags
/ "no-bok" ; are not extended language
/ "no-nyn" ; or variant subtags: their meaning
/ "zh-guoyu" ; is defined by their registration
/ "zh-hakka" ; and all of these are deprecated
/ "zh-min" ; in favor of a more modern
/ "zh-min-nan" ; subtag or sequence of subtags
/ "zh-xiang"
alphanum = (ALPHA / DIGIT) ; letters and numbers
All subtags have a maximum length of eight characters. Whitespace is
not permitted in a language tag. There is a subtlety in the ABNF
production 'variant': a variant starting with a digit has a minimum
length of four characters, while those starting with a letter have a
minimum length of five characters.
## Unicode BCP 47 Extension type "u" - Locale
Extension | Description | Examples
+-------+ | ------------------------------- | ---------
ca | Calendar type | buddhist, chinese, gregory
cf | Currency format style | standard, account
co | Collation type | standard, search, phonetic, pinyin
cu | Currency type | ISO4217 code like "USD", "EUR"
fw | First day of the week identifier | sun, mon, tue, wed, ...
hc | Hour cycle identifier | h12, h23, h11, h24
lb | Line break style identifier | strict, normal, loose
lw | Word break identifier | normal, breakall, keepall
ms | Measurement system identifier | metric, ussystem, uksystem
nu | Number system identifier | arabext, armnlow, roman, tamldec
rg | Region override | The value is a unicode_region_subtag for a regular region (not a macroregion), suffixed by "ZZZZ"
sd | Subdivision identifier | A unicode_subdivision_id, which is a unicode_region_subtagconcatenated with a unicode_subdivision_suffix.
ss | Break supressions identifier | none, standard
tz | Timezone identifier | Short identifiers defined in terms of a TZ time zone database
va | Common variant type | POSIX style locale variant
## Unicode BCP 47 Extension type "t" - Transforms
Extension | Description
+-------+ | -----------------------------------------
mo | Transform extension mechanism: to reference an authority or rules for a type of transformation
s0 | Transform source: for non-languages/scripts, such as fullwidth-halfwidth conversion.
d0 | Transform sdestination: for non-languages/scripts, such as fullwidth-halfwidth conversion.
i0 | Input Method Engine transform
k0 | Keyboard transform
t0 | Machine Translation: Used to indicate content that has been machine translated
h0 | Hybrid Locale Identifiers: h0 with the value 'hybrid' indicates that the -t- value is a language that is mixed into the main language tag to form a hybrid
x0 | Private use transform
Extensions are formatted by specifying keyword pairs after an extension
separator. The example `de-DE-u-co-phonebk` specifies German as spoken in
Germany with a collation of `phonebk`. Another example, "en-latn-AU-u-cf-account"
represents English as spoken in Australia, with the number system "latn" but
formatting currencies with the "accounting" style.
"""
alias Cldr.LanguageTag.{Parser, U}
if Code.ensure_loaded?(Jason) do
@derive Jason.Encoder
end
defstruct language: nil,
language_subtags: [],
script: nil,
territory: nil,
language_variant: nil,
locale: %{},
transform: %{},
extensions: %{},
private_use: [],
requested_locale_name: nil,
canonical_locale_name: nil,
cldr_locale_name: nil,
rbnf_locale_name: nil,
gettext_locale_name: nil,
backend: nil
@type t :: %__MODULE__{
language: String.t(),
language_subtags: [String.t()],
script: String.t() | nil,
territory: Cldr.territory(),
language_variant: String.t() | nil,
locale: Cldr.LanguageTag.U.t(),
transform: map(),
extensions: map(),
private_use: [String.t()],
requested_locale_name: String.t(),
canonical_locale_name: String.t(),
cldr_locale_name: String.t() | nil,
rbnf_locale_name: String.t() | nil,
gettext_locale_name: String.t() | nil,
backend: Cldr.backend()
}
@doc """
Parse a locale name into a `Cldr.LangaugeTag` struct.
## Arguments
* `locale_name` is any valid locale name returned by `Cldr.known_locale_names/1`
## Returns
* `{:ok, language_tag}` or
* `{:error, reason}`
"""
def parse(locale_name) when is_binary(locale_name) do
Parser.parse(locale_name)
end
@doc """
Parse a locale name into a `Cldr.LangaugeTag` struct and raises on error
## Arguments
* `locale_name` is any valid locale name returned by `Cldr.known_locale_names/1`
## Returns
* `language_tag` or
* raises an exception
"""
@spec parse!(Cldr.Locale.locale_name()) :: t() | none()
def parse!(locale_string) when is_binary(locale_string) do
Parser.parse!(locale_string)
end
@doc """
Reconstitute a textual language tag from a
LanguageTag that is suitable
to pass to a collator.
## Arguments
* `locale` is a `Cldr.LanguageTag` struct returned by `Cldr.Locale.new!/2`
## Returns
* A formatted string representation of the language tag that is also
parseable back into a `Cldr.LanguageTag.t()`
## Example
iex> {:ok, locale} = Cldr.validate_locale "en-US-u-co-phonebk-nu-arab", MyApp.Cldr
iex> Cldr.LanguageTag.to_string(locale)
"en-Latn-US-u-co-phonebk-nu-arab"
"""
@spec to_string(t) :: String.t()
def to_string(%__MODULE__{} = locale) do
basic_tag =
[
locale.language,
locale.language_subtags,
locale.script,
locale.territory,
locale.language_variant
]
|> List.flatten()
|> Enum.reject(&is_nil/1)
|> Enum.join("-")
locale_extension = U.to_string(locale.locale)
if locale_extension != "" do
basic_tag <> "-u-" <> locale_extension
else
basic_tag
end
end
end