BCP47 provides tools to parse, validate, normalize, and match language
tags following the BCP 47
standard. BCP 47 (Best Current
Practice 47) is the IETF standard that governs how human languages are
identified in internet protocols—it defines tags like en-US (English,
United States).
The package bundles access to the IANA Language Subtag Registry, the authoritative source of valid language, script, region, and variant subtags.
You can install the development version of BCP47 from
GitHub with:
# install.packages('pak')
pak::pak('christopherkenny/BCP47')| Function | Description |
|---|---|
bcp_parse() |
Decompose a tag into its subtag components |
bcp_validate() |
Check whether subtags appear in the IANA registry |
bcp_normalize() |
Apply canonical casing and substitute preferred values |
bcp_match_language() |
Find the best available language for a set of preferences |
bcp_process_registry() |
Download and parse the IANA registry |
bcp_cache_*() |
Manage the local registry cache |
bcp_parse() decomposes a tag into its RFC 5646 components. All subtags
are returned in lower-case.
library(BCP47)
bcp_parse('en-US')
#> $language
#> [1] "en"
#>
#> $extlang
#> NULL
#>
#> $script
#> [1] NA
#>
#> $region
#> [1] "us"
#>
#> $variants
#> NULL
#>
#> $extensions
#> list()
#>
#> $private
#> NULL
bcp_parse('zh-Hans-CN')
#> $language
#> [1] "zh"
#>
#> $extlang
#> NULL
#>
#> $script
#> [1] "hans"
#>
#> $region
#> [1] "cn"
#>
#> $variants
#> NULL
#>
#> $extensions
#> list()
#>
#> $private
#> NULL
bcp_parse('de-1901')
#> $language
#> [1] "de"
#>
#> $extlang
#> NULL
#>
#> $script
#> [1] NA
#>
#> $region
#> [1] NA
#>
#> $variants
#> [1] "1901"
#>
#> $extensions
#> list()
#>
#> $private
#> NULLbcp_match_language() implements the RFC 4647 “Lookup” scheme. It finds
the best available language for a user’s ordered list of preferences,
progressively stripping subtags to find a match.
# User prefers en-US, then French. Only 'en' and 'de' are available.
bcp_match_language(c('en-US', 'fr'), c('en', 'de'))
#> [1] "en"
# Prefer Traditional Chinese, fall back to Simplified, then English
bcp_match_language(
c('zh-Hant-TW', 'zh-Hans', 'en'),
c('zh-Hans', 'en', 'fr')
)
#> [1] "zh-Hans"
# No match — return a default
bcp_match_language('pt-BR', c('fr', 'de'), default = 'en')
#> [1] "en"bcp_validate() and bcp_normalize() check and canonicalize tags
against the IANA registry. They download (and cache) the registry on
first use.
# Check whether subtags are registered
bcp_validate('en-US') # TRUE
bcp_validate('xx-ZZ') # FALSE — neither subtag is registered
# Canonicalize casing and suppress default scripts
bcp_normalize('en-us') # "en-US" (region uppercased)
bcp_normalize('en-Latn-US') # "en-US" (Latn is the default script for English)
bcp_normalize('sr-latn') # "sr-Latn" (Latn is not the default for Serbian)The IANA registry is parsed into a tidy data frame you can query directly:
reg <- bcp_process_registry()
head(reg)
# Find all scripts
reg[reg$type == 'script', c('subtag', 'description')]
# Check the registry date
attr(reg, 'last_update')Registry data is cached locally to avoid repeated downloads:
bcp_cache_path() # where the cache lives
bcp_cache_size() # how big it is
bcp_cache_update() # refresh from IANA
bcp_cache_clear() # delete the cache