Skip to content

christopherkenny/BCP47

Repository files navigation

BCP47

Codecov test coverage

BCP47 provides tools to parse, validate, normalize, and match language tags following the BCP 47 standard. BCP 47 (Best Current Practice 47) is the IETF standard that governs how human languages are identified in internet protocols—it defines tags like en-US (English, United States).

The package bundles access to the IANA Language Subtag Registry, the authoritative source of valid language, script, region, and variant subtags.

Installation

You can install the development version of BCP47 from GitHub with:

# install.packages('pak')
pak::pak('christopherkenny/BCP47')

Core Functions

Function Description
bcp_parse() Decompose a tag into its subtag components
bcp_validate() Check whether subtags appear in the IANA registry
bcp_normalize() Apply canonical casing and substitute preferred values
bcp_match_language() Find the best available language for a set of preferences
bcp_process_registry() Download and parse the IANA registry
bcp_cache_*() Manage the local registry cache

Examples

Parsing

bcp_parse() decomposes a tag into its RFC 5646 components. All subtags are returned in lower-case.

library(BCP47)

bcp_parse('en-US')
#> $language
#> [1] "en"
#> 
#> $extlang
#> NULL
#> 
#> $script
#> [1] NA
#> 
#> $region
#> [1] "us"
#> 
#> $variants
#> NULL
#> 
#> $extensions
#> list()
#> 
#> $private
#> NULL
bcp_parse('zh-Hans-CN')
#> $language
#> [1] "zh"
#> 
#> $extlang
#> NULL
#> 
#> $script
#> [1] "hans"
#> 
#> $region
#> [1] "cn"
#> 
#> $variants
#> NULL
#> 
#> $extensions
#> list()
#> 
#> $private
#> NULL
bcp_parse('de-1901')
#> $language
#> [1] "de"
#> 
#> $extlang
#> NULL
#> 
#> $script
#> [1] NA
#> 
#> $region
#> [1] NA
#> 
#> $variants
#> [1] "1901"
#> 
#> $extensions
#> list()
#> 
#> $private
#> NULL

Language Matching

bcp_match_language() implements the RFC 4647 “Lookup” scheme. It finds the best available language for a user’s ordered list of preferences, progressively stripping subtags to find a match.

# User prefers en-US, then French. Only 'en' and 'de' are available.
bcp_match_language(c('en-US', 'fr'), c('en', 'de'))
#> [1] "en"

# Prefer Traditional Chinese, fall back to Simplified, then English
bcp_match_language(
  c('zh-Hant-TW', 'zh-Hans', 'en'),
  c('zh-Hans', 'en', 'fr')
)
#> [1] "zh-Hans"

# No match — return a default
bcp_match_language('pt-BR', c('fr', 'de'), default = 'en')
#> [1] "en"

Validation and Normalization

bcp_validate() and bcp_normalize() check and canonicalize tags against the IANA registry. They download (and cache) the registry on first use.

# Check whether subtags are registered
bcp_validate('en-US') # TRUE
bcp_validate('xx-ZZ') # FALSE — neither subtag is registered

# Canonicalize casing and suppress default scripts
bcp_normalize('en-us') # "en-US"  (region uppercased)
bcp_normalize('en-Latn-US') # "en-US"  (Latn is the default script for English)
bcp_normalize('sr-latn') # "sr-Latn" (Latn is not the default for Serbian)

Registry Access

The IANA registry is parsed into a tidy data frame you can query directly:

reg <- bcp_process_registry()
head(reg)

# Find all scripts
reg[reg$type == 'script', c('subtag', 'description')]

# Check the registry date
attr(reg, 'last_update')

Cache Management

Registry data is cached locally to avoid repeated downloads:

bcp_cache_path() # where the cache lives
bcp_cache_size() # how big it is
bcp_cache_update() # refresh from IANA
bcp_cache_clear() # delete the cache

About

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages