In [22]:
library(tidyverse)

The goal of readr’s locales is to encapsulate common options that vary between languages and localities. This includes:

* The names of months and days, used when parsing dates.
* The default time zone, used when parsing datetimes.
* The character encoding, used when reading non-ASCII strings.
* Default date format, used when guessing column types.
* The decimal and grouping marks, used when reading numbers.
(Strictly speaking these are not locales in the usual technical sense of the word because they also contain information about time zones and encoding.)

To create a new locale, you use the **`locale()`** function:

In [2]:
locale()

<locale>
Numbers:  123,456.78
Formats:  %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday
        (Thu), Friday (Fri), Saturday (Sat)
Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May),
        June (Jun), July (Jul), August (Aug), September (Sep), October
        (Oct), November (Nov), December (Dec)
AM/PM:  AM/PM

All of the parsing function in readr take a `locale` argument. You’ll most often use it with **`read_csv()`**, **`read_fwf()`** or **`read_table()`**. Readr is designed to work the same way across systems, so the default locale is English centric like R. If you’re not in an English speaking country, this makes initial import a little harder, because you have to override the defaults. But the payoff is big: you can share your code and know that it will work on any other system. Base R takes a different philosophy. It uses system defaults, so typical data import is a little easier, but sharing code is harder.

In [3]:
args(locale)

# Dates and times  

### Name of months and days

The first argument to `locale()` is `date_names`, and it controls what values are used for month and day names. The easiest way to specify it is with a ISO 639 language code:

In [4]:
locale('fr') # French

<locale>
Numbers:  123,456.78
Formats:  %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days:   dimanche (dim.), lundi (lun.), mardi (mar.), mercredi (mer.), jeudi
        (jeu.), vendredi (ven.), samedi (sam.)
Months: janvier (janv.), février (févr.), mars (mars), avril (avr.), mai (mai),
        juin (juin), juillet (juil.), août (août), septembre (sept.),
        octobre (oct.), novembre (nov.), décembre (déc.)
AM/PM:  AM/PM

In [5]:
locale('vi') # Vietnam

<locale>
Numbers:  123,456.78
Formats:  %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days:   Ch<U+1EE7> Nh<U+1EAD>t (CN), Th<U+1EE9> Hai (Th 2), Th<U+1EE9> Ba (Th 3), Th<U+1EE9> Tu (Th 4), Th<U+1EE9> Nam
        (Th 5), Th<U+1EE9> Sáu (Th 6), Th<U+1EE9> B<U+1EA3>y (Th 7)
Months: tháng 1 (thg 1), tháng 2 (thg 2), tháng 3 (thg 3), tháng 4 (thg 4),
        tháng 5 (thg 5), tháng 6 (thg 6), tháng 7 (thg 7), tháng 8 (thg
        8), tháng 9 (thg 9), tháng 10 (thg 10), tháng 11 (thg 11),
        tháng 12 (thg 12)
AM/PM:  SA/CH

You can list all the code with **`date_names_langs()`** or go to Wikipedia.

In [6]:
date_names_langs()

Specifying a locale allows you to parse dates in other languages:

In [7]:
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))

In [8]:
parse_date("14 oct. 1979", "%d %b %Y", locale = locale("fr"))

For many languages, it’s common to find that diacritics have been stripped so they can be stored as ASCII. You can tell the locale that with the `asciify` option:

In [9]:
parse_date("1 août 2015", "%d %B %Y", locale = locale("fr"))

"1 parsing failure.
row col           expected         actual
  1  -- date like %d %B %Y 1 ao<fb>t 2015
"

In [10]:
parse_date("1 aout 2015", "%d %B %Y", locale = locale("fr", asciify = TRUE))

Note that the quality of the translations is variable, especially for the rarer languages. If you discover that they’re not quite right for your data, you can create your own with **`date_names()`**. The following example creates a locale with Māori date names:

In [11]:
maori <- locale(date_names(
  day = c("Rātapu", "Rāhina", "Rātū", "Rāapa", "Rāpare", "Rāmere", "Rāhoroi"),
  mon = c("Kohi-tātea", "Hui-tanguru", "Poutū-te-rangi", "Paenga-whāwhā",
    "Haratua", "Pipiri", "Hōngongoi", "Here-turi-kōkā", "Mahuru",
    "Whiringa-ā-nuku", "Whiringa-ā-rangi", "Hakihea")
))

### Timezones

Unless otherwise specified, readr assumes that times are in UTC, the Universal Coordinated Time (this is a successor to GMT and for almost all intents is identical). UTC is most suitable for data because it doesn’t have daylight savings - this avoids a whole class of potential problems. If your data isn’t already in UTC, you’ll need to supply a `tz` in the locale:

In [12]:
parse_datetime("2001-10-10 20:10")

[1] "2001-10-10 20:10:00 UTC"

In [13]:
parse_datetime("2001-10-10 20:10", locale = locale(tz = "Pacific/Auckland"))

[1] "2001-10-10 20:10:00 NZDT"

You can see a complete list of time zones with **`OlsonNames()`**.

In [14]:
OlsonNames()

If you’re American, note that “EST” is a Canadian time zone that does not have DST. It’s not Eastern Standard Time! Instead use:

* PST/PDT = “US/Pacific”
* CST/CDT = “US/Central”
* MST/MDT = “US/Mountain”
* EST/EDT = “US/Eastern”

### default formats

Locales also provide default date and time formats. The date format is used when guessing column types. The default date format is `%AD`, a flexible YMD parser 

In [15]:
#YYYY-MM-DD
guess_parser("2010-10-10")

In [16]:
#YYYY/MM/DD
guess_parser('2010/10/10')

If you’re an American, you might want to use your illogical date system::

In [17]:
parse_guess("01/31/2013")

In [18]:
parse_guess("01/31/2013", locale = locale(date_format = '%m/%d/%Y'))

The time format is also used when guessing column types. The default time format is `%AT`, a flexible HMS parser

In [19]:
parse_guess("17:55:14")

17:55:14

In [20]:
parse_guess("5:55:14 PM")

17:55:14

In [21]:
# Example of a non-standard time

parse_guess('h5m55s14 PM', locale = locale(time_format = 'h%Hm%Ms%S %p'))

17:55:14

# Character

All readr functions yield strings encoded in UTF-8. This encoding is the most likely to give good results in the widest variety of settings. By default, readr assumes that your input is also in UTF-8. This is less likely to be the case, especially when you’re working with older datasets.

The following code illustrates the problems with encodings:

In [37]:
library(stringi)
x <- "Émigré cause célèbre déjà vu.\n"
y <- stri_conv(x, "UTF-8", "latin1")

"the Unicode codepoint \U0000fffd cannot be converted to destination encoding"

In [38]:
x
y

In [39]:
parse_character(y, locale = locale(encoding = 'latin1'))

If you don’t know what encoding the file uses, try **`guess_encoding()`**. It’s not 100% perfect (as it’s fundamentally a heuristic), but should at least get you pointed in the right direction:

In [40]:
guess_encoding(y)

encoding,confidence
ASCII,1


# Number

Some countries use the decimal point, while others use the decimal comma. The decimal_mark option controls which readr uses when parsing doubles:

In [46]:
parse_double('1,32', locale = locale(decimal_mark = ','))

Additionally, when writing out big numbers, you might have `1,000,000`, `1.000.000`, `1 000 000`, or `1'000'000`. The grouping mark is ignored by the more flexible number parser:

In [47]:
parse_number("$1,234.56")

In [48]:
parse_number("$1.234,56", locale = locale(decimal_mark = ',', grouping_mark = '.'))

In [49]:
# readr is smart enough to guess that if you're using , for decimals then
# you're probably using . for grouping:

parse_number("$1.234,56", locale = locale(decimal_mark = ','))