# __08 Data Import with readr__

In [1]:
# libraries
library(tidyverse)

# config
repr_html.tbl_df <- function(obj, ..., rows = 6) repr:::repr_html.data.frame(obj, ..., rows = rows)
options(dplyr.summarise.inform = FALSE)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.1
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



Most of readr’s functions are concerned with turning flat files into
data frames:
* `read_csv()` reads comma-delimited files, `read_csv2()` reads
semicolon-separated files (common in countries where , is used
as the decimal place), `read_tsv()` reads tab-delimited files, and
`read_delim()` reads in files with any delimiter.

* `read_fwf()` reads fixed-width files. You can specify fields either
by their widths with `fwf_widths()` or their position with
`fwf_positions()` . `read_table()` reads a common variation of
fixed-width files where columns are separated by white space.

* `read_log()` reads Apache style log files. (But also check out
webreadr, which is built on top of `read_log()` and provides
many more helpful tools.)

In [3]:
heights <- read_csv('data/heights.csv')

Parsed with column specification:
cols(
  earn = [32mcol_double()[39m,
  height = [32mcol_double()[39m,
  sex = [31mcol_character()[39m,
  ed = [32mcol_double()[39m,
  age = [32mcol_double()[39m,
  race = [31mcol_character()[39m
)



You can also supply an inline CSV file. This is useful for experimenting with readr and for creating reproducible examples to share
with others:

In [4]:
read_csv(
    'a, b, c
    1, 2, 3
    4, 5, 6'
)

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


The data might not have column names. You can use `col_names = FALSE` to tell `read_csv()` not to treat the first row as headings,
and instead label them sequentially from X1 to Xn :

In [5]:
read_csv('1,2,3\n4,5,6', col_names = FALSE)

X1,X2,X3
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


Alternatively you can pass `col_names` a character vector, which
will be used as the column names:

In [6]:
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))

x,y,z
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


Another option that commonly needs tweaking is na . This specifies
the value (or values) that are used to represent missing values in
your file:

In [7]:
read_csv("a,b,c\n1,2,.", na = ".")

a,b,c
<dbl>,<dbl>,<lgl>
1,2,


## __Parsing a Vector__

Before we get into the details of how readr reads files from disk, we
need to take a little detour to talk about the `parse_*()` functions.
These functions take a character vector and return a more special‐
ized vector like a logical, integer, or date:

In [8]:
str(parse_logical(c('TRUE', 'FALSE', 'NA')))

 logi [1:3] TRUE FALSE NA


In [9]:
str(parse_integer(c('1', '2', '3')))

 int [1:3] 1 2 3


In [10]:
str(parse_date(c('2010-01-01', '1979-10-14')))

 Date[1:2], format: "2010-01-01" "1979-10-14"


Like all functions in the tidyverse, the `parse_*()` functions are uni‐
form; the first argument is a character vector to parse, and the na
argument specifies which strings should be treated as missing:

In [11]:
parse_integer(c('1', '231', '.', '456'), na = '.')

In [12]:
# if parsing fails, you'll get a warning
x <- parse_integer(c('123', '345', 'abc', '123.45'))

“2 parsing failures.
row col               expected actual
  3  -- an integer                abc
  4  -- no trailing characters    .45
”


In [13]:
x

In [14]:
problems(x)

row,col,expected,actual
<int>,<int>,<chr>,<chr>
3,,an integer,abc
4,,no trailing characters,.45


Using parsers is mostly a matter of understanding what’s available
and how they deal with different types of input. There are eight particularly important parsers:

* `parse_logical()` and parse_integer() parse logicals and inte‐
gers, respectively. There’s basically nothing that can go wrong
with these parsers so I won’t describe them here further.
* `parse_double()` is a strict numeric parser, and parse_number()
is a flexible numeric parser. These are more complicated than
you might expect because different parts of the world write
numbers in different ways.
* `parse_character()` seems so simple that it shouldn’t be neces‐
sary. But one complication makes it quite important: character
encodings.
* `parse_factor()` creates factors, the data structure that R uses to
represent categorical variables with fixed and known values.
* `parse_datetime()` , `parse_date()` , `and parse_time()` allow
you to parse various date and time specifications. These are the
most complicated because there are so many different ways of
writing dates.

### __Numbers__

It seems like it should be straightforward to parse a number, but
three problems make it tricky:
* People write numbers differently in different parts of the world.
For example, some countries use . in between the integer and
fractional parts of a real number, while others use , .
* Numbers are often surrounded by other characters that provide
some context, like “$1000” or “10%”.
* Numbers often contain “grouping” characters to make them
easier to read, like “1,000,000”, and these grouping characters
vary around the world.

To address the first problem, readr has the notion of a “locale,” an
object that specifies parsing options that differ from place to place.
When parsing numbers, the most important option is the character
you use for the decimal mark. You can override the default value
of . by creating a new locale and setting the `decimal_mark` argument:

In [15]:
parse_double('1.23')

In [16]:
parse_double('1,23', locale = locale(decimal_mark = ','))

`parse_number()` addresses the second problem: it ignores non-
numeric characters before and after the number. This is particularly
useful for currencies and percentages, but also works to extract
numbers embedded in text:

In [17]:
parse_number('$100')

In [18]:
parse_number('20%')

In [19]:
parse_number('It cost $123.45')

The final problem is addressed by the combination of `parse_number()` and the locale as `parse_number()` will ignore the “grouping
mark”:

In [20]:
parse_number('$123,456,789')

In [21]:
parse_number('123.456.789',
             locale = locale(grouping_mark = '.'))

In [23]:
parse_number("123'456'789",
             locale = locale(grouping_mark = "'"))

### __Strings__

It seems like parse_character() should be really simple—it could
just return its input. Unfortunately life isn’t so simple, as there are
multiple ways to represent the same string. To understand what’s
going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a
string using `charToRaw()` :

In [25]:
charToRaw('Hadley')

[1] 48 61 64 6c 65 79

Each hexadecimal number represents a byte of information: 48 is H,
61 is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called
ASCII. ASCII does a great job of representing English characters,
because it’s the American Standard Code for Information Interchange.

readr uses UTF-8 everywhere: it assumes your data is UTF-8 enco‐
ded when you read it, and always uses it when writing. This is a
good default, but will fail for data produced by older systems that
don’t understand UTF-8. If this happens to you, your strings will
look weird when you print them. Sometimes just one or two charac‐
ters might be messed up; other times you’ll get complete gibberish.
For example:

In [26]:
x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"

To fix the problem you need to specify the encoding in `parse_character()` :

In [27]:
parse_character(x1, locale = locale(encoding = 'Latin1'))

In [28]:
parse_character(x2, locale = locale(encoding = 'Shift-JIS'))

How do you find the correct encoding? If you’re lucky, it’ll be
included somewhere in the data documentation. Unfortunately,
that’s rarely the case, so readr provides guess_encoding() to help
you figure it out. It’s not foolproof, and it works better when you
have lots of text (unlike here), but it’s a reasonable place to start.
Expect to try a few different encodings before you find the right one:

In [29]:
guess_encoding(charToRaw(x1))

encoding,confidence
<chr>,<dbl>
ISO-8859-1,0.46
ISO-8859-9,0.23


In [30]:
guess_encoding(charToRaw(x2))

encoding,confidence
<chr>,<dbl>
KOI8-R,0.42


### __Factors__

R uses factors to represent categorical variables that have a known
set of possible values. Give `parse_factor()` a vector of known
levels to generate a warning whenever an unexpected value is
present:

In [31]:
fruit <- c('apple', 'banana')
parse_factor(c('apple', 'banana', 'bananana'), levels = fruit)

“1 parsing failure.
row col           expected   actual
  3  -- value in level set bananana
”


### __Dates, Date-Times, and Times__

You pick between three parsers depending on whether you want a
date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of
seconds since midnight). When called without any additional arguments:

* `parse_datetime()` expects an ISO8601 date-time. ISO8601 is
an international standard in which the components of a date are
organized from biggest to smallest: year, month, day, hour,
minute, second:

In [32]:
parse_datetime('2010-10-01T2010')

[1] "2010-10-01 20:10:00 UTC"

In [33]:
# if time is omitted, it will be set to midnight
parse_datetime('20101010')

[1] "2010-10-10 UTC"

* `parse_date()` expects a four-digit year, a - or / , the month, a -
or / , then the day:

In [34]:
parse_date('2010-10-01')

In [36]:
parse_date('03/03/1999', format = '%m/%d/%Y')

* `parse_time()` expects the hour, : , minutes, optionally : and
seconds, and an optional a.m./p.m. specifier:

In [38]:
library(hms)
parse_time('01:10 am')

01:10:00

In [39]:
parse_time('20:20:01')

20:20:01

If these defaults don’t work for your data you can supply your own
date-time format , built up of the following pieces:

* _Year_
    - `%Y` (4 digits).
    - `%y` (2 digits; 00-69 → 2000-2069, 70-99 → 1970-1999).
* _Month_ 
    - `%m` (2 digits).
    - `%b` (abbreviated name, like “Jan”).
    - `%B` (full name, “January”).

* Day
    - `%d` (2 digits).
    - `%e` (optional leading space).

* Time
    - `%H` (0-23 hour format).
    - `%I` (0-12, must be used with %p ).
    - `%p` (a.m./p.m. indicator).
    - `%M` (minutes).
    - `%S` (integer seconds).
    - `%OS` (real seconds).
    - `%Z` (time zone [a name, e.g., America/Chicago ]). Note: beware
    of abbreviations. If you’re American, note that “EST” is a Cana‐
    dian time zone that does not have daylight saving time. It is
    Eastern Standard Time! We’ll come back to this in “Time
    Zones” on page 254.
    - `%z` (as offset from UTC, e.g., +0800 ).

* Nondigits
    - `%.` (skips one nondigit character).
    - `%*` (skips any number of nondigits).

If you’re using `%b` or `%B` with non-English month names, you’ll need
to set the lang argument to `locale()` . See the list of built-in lan‐
guages in `date_names_langs()` , or if your language is not already
included, create your own with `date_names()` :

In [40]:
parse_date('1 janvier 2015', '%d %B %Y', locale = locale('fr'))

### __Parsing a File__

In [41]:
challenge <- read_csv(readr_example('challenge.csv'))

Parsed with column specification:
cols(
  x = [32mcol_double()[39m,
  y = [33mcol_logical()[39m
)

“1000 parsing failures.
 row col           expected     actual                                                                           file
1001   y 1/0/T/F/TRUE/FALSE 2015-01-16 '/home/concerta/R/x86_64-pc-linux-gnu-library/4.0/readr/extdata/challenge.csv'
1002   y 1/0/T/F/TRUE/FALSE 2018-05-18 '/home/concerta/R/x86_64-pc-linux-gnu-library/4.0/readr/extdata/challenge.csv'
1003   y 1/0/T/F/TRUE/FALSE 2015-09-05 '/home/concerta/R/x86_64-pc-linux-gnu-library/4.0/readr/extdata/challenge.csv'
1004   y 1/0/T/F/TRUE/FALSE 2012-11-28 '/home/concerta/R/x86_64-pc-linux-gnu-library/4.0/readr/extdata/challenge.csv'
1005   y 1/0/T/F/TRUE/FALSE 2020-01-13 '/home/concerta/R/x86_64-pc-linux-gnu-library/4.0/readr/extdata/challenge.csv'
.... ... .................. .......... ..............................................................................
See problems(...) for more details.
”


In [42]:
problems(challenge)

row,col,expected,actual,file
<int>,<chr>,<chr>,<chr>,<chr>
1001,y,1/0/T/F/TRUE/FALSE,2015-01-16,'/home/concerta/R/x86_64-pc-linux-gnu-library/4.0/readr/extdata/challenge.csv'
1002,y,1/0/T/F/TRUE/FALSE,2018-05-18,'/home/concerta/R/x86_64-pc-linux-gnu-library/4.0/readr/extdata/challenge.csv'
1003,y,1/0/T/F/TRUE/FALSE,2015-09-05,'/home/concerta/R/x86_64-pc-linux-gnu-library/4.0/readr/extdata/challenge.csv'
⋮,⋮,⋮,⋮,⋮
1998,y,1/0/T/F/TRUE/FALSE,2015-08-16,'/home/concerta/R/x86_64-pc-linux-gnu-library/4.0/readr/extdata/challenge.csv'
1999,y,1/0/T/F/TRUE/FALSE,2020-02-04,'/home/concerta/R/x86_64-pc-linux-gnu-library/4.0/readr/extdata/challenge.csv'
2000,y,1/0/T/F/TRUE/FALSE,2019-01-06,'/home/concerta/R/x86_64-pc-linux-gnu-library/4.0/readr/extdata/challenge.csv'


A good strategy is to work column by column until there are no
problems remaining. Here we can see that there are a lot of parsing
problems with the x column—there are trailing characters after the
integer value. That suggests we need to use a double parser instead.
To fix the call, start by copying and pasting the column specification
into your original call:

In [44]:
challenge <- readr_example('challenge.csv') %>%
    read_csv(col_types = cols(x = col_double(),
                              y = col_character()))

That fixes the first problem, but if we look at the last few rows, you’ll
see that they’re dates stored in a character vector:

In [45]:
tail(challenge)

x,y
<dbl>,<chr>
0.8052743,2019-11-21
0.1635163,2018-03-29
0.471939,2014-08-04
0.7183186,2015-08-16
0.2698786,2020-02-04
0.6082372,2019-01-06


You can fix that by specifying that y is a date column:

In [46]:
challenge <- readr_example('challenge.csv') %>%
    read_csv(col_types = cols(x = col_double(),
                              y = col_date()))

In [47]:
tail(challenge)

x,y
<dbl>,<date>
0.8052743,2019-11-21
0.1635163,2018-03-29
0.471939,2014-08-04
0.7183186,2015-08-16
0.2698786,2020-02-04
0.6082372,2019-01-06


Every `parse_xyz()` function has a corresponding col_xyz() func‐
tion. You use `parse_xyz()` when the data is in a character vector in
R already; you use `col_xyz()` when you want to tell readr how to
load the data.

### __Writing to a File__

readr also comes with two useful functions for writing data back to
disk: write_csv() and write_tsv() . Both functions increase the
chances of the output file being read back in correctly by:
- Always encoding strings in UTF-8.
- Saving dates and date-times in ISO8601 format so they are
easily parsed elsewhere.
If you want to export a CSV file to Excel, use `write_excel_csv()` —
this writes a special character (a “byte order mark”) at the start of
the file, which tells Excel that you’re using the UTF-8 encoding.
The most important arguments are x (the data frame to save) and
path (the location to save it). You can also specify how missing val‐
ues are written with na , and if you want to append to an existing file:
`write_csv(challenge, "challenge.csv")`
Note that the type information is lost when you save to CSV:

- `write_rds()` and `read_rds()` are uniform wrappers around the
base functions `readRDS()` and `saveRDS()` . These store data in
R’s custom binary format called RDS:

The feather package implements a fast binary file format that
can be shared across programming languages:

```r
library(feather)
write_feather(challenge, 'challenge.feather')
read_feather('challenge.feather')
```