# Chapter 8. Data Import with readr

In [1]:
library(feather)
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.4     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Getting Started

> - `read_csv()` reads comma-delimited files, `read_csv2()` reads semicolon-separated files (common in countries where `,` is used as the decimal place), `read_tsv()` reads tab-delimited files, and `read_delim()` reads in files with any delimiter.
> - `read_fwf()` reads fixed-width files. You can specify fields either by their widths with `fwf_widths()` or their position with `fwf_positions()`. `read_table()` reads a common variation of fixed-width files where columns are separated by white space.
> - `read_log()` reads Apache style log files. (But also check out **webreadr**, which is built on top of `read_log()` and provides many more helpful tools.)

In [2]:
heights <- read_csv("../data/heights.csv")
summary(heights)
head(heights)


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  earn = [32mcol_double()[39m,
  height = [32mcol_double()[39m,
  sex = [31mcol_character()[39m,
  ed = [32mcol_double()[39m,
  age = [32mcol_double()[39m,
  race = [31mcol_character()[39m
)




      earn            height          sex                  ed      
 Min.   :   200   Min.   :57.50   Length:1192        Min.   : 3.0  
 1st Qu.: 10000   1st Qu.:64.01   Class :character   1st Qu.:12.0  
 Median : 20000   Median :66.45   Mode  :character   Median :13.0  
 Mean   : 23155   Mean   :66.92                      Mean   :13.5  
 3rd Qu.: 30000   3rd Qu.:69.85                      3rd Qu.:16.0  
 Max.   :200000   Max.   :77.05                      Max.   :18.0  
      age            race          
 Min.   :18.00   Length:1192       
 1st Qu.:29.00   Class :character  
 Median :38.00   Mode  :character  
 Mean   :41.38                     
 3rd Qu.:51.00                     
 Max.   :91.00                     

earn,height,sex,ed,age,race
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
50000,74.42444,male,16,45,white
60000,65.53754,female,16,58,white
30000,63.6292,female,16,29,white
50000,63.10856,female,16,91,other
51000,63.40248,female,17,39,white
9000,64.39951,female,15,26,white


In [3]:
read_csv(
  "a,b,c
   1,2,3
   4,5,6"
)
read_csv(
  "The first line of metadata
   The second line of metadata
   x,y,z
   1,2,3",
  skip = 2
)
read_csv(
  "# A comment I want to skip
   x,y,z
   1,2,3",
  comment = "#"
)
read_csv(
  "1,2,3
   4,5,6",
  col_names = FALSE
)
read_csv(
  "1,2,3
   4,5,6",
  col_names = c("x", "y", "z")
)

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


x,y,z
<dbl>,<dbl>,<dbl>
1,2,3


x,y,z
<dbl>,<dbl>,<dbl>
1,2,3


X1,X2,X3
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


x,y,z
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


> There are a few good reasons to favor **readr** functions over the base equivalents:
> - They are typically much faster (~10x) than their base equivalents. Long-running jobs have a progress bar, so you can see what’s happening. If you’re looking for raw speed, try `data.table::fread()`. It doesn’t fit quite so well into the tidyverse, but it can be quite a bit faster.
> - They produce tibbles, and they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions.
> - They are more reproducible. Base R functions inherit some behavior from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.

### Exercises

1. What function would you use to read a file where fields are separated with “|”?
1. Apart from `file`, `skip`, and `comment`, what other arguments do `read_csv()` and `read_tsv()` have in common?
1. What are the most important arguments to `read_fwf()`?
1. Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or '. By convention, `read_csv()` assumes that the quoting character will be ", and if you want to change it you’ll need to use `read_delim()` instead. What arguments do you need to specify to read the following text into a data frame?

    ```
    "x,y\n1,'a,b'"
    ```

1. Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

    ```r
    read_csv("a,b\n1,2,3\n4,5,6")
    read_csv("a,b,c\n1,2\n1,2,3,4")
    read_csv("a,b\n\"1")
    read_csv("a,b\n1,2\na,b")
    read_csv("a;b\n1;3")
    ```

In [4]:
# 1.
read_delim(
  "1|2|3
   4|5|6",
  col_names = c("x", "y", "z"),
  delim = "|",
)

# 4.
read_csv("x,y\n1,'a,b'", quote = "'")

x,y,z
<chr>,<dbl>,<dbl>
1,2,3
4,5,6


x,y
<dbl>,<chr>
1,"a,b"


In [5]:
# 5.
read_csv("a,b\n1,2,3\n4,5,6", col_names = c("a", "b", "c"), skip = 1)
read_csv("a,b,c\n1,2\n1,2,3,4", col_names = c("a", "b", "c", "d"), skip = 1)
read_csv("a,b\n\"1")
read_csv("a,b\n1,2\na,b")
read_delim("a;b\n1;3", delim = ";")

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


“1 parsing failure.
row col  expected    actual         file
  1  -- 4 columns 2 columns literal data
”


a,b,c,d
<dbl>,<dbl>,<dbl>,<dbl>
1,2,,
1,2,3.0,4.0


“2 parsing failures.
row col                     expected    actual         file
  1  a  closing quote at end of file           literal data
  1  -- 2 columns                    1 columns literal data
”


a,b
<dbl>,<chr>
1,


a,b
<chr>,<chr>
1,2
a,b


a,b
<dbl>,<dbl>
1,3


## Parsing a Vector

In [6]:
str(parse_logical(c("TRUE", "FALSE", "NA")))
str(parse_integer(c("1", "231", ".", "456"), na = "."))
str(parse_date(c("2010-01-01", "1979-10-14")))

 logi [1:3] TRUE FALSE NA
 int [1:4] 1 231 NA 456
 Date[1:2], format: "2010-01-01" "1979-10-14"


In [7]:
parse_integer(c("123", "345", "abc", "123.45"))
problems(parse_integer(c("123", "345", "abc", "123.45")))

“2 parsing failures.
row col               expected actual
  3  -- an integer             abc   
  4  -- no trailing characters 123.45
”


row,col,expected,actual
<int>,<int>,<chr>,<chr>
3,,an integer,abc
4,,no trailing characters,123.45


> - `parse_logical()` and `parse_integer()` parse logicals and integers, respectively. There’s basically nothing that can go wrong with these parsers so I won’t describe them here further.
> - `parse_double()` is a strict numeric parser, and `parse_number()` is a flexible numeric parser. These are more complicated than you might expect because different parts of the world write numbers in different ways.
> - `parse_character()` seems so simple that it shouldn’t be necessary. But one complication makes it quite important: character encodings.
> - `parse_factor()` creates factors, the data structure that R uses to represent categorical variables with fixed and known values.
> - `parse_datetime()`, `parse_date()`, and `parse_time()` allow you to parse various date and time specifications. These are the most complicated because there are so many different ways of writing dates.

### Numbers
> - People write numbers differently in different parts of the world. For example, some countries use `.` in between the integer and fractional parts of a real number, while others use `,`.
> - Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”.
> - Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”, and these grouping characters vary around the world.

In [8]:
parse_double("1,23", locale = locale(decimal_mark = ","))
parse_number("It cost $123.45")
parse_number("123'456'789", locale = locale(grouping_mark = "'"))

### Strings

> Each hexadecimal number represents a byte of information: `48` is `H`, `61` is `a`, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII.
>
> Things get more complicated for languages other than English. In the early days of computing there were many competing standards for encoding non-English characters, and to correctly interpret a string you needed to know both the values and the encoding. For example, two common encodings are Latin1 (aka ISO-8859-1, used for Western European languages) and Latin2 (aka ISO-8859-2, used for Eastern European languages). In Latin1, the byte b1 is “±”, but in Latin2, it’s “ą”! Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji!).
>
> **readr** uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing. This is a good default, but will fail for data produced by older systems that don’t understand UTF-8. If this happens to you, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times you’ll get complete gibberish.

> If you’d like to learn more I’d recommend reading the detailed explanation at *http://kunststube.net/encoding/*.

In [9]:
charToRaw("Hadley")

[1] 48 61 64 6c 65 79

In [10]:
x <- "El Ni\xf1o was particularly bad this year"
guess_encoding(charToRaw(x))
parse_character(x, locale = locale(encoding = "Latin1"))

encoding,confidence
<chr>,<dbl>
ISO-8859-1,0.46
ISO-8859-9,0.23


In [11]:
x <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
guess_encoding(charToRaw(x))
parse_character(x, locale = locale(encoding = "Shift-JIS"))

encoding,confidence
<chr>,<dbl>
KOI8-R,0.42


### Factors

In [12]:
problems(parse_factor(c("apple", "banana", "bananana"), levels = c("apple", "banana")))
parse_factor(c("apple", "banana", "bananana"), levels = c("apple", "banana"))

row,col,expected,actual
<int>,<int>,<chr>,<chr>
3,,value in level set,bananana


“1 parsing failure.
row col           expected   actual
  3  -- value in level set bananana
”


### Dates, Date-Times, and Times

In [13]:
parse_datetime("20101010")
parse_time("01:10 am")
parse_date("01/02/15", "%m/%d/%y")

date_names_langs()
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))

[1] "2010-10-10 UTC"

01:10:00

### Exercises

1. What are the most important arguments to `locale()`?
1. What happens if you try and set `decimal_mark` and `grouping_mark` to the same character? What happens to the default value of `grouping_mark` when you set `decimal_mark` to ",“? What happens to the default value of decimal_mark when you set the `grouping_mark` to ".“?
1. I didn’t discuss the `date_format` and `time_format` options to `locale()`. What do they do? Construct an example that shows when they might be useful.
1. If you live outside the US, create a new locale object that encapsulates the settings for the types of files you read most commonly.
1. What’s the difference between `read_csv()` and `read_csv2()`?
1. What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.
1. Generate the correct format string to parse each of the following dates and times:

    ```r
    d1 <- "January 1, 2010"
    d2 <- "2015-Mar-07"
    d3 <- "06-Jun-2017"
    d4 <- c("August 19 (2015)", "July 1 (2015)")
    d5 <- "12/30/14" # Dec 30, 2014
    t1 <- "1705"
    t2 <- "11:15:10.12 PM"
    ```

In [14]:
# 7.
parse_date("January 1, 2010", "%B %d, %Y")
parse_date("2015-Mar-07", "%Y-%b-%d")
parse_date("06-Jun-2017", "%d-%b-%Y")
parse_date(c("August 19 (2015)", "July 1 (2015)"), "%B %d (%Y)")
parse_date("12/30/14", "%m/%d/%y") # Dec 30, 2014
parse_time("1705", "%H%M")
parse_time("11:15:10.12 PM", "%I:%M:%OS %p")

17:05:00

23:15:10.12

### Parsing a File

> **readr** uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column.

> I highly recommend always supplying `col_types`, building up from the printout provided by **readr**. This ensures that you have a consistent and reproducible data import script. If you rely on the default guesses and your data changes, **readr** will continue to read it in. If you want to be really strict, use `stop_for_problems()`: that will throw an error and stop your script if there are any parsing problems.

> If you’re having major parsing problems, sometimes it’s easier to just read into a character vector of lines with `read_lines()`, or even a character vector of length 1 with `read_file()`. Then you can use the string parsing skills you’ll learn later to parse more exotic formats.

In [15]:
guess_parser("2010-10-01")
guess_parser("15:01")
guess_parser(c("TRUE", "FALSE"))
guess_parser(c("1", "5", "9"))
guess_parser(c("12,352,561"))

parse_guess("2010-10-10")

In [16]:
problems(read_csv(readr_example("challenge.csv"))) %>% head(10)
summary(read_csv(readr_example("challenge.csv"), col_types = "dD"))


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  x = [32mcol_double()[39m,
  y = [33mcol_logical()[39m
)




row,col,expected,actual,file
<int>,<chr>,<chr>,<chr>,<chr>
1001,y,1/0/T/F/TRUE/FALSE,2015-01-16,'/opt/conda/lib/R/library/readr/extdata/challenge.csv'
1002,y,1/0/T/F/TRUE/FALSE,2018-05-18,'/opt/conda/lib/R/library/readr/extdata/challenge.csv'
1003,y,1/0/T/F/TRUE/FALSE,2015-09-05,'/opt/conda/lib/R/library/readr/extdata/challenge.csv'
1004,y,1/0/T/F/TRUE/FALSE,2012-11-28,'/opt/conda/lib/R/library/readr/extdata/challenge.csv'
1005,y,1/0/T/F/TRUE/FALSE,2020-01-13,'/opt/conda/lib/R/library/readr/extdata/challenge.csv'
1006,y,1/0/T/F/TRUE/FALSE,2016-04-17,'/opt/conda/lib/R/library/readr/extdata/challenge.csv'
1007,y,1/0/T/F/TRUE/FALSE,2011-05-14,'/opt/conda/lib/R/library/readr/extdata/challenge.csv'
1008,y,1/0/T/F/TRUE/FALSE,2020-07-18,'/opt/conda/lib/R/library/readr/extdata/challenge.csv'
1009,y,1/0/T/F/TRUE/FALSE,2011-04-30,'/opt/conda/lib/R/library/readr/extdata/challenge.csv'
1010,y,1/0/T/F/TRUE/FALSE,2010-05-11,'/opt/conda/lib/R/library/readr/extdata/challenge.csv'


       x                  y             
 Min.   :   0.001   Min.   :2010-01-03  
 1st Qu.:   0.497   1st Qu.:2013-04-08  
 Median :   0.998   Median :2016-08-11  
 Mean   :1265.592   Mean   :2016-09-11  
 3rd Qu.:2577.250   3rd Qu.:2020-01-19  
 Max.   :4999.000   Max.   :2023-09-06  
                    NA's   :1000        

In [17]:
summary(read_csv(readr_example("challenge.csv"), guess_max = 1001))


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  x = [32mcol_double()[39m,
  y = [34mcol_date(format = "")[39m
)




       x                  y             
 Min.   :   0.001   Min.   :2010-01-03  
 1st Qu.:   0.497   1st Qu.:2013-04-08  
 Median :   0.998   Median :2016-08-11  
 Mean   :1265.592   Mean   :2016-09-11  
 3rd Qu.:2577.250   3rd Qu.:2020-01-19  
 Max.   :4999.000   Max.   :2023-09-06  
                    NA's   :1000        

In [18]:
summary(type_convert(read_csv(readr_example("challenge.csv"), col_types = cols(.default = "c"))))


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  x = [32mcol_double()[39m,
  y = [34mcol_date(format = "")[39m
)




       x                  y             
 Min.   :   0.001   Min.   :2010-01-03  
 1st Qu.:   0.497   1st Qu.:2013-04-08  
 Median :   0.998   Median :2016-08-11  
 Mean   :1265.592   Mean   :2016-09-11  
 3rd Qu.:2577.250   3rd Qu.:2020-01-19  
 Max.   :4999.000   Max.   :2023-09-06  
                    NA's   :1000        

## Writing to a File

> If you want to export a CSV file to Excel, use `write_excel_csv()`—this writes a special character (a “byte order mark”) at the start of the file, which tells Excel that you’re using the UTF-8 encoding.

> The **feather** package implements a fast binary file format that can be shared across programming languages.

In [19]:
read_csv(readr_example("challenge.csv"), col_types = "dD") %>% write_csv("/tmp/challenge.csv")
problems(read_csv("/tmp/challenge.csv")) %>% head(10)


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  x = [32mcol_double()[39m,
  y = [33mcol_logical()[39m
)




row,col,expected,actual,file
<int>,<chr>,<chr>,<chr>,<chr>
1001,y,1/0/T/F/TRUE/FALSE,2015-01-16,'/tmp/challenge.csv'
1002,y,1/0/T/F/TRUE/FALSE,2018-05-18,'/tmp/challenge.csv'
1003,y,1/0/T/F/TRUE/FALSE,2015-09-05,'/tmp/challenge.csv'
1004,y,1/0/T/F/TRUE/FALSE,2012-11-28,'/tmp/challenge.csv'
1005,y,1/0/T/F/TRUE/FALSE,2020-01-13,'/tmp/challenge.csv'
1006,y,1/0/T/F/TRUE/FALSE,2016-04-17,'/tmp/challenge.csv'
1007,y,1/0/T/F/TRUE/FALSE,2011-05-14,'/tmp/challenge.csv'
1008,y,1/0/T/F/TRUE/FALSE,2020-07-18,'/tmp/challenge.csv'
1009,y,1/0/T/F/TRUE/FALSE,2011-04-30,'/tmp/challenge.csv'
1010,y,1/0/T/F/TRUE/FALSE,2010-05-11,'/tmp/challenge.csv'


In [20]:
read_csv(readr_example("challenge.csv"), col_types = "dD") %>% write_feather("/tmp/challenge.feather")
summary(read_feather("/tmp/challenge.feather"))

       x                  y             
 Min.   :   0.001   Min.   :2010-01-03  
 1st Qu.:   0.497   1st Qu.:2013-04-08  
 Median :   0.998   Median :2016-08-11  
 Mean   :1265.592   Mean   :2016-09-11  
 3rd Qu.:2577.250   3rd Qu.:2020-01-19  
 Max.   :4999.000   Max.   :2023-09-06  
                    NA's   :1000        

> For rectangular data:
> - **haven** reads SPSS, Stata, and SAS files.
> - **readxl** reads Excel files (both *.xls* and *.xlsx*).
> - **DBI**, along with a database-specific backend (e.g., **RMySQL**, **RSQLite**, **RPostgreSQL**, etc.) allows you to run SQL queries against a database and return a data frame.
> 
> For hierarchical data: use **jsonlite** (by Jeroen Ooms) for JSON, and **xml2** for XML. Jenny Bryan has some excellent worked examples at *https://jennybc.github.io/purrr-tutorial/*.
> 
> For other file types, try the [*R data import/export manual*](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package.