In [3]:
library(tidyverse)

# Introduction

The key problem that readr solves is **parsing** a flat file into a tibble. Parsing is the process of taking a text file and turning it into a rectangular tibble where each column is the appropriate part. Parsing takes place in three basic stages:
* **Rectangle parsers**: The flat file is parsed into a rectangular matrix of strings.

* **Column specification**: The type of each column is determined.

* **Vector parsers**: Each column of strings is parsed into a vector of a more specific type.

Each `parse_*()` is coupled with a `col_*()` function, which will be used in the process of parsing a complete tibble.

# Vector parsers

It’s easiest to learn the vector parses using `parse_` functions. These all take a character vector and some options. They return a new vector the same length as the old, along with an attribute describing any problems.

### Atomic vectors

**`parse_logical()`**, **`parse_integer()`**, **`parse_double()`**, and **`parse_character()`** are straightforward parsers that produce the corresponding atomic vector.

In [4]:
parse_integer(c('1', '2', '3', '4'))

In [5]:
parse_double(c('1.23', '2.53'))

In [8]:
parse_logical(c('true', 'false'))

By default, readr expects `.` as the decimal mark and `,` as the grouping mark. You can override this default using **`locale()`** 

#### Flexible numeric parser

**`parse_integer()`** and **`parse_double()`** are strict: the input string must be a single number with no leading or trailing characters. **`parse_number()`** is more flexible: it ignores non-numeric prefixes and suffixes, and knows how to deal with grouping marks. This makes it suitable for reading currencies and percentages:

In [9]:
try(parse_integer('100$'))

"1 parsing failure.
row col               expected actual
  1  -- no trailing characters   100$
"

In [12]:
parse_number('100$')

parse_number('99.25%')

### Date/times

readr supports three types of date/time data:

* dates: number of days since 1970-01-01.
* times: number of seconds since midnight.
* datetimes: number of seconds since midnight 1970-01-01.

In [13]:
parse_datetime('2010-10-01 21:45')

[1] "2010-10-01 21:45:00 UTC"

In [14]:
parse_date('2010-10-01')

In [15]:
parse_time('1:00pm')

13:00:00

Each function takes a format argument which describes the format of the string. If not specified, it uses a default value:

* **`parse_datetime()`** recognises **ISO8601** datetimes.

* **`parse_date()`** uses the date_format specified by the `locale()`. The default value is `%AD` which uses an automatic date parser that recognises dates of the format `Y-m-d` or `Y/m/d`.

* **`parse_time()`** uses the `time_format` specified by the `locale()`. The default value is `%At` which uses an automatic time parser that recognises times of the form `H:M` optionally followed by seconds and am/pm.

In most cases, you will need to supply a format:

In [20]:

parse_date("1 January, 2010", format = '%d %B, %Y')

In [21]:

parse_date("06/10/01", format = '%d/%m/%y')

### Factors

When reading a column that has a known set of values, you can read directly into a factor. `parse_factor()` will generate a warning if a value is not in the supplied levels.

In [24]:
parse_factor(c('Banana', 'Coconut'), levels = c('Banana', 'Coconut', 'Apple'))

In [26]:
result <- parse_factor(c('Banana', 'Durian'), levels = c('Apple', 'Banana', 'Coconut'))
result

"1 parsing failure.
row col           expected actual
  2  -- value in level set Durian
"

See detail the problem by using **`problems()`**:

In [27]:
problems(result)

row,col,expected,actual
2,,value in level set,Durian


# Column specification

 `readr` uses some heuristics to guess the type of each column. You can access these results yourself using `guess_parser()`:

In [28]:
guess_parser(c('1', '12.5', '37'))

In [30]:
guess_parser(c('true', 'false', 'true'))

In [31]:
guess_parser(c('Hello', '5.23'))

In [34]:
guess_parser('2001/10/06')

Guesses are fairly strict. For example, we don’t guess that currencies are numbers, even though we can parse them:

In [36]:
guess_parser('100%')

guess_parser('5$')

There are two parsers that will never be guessed: `col_skip()` and `col_factor()`. You will always need to supply these explicitly.

You can see the specification that readr would generate for a column file by using `spec_csv()`, `spec_tsv()` and so on:

In [38]:
path <- readr_example('mtcars.csv')
path

In [39]:
x <- spec_csv(path)
x


-- Column specification ------------------------------------------------------------------------------------------------
cols(
  mpg = col_double(),
  cyl = col_double(),
  disp = col_double(),
  hp = col_double(),
  drat = col_double(),
  wt = col_double(),
  qsec = col_double(),
  vs = col_double(),
  am = col_double(),
  gear = col_double(),
  carb = col_double()
)



cols(
  mpg = col_double(),
  cyl = col_double(),
  disp = col_double(),
  hp = col_double(),
  drat = col_double(),
  wt = col_double(),
  qsec = col_double(),
  vs = col_double(),
  am = col_double(),
  gear = col_double(),
  carb = col_double()
)

For bigger files, you can often make the specification simpler by changing the default column type using `cols_condense()`

In [41]:
mtcars_spec <- spec_csv(path)

mtcars_spec


-- Column specification ------------------------------------------------------------------------------------------------
cols(
  mpg = col_double(),
  cyl = col_double(),
  disp = col_double(),
  hp = col_double(),
  drat = col_double(),
  wt = col_double(),
  qsec = col_double(),
  vs = col_double(),
  am = col_double(),
  gear = col_double(),
  carb = col_double()
)



cols(
  mpg = col_double(),
  cyl = col_double(),
  disp = col_double(),
  hp = col_double(),
  drat = col_double(),
  wt = col_double(),
  qsec = col_double(),
  vs = col_double(),
  am = col_double(),
  gear = col_double(),
  carb = col_double()
)

In [42]:
cols_condense(mtcars_spec)

cols(
  .default = col_double()
)

By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in `challenge.csv` the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows:

In [45]:
challenge_path <- readr_example('challenge.csv')

In [50]:
challenge <- read_csv(challenge_path)


-- Column specification ------------------------------------------------------------------------------------------------
cols(
  x = col_double(),
  y = col_logical()
)

"1000 parsing failures.
 row col           expected     actual                                                                file
1001   y 1/0/T/F/TRUE/FALSE 2015-01-16 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1002   y 1/0/T/F/TRUE/FALSE 2018-05-18 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1003   y 1/0/T/F/TRUE/FALSE 2015-09-05 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1004   y 1/0/T/F/TRUE/FALSE 2012-11-28 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1005   y 1/0/T/F/TRUE/FALSE 2020-01-13 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
.... ... .................. .......... ...................................................................
See problems(...) for more details.
"

In [49]:
challenge <- read_csv(challenge_path, guess_max = 10001)


-- Column specification ------------------------------------------------------------------------------------------------
cols(
  x = col_double(),
  y = col_date(format = "")
)



Another way is to manually specify the `col_type`, as described below.



# Rectangle parsers

readr comes with five parsers for rectangular file formats:

* **`read_csv()`** and **`read_csv2()`** for csv files
* **`read_tsv()`** for tabs separated files
* **`read_fwf()`** for fixed-width files
* **`read_log()`** for web log files

Each of these functions firsts calls `spec_xxx()` (as described above), and then parses the file according to that column specification:

In [52]:
df <- read_csv(challenge_path)


-- Column specification ------------------------------------------------------------------------------------------------
cols(
  x = col_double(),
  y = col_logical()
)

"1000 parsing failures.
 row col           expected     actual                                                                file
1001   y 1/0/T/F/TRUE/FALSE 2015-01-16 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1002   y 1/0/T/F/TRUE/FALSE 2018-05-18 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1003   y 1/0/T/F/TRUE/FALSE 2015-09-05 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1004   y 1/0/T/F/TRUE/FALSE 2012-11-28 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1005   y 1/0/T/F/TRUE/FALSE 2020-01-13 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
.... ... .................. .......... ...................................................................
See problems(...) for more details.
"

The rectangular parsing functions almost always succeed; they’ll only fail if the format is severely messed up. Instead, readr will generate a data frame of problems. The first few will be printed out, and you can access them all with **`problems()`**:

In [53]:
problems(df)

row,col,expected,actual,file
1001,y,1/0/T/F/TRUE/FALSE,2015-01-16,'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1002,y,1/0/T/F/TRUE/FALSE,2018-05-18,'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1003,y,1/0/T/F/TRUE/FALSE,2015-09-05,'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1004,y,1/0/T/F/TRUE/FALSE,2012-11-28,'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1005,y,1/0/T/F/TRUE/FALSE,2020-01-13,'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1006,y,1/0/T/F/TRUE/FALSE,2016-04-17,'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1007,y,1/0/T/F/TRUE/FALSE,2011-05-14,'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1008,y,1/0/T/F/TRUE/FALSE,2020-07-18,'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1009,y,1/0/T/F/TRUE/FALSE,2011-04-30,'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1010,y,1/0/T/F/TRUE/FALSE,2010-05-11,'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'


You’ve already seen one way of handling bad guesses: increasing the number of rows used to guess the type of each column.

In [54]:
df2 <- read_csv(challenge_path, guess_max = 1001)


-- Column specification ------------------------------------------------------------------------------------------------
cols(
  x = col_double(),
  y = col_date(format = "")
)



Another approach is to manually supply the column specification.

### Overriding the defaults

In the previous examples, you may have noticed that readr printed the column specification that it used to parse the file:

In [56]:
mtcars_path <- readr_example('mtcars.csv')

In [58]:
mtcars_data <- read_csv(mtcars_path)


-- Column specification ------------------------------------------------------------------------------------------------
cols(
  mpg = col_double(),
  cyl = col_double(),
  disp = col_double(),
  hp = col_double(),
  drat = col_double(),
  wt = col_double(),
  qsec = col_double(),
  vs = col_double(),
  am = col_double(),
  gear = col_double(),
  carb = col_double()
)



You can also access it after the fact using **`spec()`**:

In [59]:
spec(mtcars_data)

cols(
  mpg = col_double(),
  cyl = col_double(),
  disp = col_double(),
  hp = col_double(),
  drat = col_double(),
  wt = col_double(),
  qsec = col_double(),
  vs = col_double(),
  am = col_double(),
  gear = col_double(),
  carb = col_double()
)

(This also allows you to access the full column specification if you’re reading a very wide file. By default, readr will only print the specification of the first 20 columns.)

If you want to manually specify the column types, you can start by copying and pasting this code, and then tweaking it fix the parsing problems.

In [60]:
spec(read_csv(challenge_path))


-- Column specification ------------------------------------------------------------------------------------------------
cols(
  x = col_double(),
  y = col_logical()
)

"1000 parsing failures.
 row col           expected     actual                                                                file
1001   y 1/0/T/F/TRUE/FALSE 2015-01-16 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1002   y 1/0/T/F/TRUE/FALSE 2018-05-18 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1003   y 1/0/T/F/TRUE/FALSE 2015-09-05 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1004   y 1/0/T/F/TRUE/FALSE 2012-11-28 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1005   y 1/0/T/F/TRUE/FALSE 2020-01-13 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
.... ... .................. .......... ...................................................................
See problems(...) for more details.
"

cols(
  x = col_double(),
  y = col_logical()
)

In [65]:
df3 <- read_csv(challenge_path)


-- Column specification ------------------------------------------------------------------------------------------------
cols(
  x = col_double(),
  y = col_logical()
)

"1000 parsing failures.
 row col           expected     actual                                                                file
1001   y 1/0/T/F/TRUE/FALSE 2015-01-16 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1002   y 1/0/T/F/TRUE/FALSE 2018-05-18 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1003   y 1/0/T/F/TRUE/FALSE 2015-09-05 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1004   y 1/0/T/F/TRUE/FALSE 2012-11-28 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
1005   y 1/0/T/F/TRUE/FALSE 2020-01-13 'C:/Users/dell/Anaconda3/Lib/R/library/readr/extdata/challenge.csv'
.... ... .................. .......... ...................................................................
See problems(...) for more details.
"

In [67]:
read_csv(challenge_path, col_types = cols(
    x = col_double(),
    y = col_date(format = '')
))

x,y
404,
4172,
3004,
787,
37,
2332,
2489,
1449,
3665,
3863,


In general, it’s good practice to supply an explicit column specification. It is more work, but it ensures that you get warnings if the data changes in unexpected ways. To be really strict, you can use `stop_for_problems(df3)`. This will throw an error if there are any parsing problems, forcing you to fix those problems before proceeding with the analysis.

### Available column specifications

The available specifications are: (with string abbreviations in brackets)

* **`col_logical()`** [l], containing only T, F, TRUE or FALSE.
* **`col_integer()`** [i], integers.
* **`col_double()`** [d], doubles.
* **`col_character()`** [c], everything else.
* **`col_factor(levels, ordered)`** [f], a fixed set of values.
* **`col_date(format = "")`** [D]: with the locale’s date_format.
* **`col_time(format = "")`** [t]: with the locale’s time_format.
* **`col_datetime(format = "")`** [T]: ISO8601 date times
* **`col_number()`** [n], numbers containing the grouping_mark
* **`col_skip()`** [_, -], don’t import this column.
* **`col_guess()`** [?], parse using the “best” type based on the input.

Use the col_types argument to override the default choices. There are two ways to use it:

*  With a string: "dc__d": read first column as double, second as character, skip the next two and read the last column as a double. (There’s no way to use this form with types that take additional parameters.)

In [70]:
read_csv(mtcars_path, col_types = 'dddddddiiii') %>% glimpse()

Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17...
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,...
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,...
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, ...
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3....
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,...
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,...
$ vs   <int> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,...
$ am   <int> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,...
$ gear <int> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,...
$ carb <int> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,...


*  With a (named) list of col objects:

```r
iris_data <- read_csv(
    'iris.csv',
    col_types = cols(
        Sepal.Length = col_double(),
        Sepal.Width = col_double(),
        Petal.Length = col_double(),
        Petal.Width = col_double(),
        Species = col_factor(levels = c("setosa", "versicolor", "virginica"))
    )
)
```

  
```r
  #Or with their abbreviations:
iris_data <- read_csv(
    'iris.csv',
    col_types = cols(
        Sepal.Length = 'd',
        Sepal.Width = 'd',
        Petal.Length = 'd',
        Petal.Width = 'd',
        Species = col_
        factor(c("setosa", "versicolor", "virginica"))
    )
)
```

    
```r
#Any omitted columns will be parsed automatically, so the previous call will lead to the same result as:
read_csv(
    'iris.csv',
    col_types = cols(
        Species = col_factor(c("setosa", "versicolor", "virginica"))
    )
)
```

If you only want to read specified columns, use **`cols_only()`**:

In [81]:
read_csv(mtcars_path, 
         col_types = cols_only(
             mpg = 'd',
             disp = 'd'
         )) %>% head()

mpg,disp
21.0,160
21.0,160
22.8,108
21.4,258
18.7,360
18.1,225


### Output

The output of all these functions is a tibble. Note that characters are never automatically converted to factors (i.e. no more `stringsAsFactors = FALSE`) and column names are left as is, not munged into valid R identifiers (i.e. there is no `check.names = TRUE`). Row names are never set.

Attributes store the column specification (**`spec()`**) and any parsing problems (**`problems()`**).