In [2]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.2
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 0.8.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In [3]:
#read_csv() reads comma delimited files, read_csv2() reads semicolon separated files (common in countries where , is used as the decimal place), read_tsv() reads tab delimited files, and read_delim() reads in files with any delimiter.

#read_fwf() reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions(). read_table() reads a common variation of fixed width files where columns are separated by white space.

#read_log() reads Apache style log files. (But also check out webreadr which is built on top of read_log() and provides many more helpful tools.)

In [4]:
#read_csv() uses the first line of the data for the column names, which is a very common convention
#tweaking this

#You can use skip = n to skip the first n lines; or use comment = "#" to drop all lines that start with (e.g.) #.
#Example

read_csv("The first line of metadata
  The second line of metadata
  x,y,z
  1,2,3", skip = 2)
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3

read_csv("# A comment I want to skip
  x,y,z
  1,2,3", comment = "#")
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3

x,y,z
<dbl>,<dbl>,<dbl>
1,2,3


x,y,z
<dbl>,<dbl>,<dbl>
1,2,3


In [5]:
#The data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn:

#Example
read_csv("1,2,3\n4,5,6", col_names = FALSE)
#> # A tibble: 2 x 3
#>      X1    X2    X3
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

X1,X2,X3
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


In [6]:
#Alternatively you can pass col_names a character vector which will be used as the column names:

#Example

read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
#> # A tibble: 2 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

x,y,z
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


In [7]:
#Another option that commonly needs tweaking is na: this specifies the value (or values) that are used to represent missing values in your file:

read_csv("a,b,c\n1,2,.", na = ".")
#> # A tibble: 1 x 3
#>       a     b c    
#>   <dbl> <dbl> <lgl>
#> 1     1     2 NA

a,b,c
<dbl>,<dbl>,<lgl>
1,2,


In [8]:
#Why we are not using read.csv()
#Why we favour readr function over base r

#Much Faster
#They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions.
#More Reproducible

In [9]:
#Exercises

#What function would you use to read a file where fields were separated with “|”?

#We can use read_delim():

read_delim("a|b|c\n1|2|3\n4|5|6", delim = "|")

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


In [10]:
#Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?

#read_csv() and read_tsv() have the same arguments. They only difference is that one is comma delimited, and the other is tab delimited

In [11]:
#What are the most important arguments to read_fwf()?

#The most important argument is col_positions, which defines the column positions.

In [13]:
#Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or '. By convention, read_csv() assumes that the quoting character will be ", and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame?
"x,y\n1,'a,b'"

#The argument is quote, and we can use it in read_csv(), read_csv2(), and read_tsv() as well. For example:

read_csv("x,y\n1,'a,b'", quote = "'")

x,y
<dbl>,<chr>
1,"a,b"


In [15]:
#Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

read_csv("a,b\n1,2,3\n4,5,6")

#Only two columns names are provided, so the values in the last column are dropped.

“2 parsing failures.
row col  expected    actual         file
  1  -- 2 columns 3 columns literal data
  2  -- 2 columns 3 columns literal data
”

a,b
<dbl>,<dbl>
1,2
4,5


In [17]:
read_csv("a,b,c\n1,2\n1,2,3,4")
#Only three column names are provided. The value in the last column in the last row is dropped, and NA is coerced in the third column of second row.

“2 parsing failures.
row col  expected    actual         file
  1  -- 3 columns 2 columns literal data
  2  -- 3 columns 4 columns literal data
”

a,b,c
<dbl>,<dbl>,<dbl>
1,2,
1,2,3.0


In [19]:
read_csv("a,b\n\"1")
#The open quote \" is dropped because there is no paired close quote. There is only one value in the second row, so NA is coerced in the second column.

“2 parsing failures.
row col                     expected    actual         file
  1  a  closing quote at end of file           literal data
  1  -- 2 columns                    1 columns literal data
”

a,b
<dbl>,<chr>
1,


In [21]:
read_csv("a,b\n1,2\na,b")

#Since the second rows are strings, the entire columns are coerced into strings.

a,b
<chr>,<chr>
1,2
a,b


In [22]:
##Parsing a Vector

In [23]:
#These functions take a character vector and return a more specialised vector like a logical, integer, or date:

In [24]:
str(parse_logical(c("TRUE", "FALSE", "NA")))
#>  logi [1:3] TRUE FALSE NA
str(parse_integer(c("1", "2", "3")))
#>  int [1:3] 1 2 3
str(parse_date(c("2010-01-01", "1979-10-14")))
#>  Date[1:2], format: "2010-01-01" "1979-10-14"

 logi [1:3] TRUE FALSE NA
 int [1:3] 1 2 3
 Date[1:2], format: "2010-01-01" "1979-10-14"


In [25]:
#Like all functions in the tidyverse, the parse_*() functions are uniform: the first argument is a character vector to parse, and the na argument specifies which strings should be treated as missing:

parse_integer(c("1", "231", ".", "456"), na = ".")
#> [1]   1 231  NA 456

In [26]:
#f there are many parsing failures, you’ll need to use problems() to get the complete set. This returns a tibble, which you can then manipulate with dplyr.

In [27]:
#There are eight particularly important parsers:

#parse_logical() and parse_integer() parse logicals and integers respectively. There’s basically nothing that can go wrong with these parsers so I won’t describe them here further.

#parse_double() is a strict numeric parser, and parse_number() is a flexible numeric parser. These are more complicated than you might expect because different parts of the world write numbers in different ways.

#parse_character() seems so simple that it shouldn’t be necessary. But one complication makes it quite important: character encodings.

#parse_factor() create factors, the data structure that R uses to represent categorical variables with fixed and known values.

#parse_datetime(), parse_date(), and parse_time() allow you to parse various date & time specifications. These are the most complicated because there are so many different ways of writing dates.

In [28]:
#Numbers

In [29]:
#three problems make it tricky:

#People write numbers differently in different parts of the world. For example, some countries use . in between the integer and fractional parts of a real number, while others use ,.

#Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”.

#Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”, and these grouping characters vary around the world.