In this notebook, we will learn about:

* [Tibbles](#Tibbles), a modern take on R data frames
* [Data import](#data-import)

In [1]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.3
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 1.0.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



# Tibbles

In [2]:
iris # the famous iris data set used by Sir Ronald Fisher

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
<dbl>,<dbl>,<dbl>,<dbl>,<fct>
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


In [3]:
(iris_tibble <- as_tibble(iris)) # convert to a tibble

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
<dbl>,<dbl>,<dbl>,<dbl>,<fct>
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


In [4]:
tibble(
    x = seq(100, 1000, 100),
    y = 50,   # vectors of shorter length are automatically expanded
    z = x + 50   # variables are created sequentially so can refer to a variable already created
)

x,y,z
<dbl>,<dbl>,<dbl>
100,50,150
200,50,250
300,50,350
400,50,450
500,50,550
600,50,650
700,50,750
800,50,850
900,50,950
1000,50,1050


To create tibbles by specifiying rows, use `tribble` (for **tr**ansposed t**ibble**)

In [5]:
tribble(
    ~letter, ~Letter, ~number, # note use of R formulas in specifying variables names in tribble()
    #--|---|---
    "a", "A", 1,
    "f", "F", 6,
    "k", "K", 11,
    "p", "P", 16,
    "u", "U", 21,
    "z", "Z", 26
)

letter,Letter,number
<chr>,<chr>,<dbl>
a,A,1
f,F,6
k,K,11
p,P,16
u,U,21
z,Z,26


Data frames and tibble print differently

In [6]:
print(iris)

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa
6            5.4         3.9          1.7         0.4     setosa
7            4.6         3.4          1.4         0.3     setosa
8            5.0         3.4          1.5         0.2     setosa
9            4.4         2.9          1.4         0.2     setosa
10           4.9         3.1          1.5         0.1     setosa
11           5.4         3.7          1.5         0.2     setosa
12           4.8         3.4          1.6         0.2     setosa
13           4.8         3.0          1.4         0.1     setosa
14           4.3         3.0          1.1         0.1     setosa
15           5.8         

In [7]:
print(iris_tibble) # only first 10 rowsm with type info, are printed

[38;5;246m# A tibble: 150 x 5[39m
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          [3m[38;5;246m<dbl>[39m[23m       [3m[38;5;246m<dbl>[39m[23m        [3m[38;5;246m<dbl>[39m[23m       [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<fct>[39m[23m  
[38;5;250m 1[39m          5.1         3.5          1.4         0.2 setosa 
[38;5;250m 2[39m          4.9         3            1.4         0.2 setosa 
[38;5;250m 3[39m          4.7         3.2          1.3         0.2 setosa 
[38;5;250m 4[39m          4.6         3.1          1.5         0.2 setosa 
[38;5;250m 5[39m          5           3.6          1.4         0.2 setosa 
[38;5;250m 6[39m          5.4         3.9          1.7         0.4 setosa 
[38;5;250m 7[39m          4.6         3.4          1.4         0.3 setosa 
[38;5;250m 8[39m          5           3.4          1.5         0.2 setosa 
[38;5;250m 9[39m          4.4         2.9          1.4         0.2 setosa 
[38;5;250m10[39m      

You can control the default printing behavior in a couple of ways.

In [8]:
mpg %>%
    print(n = 20, width = Inf)

[38;5;246m# A tibble: 234 x 11[39m
   manufacturer model              displ  year   cyl trans      drv     cty
   [3m[38;5;246m<chr>[39m[23m        [3m[38;5;246m<chr>[39m[23m              [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m      [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m audi         a4                   1.8  [4m1[24m999     4 auto(l5)   f        18
[38;5;250m 2[39m audi         a4                   1.8  [4m1[24m999     4 manual(m5) f        21
[38;5;250m 3[39m audi         a4                   2    [4m2[24m008     4 manual(m6) f        20
[38;5;250m 4[39m audi         a4                   2    [4m2[24m008     4 auto(av)   f        21
[38;5;250m 5[39m audi         a4                   2.8  [4m1[24m999     6 auto(l5)   f        16
[38;5;250m 6[39m audi         a4                   2.8  [4m1[24m999     6 manual(m5) f        18
[38;

In [9]:
options(tibble.print_min = 20, tibble.print_max = 20, tibble.width = Inf)

In [10]:
print(mpg)

[38;5;246m# A tibble: 234 x 11[39m
   manufacturer model              displ  year   cyl trans      drv     cty
   [3m[38;5;246m<chr>[39m[23m        [3m[38;5;246m<chr>[39m[23m              [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m      [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m audi         a4                   1.8  [4m1[24m999     4 auto(l5)   f        18
[38;5;250m 2[39m audi         a4                   1.8  [4m1[24m999     4 manual(m5) f        21
[38;5;250m 3[39m audi         a4                   2    [4m2[24m008     4 manual(m6) f        20
[38;5;250m 4[39m audi         a4                   2    [4m2[24m008     4 auto(av)   f        21
[38;5;250m 5[39m audi         a4                   2.8  [4m1[24m999     6 auto(l5)   f        16
[38;5;250m 6[39m audi         a4                   2.8  [4m1[24m999     6 manual(m5) f        18
[38;

In [11]:
df <- tibble(
    x = runif(10),
    y = rnorm(10)
)

In [12]:
df$x # extracts a variable, name only

In [13]:
df[['x']] # double brackets can extract variables using names

In [14]:
df[[1]] # or using positions

To use `$` or `[[` in a pipe, have to use special placeholder: `.`

In [15]:
df %>% .$x

In [16]:
df %>% .[[1]]

In [17]:
var_name = "x" # store variable name in a variable

In [18]:
df$var_name # doesn't work

“Unknown or uninitialised column: 'var_name'.”


NULL

In [19]:
df[[var_name]]

We can use backticks to create variable names that are not valid R variable names. E.g., numbers

In [20]:
(cubes <- tibble(
    `1` = 1:10,
    `2` = `1`^2,
    `3` = `1`^3
))

1,2,3
<int>,<dbl>,<dbl>
1,1,1
2,4,8
3,9,27
4,16,64
5,25,125
6,36,216
7,49,343
8,64,512
9,81,729
10,100,1000


In [21]:
rename(cubes, x = `1`, squares = `2`, cubes = `3`)

x,squares,cubes
<int>,<dbl>,<dbl>
1,1,1
2,4,8
3,9,27
4,16,64
5,25,125
6,36,216
7,49,343
8,64,512
9,81,729
10,100,1000


# Data import

We will focus on the `read_csv()` function but there are other ones available:

* `read_csv2()`: uses semicolon as delimiter
* `read_tsv()`: uses tab as delimiter
* `read_delim()`: can use any delimiter
* `read_fwf()`: to read fix width files
* `read_table()`: to read fixed wdith files where columns are separated by white space

In [22]:
(heights <- read_csv("https://raw.githubusercontent.com/ambujtewari/stats306-fall2017/master/data/heights.csv"))

Parsed with column specification:
cols(
  earn = [32mcol_double()[39m,
  height = [32mcol_double()[39m,
  sex = [31mcol_character()[39m,
  ed = [32mcol_double()[39m,
  age = [32mcol_double()[39m,
  race = [31mcol_character()[39m
)



earn,height,sex,ed,age,race
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
50000,74.42444,male,16,45,white
60000,65.53754,female,16,58,white
30000,63.62920,female,16,29,white
50000,63.10856,female,16,91,other
51000,63.40248,female,17,39,white
9000,64.39951,female,15,26,white
29000,61.65633,female,12,49,white
32000,72.69854,male,17,46,white
2000,72.03947,male,15,21,hispanic
27000,72.23493,male,12,26,white


To create short examples illustrating read_csv's behavior, we can specify the contents of a csv file inline.

In [23]:
read_csv("a, b, c
1, 2, 3
4, 5, 6
")

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


You might want to skip a few rows in the beginning that have metadata.

In [24]:
read_csv("First row to skip
Second row to skip
Third row to skip
a, b, c
1, 2, 3
4, 5, 6
", skip = 3)

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


You can also skip comments line by specifying a comment character.

In [25]:
read_csv("# First comment line
a, b, c
# This separate the header from the data
1, 2, 3
4, 5, 6
# Another comment line
", comment = '#')

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


Set `col_names = FALSE` when you don't have column names in the file. The column names are then set to X1, X2, ...

In [26]:
read_csv("1, 2, 3
4, 5, 6
", col_names = FALSE)

X1,X2,X3
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


You can specify your own column names.

In [27]:
read_csv("1, 2, 3
4, 5, 6
", col_names = c("a", "b", "c"))

a,b,c
<dbl>,<dbl>,<dbl>
1,2,3
4,5,6


You can specify how missing values are represented in the file.

In [28]:
read_csv("a, b, c
1, 2, 3
4, ?, 6
", na = "?")

a,b,c
<dbl>,<dbl>,<dbl>
1,2.0,3
4,,6


You can write a tibble to a csv file using `write_csv()`.

In [29]:
write_csv(cubes, "cubes.csv")

In [30]:
cubes2 <- read_csv("cubes.csv")

Parsed with column specification:
cols(
  `1` = [32mcol_double()[39m,
  `2` = [32mcol_double()[39m,
  `3` = [32mcol_double()[39m
)



In [31]:
cubes2

1,2,3
<dbl>,<dbl>,<dbl>
1,1,1
2,4,8
3,9,27
4,16,64
5,25,125
6,36,216
7,49,343
8,64,512
9,81,729
10,100,1000


In [32]:
file.remove("cubes.csv")