In this notebook, we will learn about:

* [Tibbles](#Tibbles), a modern take on R data frames
* [Data import](#data-import)

In [1]:
library(tidyverse)

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats


# Tibbles

In [2]:
iris # the famous iris data set used by Sir Ronald Fisher

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


In [3]:
(iris_tibble <- as_tibble(iris)) # convert to a tibble

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


In [4]:
tibble(
    x = seq(100, 1000, 100),
    y = 50,   # vectors of shorter length are automatically expanded
    z = x + 50   # variables are created sequentially so can refer to a variable already created
)

x,y,z
100,50,150
200,50,250
300,50,350
400,50,450
500,50,550
600,50,650
700,50,750
800,50,850
900,50,950
1000,50,1050


To create tibbles by specifiying rows, use `tribble` (for **tr**ansposed t**ibble**)

In [5]:
tribble(
    ~letter, ~Letter, ~number, # note use of R formulas in specifying variables names in tribble()
    #--|---|---
    "a", "A", 1,
    "f", "F", 6,
    "k", "K", 11,
    "p", "P", 16,
    "u", "U", 21,
    "z", "Z", 26
)

letter,Letter,number
a,A,1
f,F,6
k,K,11
p,P,16
u,U,21
z,Z,26


Data frames and tibble print differently

In [6]:
print(iris)

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa
6            5.4         3.9          1.7         0.4     setosa
7            4.6         3.4          1.4         0.3     setosa
8            5.0         3.4          1.5         0.2     setosa
9            4.4         2.9          1.4         0.2     setosa
10           4.9         3.1          1.5         0.1     setosa
11           5.4         3.7          1.5         0.2     setosa
12           4.8         3.4          1.6         0.2     setosa
13           4.8         3.0          1.4         0.1     setosa
14           4.3         3.0          1.1         0.1     setosa
15           5.8         

In [7]:
print(iris_tibble) # only first 10 rowsm with type info, are printed

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
# ... with 140 more rows


You can control the default printing behavior in a couple of ways.

In [8]:
mpg %>%
    print(n = 20, width = Inf)

# A tibble: 234 × 11
   manufacturer              model displ  year   cyl      trans   drv   cty
          <chr>              <chr> <dbl> <int> <int>      <chr> <chr> <int>
1          audi                 a4   1.8  1999     4   auto(l5)     f    18
2          audi                 a4   1.8  1999     4 manual(m5)     f    21
3          audi                 a4   2.0  2008     4 manual(m6)     f    20
4          audi                 a4   2.0  2008     4   auto(av)     f    21
5          audi                 a4   2.8  1999     6   auto(l5)     f    16
6          audi                 a4   2.8  1999     6 manual(m5)     f    18
7          audi                 a4   3.1  2008     6   auto(av)     f    18
8          audi         a4 quattro   1.8  1999     4 manual(m5)     4    18
9          audi         a4 quattro   1.8  1999     4   auto(l5)     4    16
10         audi         a4 quattro   2.0  2008     4 manual(m6)     4    20
11         audi         a4 quattro   2.0  2008     4   auto(s6)    

In [9]:
options(tibble.print_min = 20, tibble.print_max = 20, tibble.width = Inf)

In [10]:
print(mpg)

# A tibble: 234 × 11
   manufacturer              model displ  year   cyl      trans   drv   cty
          <chr>              <chr> <dbl> <int> <int>      <chr> <chr> <int>
1          audi                 a4   1.8  1999     4   auto(l5)     f    18
2          audi                 a4   1.8  1999     4 manual(m5)     f    21
3          audi                 a4   2.0  2008     4 manual(m6)     f    20
4          audi                 a4   2.0  2008     4   auto(av)     f    21
5          audi                 a4   2.8  1999     6   auto(l5)     f    16
6          audi                 a4   2.8  1999     6 manual(m5)     f    18
7          audi                 a4   3.1  2008     6   auto(av)     f    18
8          audi         a4 quattro   1.8  1999     4 manual(m5)     4    18
9          audi         a4 quattro   1.8  1999     4   auto(l5)     4    16
10         audi         a4 quattro   2.0  2008     4 manual(m6)     4    20
11         audi         a4 quattro   2.0  2008     4   auto(s6)    

In [11]:
df <- tibble(
    x = runif(10),
    y = rnorm(10)
)

In [12]:
df$x # extracts a variable, name only

In [13]:
df[['x']] # double brackets can extract variables using names

In [14]:
df[[1]] # or using positions

To use `$` or `[[` in a pipe, have to use special placeholder: `.`

In [15]:
df %>% .$x

In [16]:
df %>% .[[1]]

In [17]:
var_name = "x" # store variable name in a variable

In [18]:
df$var_name # doesn't work

“Unknown column 'var_name'”

NULL

In [19]:
df[[var_name]]

We can use backticks to create variable names that are not valid R variable names. E.g., numbers

In [20]:
(cubes <- tibble(
    `1` = 1:10,
    `2` = `1`^2,
    `3` = `1`^3
))

1,2,3
1,1,1
2,4,8
3,9,27
4,16,64
5,25,125
6,36,216
7,49,343
8,64,512
9,81,729
10,100,1000


In [21]:
rename(cubes, x = `1`, squares = `2`, cubes = `3`)

x,squares,cubes
1,1,1
2,4,8
3,9,27
4,16,64
5,25,125
6,36,216
7,49,343
8,64,512
9,81,729
10,100,1000


# Data import

We will focus on the `read_csv()` function but there are other ones available:

* `read_csv2()`: uses semicolon as delimiter
* `read_tsv()`: uses tab as delimiter
* `read_delim()`: can use any delimiter
* `read_fwf()`: to read fix width files
* `read_table()`: to read fixed wdith files where columns are separated by white space

In [22]:
heights <- read_csv("data/heights.csv")

Parsed with column specification:
cols(
  earn = col_double(),
  height = col_double(),
  sex = col_character(),
  ed = col_integer(),
  age = col_integer(),
  race = col_character()
)


To create short examples illustrating read_csv's behavior, we can specify the contents of a csv file inline.

In [23]:
read_csv("a, b, c
1, 2, 3
4, 5, 6
")

a,b,c
1,2,3
4,5,6


You might want to skip a few rows in the beginning that have metadata.

In [24]:
read_csv("First row to skip
Second row to skip
Third row to skip
a, b, c
1, 2, 3
4, 5, 6
", skip = 3)

a,b,c
1,2,3
4,5,6


You can also skip comments line by specifying a comment character.

In [25]:
read_csv("# First comment line
a, b, c
# This separate the header from the data
1, 2, 3
4, 5, 6
# Another comment line
", comment = '#')

a,b,c
1,2,3
4,5,6


Set `col_names = FALSE` when you don't have column names in the file. The column names are then set to X1, X2, ...

In [26]:
read_csv("1, 2, 3
4, 5, 6
", col_names = FALSE)

X1,X2,X3
1,2,3
4,5,6


You can specify your own column names.

In [27]:
read_csv("1, 2, 3
4, 5, 6
", col_names = c("a", "b", "c"))

a,b,c
1,2,3
4,5,6


You can specify how missing values are represented in the file.

In [28]:
read_csv("a, b, c
1, 2, 3
4, ?, 6
", na = "?")

a,b,c
1,2.0,3
4,,6
