# Section 2 - Data Wrangling

## Short Introduction to Piping

Pipe operator `%>%` is used to pipe/forward an argument or expression into the next function call or expression. It is provided by the `magrittr` package.

In [1]:
# install.packages("magrittr")
library(magrittr)

# examples
c(1,2,3,4) %>% sum()                        # equivalent to: sum(c(1,2,3,4))
c(1,2,3,4) %>% sum() %>% log(base = 10)     # equivalent to: log(sum(c(1,2,3,4)), base = 10)

# pipe into another element with `.`
c(1,2,3,4) %>% sum() %>% log(1000, .)     # equivalent to: log(1000, base = sum(c(1,2,3,4)))

In [2]:
# --------------------

## `data.table`s

### 1. Overview

Modern implementation of `data.frame`s, less complicated (better syntax), memory efficient and a lot faster. Compatibility with `data.frame` functions.

Operates on columns by reference (modifications in a `data.frame` lead to copies). Does not require row names. All columns have to have the same length.

Basic syntax: `DT[i = row(s), j = column(s), by = grouped by]`

("Take `DT`, subset rows by `i`, then compute `j` grouped by `by`.")

Big advantage of `data.table`s: within `[]` the columns names are handled like variables, i.e. they are interpreted using the `data.table` enviroment, which makes column operations very powerful (more on that later)!

Do not forget: R interprets everything in `" "` or `' '` as a string. Hence if it should refer to a variable name of an `data.table`, `` ` ` `` (backticks) have to be used.

### 2. Creating and Loading Tables

`data.table`s require the according package.

`data.table`s can be defined based on vectors, based on a conversion of another object or by loading some (local) data/files.

Hint: if a column is missing some entries (w.r.t. the other columns), the elements in this columns are recycled.

In [3]:
# install.packages('data.table')
library(data.table)

# example
DT <- data.table(x = 1:6,
                 'names' = rep(c("Donata", "Philipp"), 3),
                 y = c(1,2),
                 z = factor(c('A', 'A', 'B', 'C', 'A', 'B')))     # vector y is recycled
class(DT)
DT

x,names,y,z
1,Donata,1,A
2,Philipp,2,A
3,Donata,1,B
4,Philipp,2,C
5,Donata,1,A
6,Philipp,2,B


In [4]:
# conversion (typically for `data.frames`)
titanic_dt <- as.data.table(Titanic)           # Titanic is a data frame provided in the `data.table` package
class(titanic_dt)

In [5]:
# loading (local) data into a table
# often: different functions for different file types available

# CSV:
# DT <- fread('path_to_file/file.csv')

# XLSX (excel files):
# install.packages("readxl")
library(readxl)
# DF <- read_excel('path_to_file/file.xlsx')     # return data.frames -> convert to data.table :)

In [6]:
# get path to current working directory
# getwd()

# define file path for any system (different OS often means different representations of paths!)
file.path('first', 'second', 'third')

### 3. Inspecting Tables

First step in every analysis involve inspecting the data to get a brief overview.

Hint: by using just a part of a `data.table` this specific part can be analised.

In [7]:
# inspect the size of the table
ncol(DT)                            # number of columns in the table
nrow(DT)                            # number of rows in the table
dim(DT)                             # nrow and ncol

In [8]:
# inspect the basic statistics of each column
summary(DT)                                       # problem: not very helpful for strings

       x           names                 y       z    
 Min.   :1.00   Length:6           Min.   :1.0   A:3  
 1st Qu.:2.25   Class :character   1st Qu.:1.0   B:2  
 Median :3.50   Mode  :character   Median :1.5   C:1  
 Mean   :3.50                      Mean   :1.5        
 3rd Qu.:4.75                      3rd Qu.:2.0        
 Max.   :6.00                      Max.   :2.0        

In [9]:
# inspect string columns
DT[, unique(names)]          # return all unique elements of a certain variable ...
DT[, unique(z)]

DT[, table(names)]           # ... or return how often each element occurs
DT[, table(z)]

names
 Donata Philipp 
      3       3 

z
A B C 
3 2 1 

In [10]:
# --------------------

## Row Subsetting

### 1. Subsetting Row by Indices

Similar (even equivalent) to subsetting row in matrices.

In [11]:
# access one row
DT[2,]                                                   # the comma can be left out as the row index is the first entry
DT[2]

# access consecutive rows
DT[2:4]

# access multiple (not necessarily consecutive) rows
DT[c(2,4,6)]

x,names,y,z
2,Philipp,2,A


x,names,y,z
2,Philipp,2,A


x,names,y,z
2,Philipp,2,A
3,Donata,1,B
4,Philipp,2,C


x,names,y,z
2,Philipp,2,A
4,Philipp,2,C
6,Philipp,2,B


### 2. Subsetting Rows by Logical Conditions

Again similar (even equivalent) to subsetting row in matrices.

Operators: `==`, `>`, `<`, `!=`, `%in%`.

Concatenate multiple conditions with `&` and/or `|`.

In [12]:
# examples
DT[names == 'Donata']                          # opposite of: DT[names != 'Philipp']
DT[x < 4]
DT[z %in% c('B', 'C')]

DT[z %in% c('B', 'C') & x < 4]                 # "intersection of tables"
DT[z %in% c('B', 'C') | names == 'Donata']     # "combination of tables"

x,names,y,z
1,Donata,1,A
3,Donata,1,B
5,Donata,1,A


x,names,y,z
1,Donata,1,A
2,Philipp,2,A
3,Donata,1,B


x,names,y,z
3,Donata,1,B
4,Philipp,2,C
6,Philipp,2,B


x,names,y,z
3,Donata,1,B


x,names,y,z
1,Donata,1,A
3,Donata,1,B
4,Philipp,2,C
5,Donata,1,A
6,Philipp,2,B


In [13]:
# --------------------

## Column Operations

### 1. Working with columns

Accessing columns is similar (even equivalent) to accessing columns in matrices.

Hint: do not access columns by index but rather by the column name to prevent bugs. (Except if you want to acces consecutive columns and know exactly where they are.)

In [14]:
# access one column
DT[, names]

# access a cell
DT[2, names]

# access multiple columns
DT[, list(x, names)]
DT[, .(x, names)]

DT[, c(x, names)]             # this returns the elements in the chosen columns (and rows) in a vector instead of a `data.table`

x,names
1,Donata
2,Philipp
3,Donata
4,Philipp
5,Donata
6,Philipp


x,names
1,Donata
2,Philipp
3,Donata
4,Philipp
5,Donata
6,Philipp


### 2. Column Operators

Back to the big advantage of `data.table`s: it is possible to apply functions to the columns inside the `[]` enviroment! Even multiple operations (provided in a list) can be applied (returns a vector or table of the results).

Hint: if the list is provided by `c()` the output will be again a vector.

In [15]:
# applying function inside and outside
mean_outside <- mean(DT$x, na.rm = TRUE)
mean_inside <- DT[, mean(x, na.rm = TRUE)]             # na.rm = TRUE removes NAs/NULLs! (if there are any)
mean_outside == mean_inside

# applying the function on some rows of the column
sum_outside <- sum(DT$x[2:5])
sum_inside <- DT[2:5, sum(x)]
sum_outside == sum_inside

# simpler with this trick:
sum_philipp <- DT[names == 'Philipp', sum(x)]

# multiple operations
sum_and_mean <- DT[, .(mean(x), sum(x))]
DT[, .(mean_x = mean(x), sum_x = sum(x))]              # even names can be provided :)

# operation applied to columns
DT[, x+y]

mean_x,sum_x
3.5,21


### 3. Advanced Commands: `*apply()` over columns

Columns of a table are exposed as a list, therefore other functions (applying to lists) can be applied to each column via functions from the `*apply()` family.

Hint: I think `sapply()` is the simplest to use :)

In [16]:
# examples
sapply(DT, class)

# Note that we can access columns stored as variables by setting with=F. In this case,
# `colnames(iris_dt)!="Species"` returns a logical vector and ` iris_dt` is subsetted by the logical vector.
iris_dt <- as.data.table(iris)
sapply(iris_dt[, colnames(iris_dt)!="Species", with = F], sum)     # == sapply(iris_dt[, 1:4], sum)

In [17]:
# --------------------

## The `by` Option

Execution of the `j` command by groups.

Hint: the `by =` can be omitted, but is usually included for clarity.

In [18]:
# examples
DT[, sum(y), by=names]
DT[, .(sum_y = sum(y), mean_x = mean(x)), by=z]

names,V1
Donata,3
Philipp,6


z,sum_y,mean_x
A,4,2.666667
B,3,4.5
C,2,4.0


In [19]:
# --------------------

## Counting Occurences with `.N`

Counting number of observations/rows within a table. If no `by` is provided, it is equivalent to `nrow`.

In [20]:
# examples
DT[, .N] == nrow(DT)
DT[, .N, by = z]                         # number of observations for each level in `z`
DT[names == 'Donata', .N, by = z]

z,N
A,3
B,2
C,1


z,N
A,2
B,1


In [21]:
# --------------------

## Extending Tables

### 1. Creating New Columns

The `:=` operator updates the table inplace (no copy is made/needed!), i.e. changes of the input by reference.

Columns can also be removed using `:= NULL

In [22]:
# example
DT[, sum_x_y := x+y]
DT
DT[, z := NULL]
DT

x,names,y,z,sum_x_y
1,Donata,1,A,2
2,Philipp,2,A,4
3,Donata,1,B,4
4,Philipp,2,C,6
5,Donata,1,A,6
6,Philipp,2,B,8


x,names,y,sum_x_y
1,Donata,1,2
2,Philipp,2,4
3,Donata,1,4
4,Philipp,2,6
5,Donata,1,6
6,Philipp,2,8


### 2. Advanced: Multiple Assignments

Create or delete multiple new columns or new columns with specific entries based on other columns.

In [23]:
# create multiple new columns
iris_dt[1:3]
iris_dt[, `:=` (Sepal.Area = Sepal.Length * Sepal.Width, Petal.Area = Petal.Length * Petal.Width)][1:3]
iris_dt[, `:=` (Sepal.Area = NULL, Petal.Area = NULL)][1:3]

# add new column based on other column
iris_dt[Species == "setosa", color := "orange"]
unique(iris_dt[, .(Species, color)])
iris_dt[Species == "versicolor", color := "purple"]
unique(iris_dt[, .(Species, color)])
iris_dt[Species == "virginica", color := "pink"]
unique(iris_dt[, .(Species, color)])

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa


Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,Sepal.Area,Petal.Area
5.1,3.5,1.4,0.2,setosa,17.85,0.28
4.9,3.0,1.4,0.2,setosa,14.7,0.28
4.7,3.2,1.3,0.2,setosa,15.04,0.26


Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa


Species,color
setosa,orange
versicolor,
virginica,


Species,color
setosa,orange
versicolor,purple
virginica,


Species,color
setosa,orange
versicolor,purple
virginica,pink


### 3. Copying Tables

Create a copy of a table in the memory by using `copy()`. This copy is no longer linked to the original hence changes in one tables does not affect the other one (in contrast to "copies" created via `<-`).

In [24]:
# example
DT_copy <- DT
DT %>% colnames()
DT_copy[, names := NULL]                                      # changes in this "copy" affect the original table
DT %>% colnames()

DT_copy_real <- copy(DT)
DT_copy_real[, 'names' := rep(c("Donata", "Philipp"), 3)]
DT_copy_real %>% colnames()
DT %>% colnames()

In [25]:
# --------------------

##### End of Section 2!