## Python brief history

R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data scientists for developing statistical software and data analysis.

R was created by (R)oss Ihaka and (R)obert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.

R is an implementation of the S programming language (Bell Labs) combined with lexical scoping semantics, inspired by (S)cheme. 
R is named partly after the first names of the first two R authors and partly as a play on the name of S.

+ Current version
    + 3.6.1 (2019-07-05)
+ Project website
    + https://www.r-project.org

# Introduction to programming with R

## Benefits of using R

+ R has one of the richest ecosystems to perform data analysis
+ More than 12,000 packages available in CRAN (Comprehensive R Archive Network)
+ Self-validation process so that packages can follow best practices for development (CII Best Practices Badge Program)
+ Designed for statistical analysis, machine learning and data mining

<table>
    <tr>
        <td><img src="images/1.png" /></td>
    </tr>
</table>

## Pros and Cons

<table>
    <tr>
        <td><img src="images/2.png" /></td>
    </tr>
</table>

## Installing R (Windows, Linux, MacOS)
+ Download and install the latest R distribution from CRAN: https://cran.r-project.org
+ Download and install the latest version of RStudio: https://www.rstudio.com
+ While in RStudio, set the working directory
    + Find out existing working directory: getwd()
    + Set a new working directory: setwd("/.../...")

In [None]:
# At the R prompt we type expressions. 
# The <- symbol is the assignment operator.

x <- 1 # or x = 1
print(x)

## Data Types

R has five basic classes of objects:
+ character
+ numeric (real numbers)
+ integer
+ complex
+ logical (True/False)

### Strings and characters

In [None]:
# character - text is represented as a sequence of characters (letters, numbers, and symbols)
s1 <- "do more with less"
print(s1)

In [None]:
# Single quotes within double quotes.
s2 <- "The 'R' project for statistical computing"
print(s2)

In [None]:
# Double quotes within single quotes.
s3 <- 'The "R" project for statistical computing'
print(s3)

In [None]:
# We want to avoid the following cases.
s4 <- "This "is" totally unacceptable"
s5 <- 'This 'is' absolutely wrong'

### Numbers

In [None]:
# Numbers in R a generally treated as numeric objects.
n1 <- 1
print(n1)

# There is also a special number Inf which represents infinity. 
# This way Inf can be used in ordinary calculations, e.g. 1 / Inf.
n2 <- 1 / Inf
print(n2)

### Complex numbers

In [None]:
# Complex numbers.
c <- 1+0i
print(c)

### Logical

In [None]:
# Logical.
l <- TRUE
print(l)

### Vectors

In [None]:
# The most basic type of R object is a vector. 
# A vector can only contain objects of the same class.
# A vector can be created using the c() function to concatenate things together.

x <- c(0.5, 0.6) # numeric
print(x)

In [None]:
x <- c(TRUE, FALSE) # logical
print(x)

In [None]:
x <- c(T, F) # logical
print(x)

In [None]:
x <- c("a", "b", "c") # character 
print(x)

In [None]:
x <- 1:10 # integer 
print(x)

In [None]:
x <- c(1+0i, 2+4i) # complex
print(x)

In [None]:
# We can also use the vector() function to initialize vectors.
x <- vector("numeric", length = 10)
print(x)

In [None]:
# We can sort the elements of a vector by using the sort() function in the following way.
v <- c(4,78,-45,6,89,678)
sort.v <- sort(v)
print(sort.v)

In [None]:
# Or sort the elements in the reverse order.
revsort.v <- sort(v, decreasing = TRUE)
print(revsort.v) 

Missing values are denoted by NA or NaN for q undefined mathematical operations.
+ is.na() is used to test objects if they are NA
+ is.nan() is used to test for NaN
+ NA values have a class also, so there are integer NA, character NA, etc. 
+ A NaN value is also NA but the converse is not true

In [None]:
# Create a vector with NA and NaN values.
 x <- c(1, 2, NaN, NA, 4)
print(x)

# Return a logical vector indicating which elements are NA
is.na(x)

# Return a logical vector indicating which elements are NaN
is.nan(x)

In [None]:
# Factors are used to represent categorical data and can be unordered or ordered. 
# One can think of a factor as an integer vector where each integer has a label. 
# Factors are important in statistical modeling and functions like lm() and glm().

x <- factor(c("yes", "yes", "no", "yes", "no"))
print(x)
table(x)

### Lists

In [None]:
# Lists are a special type of vector that can contain elements of different classes. 
# Lists, in combination with the various apply functions  make for a powerful combination.
# Lists can be explicitly created using the list() function, which takes an arbitrary number of arguments.

x <- list(1, "a", TRUE, 1 + 4i)
print(x[3])
print(x)

### Matrices

In [None]:
# Matrices are vectors with a dimension attribute. 
# The dimension attribute is itself an integer vector of length 2 (number of rows, number of columns).
m <- matrix(1:6, nrow = 2, ncol = 3)
print(m)

In [None]:
# Matrices are constructed column-wise, so entries can be thought of starting in 
# the “upper left” corner and running down the columns.
dim(m)

In [None]:
# Matrices can also be created directly from vectors by adding a dimension attribute.
m <- 1:10
dim(m) <- c(2, 5)
print(m)

In [None]:
# Matrices can be created by column-binding or row-binding with the cbind() and rbind() functions.
x <- 1:3
y <- 10:12
cbind(x, y)
rbind(x, y)

### Data Frames

Data frames are usually created by reading in a dataset using the read.table() or read.csv(). However, data frames can also be created explicitly with the data.frame() function or they can be coerced from other types of objects like lists.

In [None]:
# Data frames are used to store tabular data in R.
# The basic structure of a data frame is that there is one observation per row and each column represents 
# a variable, a measure, feature, or characteristic of that observation. 
# Unlike matrices, data frames can store different classes of objects in each column. 
x <- data.frame(foo = 1:4, bar = c(T, T, F, F))
nrow(x)
ncol(x)
print(x)

In [None]:
# R objects can have names. 
# Here is an example of assigning names to an integer vector.

x <- 1:3
names(x)
names(x) <- c("New York", "Seattle", "Los Angeles")
print(x)

In [None]:
# Matrices can have both column and row names.
m <- matrix(1:4, nrow = 2, ncol = 2)
dimnames(m) <- list(c("a", "b"), c("c", "d"))
print(m)

In [None]:
# Column names and row names can be set separately using the colnames() and rownames() functions.
colnames(m) <- c("h", "f")
rownames(m) <- c("x", "z")
print(m)

<table style="width:60%">
    <tr>
        <td><img src="images/3.png" /></td>
    </tr>
</table>

## I/O

There are several functions for reading data into R:
+ `read.table`, `read.csv`: for reading tabular data
+ `readLines`: for reading lines of a text file
+ `source`: for reading in R code files (inverse of dump)
+ `dget`: for reading in R code files (inverse of dput)
+ `load`: for reading in saved workspaces
+ `unserialize`: for reading single R objects in binary form

There are analogous functions for writing data to files:
+ `write.table`: for writing tabular data to text files (i.e. CSV) or connections
+ `writeLines`: for writing character data line-by-line to a file or connection
+ `dump`: for dumping a textual representation of multiple R objects
+ `dput`: for outputting a textual representation of an R object
+ `save`: for saving an arbitrary number of R objects in binary format (possibly compressed) to a file
+ `serialize`: for converting an R object into a binary format for outputting to a connection (or file)

The `read.table()` function is one of the most commonly used functions for reading data and has the following arguments:
+ <b>file</b>: the name of a file or a connection
+ <b>header</b>: logical indicating if the file has a header line
+ <b>sep</b>: a string indicating how the columns are separated
+ <b>colClasses</b>: a character vector indicating the class of each column in the dataset
+ <b>nrows</b>: the number of rows in the dataset; by default read.table() reads an entire file
+ <b>comment.char</b>: a character string indicating the comment character; this defalts to "#"
+ <b>skip</b>: the number of lines to skip from the beginning
+ <b>stringsAsFactors</b>: should character variables be coded as factors? This defaults to TRUE

In [None]:
data <- read.table("data/titanic.csv", header = TRUE, nrows = 10, sep = ",")
head(data)

## Subsetting R Objects

There are three operators that can be used to extract subsets of R objects.
+ The `[` operator always returns an object of the same class as the original; it can be used to select multiple elements of an object
+ The `[[` operator is used to extract elements of a list or a data frame; it can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame
+ The `$` operator is used to extract elements of a list or data frame by literal name; its semantics are similar to that of `[[`


In [None]:
# Vectors are basic objects in R and they can be subsetted using the [ operator.
x <- c("a", "b", "c", "d", "e", "f")

# Extract the first element.
print(x[1])

# Extract the second element.
print(x[2])

# Here we extract the first four elements of the vector.
print(x[1:4])

# The sequence does not have to be in order; we can specify any arbitrary integer vector.
print(x[c(1, 3, 4)])

In [None]:
# Matrices can be subsetted in the usual way with (i,j) type indices. 
# Here, we create simple 2x3 matrix with the matrix function.

x <- matrix(1:6, 2, 3)
print(x)
cat("\n")

# We can access the (1,2) or the (2,1) element of this matrix using the appropriate indices.
print(x[1, 2])
print(x[2, 1])

In [None]:
# Indices can also be missing. This behavior is used to access entire rows or columns of a matrix.
# Extract the first row.
print(x[1,])

In [None]:
# Extract the second column.
print(x[,2])

In [None]:
x <- list(foo = 1:4, bar = 0.6)
print(x)

# The [[ operator can be used to extract single elements from a list. 
# Here we extract the first element of the list.
print(x[[1]])
print(x[["bar"]])
cat("\n")

# We can also use the $ operator to extract elements by name.
print(x$bar)

In [None]:
x <- list(a = list(10, 12, 14), b = c(3.14, 2.81))
print(x)

In [None]:
# Get the 3rd element of the 1st element.
x[[1]][[3]]

In [None]:
# Get the 2nd element of the 2nd element.
x[[2]][[2]]

## Vectorized Operations

Many operations in R are vectorized, meaning that operations occur in parallel in certain R objects. This allows us to write code that is efficient, concise, and easier to read than in non-vectorized languages.

In [None]:
# The simplest example is when adding two vectors together.
x <- 1:4
print(x)
cat("\n")

y <- 6:9
print(y)
cat("\n")

z <- x + y
print(z)

In [None]:
# We can also do other operations in a vectorized manner, such as 
# logical comparisons, subtraction, multiplication and division.
x > 2
x >= 2
x < 3
y == 8
x - y
x * y
x / y

In [None]:
# Matrix operations are also vectorized. This way, we can do element-by-element 
# operations on matrices without having to loop over every element.
x <- matrix(1:4, 2, 2)
print(x)
y <- matrix(rep(10, 4), 2, 2)
print(y)

In [None]:
# Element-wise multiplication.
x * y

In [None]:
# Element-wise division.
x / y

In [None]:
# True matrix multiplication.
x %*% y

## Control Structures

Control structures in R allow us to control the flow of execution of a series of R expressions. Control structures allow us to put logic into our R code, rather than just always executing the same R code every time. Moreover, control structures allow us to respond to inputs or to features of the data and execute different R expressions accordingly.

Commonly used control structures are:
+ `if-else`: testing a condition and acting on it
+ `for`: execute a loop a fixed number of times
+ `while`: execute a loop while a condition is true
+ `repeat`: execute an infinite loop (must break out of it to stop) • break: break the execution of a loop
+ `next`: skip an interation of a loop


## `if-else`

<pre>
if(<condition1>) {
    "do something"
} else if(<condition2>) {
    "do something different"
} else {
    "do something different"
}
</pre>

In [None]:
# Generate a uniform random number.
# runif(n, a, b) generates n uniform random numbers between a and b.
x <- runif(1, 0, 10) 
print(x)
if(x > 3) {
    y <- 10 
} else {
    y <- 0 
}
print(y)

## `for` Loop

In R, for loops take an interator variable and assign it successive values from a sequence or vector. 

For loops are most commonly used for iterating over the elements of an object (list, vector, etc.)

In [None]:
for(i in 1:5) {
    print(i) 
}

cat("\n")

x <- c("a", "b", "c", "d")
for(letter in x) {
    print(letter)
}

## `while` Loop

The `while` loops begin by testing a condition. If it is true, then they execute the loop body. 

Once the loop body is executed, the condition is tested again, and so forth, until the condition is false, after which the loop exits.

In [None]:
count <- 0
while(count < 10) {
    print(count)
    count <- count + 1
}

## `repeat` Loop

The `repeat` initiates an infinite loop right from the start. The only way to exit a `repeat` loop is to call break.

One possible paradigm might be in an iterative algorith where we may be searching for a solution and we don’t want to stop until we are close enough to the solution. In this kind of situation, we often don’t know in advance how many iterations it’s going to take to get close enough to the solution.


In [None]:
x <- runif(1, 0, 10)
tol <- 3

repeat {
    print(x)
    if(abs(x) < tol) {
        break
    } else {
        x <- runif(1, 0, 10)
    } 
}

## `next`, `break`

+ `next` is used to skip an iteration of a loop.
+ `break` is used to exit a loop immediately, regardless of what iteration the loop may be on.

In [None]:
for(i in 1:30) { 
    if(i <= 20) {
        next # Skip the first 20 iterations.
    }
    print(i)
}

cat("\n")

for(i in 1:30) { 
    print(i)
    if(i > 20) {
        break
    }
}

## Functions in R

Functions are often used to encapsulate a sequence of expressions that need to be executed numerous times, perhaps under slightly different conditions. Functions are also often written when code must be shared with others.

The writing of a function allows a developer to create an interface to the code, that is explicitly specified with a set of parameters. This interface provides an abstraction of the code to potential users.

Functions are defined using the function() directive and are stored as R objects just like anything else. In particular, they are R objects of class `function`.

In [None]:
# When specifying the function arguments by name it does not matter in what order we specify them.
f <- function(num = 5) {
    for(i in seq_len(num)) {
        cat(i, "apple(s)\n", sep = " ")
    }
}

f(num = 3)
cat("\n")
f()

## Loop Functions Looping

Writing for and while loops is useful when programming but not particularly easy when working interactively on the command line.

+ `lapply()`: Loop over a list and evaluate a function on each element
+ `sapply()`: Same as lapply but try to simplify the result
+ `apply()`: Apply a function over the margins of an array
+ `tapply()`: Apply a function over subsets of a vector

The `lapply()` function does the following simple series of operations:
+ it loops over a list, iterating over each element in that list
+ it applies a function to each element of the list (a function that we specify) and returns a list (the l is for “list”)

lapply(X, FUN)
+ <b>X</b>: A vector or an object
+ <b>FUN</b>: Function applied to each element of x

(The `rnorm()` generates a random dataset following a normal distribution with the given mean and variance in the given space)

In [None]:
x <- list(a = 1:5)
print(x)
print(lapply(x, mean))

The `sapply()` function behaves similarly to `lapply()`; the main difference is in the return value. 

`sapply()` will try to simplify the result of `lapply()` if possible. Essentially, `sapply()` calls `lapply()` on its input and then applies the following algorithm:
+ If the result is a list where every element is length 1, then a vector is returned
+ If the result is a list where every element is a vector of the same length (> 1), a matrix is returned
+ If it can’t figure things out, a list is returned

sapply(X, FUN)
+ <b>X</b>: A vector or an object
+ <b>FUN</b>: Function applied to each element of x


In [None]:
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
print(x)

In [None]:
# lapply() returns a list (as usual), but that each element of the list has length 1.
print(lapply(x, mean))

In [None]:
# Here’s the result of calling sapply() on the same list.
print(sapply(x, mean))

The function `tapply()` computes a measure (mean, median, min, max, etc..) or a function for each factor variable in a vector.

tapply(X, INDEX, FUN = NULL)
+ <b>X</b>: An object, usually a vector
+ <b>INDEX</b>: A list containing factor
+ <b>FUN</b>: Function applied to each element of x

The `gl()` function generates factors by specifying the pattern of their levels:
+ gl(n, k, length = n*k, labels = 1:n, ordered = FALSE)

In [None]:
# Simulate some data.
x <- c(rnorm(3), runif(3), rnorm(3))
print(x)

# Define some groups with a factor variable.
# The gl() function generates factors by specifying the pattern of their levels.
# n is an integer giving the number of levels.
# k is an integer giving the number of replications.
f <- gl(3, 3)
print(f)

print(tapply(x, f, mean))

The `apply()` function is used to a evaluate a function over the margins of an array. 

It is most often used to apply a function to the rows or columns of a matrix (which is just a 2-dimensional array). However, it can be used with general arrays, for example, to take the average of an array of matrices. 

Using `apply()` is not really faster than writing a loop, but it works in one line and is highly compact.

apply(X, MARGIN, FUN)
+ <b>x</b>: an array or matrix
+ <b>MARGIN</b>:  take a value or range between 1 and 2 to define where to apply the function:
+ <b>MARGIN = 1</b>: the manipulation is performed on rows
+ <b>MARGIN = 2</b>: the manipulation is performed on columns
+ <b>MARGIN = c(1,2)</b>: the manipulation is performed on rows and columns
+ <b>FUN</b>: tells which function to apply. Built functions like mean, median, sum, min, max and even user-defined functions can be applied>

In [None]:
x <- matrix(rnorm(50), 10, 5)
print(x)

In [None]:
# Take the sum of each row.
rowSums = apply(x, 1, sum)
print(rowSums)

In [None]:
# Take the mean of each row.
rowMeans = apply(x, 1, mean)
print(rowMeans)

In [None]:
# Take the sum of each column.
colSums = apply(x, 2, sum)
print(colSums)

In [None]:
# Take the mean of each column.
colMeans = apply(x, 2, mean)
print(colMeans)
cat("\n")

## The dplyr Package

The `dplyr` package was developed by Hadley Wickham of RStudio and is an optimized and distilled version of his `plyr` package. 

The `dplyr` package simplifies existing functionality in R. One important contribution of the `dplyr` package is that it provides a grammar (in particular, verbs) for data manipulation and for operating on data frames. With this grammar, we can sensibly communicate what it is that we are doing to a data frame that other people can understand. Another useful contribution is that the `dplyr` functions are very fast, as many key operations are coded in C++.

Some of the key verbs provided by the dplyr package are:
+ `select`: return a subset of the columns of a data frame, using a flexible notation
+ `filter`: extract a subset of rows from a data frame based on logical conditions
+ `arrange`: reorder rows of a data frame
+ `rename`: rename variables in a data frame
+ `mutate`: add new variables/columns or transform existing variables
+ `group_by` / `summarize`: generate summary statistics of different variables in the data frame, possibly within strata
+ `%>%`: the “pipe” operator is used to connect multiple verb actions together into a pipeline

In [None]:
library(dplyr)

# We will be using a dataset containing air pollution and temperature data for the city of Chicago in the U.S.
chicago <- readRDS("data/chicago")
dim(chicago)
head(chicago)

## `select()`

The `select()` function can be used to select columns of a data frame that we want to focus on.

In [None]:
# Suppose we wanted to take the first 3 columns only. 
# We could for example use numerical indices. 
names(chicago)[1:3]

In [None]:
# But we can also use the names directly.
head(select(chicago, city:dptp))

In [None]:
# We can also omit variables by using the negative sign.
head(select(chicago, -(city:dptp)))

In [None]:
# The select() function also allows a special syntax that allows us to specify variable names based on patterns. 
# For example, if we want to keep every variable that ends with a “2”, we could do:
subset <- select(chicago, ends_with("2"))
head(subset)

In [None]:
# Or if we wanted to keep every variable that starts with a “d”, we could do:
subset <- select(chicago, starts_with("d"))
head(subset)

## `filter()`

The `filter()` function is used to extract subsets of rows from a data frame. This function is similar to the existing `subset()` R function but is faster.

In [None]:
# Suppose we wanted to extract the rows of the chicago data frame where the levels of PM2.5 
# are greater than 30 and the temperature is greater than 80 degrees Fahrenheit.
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
head(chic.f)
summary(chic.f$pm25tmean2)
summary(chic.f$tmpd)

## `arrange()`

The `arrange()` function is used to reorder rows of a data frame according to one of the variables/columns.

In [None]:
# Let's order the rows of the data frame by an ascending order of date (oldest observations first).
chicago <- arrange(chicago, date)
head(chicago)

# or arrange the rows in an descending order of the date (newest observations first).
chicago <- arrange(chicago, desc(date))
head(chicago)

## `rename()`

The `rename()` function allows to rename a variable in a data frame.

The syntax inside the `rename()` function is to have the new name on the left-hand side of the = sign and the old name on the right-hand side.

In [None]:
# Here we see the names of the first five variables in the chicago data frame.
head(chicago[,1:5])

In [None]:
# These names are pretty obscure and could be renamed to something more sensible.
chicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)
head(chicago[,1:5])

## `mutate()`
The `mutate()` function exists to compute transformations of variables in a data frame. 

Often, we want to create new variables that are derived from existing variables and `mutate()` provides a clean interface for doing that.

In [None]:
# For example, let's say we want to detrend the air pollution data by subtracting the mean from the data. 
# That way we can look at whether a given day’s air pollution level is higher than or less than average.
# We create a pm25detrend variable that subtracts the mean from the pm25 variable.
chicago_mutated <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))
head(chicago_mutated)

## `group_by()`

The `group_by()` and the `summarize()` functions are used to generate summary statistics from the data frame within strata defined by a variable.

In [None]:
# For example, we might want to know what the average annual level of PM2.5 is. 
# The stratum is the year and that is something we can derive from the date variable.

# First, we can create a year varible using as.POSIXlt().
chicago_mutated <- mutate(chicago, year = as.POSIXlt(date)$year + 1900)
head(chicago_mutated)

# Next we create a separate data frame that splits the original data frame by year.
years <- group_by(chicago_mutated, year)

# Finally, we compute summary statistics for each year with the summarize() function.
summary <- summarize(years, pm25 = mean(pm25, na.rm = TRUE), o3 = max(o3tmean2, na.rm = TRUE), no2 = median(no2tmean2, na.rm = TRUE))
head(summary)

## `%>%`

The pipeline operater `%>%` allows to string together multiple dplyr functions in a sequence of operations,  in a left-to-right fashion, i.e.

`first(x) %>% second %>% third`

In [None]:
# Let's compute the max pollutant level by month. 
mutate(chicago, month = as.POSIXlt(date)$mon + 1) %>%
    group_by(month) %>%
    summarize(pm25_max = max(pm25, na.rm = TRUE),
        o3_max = max(o3tmean2, na.rm = TRUE),
        no2_max = max(no2tmean2, na.rm = TRUE))