# Lesson 01 - R syntax and basic data types

## Introduction to R
* R is a scripting language designed for interactive statistical programming and plotting.
* Like Bash, Perl and Python, the code is first interpreted in a Just-in-Time compilation
  * Note: R tools are free software - You may use them for any purpose, modify and improve them if needed, and share your improvements with the whole community.
* In R, comments start with a '#' character
  * `[Some code if any] # Comment`
* **In a Jupyter Terminal, run**:
  * `cd intro-R`
  * `cat script.R `
  * `Rscript script.R`

## Exercise 1: Invitation to R from a Jupyter (IPython) Notebook
* Let's skip “hello world” and start analyzing data
* In cells below, we will load a data table, get a summary of each column, and make scatterplots of every pair of variables

In [None]:
library(datasets)

In [None]:
?airquality    # The "?" operator provides documentation

In [None]:
summary(airquality)

In [None]:
pairs(airquality)

## The R Assignment Operator
* In R, the assignment operator is "`<-`"
  * It looks like a left-pointing arrow
    * `a <- 1`
* In most languages, assignment is done using the '=' symbol
  * R supports both, but they are different operators
    * `a = 1  #` Mostly used to set specific function arguments or data frame column names
    * `a <- 1 #` Has precedence over '='
  * Google's R style guide forbids '=' for assignment
    * https://google.github.io/styleguide/Rguide.xml

### Assignment in chain

In [None]:
a = 1

In [None]:
a

In [None]:
b <- 2
b

(b <- 3)

In [None]:
a
b
b <- a <- b + a
a
b

In [None]:
(a <- 1)
(b <- 3)
b <- (a <- b) + a
a
b

### Mixed assignment (NOT recommended!)

In [None]:
b = a <- 1
a
b

In [None]:
b <- a = 2
a
b

In [None]:
b <- (a = 3)
a
b

### After all, the arrow notation is more intuitive in R

In [None]:
print(a <- 1) # Assign 1 to "a", then print "a"
print(a = 2)  # Call print() with argument "a" set to 2

In [None]:
?print

In [None]:
print((a = 3))  # After all, the <- operator makes the code cleaner

a <- 4          # Even better: not doing assignment in a function call
print(a)        # This is more secure in case you comment the print() function call

### Be careful with your coding style and spacing
Some editors support shortcuts to automatically insert "<-".
* https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts

In [None]:
print(a < -1)
print(a <-2)
print(a<-3)
print(a<--4)

## Data types
### Atomic data types (characters, numeric, complex, logical)

**Characters and strings**
* Both single and double quote marks could be used at your convenience

In [None]:
'This is a "string".'
"John's car crashed."
cat("Agent Smith's glasses are so \"cool\"!")

**Numeric and integer numbers**
* The integer division must be done with the `%/%` operator

In [None]:
22 / 7
22 %/% 7

**Complex numbers**
* You may use the `x+yi` notation when writing constants, or use the `complex()` function for computed components
* To get the real component, use `Re()`
* To get the imaginary component, use `Im()`

In [None]:
k <- (0+1i) * complex(imaginary = 1)    # i * i = -1+0i
Re(k)
Im(k)

(113/355)**Re(k)
(113/355)^Re(k)

**Logical values**

In [None]:
(22 > 7) ^ (113 > 355)
xor(22 > 7, 113 > 355)

In [None]:
(22 > 7) && (113 > 355)
(22 > 7) || (113 > 355)

In [None]:
! TRUE||TRUE

### Vectors
* Vector - an ordered sequence of data elements of a single type
* Created with the `c()` function

In [None]:
v1 <- c(1,2,3,4,5)
v1

In [None]:
v2 <- c("Peter's", 'apple', '"orange"')
v2

* Single `sprintf()` for all values

In [None]:
v3 <- c(NA, 2, 3, 5, 7)    # Undefined values are replaced with the "NA" keyword
sprintf("%.12f", v3)       # Format of a single float with 12 digits after "."

* Operators `+, -, *, /, <, >, <=. >=, ==, !=, sqrt(), log(), exp()` are element-wise
  * The dot-product, or matrix multiplication, is done with the `%*%` operator

In [None]:
(v4 <- log(v3) / sqrt(v3))    # Element-wise division

(vv1 <- v1 %*% v1)            # Dot-product as a matrix multiplication
sprintf("Proof: %d", 1*1 + 2*2 + 3*3 + 4*4 + 5*5)

* Built-in summary functions: `sort(), min(), max(), mean(), median(), quantile(), sum(), var()`
  * Note: for `NA` values, it is possible to remove them with "`na.rm=TRUE`" as the second argument

In [None]:
var(v4, na.rm=TRUE)

### Lists
* List - an ordered sequence of data elements of any type, except complex numbers
* We can set a name to each element in a list

In [None]:
l1 <- list('Table', 2, 4+3i, TRUE, c(3,4))
l1

In [None]:
l2 <- list(name="Alice", org="McGill U.", 98765.43)
l2

### Indexing and Sequences in R
* Lists and arrays are indexed using square brackets
  * The first index is 1
  * Indexes could be a range (`x:y`), a logical mask (as a vector of TRUE/FALSE) or a vector of integers

In [None]:
c(2,4,8)[1]    # Create a vector with 3 integers, select the first

In [None]:
v1
v1 >= 3        # TRUE if value is greater than or equal to 3
v1[v1 >= 3]    # Locical mask: select all values greater than or equal to 3

In [None]:
v2
v2[c(1,3)]     # A vector of indexes

* From lists:
  * The single bracket `[]` notation will return a sublist
  * The double bracket `[[]]` notation will return a single value
* Use the dollar sign `$` notation to select named members

In [None]:
l1
l1[4]
l1[[4]]

In [None]:
l2
l2[1:2]
l2$org

* Sequence of integers: `seq(start, stop, by=step)`, where values are in [start, stop]
* Sequence of integers: `seq(start, by=step, length.out=N)`
* Range made of N values: `seq(start, stop, length.out=N)`, where values are in [start, stop]

In [None]:
seq(2, 5.9, by=2)
seq(1, 5, by=2)

In [None]:
v1
v1[seq(5, 1, by=-2)]

In [None]:
seq(3, by=2, length.out=20)
seq(0, 10, length.out=21)

### Matrices and Arrays
* Vector: a 1-dimensional array
* Matrix: a 2-dimensional array
  * Use the `t()` function to transpose a matrix

In [None]:
m1 <- matrix(c(2,4,3,1,5,7), nrow=3, ncol=2)
m1

In [None]:
m1 * m1
m1 %*% t(m1)

* Array: N-dimensional collection of data entries of a single type

In [None]:
a1 <- array(1:20, dim=c(2,2,5))
a1[1,,]
a1[,2,]
a1[,,1]

### Data Frames
* Data frame - Collection of vectors of different types
  * Example: the airquality data was imported as a data frame

In [None]:
myFrame <- data.frame(employe=c("Alice", "Bob", "Carol"), org=c("McGill U.", "U. de Mtl", "U. Laval"))
myFrame

* Columns name: use `names()` or `colnames()`
* Number of rows: use `nrow()`. See also `dim()`, `ncol()`

In [None]:
names(myFrame)
colnames(myFrame)
sprintf("%d rows", nrow(myFrame))

* Matrix indexing: use single bracket `[i,j]` notation, where `i` and `j` are indexes. An empty index means "all"
* Column indexing: use the dollar sign (`$`) notation to select a named column

In [None]:
myFrame[2,]            # Entire row 2
myFrame$org[c(1,3)]    # Rows 1 and 3

* Use `head()` to get the first 6 rows. Set `n = N` to select the first `N` rows
* Use `tail()` to get the last 6 rows. Set `n = N` to select the last `N` rows

In [None]:
head(myFrame, n=2)
tail(myFrame, n=2)

* Appending Data
  * Use `rbind()` to append rows to a matrix or data frame
  * Use `cbind()` to append columns to a matrix or data frame

In [None]:
df1 <- data.frame(one=c(10,11,12), two=c("w","w","x"))
df2 <- data.frame(one=c(20,21,22), two=c("y","y","z"))

In [None]:
df12v <- rbind(df1, df2)
df12v
df12v$one

In [None]:
df12h <- cbind(df1, df2)
df12h
df12h$one

## Exercise 2 - Analyzing Data from a Data Frame
* From the airquality data, find the `mean()` Temp when the Ozone was below the `median()`
* Challenge: process the data in one line