# Introduction to R

<div class="alert alert-block alert-info">
<b>Overview of R:</b> 

- R is a widely used programming language and as such has many resources available. If you're looking to do some data wrangling chances are someone before you has already had to do the same thing you're about to do and maybe even written a package for those specific operations so it's usually worth searching around for the specific analysis or visualization you need to do. 

- [RStudio](https://www.rstudio.com/) is a quite user-friendly interface and has some great capabilities when plotting and debugging. You can always run R in jupyter though, like we are doing here!

- RStudio also provides several [cheatsheets](https://www.rstudio.com/resources/cheatsheets) for some of the most commonly used functionalities.

- Additionally, when you know the name of a function you can directly access the documentation for said function by entering: ?functionOfInterest.

- The power of R lies both in it's statistical capabilities as well as in the various plotting options to visualize the data. Higher level visualizations are usually performed using [ggplot2](https://ggplot2.tidyverse.org/) which has an enormous range of possibilities. You can find some examples in the [R Graph Gallery](http://www.r-graph-gallery.com/)

</div>

In this notebook, we will be covering the basics of R. The goal is for you to be able to interpret R code that will be used in future modules in this class. Here is a cheat sheet for R basics that may be a helpful resource for you along the way:
- https://iqss.github.io/dss-workshops/R/Rintro/base-r-cheat-sheet.pdf

### Getting started : R can be used to perform simple operations


In [None]:
# addition
4 + 3

In [None]:
# subtraction
4 - 1

In [None]:
# multiplication
5 * 3

In [None]:
# division
5/2

In [None]:
# raise a number to a power
2^3

In [None]:
# take a root
sqrt(8)

### R data types 

In [None]:
# character: "a", "swc"
c <- 'c'
typeof(c)

In [None]:
# numeric: 2, 15.5
n <- 2.0
typeof(n)

In [None]:
# integer: 2L (the L tells R to store this as an integer)
i <- 2L
typeof(i)

In [None]:
# logical: TRUE, FALSE
l <- TRUE
typeof(l)

### Basics of data structures in R

#### Atomic Vectors

A vector is a simple data structure in R. It can be used to make more complex data structures in R. Atomic vectors can only contain one type of data: logical, integer, double, character. They are one-dimensional and can be initialized with ```c()```. Let's start by making a vector and storing it as variable x:

In [None]:
x <- c(3,1,2,4)
x

As we mentioned earlier, they can only store one type of data. In this case, we are storing numerical data.

In [None]:
typeof(x)

In [None]:
#length of vector
length(x)

In [None]:
#returning elements at positions 3 and 1
x[c(3, 1)]

In [None]:
#omitting values
x[-c(3, 1)]

In [None]:
#returning an ordered vector
x[order(x)]

We can pair vectors with mathematical functions. Here we see that applying a mathematical function to x applies it to all elements of x.

In [None]:
sqrt(x)

In [None]:
x+1

In [None]:
x**2

#### Lists
A list in R is still a vector but (unlike an atomic vector) we are able to stores different elements in a list, where each element can be of a different type


In [None]:
# initialize with list()
my.list <- list(c(1:3), 'a', c(TRUE, FALSE))

In [None]:
my.list

In [None]:
# can name each element
my.named.list <- list(one = c(1:3), two = 'a', three = c(TRUE, FALSE))

In [None]:
my.named.list

In [None]:
# access specific list elements
my.named.list$one

In [None]:
my.named.list[['two']]

In [None]:
my.named.list[[3]]

In [None]:
#Lists are made up of atomic vectors or other lists
long.list <- list(first = 1, second = list(1,2,3))

In [None]:
long.list

#### Matrix

Matrices are multi-dimensional data structures that can only hold one type of data (usually numeric). A Matrix is a sub-category of Arrays and only has two dimensions while Arrays can have more


Let's first make a matrix using the following syntax:

```matrix(data, nrow, ncol, byrow, dimnames)```

In [None]:
# Here, we make a matrix, M, with 4 rows and 3 columns with numbers going sequentially from 3 to 14
M <- matrix(c(3:14), nrow = 4, ncol =3, byrow = TRUE)
print(M)

In [None]:
#check the dimensions of the matrix
dim(M)

In [None]:
# adding row and column names to our matrix
rownames <- c("a", "b", "c", "d")
colnames <- c("e", "f", "g")
P <- matrix(c(3:14), nrow = 4, ncol=3, byrow = TRUE, dimnames = list(rownames, colnames))
print(P)

In [None]:
#all elements of matrices should be of the same type
typeof((M))

What if we want to access a certain element of the matrix?

In [None]:
#accessing the element in the first row, third column using column and row numbers
P[1,3]

In [None]:
#accessing the element in the first row, third column using column and row names
P['a','g']

In [None]:
#accessing all elements in a row of the matrix
P[3,]

In [None]:
#accessing all elements in a column
P[,1]

Let's create another matrix to be able to try out some matrix mathematics

In [None]:
J <- matrix(c(3:14), nrow = 4, ncol=3, byrow = FALSE, dimnames = list(rownames, colnames))
print(J)

In [None]:
# Add the matrices.
addition_result <- P + J
print("Result of addition:")
print(addition_result)

In [None]:
# Subtract the matrices
subtraction_result <- P - J
print("Result of subtraction:")
print(subtraction_result)

In [None]:
# Multiply the matrices.
mult_result <- P * J
print("Result of multiplication:")
print(mult_result)

#note this is not the same thing as taking the dot product (commonly referred to as matrix multiplication)

In [None]:
# Divide the matrices
div_result <- P / J
print("Result of division:")
print(div_result)

#### Dataframe

Earlier, we had said that vectors can be used to build more complex data structures. Now let's see how we can use vectors to build more complex data structures by using multiple vectors to create a data frame, one of the most commonly used data structures. 

In [None]:
a <- c(1,0,1,1)
b <- c('w','r','h','i')
c <- c(5,6,3,1)

test_frame <-data.frame(a,b,c)
test_frame

Under the hood, data frames are lists and can be accessed as such

In [None]:
test_frame$a

But they also have matrix-like properties: they possess rows and columns

In [None]:
dim(test_frame)

In [None]:
test_frame[,1]

In [None]:
test_frame[1,]

### Basic dataframe subsetting

Often you'll have a lot of data stored in dataframes and won't be able to look at it all at once. To get an idea of the first and last rows of the data frame use the commands `head` and `tail`, respectively.

In [None]:
head(test_frame)

In [None]:
tail(test_frame)

We can also subset the dataframe using ```subset()```. Not familiar with how to use the subset function? Let's figure out how to use this function using ```?```

In [None]:
?subset

In [None]:
subset(test_frame, b=='w')

Finally, you can also subset a dataframe using logical statements and `[]`.

In [None]:
#first let's see what happens when we apply a logical statement to a matrix column
test_frame$b == 'w'
test_frame[,2] == 'w'

In [None]:
#now let's feed the logical statement into []
logical <- test_frame$b == 'w'
test_frame[logical,]

In [None]:
#you can also combine these into one line
test_frame[test_frame$b == 'w',]

<div class="alert alert-block alert-success">
Your turn! Use your preferred method to subset test_frame to only contain rows where the sum of the two integer values is greater than 4.
</div>

### Functions

Just like in python, we can create functions in R. However, the syntax is slightly different.

In general, remember that functions are used to link several operations which can be repeated by typing just one command instead of having to re-type the whole thing. If you find yourself doing something over and over again it should probably go into a function...

Functions help you clean up your code and make things more organized. Better organized means both less mistakes and it's easier to catch mistakes.

The best thing about functions it that once you create and adequately test your function, you can continue to re-use it from analysis to analysis!

<b>Basic structure of a function in R:</b>

```
#function documentation
my_function <- function(my_input_argument) {
    #some operation(s)
    return(output)
}
```

In [None]:
fahrenheit_to_celsius <- function(temp_F){
    temp_C <- (temp_F - 32) * 5 / 9
    return(temp_C)
 }

In [None]:
fahrenheit_to_celsius(32)

### Conditional Statements

A simple if, else statement in R

<b>Structure of an if, else statement:</b>

```
if( ){
    statement 1
} else {
    statement 2
}
```

In [None]:
y <- 4

if(y > 0){
    print("Non-negative number")
} else {
    print("Negative number")
}

Adding multiple conditions

<b>Structure of an if, else statement with multiple conditions:</b>

```
if ( ) {
    statement 1
} else if ( ) {
    statement 2
} else {
    statement 3
}
```

In [None]:
if ( y > 5) {
    print("y is larger than 5")
} else if ( y == 5) {
    print("y is 5")
} else {
    print("y is less than 5")
}

<div class="alert alert-block alert-success">
Your turn! Write a function which takes a numeric input and returns the log10. If the input number is negative, instead print out "Input is negative number".
</div>

### For and While Loops

<b>For loop structure:</b>
    
```
for () {
    statement
}
```

In [None]:
x <- c(2,5,3,9,8,11,6)
count <- 0
for (val in x) {
    if(val %% 2 == 0)  count = count+1
}
print(count)

<b>While loop structure:</b>

```
while () {
    statement
}
```

In [None]:
i <- 1
while (i < 6) {
    print(i)
    i <- i+1
}

### Alternatives to Loops
Although loops are very useful and easy to understand, sometimes you want to make your code more streamlined. There are multiple ways in R to avoid loops and use less lines of code.

Often loops aren't necessary, if you can find a function that does what you want it will automatically apply to every element in a vector!

In [None]:
my.vector <- seq(1,20,4)
my.vector

In [None]:
#instead of using a for loop
for (val in my.vector){
    print(log10(val))
}

In [None]:
#just use the log10 function directly on the vector
log10(my.vector)

#### lapply and sapply
`lapply` will apply a function to every element in a list

In [None]:
my.list <- list(a = c(1,2,3), b = c(7,8,9), c = c(4,5,6))
my.list

In [None]:
#You can't use a function on a list#
max(my.list)

In [None]:
#instead let's use lapply, but first let's figure out how to use it
?lapply

In [None]:
max.values <- lapply(my.list, max)
max.values
typeof(max.values)

In [None]:
#this output isn't in an easy to use format, let's change that with unlist
unlist(max.values)

In [None]:
#and what about sapply, how does that compare?
sapply(my.list, max)

#### apply
`apply` will apply a function to every row/column in a matrix or dataframe

In [None]:
my.data <- data.frame(a = c(1,2,3), b = c(7,8,9), c = c(4,5,6))
my.data

In [None]:
#again, let's figure out how to use apply
?apply

In [None]:
my.data

In [None]:
apply(my.data, 1, max)
apply(my.data, 2, max)

<div class="alert alert-block alert-success">
Your turn! Use your preferred apply commands to get the minimum values in my.list and my.data
</div>

### R practice

R has a number of built-in datasets. Just for practice, let's use one of R's existing datasets called mtcars. Remember, if you're not sure about how to get started, you can always google!

In [None]:
data(mtcars)

In [None]:
dim(mtcars)
head(mtcars)

***Based on what we've covered so far, take a minute to try out some of these exercises:***

1.  Subset the vector, “mtcars[,1]“, for values greater than “15.0“.

<div class="alert alert-block alert-success">

You should be getting a vector with the following values: 

21 21 22.8 21.4 18.7 18.1 24.4 22.8 19.2 17.8 16.4 17.3 15.2 32.4 30.4 33.9 21.5 15.5 15.2 19.2 27.3 26 30.4 15.8 19.7 21.4

</div>

2. Subset the dataframe, “mtcars” for rows with “mpg” greater than , or equal to, 21 miles per gallon.

<div class="alert alert-block alert-success">
You should get a dataframe with only the following rows: 'Mazda RX4' 'Mazda RX4 Wag' 'Datsun 710' 'Hornet 4 Drive' 'Merc 240D' 'Merc 230' 'Fiat 128' 'Honda Civic' 'Toyota Corolla' 'Toyota Corona' 'Fiat X1-9' 'Porsche 914-2' 'Lotus Europa' 'Volvo 142E'

</div>

3. Subset “mtcars” for rows with “cyl” less than “6“, and “gear” exactly equal to “4“.

<div class="alert alert-block alert-success">

You should get a dataframe with only the following rows: 'Datsun 710' 'Merc 240D' 'Merc 230' 'Fiat 128' 'Honda Civic' 'Toyota Corolla' 'Fiat X1-9' 'Volvo 142E'

</div>

4. Subset “mtcars” for rows greater than, or equal to, 21 miles per gallon. Also, select only the columns, “mpg” through “hp“.

<div class="alert alert-block alert-success">

You should have a 14 x 4 dataframe with only the listed columns and rows: 'Mazda RX4' 'Mazda RX4 Wag' 'Datsun 710' 'Hornet 4 Drive' 'Merc 240D' 'Merc 230' 'Fiat 128' 'Honda Civic' 'Toyota Corolla' 'Toyota Corona' 'Fiat X1-9' 'Porsche 914-2' 'Lotus Europa' 'Volvo 142E'

</div>

5. Add a new column to mtcars called `efficiency` that is equal to `qsec`/`wt` 

<div class="alert alert-block alert-success">

You should have a a 32 x 12 dataframe with the following values for efficiency: 6.2824427480916, 5.92, 8.02155172413793, 6.04665629860031, ..., 4.57413249211356, 5.5956678700361, 4.08963585434174, 6.69064748201439

</div>

6. You have decided you want to purchase a car from the mtcars dataset! Come up with your own criteria for your ideal car, making sure to include data from 4 different columns and set quantitative thresholds and logical criteria as appropriate. Then write your own function to determine if a car (each row) passes your criteria. Use lapply to apply this function to each row in mtcars and create a new column that specifies whether each car passes your criteria.

Read up on all the columns in the dataset here: https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/mtcars