# Introduction to R

<div class="alert alert-block alert-info">
<b>Overview of R:</b> 

- R is a widely used programming language and as such has many resources available. If you're looking to do some data wrangling chances are someone before you has already had to do the same thing you're about to do and maybe even written a package for those specific operations so it's usually worth searching around for the specific analysis or visualization you need to do. 

- [RStudio](https://www.rstudio.com/) is a quite user-friendly interface and has some great capabilities when plotting and debugging.

- RStudio also provides several [cheatsheets](https://www.rstudio.com/resources/cheatsheets) for some of the most commonly used functionalities.

- Additionally, when you know the name of a function you can directly access the documentation for said function by entering: ?functionOfInterest.

- The power of R lies both in it's statistical capabilities as well as in the various plotting options to visualize the data. Higher level visualizations are usually performed using [ggplot2](http://ggplot2.org/) which has an enormous range of possibilities. You can find some examples in the [R Graph Gallery](http://www.r-graph-gallery.com/)

</div>

In this notebook, we will be covering the basics of R. The goal is for you to be able to interpret R code that will be used in future modules in this class. Here is a cheat sheet for R basics that may be a helpful resource for you along the way:
- https://iqss.github.io/dss-workshops/R/Rintro/base-r-cheat-sheet.pdf

### Getting started : R can be used to perform simple operations


In [None]:
# addition
4 + 3

In [None]:
# subtraction
4 - 1

In [None]:
# multiplication
5 * 3

In [None]:
# division
5/2

In [None]:
# raise a number to a power
2^3

In [None]:
# take a root
sqrt(8)

### R data types 

In [None]:
# character: "a", "swc"
c <- 'c'
typeof(c)

In [None]:
# numeric: 2, 15.5
n <- 2.0
typeof(n)

In [None]:
# integer: 2L (the L tells R to store this as an integer)
i <- 2L
typeof(i)

In [None]:
# logical: TRUE, FALSE
l <- TRUE
typeof(l)

### Basics of data structures in R

#### Atomic Vectors

A vector is a simple data structure in R. It can be used to make more complex data structures in R. Atomic vectors can only contain one type of data: logical, integer, double, character. They are one-dimensional and can be initialized with ```c()```. Let's start by making a vector and storing it as variable x:

In [None]:
x <- c(3,1,2,4)
x

As we mentioned earlier, they can only store one type of data. In this case, we are storing numerical data.

In [None]:
typeof(x)

In [None]:
#length of vector
length(x)

In [None]:
#returning elements in at positions 3 and 1
x[c(3, 1)]

In [None]:
#returning an ordered vector
x[order(x)]

In [None]:
#omitting values
x[-c(3, 1)]

We can pair vectors with mathematical functions. Here we see that applying a mathematical function to x applies it to all elements of x.

In [None]:
sqrt(x)

In [None]:
x+1

In [None]:
x**2

#### Lists
A list in R is still a vector but (unlike an atomic vector) we are able to stores different elements in a list, where each element can be of a different type


In [None]:
# initialize with list()
my.list <- list(c(1:3), 'a', c(TRUE, FALSE))

In [None]:
my.list

In [None]:
# can name each element
my.named.list <- list(one = c(1:3), two = 'a', three = c(TRUE, FALSE))

In [None]:
my.named.list

In [None]:
# access specific list elements
my.named.list$one

In [None]:
my.named.list[['two']]

In [None]:
my.named.list[[3]]

In [None]:
#Lists are made up of atomic vectors or other lists
long.list <- list(first = 1, second = list(1,2,3))

In [None]:
long.list

#### Matrix

Matrices are multi-dimensional data structures that can only hold one type of data (usually numeric). A Matrix is a sub-category of Arrays and only has two dimensions while Arrays can have more


Let's first make a matrix using the following syntax:

```matrix(data, nrow, ncol, byrow, dimnames)```

In [None]:
# Here, we make a matrix, M, with 4 rows and 3 columns with numbers going sequentially from 3 to 14
M <- matrix(c(3:14), nrow = 4, ncol =3, byrow = TRUE)
print(M)


In [None]:
#check the dimensions of the matrix
dim(M)

In [None]:
# adding row and column names to our matrix
rownames = c("a", "b", "c", "d")
colnames = c("e", "f", "g")
P <- matrix(c(3:14), nrow = 4, ncol=3, byrow = TRUE, dimnames = list(rownames, colnames))
print(P)

In [None]:
#all elements of matrices should be of the same type
typeof((M))

What if we want to access a certain element of the matrix?

In [None]:
#accessing the element in the first row, third column using column and row numbers
P[1,3]

In [None]:
#accessing the element in the first row, third column using column and row names
P['a','g']

In [None]:
#accessing all elements in a row of the matrix
P[3,]

In [None]:
#accessing all elements in a column
P[,1]

let's create another matrix to be able to try out some matrix mathematics

In [None]:
J <- matrix(c(3:14), nrow = 4, ncol=3, byrow = FALSE, dimnames = list(rownames, colnames))
print(J)

In [None]:
# Add the matrices.
addition_result <- P + J
print("Result of addition:")
print(addition_result)

In [None]:
# Subtract the matrices
subtraction_result <- P - J
print("Result of subtraction:")
print(subtraction_result)

In [None]:
# Multiply the matrices.
mult_result <- P * J
print("Result of multiplication:")
print(mult_result)

In [None]:
# Divide the matrices
div_result <- P / J
print("Result of division:")
print(div_result)

#### Dataframe

Earlier, we had said that vectors can be used to build more complex data structures. Now let's see how we can use vectors to build more complex data structures by using multiple vectors to create a data frame, one of the most commonly used data structures. 

In [None]:
a <-c(1,0,1,1)
b <- c('w','r','h','i')
c <- c(5,6,3,1)

test_frame <-data.frame(a,b,c)
test_frame

Under the hood, data frames are lists and can be accessed as such

In [None]:
test_frame$a

But they also have matrix-like properties: they possess rows and columns

In [None]:
dim(test_frame)

In [None]:
test_frame[,1]

In [None]:
test_frame[1,]

We can also subset the dataframe using ```subset()```. Not familiar with how to use the subset function? Let's figure out how to use this function using ```?```

In [None]:
?subset

In [None]:
subset(test_frame, b=='w')

### Functions

Just like in python, we can create functions in R. However, the syntax is slightly different.

In general, remember that functions are used to link several operations which can be repeated by typing just one command instead of having to re-type the whole thing. If you find yourself doing something over and over again it should probably go into a function...

Functions help you clean up your code and make things more organized. Better organized means both less mistakes and it's easier to catch mistakes.

The best thing about functions it that once you create and adequately test your function, you can continue to re-use it from analysis to analysis!

<div class="alert alert-block alert-warning">

<b>Basic structure of a function in R:</b>

```#function documentation```

```my_function <- function(my_input_argument) {```
  
  ```#some operation (s)```
  
  ```return(output)```
}
<div>

In [None]:
fahrenheit_to_celsius <- function(temp_F){
    temp_C <- (temp_F - 32) * 5 / 9
    return(temp_C)
 }

fahrenheit_to_celsius(32)


### Conditional Statements

A simple if, else statement in R

<div class="alert alert-block alert-warning">
<b>Structure of an if, else statement:</b>

```if( ){
statement 1
} else {
statement 2
}```

<div>

In [None]:
y <- 4

if(y > 0){
print("Non-negative number")
} else {
print("Negative number")
}

Adding multiple conditions

<div class="alert alert-block alert-warning">
<b>Structure of an if, else statement with multiple conditions:</b>

```if ( ) {
statement 1
} else if ( ) {
statement 2
} else {
statement 3
}```

<div>

In [None]:
if ( y > 5) {
print("y is larger than 5")
} else if ( y == 5) {
print("y is 5")
} else {
print("y is less than 5")
}

### For and While Loops

<div class="alert alert-block alert-warning">
<b>For loop structure:</b>
    
```for () {
if()  statement
}```

<div>

In [None]:
x <- c(2,5,3,9,8,11,6)
count <- 0
for (val in x) {
if(val %% 2 == 0)  count = count+1
}
print(count)

<div class="alert alert-block alert-warning">
<b>While loop structure:</b>

```while () {
statement
}```

<div>

In [None]:
i <- 1
while (i < 6) {
print(i)
i = i+1
}

### R practice

R has a number of built-in datasets. Just for practice, let's use one of R's existing datasets called mtcars. Remember, if you're not sure about how to get started, you can always google!

In [None]:
data(mtcars)

In [None]:
head(mtcars)

***Based on what we've covered so far, take a minute to try out some of these exercises:***

1.  Subset the vector, “mtcars[,1]“, for values greater than “15.0“.

<div class="alert alert-block alert-success">

You should be getting a vector with the following values: 

21 21 22.8 21.4 18.7 18.1 24.4 22.8 19.2 17.8 16.4 17.3 15.2 32.4 30.4 33.9 21.5 15.5 15.2 19.2 27.3 26 30.4 15.8 19.7 21.4

<div>

2. Subset the dataframe, “mtcars” for rows with “mpg” greater than , or equal to, 21 miles per gallon.

<div class="alert alert-block alert-success">
You should get a dataframe with only the following rows: 'Mazda RX4' 'Mazda RX4 Wag' 'Datsun 710' 'Hornet 4 Drive' 'Merc 240D' 'Merc 230' 'Fiat 128' 'Honda Civic' 'Toyota Corolla' 'Toyota Corona' 'Fiat X1-9' 'Porsche 914-2' 'Lotus Europa' 'Volvo 142E'

<div>

3. Subset “mtcars” for rows with “cyl” less than “6“, and “gear” exactly equal to “4“.

<div class="alert alert-block alert-success">

You should get a dataframe with only the following rows: 'Datsun 710' 'Merc 240D' 'Merc 230' 'Fiat 128' 'Honda Civic' 'Toyota Corolla' 'Fiat X1-9' 'Volvo 142E'

<div>

4. Subset “mtcars” for rows greater than, or equal to, 21 miles per gallon. Also, select only the columns, “mpg” through “hp“.

<div class="alert alert-block alert-success">

You should have a 14 x 4 dataframe with only the listed columns and rows: 'Mazda RX4' 'Mazda RX4 Wag' 'Datsun 710' 'Hornet 4 Drive' 'Merc 240D' 'Merc 230' 'Fiat 128' 'Honda Civic' 'Toyota Corolla' 'Toyota Corona' 'Fiat X1-9' 'Porsche 914-2' 'Lotus Europa' 'Volvo 142E'

<div>

### Loading R packages: ggplot2

Every time you load a library you're adding a new environment to R. The library specific functions become available because R searches all environments for your function call. Sometimes you override an existing function by loading a new library. You can reference the package-specific function via the package specifier.

Here we will continue to use the mtcars dataset to explore the ggplot2 library, a commonly used library for plotting graphs in R.


In [None]:
library('ggplot2')

Ggplot works primarily with dataframes. We have to supply ggplot with a dataframe, in this case mtcars. As we go through ggplot, a key thing to notice is how the plot can continually be enhaanced by adding layers and themes (generally indicated by a ```+``` sign and a following statement) to an existing plot. 


Here is a ggplot data visualization cheat sheet for your own reference: https://github.com/rstudio/cheatsheets/blob/main/data-visualization-2.1.pdf

Let's start with initializing a basic ggplot with the mtcars dataset, using the mpg and cyl columns of data from the mtcars dataframe. The aes() function is what we will use to specify the X and Y axes

In [None]:
ggplot(mtcars, aes(x=mpg, y=cyl))

Do you see a blank ggplot? If so, don't worry, you did this step correctly. While we have x and y labels that match the columns we initialized the plot with, we don't see any plotted data. This is because ggplot does not make assumptions about the plot you are meaning to draw. Initializing the ggplot only tells ggplot what dataframe and what x and y columns from the dataframe should be used.

Now let's make a scatterplot. We will do this by adding on a layer using ```geom_point()```.

In [None]:
ggplot(mtcars, aes(x=mpg, y=cyl)) + geom_point()

From here, we can add on a smoothing layer using the method "lm" to draw a line of best fit. 

In [None]:
ggplot(mtcars, aes(x=mpg, y=cyl)) + geom_point() + geom_smooth(method='lm')

The range of the x and y axes were automatically set by ggplot...but what if we want to change them. Let's add another couple of layers that adjust the range of the x and y axes.

In [None]:
ggplot(mtcars, aes(x=mpg, y=cyl)) + geom_point() + geom_smooth(method='lm')+ xlim(c(0, 25)) + ylim(c(0, 9))

### Ggplot Practice

Now that we have learned the basics of ggplot. Let's read in the hg19 chromosome size dataset and practice plotting with this dataset.

In [None]:
hg19_chrom <- read.table('data/hg19.chrom.sizes.txt',header = FALSE, sep = "\t")

In [None]:
head(hg19_chrom)

***Now that we have loaded in the data, use ggplot to plot a barplot of the data where the x-axis is the chromosome name and the y-axis is the size of the chromosome. If you have extra time, try adding a title and informative x- and y-axis labels, try out different ggplot themes, etc.***