# Day 1 AM: Introduction to R, RStuido and the `data.frame`

## Using R and RStudio

R is a flexible language that is specialized for data analysis and visualization. This workshop focuses on **tabular** data that can be loaded into an R `data.frame` for **exploratory** analysis and visualization. Other aspects of R, such as general purpose programming, modeling for statistical inference and use of BioConductor for specialized assay analysis are de-emphasized in this workshop.

Most people using R use it in the context of the RStudio graphical user interface (GUI) environment, and we introduce this environment to illustrate:

- The anatomy of RStudio
- The R console
- Writing, executing and "sourcing" R scripts
- Using R markdown and notebooks for literate programming
- Getting help

## Overview of the exploratory data analysis pipeline

The exploratory data analysis pipeline typically consists of the following steps:

- Converting messy data into tidy data
- Manipulating tidy data
- Visualizing tidy data

We will cover these stages in reverse order in this workshop since the first two stages are quite dry without setting up the correct motivation. First however, we cover some essential concepts and show how data is loaded in the first place.

## Types, collections and variable assignments

### Strings

In [1]:
"This is a string"

In [64]:
substr("This is a string", 6, 10)

In [65]:
paste("gene", 1:10)

In [66]:
paste("Hello", "world", sep=", ")

### Numbers

In [2]:
42

In [3]:
3.14

In [4]:
0.5 + 0.5i

### Boolean values

In [7]:
TRUE

In [8]:
2 > 3

### Factors

In [10]:
sex <- as.factor(c("M", "F"))

In [11]:
sex

In [14]:
str(sex)

 Factor w/ 2 levels "F","M": 2 1


### Missing values

In [15]:
NA

In [16]:
4 * NA

### Vectors

In [20]:
5:10

In [21]:
10:5

In [22]:
c(1,1,2,3,5,8)

In [35]:
seq(1, 10, by=3)

In [37]:
rep(1:4, 2)

In [36]:
rep(1:4, each=2)

In [38]:
rnorm(5, 100, 15)

In [70]:
sample(c("H", "T"), 5, replace = TRUE)

### Matrices

In [23]:
matrix(1:12, nrow=4)

0,1,2
1,5,9
2,6,10
3,7,11
4,8,12


In [24]:
matrix(1:12, nrow=4, byrow=TRUE)

0,1,2
1,2,3
4,5,6
7,8,9
10,11,12


### Lists

In [26]:
list(a=1, b=2)

In [27]:
list(a=5:10, b= 10:5)

### Assignment

In [32]:
greet <- "hello"

In [34]:
greet

In [47]:
my.vec <- 5:10

In [48]:
my.vec

In [28]:
my.list <- list(a=5:10, b= 10:5)

In [29]:
my.list

In [30]:
my.matrix <- matrix(1:12, nrow=4, byrow=TRUE)

In [31]:
my.matrix

0,1,2
1,2,3
4,5,6
7,8,9
10,11,12


### Indexing

#### Vectors

In [57]:
my.vec

In [49]:
my.vec[1]

In [50]:
my.vec[-1]

In [53]:
my.vec[-c(1,3)]

In [51]:
my.vec[2:4]

#### Lists

In [58]:
my.list

In [46]:
my.list$a

In [40]:
my.list[1]

In [43]:
my.list[[1]]

#### Matrices

In [59]:
my.matrix

0,1,2
1,2,3
4,5,6
7,8,9
10,11,12


In [56]:
my.matrix[2,3]

In [61]:
my.matrix[2,]

In [62]:
my.matrix[,3]

In [63]:
my.matrix[2:3, 2:3]

0,1
5,6
8,9


## Getting data into a `data.frame`

### Creating a `data.frame` from scratch

A data frame is just a collection of lists of the same length, where each list contains only one type of variable, is treated as a column. 

In [69]:
my.df <- data.frame(pid=1:4, values=rnorm(4))

In [68]:
my.df

pid,values
1,2.6379455
2,-0.4258357
3,1.6738928
4,-1.4678309


## Understanding the `data.frame`

## Exporting a `data.frame`

## Installing packages from `CRAN` and `BioConductor`