#### R in a nutshell

- Statistical programming environments
- Originally designed and implemented by statisticians
- Widely popular due to its extensive collection of community-contributed packages
- Quickly gaining places among traditional proprietary tools such as SAS and STATA for data analytics

#### Learning Objectives

- Understand basic programming concepts: variables, assignment, functions, loops, conditions
- Understand core R concepts: data loading, data types, data access, libraries
- Understand advanced R concepts: data manipulation, visualization
- Understand HPC concepts: running R codes on the Palmetto supercomputer via batch submission scripts

#### Materials on this notebook is based on two lessons by Software Carpentry and Data Carpentry:

- Introduction to Programming using R
- Data Analysis and Visualization in R for Ecology

## Where am I?

In [None]:
getwd()

## Variables and assignment

In [None]:
read.csv("data/combined.csv")

- *variable* : label, name, identifier ...
- *value*    : the actual content represented by a *variable*
- *assignment* : the act of assigning a value to a variable
- R's assignment notation:  *variable* <- *value*

In [None]:
x <- 2

In [None]:
x

In [None]:
weight_kg <- 97

In [None]:
weight_lb <- weight_kg * 2.2

In [None]:
weight_lb

In [None]:
weight_kg <- 90

In [None]:
weight_lb <- weight_kg * 2.2
weight_lb

*Read data into a variable:*

In [None]:
surveys <- read.csv("data/combined.csv")
head(surveys)

*Header is TRUE or Header is FALSE, that is the question!*

In [None]:
surveys <- read.csv("data/combined.csv", header = FALSE)
head(surveys)

In [None]:
surveys <- read.csv("data/combined.csv")
head(surveys)

*How to get help?*

In [None]:
?read.csv

## Data Types

- Data frames are the *de facto* data structure for R's tabular data, and conceptionally equivalent to an Excel spreadsheet but is more powerful and versatile.
- Matrices (multi-dimensional) and vectors (one dimension) are also available for computational purposes. 
- Data frames represents a table whose columns are vectors with same length but possible different data types

In [None]:
class(surveys)

*Structure of a data frame*

In [None]:
str(surveys)

In [None]:
summary(surveys)

*Size of a data frame*

In [None]:
dim(surveys)

In [None]:
nrow(surveys)

In [None]:
ncol(surveys)

*Content of a data frame*

In [None]:
head(surveys)

In [None]:
head(surveys, n=10)

In [None]:
tail(surveys)

In [None]:
tail(surveys, n=10)

*Names*

In [None]:
names(surveys)

In [None]:
surveys_colnames <- names(surveys)

In [None]:
surveys_colnames

In [None]:
surveys_rownames <- rownames(surveys)
str(surveys_rownames)

## Data frames access: indexing and subsetting

Similar to an Excel spreadsheet, we can extract specific data from a dataframe via 'coordinates': row/column combinations

Accessing a single element

In [None]:
surveys[1,1]

In [None]:
surveys[1,2]

*Accessing a block of elements*

In [None]:
surveys[1:5,2]

In [None]:
surveys[2,3:7]

In [None]:
surveys[1:5,3:7]

*Accessing scattered groups of elements*

In [None]:
?c

In [None]:
surveys[c(2:4,6:7),]

*Excluding data with the `-` notation:*

In [None]:
surveys[1:5, -3]

*Accessing columns by names:*

In [None]:
surveys[1:5,"month"]

In [None]:
surveys[["month"]][1:5]

In [None]:
surveys$month[1:5]

** Challenge: **

- Create a data frame containing on observations from row 200 to the end of the `surveys` data set
- Create a data frame containing the row that is in the middle of the data frame. Store the content in a variable named `surveys_middle`.
- Combine `nrow` with the `-` notation to reproduce the behavior of `head(surveys)`

## Factors

- Special class, representing categorical data
- Can be ordered or unordered
- Stored as integers with labels (text) associated with these unique integers
- Looked and behave like character vectors but are integers under the hood
- Once created, a `factor` object can only contain a pre-defined set of values, known as *levels*. 
- *Levels* are sorted alphabetically by default. 

In [None]:
str(surveys)

In [None]:
levels(surveys$sex)

In [None]:
nlevels(surveys$sex)

*Converting factors:*

In [None]:
as.character(surveys$sex)

In [None]:
f <- factor(c(1990,1983,1977,1998,1990))

In [None]:
f

In [None]:
as.numeric(f) #incorrect

In [None]:
as.numeric(as.character(f)) #works

In [None]:
as.numeric(levels(f))[f] #recommended

*Renaming factors:*

In [None]:
plot(surveys$sex)

In [None]:
sex <- surveys$sex

In [None]:
levels(sex)

In [None]:
levels(sex)[1] <- "missing"

In [None]:
plot(sex)

*Using `stringsAsFactors=FALSE`*

In [None]:
surveys <- read.csv('data/combined.csv', stringsAsFactors = TRUE)
str(surveys)

In [None]:
surveys <- read.csv('data/combined.csv', stringsAsFactors = FALSE)
str(surveys)