# R Basics

- Use the console to use R as a simple calculator

In [None]:
1+2

- The assignment symbol is "<-". The classical "=" symbol can also be used

In [None]:
a=2+3
b<-10/a

In [None]:
a <- 3
b <- sqrt(a)
b

## Data Classes

* R has five atomic classes
* Numeric
    - Double is equivalent to numeric.
    - Numbers in R are treated as numeric unless specified otherwise.
* Integer
* Complex
* Character
* Logical
    - TRUE or FALSE
* You can convert data from one type to the other using the `as.<Type>` functions
* To check the class of an object, use the `is.<Type>` function.

In [None]:
c <- 2i
d <- TRUE
d

In [None]:
as.numeric(d); as.character(b); is.complex(c)

## Data Objects‐ Vectors

* Vectors can only contain elements of the same class
* Vectors can be constructed by
    -  Using the `c()` function (concatenate)
* Coercion will occur when mixed objects are passed to the `c()` function, as if the `as.<Type>()` function is explicitly called
    -  Using the `vector()` function
* One can use `[index]` to access individual element
    -  Indices start from 1

In [None]:
# "#" indicates comment
# "<-" performs assignment operation (you can use "=" as well, but "<-" is preferred)
# numeric (double is the same as numeric)
d <- c(1,2,3)
# character
d <- c("1","2","3")
# you can covert at object with as.TYPE
# as. numeric changes the character vector created above to numeric
as.numeric(d)

In [None]:
# The conversion doesn't always work though
as.numeric("a")

In [None]:
x <- c(0.5, 0.6) ## numeric
x <- c(TRUE, FALSE) ## logical
x <- c(T, F) ## logical
x <- c("a", "b", "c") ## character
# The ":" operator can be used to generate integer sequences
x <- 9:29 ## integer
x <- c(1+0i, 2+4i) ## complex
x <- vector("numeric", length = 10)
# Coercion will occur when objects of different classes are mixed
y <- c(1.7, "a") ## character
y <- c(TRUE, 2) ## numeric
y <- c("a", TRUE) ## character
# Can also coerce explicitly
x <- 0:6
class(x)

In [None]:
as.logical(x)

- Vectorized Operations

In [None]:
x <- 1:4; y <- 6:9
x + y

In [None]:
x > 2

In [None]:
x * y

In [None]:
print( x[x >= 3] )

## Data Objects - Matrices

* Matrices are vectors with a dimension attribute
* R matrices can be constructed
    -  Using the `matrix()` function
	  - Passing an dim attribute to a vector
	  -  Using the `cbind()` or `rbind()` functions
* R matrices are constructed column‐wise
* One can use `[<index>,<index>]` to access individual element

In [None]:
# Create a matrix using the matrix() function
m <- matrix(1:6, nrow = 2, ncol = 3)
m

In [None]:
dim(m)

In [None]:
attributes(m)

In [None]:
# Pass a dim attribute to a vector
m <- 1:10
m

In [None]:
dim(m) <- c(2, 5)
m

In [None]:
# Row binding and column binding
x <- 1:3
y <- 10:12
cbind(x, y)

In [None]:
rbind(x, y)

In [None]:
# Slicing
m

In [None]:
# element at 2nd row, 3rd column
m[2,3]

In [None]:
# entire i<sup>th</sup> row of m
m[2,]

In [None]:
# entire j<sup>th</sup> column of m
m[,3]

## Data Objects - Lists

* Lists are a special kind of vector that contains objects of different classes
* Lists can be constructed by using the `list()` function
* Lists can be indexed using `[[  ]]`


In [None]:
# Use the list() function to construct a list
x <- list(1, "a", TRUE, 1 + 4i)
x

## Names

* R objects can have names


In [None]:
# Each element in a vector can have a name
x <- 1:3
names(x)

In [None]:
names(x) <- c("a","b","c")
names(x)

In [None]:
x

In [None]:
# Lists
x <- list(a = 1, b = 2, c = 3)
x

In [None]:
# Names can be used to refer to individual element
x$a

In [None]:
# Columns and rows of matrices
m <- matrix(1:4, nrow = 2, ncol = 2)
dimnames(m) <- list(c("a", "b"), c("c", "d"))
m

## Querying Object Attributes

* The `class()` function
* The `str()` function
* The `attributes()` function reveals attributes of an object (does not work with vectors)
    -  Class
    -  Names
    -  Dimensions
    -  Length
    -  User defined attributes
* They work on all objects (including functions)

In [None]:
m <- matrix(1:10, nrow = 2, ncol = 5)
str(matrix)

In [None]:
str(m)

In [None]:
str(str)

## Data Class - Factors

* Factors are used to represent categorical data.
* Factors can be unordered or ordered.
* Factors are treated specially by modelling functions like `lm()` and `glm()`

In [None]:
# Use the factor() function to construct a vector of factors
# The order of levels can be set by the levels keyword
x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))
x

## Date and Time

* R has a Date class for date data while times are represented by POSIX formats
* One can convert a text string to date using the `as.Date()` function
* The `strptime()` function can deal with dates and times in different formats.
* The package "`lubridate`" provides many additional and convenient features

In [None]:
# Dates are stored internally as the number of days since 1970-01-01
x <- as.Date("1970-01-01")
x

In [None]:
as.numeric(x)

In [None]:
x+1

In [None]:
# Times are stored internally as the number of seconds since 1970-01-01
x <- Sys.time() ; x

In [None]:
as.numeric(x)

In [None]:
p <- as.POSIXlt(x)
names(unclass(p))

In [None]:
p$sec

## Missing Values

* Missing values are denoted by `NA` or `NaN` for undefined mathematical operations.
    - `is.na()` is used to test objects if they are `NA`
    - `is.nan()` is used to test for `NaN`
    - `NA` values have a class also, so there are integer `NA`, character `NA`, etc.
    - A `NaN` value is also `NA` but the converse is not true

In [None]:
x <- c(1,2, NA, 10,3)
is.na(x)

In [None]:
is.nan(x)

In [None]:
x <- c(1,2, NaN, NA,4)
is.na(x)

In [None]:
is.nan(x)

## Distributions and Random Variables

* For each distribution R provides four functions: density (`d`), cumulative density (`p`), quantile (`q`), and random generation (`r`)
    - The function name is of the form `[d|p|q|r]<name of distribution>`
    - e.g. `qbinom()` gives the quantile of a binomial distribution


In [None]:
# Random generation from a uniform distribution.
runif(10, 2, 4)

In [None]:
# You can name the arguments in the function call.
runif(10, min = 2, max = 4)

In [None]:
# Given p value and degree of freedom, find the t-value.
qt(p=0.975, df = 8)

In [None]:
# The inverse of the above function call
pt(2.306, df = 8)

## User Defined Functions
* Similar to other languages, functions in R are defined by using the `function()` directives
* The return value is the last expression in the function body to be evaluated.
* Functions can be nested
* Functions are R objects
    - For example, they can be passed as an argument to other functions

In [None]:
newDef <- function(a,b)
 {
     x = runif(10,a,b)
     mean(x)
 }
newDef(-1,1)

## Control Structures
* Control structures allow one to control the flow of execution.

<table>
<tr><td><code>if … else</code></td><td>testing a condition</td></tr>
<tr><td><code>for</code></td><td>executing a loop (with fixed number of iterations)</td></tr>
<tr><td><code>while</code></td><td>executing a loop when a condition is true</td></tr>
<tr><td><code>repeat</code></td><td>executing an infinite loop</td></tr>
<tr><td><code>break</code></td><td>breaking the execution of a loop</td></tr>
<tr><td><code>next</code></td><td>skipping to next iteration</td></tr>
<tr><td><code>return</code></td><td>exit a function</td></tr>
</table>

## Loops
### for Loops

In [None]:
x <- c("a", "b", "c", "d")
# These loops have the same effect
# Loop through the indices
for(i in 1:4) {
  print(x[i])
}


In [None]:
# Loop using the seq_along() function
for(i in seq_along(x)) {
  print(x[i])
}

In [None]:
# Loop through the name
for(letter in x) {
  print(letter)
}

In [None]:
for(i in 1:4) print(x[i])

### while loops

* The `while` loop can be used to repeat a set of instructions
* It is often used when you do not know in advance how often the instructions will be executed. 
* The basic format for a `while` loop is `while(cond) expr`

In [None]:
counter <- as.integer(readline(prompt="Enter an integer: "))

factorial <- 1                                                                                                                                                    
while ( counter > 0)
{
   factorial <-  factorial * counter                                                                                                                   
   counter = counter - 1                                                                                                                                 
}

print(factorial)

### repeat loops

* The `repeat` loop is similar to the `while` loop. 
* The difference is that it will always begin the loop the first time. The `while` loop will only start the loop if the condition is true the first time it is evaluated. 
* Another difference is that you have to explicitly specify when to stop the loop using the `break` command.

### break and next statements

* The `break` statement is used to stop the execution of the current loop. 
  - It will break out of the current loop. 
* The `next` statement is used to skip the statements that follow and restart the current loop. 
  - If a `for` loop is used then the `next` statement will update the loop variable.


In [None]:
counter <- as.integer(readline(prompt="Enter an integer: "))

dble_factorial <- 1

repeat
{
    dble_factorial <- dble_factorial * counter
    
    if (counter <= 2)
        break
    else
        counter = counter - 2
    
}

print(dble_factorial)

## The apply Function
* The `apply()` function evaluate a function over
the margins of an array
    - More concise than the for loops (not necessarily
faster)

In [None]:
# X: array objects
# MARGIN: a vector giving the subscripts which the function will be applied over
# FUN: a function to be applied
str(apply)

In [None]:
x <- matrix(rnorm(200), 20, 10)
# Row means
apply(x, 1, mean)

In [None]:
# Column sums
apply(x, 2, sum)

In [None]:
# 25th and 75th Quantiles for rows
apply(x, 1, quantile, probs = c(0.25, 0.75))

In [None]:
dim(x)

In [None]:
# Change the dimensions of x
dim(x) <- c(2,2,50)
# Take average over the first two dimensions
apply(x, c(1, 2), mean)

In [None]:
rowMeans(x, dims = 2)

## R for Data Science

* The `tidyverse` is a collection of R packages developed by RStudio’s chief scientist Hadley Wickham. 
     * `ggplot2` for data visualisation.
     * `dplyr` for data manipulation.
     * `tidyr` for data tidying.
     * `readr` for data import.
     * `purrr` for functional programming.
     * `tibble` for tibbles, a modern re-imagining of data frames.
* These packages work well together as part of larger data analysis pipeline. 
* To learn more about these tools and how they work together, read [R for Data Science](http://r4ds.had.co.nz/). 

## Tidyverse

* What is Tidy Data?
     * "Tidy data" is a term that describes a standardized approach to structuring datasets to make analyses and visualizations easier. 
* The core tidy data principles
     * Variable make up the columns
     * Observations make up the rows
     * Values go into cells
* `library(tidyverse)` will load the core tidyverse packages:

In [None]:
library(tidyverse)

## Installing and Using Packages

* To install a package, `mypackage`, enter `install.packages('mypackage')` on the R prompt
* To use a package, `mypackage`, enter `library(mypackage)`
* When you share R script, notebooks etc with others or use on multiple systems, there is no gaurantee that `mypackage` would be available. 
   * In this case `library(mypackage)` may give you an error.
   * R provides a function called `require` that will check if the package is installed
        * If installed, the package will be loaded
        * Use a condition where the package can be installed if the `require` statement fails

In [None]:
if ( !require('lubridate')){
    install.packages('lubridate')
} 

## Tidyverse 

* Packages that are part of tidyverse but not loaded automatically
   * `lubridate` for dates and date-times
   * `magrittr` provides the pipe, %>% used throughout the tidyverse.
   * `readxl` for .xls and .xlsx sheets.
   * `haven` for SPSS, Stata, and SAS data. 
* packages that are not in the tidyverse, but are tidyverse-adjacent. They are very useful for importing data from other sources:
   * `jsonlite` for JSON.
   * `xml2` for XML.
   * `httr` for web APIs.
   * `rvest` for web scraping.
   * `DBI` for relational databases


## Readr Package

* `readr` is to provide a fast and friendly way to read rectangular data (like csv, tsv, and fwf).
* `readr` supports seven file formats with seven read_ functions:
    * `read_csv()`: comma separated (CSV) files
    * `read_csv2()`: semicolon separated file and "," for decimal point
    * `read_tsv()`: tab separated files
    * `read_delim()`: general delimited files
    * `read_fwf()`: fixed width files
    * `read_table()`: tabular files where colums are separated by white-space.
    * `read_log()`: web log files
* Usage
    * `read_delim(file,delim)` 
          - `file`: path to a file, a connection, or literal data


In [None]:
# read daily usage report for Sol in AY 2016-17
# usage is reported in terms of SUs used and jobs submitted for
#  serial (1 cpu), single or smp ( > 1 cpu but max of 1 node) and 
#  parallel or multi node (> 1 node)  jobs 
daily <- read_delim('http://webapps.lehigh.edu/hpc/training/soldaily1617-public.csv',delim=";")

## Dplyr

* `dplyr` is a grammar of data manipulation, providing a consistent set of verbs to  solve the most common data manipulation challenges:
  * `mutate()` adds new variables that are functions of existing variables
  * `select()` picks variables based on their names.
  * `filter()` picks cases based on their values.
  * `summarise()` reduces multiple values down to a single summary.
  * `arrange()` changes the ordering of the rows.
* These all combine naturally with `group_by()` which allows you to perform any operation "by group"



In [None]:
daily %>% head

In [None]:
# Number of core hours available per month for AY 2016-17
# Oct 1, 2016: Initial launch with 780 cpu
# Mar 15, 2017: Added 192 cpus
# May 1, 2017: Added 312 cpus
# Total Available at end of AY 2016-17: 1284 cpus
ay1617su <- c(580320.00,561600.00,580320.00,580320.00,524160.00,580320.00,699840.00,955296.00,924480.00,955296.00,955296.00,924480.00)

In [None]:
monthly <- daily %>% 
  group_by(Month=floor_date(as.Date(Day), "month"),Name,Department,PI,PIDept,Status) %>% 
  summarize(Serial=sum(as.double(Serial)), # Single core or serial SUs consumed
    Single=sum(as.double(Single)), # Single node - multi core SUs consumed
    Multi=sum(as.double(Multi)), # Multi node SUs consumed
    Total=sum(as.double(Total)), # Total SUs consumed
    SerialJ=sum(as.double(SerialJ)), # Number of Single core or serial jobs 
    SingleJ=sum(as.double(SingleJ)), # Number of Single node - multi core jobs
    MultiJ=sum(as.double(MultiJ)), # Number Multi node jobs
    TotalJ=sum(as.double(TotalJ))) # Total Number of jobs jobs
monthly %>% head

## Sol Monthly Usage for AY 2016-17

In [None]:
monthly %>% 
  group_by(Month) %>%   
  summarize(Total=round(sum(as.double(Total)),2),Jobs=round(sum(as.double(TotalJ)))) %>%
  mutate(Available=ay1617su,Unused=Available-Total,Percent=round(Total/Available*100,2)) -> monthlyusage
monthlyusage

## Sol usage per PI's Department



In [None]:
library(knitr)
monthly %>%
  group_by(PIDept) %>%
  summarize(Total=round(sum(as.double(Total)),2),Jobs=round(sum(as.double(TotalJ)))) -> monthlypidept
monthlypidept %>% kable

## Sol usage by user's department or major

In [None]:
monthly %>%
  group_by(Department) %>%
  summarize(Serial=round(sum(as.double(Serial))),SMP=round(sum(as.double(Single))),DMP=round(sum(as.double(Multi))),Total=round(sum(as.double(Total)),2),Jobs=round(sum(as.double(TotalJ)))) %>%
  arrange(desc(Total)) -> monthlyuser
monthlyuser

## Need code for creating LaTeX documents

- The `xtable` package works on RMarkdown documents doesn't work in Jupyter Notebooks

In [None]:
library(xtable)
monthlyuser %>% xtable

## Sol usage by user affiliation

In [None]:
monthly %>%
  group_by(Status) %>%
  summarize(Total=round(sum(as.double(Total)),2)) -> monthlystatus
monthlystatus

## Tidyr

* The goal of `tidyr` is to help you create tidy data.
* Tidy data is data where:
   * Each variable is in a column.
   * Each observation is a row.
   * Each value is a cell.
* Tidy data describes a standard way of storing data that is used wherever
 possible throughout the `tidyverse`. 
* If you ensure that your data is tidy, you’ll spend less timing fighting with the tools and more time working on your analysis.

* two fundamental verbs of data tidying:
   * `gather()` takes multiple columns, and gathers them into key-value pairs
   * `spread()`. takes two columns (key & value) and spreads in to multiple columns


In [None]:
daily %>% 
  filter(as.Date(Day) >= "2017-02-01" & as.Date(Day) <= "2017-03-01") %>% 
  select(Day,Name,Department,PI,PIDept,Serial,Single,Multi) %>% 
  gather(JobType,Usage,Serial:Multi) %>% 
  filter(as.double(Usage) > 100 ) -> tmp
tmp %>% arrange(Usage) %>% head

In [None]:
tmp %>% arrange(Usage) %>% 
  spread(JobType,Usage,fill = 0.0) %>% head

## Other Tidyr functions

* `separate()`: Splitting a single variable into two

In [None]:
daily %>% 
   select(c(Department,Day,Total)) %>% 
   separate(Day,c("Year","Month","Day"),sep="-") -> tmp
head(tmp)

* `unite()`: Merging two variables into one

In [None]:
tmp %>%
  unite(Day,c("Year","Month","Day"),sep="/") %>%
  tail

## Data Visualization

* __Data visualization__ or __data visualisation__ is viewed by many disciplines as a modern equivalent of visual communication.
* It involves the creation and study of the visual representation of data.
* A primary goal of data visualization is to communicate information clearly and efficiently via statistical graphics, plots and information graphics. 
* Data visualization is both an art and a science.

### Data Visualization Tools

* There are vast number of Data Visualization Tools targeted for different audiences
* A few used by academic researchers 
     * Tableau
     * Google Charts
     * R
     * Python
     * Matlab
     * GNUPlot

   


## ggplot2 Package


- "gg" stands for Grammar-of-Graphics
- The idea is that any data graphics can be described by specifying
    - A dataset
    - Visual marks that represent data points
    - A coordination system
- `ggplot2` package in R is an implementation of it
    - Versatile
    - Clear and consistent interface
    - Beautiful output




In [None]:
p <- monthlystatus %>%
  ggplot(aes(x=Status,y=Total)) + geom_col()
p

In [None]:
p + coord_flip()

## Bar Chart Monthly Usage

In [None]:
p <- monthlyusage %>%
  ggplot(aes(Month,Percent)) + geom_col()
p

* Add Plot Title and Caption, and x and y labels

In [None]:
p + labs(title="Sol Usage", y="Percent", x="Month", caption="AY 2016-17")

## Line Charts

In [None]:
p <- daily %>%
  group_by(Day, PIDept) %>%
  summarize(Total=round(sum(as.double(Total)),2),Jobs=round(sum(as.double(TotalJ)))) %>%
  ggplot(aes(Day,Total)) + geom_line(aes(col = PIDept))
p

Plot is very busy. There are several options to clean this up.
* summarize by week or month
* take a running average or add a smoothing function
* create separate plots for each Department using facet_wrap

In [None]:
p + facet_wrap( ~PIDept)

It's not very useful at first try. That may due to data. It would be ideal if each subplot had it's y-axis limits. This is achieved by adding _scales="free"_ as an option to facet_wrap. You can also adjust the number of rows or columns by adding _nrow_ or _ncol_ as an option. Finally the legend is redundant. You can remove the legend by adding _theme(legend.position='none')_

Combining all these options we get

In [None]:
p + facet_wrap( ~PIDept, scales = "free", ncol = 2) + theme(legend.position='none')

## Animations

__If a Picture is Worth a Thousand Words, Then Is a Video Worth a Million?__

What is an animation really? Just a collection of images that appear at a high frequency or frame rate. If you have a collection of pictures, you can convert them to gif, mpeg, or any other video format using tools like ImageMagick or ffmpeg. R provides tools that will convert a collection of images from plots to video provided you have one of these conversion tools.

In [None]:
if(!require('animation')){
    install.packages('animation')
}
if(!require('gganimate')){
    install.packages('animation')
}

In [None]:
weeklyusage_status <- daily %>%
  group_by(Week=floor_date(as.Date(Day), "week"),Status) %>% 
  summarize(Total=round(sum(as.double(Total)),2),Jobs=round(sum(as.double(TotalJ)))) %>%
  ggplot(aes(Week,Total,frame=Week,cumulative=TRUE)) + geom_line(aes(col = Status)) +
  facet_wrap( ~Status, scales = "free", ncol = 2) + theme(legend.position='none')

In [None]:
ani.options(interval = 0.1, ani.width = 640, ani.height = 480)
gganimate(weeklyusage_status,'weeklystatus.gif')

![Usage by Users Status](weeklystatus.gif)