# R: a quick overview

R is a widely used programming language and as such has many resources available. If you're looking to do some data wrangling chances are someone before you has already had to do the same thing yuo're about to do and maybe even written a package for those specific operations so it's usualy worth looking around a bit before trying to implement your own function.

While here we're working either from within the console on the cluster or from a notebook I would also recommend that you get [RStudio](https://www.rstudio.com/). This is a more user-friendly interface and has some great capabilities when plotting and debugging.

RStudio is also providing several [cheatsheets](https://www.rstudio.com/resources/cheatsheets) for some of the most commonly used functionalities.

Additionally, when you know the name of a function you can directly access the documentation for said function by entering: ?functionOfInterest.

The power of R lies both in it's statistical capabilities as well as in the various plotting options to visualize the data. Higher level visualizations are usually performed using [ggplot2](http://ggplot2.org/) which has an enormous range of possibilities. You can find some examples in the [R Graph Gallery](http://www.r-graph-gallery.com/)

There are six types of data structures in R, out of which we'll look at the main 4 (the other two are rarely used). Knowing about these structures will help you get a better understanding of the language which in turn improves your programming skills.

# Getting Started

In [3]:
# R can be used to perform simple operations:

# addition
3 + 5

# subtraction
10 - 4

# multiplication
3 * 4

# division
8 / 3

# raise a number to a power
2^8

# take a root
sqrt(4)



# Data Structure: Vector, Atomic

Atomic vectors can only contain one type of data: logical, integer, double, character

In [5]:
my.vector <- c(1, 2, 3)

my.vector2 <- c(1:3)

# atmoic vectors are one dimensional
print(my.vector2)

# they take only one type of data
typeof(my.vector2)

my.char.vector <- c('a', 'b', 'c')
typeof(my.char.vector)

[1] 1 2 3


In [6]:
length(my.vector)

In [7]:
# vectorized operations
my.vector + 3

my.vector + my.vector

# Data Structure: Vector, Lists

Stores different elements, where each element can be of a different type

In [3]:
my.list <- list(1:3, 'a', c(TRUE, FALSE))

my.list

In [6]:
my.named.list <- list(one = 1:3, two = 'a', three = c(TRUE, FALSE))
my.named.list

In [7]:
my.named.list$one
my.named.list[['two']]
my.named.list[[3]]


Lists are made up of atomic vectors or other lists

In [10]:
typeof(my.named.list$one)

long.listed.list <- list(first = 1, second = list(2,3))
long.listed.list
typeof(long.listed.list$second)

# Data Structure: Attributes

Attributes are not as important in the beginning but good to know about. They are used to store metadata.

In [11]:
# Names
x <- c(a = 1, b = 2, c = 3)
x

In [13]:
# Factors
sex.char <- c('m', 'm', 'f')
sex.factor <- factor(sex.char, levels = c('m','f'))
sex.factor
table(sex.factor)


sex.factor
m f 
2 1 

# Data Structure: Matrices / Arrays

These multi-dimensional data structures can only hold one type of data (usually numeric). A Matrix is a sub-category of Arrays and only has two dimensions while Arrays can have more.

In [18]:
my.matrix <- matrix(1:6, ncol = 3, nrow = 2)
my.matrix

typeof(my.matrix)

# dimensions are shown as: rows columns
dim(my.matrix)

0,1,2
1,3,5
2,4,6


In [17]:
# accessing matrix row
my.matrix[1,]

#acessing matrix column
my.matrix[,1]

# Data Structue: Data Frame

One of the most commonly used data structures.

In [19]:
my.df <- data.frame(x = 1:3, y = c('a', 'b', 'c'))

my.df

x,y
1,a
2,b
3,c


Under the hood data frames are lists and can be accessed as such

In [21]:
typeof(my.df)
my.df$x
my.df[['y']]

But they also have matrix-like properties: they posess rows and columns

In [22]:
dim(my.df)

my.df[,1]
my.df[1,]

x,y
1,a


This mixed property allows it to be flexible. Additionally two data frames can be combined if the dimensions match up

In [23]:
my.other.df <- data.frame(x = 5:7, y = c('x', 'y', 'z'))

#combine by columns
my.column.combined.df <- cbind(my.df, my.other.df)
my.column.combined.df

#combine by rows
my.row.combined.df <- rbind(my.df, my.other.df)
dim(my.row.combined.df)

x,y,x.1,y.1
1,a,5,x
2,b,6,y
3,c,7,z


Subsetting: 

In [29]:
# subset operators: [, [[, $
my.df[1,]
my.df$x
my.df[[1]]


x,y
1,a


In [28]:
# subset types
# 1. using positive integers
my.column.combined.df[c(1,3),]

# 2. using negative integers (omitting these parts)
my.column.combined.df[c(-1,-3),]

# 3. Logical Vectors (careful with recyling)
my.column.combined.df[c(TRUE,FALSE,TRUE,FALSE),]
my.column.combined.df[c(TRUE),]


Unnamed: 0,x,y,x.1,y.1
1,1,a,5,x
3,3,c,7,z


Unnamed: 0,x,y,x.1,y.1
2,2,b,6,y


Unnamed: 0,x,y,x.1,y.1
1,1,a,5,x
3,3,c,7,z


x,y,x.1,y.1
1,a,5,x
2,b,6,y
3,c,7,z


In [32]:
# subset types continued
# 4. Nothing, used especially with matrices, arrays and data frames
my.column.combined.df[c(1,3),]

# 5. Zero, returns 0-length vector. Mainly used in generating test data (output is: numeric(0))
my.column.combined.df[0]

# 6. Character vectors, if names are present
my.df['x']

Unnamed: 0,x,y,x.1,y.1
1,1,a,5,x
3,3,c,7,z


x
1
2
3


One of the most widely used subsetting types used is logical subsetting. Given a provided condition elements can be extracted

In [35]:
# want to extract rows where column x is greater or equal to two (2)
my.df[my.df$x >= 2, ]
# What actually happens is your condition creates a logical vector which is used for extracion
my.df$x >= 2

Unnamed: 0,x,y
2,2,b
3,3,c


In [36]:
# it is also possible to provide multiple conditions
my.df[my.df$x >= 2 & my.df$y == 'c', ]

Unnamed: 0,x,y
3,3,c


In [45]:
# is whe have a regular vector whe use the command 'which' to do the exact same
some.vector <- c(1:10)
some.vector[which(some.vector > 8)]

# Loops and 'if - else'

Loops are used when an operation has to be repeated several times. There are two types of loops; 'for-' and 'while-loops'

In [10]:
# for loops are used if a definitive end is in sight and you know the number of times an operations has to be performed
# here I want to know what the value is for each element in my.vector
for(i in 1:length(my.vector)){
    print(paste('Iteration:', i))
    print(paste('Value at posiion', i, 'is:', my.vector[i]))
}

# while loops are used if the total number of iterations is not clear from the beginning.
my.condition <- TRUE  # while this condition is true, keep looping
my.counter <- 0  # counts number of loops performed
while(my.condition){
    my.counter <- my.counter + 1
    if(sum(runif(5)) > 3){
        my.condition <- FALSE
    }
}
my.counter

[1] "Iteration: 1"
[1] "Value at posiion 1 is: 1"
[1] "Iteration: 2"
[1] "Value at posiion 2 is: 2"
[1] "Iteration: 3"
[1] "Value at posiion 3 is: 3"


In [11]:
# if-else statements check whether or not a certain condition holds true and reacts accordingly
if(length(my.vector) >= 3){
    print('we expected this')
} else {
    print('This is odd. Maybe you should recreate my.vector')
}

[1] "we expected this"
