<div style="text-align: center"><img width=150px src="http://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/R_logo.svg/200px-R_logo.svg.png"></div>

# An introduction to R and notebooks

R is a language for statistical computing and graphics. It is distributed for free (under the GNU General Public License) and used by many people around the world to analyse data.  

You can learn more about R at [r-project.org](https://www.r-project.org/about.html) and [An Introduction to R](https://cran.r-project.org/doc/manuals/r-release/R-intro.html).

Use of an analysis notebooks helps you to make your research reproducible, and reduces the chance of making errors (such as transcribing numbers from analysis to report). It also means that if, for some reason, your data are updated then all the analysis can be re-run automatically without having to keep pointing and clicking through multiple menus (e.g. as you would do if using alternative software such as SPSS, Prism etc). 

This notebook demonstrates R within a Jupyter notebook, adapting some material from the Microsoft Azure tutorial for R notebooks. 

(NB: although we'll use an R notebook note this is not required to use R for statistical analysis. e.g. You could download it and run it on your computer instead. All you need is available at [r-project.org](https://www.r-project.org/about.html). If you want to install on your computer then a popular add-on some people like is [RStudio](https://www.rstudio.com/products/rstudio/download) (described as "a set of integrated tools designed to help you be more productive with R").)

# Get started

- First go to Microsoft Azure Notebooks [notebookes.azure.com](https://notebooks.azure.com/).
- Follow the link to **Sign In**
- Use your university account (xxx@qmul.ac.uk) to log in (same account that you use to check QMUL email)
- To access this tutorial go to your projects and select 'Upload GitHub Repo' button
- Enter [https://github.com/brentnall/r-azure-notebook-intro](https://github.com/brentnall/r-azure-notebook-intro) into the GitHub repository 
- Given your project a name (e.g. r-intro)
- Then import. This will copy it into your azure storage.
- Once in your storage, open the project you just cloned (Introduction to R)
- To view the notebook click on the file "Introduction to R.ipynb"
- You will then be able to edit, change and follow the instructions below on your own copy


# Notebook basics

There are two basic types of 'cells' that we're going to use in our Jupyter notebook. 

- Markdown
- Code 

Markdown is where you write your text, or tell the notebook to display an image.
Code is where you do you analysis (in R for us). You choose the type of cell using the menu above

![title](menu.png)

This is a markdown cell. 
- When it has been 'run' it will look nice, and the first line will look like a title, without two has symbols in front. 
- When it is being edited it will have a box around it. 
- When it has not been edited, but not yet (re)run you'll see the hash symbols still etc. In general, if not run then markdown cells won't have nice formatting, and code cells won't have any output. 

To choose the type of cell (markdown or code) select the box at the top in the menu, and change the selection. Of course code might cause problems in a markdown cell, and vice versa. So you need to make sure you have the right type of cell for what you are entering.

To get each cell to format or run, click the "Run" button (to the left of the markdown/code selction) after editing it. 

To run all the cells again from the start to the end hit the >> button.

# Using R


## Your first R command

The cell below this is the first code cell in this notebook, and it runs a simple command in R

In [1]:
# In a code cell the hashtag means comment - anything after it does not do anything
print("hello world")

[1] "hello world"


** TASK FOR YOU ** Try to repeat what was just done in the code cell by entering a new code cell below this (select this cell, then go to menu option: Insert), and write the code in R to print out "why did the chicken cross the road?" 

## R as a calculator 

You can use R like a calculator. For example 10 times 20 = ?

In [2]:
10 * 20

In [3]:
# Division example
2/3

** TASK FOR YOU ** How would you add 123 + 125 and get the answer printed out? 

In [4]:
#enter your code below and then click run

The elementary arithmetic operators are the usual `+`, `-`, `*`, `/`, and `^` (raise to a power), along with all the common arithmetic functions: `log`, `exp`, `sin`, `cos`, `tan`, `sqrt`, and so on.

In [5]:
#e.g.the square root of 4 is what?
sqrt(4)

**TASK FOR YOU** write code below to calculate the square of 2 and print it out

In [6]:
#write your code below and then hit run

## Saving values

In R we can also save or 'assign' values to an 'object'. This is very useful and needed for more complex analysis. 

In [7]:
# For example
myval <- 2/3

** TASK FOR YOU ** Write the code to print out `myval`

In [8]:
# enter your code here and run it


## Vectors and assignment

R is more advanced than a calculator. More generally R operates on named *data structures*. This is jargon, but worth learning in order to follow the help files in R.

The simplest *data structure* is the numeric *vector*, which is an ordered collection of numbers (and the most simple vector is just a number - a vector with length 1). 

To set up a vector named `x`, say, consisting of five numbers, namely 10.4, 5.6, 3.1, 6.4 and 21.7, use the following R command:

In [9]:
x <- c(10.4, 5.6, 3.1, 6.4, 21.7)

As we saw above, in a notebook, the previous cell won't show any output. You can see the contents of `x` by simple running `x` in a code cell:

In [10]:
x

`c()` is a type of *function* . This function puts the numbers together into a vector (`c` stands for concatenation).

The assignment operator (`<-`) consists of the two characters, `<` ("less than") and '-' ("minus") occurring strictly side-by-side. It looks like the point of an arrow to the object receiving the value of the expression. It is equivalent to '=', and if you like, you can just use '=' instead.



** TASK FOR YOU ** Now try to repeat what we did. Enter a vector called `x2` with elements 1,2,3,4, and print it out 

In [11]:
##Enter your code here and then run it

One nice thing about R is it is easy to get help. For example, suppose you forgot what c() does. You would do this by entering ?c . 
** TASK FOR YOU ** Get the R help file for c() by entering ?c 

In [12]:
#Enter ?c then click run to get help on the function c()

Examples at the bottom of the help file are often the most practically useful part of the help file. They show you how it can be used.

### Some maths using vectors

You can also do some maths using the vector, as we did for numbers above. For example, the following statement displays the reciprocals of the values in `x` (but doesn't assign those values to any variable):

In [13]:
1/x

The following code creates a vector `y` with 11 entries consisting of two copies of x with a zero in the middle place.

In [14]:
y <- c(x, 0, x)
y

### Some other operations on vectors
- `max` and `min` select the largest and smallest elements of a vector, respectively. 
- `length(x)` is the number of elements in `x`, 
- `sum(x)` gives the total of the elements in `x`, 
- `mean(x)` gives the mean (sum divided by length)
- `sd(x)` gives the standard deviation

In [15]:
# For example
min(x)
max(x)
length(x)
sum(x)
mean(x)
sd(x)

Help is also available for these, e.g. --

In [16]:
?mean

In the help file you'll see that you can specify more than just `mean(x)`. 

For example, you might enter `mean(x, trim=0.1)`. This would take the mean, but remove 10% of the data at the extremes. Like the median, it is sometimes used to remove the effect of outliers when giving a estimate of the average.

In [17]:
#for example, suppose data are all 1 except for one, which is 1000
y<-c(1,1,1,1,1,1,1,1,1,1000)
mean(y)

Which is perhaps not a good description of the average in the data (e.g. if 1000 is infeasible). If we trim the top and bottom (i.e. remove 20% of the data) then we get ..

In [18]:
mean(y, trim=0.2)

THe main point of this is that a function such as mean can sometimes have more than one `argument`. We gave `mean()` two arguments: `x` and `trim=0.2`. 

## Logical variables

Another useful feature is logical quantities. The elements of a logical vector can have the values `TRUE`, `FALSE` (or `NA` for "not available"). 


Logical vectors are generated by *conditions*. The following expression, for example, sets `temp` as a vector of the same length as `x` with values `FALSE` corresponding to elements of `x` where the condition is not met and `TRUE` where it is:

In [19]:
temp <- x > 13
temp
x

i.e. only the last element of x is greater than 13.

The basic logical operators are `<`, `<=`, `>`, `>=`, `==` for exact equality, and `!=` for inequality


In [20]:
#For example, does 2 equal 3?
2 == 3

In [21]:
# Is 2 NOT EQUAL to 3?
2!=3

** TASK FOR YOU ** write code to test whether any of x equal 3.1?

In [22]:
# try to write the command here--

## Missing values

Sometimes there will be missing value in your data. When an element or value is "not available" or a "missing value" you can reserve a place for it within a vector by assigning it the special value `NA`. In general, any operation on an `NA` becomes an `NA`. e.g NA + 1 = NA. The function `is.na(x)` gives a logical vector of the same size as `x` with value `TRUE` if and only if the corresponding element in `x` is `NA`:


In [23]:
z <- c(1:3,NA)

z

ind <- is.na(z)

ind

There is also is a second kind of "missing" value, the NaN or not-a-number, which is produced by numerical computation that cannot be sensibly performed:

In [24]:
0/0
Inf - Inf

## Character vectors

Character quantities and character vectors are used frequently in R, for example as plot labels. They're defined by a sequence of characters inside double quotes, for example:

In [25]:
"x-values"
"New iteration results"

As with numbers, character vectors may be concatenated into a vector using the `c()` function. 


In [26]:
labs <- c("Label 1", "Label 2")
labs

## Index vectors; selecting and modifying subsets of a data set

Subsets of the elements of a vector may be selected by appending to the name of the vector an *index vector* in square brackets.  There are different ways to do this. We consider two.

### (1) A logical vector

In this case the index vector should be the same length as the vector from which elements are to be selected. Values corresponding to `TRUE` in the index vector are selected and those corresponding to `FALSE` are omitted. For example, the following expression creates an object `y` which contains the non-missing values of `x`, in the same order. Note that if `x` has missing values, `y` is shorter than `x`.

In [27]:
y <- x[!is.na(x)]
y

### (2) A vector of the positions wanted

For example:

In [28]:
x[2]    # The second component of x
x[1:3] # Selects the first 3 elements of x 

## Other data structures

Vectors are an important type of object in R, and if you understand the above then it is easier to work with other structures. These two re quite useful to understand --

- *Data frames* are matrix-like structures in which the columns can be of different types. Think of data frames as 'data matrices' with one row per observational unit but with (possibly) both numerical and categorical variables. Many experiments are best described by data frames: the treatments are categorical but the response is numeric. See [Data frames](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Data-frames).

- *Factors* provide compact ways to handle categorical data. See [Factors](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Factors). 

# A sample R session

The code in this walkthrough introduces you to various features of R for some basic data analysis. To begin with, we will use R to make up some data (simulation). We'll generate two random normal vectors of x- and y-coordinates where y = x + error

In [29]:
# the first data vector are 20 sampes from a standard Normal distribution (mean 0, sd 1)
myvar1 <- rnorm(20)

# the second data vector are 20 samples equal to the first data vector plus the first data vector multiplied by another standard normal distribution. 
myvar2 <- myvar1 + rnorm(20)

# then we can put them into a data frame 
dummy <- data.frame(x=myvar1, y=myvar2)

#and print out so you can see the data we made up
dummy

x,y
<dbl>,<dbl>
-1.16869874,-1.36203218
-2.33064002,-4.23269642
-1.05494695,-2.21824381
-3.11050427,-5.16398949
-0.6770887,-1.63478504
1.06818944,0.75879851
-1.33968659,0.33794906
-0.58182184,-0.49978789
-1.35193161,-1.60808181
0.23153044,1.66417573


Every time you run this, you'll get different numbers in your table (new simulation is run).

Lets get a quick summary of these data

In [30]:
# A quick way to get summary statistics of columns in the data frame
summary(dummy)

       x                 y          
 Min.   :-3.1105   Min.   :-5.1640  
 1st Qu.:-1.2114   1st Qu.:-1.6514  
 Median :-0.6295   Median :-0.4377  
 Mean   :-0.5087   Mean   :-0.8484  
 3rd Qu.: 0.3438   3rd Qu.: 0.2739  
 Max.   : 1.9178   Max.   : 1.6642  

This tells us the minimm, maximum, median, mean, 1st quantile (75% of data above this point) and 3rd quantile (25% of data above this point) for our `x` and `y` columns, that we made up.

We can also do this one statistic at a time. For example:

In [31]:
#To access each column we use a dollar sign
#e.g. x
dummy$x

In [32]:
#another way to get mean of x
mean(dummy$x)

In [33]:
#mean of first 10 values of y
mean(dummy$y[1:10])

In [34]:
cor.test(dummy$x, dummy$y)


	Pearson's product-moment correlation

data:  dummy$x and dummy$y
t = 5.4441, df = 18, p-value = 3.587e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5319059 0.9127102
sample estimates:
      cor 
0.7887672 


In [35]:
cor.test(dummy$x, dummy$y, method="spearman")


	Spearman's rank correlation rho

data:  dummy$x and dummy$y
S = 378, p-value = 0.0005621
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.7157895 


# Plots

One way to do plots is to generate a graphics file. We'll generate PNG files, but you can generate many other types in a similar way (e.g. PDF, EPS, JPEG; e.g. use pdf() instead of png(); ?pdf for furter help)

In [36]:

#To save a graphic you first need to open the file (called 'plot1.png' here)
png('plot1.png')

##Then you make the chart.
##This is a standard R scatterplot
plot(dummy$x, dummy$y)

## Then - important - close the file using 'dev.off()'
dev.off()

Then show the chart 'plot1.png' in a markdown cell via (click edit to see the code for this if you see a chart already) --
![title](plot1.png)

(You can also use the plot in other software - it has been saved onto your cloud storage for the project.)

In [37]:
# Now lets repeat, but add some labels 

#To save a graphic you first need to open the file (lets save as a pdf this time called 'plot1b.pdf' here)
pdf('plot1b.pdf')

##Then you make the chart.
##This is a standard R scatterplot, but now we've added some labels to the plot
plot(dummy$x, dummy$y, xlab="TEMPERATURE", ylab="CONSUMPTION OF CHOCOLATE ICE CREAM", main="I love ice cream when it is hot")

## Then - important - close the file using 'dev.off()'
dev.off()

So now you can produce a simple scatter plot in R and save it as a png or a pdf file. e.g. You can import png files into Word. You can open the PDF using a viewer like Adobe acrobat - it will be saved in your cloud storage associated with this notebook.