Introduction to R

R is a language and environment for statistical computing and graphics. R and its libraries/packages implement a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.

Vast capabilities, a wide range of statistical and graphical techniques
Excellent community support: mailing list, blogs, tutorials
Easy to extend by writing new functions

1. Lab Setup and basics in R

Option: Use RStudio (free and open-source integrated development environment for R)

Start the RStudio program
Open a new document and save the file.
The window in the upper-left is your R script. This is where you will write instructions for R to carry out.
The window in the lower-left is the R console. This is where results will be displayed.

Option: Use the terminal. This is what I am going to be using today:

Start R session:

End R session:

q()

1.1 R Basics

Let's start by getting comfortable in the R environment:

The user interacts with R by inputting commands at the prompt (>). We did so above by using the sessionInf() command. We can also, for example, ask R to do basic calculations for us:

1 + 1
[1] 2

Additional operators include -, *, /, and ^ (exponentiation). As an example, the following command calculates the (approximate) area of a circle with a radius of 2:

3.14 * 2^2
[1] 12.56

1.1.1 Installing and Loading Libraries

Install and load a package. From experience, if you receive the error command not found - make sure you have loaded the library! (Made this mistake one too many times).

install.packages("ggplot2")
library(ggplot2)

Alternative: In Rstudio, go to the "Packages" tab and click the "Install" button. Search in the pop-up window and click "Install".

1.1.2 Using R help

help(help)
help(sqrt)
?sqrt

1.2 Variables

You can create variables in R - individual units with names to store values in. These units can then be called on later on, to examine or use their stored values:

r = 2

In the above command, I created a variable named r, and assigned the value 2 to it (using the = operator). Note that the above command didn't prompt R to generate any output messages; the operation here is implicit. However, I can now call on r to check its stored value:

r
[1] 2

I can use stored variables for future operations:

3.14 * r^2
[1] 12.56

R has some built-in variables that we can directly make use of. For example, the pi variable stores a more accurate version of the constant $\pi$ than our 3.14:

pi
[1] 3.141593

Now, can you make sense of the following operations (notice how I can change the value stored in r with a new assignment operation):

r = 3
area = pi * r^2
area
[1] 28.27433

Lastly, R can use and handle other "classes" of values than just numbers. For example, character strings:

circle = "cake"
circle
[1] "cake"

Question: try the following command in R:

circle = cake

Does it run successfully? What is the problem?

1.3 R Functions

Functions are conveniently enclosed operations, that take zero or more input and generate the desired outcome. We use a couple of examples to illustrate the concept of R functions. The first one, the very basic c() function, combines values into a vector:

c(1, 2, 3)
[1] 1 2 3

Notice that you call functions by providing parameters (values in the the parentheses) as input. They then (most times) return values as input. You can, of course, use variables as input, or assign the returned value to new variables. Imagine two researchers individually collected sample measurements of the same population, and now would like to combine their data. They can do so with:

samples1 = c(3, 4, 2, 4, 7, 5, 5, 6, 3, 2)
samples2 = c(2, 3)
samples_all = c(samples1, samples2)
samples_all
 [1] 3 4 2 4 7 5 5 6 3 2 2 3

The second example, t.test(), does exactly what its name suggests: it performs a t-test between two vectors, to see if the difference in their means is statistically significant:

t.test(samples1, samples2)

	Welch Two Sample t-test

data:  samples1 and samples2
t = 2.2047, df = 3.9065, p-value = 0.09379
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.4340879  3.6340879
sample estimates:
mean of x mean of y 
      4.1       2.5

Certain function parameters have names, and you can explicitly invoke them during function calls. For example, here you will notice that the test performed is a two-sided test. What if we wanted to perform a one-sided test, to see if the average of samples1 is significantly higher than that of samples2? For this, we can invoke the alternative parameter in t.test(), which lets us select one of the options ("two.sided", "less", or "greater"), depending on the alternative hypothesis we are interested in.

t.test(x = samples1, y = samples2, alternative = "greater")

	Welch Two Sample t-test

data:  samples1 and samples2
t = 2.2047, df = 3.9065, p-value = 0.04689
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 0.04217443        Inf
sample estimates:
mean of x mean of y 
      4.1       2.5

You can check the full list of parameters for functions in R with the command ? + function name. For example ?t.test gives you the full documentation for the function t.test.

1.4 Writing R functions

The functions we used so far are built-in. Just like variables, we can also create our own functions, by invoking the function keyword.

area_circle = function(r) {
     return(pi * r^2)
  }
area_circle(r = 3)
[1] 28.27433

Question: study the following two functions, aimed at calculating the overall mean of samples collected by two separate researchers.
- What happened in each function?
- What are their differences?
- Which one is better?

overall_mean1 = function(samples1, samples2) {
      samples_all = c(samples1, samples2)
      return(mean(samples_all))
  }
overall_mean2 = function(samples1, samples2) {
      mean1 = mean(samples1)
      mean2 = mean(samples2)
      return((mean1 + mean2) / 2)
  }

Hint: imagine the following scenarios:
- If the first researcher collected a lot more samples than the second one, which way is better?
- If the first researcher collected a lot more samples than the second one, but their experimental protocol is flawed, leading to overestimation of measurements, which way is better?

1.5 Inspecting data

We will use an example project of the most popular baby names in the United States and the United Kingdom. A cleaned and merged version of the data file is available at http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv.

In order to read data from a file, you have to know what kind of file it is.

?read.csv

Q. What would you use for other file types?

Read in the file and assign the result to the name baby.names.

baby.names = read.csv(file="https://www.dropbox.com/s/kr76pha2p82snj4/babyNames.csv?dl=1")

Look at the first 10 lines:

head(baby.names)

What kind of object is our variable?

class(baby.names)
str(baby.names)

2. Working with Data in R

Here we will work further with the 'baby.names' files that you loaded in above.

2.1 Working with data.frames in R

Usually, data read into R will be stored as a data.frame.

A data.frame is a list of vectors of equal length
Each vector in the list forms a column
Each column can be a different type of vector
Typically columns are variables and the rows are observations

A data.frame has two dimensions corresponding to the number of rows and the number of columns (in that order).

Check the dimensions of the data.frame then extract the first three rows of the data.frame.

dim(baby.names)
baby.names[1:3,]

Extract the first three columns of the data.frame.

baby.names[,1:3]

Output a specific columns of the data.frame.

baby.names$Name

Output only the unique values within a column.

unique(baby.names$Name)

2.2 Summarize data in a data.frame

Using the base function sum(). Have many babies were called "jill"?

sum(baby.names$Name == "Jill")

Extract rows where Name == "Jill". This can also be done with the subset command in R.

baby.names[baby.names$Name == "Jill",]
subset(baby.names, Name == "Jill")

Relational and logical operators

Operator	Meaning
`==`	equal to
`!=`	not equal to
`>`	greater than
`>=`	greater than or equal to
`<`	less than
`<=`	less than or equal to
`%in%`	contained in
`&`	and
`\|`	or

Exercise # 1:

How many female babies are listed in the table?

How many babies were born after 2003? Save the subset in a new dataframe.

2.3 Adding columns

Add a new column specifying the country.

head(baby.names)
table(baby.names$Location)

Output:

> head(baby.names)
           Location Year    Sex    Name Count  Percent Name.length
1 England and Wales 1996 Female  sophie  7087 2.394273           6
2 England and Wales 1996 Female   chloe  6824 2.305421           5
3 England and Wales 1996 Female jessica  6711 2.267245           7
4 England and Wales 1996 Female   emily  6415 2.167244           5
5 England and Wales 1996 Female  lauren  6299 2.128055           6
6 England and Wales 1996 Female  hannah  5916 1.998662           6

> table(baby.names$Location)

               AK                AL                AR                AZ 
             8685             31652             23279             42775 
               CA                CO                CT                DC 
           133257             35181             21526             11074 
               DE England and Wales                FL                GA 
             8625            227449             77218             58715 
               HI                IA                ID                IL 
            12072             22748             16330             66576 
               IN                KS                KY                LA 
            40200             24548             27041             35370 
               MA                MD                ME                MI 
            33190             35279             10030             51601 
               MN                MO                MS                MT 
            32946             37078             24520              9759 
               NC                ND                NE                NH 
            51874              8470             17549              9806 
               NJ                NM                NV                NY 
            47219             18673             21894             89115 
               OH                OK                OR                PA 
            55633             29857             27524             53943 
               RI                SC                SD                TN 
             9020             29738              9687             39714 
               TX                UT                VA                VT 
           113754             29828             44859              5519 
               WA                WI                WV                WY 
            41231             32858             13726              5786

baby.names$Country = "US"
baby.names$Country[baby.names$Location == "England and Wales"] = "UK"
head(baby.names)

Output:

> head(baby.names)
           Location Year    Sex    Name Count  Percent Name.length Country
1 England and Wales 1996 Female  sophie  7087 2.394273           6      UK
2 England and Wales 1996 Female   chloe  6824 2.305421           5      UK
3 England and Wales 1996 Female jessica  6711 2.267245           7      UK
4 England and Wales 1996 Female   emily  6415 2.167244           5      UK
5 England and Wales 1996 Female  lauren  6299 2.128055           6      UK
6 England and Wales 1996 Female  hannah  5916 1.998662           6      UK

table(baby.names$Country)

2.4 Replacing data entries

Especially when it comes to metadata - some cleaning may be needed before you run statistics. Here lets take a look at the sex column.

table(baby.names$Sex)

Do you notice any discrepancy in the output?

If we ran statistics on this column it would be confused by the classification of Males here. To fix the column you can run:

baby.names$Sex = gsub("M$", "Male", baby.names$Sex)

Why do we need the $ sign? What happens if we omit it?

Check the output table again.

3. Exporting data

Now that we have made some changes to our data set, we might want to save those changes to a file.

3.1 Save the output as a csv file

getwd() # Check current working directory. Is this where you want to save your file?
setwd("/home/hutlab_public/Desktop") # Change the current working directory 
getwd()
dir.create("R_Tutorial") # Create a new directory
setwd("/home/hutlab_public/Desktop/R_Tutorial")
write.csv(baby.names, file="babyNames_v2.csv")

How would you save other file formats?

Locate and open the file outside of R.

3.2 Save the output as an R object

save(baby.names, file="babyNames.Rdata")

How do you load an R object?

?load

4. Basic statistics

Descriptive statistics of single variables are straightforward:

Find the mean of baby name lengths:

mean(baby.names$Name.length)

Find the median of baby name lengths:

median(baby.names$Name.length)

Find the standard deviation of baby name lengths:

sd(baby.names$Name.length)

Summarize the baby name lengths:

summary(baby.names$Name.length)

Exercise #3:

Which are the longest names?

Which are the shortest names?

summary(baby.names)

5. Simple graphs

5.1 Boxplots

Compare the length of baby names for boys and girls using a boxplot.

p = ggplot(data = baby.names, aes(x = Sex, y = Name.length)) +
        geom_boxplot()

ggsave(plot = p,
       filename = "basic_box_introR.png",
       width = 7, height = 6)

p

Adding color to the boxplot:

p2 = ggplot(baby.names, aes(x = Sex, y = Name.length, fill = Sex)) + 
         geom_boxplot() + 
         theme_bw() + 
         labs(y = "Length of Name")

ggsave(plot = p2,
       filename = "fill_box_introR.png", 
       width = 7, height = 6)

p3 = ggplot(baby.names, aes(x = Sex, y = Name.length, color = Sex)) + 
         geom_boxplot(lwd = 2) + 
         theme_bw() + 
         labs(y = "Length of Name")

ggsave(plot = p3,
       filename = "color_box_introR.png", 
       width = 7, height = 6)

p2

p3

Change the layout of the plot:

Add a plot title.
Add a title to the y-axis.
Change the color of the boxplot. A good place to look up color names are:
- http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
- http://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3

5.2 Histograms

How many names were recorded for each year?

Check the timeframe that is included in the table.
How many records were obtained in total?

p4 = ggplot(baby.names, aes(x = Name.length)) + 
         geom_histogram()

ggsave(plot = p4,
       filename = "basic_histogram_introR.png", 
       width = 7, height = 6)

p4

Exercise # 4: Take a look at ?geom_histogram and change the layout of the plot. Googling works well as well!

tidyverse

The "tidyverse" is

a dialect of R
a collection of R packages with a shared design philosophy focused on manipulating data frames

One prominent feature of the tidyverse is the pipe: %>%. Pipes take input from their left and pass it to the function on their right. So these two lines of code will produce the same result:

f(x,y)

x %>% f(y)

This makes code more readable when chaining multiple operations performed on an input data frame.

The example command above takes the baby.names data, filters out the "England and Wales" observations, groups by Year and Sex, then computes the average name length by group, and arranges the result in descending order. This is done using several functions from the dplyr package from the tidyverse family.

library(dplyr)

baby.names %>%
    filter(Location != "England and Wales") %>% 
    group_by(Year, Sex) %>%
    summarise(mean_length = mean(Name.length)) %>%
    arrange(-mean_length)

You can see that the command ends up looking similar to the English sentence describing what it does. The final result shows that Females in 1989 had the longest names on average.

The base R language recently introduced its own pipe in version 4.1.0 that looks like this: |> . There are some subtle differences in behavior but for the most part they are interchangeable.

....

Other tutorials using the baby names data: rpub

Microbial community profiling

Downstream analysis and statistics

Infrastructure and utilities

Metagenomic Visualizations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction to R

Introduction to R

Contents

1. Lab Setup and basics in R

1.1 R Basics

1.1.1 Installing and Loading Libraries

1.1.2 Using R help

1.2 Variables

1.3 R Functions

1.4 Writing R functions

1.5 Inspecting data

2. Working with Data in R

2.1 Working with data.frames in R

2.2 Summarize data in a data.frame

Relational and logical operators

2.3 Adding columns

2.4 Replacing data entries

3. Exporting data

3.1 Save the output as a csv file

3.2 Save the output as an R object

4. Basic statistics

5. Simple graphs

5.1 Boxplots

5.2 Histograms

tidyverse

Microbial community profiling

Downstream analysis and statistics

Infrastructure and utilities

Clone this wiki locally