Permalink
Switch branches/tags
Nothing to show
Find file Copy path
a023938 Mar 30, 2016
3 contributors

Users who have contributed to this file

@mikelove @rafalab @alexnones
223 lines (162 sloc) 10.2 KB
layout title
page
Getting Started

Getting Started

In this book we will be using the R programming language for all our analysis. You will learn R and statistics simultaneously. However, we assume you have some basic programming skills and knowledge of R syntax. If you don't, your first homework, listed below, is to complete a tutorial. Here we give step-by-step instructions on how to get set up to follow along.

Installing R

The first step is to install R. You can download and install R from the Comprehensive R Archive Network (CRAN). It is relatively straightforward, but if you need further help you can try the following resources:

Installing RStudio

The next step is to install RStudio, a program for viewing and running R scripts. Technically you can run all the code shown here without installing RStudio, but we highly recommend this integrated development environment (IDE). Instructions are here and for Windows we have special instructions.

Learn R Basics

The first homework assignment is to complete an R tutorial to familiarize yourself with the basics of programming and R syntax. To follow this book you should be familiar with the difference between lists (including data frames) and numeric vectors, for-loops, how to create functions, and how to use the sapply and replicate functions.

If you are already familiar with R, you can skip to the next section. Otherwise, you should go through the swirl tutorial, which teaches you R programming and data science interactively, at your own pace and in the R console. Once you have R installed, you can install swirl and run it the following way:

install.packages("swirl")
library(swirl)
swirl()

Alternatively you can take the try R interactive class from Code School.

There are also many open and free resources and reference guides for R. Two examples are:

Two key things you need to know about R is that you can get help for a function using help or ?, like this:

?install.packages
help("install.packages")

and the hash character represents comments, so text following these characters is not interpreted:

##This is just a comment

Installing Packages

The first R command we will run is install.packages. If you took the swirl tutorial you should have already done this. R only includes a basic set of functions. It can do much more than this, but not everybody needs everything so we instead make some functions available via packages. Many of these functions are stored in CRAN. Note that these packages are vetted: they are checked for common errors and they must have a dedicated maintainer. You can easily install packages from within R if you know the name of the packages. As an example, we are going to install the package rafalib which we use in our first data analysis examples:

install.packages("rafalib")

We can then load the package into our R sessions using the library function:

library(rafalib)

From now on you will see that we sometimes load packages without installing them. This is because once you install the package, it remains in place and only needs to be loaded with library. If you try to load a package and get an error, it probably means you need to install it first.

Importing Data into R

The first step when preparing to analyze data is to read in the data into R. There are several ways to do this and we will discuss three of them. But you only need to learn one to follow along.

In the life sciences, small datasets such as the one used as an example in the next sections are typically stored as Excel files. Although there are R packages designed to read Excel (xls) format, you generally want to avoid this and save files as comma delimited (Comma-Separated Value/CSV) or tab delimited (Tab-Separated Value/TSV/TXT) files. These plain-text formats are often easier for sharing data with collaborators, as commercial software is not required for viewing or working with the data. We will start with a simple example dataset containing female mouse weights.

The first step is to find the file containing your data and know its path.

Paths and the Working Directory

When you are working in R it is useful to know your working directory. This is the directory or folder in which R will save or look for files by default. You can see your working directory by typing:

getwd()

You can also change your working directory using the function setwd. Or you can change it through RStudio by clicking on "Session".

The functions that read and write files (there are several in R) assume you mean to look for files or write files in the working directory. Our recommended approach for beginners will have you reading and writing to the working directory. However, you can also type the full path, which will work independently of the working directory.

Projects in RStudio

We find that the simplest way to organize yourself is to start a Project in RStudio (Click on "File" and "New Project"). When creating the project, you will select a folder to be associated with it. You can then download all your data into this folder. Your working directory will be this folder.

Option 1: Read file over the Internet

You can navigate to the femaleMiceWeights.csv file by visiting the data directory of dagdata on GitHub. If you navigate to the file, you need to click on Raw on the upper right hand corner of the page.

GitHub page screenshot.

Now you can copy and paste the URL and use this as the argument to read.csv. Here we break the URL into a base directory and a filename and then combine with paste0 because the URL would otherwise be too long for the page. We use paste0 because we want to put the strings together as is, if you were specifying a file on your machine you should use the smarter function, file.path, which knows the difference between Windows and Mac file path connectors. You can specify the URL using a single string to avoid this extra step.

dir <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/"
url <- paste0(dir, "femaleMiceWeights.csv")
dat <- read.csv(url)

Option 2: Download file with your browser to your working directory

There are reasons for wanting to keep a local copy of the file. For example, you may want to run the analysis while not connected to the Internet or you may want to ensure reproducibility regardless of the file being available on the original site. To download the file, as in option 1, you can navigate to the femaleMiceWeights.csv. In this option we use your browser's "Save As" function to ensure that the downloaded file is in a CSV format. Some browsers add an extra suffix to your filename by default. You do not want this. You want your file to be named femaleMiceWeights.csv. Once you have this file in your working directory, then you can simply read it in like this:

dat <- read.csv("femaleMiceWeights.csv")

If you did not receive any message, then you probably read in the file successfully.

Option 3: Download the file from within R

We store many of the datasets used here on GitHub. You can save these files directly from the Internet to your computer using R. In this example, we are using the download.file function in the downloader package to download the file to a specific location and then read it in. We can assign it a random name and a random directory using the function tempfile, but you can also save it in directory with the name of your choosing.

library(downloader) ##use install.packages to install
dir <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/"
filename <- "femaleMiceWeights.csv" 
url <- paste0(dir, filename)
if (!file.exists(filename)) download(url, destfile=filename)

We can then proceed as in option 2:

dat <- read.csv(filename)

Option 4: Download the data package (Advanced)

Many of the datasets we include in this book are available in custom-built packages from GitHub. The reason we use GitHub, rather than CRAN, is that on GitHub we do not have to vet packages, which gives us much more flexibility.

To install packages from GitHub you will need to install the devtools package:

install.packages("devtools")

Note to Windows users: to use devtools you will have to also install Rtools. In general you will need to install packages as administrator. One way to do this is to start R as administrator. If you do not have permission to do this, then it is a bit more complicated.

Now you are ready to install a package from GitHub. For this we use a different function:

library(devtools)
install_github("genomicsclass/dagdata")

The file we are working with is actually included in this package. Once you install the package, the file is on your computer. However, finding it requires advanced knowledge. Here are the lines of code:

dir <- system.file(package="dagdata") #extracts the location of package
list.files(dir)
list.files(file.path(dir,"extdata")) #external data is in this directory

And now we are ready to read in the file:

filename <- file.path(dir,"extdata/femaleMiceWeights.csv")
dat <- read.csv(filename)