# echen/ggplot2-tutorial

### Subversion checkout URL

You can clone with
or
.
Quick introduction to ggplot2 (no knowledge of R assumed)
R
latest commit b7efe10658
authored
 Failed to load latest commit information. data Jan 17, 2012 README.md Oct 7, 2012 ggplot2-tutorial.R Jan 17, 2012

# Introduction

This is a bare-bones introduction to ggplot2, a visualization package in R. It assumes no knowledge of R.

There is also a literate programming version of this tutorial in `ggplot2-tutorial.R`.

# Preview

Given Fisher's iris data set and one simple command...

``````qplot(Sepal.Length, Petal.Length, data = iris, color = Species)
``````

...we can produce this plot of sepal length vs. petal length, colored by species.

# Installation

You can download R here. After installation, you can launch R in interactive mode by either typing `R` on the command line or opening the standard GUI (which should have been included in the download).

# R Basics

## Vectors

Vectors are a core data structure in R, and are created with `c()`. Elements in a vector must be of the same type.

``````numbers = c(23, 13, 5, 7, 31)
names = c("edwin", "alice", "bob")
``````

Elements are indexed starting at 1, and are accessed with `[]` notation.

``````numbers[1] # 23
names[1] # edwin
``````

## Data frames

Data frames are like matrices, but with named columns of different types (similar to database tables).

``````books = data.frame(
title = c("harry potter", "war and peace", "lord of the rings"), # column named "title"
author = c("rowling", "tolstoy", "tolkien"),
num_pages = c("350", "875", "500")
)
``````

You can access columns of a data frame with `\$`.

``````books\$title # c("harry potter", "war and peace", "lord of the rings")
books\$author[1] # "rowling"
``````

You can also create new columns with `\$`.

``````books\$num_bought_today = c(10, 5, 8)
books\$num_bought_yesterday = c(18, 13, 20)

books\$total_num_bought = books\$num_bought_today + books\$num_bought_yesterday
``````

Suppose you want to import a TSV file into R as a data frame.

For example, consider the `data/students.tsv` file (with columns describing each student's age, test score, and name).

``````13   100 alice
14   95  bob
13   82  eve
``````

We can import this file into R using `read.table()`.

``````students = read.table("data/students.tsv",
header = F, # file does not contain a header (`F` is short for `FALSE`),
# so we must manually specify column names
sep = "\t", # file is tab-delimited
col.names = c("age", "score", "name") # column names
)
``````

We can now access the different columns in the data frame with `students\$age`, `students\$score`, and `students\$name`.

For an example of a file in a different format, look at the `data/studentsWithHeader.tsv` file.

``````age,score,name
13,100,alice
14,95,bob
13,82,eve
``````

Here we have the same data, but now the file is comma-delimited and contains a header. We can import this file with

``````students = read.table("data/students.tsv",
sep = ",",
header = T  # first line contains column names, so we can
)               # immediately call `students\$age`
``````

(Note: there is also a `read.csv` function.)

## help

There are many more options that `read.table` can take. For a list of these, just type `help(read.table)` (or `?read.table`) at the prompt to access documentation.

``````# These work for other functions as well.
``````

# ggplot2

With these R basics in place, let's dive into the ggplot2 package.

## Installation

One of R's greatest strengths is its excellent set of packages. To install a package, you can use the `install.packages()` function.

``````install.packages("ggplot2")
``````

To load a package into your current R session, use `library()`.

``````library(ggplot2)
``````

## Scatterplots with qplot()

Let's look at how to create a scatterplot in ggplot2. We'll use the `iris` data frame that's automatically loaded into R.

What does the data frame contain? We can use the `head` function to look at the first few rows.

``````head(iris) # by default, head displays the first 6 rows. see `?head`
head(iris, n = 10) # we can also explicitly set the number of rows to display

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1         3.5          1.4         0.2  setosa
4.9         3.0          1.4         0.2  setosa
4.7         3.2          1.3         0.2  setosa
4.6         3.1          1.5         0.2  setosa
5.0         3.6          1.4         0.2  setosa
5.4         3.9          1.7         0.4  setosa
``````

(The data frame actually contains three types of species: setosa, versicolor, and virginica.)

Let's plot `Sepal.Length` against `Petal.Length` using ggplot2's `qplot()` function.

``````qplot(Sepal.Length, Petal.Length, data = iris)
# Plot Sepal.Length vs. Petal.Length, using data from the `iris` data frame.
# * First argument `Sepal.Length` goes on the x-axis.
# * Second argument `Petal.Length` goes on the y-axis.
# * `data = iris` means to look for this data in the `iris` data frame.
``````

To see where each species is located in this graph, we can color each point by adding a `color = Species` argument.

``````qplot(Sepal.Length, Petal.Length, data = iris, color = Species) # dude!
``````

Similarly, we can let the size of each point denote petal width, by adding a `size = Petal.Width` argument.

``````qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width)
# We see that Iris setosa flowers have the narrowest petals.
``````

``````qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width, alpha = I(0.7))
# By setting the alpha of each point to 0.7, we reduce the effects of overplotting.
``````

Finally, let's fix the axis labels and add a title to the plot.

``````qplot(Sepal.Length, Petal.Length, data = iris, color = Species,
xlab = "Sepal Length", ylab = "Petal Length",
main = "Sepal vs. Petal Length in Fisher's Iris data")
``````

## Other common geoms

In the scatterplot examples above, we implicitly used a point geom, the default when you supply two arguments to `qplot()`.

``````# These two invocations are equivalent.
qplot(Sepal.Length, Petal.Length, data = iris, geom = "point")
qplot(Sepal.Length, Petal.Length, data = iris)
``````

But we can also easily use other types of geoms to create more kinds of plots.

### Barcharts: geom = "bar"

``````movies = data.frame(
director = c("spielberg", "spielberg", "spielberg", "jackson", "jackson"),
movie = c("jaws", "avatar", "schindler's list", "lotr", "king kong"),
minutes = c(124, 163, 195, 600, 187)
)

# Plot the number of movies each director has.
qplot(director, data = movies, geom = "bar", ylab = "# movies")
# By default, the height of each bar is simply a count.
``````

``````# But we can also supply a different weight.
# Here the height of each bar is the total running time of the director's movies.
qplot(director, weight = minutes, data = movies, geom = "bar", ylab = "total length (min.)")
``````

### Line charts: geom = "line"

``````qplot(Sepal.Length, Petal.Length, data = iris, geom = "line", color = Species)
# Using a line geom doesn't really make sense here, but hey.
``````

``````# `Orange` is another built-in data frame that describes the growth of orange trees.
qplot(age, circumference, data = Orange, geom = "line",
colour = Tree,
main = "How does orange tree circumference vary with age?")
``````

``````# We can also plot both points and lines.
qplot(age, circumference, data = Orange, geom = c("point", "line"), colour = Tree)
``````

And that's it with what I'll cover.

# Next Steps

I skipped over a lot of aspects of R and ggplot2 in this intro.

For example,

• There are many geoms (and other functionalities) in ggplot2 that I didn't cover, e.g., boxplots and histograms.
• I didn't talk about ggplot2's layering system, or the grammar of graphics it's based on.

So I'll end with some additional resources on R and ggplot2.

Edwin Chen :: @echen :: http://blog.echen.me

Something went wrong with that request. Please try again.