Let’s explore a little data set from my life. I wanted to know how long it really takes me to get to work / home. So I tracked the minutes it took me to commute (door to door).
-
Start RStudio.
-
Create a new project (File → New Project… → New Directory → New Project).
-
Begin a new R Notebook (File → New File → R Notebook) and save it in the same folder where you saved your new project.
-
Download my data set in form of this csv file (click on ‘Raw’, then cmd + s to save the file) and save it in the same folder.
You can copy the code in the sections below into your R Notebook and execute it by, for example, pressing cmd + enter (mac) or ctrl + enter (windows) to execute the current line or selection.
library(tidyverse)
dd <- read_csv("commute.csv")
Let’s have a look at the data
dd
## # A tibble: 34 x 4
## direction date by minutes
## <chr> <date> <chr> <dbl>
## 1 home_work 2015-08-11 metro 51
## 2 home_work 2015-08-12 metro 49
## 3 home_work 2015-08-18 metro 50
## 4 home_work 2015-08-19 metro 45
## 5 home_work 2015-08-20 metro 60
## 6 home_work 2015-08-24 metro 41
## 7 home_work 2015-08-25 metro 46
## 8 home_work 2015-08-26 metro 46
## 9 home_work 2015-08-31 bike 50
## 10 home_work 2015-09-01 metro 41
## # … with 24 more rows
Ok, loading the data worked.
Let’s explore the data set!
How much time did it take me each day?
dd %>% ggplot(aes(date, minutes)) + geom_point()
(If the x-axis doesn’t look like this, try dd$date <- as.Date(dd$date)
to change the type of the date
column, then plot again.)
What does the distribution of commute times look like? Let’s make a histogram.
dd %>% ggplot(aes(minutes)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We notice the binwidth message. Let’s set the binwidth to 1 minute.
dd %>% ggplot(aes(minutes)) + geom_histogram(binwidth = 1)
Train or bike is saved in by
column. Let’s change the color in the
histogram based on the by
column.
dd %>% ggplot(aes(minutes, fill = by)) + geom_histogram(binwidth = 1)
Do you see how the blue-ish bars “bump up” the red bars (e.g., around 50 minutes)? There are two ways to get rid of that.
- Put bars next to each other
dd %>% ggplot(aes(minutes, fill = by)) + geom_histogram(binwidth = 1, position = "dodge")
That’s not good here because now the blue-ish metro bars are to the right of the red bike bars, so if we want to say which one is faster we just introduced a bias in favor of bike (by moving those bars to the left).
- Two separate histograms
dd %>% ggplot(aes(minutes)) + geom_histogram(binwidth = 1) + facet_grid(~by)
Wow, that is simple! I really like the facet_grid
command. Note that
both histograms use the same scales, that makes it good for comparison.
Actually, it would be better, if we had on histogram below the other.
All we have to do is moving by
to the left and add a dot (otherwise
you will get an error – because the ~
can’t be the last symbol, I
believe).
dd %>% ggplot(aes(minutes)) + geom_histogram(binwidth = 1) + facet_grid(by~.)
Perfect!
Summary:
- There is no clear winner, train and bike seem to be relatively similar on average.
- We see that I have way more records for train than bike commutes.
- Looking at the spread of the data points, I realize why I often hesitate for a moment when I say “I need 45 minutes to work”.
- Seems like train can’t get faster than 41 minutes but can have painful slow outliers.
We can arrive at this informative plot in only two lines:
dd <- read_csv("commute.csv")
dd %>% ggplot(aes(minutes)) + geom_histogram(binwidth = 1) + facet_grid(by~.)
The .
becomes
direction
dd %>% ggplot(aes(minutes)) + geom_histogram(binwidth = 1) + facet_grid(by ~ direction)
Interesting. The slow train rides where always from work to home (i.e., at night) while the really fast rides where from home to work (i.e., morning, ah well more like noon).
All plots stacked vertically as before?
dd %>% ggplot(aes(minutes)) + geom_histogram(binwidth = 1) + facet_grid(by + direction ~ .)
I like that. With this layout, I feel so much faster to “see the story” of these data points.
Let’s help the eye with some colors.
dd %>% ggplot(aes(minutes, fill = direction)) + geom_histogram(binwidth = 1) + facet_grid(by + direction ~ .)
Let’s help the eye with some nicer colors.
dd %>% ggplot(aes(minutes, fill = direction)) + geom_histogram(binwidth = 1) + facet_grid(by + direction ~ .) +
scale_fill_brewer(palette = "Set1")
I think it would be okay / reasonable to stop here. For the fun of it (aka to use ggplot), we can try to squeeze more out of the data set, but with limited prospects of new insights.
Let’s repeat the plot from the beginning
dd %>% ggplot(aes(date, minutes)) + geom_point()
In this plot I notice two things
- I recorded more often during the August–September time.
- The extreme cases with a long commute occured in the end of the time window.
To visualize #1 we can just make another histogram.
dd %>% ggplot(aes(date)) + geom_histogram(binwidth = 14)
I played with the width of the bins, and chose 2 weeks = 14 days. Ok, seems there was a time when I was most motivated to record my commutes.
To visualize #2 (and to come back to the original question: is there a trend over time?), let’s try to add a line.
dd %>% ggplot(aes(date, minutes)) + geom_point() + geom_line()
Hm, this connected the points with a line (definitely useful in other cases) but here we want something like an average line.
dd %>% ggplot(aes(date, minutes)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Great! This suggests that my commuting times where stable until October
and then got really worse. But careful, this line is already an
interpretation of our data. ggplot was actually fitting some local
regression (loess
) model in the background. If you are a modelling
person you probably have your own model for your data. For now let’s
simplify the fitted line to a correlation (-> linear regression model,
lm
).
dd %>% ggplot(aes(date, minutes)) + geom_point() + geom_smooth(method = lm)
Let’s explore the compositional power of ggplot again. Is there the trend the same for both, bike and train?
Earlier, we used fill
for the color of the bars, here we use colour
(the points and lines have no fill
property).
dd %>% ggplot(aes(date, minutes, colour = by)) + geom_point() + geom_smooth(method = lm)
Small change, big result!
(But let’s not overinterpret the meaning of these lines for this small data set.)
I think the gray error bands around the lines are important, but we get rid of them to keep things clean for this demonstration.
dd %>% ggplot(aes(date, minutes, colour = by)) + geom_point() + geom_smooth(method = lm, se = FALSE)
How about additionally splitting the data by direction
? We could use
the facets, as we did
above.
dd %>% ggplot(aes(date, minutes, colour = by)) + geom_point() + geom_smooth(method = lm, se = FALSE) +
facet_grid(direction~.)
If we want to have only one plot, we could, for each direction,
- Give different shapes to the points
- Give different styles to the lines
- Both
Here we go.
- Give different shapes to the points
dd %>%
ggplot(aes(date, minutes, colour = by, shape = direction)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
- Give different styles to the lines
dd %>%
ggplot(aes(date, minutes, colour = by, lty = direction)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
- Both
dd %>%
ggplot(aes(date, minutes, colour = by, shape = direction, lty = direction)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
Great, ggplot understands that shape
belongs to the points and lty
to the lines.
Hey, it would be nice to have an “empty” point instead of the triangle
for the work_home
data points. You can find a list of possible symbols
by googling “ggplot point
shapes”. We add
scale_shape_manual
which means that we manually define the codes for
all shapes that are used in the plot (here, 2 values, namely 19 (filled
point) and 1 (empty point)).
dd %>%
ggplot(aes(date, minutes, colour = by, shape = direction, lty = direction)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
scale_shape_manual(values = c(19, 1))
Let’s have another workshop sometime where I tell you about reshaping your data…