Skip to content
Switch branches/tags
Go to file
Cannot retrieve contributors at this time
515 lines (408 sloc) 18.4 KB
title: "Linear regression in R"
theme: readable
toc: yes
toc_depth: 3
toc: yes
toc_depth: 2
date: "April 24, 2016"
```{r setup, include = FALSE}
knitr::opts_chunk$set(cache = TRUE)
setwd('~/projects/BI-TECH-CP303/projects/project 1')
bikeshare = read.delim('./data/bikeshare_2015.tsv',
sep = '\t',
header = TRUE)
# Linear regression
In this tutorial we'll learn:
1. how to fit linear regression models
2. how to split data into test and train sets
3. how to tune our models and select features
## Data preparation
As always when you open R, start by reloading the packages you expect to use.
In the code below, we're loading the `dplyr` and `ggplot2` libraries.
We're working with the Capital Bikeshare again, so start by reading in
the data file.
```{r load-data, eval = FALSE}
# you might want to set the working directory first.
bikeshare = read.delim('bikeshare_2015.tsv',
sep = '\t',
header = TRUE)
## The `lm()` function
The function for creating a linear model in R is `lm()` and the primary
arguments are *formula* and *data*. Formulas in R are expressed with a tilde,
*e.g*`y ~ x`. Let's fit the model: $rentals = \beta_0 + \beta_1*crossing$.
```{r lm, fig.height = 7, fig.width = 5, fig.align = 'center'}
model = lm(num_rentals ~ crossing, data = bikeshare)
# view what is returned in the lm object
# get model output
# plot it
ggplot(bikeshare, aes(x = crossing, y = num_rentals)) +
geom_smooth(method = 'lm') +
geom_point(size = 3, alpha = 0.60) +
facet_grid(season ~. ) +
The `attributes()` function can be called on just about any object in R and it
returns a list of all the things inside. It's a great way to explore
objects and see what values are contained inside that could be used in other
analysis. For example, extracting the residuals via `model$residuals` is useful
if we want to print diagnostic plots like those above.
Run `summary()` on the `lm` object to see detailed results. The *Call*
section prints the model specification, and the *Residuals* section
contains a summary of the distribution of the errors. *Coefficients* section contains the estimated coefficients, standard errors, *t-* and *p-*values for each variable in the model. Our model ends up being `rentals = 1987 + 40*(crossings)`, which means that the average number of rentals is 1987 when there are no crosswalks, and the average increases by 40 rentals for every additional crosswalk within a quarter mile.
### Multivariate regression
```{r, eval = FALSE, include = FALSE}
# lets include the number of parking areas
ggplot(bikeshare, aes(y = no_empty_docks, x = no_bikes)) +
geom_smooth(method = 'lm', se = FALSE, color = 'black') +
geom_point(aes(color = num_rentals, size = num_rentals), alpha = 0.70) +
facet_grid(season ~. ) +
scale_colour_gradient(limits = c(2, 28540), low = 'blue', high = 'yellow') +
scale_size(range = c(0, 15)) +
We can fit regressions with multiple covariates the same way by adding variables
with the `+` sign. Let's fit another model that includes the number of crosswalks
and the number of parking lots as predictors:
```{r multi-lm}
model = lm(num_rentals ~ crossing + parking, data = bikeshare)
Let's try one more, this time we'll include `season`, a factor variable:
```{r lm-factor, fig.height = 4, fig.width = 4, fig.align = 'center'}
# lets include season this time
model = lm(num_rentals ~ crossing + parking + season, data = bikeshare)
ggplot(bikeshare, aes(x = season, y = num_rentals)) +
geom_boxplot() +
You might have noticed that one of the seasons is missing. By default R chooses a
reference group that is represented in the intercept term. In this case `Autumn`
is the reference group for the `season` variable because
by default R orders factors alphabetically. If you want to order the
`season` category differently, just specify the `levels` in the `factor` function.
```{r factor-relevel}
bikeshare$season = factor(bikeshare$season,
levels = c('Spring', 'Summer', 'Autumn', 'Winter'))
model = lm(num_rentals ~ crossing + season, data = bikeshare)
The interpretation of the new model is that stations without crosswalks or parking have an average of 2,687 rentals in the spring. Those same stations can expect an additional 232 rentals in the summer, and about 906 less in the fall and 2,116 less in the winter.
# The *caret* package
Remember last time when we did exploratory data analysis (EDA) with `ggpairs`
because it allowed us to view relationships between many variable simultaneously?
We're in a similar situation now because there are around 70 modeling variables
to choose from, so how do we start developing models?
Lucky for us there's the `caret` package (short for **c**lassification **a**nd
**re**gression **t**raining). `caret` is great for model development because it
integrates many modeling methods in R into one unified syntax. That means more reusable
code for us! *caret* contains helper functions that provide a unified
framework for data cleaning/splitting, model training, and comparison. I highly
recommend the
[optional reading](
this week which provides a great overview of the *caret* package.
```{r, eval = FALSE}
install.packages('caret', dependencies = TRUE)
set.seed(1234) # set a seed
Setting a seed in R insures that you get identical results each time you run
your code. Since re-sampling methods are inherently probabilistic, every time we
rerun them we'll get slightly different answers. Setting the seed to the same
number insures that we get identical randomness each time the code is run, and
that's helpful for debugging.
## Train and test data
Before analysis we'll divide data into train and
test sets. Check out
[this]( nice overview for more
details. The *training* set is typically about 75% of the data and is used for
all the model development. Once we have a model we're satisfied with, we use our
*testing* set, the other 25% to generate model predictions. Splitting the data
into the two groups, train and test, generates two types of errors, in-sample
and out-of-sample errors. *In-sample* errors are the errors derived from same
data the model was built with. *Out-of-sample* errors are derived from measuring
the error on a fresh data set. We are interested in the out-of-sample error
because this quantity represents how'd we'd expect the model to perform in the
future on brand new data.
Here's how to split the data with *caret*:
# select the training observations
in_train = createDataPartition(y = bikeshare$num_rentals,
p = 0.75, # 75% in train, 25% in test
list = FALSE)
head(in_train) # row indices of observations in the training set
train = bikeshare[in_train, ]
test = bikeshare[-in_train, ]
**Note:** I recommend doing all data processing and aggregation steps *before*
splitting out your train/test sets.
## Model training
Our workhorse function in the *caret* package in the `train` function. This
function can be used to evaluate performance parameters, choose optimal models
based on the values of those parameters, and estimate model performance. For
regression we can use it in place of the `lm()` function. Here's our last
regression model using the train function.
Now that you're familiar with how to specify model equations with the `~` character
you should recognize the model:
```{r lm-caret}
model_fit = train(num_rentals ~ crossing + parking + season,
data = train,
method = 'lm',
metric = 'RMSE')
# get predictions
out_of_sample_predictions = predict(model_fit, newdata = test)
# compare predictions against the observed values
errors = data.frame(predicted = out_of_sample_predictions,
observed = test$num_rentals,
error = out_of_sample_predictions - test$num_rentals)
# plot the out-of-sample errors
ggplot(data = errors, aes(x = predicted, y = observed)) +
geom_abline(aes(intercept = 0, slope = 1),
size = 3, alpha = 0.70, color = 'red') +
geom_point(size = 3, alpha = 0.80) +
ggtitle('out-of-sample errors') +
Our prediction accuracy is not so great for this model. The in-sample RMSE is
about 2,863 which means that on average the predictions are off by about 2,863
What happens if we build a model with all the variables in it? To do that,
I'm using the `select` verb from `dplyr` to remove variables that aren't
predictors, like the station name, id, and lat/long.
```{r full-model}
full_model = train(num_rentals ~ .,
data = select(train, -station, -id, -lat, -long),
method = 'lm')
# get predictions
out_of_sample_predictions = predict(full_model, newdata = test)
# compare predictions against the observed values
errors = data.frame(predicted = out_of_sample_predictions,
observed = test$num_rentals,
error = out_of_sample_predictions - test$num_rentals)
# plot the out-of-sample errors
ggplot(data = errors, aes(x = predicted, y = observed)) +
geom_abline(aes(intercept = 0, slope = 1),
size = 3, alpha = 0.70, color = 'red') +
geom_point(size = 3, alpha = 0.80) +
ggtitle('out-of-sample errors, full model') +
The in-sample RMSE is about 2,281, so definitely an improvement over the previous
model, but this model is really complex and probably not going to be usable by
Pronto. How can we reduce the complexity of the model, but maintain reasonable
predictive accuracy?
# Assignment 3
1. Try a couple different models based on the hypotheses you tested in the
first two assignments. Can you improve on the RMSE?
## Preprocessing
Shrinkage methods require that the predictors are normalized to be on the same
scale. We can accomplish this by centering and scaling the data. You center a
variable by subtracting the mean of the variable from from each observation. To
scale your observations you then divide the centered observation by the variable
standard deviation. Now the variable follows a standard normal distribution with
mean = 0 and standard deviation = 1.
The *caret* package has lots of convenient functions for
[preprocessing data](, check 'em
```{r, warning = FALSE}
full_model_scaled = train(num_rentals ~ .,
data = select(train, -station, -id, -lat, -long),
method = 'lm',
preProcess = c('center', 'scale'))
Coefficients estimated with normalized data have a different interpretation than
coefficients from un-normalized data. In this case when the data are scaled the
intercept has a better interpretation, it's the expected number of rentals when
all the predictors are at their average value. So, in this case, when all the
predictors are at their average values, we expect about 2813 rentals per season.
In the previous full-model we had an intercept of about -878.261, which could be
interpreted as the expected number of rentals when all the other predictors
have a value of 0. That's pretty unsatisfying for a couple reasons. First, we
can't have negative rentals! Second, for a lot of the predictors it doesn't make
sense to plug in 0's. What does it mean to have a duration of 0?
Centering and scaling fix the non-interpret ability of the previous models.
Since we divide by the standard deviation during scaling, the non-intercept
coefficients in the centered and scaled model can be interpreted as the
increase in $y$ associated with a 1 standard deviation increase in $x$.
# Model Selection
## Variable combination
A simple method to reduce model complexity is to combine some of the variables.
For example the data set contains a variable for *nightclub*, *pub* and *bar*,
likewise there's a variable for
*cafe*, *restaurant* and *fast_food*. Maybe we can retain information
and remove some variables.
```{r, warning = FALSE}
bikeshare$food = bikeshare$fast_food + bikeshare$restaurant + bikeshare$cafe
bikeshare$nightlife = bikeshare$bar + bikeshare$pub + bikeshare$nightclub
bikeshare$tourism =
bikeshare$tourism_artwork +
bikeshare$tourism_hotel +
bikeshare$tourism_information +
# save new modeling dataset in new variable
to_model =
bikeshare %>%
select(-station, -id, -lat, -long, -fast_food, -restaurant, -cafe, -bar,
-pub, -nightclub, -tourism_artwork, -tourism_hotel,
-tourism_information, -tourism_museum)
Try out your own categories, these are just a few to get you started.
We'll learn how to make categories computationally when we cover clustering.
We've change the data frame, don't forget to redefine the train and test sets!
```{r, warning = FALSE}
train = bikeshare[in_train, ]
test = bikeshare[-in_train, ]
train = select(train, -station, -id, -lat, -long)
test = select(test, -station, -id, -lat, -long)
# how does our new full-model compare?
full_model = train(num_rentals ~ .,
data = train,
method = 'lm')
## Subset selection
We haven't talked much about computational limitations yet, but it's a good
time to start. Selection methods can be *extremely* slow. Why? Because we have
$2^p = 2^{117}$ possible variable combinations. I recommend
doing some combining before trying these methods. I'll leave the combining
up to you, but to make sure these models can run in less than infinite time,
I'm going to remove a bunch of predictors so you get the idea.
```{r, warning = FALSE}
# forward selection
forward_model = train(num_rentals ~ .,
data = train,
method = 'leapForward',
preProcess = c('center', 'scale'),
# try models of size 1 - 23
tuneGrid = expand.grid(nvmax = 1:23),
trControl = trainControl(method = 'cv', number = 5))
# what does this return?
# what what should the number of variables, k, be?
# what metric was used?
# here's a handful of other useful plots and summaries
# compare all the models
plot(forward_model$finalModel, scale = 'adjr2')
# backward_selection
backward_model = train(num_rentals ~ .,
data = train,
method = 'leapBackward',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(nvmax = 1:23),
trControl = trainControl(method = 'cv', number = 5))
plot(backward_model$finalModel, scale = 'adjr2')
plot(varImp(backward_model, scale = TRUE))
# steps in both directions
hybrid_model = train(num_rentals ~ .,
data = train,
method = 'leapSeq',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(nvmax = 1:23),
trControl = trainControl(method = 'cv', number = 5))
plot(hybrid_model$finalModel, scale = 'adjr2')
## Shrinkage
### Ridge regression
# ridge regression
ridge_model = train(num_rentals ~ .,
data = train,
method = 'ridge',
preProcess = c('center', 'scale'),
tuneLength = 10,
# reducing the cv for speed
trControl = trainControl(method = 'cv', number = 5))
# get the coefficients for the model
# NOTE: shrinkage methods don't have intercept terms
ridge_coefs = predict(ridge_model$finalModel, type = 'coef', mode = 'norm')$coefficients
# ridge regression with variable selection
ridge_model2 = train(num_rentals ~ .,
data = train,
method = 'foba',
preProcess = c('center', 'scale'),
tuneLength = 10,
trControl = trainControl(method = 'cv', number = 5))
Selection, ridge regression, and lasso are just a couple techniques at our
disposal for decreasing our model size. See
[this page]( for
a list of other available options to try out if you like.
### Lasso
lasso_model = train(num_rentals ~ .,
data = train,
method = 'lasso',
preProc = c('scale', 'center'),
tuneLength = 10,
trControl = trainControl(method = 'cv', number = 5))
# get the model coefficients
lasso_coefs = predict(lasso_model$finalModel, type = 'coef', mode = 'norm')$coefficients
# Measuring predictive accuracy
All right, now we've got a nice collection of models. Which one should we
results = resamples(list(forward_selection = forward_model,
backward_selection = backward_model,
hybrid_selection = hybrid_model,
ridge_regression = ridge_model,
lasso_regeression = lasso_model))
# compare RMSE and R-squared
# plot results
Those are in-sample statistics however, so if we want to compare the model's
out-of-sample prediction accuracy, we need to compute the RMSE using the test
data we held out. Let's compare two models: backward selection and
backward_predictions = predict(backward_model, test)
sqrt(mean((backward_predictions - test$rentals)^2 , na.rm = TRUE))
lasso_predictions = predict(lasso_model, test)
sqrt(mean((lasso_predictions - test$rentals)^2 , na.rm = TRUE))
# Assignment 4