Permalink
Cannot retrieve contributors at this time
--- | |
title: "Linear regression in R" | |
output: | |
html_document: | |
theme: readable | |
toc: yes | |
toc_depth: 3 | |
pdf_document: | |
toc: yes | |
toc_depth: 2 | |
date: "April 24, 2016" | |
--- | |
```{r setup, include = FALSE} | |
knitr::opts_chunk$set(cache = TRUE) | |
require(dplyr) | |
require(ggplot2) | |
require(GGally) | |
require(scales) | |
require(caret) | |
setwd('~/projects/BI-TECH-CP303/projects/project 1') | |
bikeshare = read.delim('./data/bikeshare_2015.tsv', | |
sep = '\t', | |
header = TRUE) | |
``` | |
# Linear regression | |
In this tutorial we'll learn: | |
1. how to fit linear regression models | |
2. how to split data into test and train sets | |
3. how to tune our models and select features | |
## Data preparation | |
As always when you open R, start by reloading the packages you expect to use. | |
In the code below, we're loading the `dplyr` and `ggplot2` libraries. | |
We're working with the Capital Bikeshare again, so start by reading in | |
the data file. | |
```{r load-data, eval = FALSE} | |
library(dplyr) | |
library(ggplot2) | |
# you might want to set the working directory first. | |
setwd('/path/to/your/data') | |
bikeshare = read.delim('bikeshare_2015.tsv', | |
sep = '\t', | |
header = TRUE) | |
``` | |
## The `lm()` function | |
The function for creating a linear model in R is `lm()` and the primary | |
arguments are *formula* and *data*. Formulas in R are expressed with a tilde, | |
*e.g*`y ~ x`. Let's fit the model: $rentals = \beta_0 + \beta_1*crossing$. | |
```{r lm, fig.height = 7, fig.width = 5, fig.align = 'center'} | |
model = lm(num_rentals ~ crossing, data = bikeshare) | |
# view what is returned in the lm object | |
attributes(model) | |
# get model output | |
summary(model) | |
# plot it | |
ggplot(bikeshare, aes(x = crossing, y = num_rentals)) + | |
geom_smooth(method = 'lm') + | |
geom_point(size = 3, alpha = 0.60) + | |
facet_grid(season ~. ) + | |
theme_minimal() | |
``` | |
The `attributes()` function can be called on just about any object in R and it | |
returns a list of all the things inside. It's a great way to explore | |
objects and see what values are contained inside that could be used in other | |
analysis. For example, extracting the residuals via `model$residuals` is useful | |
if we want to print diagnostic plots like those above. | |
Run `summary()` on the `lm` object to see detailed results. The *Call* | |
section prints the model specification, and the *Residuals* section | |
contains a summary of the distribution of the errors. *Coefficients* section contains the estimated coefficients, standard errors, *t-* and *p-*values for each variable in the model. Our model ends up being `rentals = 1987 + 40*(crossings)`, which means that the average number of rentals is 1987 when there are no crosswalks, and the average increases by 40 rentals for every additional crosswalk within a quarter mile. | |
### Multivariate regression | |
```{r, eval = FALSE, include = FALSE} | |
# lets include the number of parking areas | |
ggplot(bikeshare, aes(y = no_empty_docks, x = no_bikes)) + | |
geom_smooth(method = 'lm', se = FALSE, color = 'black') + | |
geom_point(aes(color = num_rentals, size = num_rentals), alpha = 0.70) + | |
facet_grid(season ~. ) + | |
scale_colour_gradient(limits = c(2, 28540), low = 'blue', high = 'yellow') + | |
scale_size(range = c(0, 15)) + | |
theme_minimal() | |
``` | |
We can fit regressions with multiple covariates the same way by adding variables | |
with the `+` sign. Let's fit another model that includes the number of crosswalks | |
and the number of parking lots as predictors: | |
```{r multi-lm} | |
model = lm(num_rentals ~ crossing + parking, data = bikeshare) | |
summary(model) | |
``` | |
Let's try one more, this time we'll include `season`, a factor variable: | |
```{r lm-factor, fig.height = 4, fig.width = 4, fig.align = 'center'} | |
# lets include season this time | |
model = lm(num_rentals ~ crossing + parking + season, data = bikeshare) | |
summary(model) | |
ggplot(bikeshare, aes(x = season, y = num_rentals)) + | |
geom_boxplot() + | |
theme_minimal() | |
``` | |
You might have noticed that one of the seasons is missing. By default R chooses a | |
reference group that is represented in the intercept term. In this case `Autumn` | |
is the reference group for the `season` variable because | |
by default R orders factors alphabetically. If you want to order the | |
`season` category differently, just specify the `levels` in the `factor` function. | |
```{r factor-relevel} | |
bikeshare$season = factor(bikeshare$season, | |
levels = c('Spring', 'Summer', 'Autumn', 'Winter')) | |
model = lm(num_rentals ~ crossing + season, data = bikeshare) | |
summary(model) | |
``` | |
The interpretation of the new model is that stations without crosswalks or parking have an average of 2,687 rentals in the spring. Those same stations can expect an additional 232 rentals in the summer, and about 906 less in the fall and 2,116 less in the winter. | |
# The *caret* package | |
Remember last time when we did exploratory data analysis (EDA) with `ggpairs` | |
because it allowed us to view relationships between many variable simultaneously? | |
We're in a similar situation now because there are around 70 modeling variables | |
to choose from, so how do we start developing models? | |
Lucky for us there's the `caret` package (short for **c**lassification **a**nd | |
**re**gression **t**raining). `caret` is great for model development because it | |
integrates many modeling methods in R into one unified syntax. That means more reusable | |
code for us! *caret* contains helper functions that provide a unified | |
framework for data cleaning/splitting, model training, and comparison. I highly | |
recommend the | |
[optional reading](https://github.com/erinshellman/BI-TECH-CP303/blob/master/reading/regression/v28i05.pdf) | |
this week which provides a great overview of the *caret* package. | |
```{r, eval = FALSE} | |
install.packages('caret', dependencies = TRUE) | |
library(caret) | |
set.seed(1234) # set a seed | |
``` | |
Setting a seed in R insures that you get identical results each time you run | |
your code. Since re-sampling methods are inherently probabilistic, every time we | |
rerun them we'll get slightly different answers. Setting the seed to the same | |
number insures that we get identical randomness each time the code is run, and | |
that's helpful for debugging. | |
## Train and test data | |
Before analysis we'll divide data into train and | |
test sets. Check out | |
[this](http://topepo.github.io/caret/training.html) nice overview for more | |
details. The *training* set is typically about 75% of the data and is used for | |
all the model development. Once we have a model we're satisfied with, we use our | |
*testing* set, the other 25% to generate model predictions. Splitting the data | |
into the two groups, train and test, generates two types of errors, in-sample | |
and out-of-sample errors. *In-sample* errors are the errors derived from same | |
data the model was built with. *Out-of-sample* errors are derived from measuring | |
the error on a fresh data set. We are interested in the out-of-sample error | |
because this quantity represents how'd we'd expect the model to perform in the | |
future on brand new data. | |
Here's how to split the data with *caret*: | |
```{r} | |
# select the training observations | |
in_train = createDataPartition(y = bikeshare$num_rentals, | |
p = 0.75, # 75% in train, 25% in test | |
list = FALSE) | |
head(in_train) # row indices of observations in the training set | |
train = bikeshare[in_train, ] | |
test = bikeshare[-in_train, ] | |
dim(train) | |
dim(test) | |
``` | |
**Note:** I recommend doing all data processing and aggregation steps *before* | |
splitting out your train/test sets. | |
## Model training | |
Our workhorse function in the *caret* package in the `train` function. This | |
function can be used to evaluate performance parameters, choose optimal models | |
based on the values of those parameters, and estimate model performance. For | |
regression we can use it in place of the `lm()` function. Here's our last | |
regression model using the train function. | |
Now that you're familiar with how to specify model equations with the `~` character | |
you should recognize the model: | |
```{r lm-caret} | |
model_fit = train(num_rentals ~ crossing + parking + season, | |
data = train, | |
method = 'lm', | |
metric = 'RMSE') | |
print(model_fit) | |
summary(model_fit) | |
# get predictions | |
out_of_sample_predictions = predict(model_fit, newdata = test) | |
# compare predictions against the observed values | |
errors = data.frame(predicted = out_of_sample_predictions, | |
observed = test$num_rentals, | |
error = out_of_sample_predictions - test$num_rentals) | |
# plot the out-of-sample errors | |
ggplot(data = errors, aes(x = predicted, y = observed)) + | |
geom_abline(aes(intercept = 0, slope = 1), | |
size = 3, alpha = 0.70, color = 'red') + | |
geom_point(size = 3, alpha = 0.80) + | |
ggtitle('out-of-sample errors') + | |
theme_minimal() | |
``` | |
Our prediction accuracy is not so great for this model. The in-sample RMSE is | |
about 2,863 which means that on average the predictions are off by about 2,863 | |
rentals. | |
What happens if we build a model with all the variables in it? To do that, | |
I'm using the `select` verb from `dplyr` to remove variables that aren't | |
predictors, like the station name, id, and lat/long. | |
```{r full-model} | |
full_model = train(num_rentals ~ ., | |
data = select(train, -station, -id, -lat, -long), | |
method = 'lm') | |
full_model | |
# get predictions | |
out_of_sample_predictions = predict(full_model, newdata = test) | |
# compare predictions against the observed values | |
errors = data.frame(predicted = out_of_sample_predictions, | |
observed = test$num_rentals, | |
error = out_of_sample_predictions - test$num_rentals) | |
# plot the out-of-sample errors | |
ggplot(data = errors, aes(x = predicted, y = observed)) + | |
geom_abline(aes(intercept = 0, slope = 1), | |
size = 3, alpha = 0.70, color = 'red') + | |
geom_point(size = 3, alpha = 0.80) + | |
ggtitle('out-of-sample errors, full model') + | |
theme_minimal() | |
``` | |
The in-sample RMSE is about 2,281, so definitely an improvement over the previous | |
model, but this model is really complex and probably not going to be usable by | |
Pronto. How can we reduce the complexity of the model, but maintain reasonable | |
predictive accuracy? | |
# Assignment 3 | |
1. Try a couple different models based on the hypotheses you tested in the | |
first two assignments. Can you improve on the RMSE? | |
## Preprocessing | |
Shrinkage methods require that the predictors are normalized to be on the same | |
scale. We can accomplish this by centering and scaling the data. You center a | |
variable by subtracting the mean of the variable from from each observation. To | |
scale your observations you then divide the centered observation by the variable | |
standard deviation. Now the variable follows a standard normal distribution with | |
mean = 0 and standard deviation = 1. | |
The *caret* package has lots of convenient functions for | |
[preprocessing data](http://topepo.github.io/caret/preprocess.html), check 'em | |
out! | |
```{r, warning = FALSE} | |
full_model_scaled = train(num_rentals ~ ., | |
data = select(train, -station, -id, -lat, -long), | |
method = 'lm', | |
preProcess = c('center', 'scale')) | |
``` | |
Coefficients estimated with normalized data have a different interpretation than | |
coefficients from un-normalized data. In this case when the data are scaled the | |
intercept has a better interpretation, it's the expected number of rentals when | |
all the predictors are at their average value. So, in this case, when all the | |
predictors are at their average values, we expect about 2813 rentals per season. | |
In the previous full-model we had an intercept of about -878.261, which could be | |
interpreted as the expected number of rentals when all the other predictors | |
have a value of 0. That's pretty unsatisfying for a couple reasons. First, we | |
can't have negative rentals! Second, for a lot of the predictors it doesn't make | |
sense to plug in 0's. What does it mean to have a duration of 0? | |
Centering and scaling fix the non-interpret ability of the previous models. | |
Since we divide by the standard deviation during scaling, the non-intercept | |
coefficients in the centered and scaled model can be interpreted as the | |
increase in $y$ associated with a 1 standard deviation increase in $x$. | |
# Model Selection | |
## Variable combination | |
A simple method to reduce model complexity is to combine some of the variables. | |
For example the data set contains a variable for *nightclub*, *pub* and *bar*, | |
likewise there's a variable for | |
*cafe*, *restaurant* and *fast_food*. Maybe we can retain information | |
and remove some variables. | |
```{r, warning = FALSE} | |
bikeshare$food = bikeshare$fast_food + bikeshare$restaurant + bikeshare$cafe | |
bikeshare$nightlife = bikeshare$bar + bikeshare$pub + bikeshare$nightclub | |
bikeshare$tourism = | |
bikeshare$tourism_artwork + | |
bikeshare$tourism_hotel + | |
bikeshare$tourism_information + | |
bikeshare$tourism_museum | |
# save new modeling dataset in new variable | |
to_model = | |
bikeshare %>% | |
select(-station, -id, -lat, -long, -fast_food, -restaurant, -cafe, -bar, | |
-pub, -nightclub, -tourism_artwork, -tourism_hotel, | |
-tourism_information, -tourism_museum) | |
``` | |
Try out your own categories, these are just a few to get you started. | |
We'll learn how to make categories computationally when we cover clustering. | |
We've change the data frame, don't forget to redefine the train and test sets! | |
```{r, warning = FALSE} | |
train = bikeshare[in_train, ] | |
test = bikeshare[-in_train, ] | |
train = select(train, -station, -id, -lat, -long) | |
test = select(test, -station, -id, -lat, -long) | |
dim(train) | |
dim(test) | |
# how does our new full-model compare? | |
full_model = train(num_rentals ~ ., | |
data = train, | |
method = 'lm') | |
``` | |
## Subset selection | |
We haven't talked much about computational limitations yet, but it's a good | |
time to start. Selection methods can be *extremely* slow. Why? Because we have | |
$2^p = 2^{117}$ possible variable combinations. I recommend | |
doing some combining before trying these methods. I'll leave the combining | |
up to you, but to make sure these models can run in less than infinite time, | |
I'm going to remove a bunch of predictors so you get the idea. | |
```{r, warning = FALSE} | |
# forward selection | |
forward_model = train(num_rentals ~ ., | |
data = train, | |
method = 'leapForward', | |
preProcess = c('center', 'scale'), | |
# try models of size 1 - 23 | |
tuneGrid = expand.grid(nvmax = 1:23), | |
trControl = trainControl(method = 'cv', number = 5)) | |
# what does this return? | |
attributes(forward_model) | |
# what what should the number of variables, k, be? | |
forward_model$bestTune | |
# what metric was used? | |
forward_model$metric | |
# here's a handful of other useful plots and summaries | |
print(forward_model) | |
summary(forward_model) | |
plot(forward_model) | |
plot(varImp(forward_model)) | |
# compare all the models | |
plot(forward_model$finalModel, scale = 'adjr2') | |
# backward_selection | |
backward_model = train(num_rentals ~ ., | |
data = train, | |
method = 'leapBackward', | |
preProcess = c('center', 'scale'), | |
tuneGrid = expand.grid(nvmax = 1:23), | |
trControl = trainControl(method = 'cv', number = 5)) | |
plot(backward_model) | |
plot(backward_model$finalModel, scale = 'adjr2') | |
plot(varImp(backward_model, scale = TRUE)) | |
# steps in both directions | |
hybrid_model = train(num_rentals ~ ., | |
data = train, | |
method = 'leapSeq', | |
preProcess = c('center', 'scale'), | |
tuneGrid = expand.grid(nvmax = 1:23), | |
trControl = trainControl(method = 'cv', number = 5)) | |
plot(hybrid_model) | |
plot(hybrid_model$finalModel, scale = 'adjr2') | |
plot(varImp(hybrid_model)) | |
``` | |
## Shrinkage | |
### Ridge regression | |
```{r} | |
# ridge regression | |
ridge_model = train(num_rentals ~ ., | |
data = train, | |
method = 'ridge', | |
preProcess = c('center', 'scale'), | |
tuneLength = 10, | |
# reducing the cv for speed | |
trControl = trainControl(method = 'cv', number = 5)) | |
print(ridge_model) | |
plot(ridge_model) | |
plot(ridge_model$finalModel) | |
plot(varImp(ridge_model)) | |
# get the coefficients for the model | |
# NOTE: shrinkage methods don't have intercept terms | |
ridge_coefs = predict(ridge_model$finalModel, type = 'coef', mode = 'norm')$coefficients | |
# ridge regression with variable selection | |
ridge_model2 = train(num_rentals ~ ., | |
data = train, | |
method = 'foba', | |
preProcess = c('center', 'scale'), | |
tuneLength = 10, | |
trControl = trainControl(method = 'cv', number = 5)) | |
print(ridge_model2) | |
plot(ridge_model2) | |
plot(varImp(ridge_model2)) | |
``` | |
Selection, ridge regression, and lasso are just a couple techniques at our | |
disposal for decreasing our model size. See | |
[this page](http://topepo.github.io/caret/Feature_Selection_Wrapper.html) for | |
a list of other available options to try out if you like. | |
### Lasso | |
```{r} | |
lasso_model = train(num_rentals ~ ., | |
data = train, | |
method = 'lasso', | |
preProc = c('scale', 'center'), | |
tuneLength = 10, | |
trControl = trainControl(method = 'cv', number = 5)) | |
print(lasso_model) | |
plot(lasso_model) | |
plot(varImp(lasso_model)) | |
plot(lasso_model$finalModel) | |
# get the model coefficients | |
lasso_coefs = predict(lasso_model$finalModel, type = 'coef', mode = 'norm')$coefficients | |
``` | |
# Measuring predictive accuracy | |
All right, now we've got a nice collection of models. Which one should we | |
report? | |
```{r} | |
results = resamples(list(forward_selection = forward_model, | |
backward_selection = backward_model, | |
hybrid_selection = hybrid_model, | |
ridge_regression = ridge_model, | |
lasso_regeression = lasso_model)) | |
# compare RMSE and R-squared | |
summary(results) | |
# plot results | |
dotplot(results) | |
``` | |
Those are in-sample statistics however, so if we want to compare the model's | |
out-of-sample prediction accuracy, we need to compute the RMSE using the test | |
data we held out. Let's compare two models: backward selection and | |
lasso: | |
```{r} | |
backward_predictions = predict(backward_model, test) | |
sqrt(mean((backward_predictions - test$rentals)^2 , na.rm = TRUE)) | |
lasso_predictions = predict(lasso_model, test) | |
sqrt(mean((lasso_predictions - test$rentals)^2 , na.rm = TRUE)) | |
``` | |
# Assignment 4 | |