### Linear Regression using TidyModels

In this lab exercise we would be going through <br> 
- simple linear regression
- multiple linear regression
- transformations to predictors (using `parsnip`)

In [None]:
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(ISLR))
suppressPackageStartupMessages(library(MASS))

In [None]:
head(petrol)

## Simple Linear Regression

We are using `Boston` data set - contains various statistics for 506 neighborhoods in Boston

Agenda: Build a simple linear regression model that related the median value of owner-occupied homes (`medv`) as the response with a variable indicating the percentage of the population that belongs to a lower status (`lstat`) as the predictor.

In the below step, we create a parsnip specification for a linear regression model

In [None]:
lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

In [None]:
lm_spec

In [None]:
head(Boston)

Once we have the specification we can fit it by supplying a formula expression and the data we want to fit the model on. 

The formula is written on the form `y ~ x` where `y` is the name of the response and `x` is the name of the predictors. The names used in the formula should match the names of the variables in the data set passed to data.

In [None]:
lm_fit <- lm_spec %>% fit(medv ~ lstat, data = Boston)
lm_fit

The result of this fit is a parsnip model object. This object contains the underlying fit as well as some parsnip-specific information. If we want to look at the underlying fit object we can access it with `lm_fit$fit` or with

In [None]:
lm_fit %>% 
  pluck("fit")

In [None]:
lm_fit %>% 
  pluck("fit") %>%
  summary()

`tidy()` function returns the parameter estimates of a lm object

In [None]:
tidy(lm_fit)

`glance()` can be used to extract the model statistics

In [None]:
glance(lm_fit)

If we like the model fit then we can generate the predictions using the `predict()` function

In [None]:
predict(lm_fit, new_data = Boston)

### Excercise

Agenda: Build a simple linear regression model that relates `medv` as response to `age` as the predictor


In [None]:
# your code here


In [None]:
#hidden test cases 


In [None]:
# your code here


In [None]:
#hidden test cases 


## Multiple Linear Regression

The multiple linear regression model can be fit in much the same way as the simple linear regression model. The only difference is how we specify the predictors. We are using the same formula expression `y ~ x`, but we can specify multiple values by separating them with `+s`

In [None]:
lm_fit2 <- lm_spec %>% 
  fit(medv ~ lstat + age, data = Boston)

lm_fit2

In [None]:
tidy(lm_fit2)
predict(lm_fit2, new_data = Boston)

A shortcut when using formulas is to use the form `y ~ .` which means; set `y` as the response and set the remaining variables as predictors

In [None]:
lm_fit3 <- lm_spec %>% 
  fit(medv ~ ., data = Boston)

lm_fit3

## Interaction Terms


An interaction term is represented as the product of two or more independent variables/predictors

There are two ways on including an interaction term; `x:y` and `x * y`
 - `x:y` will include the interaction between `x` and `y`
 - `x * y` will include the interaction between `x` and `y`, `x and y`, i.e. it is short for `x:y + x + y`

In [None]:
lm_fit4 <- lm_spec %>%
  fit(medv ~ lstat * age, data = Boston)

lm_fit4

note that the interaction term is named `lstat:age`.

Sometimes we want to perform transformations, and we want those transformations to be applied, as part of the model fit as a pre-processing step. We will use the recipes package for this task.

We use the `step_interact()` to specify the interaction term. Next, we create a workflow object to combine the linear regression model specification lm_spec with the pre-processing specification `rec_spec_interact` which can then be fitted much like a parsnip model specification.

In [None]:
rec_spec_interact <- recipe(medv ~ lstat + age, data = Boston) %>%
  step_interact(~ lstat:age)

lm_wf_interact <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(rec_spec_interact)

lm_wf_interact %>% fit(Boston)

Notice that since we specified the variables in the recipe we don’t need to specify them when fitting the workflow object. Furthermore, take note of the name of the interaction term. `step_interact()` tries to avoid special characters in variables

## Non-linear transformations of the predictors

Much like we could use recipes to create interaction terms between values are we able to apply transformations to individual variables as well. If you are familiar with the dplyr package then you know how to `mutate()` which works in much the same way using `step_mutate()`.

You would want to keep as much of the pre-processing inside recipes such that the transformation will be applied consistently to new data.

In [None]:
rec_spec_pow2 <- recipe(medv ~ lstat, data = Boston) %>%
  step_mutate(lstat2 = lstat ^ 2)

lm_wf_pow2 <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(rec_spec_pow2)

lm_wf_pow2 %>% fit(Boston)

## Qualitative Predictors

We will now turn our attention to the `Carseats` data set. We will attempt to predict `Sales` of child car seats in 400 locations based on a number of predictors. One of these variables is `ShelveLoc` which is a qualitative predictor that indicates the quality of the shelving location. 

`ShelveLoc` takes on three possible values
- Bad
- Medium
- Good

If you pass such a variable to `lm()` it will read it and generate dummy variables automatically using the following convention

In [None]:
Carseats %>%
  pull(ShelveLoc) %>%
  contrasts()

So we have no problems including qualitative predictors when using `lm` as the engine.

In [None]:
lm_spec %>% 
  fit(Sales ~ . + Income:Advertising + Price:Age, data = Carseats)

however, as with so many things, we can not always guarantee that the underlying engine knows how to deal with qualitative variables. recipes can be used to handle this as well. The `step_dummy()` will perform the same transformation of turning 1 qualitative with `C` levels into `C-1` indicator variables. 

While this might seem unnecessary right now, some of the engines, later on, do not handle qualitative variables and this step would be necessary.

We are also using the `all_nominal_predictors()` selector to select all character and factor predictor variables. This allows us to select by type rather than having to type out the names.

In [None]:
rec_spec <- recipe(Sales ~ ., data = Carseats) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_interact(~ Income:Advertising + Price:Age)

lm_wf <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(rec_spec)

lm_wf %>% fit(Carseats)