# Part 1: Regression

The first type of machine learning problem we will explore is called a regression problem. A regression problem is one in which we use a set of features (or independent variables) to try to predict a continuous output (e.g. a real valued number). By showing a model enough examples, the hope is that the model can be trained to predict the output value given just the set of features, where the prediction is as close to the real value as possible.

# 1) Loading and Preprocessing

For this regression tutorial we will use a dataset from UCI's machine learning repository ([link](http://archive.ics.uci.edu/ml/datasets/Auto+MPG)) concerning city-cycle fuel consumption in miles per gallon. We will use this dataset and the 7 independent variables to predict the target, miles per gallons, for different car make and models. 

## Load Data

Instead of being a built-in `sklearn` dataset, the `auto-mpg` dataset is stored in a `.csv` file that can be accessed from the UCI repository, so we'll use `pandas` to load in a local copy. This dataset will require some preprocessing, which we will do after performing some exploratory data analysis (EDA).

First, let's import some packages we'll need.

In [1]:
import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Read in the `auto-mpg` dataset using `pandas`.

In [2]:
data = pd.read_csv('data/auto-mpg.csv', index_col='car name')
data.head()

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
car name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
chevrolet chevelle malibu,18.0,8,307.0,130,3504,12.0,70,1
buick skylark 320,15.0,8,350.0,165,3693,11.5,70,1
plymouth satellite,18.0,8,318.0,150,3436,11.0,70,1
amc rebel sst,16.0,8,304.0,150,3433,12.0,70,1
ford torino,17.0,8,302.0,140,3449,10.5,70,1


Below is the information for the variable types of each of the columns from the UCI machine learning repository's [website](https://archive.ics.uci.edu/ml/datasets/auto+mpg):
1. **mpg**: continuous
2. **cylinders**: multi-valued discrete
3. **displacement**: continuous
4. **horsepower**: continuous
5. **weight**: continuous
6. **acceleration**: continuous
7. **model year**: multi-valued discrete
8. **origin**: multi-valued discrete
9. **car name**: string (unique for each instance)

## Missing Data Preprocessing

Let's take a little more time to explore this dataset and perform any preprocessing necessary. One of the most important steps before we start any machine learning problem is to get a better understanding of the data at hand.

First, we see that the original dataset has 398 and 9 columns (1 column to identify the unique cars, 1 column for the target variable, and 7 columns of indepedent variables).

In [3]:
data.shape

(398, 8)

### Missing values

Next, we want to check to see if there are any missing values.

In [4]:
data.isna().any()

mpg             False
cylinders       False
displacement    False
horsepower      False
weight          False
acceleration    False
model year      False
origin          False
dtype: bool

At first glance it doesn't seem like we are missing any values, but if we check the UCI repository, the documentation mentions there are indeed missing values. Further investigation into the data set description file provided by UCI tells us that the `horsepower` column has 6 missing values: 

In [5]:
!cat ./data/auto-mpg.names | grep 'missing'

8. Missing Attribute Values:  horsepower has 6 missing values


Looking through the unique values in the `horsepower` column below, we can see that all the values are string numerals except for `?`.

In [6]:
data['horsepower'].sort_values(ascending=False).unique()

array(['?', '98', '97', '96', '95', '94', '93', '92', '91', '90', '89',
       '88', '87', '86', '85', '84', '83', '82', '81', '80', '79', '78',
       '77', '76', '75', '74', '72', '71', '70', '69', '68', '67', '66',
       '65', '64', '63', '62', '61', '60', '58', '54', '53', '52', '49',
       '48', '46', '230', '225', '220', '215', '210', '208', '200', '198',
       '193', '190', '180', '175', '170', '167', '165', '160', '158',
       '155', '153', '152', '150', '149', '148', '145', '142', '140',
       '139', '138', '137', '135', '133', '132', '130', '129', '125',
       '122', '120', '116', '115', '113', '112', '110', '108', '107',
       '105', '103', '102', '100'], dtype=object)

We will have to handle these values, either by removing the rows that contain them completely, or by using some strategy to generate proxy values that provide some approximation of what the values may have been. 

In general, if we have a large dataset, it is okay to go ahead and drop a couple rows, but given that our dataset is small in this case, we will try to generate proxy values using a technique known as **imputation**.

In order to fill the `?` values using imputation, we'll need to convert them as NaN (not a number) values, which is a common representation of missing data in python. We'll also convert the column variable type from strings to floats so that our imputer can calculate the mean.

In [7]:
data = data.replace('?', np.nan)
data = data.astype({'horsepower': 'float'})

In [8]:
data[data['horsepower'].isna()]

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
car name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ford pinto,25.0,4,98.0,,2046,19.0,71,1
ford maverick,21.0,6,200.0,,2875,17.0,74,1
renault lecar deluxe,40.9,4,85.0,,1835,17.3,80,2
ford mustang cobra,23.6,4,140.0,,2905,14.3,80,1
renault 18i,34.5,4,100.0,,2320,15.8,81,2
amc concord dl,23.0,4,151.0,,3035,20.5,82,1


### Train/Test split

Next, before we perform any imputing or encoding of our variables, we'll want to split our dataset into `train` and `test` data. This will let us fit the preprocessing steps to only the `train` data and then apply it to both the `train` and `test` datasets. If we don't do this, our preprocessing steps have the potential to introduce certain patterns from our `test` data, which can lead to [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)).

First we need to: **set the random seed!**

In [9]:
rand_seed = 10
np.random.seed(rand_seed)

While imputation is useful for features, it doesn't make sense to impute the output variable. Let's just remove any rows from the data with missing output variable values.

In [10]:
data.dropna(axis=0, subset=['mpg'], inplace=True)
data.shape

(398, 8)

Turns out there wasn't any missing data. Regardless, this is an important step to do just in case there is missing data!

Now we can extract the output variable `mpg` from the `DataFrame` to make the `X` and `Y` variables. We use a capital `X` to denote it is a `matrix` or 2-D array, and use a lowercase `y` to denote that it is a `vector`, or 1-D array.

In [11]:
X = data.drop(columns='mpg')
y = data['mpg'].astype(np.float64)
X.shape, y.shape

((398, 7), (398,))

Now we can use the train_test_split function to split the entire dataset into 80% `train` data and 20% `test` data:

In [12]:
from sklearn.model_selection import train_test_split

X_train_raw, X_test_raw, y_train_raw, y_test_raw = train_test_split(X, y, test_size=0.2)

print('XTrain shape:', X_train_raw.shape, 'YTrain shape:', y_train_raw.shape, '\n')
print('XTest shape:', X_test_raw.shape, 'YTest shape:', y_test_raw.shape)

XTrain shape: (318, 7) YTrain shape: (318,) 

XTest shape: (80, 7) YTest shape: (80,)


### Imputation

Imputation is the name given to the preprocessing step that transforms missing values. Here we'll impute any missing values using the average, or mean, of all the data that does exist, as that's the best guess for a data point if all we have is the data itself. To do that we'll use the `SimpleImputer` to assign the mean to all missing values by fitting against the train data

There are also other strategies that can be used to impute missing data ([see documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)).

In [13]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan,
                        strategy='mean', 
                        copy=True)
imputer.fit(X_train_raw);

Before we proceed to actually transforming the train and test datasets, let's also fit a **One Hot Encoder** to transform our categorical data.

## Categorical Data Processing

As we saw from the documentation, the `auto-mpg` dataset contains both categorical and continuous features, which will each need to be preprocessed in different ways. We'll want transform the categorical variables into indicator variables (which are either 0 or 1) using a technique known as one-hot encoding.

 Let's make a list of the categorical variable names to be transformed into indicator variables.

In [14]:
# Define the variable names that are categorical for use later
cat_var_names = ['cylinders', 'model year', 'origin']
X_train_raw_cat = X_train_raw[cat_var_names]
X_train_raw_cat.head()

Unnamed: 0_level_0,cylinders,model year,origin
car name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
datsun 210,4,79,3
datsun 210 mpg,4,81,3
honda civic,4,74,3
ford maverick,6,73,1
volkswagen rabbit,4,75,2


### Categorical Variable Encoding (One-hot & Dummy)

Many machine learning algorithms require that categorical data be encoded numerically in some fashion. A common technique used is called One-hot-encoding, which creates `k` new variables for a single categorical variable with `k` categories (or levels), where each new variable is coded with a `1` for the observations that contain that category, and a `0` for each observation that doesn't. 

However, when using some machine learning alorithms, such as linear regression, ridge regression and elastic net regression (which we will use first), we can run into the so-called ["Dummy Variable Trap"](https://www.algosome.com/articles/dummy-variable-trap-regression.html) when using One-Hot-Encoding on multiple categorical variables within the same set of features. This occurs because each set of one-hot-encoded variables can be added together across columns to create a single column of all `1`s, and so are multi-colinear when multiple one-hot-encoded variables exist within a given model. This can lead to misleading results when using the aforemetioned algorithms.

To resolve this, we can simply add an intercept term to our model (which is all `1`s) and remove the first one-hot-encoded variable for each categorical variables, resulting in `k-1` so-called "Dummy Variables". 

Luckily the `OneHotEncoder` from `sklearn` can perform both one-hot and dummy encoding simply by setting the `drop` parameter. Let's use it to transform the `cylinders`, `model year`, and `origin` variables into `k-1` dummy variables.

In [15]:
from sklearn.preprocessing import OneHotEncoder
dummy_e = OneHotEncoder(categories='auto', drop='first', handle_unknown='ignore', sparse=False)
dummy_e.fit(X_train_raw_cat);

Before using the dummy encoder, there are 21 total unique values (or possible variables) among the categorical variables. After we apply the dummy encoder, this dimension will be reduced to 18 total unique values.

In [16]:
num_unique = sum([len(cat) for cat in dummy_e.categories_])
print(f"{num_unique} total unique values among the categorical variables")

21 total unique values among the categorical variables


### [OPTIONAL] Using `pandas`

Optionally you can use `pandas` to do one-hot-encoding or dummy encoding. The problem with this, as we'll see in Day 3 of this workshop, is that we cannot include this into a `sklearn` pipeline, which will be a useful thing to do. Similar to the `OneHotEncoder`, we can set the optional parameter `drop_first` to change the behavior of the function from one-hot-encoding to dummy encoding.

In [17]:
X_train_raw_dummy = pd.get_dummies(X_train_raw, columns=cat_var_names, drop_first=True)
X_train_raw.shape, X_train_raw_dummy.shape

((318, 7), (318, 22))

## Continuous Data Preprocessing

Preprocessing continuous data requires different steps than categorical data. We'll still want to impute continuous data, but here we use the mean, median, or even more complex methods to make guesses at the missing data values. We don't need to create indicator variables, instead we need to normalize our variables, which helps improve performance of many machine learning models.

 Let's make subset out the continuous varialbles to be normalized.

In [18]:
X_train_raw_num = X_train_raw.drop(columns=cat_var_names)
X_train_raw_num.head()

Unnamed: 0_level_0,displacement,horsepower,weight,acceleration
car name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
datsun 210,85.0,65.0,2020,19.2
datsun 210 mpg,85.0,65.0,1975,19.4
honda civic,120.0,97.0,2489,15.0
ford maverick,250.0,88.0,3021,16.5
volkswagen rabbit,90.0,70.0,1937,14.0


### Normalization

[Normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) is a transformation that puts data into some known "normal" scale. We use normalization to improve the performance of many machine learning algorithms (see [here](https://en.wikipedia.org/wiki/Feature_scaling)). There are many forms of normalization, but perhaps the most useful to machine learning algorithms is called the "z-score" also known as the standard score. 

To z-score normalize the data, we simply subtract the mean of the data, and divide by the standard deviation. This results in data with a mean of `0` and a standard deviation of `1`.

We'll use the `StandardScaler` from `sklearn` to do normalization.

In [19]:
from sklearn.preprocessing import StandardScaler
norm_e = StandardScaler()
norm_e.fit(X_train_raw_num)
norm_e.mean_, norm_e.var_

(array([ 193.79716981,  104.22292994, 2980.69811321,   15.59559748]),
 array([1.09690854e+04, 1.50012228e+03, 7.23107248e+05, 7.90174162e+00]))

## Combine it all together

Now let's combine what we've learned to preprocess the entire dataset. On Day 3, we'll learn how to do this using an sklearn object called `Pipelines`. While these objects are extremely useful for preventing data leakage and having structured preprocessing, they require some set up, so we will use our preprocessors directly for now.

### Transform the `train` and `test` Input Data

Becuase we've already fit our preprocessors on the train data, we can be safe in the knowledge that we can use them to transform both the train and test data without any data leakage.

First, use the imputer to fill the missing values.

In [20]:
# Impute the data
X_train_imp = imputer.transform(X_train_raw)
X_test_imp = imputer.transform(X_test_raw)

# Check for missing values
np.isnan(X_train_imp).any(), np.isnan(X_test_imp).any()

(False, False)

Subset out the categorical and numerical features separately. 

In [21]:
# Get the categorical and numerical variable column indices
feature_map = {idx:feat for idx, feat in enumerate(imputer.feature_names_in_)}
cat_var_idx = [idx for idx, feat in feature_map.items() if feat in cat_var_names]
num_var_idx = [idx for idx, feat in feature_map.items() if feat not in cat_var_names]

# Splice the training array
X_train_cat = X_train_imp[:, cat_var_idx]
X_train_num = X_train_imp[:, num_var_idx]

# Splice the test array
X_test_cat = X_test_imp[:, cat_var_idx]
X_test_num = X_test_imp[:, num_var_idx]

Apply the dummy encoder to the categorical variables and the normalizer to the numerical variables.

In [22]:
warnings.filterwarnings('ignore')

# Categorical feature encoding
X_train_dummy = dummy_e.transform(X_train_cat)
X_test_dummy = dummy_e.transform(X_test_cat)

X_train_dummy.shape, X_test_dummy.shape

((318, 18), (80, 18))

In [23]:
# Numerical feature standardization
X_train_norm = norm_e.transform(X_train_num)
X_test_norm = norm_e.transform(X_test_num)

X_train_norm.shape, X_test_norm.shape

((318, 4), (80, 4))

Finally, merge the categorical and numerical columns back into one array.

In [24]:
X_train = np.hstack((X_train_dummy, X_train_norm))
X_test = np.hstack((X_test_dummy, X_test_norm))

X_train.shape, X_test.shape

((318, 22), (80, 22))

### Transform the `train` and `test` Outcome Variable

Similarly to how we transformed the continous variables for the input data, we will want to do something similar for the outcome/dependent variable, `mpg`. Here, we'll use the `fit_transform` method on the train data which performs both the `fit` and `transform` steps in a single call, as we don't need to worry about any other prior fitting of preprocessors.

In [25]:
mpg_scaler = StandardScaler()
y_train = mpg_scaler.fit_transform(y_train_raw.values.reshape(-1, 1))
y_test = mpg_scaler.transform(y_test_raw.values.reshape(-1, 1))

In scikit-learn, as soon as you have `X_train`, `X_test`, `y_train`, and `y_test`, everything else is just a matter of choosing your mdoel and the parameters for it. But this should not be trivialized, selecting models and that model's parameters is *very* important. While we will not cover it here, choosing the correct model and parameters is the core skill of applying machine learning algorithms, and can have dramatic affects on the performance of your predictions.

# 2) Building models

There are numerous machine learning models that can be used to model data and generate powerful predictions. These vary widely in the types of algorithms and statistical techniques that are used when building these models. Some models are purposefully built for regression problems, while others are more suited towards classification. Many models can also be used for both sets of problems with small tweaks to their algorithms.

For our dataset, let's start with the most basic (and probably most common) regression model that exists: **Linear regression (or Ordinary Least Squares regression)**. Although fairly simple, linear regression is a very powerful model in its own right and can be effective when applied to certain regression problems.

## Linear Models: Linear Regression

At a high level, linear regression is nothing more than finding the best straight line, or line of best fit (hyperplane in multi-dimensional space), through a set of data points that most accurately captures the pattern that exists within those data points.

In a univariate case (2-D), this looks something like this:

![linear-regression](images/linear_regression_line.png)

In a multivariate case (3 dimensions or more), the line turns into a hyperplane which tries to capture as much of the information about the multi-dimensional data points as possible:

![linear-regression](images/linear_regression_hyperplane.jpeg)

The general equation for a linear regression model is quite simple. All it includes are slope values (also known in machine learning as weights), or $\beta$'s,  and an intercept (also known as a bias term), which is really just a special case of a weight, generally denoted as $\beta_0$. The univariate equation is probably familiar to a lot of you:

<center>Univariate regression:</center>

$$y = b + mx $$
$$Y = \beta_0 + \beta_1X_1$$
<br>
<center>Multivariate regression:</center>

$$Y = \beta_0 + \beta_1X_1 + ... + \beta_iX_i$$ 

The goal of linear regression is to find a combination of these $\beta_i$ values such that we pass through or as close to as many data points as possible. In other words, we are trying to find the values of $\beta$ that reduce or minimize the aggregate distance between our linear model and the data points. 

We can formalize this into an optimization problem and pursue a strategy that is known in machine learning as minimizing the **cost function**. In the case of linear regression, the cost function we are trying to minimize is the **Mean Squared Error (MSE)** function:

$$ MSE = \frac{1}{n}\sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 $$
<br>
<center>i: ith data point</center>

<center>n: number of data points</center>
<center>$Y_i$: The real value of the ith data point</center>
<center>$\hat{Y}_i$: The predicted value of the ith data point</center>

The mean squared error is simply the sum of the squared errors (or distance) of each data point between the actual point in space and the predicted point from the linear model, all divided by the number of data points to get the mean.

By minimizing this function, we will be able to find our optimal linear regression solution that best represents the patterns inherent within the data.

### GLM - Ordinary Least Squares (OLS) Linear Regression

Now let's start modeling with the basic [OLS linear regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) provided by scikit-learn:

In [26]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression(n_jobs=1)  # CPUs to use

In [27]:
lin_reg.fit(X_train, y_train)

LinearRegression(n_jobs=1)

And we are done! Using scikit-learn to fit an linear regression model is as easy as that.

We can see how well we fit the training set. For regression models, the `.score()` method returns the amount of variance in the output variable that can be explained by the model predictions. This is known as $R^2$, or R-squared. There are many other performance metrics that can be used when predicting continuous variables.

Let's look at the $R^2$ for the training data:

In [28]:
print('Train R^2: %.04f' % (lin_reg.score(X_train, y_train)))

Train R^2: 0.8772


And the test data:

In [29]:
print('Test R^2: %.04f' % (lin_reg.score(X_test, y_test)))

Test R^2: 0.8385


Another common metric used in regression plots is the **Root Mean Squared Error (RMSE)**. This can be calculated by simply taking the square root of the MSE. In our case, we can intrepret this as the mean error made when predicting `mpg`, as RMSE is measured in the same units as the target variable.

Here's the RMSE for the training data:

In [30]:
from sklearn.metrics import mean_squared_error as mse
train_pred = lin_reg.predict(X_train)
test_pred = lin_reg.predict(X_test)

print('Train RMSE: %.04f' % (mse(y_train, train_pred, squared=False)))

Train RMSE: 0.3504


And the test data:

In [31]:
print('Test RMSE: %.04f' % (mse(y_test, test_pred, squared=False)))

Test RMSE: 0.4030


Similarly for MSE:

In [32]:
print('Train MSE: %.04f' % (mse(y_train, train_pred)))
print('Test MSE: %.04f' % (mse(y_test, test_pred)))

Train MSE: 0.1228
Test MSE: 0.1624


A final commonly used metric in regression is the **Mean Absolute Error (MAE)**. As the name suggests, this can be calculated by taking the mean of the absolute errors. 

In [33]:
from sklearn.metrics import mean_absolute_error as mae
print('Train MSE: %.04f' % (mae(y_train, train_pred)))
print('Test MSE: %.04f' % (mae(y_test, test_pred)))

Train MSE: 0.2637
Test MSE: 0.3151


### GLM - Ridge (L2) Regression

Many times, if we fit our models too closely to our training data, this can lead to a phenomenom called **overfitting**. It may seem like a good thing when we are able to match our data as close as possible, but often times there are differences in the data samples in our test set compared to our training set. To avoid this, most models are paired with some form of regularization (or penalization) that tries to account for unseen data in the test set. This may impact the performance on our training data, but can lead to better predictions on test data and improve overall generalization.


For linear regression models, one form of regularization is known as **Ridge (L2) regression**. Instead of using the least squares loss (which is the loss function used to calculate our MSE cost function): 
$$ L(\beta) = \sum_i^n (y_i - \hat y_i)^2 $$ 

In ridge regression we additionally penalize the coefficients by adding a regularization term: 

$$ L(\beta) = \sum_i^n (y_i - \hat y_i)^2  + \alpha \sum_j^p \beta^2 $$ 

This regularization term aims to minimize the size of any one coefficient (or weight), penalizing any reliance on a given subset of features which commonly leads to overfitting.

Ridge regression takes a **hyperparameter**, called alpha, $\alpha$ (sometimes lambda, $\lambda$). This hyperparameter indicates how much regularization should be done. In other words, how much to care about the coefficient penalty term vs how much to care about the sum of squared errors term. The higher the value of alpha the more regularization, and the smaller the resulting coefficients will be. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) for more. 

If we use an `alpha` value of `0` then we get the same solution as the OLS regression done above. Let's prove that.

In [34]:
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=0,  # regularization
                  solver='auto',
                  random_state = rand_seed) 
ridge_reg.fit(X_train, y_train)

# Predictions
ridge_train_pred = ridge_reg.predict(X_train)
ridge_test_pred = ridge_reg.predict(X_test)

In [35]:
print('Train RMSE: %.04f' % (mse(y_train, ridge_train_pred, squared=False)))
print('Test RMSE: %.04f' % (mse(y_test, ridge_test_pred, squared=False)))

Train RMSE: 0.3504
Test RMSE: 0.4030


Generally we don't know what the best value hypterparameter values should be, and so we need to leverage some type of trial and error method to determine the best values. We won't cover it today (it's covered in detail on Day 2), but scikit-learn provides a `RidgeCV` model that does just that. It fits a ridge regression model by first using cross-validation to find a good value of alpha. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html#sklearn.linear_model.RidgeCV) for more.

Just for our sanity, let's see if we can improve on our baseline linear regression model using a ridge model by setting our alpha value to 0.1.

In [36]:
ridge_reg = Ridge(alpha=0.1,  # regularization
                  solver='auto',
                  random_state = rand_seed) 
ridge_reg.fit(X_train, y_train)

# Predictions
ridge_train_pred = ridge_reg.predict(X_train)
ridge_test_pred = ridge_reg.predict(X_test)

In [37]:
print('Train RMSE: %.04f' % (mse(y_train, ridge_train_pred, squared=False)))
print('Test RMSE: %.04f' % (mse(y_test, ridge_test_pred, squared=False)))

Train RMSE: 0.3507
Test RMSE: 0.4012


Looks like despite doing slightly worse on the training set, it did a bit better than using regular OLS on the test set!

### GLM - Lasso (L1) Regression

**Lasso (L1) regression** is another form of regularized regression that penalizes the coefficients in a least squares loss. Rather than taking a squared penalty of the coefficients, Lasso uses an absolute value penalty: 

$$ L(\beta) = \sum_i^n (y_i - \hat y_i)^2  + \alpha \sum_j^p |\beta| $$ 

This has a similar effect on making the coefficients smaller, but also has a tendency to force some coefficients to 0. This leads to what is called **sparser** models, and is another way to reduce overfitting introduced by more complex models.

See [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso) for more.

In [38]:
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.01,  # regularization
                  random_state = rand_seed) 
lasso_reg.fit(X_train, y_train)

# Predictions
lasso_train_pred = lasso_reg.predict(X_train)
lasso_test_pred = lasso_reg.predict(X_test)

In [39]:
print('Train RMSE: %.04f' % (mse(y_train, lasso_train_pred, squared=False)))
print('Test RMSE: %.04f' % (mse(y_test, lasso_test_pred, squared=False)))

Train RMSE: 0.3916
Test RMSE: 0.4333


In this case, we can see that even with a small alpha, we have too much regularization which leads to worse performance on both train and test datasets. In this case, we would call our model **underfit**.

Taking a look at our feature coeffiecients, we can see that many of them are 0:

In [40]:
lasso_reg.coef_

array([ 0.02214324,  0.        , -0.29571066,  0.        , -0.00089397,
       -0.19637668, -0.10984901, -0.        , -0.        , -0.        ,
        0.        ,  0.01116894,  0.16207   ,  0.71175003,  0.43468887,
        0.62432226,  0.        ,  0.11584334, -0.        , -0.21747105,
       -0.50786204,  0.        ])

## Non-Linear Models: K-Nearest Neighbors (KNN)

With more complex data, it may be difficult to capture model predictive linear relationships. In these cases, it can be useful to use models that are able to capture non-linear dependencies from the data.

One such model is known as the **K-Nearest Neighbors (KNN)** algorithm. This algorithm is based off feature similarity, and uses data points that are similar to each other to predict the value of new data points. It does so by using a **distance metric** to quantify distance and therfore similarity between a set of points. In a KNN model, this distance metric can then be used to calculate an average value between `k` data points that are most similar to the data point to be predicted in the feature space.

![KNN](images/KNN.png)

The most commonly used distance metric for KNN is known as the **Eucliden distance**:

$$ \text{Euclidean distance} = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$$

By taking the average Eucliden distance of the `k` nearest points, we can derive a predicted value for a given data point.

### Feature encoding

For the KNN model, we won't be fitting a bias term (or intercept), and so using dummy encoded categorical variables is inappropriate. Instead we'll use one-hot encoding for these models. Let's revisit our data preprocessing so that our data is in the right format.

In [41]:
ohe = OneHotEncoder(categories='auto', drop=None, handle_unknown='ignore', sparse=False)
ohe.fit(X_train_raw_cat);

In [42]:
warnings.filterwarnings('ignore')

# Categorical feature encoding
X_train_ohe = ohe.transform(X_train_cat)
X_test_ohe = ohe.transform(X_test_cat)

X_train_ohe.shape, X_test_ohe.shape

((318, 21), (80, 21))

In [43]:
X_train_nonlinear = np.hstack((X_train_ohe, X_train_norm))
X_test_nonlinear = np.hstack((X_test_ohe, X_test_norm))

X_train_nonlinear.shape, X_test_nonlinear.shape

((318, 25), (80, 25))

### K-nearest neighbors regression

Just like the linear regression models, scikit-learn provides a very easy interface to train a KNN model ([see here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor)). A quick look at the documentation gives away the fact that there are many more hyperparameters that can be altered compared to the previous models. KNN is a model that has much greater variability in performance based on these hyperparamters, so it is important that some **hyperparameter tuning** methods are applied to try combinations of different values. Again, we won't cover specific methods today, but it is an important point to remember when using KNN models in the future. 

Unlike linear regression models, a KNN model can be used for both regression and classification problems, so we should be sure to use the `KNNeighborsRegressor` class from sklearn.

In [44]:
def tune_k_neighbors(n_neighbors, X_train, y_train, X_test, y_test):
    
    for n in n_neighbors:
        
        knn_reg = KNeighborsRegressor(n_neighbors=n,
                                      weights='uniform',  # ‘distance’ weights points by inverse of their distance
                                      algorithm='auto',  # out of ‘ball_tree’, ‘kd_tree’, ‘brute’
                                      leaf_size=30)  # for tree algorithms
        knn_reg.fit(X_train, y_train)
        
        # Predictions
        knn_train_pred = knn_reg.predict(X_train)
        knn_test_pred = knn_reg.predict(X_test)
        
        print("WHEN n = %d" % (n))
        print('Train RMSE: %.04f' % (mse(y_train, knn_train_pred, squared=False)))
        print('Test RMSE: %.04f' % (mse(y_test, knn_test_pred, squared=False)))
        print()


In [45]:
from sklearn.neighbors import KNeighborsRegressor

# Example of hyperparameter tuning for the `k` neighbors value
n_list = [2, 4, 6]
tune_k_neighbors(n_list, X_train_nonlinear, y_train, X_test_nonlinear, y_test)

WHEN n = 2
Train RMSE: 0.2345
Test RMSE: 0.4508

WHEN n = 4
Train RMSE: 0.3320
Test RMSE: 0.3874

WHEN n = 6
Train RMSE: 0.3686
Test RMSE: 0.3988



We can see that the performance varies greatly, but when n=4, we get our best performance yet! We can really see that more complex, non-linear models can oftentimes lead to better results.

## Challenge

Another popular model that can be used for both classification and regression is a **Support Vector Machine (SVM)**. Using the datasets, train an SVM model and evaluate it's performance. See if you can also tweak the hyperparameters to inspect how the model varies in performance. Make sure to use `X_train` and `y_train` as your training inputs, and `X_test` as your test input.

You can find the documentation for the [sklearn.svm.SVR here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html).

In [46]:
from sklearn.svm import SVR

svm_reg = SVR() # Add additional hyperparameters for the model