# STUDY GROUP - M01S11
## Multiple Regression

### Objectives
You will be able to:
* Understand danger of multicolinearity and how to check for it
* Understand benefit of and how to perform feature scaling and normalization 
* Understand need for and processs to deal with categorical features with label encoding/one-hot encoding 
* Explain why it's always important to split your data into training and testing sets to validate that your model works well on "new" data
    1. Understand how k-fold cross validation is a great way to run multiple train-test splits on your data set to maximize the quality of your predictions for a given set of data

### Multicollinearity

* The interpretation of a regression coefficient is that it represents the average change in the dependent variable for each 1 unit change in a predictor, assuming that all the other predictor variables are kept constant. And it is exactly because of that reason that multicollinearity can cause problems. Correlation is a problem because it indicates that changes in one predictor are associated with changes in another one as well. Because of this, the estimates of the coefficients can have big fluctuations as a result of small changes in the model. As a result, you may not be able to trust the p-values associated with correlated predictors.

* correlations = data.corr() method #high correlation > 0.75 generally
* sns.heatmap(correlations);
    
   

### Feature Scaling & Normalization

Often, your dataset will contain features that largely vary in magnitudes. If we leave these magnitudes unchanged, coefficient sizes will largely fluctuate in magnitude as well. This can give the false impression that some variables are less important than others.

**Log transformation**

Log transformation is a very useful tool when you have data that clearly does not follow a normal distribution. log transformation can help reducing skewness when you have skewed data, and can help reducing variability of data. 


**Min-max scaling**

When performing min-max scaling, you can transform x to get the transformed $x'$ by using the formula:

$$x' = \dfrac{x - \min(x)}{\max(x)-\min(x)}$$

This way of scaling brings values between 0 and 1

**Standardization**

When 

$$x' = \dfrac{x - \bar x}{\sigma}$$

x' will have mean $\mu = 0$ and $\sigma = 1$

Note that standardization does not make data $more$ normal, it will just change the mean and the standard error!

**Mean normalization**
When performing mean normalization, you use the following formula:
$$x' = \dfrac{x - \text{mean}(x)}{\max(x)-\min(x)}$$

The distribution will have values between -1 and 1, and a mean of 0.

**Unit vector transformation**
 When performing unit vector transformations, you can create a new variable x' with a range [0,1]:
 
$$x'= \dfrac{x}{{||x||}}$$


Recall that the norm of x $||x||= \sqrt{(x_1^2+x_2^2+...+x_n^2)}$

### Label Encoding/One-hot Encoding for Categorical Features

* How to identify categorical features:
    1. .info() - look for 'str' data types
    2. .describe()
    3. .scatter() - categorical features will appear historgram-like rather than a cluster of dots

In [2]:
# Label Encoding
import pandas as pd
from sklearn.preprocessing import LabelEncoder
lb_make = LabelEncoder()

feature_series = pd.Series(feature)
cat_feature = feature_series.astype('category')
feature_encoded = lb_make.fit_transform(cat_origin)

NameError: name 'feature' is not defined

In [None]:
# One hot encoding 
pd.get_dummies(cat_feature)

In [None]:
# Label Binarizer
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
origin_dummies = lb.fit_transform(cat_feature)
# you need to convert this back to a dataframe
origin_dum_df = pd.DataFrame(origin_dummies,columns=lb.classes_)

### k-fold Cross Validation with Train/Test Split

* Underfitting & Overfitting - An overfit model is not generalizable and will not hold to future cases. An underfit model does not make full use of the information available and produces weaker predictions then is feasible.

$r_{i,train} = y_{i,train} - \hat y_{i,train}$ 

$r_{i,test} = y_{i,test} - \hat y_{i,test}$ 

To get a summarized measure over all the instances in the test set and training set, a popular metric is the **(Root) Mean Squared Error**:

RMSE = $\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat y_{i})^2}$

MSE = $\frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat y_{i})^2$

Again, you can compute these for both the traing and the test set. A big difference in value between the test and training set (R)MSE is an indication of overfitting.

When using train-test-split, random samples of the data are created for the training and the test set. The problem with this is that the training and test MSE strongly depend on how the training and test sets were created. Let's see how this happens in practice using the auto-mpg data.

**K-Fold Cross Validation** expands on the idea of training and testing splits by splitting the entire dataset into {K} equal sections of data. We'll then iteratively train {K} linear regression models on the data, with each linear model using a different section of data as the testing set, and all other sections combined as the training set.

We can then average the individual results frome each of these linear models to get a Cross-Validation MSE. This will be closer to the model's actual MSE, since "noisy" results that are higher than average will cancel out the "noisy" results that are lower than average.

In [4]:
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

cv_5_results = np.mean(cross_val_score(linreg, X, y, cv=5, scoring="neg_mean_squared_error"))
cv_10_results = np.mean(cross_val_score(linreg, X, y, cv=10, scoring="neg_mean_squared_error"))
cv_20_results = np.mean(cross_val_score(linreg, X, y, cv=20, scoring="neg_mean_squared_error"))

NameError: name 'linreg' is not defined