# Linear Regression: Advanced
***Part [2/2]***

---

### Objectives
- Review **building a model** (with only numerical data) in StatsModels.
    - Model Summary
    - Interpreting coefficients


- Evaluating **Model Performance.** *(Beyond $R^2$)*


- Encoding **categorical variables.**


- Checking **assumptions of Linear Regression.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm

from sklearn.metrics import mean_absolute_error, mean_squared_log_error

from statsmodels.stats.outliers_influence import variance_inflation_factor

## Review: Building a basic model.

## Evaluating performance.

## Processing: Categorical Variables & Scaling.

## Remodeling: Include All Predictors


### Interpreting coefficients.

## Checking for Assumptions

### 1. Linearity

**The relationship between the target and predictor is linear.** Check this by drawing a scatter plot of your predictor and your target, and see if there is evidence that the relationship might not follow a straight line OR look at the correlation coefficient.

**What can I do if it looks like I'm violating this assumption?**

- Consider log-scaling your data.
- Consider a different type of model!

### 2. Normality
The normality assumption states that the model _residuals_ should follow a normal distribution.
**Note**: the normality assumption talks about the model residuals and not about the distributions of the variables!

**How can I check for this?**
* Check the Omnibus value (This is a test for error normality. The probability is the chance that the errors are normally distributed.)
    - Normal Test *https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html*
* Build a QQ-Plot.

**What can I do if it looks like I'm violating this assumption?**
* Drop outliers 
* Consider log-scaling your data 

***Demo: Sample Residual Distributions and their QQ-plots.***

<img src='https://github.com/learn-co-students/dsc-01-10-11-regression-assumptions-online-ds-ft-031119/raw/master/images/inhouse_qq_plots.png' width=700/>

---

```python
# Demo of `qqplot` code.
fig = sm.graphics.qqplot(residuals, line='45', fit=True);
```

### 3. Homoskedasticity

The errors should be homoskedastic. That is, the errors have the same variance.

Basically if the residuals are evenly spread through range

<img src='https://github.com/learn-co-students/dsc-01-10-11-regression-assumptions-online-ds-ft-031119/raw/master/images/homo_2.png' width=700/>

The residual errors have the same variance.

**How can I check for this?**

* Check the Durbin-Watson score (This is a test for error homoskedasticity. We're looking for values between ~1.5 and ~2.5).
    - Documentation: *https://www.statsmodels.org/stable/generated/statsmodels.stats.stattools.durbin_watson.html*
    - Demonstration: *https://www.statology.org/durbin-watson-test-python/*
* Build an error plot, i.e. a plot of errors for a particular predictor (vs. the values of that predictor).

**What can I do if it looks like I'm violating this assumption?**

* Consider dropping extreme values.
* Consider log-scaling your target.
* Consider a different type of model!

```python
# Demo of Homoskedasticity
plt.scatter(x=fitted.fittedvalues,y=fitted.resid)
```

### 4. Multicollinearity

The interpretation of a regression coefficient is that it represents the average change in the dependent variable for each 1 unit change in a predictor, assuming that all the other predictor variables are kept constant. Multicollinearity occurs when 2 or more of the independent variables are highly correlated with each other.

**How can I check for this?**
1. Use `variance_inflation_factor()`
2. Look at a scatter matrix 
3. Look at a heatmap 

**What can I do if it looks like I'm violating this assumption?**
- Remove features that are highly collinear with each other.

#### Important: Note on *VIF*
> *The variance inflation factor is a measure for the increase of the
variance of the parameter estimates if an additional variable, given by
exog_idx is added to the linear regression. It is a measure for
multicollinearity of the design matrix, exog.*
>
> ***One recommendation is that if VIF is greater than 5, then the explanatory
variable given by exog_idx is highly collinear with the other explanatory
variables***, *and the parameter estimates will have large standard errors
because of this.*


***Important***
- **When using VIF, you must include an intercept (constant column) for the results of this test to be accurate.**

In [None]:
# Writing a function to create VIF dictionary.
def create_vif_dictionary(X):
    """
    Parameters
    ----------
    X: Pandas dataframe of predictive variables only.
        Should have `.columns` and `.values` attributes.
    """
    
    vif_dct = {}

    # Loop through each row and set the variable name to the VIF. 
    for i in range(len(X.columns)):
        # Calculate VIF
        vif = variance_inflation_factor(X.values, i)
        
        # Extract column name for dictionary key.
        v = X.columns[i]
        
        # Set value in dictionary.
        vif_dct[v] = vif

    return vif_dct