<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Evaluating" data-toc-modified-id="Evaluating-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Evaluating</a></span><ul class="toc-item"><li><span><a href="#What-does-it-mean-to-Evaluate-your-model?" data-toc-modified-id="What-does-it-mean-to-Evaluate-your-model?-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span><strong><font color="red">What does it mean to Evaluate your model?</font></strong></a></span></li><li><span><a href="#So-What?" data-toc-modified-id="So-What?-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span><strong><font color="orange">So What?</font></strong></a></span></li><li><span><a href="#Now-What?" data-toc-modified-id="Now-What?-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span><strong><font color="green">Now What?</font></strong></a></span></li></ul></li><li><span><a href="#Feature-Engineering,-Feature-Evaluation,-and-Feature-Selection" data-toc-modified-id="Feature-Engineering,-Feature-Evaluation,-and-Feature-Selection-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Feature Engineering, Feature Evaluation, and Feature Selection</a></span><ul class="toc-item"><li><span><a href="#What-is-Feature-Engineering?" data-toc-modified-id="What-is-Feature-Engineering?-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span><strong><font color="red">What is Feature Engineering?</font></strong></a></span></li><li><span><a href="#So-What?" data-toc-modified-id="So-What?-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span><strong><font color="orange">So What?</font></strong></a></span></li><li><span><a href="#Now-What?" data-toc-modified-id="Now-What?-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span><strong><font color="green">Now What?</font></strong></a></span></li></ul></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Modeling</a></span><ul class="toc-item"><li><span><a href="#What-is-a-Model?" data-toc-modified-id="What-is-a-Model?-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span><strong><font color="red">What is a Model?</font></strong></a></span></li><li><span><a href="#So-What?" data-toc-modified-id="So-What?-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span><strong><font color="orange">So What?</font></strong></a></span></li><li><span><a href="#Now-What?" data-toc-modified-id="Now-What?-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span><strong><font color="green">Now What?</font></strong></a></span></li></ul></li></ul></div>

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
from sklearn.metrics import mean_squared_error,r2_score,explained_variance_score

from math import sqrt
from scipy import stats

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")


import env
import util
from wrangle_zillow import wrangle_zillow
import explore
import split_scale
import features_zillow
import model_zillow
import evaluate

## Evaluating

### **<font color=red>What does it mean to Evaluate your model?</font>**

There is a noteworthy saying that *“All models are wrong but some of them are useful.”* Keeping that in mind, how do we know if our model is useful?!

>**Evaluating a model** is when we use specific metrics to measure how well our model can predict our target variable, y, using our chosen feature(s), X.

>**Important Terminology:**

- **y:** The actual target value (y variable value)


- **$\hat{y}$ or yhat:** The predicted target value (predicted y variable value)


- **$\bar{y}$ or ybar:** The mean of all of the actual target values (mean of all y variable values). This is your Baseline prediction value. 

    <font color=purple>A Baseline prediction is when you predict that every target value will be equal to the mean of all of the target values or ybar.</font> 
    
>What is the best I can predict without using a variable? **If your model can't beat your Baseline prediction, scrap that model!** *'If the mean or median of your target variable is a good predictor of your target variable, you don't need to model at all.' - Ryan Orsinger*


- **Residuals:** A residual is the difference between each predicted value, yhat, and each actual value, y. So, when you talk about the residuals of your model, you are talking about the amount of error your model has. If your model had 0 residuals, it would mean that your model was predicting at 100%. (A model predicting at 100% should be a red flag, BTW.)

______


>**Metrics for evaluating models:**
    
    
- **Sum of Squared Errors (SSE):** All of the residuals squared and added together.  ($\hat{y}$ - y)$^2$ == (y - $\hat{y}$)$^2$. **This metric is not good for comparing model performance on datasets with different numbers of observations, such as train and test.**
    
    
- **Mean Squared Error (MSE):** The SSE divided by the number of observations. **This metric can be used to compare the performance of a model on datasets with different number of observations** because it is the mean.
    
    
- **Root Mean Squared Error (RMSE):** The square root of the MSE makes the measurement more meaningful because **it puts the value back in the original units of the target variable.** Along with MSE, **this metric can be used to compare the performance of a model on datasets with different numbers of observations**, such as train and test.

    **<font color=purple>Evaluating whether RMSE is sufficiently small or not will depend on how accurate we need our model to be for our given application. There is no one answer for all data in general.</font>**
    
    
- **$R^2$ or Explained Variance (score ranges from 0 to 1):** This tells you how much of the change in your y variable can be explained by your X variables. (Coefficient of Determination == Pearson's R squared) Only used in linear models. **For Example:** If $R^2$ = 0.43 for your regression equation, then it means that 43% of the variability in y is explained by the variable(s) in X.

>A high $R^2$ score means that X is a valuable predictor for your y value. *(The significance of this score depends on your p-value.)*

>A low $R^2$ means that your X is not a valuable predictor for you y value. *(The significance of this score depends on your p-value.)*
    
    
- **F-regression test (p-value):** It compares a model with no predictors to the model that you specify. The metric it returns, p-value, tells you whether your $R^2$ score is significant. 

    **<font color=purple>What is the probability the RMSE of our trained model on this set of observations would be this small by random chance?</font>**

### **<font color=orange>So What?</font>**

>**Does the model add value?**

- Does it perform better than if I made a random guess at my target value?


- Does it perform better than if I predicted the average value of the y value every time?


- Does it perform better than any existing model I have?


- How much confidence should I have in this model?


### **<font color=green>Now What?</font>**

>**Define X and y variables**

`X = df[[independent variable(s)]]`

`y = df[[dependent variable]]`

>**Create a Baseline**

**- Use the mean or median of the target variable as every baseline_yhat value.**

`baseline_yhat = y.mean()`

OR

`baseline_yhat = y.median()`

>**Create a Model (for example - an ols model predicting home_value using bedrooms, bathrooms, and square_feet as features)**

**- Import ols**

`from statsmodels.formula.api import ols`

**- Create Model**

`ols_model = ols(formula='home_value ~ bedrooms + bathrooms + square_feet', data=train).fit()`

**- Predict on Model**

`ols_yhat = ols_model.predict(X_train)`


>**Evaluate Your Model against your Baseline Using RMSE**

**- Create a Handy DataFrame for Evaluating Your Models or Model and Baseline Value**

`ols_eval = y_train.copy()`

`ols_eval.rename(columns={'home_value': 'actual'}, inplace=True)`

**- Add Baseline Value Column**

`ols_eval['baseline_yhat'] = ols_eval['actual'].mean()`

**- Add ols Predictions Column**

`ols_eval['ols_yhat'] = ols_model.predict(X_train)`

**- Calculate and Add Residuals Column for Plotting**

`ols_eval['residuals'] = ols_eval.ols_yhat - ols_eval.actual`

**- Compute the RMSE for our ols Model and Baseline Using Your Handy DataFrame**

`from sklearn.metrics import mean_squared_error`
`from math import sqrt`

`baseline_RMSE = sqrt(mean_squared_error(ols_eval.actual, ols_eval.baseline_yhat))`

`ols_RMSE = sqrt(mean_squared_error(ols_eval.actual, ols_eval.ols_yhat))`

`print(f'My model has value: {ols_RMSE < baseline_RMSE}')`

**- Compute the RMSE for the Model we created (You can also find these in the ols model summary below)**

`ols_r2 = round(ols_model.rsquared,3)`

`ols_p_value = ols_model.f_pvalue`

`print(f'My R-squared score is significant: {ols_p_value < .05}')`

**<font color=purple>If the RMSE for your ols model is smaller than the RMSE for your Baseline, and your p-value is less than your alpha, your model has value.</font>**

OR

**- Look at the R-squared and Prob (F-statistic) values in the summary chart**

`ols_model.summary()`

>**Visualize Residuals**

**- Quick look at distribution of residuals**

`plt.hist(np.log(ols_eval.residuals))`

**- Look for Patterns in Actual vs Residuals**

`plt.scatter(ols_eval.actual, ols_eval.residuals)`

**- Look at Predictions vs Residuals**

`plt.scatter(ols_eval.ols_yhat, ols_eval.residuals)`

>**If Your First Model Beats Your Baseline, Try to Beat your First Model**

**- Use the RMSE metric to compare the performance of successive models you build.**

## Feature Engineering, Feature Evaluation, and Feature Selection

### **<font color=red>What is Feature Engineering?</font>**

**Feature Engineering** is when you construct a new feature or column in your dataset using data from other columns in your dataset. You might decide to combine or separate the data from other columns to create new features for use in a machine learning algorithm. **Domain knowledge and qualitative research can be great guides in this area.**

**Feature Evaluation** is an algorithmic way to measure the impact of a feature on a target variable.

### **<font color=orange>So What?</font>**

<div class="alert alert-block alert-success">
"The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering." <b>- Luca Massaron</b>
</div>


You might want to engineer new features to:

- prepare the proper input dataset to be compatible with your machine learning algorithm's requirements.


- improve the performance of your machine learning model.

### **<font color=green>Now What?</font>**

**SelectKBest - Feature Selection**

- Select K Best is a filter method that uses a statistical test to gauge usefulness of features and keep those with the highest correlation to the target variable and remove those that are highly correlated with each other.

>**Create the KBest Object**

k = number_of_features

`kbest = SelectKBest(f_regression, k=k)`

>**Fit the KBest Object to Your train and test Data**

`kbest.fit(X_train, y_train)`

>**Use the KBest Object to Return SelectKBest Features**

`best_features = X_train.columns[kbest.get_support()]`

>**Use the KBest Object to Transform Your X_train Dataset if You Like**

`X_reduced = kbest.transform(X_train)`

____

**RFE - Recursive Feature Elimination**

- Recursive Feature Elimination is a wrapper method that takes in a ML algorithm and uses its performance to decide which features in your dataframe should be removed to achieve the best model evaluation.

k = number_of_features

>**Create the Linear Regression Object**

`lm = LinearRegression()`

>**Initialize the RFE Object**

`rfe = RFE(lm, k)`

>**Fit the RFE Object**

`rfe.fit(X_train, y_train)`

>**Get a List of Your Best Features**

`X_train.columns[rfe.support_]`

>**Transform your X_train if You Like**

`X_rfe = rfe.transform(X_train)`

>**Get the Rankings of Your Features**

`rfe.ranking_`


**Check out this super cool article showing [Feature Engineering](https://medium.com/@whitcrrd/linear-regression-part-ii-eda-feature-engineering-e66ea8763538) in a Linear Regression project! Some pretty cool ideas here.**

## Modeling

### **<font color=red>What is a Model?</font>**



### **<font color=orange>So What?</font>**



### **<font color=green>Now What?</font>**