<a href="https://colab.research.google.com/github/ccwilliamsut/machine_learning/blob/master/MLAB_03_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a Linear Regression Model

At this point, we have now cleaned our dataset and are ready to build a machine learning model. We want to predict housing prices and we have labeled data, so we will build supervised learning models with **Linear Regression**, a **Decision Tree** and **Gradient Boosting** (which creates multiple decision trees). 

Before we begin building a model, **there are a few definitions that we should cover** so that you can better understand what is happening in the code below.



---
## Key Definitions
- **[Estimator / Model](https://scikit-learn.org/stable/user_guide.html)**
  - A method that neatly packages up all of the low-level training code. 
    - This method will loop through a training set using a [score method](https://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation) to determine "loss" (deviation from the predicted value).
    - A **linear regression estimator**, for example, will try to find a single line that has the least amount of deviation from all target values.
  - Estimators allow us to **configure** training loops rather than coding everything from the ground up.
  - [Examples](https://scikit-learn.org/stable/user_guide.html) include **linear regressors**, **decision trees**, **k-Means Clustering**, etc.
  >**Note:** The estimators called by sklearn will calculate error/loss/cost exactly rather than using Gradient Descent to minimize it incrementally via "epochs". See the exact description and calculations at the following links:
    - [Data Science Exchange: Epochs discussion](https://datascience.stackexchange.com/questions/29044/how-many-epochs-does-fit-method-run)
    - [Machine Learning Mastery: Gradient Descent](https://machinelearningmastery.com/gradient-descent-for-machine-learning/).


```

```


- **[Scoring](https://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation)**
  - A method that measures "loss" (a.k.a. "cost" or "error") when training an estimator
  - Can be measured in various ways such as **mean squared error**, **median absolute error**, etc.

```

```


- **Splitting Data**
  - Refers to the act of separating data into **training** and **testing** (and possibly **validation**) sets
  - The parameter **```test_size```** dictates the proportion of data to be set aside as the test set (with the remaining being the training set)
    - Splits are typically 70/30, 80/20 or somewhere between for training/test sets
  - Accomplished with the **[train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)** method which will return 4 arrays of data (X_train, X_test, y_train, y_test)
  - These 4 arrays are then used when **fitting** the data to the esimator
  - Data can (and should be) **shuffled** with this method (to reduce chances of bias)
  > **Question**: Why do we need to split our data into training and testing sets?

```

```


- **Hyperparameters**
  - These are values that we define **before** training a model. Regular parameters learn and change during the training process, but hyperparameters cannot "learn" and must be set beforehand.
  - Includes things like **```n_estimators```**, **```max_depth```**, etc.
  - We can "tune" hyperparameters in order to get better model performance.
  - We can also employ a "GridSearchCV" function that will analyze all combinations of hyperparameters and reveal which combinations produced the best performance (shown at the end of the lesson). We use this function to reduce the work necessary to create a "good" model.

```

```

- **[Fitting](https://stackoverflow.com/questions/45704226/what-does-fit-method-in-scikit-learn-do) a model/estimator**
  - This refers to the act of actually **training** the estimator on the training set of data (X_train, y_train)
  - The estimator loops through the data (using the hyperparameter settings), measures loss with the scoring method and produces a **trained model**.
  - The trained model is then run against the test set to evaluate the accuracy.
  - The error rates returned can be used to determine if:
    - The model needs to be further refined
    - Data needs to be further engineered
    - The model is ready to deploy.


---

## Basic steps for Building a Machine Learning Model
### 1. **Import necessary libraries**
  - In order to train a model, we must identify the library that contains it and call the import function so that we can use it in our code (```from sklearn import linear_model```). For this lesson, we also need to import some other libraries such as ```preprossesing``` and ```Counter```. 

### 2. **Identify the Target Variable**
  - We must then **identify our target variable**. Basically, we can choose any feature that we would like to predict as our target, and the remaining features will be used to try and determine an effective model that can predict any value of the target feature. 
  - For this lesson, we will use **```median_house_value```** as our target variable, meaning that we are trying to predict a house's value based upon the factors contained in the other features (```total_rooms, total_bedrooms, median_income```, etc). We can easily change this up later, however, if we want to try and predict another variable instead.

### 3. **Split the Dataset into Training and Testing sets**
  - In this step, we set our training and testing sets, determine the test set size and shuffle our data (to prevent bias).
  - **NOTE:** It's vitally important to **shuffle** the dataset at this point in order to randomize the data and prevent bias from entering our model.

### 4. **Fit (or "train") the Model**
  - At this point, we call our estimator and "fit" it to our training arrays (X_train, y_train)
    - ```estimator.fit(X_train, y_train)```
  - Once trained, the **test data** is run through the model and **error metrics** are gathered.

### 5. **Visualize the data**
- Once everything is completed, we choose an appropriate way to visualize our model's performance.



```


```



---

> Websites Referenced:
- [Towards Data Science: Introduction to Linear Regression in Python](https://towardsdatascience.com/introduction-to-linear-regression-in-python-c12a072bedf0)
- [Sprinboard Blog: Linear Regression in Python: A Tutorial](https://www.springboard.com/blog/linear-regression-in-python-a-tutorial/)
- [Geeks for Geeks: Linear Regression (great, simple explanation of the math)](https://www.geeksforgeeks.org/ml-linear-regression/)

## A. Setup Environment
- Import libraries
- Prepare dataset(s)
- Create functions

In [None]:
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SETUP ENVIRONMENT <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# Import libraries
import pandas as pd
import seaborn as sns; sns.set()
import numpy as np
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
from IPython.display import display
from sklearn import preprocessing
from sklearn.model_selection import train_test_split 
from scipy import stats
from sklearn import linear_model
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import learning_curve
from sklearn import preprocessing
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

url = 'https://raw.githubusercontent.com/ccwilliamsut/machine_learning/master/absolute_beginners/data_files/modified/CaliforniaHousingDataModified.csv'

df = pd.read_csv(url)
#df = pd.read_csv(~/Downloads/CaliforniaHousingDataModified.csv)


# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SETUP DATA <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

# --------------------------------------------- RENAME FEATURES ---------------------------------------------
df.rename(columns = {'lattitude':'latitude', 't_rooms':'total_rooms'}, inplace=True)


# --------------------------------------------- ONE-HOT ENCODING ---------------------------------------------
df['ocean_proximity'] = pd.Categorical(df['ocean_proximity'])
df_dummies = pd.get_dummies(df['ocean_proximity'], drop_first = True)
df.drop(['ocean_proximity'], axis = 1, inplace = True)
df = pd.concat([df, df_dummies], axis=1)


# --------------------------------------------- DROP UNWANTED FEATURES ---------------------------------------------
df.drop(['population', 'households', 'proximity_to_store', 'ISLAND'], axis = 1, inplace = True)


# ------------------------------------- FIX MISSING DATA -------------------------------------
tb_med = df['total_bedrooms'].median(axis=0)
df['total_bedrooms'] = df['total_bedrooms'].fillna(value = tb_med)
df['total_bedrooms'] = df['total_bedrooms'].astype(int)
df.name = 'df'

# ------------------------------------- Z-SCORE -------------------------------------
z = np.abs(stats.zscore(df))
dfz = df[(z < 3).all(axis = 1)]
dfz.name = 'dfz'

# ------------------------------------- INTERQUARTILE RANGE -------------------------------------
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1
lower = q1 - (1.5 * iqr)
upper = q3 + (1.5 * iqr)
dfi = df[~((df < lower) | (df > upper)).any(axis = 1)]
dfi.name = 'dfi'
#dfi = dfi.drop(['NEAR BAY', 'NEAR OCEAN'], axis = 1)  # After applying IQR, the following features are now empty and can be dropped

# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FUNCTIONS <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
def make_heatmap(df = 'df'):
  corr = df.corr()
  plt.figure(figsize=(15 ,9))
  sns.heatmap(corr, annot=True, vmin = -1, vmax = 1, center = 0, fmt = '.1g', cmap = 'coolwarm')
  plt.show()
  plt.close()


def make_heading(heading = 'heading'):
  print('{0}:\n'.format(heading), '-' * 30)


def show_coef(df_in, model):
  coef_df = pd.DataFrame(model.coef_, df_in.columns, columns=['Coefficient'])
  make_heading('\n\nCoefficients')
  display(coef_df)
  make_heading('\n\nIntercept')
  display(model.intercept_)


def drop_features(df_in):
  df_in = df_in.drop([#'median_house_value',
                      'longitude',
                      'latitude',
                      #'housing_median_age',
                      'total_rooms',
                      'total_bedrooms',
                      #'median_income',
                      'INLAND',
                      #'NEAR BAY',
                      #'NEAR OCEAN'
                      ],
                      axis = 1
                      )
  return features_dfx


def plot_predictions(y_test, y_pred):
  fig, ax = plt.subplots()
  ax = plt.subplot()
  ax.scatter(y_test, y_pred, edgecolors=(0, 0, 0), alpha=0.5)
  ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'g--', lw=4)
  ax.set_xlabel('Actual')
  ax.set_ylabel('Predicted')
  ax.set_title("Ground Truth vs Predicted")
  plt.show()
  plt.close()


def run_linear_model(model, X, y, test_size = '0.3'):
  # Split the dataset into training and testing arrays
  #   We take our X and y variables that we have just created and now split each one into a training set and testing
  #   set: X_train, X_test, y_train, y_test)
  X_train, X_test, y_train, y_test = train_test_split(X,
                                                      y,
                                                      test_size=test_size,    # reserve this percentage of our data for testing, use the rest for training
                                                      shuffle = True,
                                                      random_state = 21
                                                      )

  # Get the name of the model (for use in our display functions)
  model_name = type(model).__name__

  # Fit (or train) the model (have the model create coefficient multipliers which will be used to predict "y")
  model.fit(X_train, y_train)

  # Here we want to input our X_test array to make predictions based upon our trained model
  y_pred = model.predict(X_test)

  # Gather metrics on model performance
  model_test_score = model.score(X_test, y_test)  # Accuracy of our model on test data
  model_train_score = model.score(X_train, y_train)  # Accuracy of our model on training data
  mean_ae = metrics.mean_absolute_error(y_test, y_pred)  # Mean absolute error (find mae on the test set (to see how well the trained model performs on new data(y_test against y_pred))
  median_ae = metrics.median_absolute_error(y_test, y_pred)  # Median absolute error
  rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))  # Root mean squared error
  ev = metrics.explained_variance_score(y_test, y_pred) # Explained Variance Score (y_test, y_pred): Measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set
  predictive_accuracy = metrics.r2_score(y_test, y_pred)  # Determine the predictive accuracy of the model by getting the R^2 score (how well future samples are likely to be predicted by the model)

  # Display the shape of each new array:
  print('\n' * 5, '*' * 50,' ', model_name,' ', '*' * 50)
  make_heading('The shape of each of our arrays after splitting into training and testing sets for dataframe \"{0}\"'.format(dfx.name))
  print('X_train shape: ', X_train.shape)
  print('X_test shape:  ', X_test.shape)
  print('y_train shape: ', y_train.shape)
  print('y_test shape:  ', y_test.shape)
  print('y_pred shape:  ', y_pred.shape)

  # Show the coefficients
  show_coef(features_dfx, model)

  # Show metrics on model performance (training and testing)
  make_heading('\n\nAccuracy and Error for training/testing on the {0} model for dataframe \"{1}\"'.format(model_name, dfx.name))
  print("Training Accuracy                 (X_train, y_train):  %.2f" % model_train_score)
  print("Test Accuracy                     (X_test, y_test):    %.2f" % model_test_score)
  print("Predictive Accuracy (R^2)         (y_test, y_pred):    %.2f" % predictive_accuracy)
  print("Explained Variance (1 is best)    (y_test, y_pred):    %.2f" % ev)
  print("Mean Absolute Error on Test Set   (y_test, y_pred):    %.2f" % mean_ae)
  print("Median Absolute Error on Test Set (y_test, y_pred):    %.2f" % median_ae)
  print("RMSE on Test Set                  (y_test, y_pred):    %.2f" % rmse)
  print('=' * 50, '\n\n\n')

  plot_predictions(y_test, y_pred)

## B. Choose a dataset to use
- Three datasets have been created:
  1. **df** -- the **original dataset** ('NaN' values have been replaced with median, so there are no null values)
  2. **dfz** -- a dataset that has used the **z-score method** for removing outlier data
  3. **dfi** -- a dataset that has used the **IQR method** for removing outlier data

```

```

**NOTE:** When choosing your dataset, you will need to change two commands to reflect the dataset that you want to use. In the following example, the 

**Original code:** --------------------------------------->  **New code** (For example, to change **from** 'dfz' **to** 'dfi' dataset):
```
dfx = dfz.copy()      -->   dfx = dfi.copy()
dfx.name = dfz.name   -->   dfx.name = dfi.name
```



```
  
  

```

In [None]:
# *************************** DETERMINE DATA FOR USE ***************************
# Choose which dataframe you would like to use (dfi: IQR, dfz: z-score, df: original (with replaced NaN values))
dfx = df.copy()
dfx.name = df.name

# Show a heatmap for the given dataframe you want to use
make_heatmap(dfx)

> Websites Referenced:
- [Scikit-Learn Documentation: Model Evaluation](https://scikit-learn.org/0.15/modules/model_evaluation.html)
- [Helpful breakdown of model metrics](https://datascience.stackexchange.com/questions/28426/train-accuracy-vs-test-accuracy-vs-confusion-matrix)
- [Ridge and Lasso Regression: Regularization of Coefficients](https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/)

## C. Step by step: Building a Linear Regression model
- In this section, we walk through building a linear regression model using the dataframe chosen in the previous step.

### The basic steps are:
  1. Define the **features** and the **target variable**.
  2. **Split** the dataset into **training** and **testing** partitions
  3. **Define the model:**
    - Determine which model to use (LinearRegression, Lasso, etc.)
    - **Train the model** - run the model against our training set
    - Use the trained model to **predict against the test set**
  4. **Capture metrics** for the model
    - Training accuracy
    - Testing accuracy
    - Prediction error
    - Error values (Mean Absolute Error, Median Absolute Error, RMSE, etc.)
  5. **Analyze metrics**
    - Look at the varirous accuracy and error scores to **determine how well the model fits** (overfit, underfit, good, bad, etc.)
  6. **Plot graphs** to visualize the various accuracy / error metrics
  7. **Predict new values** to see if the model performs as expected (given the accuracy and coefficient values)

In [None]:
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> STEP BY STEP LINEAR MODEL CONSTRUCTION <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<


# ---------------------------- Identify Target Variable ----------------------------
# Create a copy of the dataframe with only the desired features (dropping the target feature)
features_dfx = dfx.copy()
features_dfx = features_dfx.drop(['median_house_value', # You *MUST* drop 'median_house_value, as that is our target variable; Hashtag (#) means that you want to KEEP the value
                                  'longitude',
                                  'latitude',
                                  #'housing_median_age',
                                  #'total_rooms',
                                  #'total_bedrooms',
                                  #'median_income',
                                  #'INLAND',
                                  'NEAR BAY',
                                  'NEAR OCEAN'
                                  ],
                                 axis = 1
                                 )

# ---------------------------- SPLIT THE DATASET ----------------------------
# We extract the values of our two datasets into "X" and "y" variables for use in later functions
# The X variables are the "explanatory" numbers (or images). They are the numbers that we use to try and predict
#   what the "y" variable is.

# "X" are the independent variables, the "featues" that we will use to try and predict "y"
X = features_dfx.values

# "y" is the dependent variable (the "label"); the actual value that we are trying to learn to predict. 
#   Each time we make a prediction, it is compared to the actual values, and the "loss" (the difference between
#   the prediction and real number) is added to previous "loss". 
y = dfx['median_house_value'].values

# Split the dataset into training and testing arrays
#   We take our X and y variables that we have just created and now split each one into a training set and testing
#   set: X_train, X_test, y_train, y_test)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,    # reserve this percentage of our data for testing, use the rest for training
                                                    shuffle = True,
                                                    random_state = 21
                                                    )


# ---------------------------- CREATE THE MODEL  ----------------------------
# Here we define our model as a Linear Regression algorithm
model = linear_model.LinearRegression(n_jobs=-1)
model_name = type(model).__name__

# Fit (or train) the model (have the model create coefficient multipliers which will be used to predict "y")
model.fit(X_train, y_train)

# Here we want to input our X_test array to make predictions based upon our trained model
y_pred = model.predict(X_test)


# ----------------------------- METRICS --------------------------------------
model_test_score = model.score(X_test, y_test)  # Accuracy of our model on test data
model_train_score = model.score(X_train, y_train)  # Accuracy of our model on training data
mean_ae = metrics.mean_absolute_error(y_test, y_pred)  # Mean absolute error (find mae on the test set (to see how well the trained model performs on new data(y_test against y_pred))
median_ae = metrics.median_absolute_error(y_test, y_pred)  # Median absolute error
rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))  # Root mean squared error
ev = metrics.explained_variance_score(y_test, y_pred) # Explained Variance Score (y_test, y_pred): Measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set
predictive_accuracy = metrics.r2_score(y_test, y_pred)  # Determine the predictive accuracy of the model by getting the R^2 score (how well future samples are likely to be predicted by the model)

# Display the shape of each array:
make_heading('The shape of each of our arrays after splitting into training and testing sets for dataframe \"{0}\"'.format(dfx.name))
print('X_train shape: ', X_train.shape)
print('X_test shape:  ', X_test.shape)
print('y_train shape: ', y_train.shape)
print('y_test shape:  ', y_test.shape)
print('y_pred shape:  ', y_pred.shape)

# Display the model coefficients and intercept (bias)
coef = pd.DataFrame(model.coef_, features_dfx.columns, columns=['Coefficient'])
make_heading('\n\nCoefficient(s)')
display(coef)
make_heading('\n\nIntercept')
display(model.intercept_)

# Display training set metrics
make_heading('\n\nAccuracy and Error for training/testing on the {0} model for dataframe \"{1}\"'.format(model_name, dfx.name))
print("Training Accuracy (score)                 (X_train, y_train):  %.2f" % model_train_score)
print("Test Accuracy (score)                     (X_test, y_test):    %.2f" % model_test_score)
print("Predictive Accuracy (R^2 score)           (y_test, y_pred):    %.2f" % predictive_accuracy)
print("Explained Variance (1 is best) (loss)     (y_test, y_pred):    %.2f" % ev)
print("Mean Absolute Error on Test Set (loss)    (y_test, y_pred):    %.2f" % mean_ae)
print("Median Absolute Error on Test Set (loss)  (y_test, y_pred):    %.2f" % median_ae)
print("RMSE on Test Set (loss)                   (y_test, y_pred):    %.2f" % rmse)


# ----------------------------- PLOTTING --------------------------------------
fig, ax = plt.subplots()
ax.scatter(y_test, y_pred, edgecolors=(0, 0, 0), alpha=0.5)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'g--', lw=4)
ax.set_xlabel('Actual')
ax.set_ylabel('Predicted')
ax.set_title("Ground Truth vs Predicted")
plt.show()
plt.close()

Important Definitions:
- **Accuracy:** The amount of correct classifications / the total amount of classifications.
- **Train accuracy:** The accuracy of a model on examples it was constructed on.
- **Test accuracy:** The accuracy of a model on examples it hasn't seen.
- **Explained Variance Score:** Measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set.

> Websites referenced:
- [MAE and RMSE — Which Metric is Better?](https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d)
  - **Mean Absolute Error (MAE):** MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.
  - **Root mean squared error (RMSE):** RMSE is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation. **RMSE tends to punish large errors more**.
  - **NOTE:** Both values are *negatively oriented* meaning that lower is better. Also, **RMSE should be more useful when large errors are particularly undesirable.**
- [Scikit-Learn Documentation: Model Evaluation](https://scikit-learn.org/0.15/modules/model_evaluation.html)
- [**Excellent Tutorial** -- Ritchie Ng: Learning to Evaluate Linear Regression Models (single and multiple)](https://www.ritchieng.com/machine-learning-evaluate-linear-regression-model/)
  - Definitely take some time to read through this tutorial. Ritchie provides an excellent breakdown of how to construct single and multiple linear regression, the underlying math and, critically, how one can interpret the results.
- [Helpful breakdown of model metrics](https://datascience.stackexchange.com/questions/28426/train-accuracy-vs-test-accuracy-vs-confusion-matrix)
- [Machine Learning Mastery: Evaluating Model Metrics (great examples!)](https://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/)
- [Effects of Alpha on Regression](https://chrisalbon.com/machine_learning/linear_regression/effect_of_alpha_on_lasso_regression/)

## D. Run Multiple Linear Regression Models Using Functions
By moving our "step-by-step" code into a function, we can make running a model much easier and, therefore, compare multiple models against one another. 

This procedure abstracts the code in such a way as to make it reusable for different models. The function will:
  1. Take (**```model, X, y, test_size```**) as parameters
  2. Split the X and y arrays into training and testing sets according to the **```test_size```** (i.e. test_size = 0.2 means that we want 20% of the data to be set aside for testing)
  3. Train the model
  4. Gather metrics
  5. Create a heatmap for the chosen dataframe
  6. Print the results

In [None]:
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SETUP ENVIRONMENT <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# Import libraries
import pandas as pd
import seaborn as sns; sns.set()
import numpy as np
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
from IPython.display import display
from sklearn import preprocessing
from sklearn.model_selection import train_test_split 
from scipy import stats
from sklearn import linear_model
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import learning_curve
from sklearn import preprocessing
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

url = 'https://raw.githubusercontent.com/ccwilliamsut/machine_learning/master/absolute_beginners/data_files/modified/CaliforniaHousingDataModified.csv'

df = pd.read_csv(url)
#df = pd.read_csv(~/Downloads/CaliforniaHousingDataModified.csv)


# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SETUP DATA <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

# --------------------------------------------- RENAME FEATURES ---------------------------------------------
df.rename(columns = {'lattitude':'latitude', 't_rooms':'total_rooms'}, inplace=True)


# --------------------------------------------- ONE-HOT ENCODING ---------------------------------------------
df['ocean_proximity'] = pd.Categorical(df['ocean_proximity'])
df_dummies = pd.get_dummies(df['ocean_proximity'], drop_first = True)
df.drop(['ocean_proximity'], axis = 1, inplace = True)
df = pd.concat([df, df_dummies], axis=1)


# --------------------------------------------- DROP UNWANTED FEATURES ---------------------------------------------
df.drop(['population', 'households', 'proximity_to_store', 'ISLAND'], axis = 1, inplace = True)


# ------------------------------------- FIX MISSING DATA -------------------------------------
tb_med = df['total_bedrooms'].median(axis=0)
df['total_bedrooms'] = df['total_bedrooms'].fillna(value = tb_med)
df['total_bedrooms'] = df['total_bedrooms'].astype(int)
df.name = 'df'

# ------------------------------------- Z-SCORE -------------------------------------
z = np.abs(stats.zscore(df))
dfz = df[(z < 3).all(axis = 1)]
dfz.name = 'dfz'

# ------------------------------------- INTERQUARTILE RANGE -------------------------------------
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1
lower = q1 - (1.5 * iqr)
upper = q3 + (1.5 * iqr)
dfi = df[~((df < lower) | (df > upper)).any(axis = 1)]
dfi.name = 'dfi'
#dfi = dfi.drop(['NEAR BAY', 'NEAR OCEAN'], axis = 1)  # After applying IQR, the following features are now empty and can be dropped

# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FUNCTIONS <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
def make_heatmap(df = 'df'):
  corr = df.corr()
  plt.figure(figsize=(15 ,9))
  sns.heatmap(corr, annot=True, vmin = -1, vmax = 1, center = 0, fmt = '.1g', cmap = 'coolwarm')
  plt.show()
  plt.close()


def make_heading(heading = 'heading'):
  print('{0}:\n'.format(heading), '-' * 30)


def show_coef(df_in, model):
  coef_df = pd.DataFrame(model.coef_, df_in.columns, columns=['Coefficient'])
  make_heading('\n\nCoefficients')
  display(coef_df)
  make_heading('\n\nIntercept')
  display(model.intercept_)


def drop_features(df_in):
  df_in = df_in.drop([#'median_house_value',
                      'longitude',
                      'latitude',
                      #'housing_median_age',
                      'total_rooms',
                      'total_bedrooms',
                      #'median_income',
                      'INLAND',
                      #'NEAR BAY',
                      #'NEAR OCEAN'
                      ],
                      axis = 1
                      )
  return features_dfx


def plot_predictions(y_test, y_pred):
  fig, ax = plt.subplots()
  ax = plt.subplot()
  ax.scatter(y_test, y_pred, edgecolors=(0, 0, 0), alpha=0.5)
  ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'g--', lw=4)
  ax.set_xlabel('Actual')
  ax.set_ylabel('Predicted')
  ax.set_title("Ground Truth vs Predicted")
  plt.show()
  plt.close()


def run_linear_model(model, X, y, test_size = '0.3'):
  # Split the dataset into training and testing arrays
  #   We take our X and y variables that we have just created and now split each one into a training set and testing
  #   set: X_train, X_test, y_train, y_test)
  X_train, X_test, y_train, y_test = train_test_split(X,
                                                      y,
                                                      test_size=test_size,    # reserve this percentage of our data for testing, use the rest for training
                                                      shuffle = True,
                                                      random_state = 21
                                                      )

  # Get the name of the model (for use in our display functions)
  model_name = type(model).__name__

  # Fit (or train) the model (have the model create coefficient multipliers which will be used to predict "y")
  model.fit(X_train, y_train)

  # Here we want to input our X_test array to make predictions based upon our trained model
  y_pred = model.predict(X_test)

  # Display the shape of each new array:
  print('\n' * 5, '*' * 50,' ', model_name,' ', '*' * 50)
  make_heading('The shape of each of our arrays after splitting into training and testing sets for dataframe \"{0}\"'.format(dfx.name))
  print('X_train shape: ', X_train.shape)
  print('X_test shape:  ', X_test.shape)
  print('y_train shape: ', y_train.shape)
  print('y_test shape:  ', y_test.shape)
  print('y_pred shape:  ', y_pred.shape)

  # Gather metrics on model performance
  model_test_score = model.score(X_test, y_test)  # Accuracy of our model on test data
  model_train_score = model.score(X_train, y_train)  # Accuracy of our model on training data
  mean_ae = metrics.mean_absolute_error(y_test, y_pred)  # Mean absolute error (find mae on the test set (to see how well the trained model performs on new data(y_test against y_pred))
  median_ae = metrics.median_absolute_error(y_test, y_pred)  # Median absolute error
  rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))  # Root mean squared error
  ev = metrics.explained_variance_score(y_test, y_pred) # Explained Variance Score (y_test, y_pred): Measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set
  predictive_accuracy = metrics.r2_score(y_test, y_pred)  # Determine the predictive accuracy of the model by getting the R^2 score (how well future samples are likely to be predicted by the model)

  # Show the coefficients
  show_coef(features_dfx, model)

  # Show metrics on model performance (training and testing)
  make_heading('\n\nAccuracy and Error for training/testing on the {0} model for dataframe \"{1}\"'.format(model_name, dfx.name))
  print("Training Accuracy                 (X_train, y_train):  %.2f" % model_train_score)
  print("Test Accuracy                     (X_test, y_test):    %.2f" % model_test_score)
  print("Predictive Accuracy (R^2)         (y_test, y_pred):    %.2f" % predictive_accuracy)
  print("Explained Variance (1 is best)    (y_test, y_pred):    %.2f" % ev)
  print("Mean Absolute Error on Test Set   (y_test, y_pred):    %.2f" % mean_ae)
  print("Median Absolute Error on Test Set (y_test, y_pred):    %.2f" % median_ae)
  print("RMSE on Test Set                  (y_test, y_pred):    %.2f" % rmse)
  print('=' * 50, '\n\n\n')

  plot_predictions(y_test, y_pred)

In [None]:
# ---------------------------- Identify Target Variable ----------------------------
# Create a copy of the dataframe with only the desired features (dropping the target feature)
features_dfx = dfx.copy()
features_dfx = features_dfx.drop(['median_house_value'], axis = 1)  # We must drop this feature because it is our target 

# Choose additional features to add to the model (if you want to use the feature, place a '#' next to it)
features_dfx = features_dfx.drop([#'longitude',
                                  #'latitude',
                                  #'housing_median_age',
                                  #'total_rooms',
                                  #'total_bedrooms',
                                  #'median_income',
                                  #'INLAND',
                                  'NEAR BAY',
                                  'NEAR OCEAN'
                                  ],
                                 axis = 1
                                 )

# ---------------------------- SPLIT THE DATASET ----------------------------
# We extract the values of our two datasets into "X" and "y" variables for use in later functions
# The X variables are the "explanatory" numbers (or images). They are the numbers that we use to try and predict
#   what the "y" variable is.

# "X" are the independent variables, the "featues" that we will use to try and predict "y"
X = features_dfx.values

# "y" is the dependent variable (the "label"); the actual value that we are trying to learn to predict. 
#   Each time we make a prediction, it is compared to the actual values, and the "loss" (the difference between
#   the prediction and real number) is added to previous "loss". 
y = dfx['median_house_value'].values

# Define your model(s)
# There are multiple hyperparameters that we can set (change them to see effects)
modela = linear_model.LinearRegression(n_jobs = -1, 
                                       fit_intercept=True, 
                                       normalize=True
                                       )

modelb = linear_model.Ridge(alpha = 0.05)  # The "alpha" parameter is related to normalization by reducing the effect of 

modelc = linear_model.LassoCV(fit_intercept = True,
                              max_iter=1000,
                              cv = 10,
                              n_jobs=-1,
                              random_state=1,
                              #selection = 'random',  
                              )

modeld = linear_model.HuberRegressor(epsilon=1.7,
                                     alpha = 0.01,
                                     fit_intercept=True
                                     )

# Show a heatmap for reference
make_heatmap(dfx)

# Train and compare metrics on the models
test_size = 0.33
run_linear_model(modela, X, y, test_size = test_size)
run_linear_model(modelb, X, y, test_size = test_size)
run_linear_model(modelc, X, y, test_size = test_size)
run_linear_model(modeld, X, y, test_size = test_size)