# Lab 4 - Model Tuning, Evaluation, and Ensemble Methods


##Model Tuning, Evaluation, and Ensemble Methods
We've learned a lot of different ML models up to this point, but we haven't had a means of testing how well they actually work. Remember, when we build models we are more concerned with how they perform on unseen data. If we are only given one data set, how can we see how our model performs on new data if we don't have any? Tuning and evaluation methods give us the means of testing our models so we can adjust their parameters for optimal performance when deployed. Let's revisit some of our earlier models but this time we'll split our data into training and testing sets so we can evaluate, iterate, and optimize our models.

We also learned about Ensemble Methods. Ensembles are like getting a second opinion on our models -- or some times hundreds of opinions. This allows us to build several models with slightly different data, parameters, and so on. We can then average them together for a better idea of overall performance.

##Shortcuts
*   Use the "+ Code" button in the top left corner to add another block like this one only for running **code** or the "+ text" button for adding a block that runs **text**
*   I suggest looking at "Tools-->Keyboard Shortcuts..." for additional ways to run Colaboratory but here are a few useful ones:
> **Ctrl+F9** - Run all blocks   
> **Ctrl+Enter** - Run selected block   
> **Alt+Enter** - Run block and add a new block beneath   
> **Shift+Enter** - Run block and select next block   
> **Ctrl+F8** - Run all blocks before selected block   
> **Ctrl+F10** - Run selected block and all following blocks   
> **Ctrl+M+Y** - Convert selected block to a *code* block   
> **Ctrl+M+M** - Convert selected block to a *text* block

Also useful, Colaboratory supports code completion. Start typing code and press the **Tab** key. A drop down will appear with likely code based on what you typed. If only one possible command exists, it should complete it for you automatically.

Even better yet, if there is an error produced by your code, Colaboratory will provide a button at the bottom of the code output to search StackOverflow for an answer!

**Remember**: Code blocks need to be run in succession or they might produce errors!


In [0]:
!apt-get -qq install -y graphviz && pip install -q pydot
import pydot

#Install packages not native to Colaboratory
!pip -q install graphviz
!pip -q install pydot
import pydot

In [0]:
#Load our packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import graphviz
from google.colab import files
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cluster import KMeans
from sklearn import datasets
%matplotlib inline 

In [0]:
#Upload the 'HousingData.csv' file from your local computer 
files.upload()

In [0]:
#Assign data to objects
raw = pd.read_csv('HousingData.csv')

In [0]:
#Repeat the data cleaning activites from Lab 1
raw[['CRIM','INDUS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']] = raw[['CRIM','INDUS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']].fillna(raw.mean())

for col in ['ZN','CHAS']:
    raw[col].fillna(raw[col].mode()[0], inplace=True)   

##Linear Regression - Test/Train

Let's rerun our code for linear regression like we did in Lab 1:

In [0]:
#Build and visualize a simple linear regression
X = raw[['RM']].values 
y = raw['MEDV'].values

lm = LinearRegression()
lm.fit(X, y)

def lin_regplot(X, y, model):
    plt.figure(figsize=(15,10))
    plt.scatter(X, y, c='navy', edgecolor='white', s=55)
    plt.plot(X, model.predict(X), color='black', lw=2.5)
    return None

lin_regplot(X, y, lm)
plt.xlabel('Avg. Rooms')
plt.ylabel('Price in $1,000')
plt.title('Home Price by Avg. Rooms')
plt.show

Remember, the first time we executed this code we used our entire data set to build the regression model. What if we were to remove some of our data set observations at random and use those to test how well our model works? Let's try that approach and see how well our model can work at predicting prices of observations that weren't used to build it. 

Since we only have one data set, how do we do this? Well, the simplest way is to use the train/test split from our lecture. Let's give that a try and then we can measure the **Mean Squared Error** of our predictions to our actual prices. Furthermore, let's use the rest of our features besides just number or rooms.

In [0]:
#Train/test split, build multiple linear regression model and calculate mean squared error
X = raw.iloc[:, :-1].values 
y = raw['MEDV'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

mlm = LinearRegression()

mlm.fit(X_train, y_train)
y_train_pred = mlm.predict(X_train) 
y_test_pred = mlm.predict(X_test)

Visualizing all the predictor features against the target is difficult, so we will rely on the **MSE** as our guide to fit. We can see both the **MSE** on the training and testing data as shown:

In [0]:
#Calculate MSE
print('MSE train: %.3f, test: %.3f' % (mean_squared_error(y_train, y_train_pred), 
                                       mean_squared_error(y_test, y_test_pred)))

Let just see what the coefficients for our new model are and the **R-squared** statistics are for our training and testing data.

In [0]:
#Print coefficients
pd.DataFrame(list(zip(raw.columns, mlm.coef_)), columns=['features','coef'])

In [0]:
#Calculate R-squared
print('R^2 train: %.3f, test: %.3f' % (r2_score(y_train, y_train_pred), 
                                       r2_score(y_test, y_test_pred)))

So remember the coefficient for **RM** in our first model was ~9.1. Now in the presence of additional features to explain median home price, the value drops to ~4.1. This is typical in linear regression because the presence (or absence) of other predictor features modifies the estimates -- we have additional information and aren't viewing things inside a "vacuum."

The **R-squared** values have also increased meaning this model with 13 predictors explains more variation in the price than the model with just number of rooms. Don't get too excited -- **R-squared** will *always* increase with more features which makes intuitive sense because we are adding additional information. Even if it isn't that predictive, it is still more than we knew without it. The **Adjusted R-squared** is a statistic that penalizes additional predictors in the model and gives a better idea of fit. Investigate on your own!

##Logistic Regression - Test/Train
Now let's revisit our logistic regression model and again try and test how well our model works on unseen observations. Will it be able to classify them correctly as "high" and "low priced" homes? Remember we had to recode our target variable **MEDV** to take on values of 0 ("low priced") and 1 ("high priced") so let's perform that operation again before we continue:

In [0]:
#Binary coding of target  
raw['MEDV'] = np.where(raw['MEDV'] < 21.2, 0, 1)

Let's start simple like we did with our linear regression example and use **RM** to predict the probability of a low or high priced home.

In [0]:
#Build and visualize a simple logistic regression
X = raw[['RM']].values 
y = raw['MEDV'].values

lr = LogisticRegression()
lr.fit(X, y)

def log_regplot(X, y, model):
    plt.figure(figsize=(15,10))
    sns.regplot(X, y, model, logistic=True, color='navy')
    return None

log_regplot(X, y, lr)
plt.xlabel('Avg. Rooms')
plt.ylabel('Probability')
plt.show

Similar to the linear regression example, let's add back in the rest of the predictor features and split our data set into training and testing sets. Since this isn't a regression problem we'll need to use a different measure to determine how well our model classifies low and high priced homes. This is where the **confusion matrix** we spoke about in our lecture becomes useful.

In [0]:
#Train/test split, build multiple logistic regression model and build confusion matrices
X = raw.iloc[:, :-1].values 
y = raw['MEDV'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

mlr = LogisticRegression()

mlr.fit(X_train, y_train)
y_train_pred = mlr.predict(X_train) 
y_test_pred = mlr.predict(X_test)

trainmtx = confusion_matrix(y_train, y_train_pred)
testmtx = confusion_matrix(y_test, y_test_pred)

In [0]:
#Training data confusion matrix
print(trainmtx)

In [0]:
#Testing data confusion matrix
print(testmtx)

Remember that the values along the diagonal indicate where there is agreement between our predictions and our actual target values of 0 and 1. While it looks like we are doing well for the most part, we could probably improve these. What steps do you think would help to do this?

Alternatively, we can also check the **accuracy** of our model using the code below:

In [0]:
#Calculate model accuracy
print('Accuracy train: %.3f, test: %.3f' % (accuracy_score(y_train, y_train_pred),
                                           (accuracy_score(y_test, y_test_pred))))

So our accuracy score which is **True Positives+True Negatives/All Observations** is really not bad in both our training and testing data sets. This is probably our best measure since **R-squared** won't work with logistic regression (although a measure called *pseudo R-squared* exists, we won't cover it). Is there anything we could do to improve accuracy?

##Decision Trees - Test Train
Finally, let's try our train/test cross-validation technique on decision trees.

Decision trees are also widely used as part of **Ensemble Methods** where we build several trees and average their predictions together to arrive at a quantitative or qualitative outcome that is almost always more accurate than a single tree alone (remember these methods are called **Bagging, Boosting, and Random Forest**).

Since we are already becoming comfortable with the Boston Housing Prices data set, let's continue to use that. We already converted our target variable, **MEDV**, to a binary outcome so for the sake of time we will use decision trees simply for classification.

In [0]:
#Build and visualize a decision tree classifier
X = raw.iloc[:, :-1].values
y = raw['MEDV'].values

dt = DecisionTreeClassifier()
dt.fit(X, y)

dt_viz = tree.export_graphviz(dt, 
                              out_file=None, 
                              feature_names=raw.columns[0:13],  
                              class_names=['low','high'],  
                              filled=True, 
                              rounded=True,
                              proportion=True,
                              special_characters=True)

graph = graphviz.Source(dt_viz)
graph.format = 'png'
graph

As we've done before, let's split our data -- one set for training the tree and another to test its performance:

In [0]:
#Train/test split, build decision tree, and evaluate accuracy
X = raw.iloc[:, :-1].values 
y = raw['MEDV'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

dt = DecisionTreeClassifier(max_depth=3,
                            min_samples_leaf=25,
                            random_state=0) #These can be adjusted but don't touch "random_state" as this locks in our randomization

dt.fit(X_train, y_train)
y_train_pred = dt.predict(X_train) 
y_test_pred = dt.predict(X_test)

#Calculate model accuracy
print('Accuracy train: %.3f, test: %.3f' % (accuracy_score(y_train, y_train_pred),
                                           (accuracy_score(y_test, y_test_pred))))

Play with the stopping/pruning parameters to see how it impacts the train/test accuracy. We can see typically our training accuracy is better than our testing accuracy. Again this probably indicates *overfitting* but hopefully we can *tune* our model to achieve the best possible accuracy. Try a few different approaches. Does anything seem to work best?

##Your Turn!!
Now that we've gone through how to test our models, give it a shot on your own. Using the **Admission_Predict.csv** data set, prepare a model to predict our target variable (**Chance of Admit**) using one of the models above (or more if you have time). You'll have to load and clean the data like we did before, but you can reuse the same code from previous labs.

In [0]:
#Upload the 'Admission_Predict.csv' file from your local computer 
files.upload()

**What model did you choose and why? Write your code below adjusting model parameters as you see fit. What is your final model's accuracy? Are you happy with it or do you think you can do better?**

##Ensemble Methods
Ensemble methods are like getting a second opinion from the doctor only this time the "doctor" is a machine learning model. These techniques allow us to get the most out of our training data, especially if we only have a limited amount. There are several techniques for building ensembles, but we will cover three of the most common below. While **Random Forest** ensembles are limited to desicion tree algorithms, both **Bagging** and **Boosting** can be applied to other models.


###Bagging

Remember with bagging, we are simulating several data sets by sampling from our one set (with replacement) and building multiple trees with the new data sets then testing them with the observations we didn't include in each subset. Thankfully running a bagging ensemble is really straightforward in scikit-learn.

In [0]:
#Train a decision tree with bagging and estimate the Out-of-bag Score
bag_tree = BaggingClassifier(dt, 
                             oob_score=True, 
                             random_state=0)

bag_tree.fit(X, y)
print(bag_tree.oob_score_)

So using bagging actually *lowers* our accuracy. Why do you think this is happening in this case? With OOB error, each tree we build might have a different number of observations used to test since we are only using those *not* used to build the tree. What if we just go back to using the accuracy score we've used previously?

In [0]:
#Train a decision tree with bagging and estimate the accuracy
bag_tree = BaggingClassifier(dt, random_state=0)

bag_tree.fit(X, y)
y_pred = bag_tree.predict(X) 

#Calculate model accuracy
print('Accuracy: %.3f' % (accuracy_score(y, y_pred)))

This is a *much* better outcome! While our training accuracy only improved a little, or testing accuracy has improved a lot more meaning bagging has reduced our overfitting. Why do you think that is based on what you know about bagging?

###Random Forest

Random forests are similar to bagging only now we are also taking random subsets of the *features* rather than using them all. This has the effect of *decorrelating* each tree from each other. If one predictor feature is really strong at predicting the outcome it will be used every time in the bagging ensemble meaning that each of trees has some level of *sameness* to it. Let's see how random forest works with our classification problem using default settings (10 trees, no limit on `min_samples_leaf` or `mean_samples_split`, all features available for random selection.

In [0]:
#Train a decision tree with random forest and estimate the accuracy
rf_tree = RandomForestClassifier(random_state=0)

rf_tree.fit(X, y)
y_pred = rf_tree.predict(X)

#Calculate model accuracy
print('Accuracy: %.3f' % (accuracy_score(y, y_pred)))

Holy smokes! That is some good classifying and it doesn't appear to be overfitting our testing accuracy either -- it's actually doing *better* and getting all predictions correct. As an analyst, you should be suspicious of these types of perfect outcomes. What do you think  -- should we be fine with this or investigate further?

###Boosting

Let's try to build a boosted decision tree model now. With bagging we build all of our trees at the same time with the different subsets of data. Random forest was similar only we randomly selected features as well. With boosting we build trees *sequentially* meaning that we build one, see how it performs, make some tweaks to try an improve the misclassification rate on those observations we predicted incorrectly. Let's try it using default settings.

In [0]:
#Train a decision tree with boosting and estimate the accuracy
boo_tree = GradientBoostingClassifier(random_state=0)

boo_tree.fit(X, y)
y_pred = boo_tree.predict(X) 

#Calculate model accuracy
print('Accuracy: %.3f' % (accuracy_score(y, y_pred)))

It looks like both random forest and boosted decision tree models are performing well compared to our single decision tree and the bagged version. I encourage you to explore the settings on the scikit-learn website so you can play around with (and maybe even improve) the results.
