# <font color=blue>Model Validation in Python</font> 
Machine learning models are easier to implement now more than ever before. Without proper validation, the results of running new data through a model might not be as accurate as expected. Model validation allows analysts to confidently answer the question, how good is your model? We will answer this question for classification models using the complete set of tic-tac-toe endgame scenarios, and for regression models using fivethirtyeight’s ultimate Halloween candy power ranking dataset. In this course, we will cover the basics of model validation, discuss various validation techniques, and begin to develop tools for creating validated and high performing models.

## <font color=red>01 - Basic Modeling in scikit-learn </font> 
 Before we can validate models, we need an understanding of how to create and work with them. This chapter provides an introduction to running regression and classification models in scikit-learn. We will use this model building foundation throughout the remaining chapters. 

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Seen vs. unseen data</h1><div class=""><p><p>Model's tend to have higher accuracy on observations they have seen before. In the candy dataset, predicting the popularity of Skittles will likely have higher accuracy than predicting the popularity of Andes Mints—Skittles is in the dataset, and Andes Mints is not. </p>
<p>You've built a model based on 50 candies using the dataset <code>X_train</code> and need to report how accurate the model is at predicting the popularity of the 50 candies the model was built on, and the 35 candies (<code>X_test</code>) it has never seen. You will use the mean absolute error, <code>mae()</code>, as the accuracy metric.</p></div></div>

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
candy_df = pd.read_csv('./data/candy-data.csv')
X_candy = candy_df.drop(['competitorname', 'winpercent'], axis=1)
y_candy = candy_df.winpercent
X_train, X_test, y_train, y_test = train_test_split(X_candy,y_candy, test_size=0.25, random_state=42)

import warnings
warnings.filterwarnings('ignore')

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error as mae
model = LinearRegression()

In [3]:
# The model is fit using X_train and y_train
model.fit(X_train, y_train)

# Create vectors of predictions
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

# Train/Test Errors
train_error = mae(y_true=y_train, y_pred=train_predictions)
test_error = mae(y_true=y_test, y_pred=test_predictions)

# Print the accuracy for secandyen and unseen data
print("Model error on seen data: {0:.2f}.".format(train_error))
print("Model error on unseen data: {0:.2f}.".format(test_error))

Model error on seen data: 7.93.
Model error on unseen data: 8.85.


<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Set parameters and fit a model</h1><div class=""><p><p>Predictive tasks fall into one of two categories: regression or classification. In the candy dataset, the outcome is a <em>continuous</em> variable describing how often the candy was chosen over another candy in a series of 1-on-1 match-ups. To predict this value (the win-percentage), you will use a <strong>regression</strong> model.</p>
<p>In this exercise, you will specify a few parameters using a random forest regression model <code>rfr</code>.</p></div></div>

In [4]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()

In [5]:
# Set the number of trees
rfr.n_estimators = 100

# Add a maximum depth
rfr.max_depth = 6

# Set the random state
rfr.random_state = 1111

# Fit the model
rfr.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=6,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=1111, verbose=0, warm_start=False)

You have updated parameters after the model was initialized. This approach is helpful when you need to update parameters. Before making predictions, let's see which candy characteristics were most important to the model.

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Feature importances</h1><div class=""><p><p>Although some candy attributes, such as chocolate, may be extremely popular, it doesn't mean they will be <em>important</em> to model prediction. After a random forest model has been fit, you can review the model's attribute, <code>.feature_importances_</code>, to see which variables had the biggest impact. You can check how important each variable was in the model by looping over the feature importance array using <code>enumerate()</code>.</p>
<p>If you are unfamiliar with Python's <code>enumerate()</code> function, it can loop over a list while also creating an automatic counter.</p></div></div>

In [6]:
# Fit the model using X and y
rfr.fit(X_train, y_train)

# Print how important each column is to the model
for i, item in enumerate(rfr.feature_importances_):
      # Use i and item to print out the feature importance of each column
    print("{0:s}: {1:.2f}".format(X_train.columns[i], item))

chocolate: 0.40
fruity: 0.03
caramel: 0.03
peanutyalmondy: 0.07
nougat: 0.00
crispedricewafer: 0.03
hard: 0.01
bar: 0.07
pluribus: 0.02
sugarpercent: 0.16
pricepercent: 0.18


<div class="listview__content"><div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Classification predictions</h1><div class=""><p><p>In model validation, it is often important to know more about the predictions than just the final classification. When predicting who will win a game, most people are also interested in <em>how likely</em> it is a team will win. </p>
<table>
<thead>
<tr>
<th>Probability</th>
<th>Prediction</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 &lt; .50</td>
<td>0</td>
<td>Team Loses</td>
</tr>
<tr>
<td>.50 +</td>
<td>1</td>
<td>Team Wins</td>
</tr>
</tbody>
</table>
<p>In this exercise, you look at the methods, <code>.predict()</code> and <code>.predict_proba()</code> using the <code>tic_tac_toe</code> dataset. The first method will give a prediction of whether Player One will win the game, and the second method will provide the probability of Player One winning. Use <code>rfc</code> as the random forest classification model.</p></div></div></div>

In [7]:
from sklearn.preprocessing import OneHotEncoder

In [8]:
tic_tac_toe = pd.read_csv('./data/tic-tac-toe.csv')
tic_tac_toe = tic_tac_toe.replace(["positive", 'negative'], [1,0])
X = tic_tac_toe.drop('Class', axis=1)
OneCode =  OneHotEncoder() 
OneCode.fit(X)
X_tic = OneCode.transform(X).toarray()
y_tic = tic_tac_toe.Class

X_train, X_test, y_train, y_test = train_test_split(X_tic,y_tic, test_size=0.25, 
                                                    random_state=42)

In [9]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()

In [10]:
# Fit the rfc model. 
rfc.fit(X_train, y_train)

# Create arrays of predictions
classification_predictions = rfc.predict(X_test)
probability_predictions = rfc.predict_proba(X_test)

# Print out count of binary predictions
print(pd.Series(classification_predictions).value_counts())

# Print the first value from probability_predictions
print('The first predicted probabilities are: {}'.format(probability_predictions[0]))

1    160
0     80
dtype: int64
The first predicted probabilities are: [0.2 0.8]


<p class="">Well done! You can see there were 563 observations where Player One was predicted to win the Tic-Tac-Toe game.  Also, note that the <code>predicted_probabilities</code> array contains lists with only two values because you only have two possible responses (win or lose). Remember these two methods, as you will use them a lot throughout this course.</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Reusing model parameters</h1><div class=""><p><p>Replicating model performance is vital in model validation.  Replication is also important when sharing models with co-workers, reusing models on new data or asking questions on a website such as <a href="https://stackoverflow.com/" target="_blank" rel="noopener noreferrer">Stack Overflow</a>. You might use such a site to ask other coders about model errors, output, or performance. The best way to do this is to replicate your work by reusing model parameters. </p>
<p>In this exercise, you use various methods to recall which parameters were used in a model.</p></div></div>

In [11]:
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)

# Print the classification model
print(rfc.get_params)

# Print the classification model's random state parameter
print('The random state is: {}'.format(rfc.random_state))

# Print all parameters
print('Printing the parameters dictionary: {}'.format(rfc.get_params()))

<bound method BaseEstimator.get_params of RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=6, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
            oob_score=False, random_state=1111, verbose=0,
            warm_start=False)>
The random state is: 1111
Printing the parameters dictionary: {'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': 6, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 50, 'n_jobs': None, 'oob_score': False, 'random_state': 1111, 'verbose': 0, 'warm_start': False}


Recalling which parameters were used will be helpful going forward. Model validation and performance rely heavily on which parameters were used, and there is no way to replicate a model without keeping track of the parameters used!

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Random forest classifier</h1><div class=""><p><p>This exercise reviews the four modeling steps discussed throughout this chapter using a random forest classification model. You will:</p>
<ol>
<li>Create a random forest classification model.</li>
<li>Fit the model using the <code>tic_tac_toe</code> dataset.</li>
<li>Make predictions on whether Player One will win (1) or lose (0) the current game.</li>
<li>Finally, you will evaluate the overall accuracy of the model.</li>
</ol>
<p>Let's get started!</p></div></div>

In [12]:
from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)

In [13]:
# Fit rfc using X_train and y_train
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=6, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
            oob_score=False, random_state=1111, verbose=0,
            warm_start=False)

In [14]:
# Create predictions on X_test
predictions = rfc.predict(X_test)
print(predictions[0:5])

[1 1 1 1 0]


In [15]:
# Print model accuracy using score() and the testing data
print(rfc.score(X_test, y_test))

0.9166666666666666


## <font color=red>02 - Validation Basics </font> 
 This chapter focuses on the basics of model validation. From splitting data into training, validation, and testing datasets, to creating an understanding of the bias-variance tradeoff, we build the foundation for the techniques of K-Fold and Leave-One-Out validation practiced in chapter three. 

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Create one holdout set</h1><div class=""><p><p>Your boss has asked you to create a simple random forest model on the <code>tic_tac_toe</code> dataset. She doesn't want you to spend much time selecting parameters; rather she wants to know how well the model will perform on future data. For future Tic-Tac-Toe games, it would be nice to know if your model can predict which player will win. </p>
<p>The dataset <code>tic_tac_toe</code> has been loaded for your use.</p>
<p>Note that in Python, <code>=\</code> indicates the code was too long for one line and has been split across two lines.</p></div></div>

In [16]:
# Create dummy variables using pandas
X = pd.get_dummies(tic_tac_toe.iloc[:,0:9])
y = tic_tac_toe.iloc[:, 9]

# Create training and testing datasets. Use 10% for the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=1111)

Remember, without the holdout set, you cannot truly validate a model. Let's move on to creating two holdout sets.

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Create two holdout sets</h1><div class=""><p><p>You recently created a simple random forest model to predict Tic-Tac-Toe game wins for your boss, and at her request, you did not do any parameter tuning. Unfortunately, the overall model accuracy was too low for her standards. This time around, she has asked you to focus on model performance.</p>
<p>Before you start testing different models and parameter sets, you will need to split the data in training, validation, and testing datasets. </p>
<p>The datasets <code>X</code> and <code>y</code> have been loaded for your use.</p></div></div>

In [17]:
# Create temporary training and final testing datasets
X_temp, X_test, y_temp, y_test  =\
    train_test_split(X, y, test_size=0.2, random_state=1111)

# Create the final training and validation datasets
X_train, X_val, y_train, y_val =\
    train_test_split(X_temp, y_temp, test_size=0.25, random_state=1111)

<div class="dc-u-p-24"><h1 class="dc-h3">Why use holdout sets</h1><div class=""><p><p>It is important to understand when you would use three datasets (training, validation, and testing) instead of two (training and testing). There is no point in creating an additional dataset split if you are not going to use it. </p>
<p>When should you consider using training, validation, <em>and</em> testing datasets?</p></div></div>

- When there is a lot of data. Splitting into three sets helps speed up modeling.
- When testing parameters, tuning hyper-parameters, or anytime you are frequently evaluating model performance. *
- Only when you are running regression and not classification models.
- Only when you are running classification and not regression models

Anytime we are evaluating model performance repeatedly we need to create training, validation, and testing datasets.

<div class="listview__content"><div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Mean absolute error</h1><div class=""><p><p>Communicating modeling results can be difficult. However, most clients understand that on average, a predictive model was off by some number. This makes explaining the mean absolute error easy. For example, when predicting the number of wins for a basketball team, if you predict 42, and they end up with 40, you can easily explain that the error was two wins.</p>
<p>In this exercise, you are interviewing for a new position and are provided with two arrays. <code>y_test</code>, the true number of wins for all 30 NBA teams in 2017 and <code>predictions</code>, which contains a prediction for each team. To test your understanding, you are asked to both manually calculate the MAE and use <code>sklearn</code>.</p></div></div></div>

In [18]:
import numpy as np
y_test= np.array([53, 51, 51, 49, 43, 42, 42, 41, 41, 37, 36, 31, 29, 28, 20, 67, 61,
       55, 51, 51, 47, 43, 41, 40, 34, 33, 32, 31, 26, 24])
predictions = np.array([60, 62, 42, 42, 30, 50, 52, 42, 44, 35, 30, 30, 35, 40, 15, 72, 58,
       60, 40, 42, 45, 46, 40, 35, 25, 40, 20, 34, 25, 24])

In [19]:
from sklearn.metrics import mean_absolute_error as mae

# Manually calculate the MAE
n = len(predictions)
mae_one = sum(abs(y_test - predictions)) / n
print('With a manual calculation, the error is {}'.format(mae_one))

# Use scikit-learn to calculate the MAE
mae_two = mae(y_test, predictions)
print('Using scikit-lean, the error is {}'.format(mae_two))

With a manual calculation, the error is 5.9
Using scikit-lean, the error is 5.9


<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Mean squared error</h1><div class=""><p><p>Let's focus on the 2017 NBA predictions again. Every year, there are at least a couple of NBA teams that win <em>way</em> more games than expected. If you use the MAE, this accuracy metric does not reflect the bad predictions as much as if you use the MSE. Squaring the large errors from bad predictions will make the accuracy look worse. </p>
<p>In this example, NBA executives want to better predict team wins. You will use the mean squared error to calculate the prediction error. The actual wins are loaded as <code>y_test</code> and the predictions as <code>predictions</code>.</p></div></div>

In [20]:
from sklearn.metrics import mean_squared_error 

n = len(predictions)
# Finish the manual calculation of the MSE
mse_one = sum(abs(y_test - predictions)**2) / n
print('With a manual calculation, the error is {}'.format(mse_one))

# Use the scikit-learn function to calculate MSE
mse_two = mean_squared_error(y_test, predictions)
print('Using scikit-lean, the error is {}'.format(mse_two))

With a manual calculation, the error is 49.1
Using scikit-lean, the error is 49.1


If you run any additional models, you will try to beat an MSE of 49.1, which is the average squared error of using your model. Although the MSE is not as interpretable as the MAE, it will help us select a model that has fewer 'large' errors.

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Performance on data subsets</h1><div class=""><p><p>In professional basketball, there are two conferences, the East and the West. Coaches and fans often only care about how teams in their own conference will do this year. </p>
<p>You have been working on an NBA prediction model and would like to determine if the predictions were better for the East or West conference. You added a third array to your data called <code>labels</code>, which contains an "E" for the East teams, and a "W" for the West.  <code>y_test</code> and <code>predictions</code> have again been loaded for your use.</p></div></div>

In [21]:
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse

In [22]:
labels = np. array(['E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E',
       'E', 'E', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W',
       'W', 'W', 'W', 'W'])
west_error = 5.01

In [23]:
# Find the East conference teams
east_teams = labels == "E"

In [24]:
# Create arrays for the true and predicted values
true_east = y_test[east_teams]
preds_east = predictions[east_teams]

In [25]:
# Print the accuracy metrics
print('The MAE for East teams is {}'.format(
    mae(true_east, preds_east)))

The MAE for East teams is 6.733333333333333


In [26]:
# Print the West accuracy
print('The MAE for West conference is {}'.format(west_error))

The MAE for West conference is 5.01


<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Confusion matrices</h1><div class=""><p><p>Confusion matrices are a great way to start exploring your model's accuracy. They provide the values needed to calculate a wide range of metrics, including sensitivity, specificity, and the F1-score.</p>
<p>You have built a classification model to predict if a person has a broken arm based on an X-ray image. On the testing set, you have the following confusion matrix:</p>
<table>
<thead>
<tr>
<th></th>
<th>Prediction: 0</th>
<th>Prediction: 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Actual: 0</td>
<td>324 (TN)</td>
<td>15 (FP)</td>
</tr>
<tr>
<td>Actual: 1</td>
<td>123 (FN)</td>
<td>491 (TP)</td>
</tr>
</tbody>
</table></div></div>

In [27]:
# Calculate and print the accuracy
accuracy = (324 + 491) / (953)
print("The overall accuracy is {0: 0.2f}".format(accuracy))

# Calculate and print the precision
precision = (491) / (491 + 15)
print("The precision is {0: 0.2f}".format(precision))

# Calculate and print the recall
recall = (491) / (491 + 123)
print("The recall is {0: 0.2f}".format(recall))

The overall accuracy is  0.86
The precision is  0.97
The recall is  0.80


<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Confusion matrices, again</h1><div class=""><p><p>Creating a confusion matrix in Python is simple. The biggest challenge will be making sure you understand the orientation of the matrix. This exercise makes sure you understand the <code>sklearn</code> implementation of confusion matrices. Here, you have created a random forest model using the <code>tic_tac_toe</code> dataset <code>rfc</code> to predict outcomes of 0 (loss) or 1 (a win) for Player One.</p>
<p><em>Note:</em> If you read about confusion matrices on another website or for another programming language, the values might be reversed.</p></div></div>

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X_tic,y_tic, test_size=0.25, 
                                                    random_state=42)

In [29]:
from sklearn.metrics import confusion_matrix

# Create predictions
test_predictions = rfc.predict(X_test)

# Create and print the confusion matrix
cm = confusion_matrix(y_test, test_predictions)
print(cm)

# Print the true positives (actual 1s that were predicted 1s)
print("The number of true positives is: {}".format(cm[1, 1]))

[[ 58  17]
 [  3 162]]
The number of true positives is: 162


<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Precision vs. recall</h1><div class=""><p><p>The accuracy metrics you use to evaluate your model should <em>always</em> be based on the specific application. For this example, let's assume you are a really sore loser when it comes to playing Tic-Tac-Toe, but only when you are certain that you are going to win.</p>
<p>Choose the most appropriate accuracy metric, either precision or recall, to complete this example. But remember, <em>if you think you are going to win, you better win!</em></p>
<p>Use <code>rfc</code>, which is a random forest classification model built on the <code>tic_tac_toe</code> dataset.</p></div></div>

In [30]:
from sklearn.metrics import recall_score

test_predictions = rfc.predict(X_test)

# Create precision or recall score based on the metric you imported
score = recall_score(y_test, test_predictions)

# Print the final result
print("The recall value is {0:.2f}".format(score))

The recall value is 0.98


In [31]:
from sklearn.metrics import precision_score

test_predictions = rfc.predict(X_test)

# Create precision or recall score based on the metric you imported
score = precision_score(y_test, test_predictions)

# Print the final result
print("The precision value is {0:.2f}".format(score))

The precision value is 0.91


<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Error due to under/over-fitting</h1><div class=""><p><p>The candy dataset is prime for overfitting. With only 85 observations, if you use 20% for the testing dataset, you are losing a lot of vital data that could be used for modeling. Imagine the scenario where most of the chocolate candies ended up in the training data and very few in the holdout sample. Our model might <em>only</em> see that chocolate is a vital factor, but fail to find that other attributes are also important. In this exercise, you'll explore how using too many features (columns) in a random forest model can lead to overfitting.</p>
<p>A <em>feature</em> represents which columns of the data are used in a decision tree. The parameter <code>max_features</code> limits the number of features available.</p></div></div>

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X_candy,y_candy, test_size=0.25, 
                                                    random_state=42)


In [33]:
# Update the rfr model
rfr = RandomForestRegressor(n_estimators=25,
                            random_state=1111,
                            max_features=2)
rfr.fit(X_train, y_train)

# Print the training and testing accuracies 
print('The training error is {0:.2f}'.format(
  mae(y_train, rfr.predict(X_train))))
print('The testing error is {0:.2f}'.format(
  mae(y_test, rfr.predict(X_test))))

The training error is 3.41
The testing error is 8.46


In [34]:
# Update the rfr model
rfr = RandomForestRegressor(n_estimators=25,
                            random_state=1111,
                            max_features=11)
rfr.fit(X_train, y_train)

# Print the training and testing accuracies 
print('The training error is {0:.2f}'.format(
  mae(y_train, rfr.predict(X_train))))
print('The testing error is {0:.2f}'.format(
  mae(y_test, rfr.predict(X_test))))

The training error is 3.41
The testing error is 8.25


In [35]:
# Update the rfr model
rfr = RandomForestRegressor(n_estimators=25,
                            random_state=1111,
                            max_features=4)
rfr.fit(X_train, y_train)

# Print the training and testing accuracies 
print('The training error is {0:.2f}'.format(
  mae(y_train, rfr.predict(X_train))))
print('The testing error is {0:.2f}'.format(
  mae(y_test, rfr.predict(X_test))))

The training error is 3.24
The testing error is 8.75


<div class="dc-completed__message"><p class="">Great job! The chart below shows the performance at various max feature values. Sometimes, setting parameter values can make a huge difference in model performance. <br> <br> <img src="https://assets.datacamp.com/production/repositories/3981/datasets/7e30218261b88cc6e57da1e07b73c5803450ccf6/Screen%20Shot%202019-01-13%20at%205.40.29%20PM.png" width="360" height="250"></p></div>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Am I underfitting?</h1><div class=""><p><p>You are creating a random forest model to predict if you will win a future game of Tic-Tac-Toe. Using the <code>tic_tac_toe</code> dataset, you have created training and testing datasets, <code>X_train</code>, <code>X_test</code>, <code>y_train</code>, and <code>y_test</code>. </p>
<p>You have decided to create a bunch of random forest models with varying amounts of trees (1, 2, 3, 4, 5, 10, 20, and 50). The more trees you use, the longer your random forest model will take to run. However, if you don't use enough trees, you risk underfitting. You have created a for loop to test your model at the different number of trees.</p></div></div>

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X_tic,y_tic, test_size=0.25, 
                                                    random_state=42)

In [37]:
from sklearn.metrics import accuracy_score

test_scores, train_scores = [], []
for i in [1, 2, 3, 4, 5, 10, 20, 50]:
    rfc = RandomForestClassifier(n_estimators=i, random_state=1111)
    rfc.fit(X_train, y_train)
    # Create predictions for the X_train and X_test datasets.
    train_predictions = rfc.predict(X_train)
    test_predictions = rfc.predict(X_test)
    # Append the accuracy score for the test and train predictions.
    train_scores.append(round(accuracy_score(y_train, train_predictions), 2))
    test_scores.append(round(accuracy_score(y_test, test_predictions), 2))
# Print the train and test scores.
print("The training scores were: {}".format(train_scores))
print("The testing scores were: {}".format(test_scores))

The training scores were: [0.95, 0.94, 0.98, 0.98, 0.99, 1.0, 1.0, 1.0]
The testing scores were: [0.84, 0.82, 0.89, 0.88, 0.9, 0.93, 0.95, 0.99]


## <font color=red>03 - Cross Validation </font> 
 Holdout sets are a great start to model validation. However, using a single train and test set if often not enough. Cross-validation is considered the gold standard when it comes to validating model performance and is almost always used when tuning model hyper-parameters. This chapter focuses on performing cross-validation to validate model performance. 

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Two samples</h1><div class=""><p><p>After building several classification models based on the<code>tic_tac_toe</code> dataset, you realize that some models do not generalize as well as others. You have created training and testing splits just as you have been taught, so you are curious why your validation process is not working. </p>
<p>After trying a different training, test split, you noticed differing accuracies for your machine learning model. Before getting too frustrated with the varying results, you have decided to see what else could be going on.</p></div></div>

In [38]:
# Create two different samples of 200 observations 
sample1 = tic_tac_toe.sample(200, random_state=1111)
sample2 = tic_tac_toe.sample(200, random_state=1171)

In [39]:
# Print the number of common observations 
print(len([index for index in sample1.index if index in sample2.index]))

40


In [40]:
# Print the number of observations in the Class column for both samples 
print(sample1['Class'].value_counts())
print(sample2['Class'].value_counts())

1    134
0     66
Name: Class, dtype: int64
1    123
0     77
Name: Class, dtype: int64


 Notice that there are a varying number of positive observations for both sample test sets. Sometimes creating a single test holdout sample is not enough to achieve the high levels of model validation you want. You need to use something more robust.

<div class="dc-u-p-24"><h1 class="dc-h3">Potential problems</h1><div class=""><p><p>Which of the following statements are <strong>TRUE</strong> regarding potential problems with holdout samples:</p>
<ul>
<li><strong>A</strong>: Using different data splitting methods may lead to varying data in the final holdout samples.</li>
<li><strong>B</strong>: If you have limited data, your holdout accuracy may be misleading.</li>
<li><strong>C</strong>: There are no problems. Creating a single train and test sample is the only way to validate models.</li>
<li><strong>D</strong>: You shouldn't use holdout samples with limited data because you are limiting the potential training data.</li>
</ul></div></div>

A B

 If our models are not generalizing well or if we have limited data, we should be careful using a single training/validation split. You should use the next lesson's topic: cross-validation.

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">scikit-learn's KFold()</h1><div class=""><p><p>You just finished running a colleagues code that creates a random forest model and calculates an out-of-sample accuracy. You noticed that your colleague's code did not have a random state, and the errors you found were completely different than the errors your colleague reported. </p>
<p>To get a better estimate for how accurate this random forest model will be on new data, you have decided to generate some indices to use for KFold cross-validation.</p></div></div>

In [41]:
from sklearn.model_selection import KFold

# Use KFold
kf = KFold(n_splits=5, shuffle=True, random_state=1111)

# Create splits
splits = kf.split(X_candy)

# Print the number of indices
for train_index, val_index in splits:
    print("Number of training indices: %s" % len(train_index))
    print("Number of validation indices: %s" % len(val_index))

Number of training indices: 68
Number of validation indices: 17
Number of training indices: 68
Number of validation indices: 17
Number of training indices: 68
Number of validation indices: 17
Number of training indices: 68
Number of validation indices: 17
Number of training indices: 68
Number of validation indices: 17


This dataset has 85 rows. You have created five splits - each containing 68 training and 17 validation indices. You can use these indices to complete 5-fold cross-validation.

<div class="listview__content"><div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Using KFold indices</h1><div class=""><p><p>You have already created <code>splits</code>, which contains indices for the candy-data dataset to complete 5-fold cross-validation. To get a better estimate for how well a colleague's random forest model will perform on a new data, you want to run this model on the five different training and validation indices you just created. </p>
<p>In this exercise, you will use these indices to check the accuracy of this model using the five different splits. A for loop has been provided to assist with this process.</p></div></div></div>

In [42]:
X = X_candy
y = y_candy
X_train, X_test, y_train, y_test = train_test_split(X_candy,y_candy, test_size=0.25, 
                                                    random_state=42)

In [43]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

rfc = RandomForestRegressor(n_estimators=25, random_state=1111)

# Use KFold
kf = KFold(n_splits=5, shuffle=True, random_state=1111)
# Create splits
splits = kf.split(X)

# Access the training and validation indices of splits
#for train_index, val_index in splits:
    # Setup the training and validation data
    #X_train, y_train = X[train_index], y[train_index]
    #print(X[train_index])
    #X_val, y_val = X[val_index], y[val_index]
    # Fit the random forest model
    #rfc.fit(X_train, y_train)
    # Make predictions, and print the accuracy
    #predictions = rfc.predict(X_val)
    #print("Split accuracy: " + str(mean_squared_error(y_val, predictions)))
    

<div class="dc-completed__message"><p class="">Nice work! <code>KFold()</code> is a great method for accessing individual indices when completing cross-validation. One drawback is needing a for loop to work through the indices though. In the next lesson, you will look at an automated method for cross-validation using <code>sklearn</code>.</p></div>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">scikit-learn's methods</h1><div class=""><p><p>You have decided to build a regression model to predict the number of new employees your company will successfully hire next month. You open up a new Python script to get started, but you quickly realize that <code>sklearn</code> has <em>a lot</em> of different modules. Let's make sure you understand the names of the modules, the methods, and which module contains which method. </p>
<p>Follow the instructions below to load in all of the necessary methods for completing cross-validation using <code>sklearn</code>. You will use modules:</p>
<ul>
<li><code>metrics</code></li>
<li><code>model_selection</code></li>
<li><code>ensemble</code></li>
</ul></div></div>

In [44]:
# Instruction 1: Load the cross-validation method
from sklearn.model_selection import cross_val_score

# Instruction 2: Load the random forest regression model
from sklearn.ensemble import RandomForestRegressor

# Instruction 3: Load the mean squared error method
# Instruction 4: Load the function for creating a scorer
from sklearn.metrics import mean_squared_error, make_scorer

<div class="dc-completed__message"><p class="">Well done! It is easy to see how all of the methods can get mixed up, but it is important to know the names of the methods you need. You can always review the <a href="https://scikit-learn.org/stable/documentation.html" target="_blank" rel="noopener noreferrer">scikit-learn documentation</a> should you need any help</p></div>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Implement cross_val_score()</h1><div class=""><p><p>Your company has created several new candies to sell, but they are not sure if they should release all five of them. To predict the popularity of these new candies, you have been asked to build a regression model using the candy dataset. Remember that the response value is a head-to-head win-percentage against other candies. </p>
<p>Before you begin trying different regression models, you have decided to run cross-validation on a simple random forest model to get a baseline error to compare with any future results.</p></div></div>

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X_candy,y_candy, test_size=0.25, 
                                                    random_state=42)

In [46]:
rfc = RandomForestRegressor(n_estimators=25, random_state=1111)
mse = make_scorer(mean_squared_error)

# Set up cross_val_score
cv = cross_val_score(estimator=rfc,
                     X=X_train,
                     y=y_train,
                     cv=10,
                     scoring=mse)

# Print the mean error
print(cv.mean())

141.06685149205262


You now have a baseline score to build on. If you decide to build additional models or try new techniques, you should try to get an error lower than 155.56. Lower errors indicate that your popularity predictions are improving.

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">When to use LOOCV</h1><div class=""><p><p>Which of the following are reasons you might <strong>NOT</strong> run LOOCV on the provided <code>X</code> dataset? 
The <code>X</code> data has been loaded for you to explore as you see fit. </p>
<ul>
<li><strong>A</strong>: The <code>X</code> dataset has 122,624 data points, which might be computationally expensive and slow.</li>
<li><strong>B</strong>: You cannot run LOOCV on classification problems. </li>
<li><strong>C</strong>: You want to test different values for 15 different parameters</li>
</ul></div></div>

A C

This many observations will definitely slow things down and could be computationally expensive. If you don't have time to wait while your computer runs through 1,000 models, you might want to use 5 or 10-fold cross-validation.

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Leave-one-out-cross-validation</h1><div class=""><p><p>Let's assume your favorite candy is not in the candy dataset, and that you are interested in the popularity of this candy. Using 5-fold cross-validation will train on only 80% of the data at a time. The candy dataset <em>only</em> has 85 rows though, and leaving out 20% of the data could hinder our model. However, using leave-one-out-cross-validation allows us to make the most out of our limited dataset and will give you the best estimate for your favorite candy's popularity!</p>
<p>In this exercise, you will use <code>cross_val_score()</code> to perform LOOCV.</p></div></div>

In [47]:
from sklearn.metrics import mean_absolute_error, make_scorer

# Create scorer
mae_scorer = make_scorer(mean_absolute_error)

rfr = RandomForestRegressor(n_estimators=15, random_state=1111)

# Implement LOOCV
scores = cross_val_score(rfr, X=X, y=y, cv=y.shape[0], scoring=mae_scorer)

# Print the mean and standard deviation
print("The mean of the errors is: %s." % np.mean(scores))
print("The standard deviation of the errors is: %s." % np.std(scores))

The mean of the errors is: 9.464989603398694.
The standard deviation of the errors is: 7.265762094853885.


You have come along way with model validation techniques. The final chapter will wrap up model validation by discussing how to select the best model and give an introduction to parameter tuning.

# <font color=red>04 - Selecting the best model with Hyperparameter tuning. </font> 
 The first three chapters focused on model validation techniques. In chapter 4 we apply these techniques, specifically cross-validation, while learning about hyperparameter tuning. After all, model validation makes tuning possible and helps us select the overall best model. 

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Creating Hyperparameters</h1><div class=""><p><p>For a school assignment, your professor has asked your class to create a random forest model to predict the average test score for the final exam.</p>
<p>After developing an initial random forest model, you are unsatisfied with the overall accuracy. You realize that there are too many hyperparameters to choose from, and each one has <em>a lot</em> of possible values. You have decided to make a list of possible ranges for the hyperparameters you might use in your next model.</p>
<p>Your professor has provided de-identified data for the last ten quizzes to act as the training data. There are 30 students in your class.</p></div></div>

In [49]:
# Review the parameters of rfr
print(rfr.get_params())

{'bootstrap': True, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 15, 'n_jobs': None, 'oob_score': False, 'random_state': 1111, 'verbose': 0, 'warm_start': False}


In [57]:
# Maximum Depth
max_depth = [4, 8, 12]

# Minimum samples for a split
min_samples_split = [2, 5, 10]

# Max features 
max_features = [4, 6, 8, 10]

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Running a model using ranges</h1><div class=""><p><p>You have just finished creating a list of hyperparameters and ranges to use when tuning a predictive model for an assignment. You have used <code>max_depth</code>, <code>min_samples_split</code>, and <code>max_features</code> as your range variable names.</p></div></div>

In [53]:
import random

In [56]:
from sklearn.ensemble import RandomForestRegressor

# Fill in rfr using your variables
rfr = RandomForestRegressor(
    n_estimators=100,
    max_depth=random.choice(max_depth),
    min_samples_split=random.choice(min_samples_split),
    max_features=random.choice(max_features))

# Print out the parameters
print(rfr.get_params())

{'bootstrap': True, 'criterion': 'mse', 'max_depth': 12, 'max_features': 8, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 10, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


<div class="dc-completed__message"><p class="">Good job! Notice that <code>min_samples_split</code> was randomly set to 2. Since you specified a random state, <code>min_samples_split</code> will always be set to 2 if you only run this model one time.</p></div>

<div class="listview__content"><div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Preparing for RandomizedSearch</h1><div class=""><p><p>Last semester your professor challenged your class to build a predictive model to predict final exam test scores. You tried running a few different models by randomly selecting hyperparameters. However, running each model required you to code it individually. </p>
<p>After learning about <code>RandomizedSearchCV()</code>, you're revisiting your professors challenge to build the best model. In this exercise, you will prepare the three necessary inputs for completing a random search.</p></div></div></div>

In [62]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer, mean_squared_error

# Finish the dictionary by adding the max_depth parameter
param_dist = {"max_depth": [2, 4, 6, 8],
              "max_features": [2, 4, 6, 8, 10],
              "min_samples_split": [2, 4, 8, 16]}

# Create a random forest regression model
rfr = RandomForestRegressor(n_estimators=10, random_state=1111)

# Create a scorer to use (use the mean squared error)
scorer = make_scorer(mean_squared_error)

<p class="">Well done! To use <code>RandomizedSearchCV()</code>, you need a distribution dictionary, an estimator, and a scorer—once you've got these, you can run a random search to find the best parameters for your model.</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Implementing RandomizedSearchCV</h1><div class=""><p><p>You are hoping that using a random search algorithm will help you improve predictions for a class assignment. You professor has challenged your class to predict the overall final exam average score. </p>
<p>In preparation for completing a random search, you have created:</p>
<ul>
<li><code>param_dist</code>: the hyperparameter distributions</li>
<li><code>rfr</code>: a random forest regression model</li>
<li><code>scorer</code>: a scoring method to use</li>
</ul></div></div>

In [63]:
# Import the method for random search
from sklearn.model_selection import RandomizedSearchCV

# Build a random search using param_dist, rfr, and scorer
random_search =\
    RandomizedSearchCV(
        estimator=rfr,
        param_distributions=param_dist,
        n_iter=10,
        cv=5,
        scoring=scorer)

Although it takes a lot of steps, hyperparameter tuning with random search is well worth it and can improve the accuracy of your models. Plus, you are already using cross-validation to validate your best model.

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Best classification accuracy</h1><div class=""><p><p>You are in a competition at work to build the best model for predicting the winner of a Tic-Tac-Toe game. You already ran a random search and saved the results of the most accurate model to <code>rs</code>.</p>
<p>Which parameter set produces the best classification accuracy?</p></div></div>

`rs.best_params_`

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Selecting the best precision model</h1><div class=""><p><p>Your boss has offered to pay for you to see three sports games this year. Of the 41 home games your favorite team plays, you want to ensure you go to three home games that they will <em>definitely</em> win. You build a model to decide which games your team will win. </p>
<p>To do this, you will build a random search algorithm and focus on model precision (to ensure your team wins). You also want to keep track of your best model and best parameters, so that you can use them again next year (if the model does well, of course). You have already decided on using the random forest classification model <code>rfc</code> and generated a parameter distribution <code>param_dist</code>.</p></div></div>

In [68]:
X_train, X_test, y_train, y_test = train_test_split(X_tic,y_tic, test_size=0.25, 
                                                    random_state=42)
X = X_tic
y = y_tic
rfc = RandomForestClassifier()

In [80]:
from sklearn.metrics import precision_score, make_scorer

# Create a precision scorer
precision = make_scorer(precision_score)
# Finalize the random search
rs = RandomizedSearchCV(
  estimator=rfc, param_distributions=param_dist,
  scoring = precision,
  cv=5, n_iter=10, random_state=1111)
rs.fit(X, y)

# print the mean test scores:
print('The accuracy for each run was: {}.'.format(rs.cv_results_['mean_test_score']))
# print the best model score:
print('The best accuracy for a single model was: {}'.format(rs.best_score_))

The accuracy for each run was: [0.84580999 0.76196489 0.72297306 0.86447312 0.73469461 0.83893059
 0.75135581 0.87458245 0.69972202 0.94448096].
The best accuracy for a single model was: 0.9444809589745515


Your model's precision was 93%! The best model accurately predicts a winning game 93% of the time. If you look at the mean test scores, you can tell some of the other parameter sets did really poorly. Also, since you used cross-validation, you can be confident in your predictions.