# Model Applications and Improvements

It’s time to dive deeper. Find out how you can use measures of model performance including precision and recall to answer real-world questions, such as evaluating ROI on ad spend. You’ll also learn ways to improve upon those evaluation metrics, such as ensemble methods and hyperparameter tuning.

## Evaluating four categories
The confusion matrix is the most straightforward tool to look at the four categories of outcomes: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). In this exercise, you will use a standard decision tree classifier DecisionTreeClassifier() from sklearn on the sample click data and calculate the breakdowns of outcomes by the four categories.

In [44]:
from pandas import read_pickle

df = read_pickle('data/data_ch3.pkl')
# # Get non-categorical columns, with a filter
num_df = df.select_dtypes(include=['int', 'float'])
filter_cols = ['banner_pos', 'hour_of_day']
new_df = num_df[num_df.columns[~num_df.columns.isin(filter_cols)]]

In [50]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix


X = new_df.loc[:, ~new_df.columns.isin(['click'])]
y = new_df.click

# Set up classifier using training data to predict test data
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size = .2, random_state = 0)
clf = DecisionTreeClassifier()
y_pred = clf.fit(X_train, y_train).predict(X_test) 

# Define confusion matrix and four categories
conf_matrix = confusion_matrix(y_test, y_pred)
tn = conf_matrix[0][0]
fp = conf_matrix[0][1]
fn = conf_matrix[1][0]
tp = conf_matrix[1][1]

print("TN: %s, FP: %s, FN: %s, TP: %s" %(tn, fp, fn, tp))

TN: 86, FP: 3, FN: 9, TP: 2


Notice that the largest category is TN (model predicts no click, and actual was no click), which makes sense because our model will mostly predict non-click.

## ROI on ad spend
The return on investment (ROI) for ad spend can be categorized using the four outcomes from a confusion matrix. This quantity is defined as the ratio between the total return and the total cost. If this quantity is greater than 1, it indicates the total return was greater than the total cost and vice versa. In this exercise, you will compute a sample ROI assuming a fixed r, the return on a click per number of impressions, and cost, the cost per number of impressions.

In [46]:
# Compute confusion matrix and get four categories
conf_matrix = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = conf_matrix.ravel()

# Calculate total return, total spent, and ROI
r = 0.2
cost = 0.05
total_return = tp * r
total_cost = (tp + fp) * cost 
roi = total_return / total_cost
print("Total return: %s, Total cost: %s, ROI: %s" %(
  total_return, total_cost, roi))

Total return: 0.4, Total cost: 0.30000000000000004, ROI: 1.3333333333333333


Note the ROI was > 1 and hence the total return was more than the total cost. Since the total return exceeds the cost, such an ad campaign is worth it for a company.

## Precision and recall
Both precision and recall are related to the four outcomes discussed in the prior lesson and are important evaluation metrics for any machine learning model. An ad CTR model should ideally have high precision (high ROI on ad spend) and recall (relevant audience targeting). Although it is possible to calculate precision and recall by hand, sklearn has some handy implementations that you can easily plug into the existing workflow. In this exercise, you will set up a decision tree and calculate precision and recall.

In [47]:
from sklearn.metrics import precision_score, recall_score

# Set up training and testing split
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size = .2, random_state = 0)

# Create classifier and make predictions
clf = DecisionTreeClassifier()
y_pred = clf.fit(X_train, y_train).predict(X_test) 

# Evaluate precision and recall
prec = precision_score(y_test, y_pred, average = 'weighted')
recall = recall_score(y_test, y_pred, average = 'weighted')
print("Precision: %s, Recall: %s" %(prec, recall))

Precision: 0.8496842105263158, Recall: 0.88


The precision value is the proportion of clicks relative to total number of impressions, and recall is the proportion of clicks gotten of all clicks available. Both are around 80-85%! Now let's jump into comparisons with other classifiers.

## Baseline
Evaluating a classifier relative to an appropriate baseline is important. This is especially true for imbalanced datasets, such as ad click-through, because high accuracy can easily be achieved through always selecting the majority class. In this exercise, you will simulate a baseline classifier that always predicts the majority class (non-click) and look at its confusion matrix, as well as what its precision and recall are.

In [48]:
from numpy import asarray

# Set up baseline predictions
y_pred = asarray([0 for x in range(len(X_test))])

# Look at confusion matrix
print("Confusion matrix: ")
print(confusion_matrix(y_test, y_pred))

# Check precision and recall
prec = precision_score(y_test, y_pred, average = 'weighted')
recall = recall_score(y_test, y_pred, average = 'weighted')
print("Precision: %s, Recall: %s" %(prec, recall))

Confusion matrix: 
[[89  0]
 [11  0]]
Precision: 0.7921, Recall: 0.89


  'precision', 'predicted', average, warn_for)


Notice that the number of true and false positives are both 0, which is expected by design. Also note that the recall here was 83% simply due to the imbalanced nature of the dataset.

## Classifier comparison
The ROI framework can be run across different classifiers to see how higher precision and recall lead to higher ROI values. Note that the baseline classifier you created would have a total return and cost of 0 since both the true positives tp and false positives fp will be 0 by design. In this exercise, you will use the ROI framework to compare a logistic regression and decision tree classifier.

In [51]:
# Create and fit classifier
clf = LogisticRegression()
y_pred = clf.fit(X_train, y_train).predict(X_test) 

# Calculate total return, total spent, and ROI 
r, cost = 0.2, 0.05
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
total_return = tp * r
total_spent = (tp + fp) * cost 
roi = total_return / total_spent
print("Total return: %s, Total spent: %s, ROI: %s" %(total_return, total_spent, roi))

Total return: 0.0, Total spent: 0.0, ROI: nan


  # Remove the CWD from sys.path while we load stuff.


In [52]:
# Create and fit decision tree classifier
clf = DecisionTreeClassifier()
y_pred = clf.fit(X_train, y_train).predict(X_test) 

# Calculate total return, total spent, and ROI 
r, cost = 0.2, 0.05
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
total_return = tp * r
total_spent = (tp + fp) * cost 
roi = total_return / total_spent
print("Total return: %s, Total spent: %s, ROI: %s" %(total_return, total_spent, roi))

Total return: 0.4, Total spent: 0.25, ROI: 1.6


Notice that the logistic regression classifier had total return of 0 and total spend of 0 - this is because it predicted only 0's, similar to the baseline classifier. Using a decision tree classifier, we see that the ROI > 1.

## Regularization
Regularization is the process of adding information to a model in order to prevent overfitting. This is important in order to boost the evaluation metrics you saw earlier in the chapter. In this exercise, you will vary around the max depth parameter of a decision tree in order to see how the classification results are affected.

In [53]:
# Iterate over different levels of max depth
for max_depth_val in [2, 3, 5, 10, 15, 20]:
  # Create and fit model
  clf = DecisionTreeClassifier(max_depth = max_depth_val)
  print("Evaluating tree with max_depth = %s" %(max_depth_val))
  y_pred = clf.fit(X_train, y_train).predict(X_test) 
  
  # Evaluate confusion matrix, precision, recall
  print("Confusion matrix: ")
  print(confusion_matrix(y_test, y_pred))
  prec = precision_score(y_test, y_pred, average = 'weighted')
  recall = recall_score(y_test, y_pred, average = 'weighted')
  print("Precision: %s, Recall: %s" %(prec, recall))

Evaluating tree with max_depth = 2
Confusion matrix: 
[[89  0]
 [11  0]]
Precision: 0.7921, Recall: 0.89
Evaluating tree with max_depth = 3
Confusion matrix: 
[[85  4]
 [ 9  2]]
Precision: 0.8414539007092199, Recall: 0.87
Evaluating tree with max_depth = 5
Confusion matrix: 
[[84  5]
 [ 9  2]]
Precision: 0.8352995391705069, Recall: 0.86
Evaluating tree with max_depth = 10
Confusion matrix: 
[[86  3]
 [ 9  2]]
Precision: 0.8496842105263158, Recall: 0.88
Evaluating tree with max_depth = 15
Confusion matrix: 
[[85  4]
 [ 9  2]]
Precision: 0.8414539007092199, Recall: 0.87
Evaluating tree with max_depth = 20
Confusion matrix: 
[[85  4]
 [ 9  2]]
Precision: 0.8414539007092199, Recall: 0.87


  'precision', 'predicted', average, warn_for)


Note that the the recall levels are fairly close to one another, and a max depth of 10 levels has the highest precision.

## Cross validation
Cross validation is a technique that attempts to check on a model's holdout performance. It is done to ensure that the testing performance was not due to any particular issues on splitting of data. In this exercise, you will use implementations from sklearn to run a K-fold cross validation by using the KFold() module to assess cross validation to assess precision and recall for a decision tree.

In [54]:
from sklearn.model_selection import KFold, cross_val_score

# Create model 
clf = DecisionTreeClassifier()

# Set up k-fold
k_fold = KFold(n_splits = 4, random_state = 0)

# Evaluate precision and recall for each fold
precision = cross_val_score(
  clf, X_train, y_train, cv = k_fold, scoring = 'precision_weighted')
recall = cross_val_score(
  clf, X_train, y_train, cv = k_fold, scoring = 'recall_weighted')
print("Precision scores: %s" %(precision)) 
print("Recall scores: %s" %(recall))


Precision scores: [0.8285461  0.7303533  0.74347158 0.73416667]
Recall scores: [0.85 0.79 0.8  0.81]


The recall levels are fairly close to one another - showing that the testing split did not affect the classifier's results. In real-life settings, the practical takeaway is that using cross validation is very powerful for large datasets since you want to avoid overfitting and different splits will likely give differen results.

## Model selection
Both regularization and cross validation are powerful tools in model selection. Regularization can help prevent overfitting and cross validation ensures that your models are being evaluated properly. In this exercise, you will use regularization and cross validation together and see whether or not models differ significantly. You will calculate the precision only, although the same exercise can easily be done for recall and other evaluation metrics as well.

In [55]:
# Iterate over different levels of max depth and set up k-fold
for max_depth_val in [3, 5, 10]:
  k_fold = KFold(n_splits = 4, random_state = 0)
  clf = DecisionTreeClassifier(max_depth = max_depth_val)
  print("Evaluating Decision Tree for max_depth = %s" %(max_depth_val))
  y_pred = clf.fit(X_train, y_train).predict(X_test) 
  
  # Calculate precision for cross validation and test
  cv_precision = cross_val_score(
    clf, X_train, y_train, cv = k_fold, scoring = 'precision_weighted')
  precision = precision_score(y_test, y_pred, average = 'weighted')
  print("Cross validation Precision: %s" %(cv_precision))
  print("Test Precision: %s" %(precision))

Evaluating Decision Tree for max_depth = 3
Cross validation Precision: [0.7794898  0.70926316 0.71989796 0.6889    ]
Test Precision: 0.8414539007092199
Evaluating Decision Tree for max_depth = 5
Cross validation Precision: [0.7794898  0.66625    0.75473684 0.75421986]
Test Precision: 0.8304347826086956
Evaluating Decision Tree for max_depth = 10
Cross validation Precision: [0.81112135 0.7303533  0.74840426 0.73416667]
Test Precision: 0.8496842105263158


  'precision', 'predicted', average, warn_for)


Note that the the recall levels are fairly close to one another, and a max depth of 10 has the highest precision.

## Random forests
Random Forests are a classic and powerful ensemble method that utilize individual decision trees via bootstrap aggregation (or bagging for short). Two main hyperparameters involved in this type of model are the number of trees, and the max depth of each tree. In this exercise, you will implement and evaluate a simple random forest classifier with some fixed hyperparameter values.

In [58]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc

# Create random forest classifier with specified params
clf = RandomForestClassifier(n_estimators= 50, max_depth = 5)

# Train classifier - predict probability score and label
y_score = clf.fit(X_train, y_train).predict_proba(X_test) 
y_pred = clf.fit(X_train, y_train).predict(X_test) 

# Get ROC curve metrics
fpr, tpr, thresholds = roc_curve(y_test, y_score[:, 1])
print("ROC of AUC: %s"%(auc(fpr, tpr)))

# Get precision and recall
precision = precision_score(y_test, y_pred, average = 'weighted')
recall = recall_score(y_test, y_pred, average = 'weighted')
print("Precision: %s, Recall: %s" %(precision, recall))

ROC of AUC: 0.5960163432073545
Precision: 0.8496842105263158, Recall: 0.88


Note that is has better performance than logistic regression but is fairly close to that of decision trees. This is likely due to the model being not very complex (since we did not set it up that way). Now let's explore tuning the various hyperparameters.

## Grid search
Hyperparameter tuning can be done by sklearn through providing various input parameters, each of which can be encoded using various functions from numpy. One method of tuning, which exhaustively looks at all combinations of input hyperparameters specified via param_grid, is grid search. In this exercise, you will use grid search to look over the hyperparameters for a sample random forest classifier with a scoring function as the AUC of the ROC curve.

In [59]:
from sklearn.model_selection import GridSearchCV

# Create list of hyperparameters 
n_estimators = [10, 50]
max_depth = [5, 20]
param_grid = {'n_estimators': n_estimators, 'max_depth': max_depth}

# Use Grid search CV to find best parameters 
print("starting RF grid search.. ")
rf = RandomForestClassifier()
clf = GridSearchCV(estimator = rf, param_grid = param_grid, scoring = 'roc_auc')
clf.fit(X_train, y_train)
print("Best Score: ")
print(clf.best_score_)
print("Best Estimator: ")
print(clf.best_estimator_)

starting RF grid search.. 




Best Score: 
0.6237490875262615
Best Estimator: 
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=5, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)


As seen here, the best AUC is around 0.7 and from the model with the maximum number of trees (50) and maximum depth (20).