#  Classifier Evaluation

**Write and execute R code in the code cells per the instructions.  The expected results are provided for you directly following the code cells.**

In [150]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)

## Business Model

Our company has 1000 prospective customers, of which we expect 50 to buy our product - but we are not sure which customers those are.  The cost to meet with a prospective customer is \\$4500.  The revenue from a customer that buys our product is \\$100000.

In [151]:
prospects = 1000
buyers = 50
passers = 1000-50
cost = 4500
revenue = 100000

profit.baseline = buyers*revenue - prospects*cost
data.frame(profit.baseline)

profit.baseline
500000


## Business Decision

Which prospects should we meet so that we maximize profit?

The approach will be to build a model to predict which prospects will buy based on their market research scores, and then meet only with those prospects.

## Data

Here is some data about past customers.  Each customer is associated with two scores, x1 and x2, that were measured by a market research company.  Also, each customer is known to have either bought our company's product or passed on an opportunity to buy our company's product.

In [152]:
data = data.frame(x1=c(1,2,3,4,3,2,5,4,3,2,5,3,3,2,3,1,1,5,4,1,5,1,0,0,1,2,2,5,1,3,1,2,3,4,5,6,
                       1,3,3,6,3,2,5,4,3,4,5,3,3,2,3,1,2,5,4,1,5,1,1,1,1,2,2,5,1,3,1,2,3,4,5,6),
                  x2=c(3,2,6,5,4,5,3,8,9,0,0,9,7,4,5,5,4,5,6,3,2,4,3,5,4,6,5,1,2,3,4,5,4,3,4,8,
                       3,2,6,5,4,5,3,8,5,5,0,9,7,4,5,5,4,5,7,3,2,4,3,4,4,6,5,1,2,3,4,5,4,3,4,8),
                  class=c("buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy",
                          "pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass",
                          "buy","buy","buy","pass","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy","buy",
                          "pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass","pass"))

size(data)
data

observations,variables
72,3


x1,x2,class
1,3,buy
2,2,buy
3,6,buy
4,5,buy
3,4,buy
2,5,buy
5,3,buy
4,8,buy
3,9,buy
2,0,buy


In [153]:
length(which(data$class=="pass"))

## Problem 1

Build a naive Bayes model based on the data to predict which prospects will buy.

You may want to use these function(s):
* naiveBayes()

In [154]:
model = naiveBayes(class ~ x1+x2, data)
model


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      buy      pass 
0.4861111 0.5138889 

Conditional probabilities:
      x1
Y          [,1]     [,2]
  buy  2.971429 1.271537
  pass 2.702703 1.853898

      x2
Y          [,1]     [,2]
  buy  4.685714 2.348609
  pass 4.027027 1.674755


## Problem 2: Evaluation by Insample

Make predictions based on the data ("buy" cutoff=0.5) and show the resulting confusion matrix, accuracy, & business metric of the model in terms of profit.

You may want to use these function(s):
* colnames()
* predict()
* as.class()
* confusionMatrix()
* round()

To calculate profit, develop a formula that relies on the confusion matrix.  In the formula, round to the nearest whole number of prospects as appropriate.  

In [155]:
prob = predict(model, data, type="raw") # predict function uses only predictor variables (e.g., x1 & x2)
class.predicted = as.class(prob, class="buy", cutoff=0.5)

CM = confusionMatrix(class.predicted, data$class)$table
cm = CM / sum(CM)

fmt.cm(cm)

insample_accuracy = (cm[1,1]+cm[2,2])/sum(cm)
fmt(insample_accuracy)

Unnamed: 0,buy,pass
buy,0.25,0.0833333
pass,0.2361111,0.4305556


insample_accuracy
0.6805556


In [156]:
bb = cm[1,1]
pb = cm[2,1]
buyers = round((bb / (bb + pb)) * 50)
profit = buyers*revenue - buyers*cost


bp = cm[1,2]
pp = cm[2,2]

buyers2 = round((bp / (bp + pp)) * passers)
profit2 = 0 - buyers2*cost

data.frame(buy_buy=profit,
           pass_buy=0,
           buy_pass=profit2,
           pass_pass=0,
           insample_profit=profit+profit2)

buy_buy,pass_buy,buy_pass,pass_pass,insample_profit
2483000,0,-693000,0,1790000


## Problem 3: Evaluation by Cross-Validation

Partition the data into 5 test folds.

For each fold, build a naive Bayes model based on training data, make predictions based on test data ("buy" cutoff=0.5), and show the resulting confusion matrix, accuracy, & business metric in terms of profit.

Show the model, its cross-validation accuracy, & its cross-validation business value in terms of profit.

You may want to use these function(s):
* set.seed()
* createFolds()
* setdiff()
* colnames()
* naiveBayes()
* predict()
* as.class()
* confusionMatrix()
* round()

Use `set.seed(12345)` and `createFolds(..., k=5)` to do the partitioning.

In [157]:
set.seed(12345)
fold = createFolds(data$class, k=5)
fold

In [158]:
accuracy = c(NA, NA, NA, NA, NA)
profits = c(NA, NA, NA, NA, NA)

data.train = data[setdiff(1:nrow(data), fold[[1]]),]
data.dev = data[fold[[1]],]

model.1 = naiveBayes(class ~ x1+x2, data.train)
prob.1 = predict(model.1, data.dev, type="raw")
class.predicted.1 = as.class(prob.1, class="buy", cutoff=0.5)
   
CM.1 = confusionMatrix(class.predicted.1, data.dev$class)$table
cm.1 = CM.1/sum(CM.1)
fmt.cm(cm.1)

accuracy_1 = cm.1[1,1]+cm.1[2,2]
accuracy[1] = accuracy_1
fmt(accuracy_1)


bb.1 = cm.1[1,1]
pb.1 = cm.1[2,1]
buyers.1 = round((bb.1 / (bb.1 + pb.1)) * 50)
profit.1 = buyers.1*revenue - buyers.1*cost


bp.1 = cm.1[1,2]
pp.1 = cm.1[2,2]

buyers2.1 = round((bp.1 / (bp.1 + pp.1)) * passers)
profit2.1 = 0 - buyers2.1*cost

profits[1] = profit.1 + profit2.1

data.frame(buy_buy=profit.1,
           pass_buy=0,
           buy_pass=profit2.1,
           pass_pass=0,
           insample_profit=profit.1+profit2.1)


Unnamed: 0,buy,pass
buy,0.5,0.0
pass,0.0,0.5


accuracy_1
1


buy_buy,pass_buy,buy_pass,pass_pass,insample_profit
4775000,0,0,0,4775000


In [159]:
data.train = data[setdiff(1:nrow(data), fold[[2]]),]
data.dev = data[fold[[2]],]

model.2 = naiveBayes(class ~ x1+x2, data.train)
prob.2 = predict(model.2, data.dev, type="raw")
class.predicted.2 = as.class(prob.2, class="buy", cutoff=0.5)
   
CM.2 = confusionMatrix(class.predicted.2, data.dev$class)$table
cm.2 = CM.2/sum(CM.2)
fmt.cm(cm.2)

accuracy_2 = cm.2[1,1]+cm.2[2,2]
accuracy[2] = accuracy_2
fmt(accuracy_2)


bb.2 = cm.2[1,1]
pb.2 = cm.2[2,1]
buyers.2 = round((bb.2 / (bb.2 + pb.2)) * 50)
profit.2 = buyers.2*revenue - buyers.2*cost


bp.2 = cm.2[1,2]
pp.2 = cm.2[2,2]

buyers2.2 = round((bp.2 / (bp.2 + pp.2)) * passers)
profit2.2 = 0 - buyers2.2*cost

profits[2] = profit.2 + profit2.2

data.frame(buy_buy=profit.2,
           pass_buy=0,
           buy_pass=profit2.2,
           pass_pass=0,
           insample_profit=profit.2+profit2.2)


Unnamed: 0,buy,pass
buy,0.2857143,0.2857143
pass,0.2142857,0.2142857


accuracy_2
0.5


buy_buy,pass_buy,buy_pass,pass_pass,insample_profit
2769500,0,-2443500,0,326000


In [160]:
data.train = data[setdiff(1:nrow(data), fold[[3]]),]
data.dev = data[fold[[3]],]

model.3 = naiveBayes(class ~ x1+x2, data.train)
prob.3 = predict(model.3, data.dev, type="raw")
class.predicted.3 = as.class(prob.3, class="buy", cutoff=0.5)
   
CM.3 = confusionMatrix(class.predicted.3, data.dev$class)$table
cm.3 = CM.3/sum(CM.3)
fmt.cm(cm.3)

accuracy_3 = cm.3[1,1]+cm.3[2,2]
accuracy[3] = accuracy_3
fmt(accuracy_3)


bb.3 = cm.3[1,1]
pb.3 = cm.3[2,1]
buyers.3 = round((bb.3 / (bb.3 + pb.3)) * 50)
profit.3 = buyers.3*revenue - buyers.3*cost


bp.3 = cm.3[1,2]
pp.3 = cm.3[2,2]

buyers2.3 = round((bp.3 / (bp.3 + pp.3)) * passers)
profit2.3 = 0 - buyers2.3*cost

profits[3] = profit.3 + profit2.3

data.frame(buy_buy=profit.3,
           pass_buy=0,
           buy_pass=profit2.3,
           pass_pass=0,
           insample_profit=profit.3+profit2.3)

Unnamed: 0,buy,pass
buy,0.2,0.1333333
pass,0.2666667,0.4


accuracy_3
0.6


buy_buy,pass_buy,buy_pass,pass_pass,insample_profit
2005500,0,-1071000,0,934500


In [161]:
data.train = data[setdiff(1:nrow(data), fold[[4]]),]
data.dev = data[fold[[4]],]

model.4 = naiveBayes(class ~ x1+x2, data.train)
prob.4 = predict(model.4, data.dev, type="raw")
class.predicted.4 = as.class(prob.4, class="buy", cutoff=0.5)
   
CM.4 = confusionMatrix(class.predicted.4, data.dev$class)$table
cm.4 = CM.4/sum(CM.4)
fmt.cm(cm.4)

accuracy_4 = cm.4[1,1]+cm.4[2,2]
accuracy[4] = accuracy_4
fmt(accuracy_4)


bb.4 = cm.4[1,1]
pb.4 = cm.4[2,1]
buyers.4 = round((bb.4 / (bb.4 + pb.4)) * 50)
profit.4 = buyers.4*revenue - buyers.4*cost


bp.4 = cm.4[1,2]
pp.4 = cm.4[2,2]

buyers2.4 = round((bp.4 / (bp.4 + pp.4)) * passers)
profit2.4 = 0 - buyers2.4*cost

profits[4] = profit.4 + profit2.4

data.frame(buy_buy=profit.4,
           pass_buy=0,
           buy_pass=profit2.4,
           pass_pass=0,
           insample_profit=profit.4+profit2.4)

Unnamed: 0,buy,pass
buy,0.0,0.0
pass,0.4666667,0.5333333


accuracy_4
0.5333333


buy_buy,pass_buy,buy_pass,pass_pass,insample_profit
0,0,0,0,0


In [162]:
data.train = data[setdiff(1:nrow(data), fold[[5]]),]
data.dev = data[fold[[5]],]

model.5 = naiveBayes(class ~ x1+x2, data.train)
prob.5 = predict(model.5, data.dev, type="raw")
class.predicted.5 = as.class(prob.5, class="buy", cutoff=0.5)
   
CM.5 = confusionMatrix(class.predicted.5, data.dev$class)$table
cm.5 = CM.5/sum(CM.5)
fmt.cm(cm.5)

accuracy_5 = cm.5[1,1]+cm.5[2,2]
accuracy[5] = accuracy_5
fmt(accuracy_5)


bb.5 = cm.5[1,1]
pb.5 = cm.5[2,1]
buyers.5 = round((bb.5 / (bb.5 + pb.5)) * 50)
profit.5 = buyers.5*revenue - buyers.5*cost


bp.5 = cm.5[1,2]
pp.5 = cm.5[2,2]

buyers2.5 = round((bp.5 / (bp.5 + pp.5)) * passers)
profit2.5 = 0 - buyers2.5*cost

profits[5] = profit.5 + profit2.5

data.frame(buy_buy=profit.5,
           pass_buy=0,
           buy_pass=profit2.5,
           pass_pass=0,
           insample_profit=profit.5+profit2.5)

Unnamed: 0,buy,pass
buy,0.1428571,0.1428571
pass,0.3571429,0.3571429


accuracy_5
0.5


buy_buy,pass_buy,buy_pass,pass_pass,insample_profit
1337000,0,-1219500,0,117500


In [163]:
model

cv_accuracy = mean(accuracy)
cv_profit = mean(profits)

data.frame(cv_accuracy=cv_accuracy,
           cv_profit=cv_profit)


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      buy      pass 
0.4861111 0.5138889 

Conditional probabilities:
      x1
Y          [,1]     [,2]
  buy  2.971429 1.271537
  pass 2.702703 1.853898

      x2
Y          [,1]     [,2]
  buy  4.685714 2.348609
  pass 4.027027 1.674755


cv_accuracy,cv_profit
0.6266667,1230600


## Problem 4: Benefit of the Model

What is the model worth to our company in terms of how much it is expected to increase profit? In other words, what is the opportunity cost of not using the model?

In [164]:
improvement = cv_profit - profit.baseline

data.frame(profit.baseline=profit.baseline,
           profit.with_model=cv_profit,
           improvement=improvement)

profit.baseline,profit.with_model,improvement
500000,1230600,730600


<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised December 17, 2019
</span>
</p>
</font>