# Project Part C: Classification

![](banner_project.jpg)

In [1]:
analyst = "Citlalli Villarreal" # Replace this with your name

In [2]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

## Directions

### Objective

Construct, evaluate, and tune a classifier trained on a transformed dataset about public company fundamentals.  Later, use the classifier along with additional analysis to recommend a portfolio of 12 company investments that maximizes 12-month return of an overall \$1,000,000 investment.

### Approach

Retrieve a transformed dataset.

Construct a model to predict whether stock price will grow more than 30% over 12 months, given 12 months of past company fundamentals data, and using a machine learning model construction method and transformed data.

Evaluate the model's business performance based on a business model and business parameters.

Tune the model by exhaustive search for the best performing model.

## Business Model

The business model is ...

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>

$ profit\,rate = profit \div budget $


$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business model parameters include ...

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocations = \\$1,000,000 $\div$ 12 to each company: investments to allocate to specific companies in the portfolio 

Fill the portfolio with companies you predict to have the highest probabilities of growing above 30%.

In [3]:
# Set the business parameters.

budget = 1000000
portfolio_size = 12
allocation = rep(budget/portfolio_size, portfolio_size)

layout(fmt(budget), fmt(portfolio_size), fmt(allocation))

budget,Unnamed: 1_level_0,Unnamed: 2_level_0
portfolio_size,Unnamed: 1_level_1,Unnamed: 2_level_1
allocation,Unnamed: 1_level_2,Unnamed: 2_level_2
1000000,,
12,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,

budget
1000000

portfolio_size
12

allocation
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33


## Data

Below the data was retrieved from the "My Data.csv" file. The data frame seen below is the first 6 rows of the data table. 

In [4]:
# Retrieve "My Data.csv"
data = read.csv("My Data.csv", header=TRUE)
data$big_growth = factor(data$big_growth, levels=c("YES","NO"))

# Present a few rows ...
data[1:6,]


gvkey,tic,conm,PC1,PC2,PC3,prccq,growth,big_growth
1004,AIR,AAR CORP,1.4097638,0.2124544,-0.18735809,43.69,0.0507455507,NO
1045,AAL,AMERICAN AIRLINES GROUP INC,-2.8093139,0.2246363,1.43661206,32.11,-0.3828560446,NO
1050,CECE,CECO ENVIRONMENTAL CORP,1.5247216,0.4396434,-0.16785608,6.75,0.3157894737,YES
1062,ASA,ASA GOLD AND PRECIOUS METALS,1.5736687,0.6384403,0.01227541,8.66,-0.2164739518,NO
1072,AVX,AVX CORP,1.2812646,0.4529129,0.09293832,15.25,-0.1184971098,NO
1075,PNW,PINNACLE WEST CAPITAL CORP,0.3697622,-0.4860613,-0.01283639,85.2,0.0002347969,NO


## Build Classification Model


Below is the construction of a naive bayes models that uses lapace smoothing which aims to predict the 'big_growth' given the following predictor variables: 'PC1', 'PC2', and 'PC3'.

In [5]:
# Construct a naive Bayes model to predict big_growth given PC1, PC2, PC3 (use laplace=TRUE).
# Present a brief summary of the model parameters.

model = naiveBayes(big_growth ~ PC1+PC2+PC3, data, laplace=TRUE)
model


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
       YES         NO 
0.08362369 0.91637631 

Conditional probabilities:
     PC1
Y           [,1]     [,2]
  YES  1.1285253 1.331656
  NO  -0.1029833 5.674085

     PC2
Y            [,1]      [,2]
  YES  0.24124482 0.8980264
  NO  -0.02201474 4.7929504

     PC3
Y             [,1]      [,2]
  YES -0.014239305 0.6770092
  NO   0.001299404 3.6532777


## Evaluate Classification Model (5-fold cross-validation)

Below is the construction of 5 naive bayes models that uses lapace smoothing which aims to predict the 'big_growth' given the following predictor variables: 'PC1', 'PC2', and 'PC3'. 5-fold cross-validation is used to reveal how the naive bayes model will behave on unseen data. 

In [6]:
# Partition the data into 5 folds (use set.seed(0) and createFolds(...) based on big_growth).
# Present the first few observation (row) numbers for each of the folds.
#
# You can use the str() function.

set.seed(0)
fold = createFolds(data$big_growth, k=5, list=TRUE)
str(fold)
data.train.1 = data[setdiff(1:nrow(data), fold$Fold1),]
data.train.2 = data[setdiff(1:nrow(data), fold$Fold2),]
data.train.3 = data[setdiff(1:nrow(data), fold$Fold3),]
data.train.4 = data[setdiff(1:nrow(data), fold$Fold4),]
data.train.5 = data[setdiff(1:nrow(data), fold$Fold5),]

data.test.1 = data[fold$Fold1,]
data.test.2 = data[fold$Fold2,]
data.test.3 = data[fold$Fold3,]
data.test.4 = data[fold$Fold4,]
data.test.5 = data[fold$Fold5,]


List of 5
 $ Fold1: int [1:861] 9 13 17 19 31 42 44 54 60 66 ...
 $ Fold2: int [1:861] 1 2 6 11 16 25 32 49 55 59 ...
 $ Fold3: int [1:861] 4 8 14 22 28 34 40 45 50 52 ...
 $ Fold4: int [1:861] 3 5 15 18 21 24 26 27 30 36 ...
 $ Fold5: int [1:861] 7 10 12 20 23 29 33 35 37 46 ...


In [7]:
# Present the model's estimated accuracy at cutoff=0.5 and profit for each fold.
data.train.1 = data[setdiff(1:nrow(data), fold$Fold1),]
data.train.2 = data[setdiff(1:nrow(data), fold$Fold2),]
data.train.3 = data[setdiff(1:nrow(data), fold$Fold3),]
data.train.4 = data[setdiff(1:nrow(data), fold$Fold4),]
data.train.5 = data[setdiff(1:nrow(data), fold$Fold5),]

data.test.1 = data[fold$Fold1,]
data.test.2 = data[fold$Fold2,]
data.test.3 = data[fold$Fold3,]
data.test.4 = data[fold$Fold4,]
data.test.5 = data[fold$Fold5,]

model.1 = naiveBayes(big_growth ~ PC1+PC2+PC3, data.train.1, laplace=TRUE)
prob.1 = predict(model.1, data.test.1, type="raw")
prediction.1 = as.class(prob.1, class="YES", cutoff=0.5)
CM.1 = confusionMatrix(prediction.1, data.test.1$big_growth)$table
cm.1 = CM.1 / sum(CM.1)

model.2 = naiveBayes(big_growth ~ PC1+PC2+PC3, data.train.2, laplace=TRUE)
prob.2 = predict(model.2, data.test.2, type="raw")
prediction.2 = as.class(prob.2, class="YES", cutoff=0.5)
CM.2 = confusionMatrix(prediction.2, data.test.2$big_growth)$table
cm.2 = CM.2 / sum(CM.2)

model.3 = naiveBayes(big_growth ~ PC1+PC2+PC3, data.train.3, laplace=TRUE)
prob.3 = predict(model.3, data.test.3, type="raw")
prediction.3 = as.class(prob.3, class="YES", cutoff=0.5)
CM.3 = confusionMatrix(prediction.3, data.test.3$big_growth)$table
cm.3 = CM.3 / sum(CM.3)

model.4 = naiveBayes(big_growth ~ PC1+PC2+PC3, data.train.4, laplace=TRUE)
prob.4 = predict(model.4, data.test.4, type="raw")
prediction.4 = as.class(prob.4, class="YES", cutoff=0.5)
CM.4 = confusionMatrix(prediction.4, data.test.4$big_growth)$table
cm.4 = CM.4 / sum(CM.4)

model.5 = naiveBayes(big_growth ~ PC1+PC2+PC3, data.train.5, laplace=TRUE)
prob.5 = predict(model.5, data.test.5, type="raw")
prediction.5 = as.class(prob.5, class="YES", cutoff=0.5)
CM.5 = confusionMatrix(prediction.5, data.test.5$big_growth)$table
cm.5 = CM.5 / sum(CM.5)

accuracy.1 = cm.1["YES","YES"]+cm.1["NO","NO"]
accuracy.2 = cm.2["YES","YES"]+cm.2["NO","NO"]
accuracy.3 = cm.3["YES","YES"]+cm.3["NO","NO"]
accuracy.4 = cm.4["YES","YES"]+cm.4["NO","NO"]
accuracy.5 = cm.5["YES","YES"]+cm.5["NO","NO"]

new.data.test.1 = cbind(prob.1, data.test.1)
new.data.test.1 = new.data.test.1[order(-new.data.test.1$YES),]
new.data.test.2 = cbind(prob.2, data.test.2)
new.data.test.2 = new.data.test.2[order(-new.data.test.2$YES),]
new.data.test.3 = cbind(prob.3, data.test.3)
new.data.test.3 = new.data.test.3[order(-new.data.test.3$YES),]
new.data.test.4 = cbind(prob.4, data.test.4)
new.data.test.4 = new.data.test.4[order(-new.data.test.4$YES),]
new.data.test.5 = cbind(prob.5, data.test.5)
new.data.test.5 = new.data.test.5[order(-new.data.test.5$YES),]

profit.1 = sum((1+new.data.test.1[1:12,]$growth) * allocation) - budget
profit.2 = sum((1+new.data.test.2[1:12,]$growth) * allocation) - budget
profit.3 = sum((1+new.data.test.3[1:12,]$growth) * allocation) - budget
profit.4 = sum((1+new.data.test.4[1:12,]$growth) * allocation) - budget
profit.5 = sum((1+new.data.test.5[1:12,]$growth) * allocation) - budget


data.frame(fold=1:5, accuracy=c(accuracy.1, accuracy.2, accuracy.3, accuracy.4, accuracy.5),
          profit=c(profit.1, profit.2, profit.3, profit.4, profit.5))

fold,accuracy,profit
1,0.232288,-144476.4
2,0.2229965,-114764.05
3,0.2334495,-22672.32
4,0.203252,4896.07
5,0.2020906,-119454.67


In [8]:
# Present the model's 5-fold cross-validation estimated accuracy at cutoff=0.5, profit, and profit rate.
accuracy.cv = sum(accuracy.1, accuracy.2, accuracy.3, accuracy.4, accuracy.5)/5
profit.cv =  sum(profit.1, profit.2, profit.3, profit.4, profit.5)/5
profit_rate.cv = sum(c(profit.1, profit.2, profit.3, profit.4, profit.5)/budget)/5
fmt(data.frame(accuracy.cv, profit.cv, profit_rate.cv), "5-Fold Cross-Validation Estimated Performance")

accuracy.cv,profit.cv,profit_rate.cv
0.2188153,-79294.28,-0.0792943


## Tune Classification Model

Below is the construction of 21 different naive bayes models, in regards to cutoff and predictor variables in use, that use lapace smoothing which aims to predict the 'big_growth' given a unique combination of the following predictor variables: 'PC1', 'PC2', and 'PC3'. 5-fold cross-validation is used to reveal how the naive bayes models will behave on unseen data. The best model is recognized as the first model with the highest profit and profit rate. 

In [9]:
# Partition the data into 5 folds (use set.seed(0) and createFolds(...) based on big_growth).

# Construct several naive Bayes models to predict big_growth (use laplace=TRUE).
# Iterate through unique combinations of predictor variables, selected from PC1, PC2, and PC3.
# Iterate through cutoff values, selected from 0.25, 0.33, and 0.50.

# Estimate each model's accuracy and profit, using 5-fold cross validation.

# Present the best model: selected variables, selected cutoff, accuracy, profit, and profit rate.
# Present all the models: selected variables, selected cutoff, accuracy, profit, and profit rate.

models = data.frame(cutoff=0.0, accuracy.cv=0, profit.cv=0, profit_rate.cv=0.0)

for (cutoff in c(0.25, 0.33, 0.50)) {
    set.seed(0)
    fold = createFolds(data$big_growth, k=5, list=TRUE)
    data.train.1 = data[setdiff(1:nrow(data), fold$Fold1),]
    data.train.2 = data[setdiff(1:nrow(data), fold$Fold2),]
    data.train.3 = data[setdiff(1:nrow(data), fold$Fold3),]
    data.train.4 = data[setdiff(1:nrow(data), fold$Fold4),]
    data.train.5 = data[setdiff(1:nrow(data), fold$Fold5),]

    data.test.1 = data[fold$Fold1,]
    data.test.2 = data[fold$Fold2,]
    data.test.3 = data[fold$Fold3,]
    data.test.4 = data[fold$Fold4,]
    data.test.5 = data[fold$Fold5,]

    model.1 = naiveBayes(big_growth ~ PC1 , data.train.1, laplace=TRUE)
    prob.1 = predict(model.1, data.test.1, type="raw")
    prediction.1 = as.class(prob.1, class="YES", cutoff=cutoff)
    CM.1 = confusionMatrix(prediction.1, data.test.1$big_growth)$table
    cm.1 = CM.1 / sum(CM.1)

    model.2 = naiveBayes(big_growth ~ PC1 , data.train.2, laplace=TRUE)
    prob.2 = predict(model.2, data.test.2, type="raw")
    prediction.2 = as.class(prob.2, class="YES", cutoff=cutoff)
    CM.2 = confusionMatrix(prediction.2, data.test.2$big_growth)$table
    cm.2 = CM.2 / sum(CM.2)

    model.3 = naiveBayes(big_growth ~ PC1, data.train.3, laplace=TRUE)
    prob.3 = predict(model.3, data.test.3, type="raw")
    prediction.3 = as.class(prob.3, class="YES", cutoff=cutoff)
    CM.3 = confusionMatrix(prediction.3, data.test.3$big_growth)$table
    cm.3 = CM.3 / sum(CM.3)

    model.4 = naiveBayes(big_growth ~ PC1, data.train.4, laplace=TRUE)
    prob.4 = predict(model.4, data.test.4, type="raw")
    prediction.4 = as.class(prob.4, class="YES", cutoff=cutoff)
    CM.4 = confusionMatrix(prediction.4, data.test.4$big_growth)$table
    cm.4 = CM.4 / sum(CM.4)

    model.5 = naiveBayes(big_growth ~ PC1, data.train.5, laplace=TRUE)
    prob.5 = predict(model.5, data.test.5, type="raw")
    prediction.5 = as.class(prob.5, class="YES", cutoff=cutoff)
    CM.5 = confusionMatrix(prediction.5, data.test.5$big_growth)$table
    cm.5 = CM.5 / sum(CM.5)

    accuracy.1 = cm.1["YES","YES"]+cm.1["NO","NO"]
    accuracy.2 = cm.2["YES","YES"]+cm.2["NO","NO"]
    accuracy.3 = cm.3["YES","YES"]+cm.3["NO","NO"]
    accuracy.4 = cm.4["YES","YES"]+cm.4["NO","NO"]
    accuracy.5 = cm.5["YES","YES"]+cm.5["NO","NO"]

    new.data.test.1 = cbind(prob.1, data.test.1)
    new.data.test.1 = new.data.test.1[order(-new.data.test.1$YES),]
    new.data.test.2 = cbind(prob.2, data.test.2)
    new.data.test.2 = new.data.test.2[order(-new.data.test.2$YES),]
    new.data.test.3 = cbind(prob.3, data.test.3)
    new.data.test.3 = new.data.test.3[order(-new.data.test.3$YES),]
    new.data.test.4 = cbind(prob.4, data.test.4)
    new.data.test.4 = new.data.test.4[order(-new.data.test.4$YES),]
    new.data.test.5 = cbind(prob.5, data.test.5)
    new.data.test.5 = new.data.test.5[order(-new.data.test.5$YES),]

    profit.1 = sum((1+new.data.test.1[1:12,]$growth) * allocation) - budget
    profit.2 = sum((1+new.data.test.2[1:12,]$growth) * allocation) - budget
    profit.3 = sum((1+new.data.test.3[1:12,]$growth) * allocation) - budget
    profit.4 = sum((1+new.data.test.4[1:12,]$growth) * allocation) - budget
    profit.5 = sum((1+new.data.test.5[1:12,]$growth) * allocation) - budget
    
    accuracy.cv = sum(accuracy.1, accuracy.2, accuracy.3, accuracy.4, accuracy.5)/5
    profit.cv =  sum(profit.1, profit.2, profit.3, profit.4, profit.5)/5
    profit_rate.cv = sum(c(profit.1, profit.2, profit.3, profit.4, profit.5)/budget)/5
    curr.model = data.frame(cutoff, accuracy.cv, profit.cv, profit_rate.cv)
    models = rbind(models, curr.model)
  
}

for (cutoff in c(0.25, 0.33, 0.50)) {
    set.seed(0)
    fold = createFolds(data$big_growth, k=5, list=TRUE)
    data.train.1 = data[setdiff(1:nrow(data), fold$Fold1),]
    data.train.2 = data[setdiff(1:nrow(data), fold$Fold2),]
    data.train.3 = data[setdiff(1:nrow(data), fold$Fold3),]
    data.train.4 = data[setdiff(1:nrow(data), fold$Fold4),]
    data.train.5 = data[setdiff(1:nrow(data), fold$Fold5),]

    data.test.1 = data[fold$Fold1,]
    data.test.2 = data[fold$Fold2,]
    data.test.3 = data[fold$Fold3,]
    data.test.4 = data[fold$Fold4,]
    data.test.5 = data[fold$Fold5,]

    model.1 = naiveBayes(big_growth ~ PC2 , data.train.1, laplace=TRUE)
    prob.1 = predict(model.1, data.test.1, type="raw")
    prediction.1 = as.class(prob.1, class="YES", cutoff=cutoff)
    CM.1 = confusionMatrix(prediction.1, data.test.1$big_growth)$table
    cm.1 = CM.1 / sum(CM.1)

    model.2 = naiveBayes(big_growth ~ PC2 , data.train.2, laplace=TRUE)
    prob.2 = predict(model.2, data.test.2, type="raw")
    prediction.2 = as.class(prob.2, class="YES", cutoff=cutoff)
    CM.2 = confusionMatrix(prediction.2, data.test.2$big_growth)$table
    cm.2 = CM.2 / sum(CM.2)

    model.3 = naiveBayes(big_growth ~ PC2, data.train.3, laplace=TRUE)
    prob.3 = predict(model.3, data.test.3, type="raw")
    prediction.3 = as.class(prob.3, class="YES", cutoff=cutoff)
    CM.3 = confusionMatrix(prediction.3, data.test.3$big_growth)$table
    cm.3 = CM.3 / sum(CM.3)

    model.4 = naiveBayes(big_growth ~ PC2, data.train.4, laplace=TRUE)
    prob.4 = predict(model.4, data.test.4, type="raw")
    prediction.4 = as.class(prob.4, class="YES", cutoff=cutoff)
    CM.4 = confusionMatrix(prediction.4, data.test.4$big_growth)$table
    cm.4 = CM.4 / sum(CM.4)

    model.5 = naiveBayes(big_growth ~ PC2, data.train.5, laplace=TRUE)
    prob.5 = predict(model.5, data.test.5, type="raw")
    prediction.5 = as.class(prob.5, class="YES", cutoff=cutoff)
    CM.5 = confusionMatrix(prediction.5, data.test.5$big_growth)$table
    cm.5 = CM.5 / sum(CM.5)

    accuracy.1 = cm.1["YES","YES"]+cm.1["NO","NO"]
    accuracy.2 = cm.2["YES","YES"]+cm.2["NO","NO"]
    accuracy.3 = cm.3["YES","YES"]+cm.3["NO","NO"]
    accuracy.4 = cm.4["YES","YES"]+cm.4["NO","NO"]
    accuracy.5 = cm.5["YES","YES"]+cm.5["NO","NO"]

    new.data.test.1 = cbind(prob.1, data.test.1)
    new.data.test.1 = new.data.test.1[order(-new.data.test.1$YES),]
    new.data.test.2 = cbind(prob.2, data.test.2)
    new.data.test.2 = new.data.test.2[order(-new.data.test.2$YES),]
    new.data.test.3 = cbind(prob.3, data.test.3)
    new.data.test.3 = new.data.test.3[order(-new.data.test.3$YES),]
    new.data.test.4 = cbind(prob.4, data.test.4)
    new.data.test.4 = new.data.test.4[order(-new.data.test.4$YES),]
    new.data.test.5 = cbind(prob.5, data.test.5)
    new.data.test.5 = new.data.test.5[order(-new.data.test.5$YES),]

    profit.1 = sum((1+new.data.test.1[1:12,]$growth) * allocation) - budget
    profit.2 = sum((1+new.data.test.2[1:12,]$growth) * allocation) - budget
    profit.3 = sum((1+new.data.test.3[1:12,]$growth) * allocation) - budget
    profit.4 = sum((1+new.data.test.4[1:12,]$growth) * allocation) - budget
    profit.5 = sum((1+new.data.test.5[1:12,]$growth) * allocation) - budget
    
    accuracy.cv = sum(accuracy.1, accuracy.2, accuracy.3, accuracy.4, accuracy.5)/5
    profit.cv =  sum(profit.1, profit.2, profit.3, profit.4, profit.5)/5
    profit_rate.cv = sum(c(profit.1, profit.2, profit.3, profit.4, profit.5)/budget)/5
    curr.model = data.frame(cutoff, accuracy.cv, profit.cv, profit_rate.cv)
    models = rbind(models, curr.model)
  
}
for (cutoff in c(0.25, 0.33, 0.50)) {
    set.seed(0)
    fold = createFolds(data$big_growth, k=5, list=TRUE)
    data.train.1 = data[setdiff(1:nrow(data), fold$Fold1),]
    data.train.2 = data[setdiff(1:nrow(data), fold$Fold2),]
    data.train.3 = data[setdiff(1:nrow(data), fold$Fold3),]
    data.train.4 = data[setdiff(1:nrow(data), fold$Fold4),]
    data.train.5 = data[setdiff(1:nrow(data), fold$Fold5),]

    data.test.1 = data[fold$Fold1,]
    data.test.2 = data[fold$Fold2,]
    data.test.3 = data[fold$Fold3,]
    data.test.4 = data[fold$Fold4,]
    data.test.5 = data[fold$Fold5,]

    model.1 = naiveBayes(big_growth ~ PC3 , data.train.1, laplace=TRUE)
    prob.1 = predict(model.1, data.test.1, type="raw")
    prediction.1 = as.class(prob.1, class="YES", cutoff=cutoff)
    CM.1 = confusionMatrix(prediction.1, data.test.1$big_growth)$table
    cm.1 = CM.1 / sum(CM.1)

    model.2 = naiveBayes(big_growth ~ PC3 , data.train.2, laplace=TRUE)
    prob.2 = predict(model.2, data.test.2, type="raw")
    prediction.2 = as.class(prob.2, class="YES", cutoff=cutoff)
    CM.2 = confusionMatrix(prediction.2, data.test.2$big_growth)$table
    cm.2 = CM.2 / sum(CM.2)

    model.3 = naiveBayes(big_growth ~ PC3, data.train.3, laplace=TRUE)
    prob.3 = predict(model.3, data.test.3, type="raw")
    prediction.3 = as.class(prob.3, class="YES", cutoff=cutoff)
    CM.3 = confusionMatrix(prediction.3, data.test.3$big_growth)$table
    cm.3 = CM.3 / sum(CM.3)

    model.4 = naiveBayes(big_growth ~ PC3, data.train.4, laplace=TRUE)
    prob.4 = predict(model.4, data.test.4, type="raw")
    prediction.4 = as.class(prob.4, class="YES", cutoff=cutoff)
    CM.4 = confusionMatrix(prediction.4, data.test.4$big_growth)$table
    cm.4 = CM.4 / sum(CM.4)

    model.5 = naiveBayes(big_growth ~ PC3, data.train.5, laplace=TRUE)
    prob.5 = predict(model.5, data.test.5, type="raw")
    prediction.5 = as.class(prob.5, class="YES", cutoff=cutoff)
    CM.5 = confusionMatrix(prediction.5, data.test.5$big_growth)$table
    cm.5 = CM.5 / sum(CM.5)

    accuracy.1 = cm.1["YES","YES"]+cm.1["NO","NO"]
    accuracy.2 = cm.2["YES","YES"]+cm.2["NO","NO"]
    accuracy.3 = cm.3["YES","YES"]+cm.3["NO","NO"]
    accuracy.4 = cm.4["YES","YES"]+cm.4["NO","NO"]
    accuracy.5 = cm.5["YES","YES"]+cm.5["NO","NO"]

    new.data.test.1 = cbind(prob.1, data.test.1)
    new.data.test.1 = new.data.test.1[order(-new.data.test.1$YES),]
    new.data.test.2 = cbind(prob.2, data.test.2)
    new.data.test.2 = new.data.test.2[order(-new.data.test.2$YES),]
    new.data.test.3 = cbind(prob.3, data.test.3)
    new.data.test.3 = new.data.test.3[order(-new.data.test.3$YES),]
    new.data.test.4 = cbind(prob.4, data.test.4)
    new.data.test.4 = new.data.test.4[order(-new.data.test.4$YES),]
    new.data.test.5 = cbind(prob.5, data.test.5)
    new.data.test.5 = new.data.test.5[order(-new.data.test.5$YES),]

    profit.1 = sum((1+new.data.test.1[1:12,]$growth) * allocation) - budget
    profit.2 = sum((1+new.data.test.2[1:12,]$growth) * allocation) - budget
    profit.3 = sum((1+new.data.test.3[1:12,]$growth) * allocation) - budget
    profit.4 = sum((1+new.data.test.4[1:12,]$growth) * allocation) - budget
    profit.5 = sum((1+new.data.test.5[1:12,]$growth) * allocation) - budget
    
    accuracy.cv = sum(accuracy.1, accuracy.2, accuracy.3, accuracy.4, accuracy.5)/5
    profit.cv =  sum(profit.1, profit.2, profit.3, profit.4, profit.5)/5
    profit_rate.cv = sum(c(profit.1, profit.2, profit.3, profit.4, profit.5)/budget)/5
    curr.model = data.frame(cutoff, accuracy.cv, profit.cv, profit_rate.cv)
    models = rbind(models, curr.model)
  
}

In [10]:
for (cutoff in c(0.25, 0.33, 0.50)) {
    set.seed(0)
    fold = createFolds(data$big_growth, k=5, list=TRUE)
    data.train.1 = data[setdiff(1:nrow(data), fold$Fold1),]
    data.train.2 = data[setdiff(1:nrow(data), fold$Fold2),]
    data.train.3 = data[setdiff(1:nrow(data), fold$Fold3),]
    data.train.4 = data[setdiff(1:nrow(data), fold$Fold4),]
    data.train.5 = data[setdiff(1:nrow(data), fold$Fold5),]

    data.test.1 = data[fold$Fold1,]
    data.test.2 = data[fold$Fold2,]
    data.test.3 = data[fold$Fold3,]
    data.test.4 = data[fold$Fold4,]
    data.test.5 = data[fold$Fold5,]

    model.1 = naiveBayes(big_growth ~ PC1+PC2 , data.train.1, laplace=TRUE)
    prob.1 = predict(model.1, data.test.1, type="raw")
    prediction.1 = as.class(prob.1, class="YES", cutoff=cutoff)
    CM.1 = confusionMatrix(prediction.1, data.test.1$big_growth)$table
    cm.1 = CM.1 / sum(CM.1)

    model.2 = naiveBayes(big_growth ~ PC1+PC2 , data.train.2, laplace=TRUE)
    prob.2 = predict(model.2, data.test.2, type="raw")
    prediction.2 = as.class(prob.2, class="YES", cutoff=cutoff)
    CM.2 = confusionMatrix(prediction.2, data.test.2$big_growth)$table
    cm.2 = CM.2 / sum(CM.2)

    model.3 = naiveBayes(big_growth ~ PC1+PC2, data.train.3, laplace=TRUE)
    prob.3 = predict(model.3, data.test.3, type="raw")
    prediction.3 = as.class(prob.3, class="YES", cutoff=cutoff)
    CM.3 = confusionMatrix(prediction.3, data.test.3$big_growth)$table
    cm.3 = CM.3 / sum(CM.3)

    model.4 = naiveBayes(big_growth ~ PC1+PC2, data.train.4, laplace=TRUE)
    prob.4 = predict(model.4, data.test.4, type="raw")
    prediction.4 = as.class(prob.4, class="YES", cutoff=cutoff)
    CM.4 = confusionMatrix(prediction.4, data.test.4$big_growth)$table
    cm.4 = CM.4 / sum(CM.4)

    model.5 = naiveBayes(big_growth ~ PC1+PC2, data.train.5, laplace=TRUE)
    prob.5 = predict(model.5, data.test.5, type="raw")
    prediction.5 = as.class(prob.5, class="YES", cutoff=cutoff)
    CM.5 = confusionMatrix(prediction.5, data.test.5$big_growth)$table
    cm.5 = CM.5 / sum(CM.5)

    accuracy.1 = cm.1["YES","YES"]+cm.1["NO","NO"]
    accuracy.2 = cm.2["YES","YES"]+cm.2["NO","NO"]
    accuracy.3 = cm.3["YES","YES"]+cm.3["NO","NO"]
    accuracy.4 = cm.4["YES","YES"]+cm.4["NO","NO"]
    accuracy.5 = cm.5["YES","YES"]+cm.5["NO","NO"]

    new.data.test.1 = cbind(prob.1, data.test.1)
    new.data.test.1 = new.data.test.1[order(-new.data.test.1$YES),]
    new.data.test.2 = cbind(prob.2, data.test.2)
    new.data.test.2 = new.data.test.2[order(-new.data.test.2$YES),]
    new.data.test.3 = cbind(prob.3, data.test.3)
    new.data.test.3 = new.data.test.3[order(-new.data.test.3$YES),]
    new.data.test.4 = cbind(prob.4, data.test.4)
    new.data.test.4 = new.data.test.4[order(-new.data.test.4$YES),]
    new.data.test.5 = cbind(prob.5, data.test.5)
    new.data.test.5 = new.data.test.5[order(-new.data.test.5$YES),]

    profit.1 = sum((1+new.data.test.1[1:12,]$growth) * allocation) - budget
    profit.2 = sum((1+new.data.test.2[1:12,]$growth) * allocation) - budget
    profit.3 = sum((1+new.data.test.3[1:12,]$growth) * allocation) - budget
    profit.4 = sum((1+new.data.test.4[1:12,]$growth) * allocation) - budget
    profit.5 = sum((1+new.data.test.5[1:12,]$growth) * allocation) - budget
    
    accuracy.cv = sum(accuracy.1, accuracy.2, accuracy.3, accuracy.4, accuracy.5)/5
    profit.cv =  sum(profit.1, profit.2, profit.3, profit.4, profit.5)/5
    profit_rate.cv = sum(c(profit.1, profit.2, profit.3, profit.4, profit.5)/budget)/5
    curr.model = data.frame(cutoff, accuracy.cv, profit.cv, profit_rate.cv)
    models = rbind(models, curr.model)
  
}

for (cutoff in c(0.25, 0.33, 0.50)) {
    set.seed(0)
    fold = createFolds(data$big_growth, k=5, list=TRUE)
    data.train.1 = data[setdiff(1:nrow(data), fold$Fold1),]
    data.train.2 = data[setdiff(1:nrow(data), fold$Fold2),]
    data.train.3 = data[setdiff(1:nrow(data), fold$Fold3),]
    data.train.4 = data[setdiff(1:nrow(data), fold$Fold4),]
    data.train.5 = data[setdiff(1:nrow(data), fold$Fold5),]

    data.test.1 = data[fold$Fold1,]
    data.test.2 = data[fold$Fold2,]
    data.test.3 = data[fold$Fold3,]
    data.test.4 = data[fold$Fold4,]
    data.test.5 = data[fold$Fold5,]

    model.1 = naiveBayes(big_growth ~ PC1+PC3 , data.train.1, laplace=TRUE)
    prob.1 = predict(model.1, data.test.1, type="raw")
    prediction.1 = as.class(prob.1, class="YES", cutoff=cutoff)
    CM.1 = confusionMatrix(prediction.1, data.test.1$big_growth)$table
    cm.1 = CM.1 / sum(CM.1)

    model.2 = naiveBayes(big_growth ~ PC1+PC3 , data.train.2, laplace=TRUE)
    prob.2 = predict(model.2, data.test.2, type="raw")
    prediction.2 = as.class(prob.2, class="YES", cutoff=cutoff)
    CM.2 = confusionMatrix(prediction.2, data.test.2$big_growth)$table
    cm.2 = CM.2 / sum(CM.2)

    model.3 = naiveBayes(big_growth ~ PC1+PC3, data.train.3, laplace=TRUE)
    prob.3 = predict(model.3, data.test.3, type="raw")
    prediction.3 = as.class(prob.3, class="YES", cutoff=cutoff)
    CM.3 = confusionMatrix(prediction.3, data.test.3$big_growth)$table
    cm.3 = CM.3 / sum(CM.3)

    model.4 = naiveBayes(big_growth ~ PC1+PC3, data.train.4, laplace=TRUE)
    prob.4 = predict(model.4, data.test.4, type="raw")
    prediction.4 = as.class(prob.4, class="YES", cutoff=cutoff)
    CM.4 = confusionMatrix(prediction.4, data.test.4$big_growth)$table
    cm.4 = CM.4 / sum(CM.4)

    model.5 = naiveBayes(big_growth ~ PC1+PC3, data.train.5, laplace=TRUE)
    prob.5 = predict(model.5, data.test.5, type="raw")
    prediction.5 = as.class(prob.5, class="YES", cutoff=cutoff)
    CM.5 = confusionMatrix(prediction.5, data.test.5$big_growth)$table
    cm.5 = CM.5 / sum(CM.5)

    accuracy.1 = cm.1["YES","YES"]+cm.1["NO","NO"]
    accuracy.2 = cm.2["YES","YES"]+cm.2["NO","NO"]
    accuracy.3 = cm.3["YES","YES"]+cm.3["NO","NO"]
    accuracy.4 = cm.4["YES","YES"]+cm.4["NO","NO"]
    accuracy.5 = cm.5["YES","YES"]+cm.5["NO","NO"]

    new.data.test.1 = cbind(prob.1, data.test.1)
    new.data.test.1 = new.data.test.1[order(-new.data.test.1$YES),]
    new.data.test.2 = cbind(prob.2, data.test.2)
    new.data.test.2 = new.data.test.2[order(-new.data.test.2$YES),]
    new.data.test.3 = cbind(prob.3, data.test.3)
    new.data.test.3 = new.data.test.3[order(-new.data.test.3$YES),]
    new.data.test.4 = cbind(prob.4, data.test.4)
    new.data.test.4 = new.data.test.4[order(-new.data.test.4$YES),]
    new.data.test.5 = cbind(prob.5, data.test.5)
    new.data.test.5 = new.data.test.5[order(-new.data.test.5$YES),]

    profit.1 = sum((1+new.data.test.1[1:12,]$growth) * allocation) - budget
    profit.2 = sum((1+new.data.test.2[1:12,]$growth) * allocation) - budget
    profit.3 = sum((1+new.data.test.3[1:12,]$growth) * allocation) - budget
    profit.4 = sum((1+new.data.test.4[1:12,]$growth) * allocation) - budget
    profit.5 = sum((1+new.data.test.5[1:12,]$growth) * allocation) - budget
    
    accuracy.cv = sum(accuracy.1, accuracy.2, accuracy.3, accuracy.4, accuracy.5)/5
    profit.cv =  sum(profit.1, profit.2, profit.3, profit.4, profit.5)/5
    profit_rate.cv = sum(c(profit.1, profit.2, profit.3, profit.4, profit.5)/budget)/5
    curr.model = data.frame(cutoff, accuracy.cv, profit.cv, profit_rate.cv)
    models = rbind(models, curr.model)
  
}
for (cutoff in c(0.25, 0.33, 0.50)) {
    set.seed(0)
    fold = createFolds(data$big_growth, k=5, list=TRUE)
    data.train.1 = data[setdiff(1:nrow(data), fold$Fold1),]
    data.train.2 = data[setdiff(1:nrow(data), fold$Fold2),]
    data.train.3 = data[setdiff(1:nrow(data), fold$Fold3),]
    data.train.4 = data[setdiff(1:nrow(data), fold$Fold4),]
    data.train.5 = data[setdiff(1:nrow(data), fold$Fold5),]

    data.test.1 = data[fold$Fold1,]
    data.test.2 = data[fold$Fold2,]
    data.test.3 = data[fold$Fold3,]
    data.test.4 = data[fold$Fold4,]
    data.test.5 = data[fold$Fold5,]

    model.1 = naiveBayes(big_growth ~ PC2+PC3 , data.train.1, laplace=TRUE)
    prob.1 = predict(model.1, data.test.1, type="raw")
    prediction.1 = as.class(prob.1, class="YES", cutoff=cutoff)
    CM.1 = confusionMatrix(prediction.1, data.test.1$big_growth)$table
    cm.1 = CM.1 / sum(CM.1)

    model.2 = naiveBayes(big_growth ~ PC2+PC3 , data.train.2, laplace=TRUE)
    prob.2 = predict(model.2, data.test.2, type="raw")
    prediction.2 = as.class(prob.2, class="YES", cutoff=cutoff)
    CM.2 = confusionMatrix(prediction.2, data.test.2$big_growth)$table
    cm.2 = CM.2 / sum(CM.2)

    model.3 = naiveBayes(big_growth ~ PC2+PC3, data.train.3, laplace=TRUE)
    prob.3 = predict(model.3, data.test.3, type="raw")
    prediction.3 = as.class(prob.3, class="YES", cutoff=cutoff)
    CM.3 = confusionMatrix(prediction.3, data.test.3$big_growth)$table
    cm.3 = CM.3 / sum(CM.3)

    model.4 = naiveBayes(big_growth ~ PC2+PC3, data.train.4, laplace=TRUE)
    prob.4 = predict(model.4, data.test.4, type="raw")
    prediction.4 = as.class(prob.4, class="YES", cutoff=cutoff)
    CM.4 = confusionMatrix(prediction.4, data.test.4$big_growth)$table
    cm.4 = CM.4 / sum(CM.4)

    model.5 = naiveBayes(big_growth ~ PC2+PC3, data.train.5, laplace=TRUE)
    prob.5 = predict(model.5, data.test.5, type="raw")
    prediction.5 = as.class(prob.5, class="YES", cutoff=cutoff)
    CM.5 = confusionMatrix(prediction.5, data.test.5$big_growth)$table
    cm.5 = CM.5 / sum(CM.5)

    accuracy.1 = cm.1["YES","YES"]+cm.1["NO","NO"]
    accuracy.2 = cm.2["YES","YES"]+cm.2["NO","NO"]
    accuracy.3 = cm.3["YES","YES"]+cm.3["NO","NO"]
    accuracy.4 = cm.4["YES","YES"]+cm.4["NO","NO"]
    accuracy.5 = cm.5["YES","YES"]+cm.5["NO","NO"]

    new.data.test.1 = cbind(prob.1, data.test.1)
    new.data.test.1 = new.data.test.1[order(-new.data.test.1$YES),]
    new.data.test.2 = cbind(prob.2, data.test.2)
    new.data.test.2 = new.data.test.2[order(-new.data.test.2$YES),]
    new.data.test.3 = cbind(prob.3, data.test.3)
    new.data.test.3 = new.data.test.3[order(-new.data.test.3$YES),]
    new.data.test.4 = cbind(prob.4, data.test.4)
    new.data.test.4 = new.data.test.4[order(-new.data.test.4$YES),]
    new.data.test.5 = cbind(prob.5, data.test.5)
    new.data.test.5 = new.data.test.5[order(-new.data.test.5$YES),]

    profit.1 = sum((1+new.data.test.1[1:12,]$growth) * allocation) - budget
    profit.2 = sum((1+new.data.test.2[1:12,]$growth) * allocation) - budget
    profit.3 = sum((1+new.data.test.3[1:12,]$growth) * allocation) - budget
    profit.4 = sum((1+new.data.test.4[1:12,]$growth) * allocation) - budget
    profit.5 = sum((1+new.data.test.5[1:12,]$growth) * allocation) - budget
    
    accuracy.cv = sum(accuracy.1, accuracy.2, accuracy.3, accuracy.4, accuracy.5)/5
    profit.cv =  sum(profit.1, profit.2, profit.3, profit.4, profit.5)/5
    profit_rate.cv = sum(c(profit.1, profit.2, profit.3, profit.4, profit.5)/budget)/5
    curr.model = data.frame(cutoff, accuracy.cv, profit.cv, profit_rate.cv)
    models = rbind(models, curr.model)
  
}



In [11]:
for (cutoff in c(0.25, 0.33, 0.50)) {
    set.seed(0)
    fold = createFolds(data$big_growth, k=5, list=TRUE)
    data.train.1 = data[setdiff(1:nrow(data), fold$Fold1),]
    data.train.2 = data[setdiff(1:nrow(data), fold$Fold2),]
    data.train.3 = data[setdiff(1:nrow(data), fold$Fold3),]
    data.train.4 = data[setdiff(1:nrow(data), fold$Fold4),]
    data.train.5 = data[setdiff(1:nrow(data), fold$Fold5),]

    data.test.1 = data[fold$Fold1,]
    data.test.2 = data[fold$Fold2,]
    data.test.3 = data[fold$Fold3,]
    data.test.4 = data[fold$Fold4,]
    data.test.5 = data[fold$Fold5,]

    model.1 = naiveBayes(big_growth ~ PC1+PC2+PC3 , data.train.1, laplace=TRUE)
    prob.1 = predict(model.1, data.test.1, type="raw")
    prediction.1 = as.class(prob.1, class="YES", cutoff=cutoff)
    CM.1 = confusionMatrix(prediction.1, data.test.1$big_growth)$table
    cm.1 = CM.1 / sum(CM.1)

    model.2 = naiveBayes(big_growth ~ PC1+PC2+PC3  , data.train.2, laplace=TRUE)
    prob.2 = predict(model.2, data.test.2, type="raw")
    prediction.2 = as.class(prob.2, class="YES", cutoff=cutoff)
    CM.2 = confusionMatrix(prediction.2, data.test.2$big_growth)$table
    cm.2 = CM.2 / sum(CM.2)

    model.3 = naiveBayes(big_growth ~ PC1+PC2+PC3 , data.train.3, laplace=TRUE)
    prob.3 = predict(model.3, data.test.3, type="raw")
    prediction.3 = as.class(prob.3, class="YES", cutoff=cutoff)
    CM.3 = confusionMatrix(prediction.3, data.test.3$big_growth)$table
    cm.3 = CM.3 / sum(CM.3)

    model.4 = naiveBayes(big_growth ~ PC1+PC2+PC3, data.train.4, laplace=TRUE)
    prob.4 = predict(model.4, data.test.4, type="raw")
    prediction.4 = as.class(prob.4, class="YES", cutoff=cutoff)
    CM.4 = confusionMatrix(prediction.4, data.test.4$big_growth)$table
    cm.4 = CM.4 / sum(CM.4)

    model.5 = naiveBayes(big_growth ~ PC1+PC2+PC3, data.train.5, laplace=TRUE)
    prob.5 = predict(model.5, data.test.5, type="raw")
    prediction.5 = as.class(prob.5, class="YES", cutoff=cutoff)
    CM.5 = confusionMatrix(prediction.5, data.test.5$big_growth)$table
    cm.5 = CM.5 / sum(CM.5)

    accuracy.1 = cm.1["YES","YES"]+cm.1["NO","NO"]
    accuracy.2 = cm.2["YES","YES"]+cm.2["NO","NO"]
    accuracy.3 = cm.3["YES","YES"]+cm.3["NO","NO"]
    accuracy.4 = cm.4["YES","YES"]+cm.4["NO","NO"]
    accuracy.5 = cm.5["YES","YES"]+cm.5["NO","NO"]

    new.data.test.1 = cbind(prob.1, data.test.1)
    new.data.test.1 = new.data.test.1[order(-new.data.test.1$YES),]
    new.data.test.2 = cbind(prob.2, data.test.2)
    new.data.test.2 = new.data.test.2[order(-new.data.test.2$YES),]
    new.data.test.3 = cbind(prob.3, data.test.3)
    new.data.test.3 = new.data.test.3[order(-new.data.test.3$YES),]
    new.data.test.4 = cbind(prob.4, data.test.4)
    new.data.test.4 = new.data.test.4[order(-new.data.test.4$YES),]
    new.data.test.5 = cbind(prob.5, data.test.5)
    new.data.test.5 = new.data.test.5[order(-new.data.test.5$YES),]

    profit.1 = sum((1+new.data.test.1[1:12,]$growth) * allocation) - budget
    profit.2 = sum((1+new.data.test.2[1:12,]$growth) * allocation) - budget
    profit.3 = sum((1+new.data.test.3[1:12,]$growth) * allocation) - budget
    profit.4 = sum((1+new.data.test.4[1:12,]$growth) * allocation) - budget
    profit.5 = sum((1+new.data.test.5[1:12,]$growth) * allocation) - budget
    
    accuracy.cv = sum(accuracy.1, accuracy.2, accuracy.3, accuracy.4, accuracy.5)/5
    profit.cv =  sum(profit.1, profit.2, profit.3, profit.4, profit.5)/5
    profit_rate.cv = sum(c(profit.1, profit.2, profit.3, profit.4, profit.5)/budget)/5
    curr.model = data.frame(cutoff, accuracy.cv, profit.cv, profit_rate.cv)
    models = rbind(models, curr.model)
  
}

In [12]:
bayes_models = cbind(method = c("naive bayes", "naive bayes", "naive bayes", "naive bayes", "naive bayes",
                "naive bayes", "naive bayes", "naive bayes", "naive bayes", "naive bayes",
                "naive bayes", "naive bayes", "naive bayes", "naive bayes", "naive bayes",
                "naive bayes", "naive bayes", "naive bayes", "naive bayes", "naive bayes",
                "naive bayes", "naive bayes"),
      
    variables=c("omit", "PC1, big_growth", "PC1, big_growth", "PC1, big_growth",
              "PC2, big_growth", "PC2, big_growth", "PC2, big_growth",
              "PC3, big_growth", "PC3, big_growth", "PC3, big_growth", 
              "PC1, PC2, big_growth", "PC1, PC2, big_growth", "PC1, PC2, big_growth", 
              "PC1, PC3, big_growth", "PC1, PC3, big_growth", "PC1, PC3, big_growth",
              "PC2, PC3, big_growth", "PC2, PC3, big_growth","PC2, PC3, big_growth",
              "PC1, PC2, PC3, big_growth", "PC1, PC2, PC3, big_growth",
              "PC1, PC2, PC3, big_growth"), models)



bayes_models = bayes_models[2:22,]
best_model = bayes_models[order(-bayes_models$profit.cv),][1,]
fmt(best_model, row.names=FALSE, "best model")
fmt(bayes_models, row.names=FALSE, "search for best model")

method,variables,cutoff,accuracy.cv,profit.cv,profit_rate.cv
naive bayes,"PC3, big_growth",0.25,0.204878,47491.51,0.0474915


method,variables,cutoff,accuracy.cv,profit.cv,profit_rate.cv
naive bayes,"PC1, big_growth",0.25,0.2987224,-85297.43,-0.0852974
naive bayes,"PC1, big_growth",0.33,0.8034843,-85297.43,-0.0852974
naive bayes,"PC1, big_growth",0.5,0.9133566,-85297.43,-0.0852974
naive bayes,"PC2, big_growth",0.25,0.3554007,-146896.83,-0.1468968
naive bayes,"PC2, big_growth",0.33,0.7022067,-146896.83,-0.1468968
naive bayes,"PC2, big_growth",0.5,0.9156794,-146896.83,-0.1468968
naive bayes,"PC3, big_growth",0.25,0.204878,47491.51,0.0474915
naive bayes,"PC3, big_growth",0.33,0.7823461,47491.51,0.0474915
naive bayes,"PC3, big_growth",0.5,0.9121951,47491.51,0.0474915
naive bayes,"PC1, PC2, big_growth",0.25,0.2197445,-142450.55,-0.1424506


# <font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised December 28, 2020
</span>
</p>
</font>