# Project Part C: Classification

![](banner_project.jpg)

In [13]:
analyst = "Charlie Ellis" # Replace this with your name

In [14]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

## Directions

### Objective

Construct and evaluate a classifier trained on a transformed dataset about public company fundamentals.  Later, use the classifier along with additional analysis to recommend a portfolio of 12 company investments that maximizes 12-month return of an overall \$1,000,000 investment.

### Approach

Retrieve a transformed dataset.

Construct a model to predict whether stock price will grow more than 30% over 12 months, given 12 months of past company fundamentals data, and using a machine learning model construction method and transformed data.

Evaluate the model's business performance based on a business model and business parameters.

## Business Model

The business model is ...

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>

$ profit\,rate = profit \div budget $


$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business model parameters include ...

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocations = \\$1,000,000 $\div$ 12 to each company: investments to allocate to specific companies in the portfolio 

Fill the portfolio with companies with the lowest gvkey values from among those you predict to grow above 30%.  If you predict fewer than the portfolio size to grow above 30%, then fill the rest of the portfolio with the remaining companies with lowest gvkey values.

In [15]:
# Set the business parameters.

budget = 1000000
portfolio_size = 12
allocation = rep(budget/portfolio_size, portfolio_size)

layout(fmt(budget), fmt(portfolio_size), fmt(allocation))

budget,Unnamed: 1_level_0,Unnamed: 2_level_0
portfolio_size,Unnamed: 1_level_1,Unnamed: 2_level_1
allocation,Unnamed: 1_level_2,Unnamed: 2_level_2
1000000,,
12,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,

budget
1000000

portfolio_size
12

allocation
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33


## Data

In this step, we just acquire our data frame that we created from the last part of the last project where we established what 5 predictor variables and 3 outcome variables we wanted to use for classifying. Two of the predictor variables we used were the first and second principal components, which we were able to acquire from last project, and are great estimators for our data and will be helpful for making predictions.

In [16]:
# Retrieve "My Data.csv"
data = read.csv("My Data.csv", header=TRUE)

# Present a few rows ...
data[1:6,]

gvkey,tic,conm,PC1,PC2,prccq,growth,big_growth
1004,AIR,AAR CORP,3.4371231,-0.2260719,43.69,0.0507455507,NO
1045,AAL,AMERICAN AIRLINES GROUP INC,-12.0332067,0.8045109,32.11,-0.3828560446,NO
1050,CECE,CECO ENVIRONMENTAL CORP,3.9532234,-0.7553386,6.75,0.3157894737,YES
1062,ASA,ASA GOLD AND PRECIOUS METALS,3.6561434,-0.7981915,8.66,-0.2164739518,NO
1072,AVX,AVX CORP,2.9282228,-0.71042,15.25,-0.1184971098,NO
1075,PNW,PINNACLE WEST CAPITAL CORP,0.3488491,1.1389605,85.2,0.0002347969,NO


## Classification Model

In the model construction part of this project, we constructed a naive Bayes model trained on the pc1 and pc2 values for the entire dataset, which will predict big_growth. Then we predicted big_growth on that same dataset, and constructed a confusion matrix on the predicted values from our model, compared to the actual values from our dataset. From this, we then calculated the accuracy and change in profits for the top 12 companies which were to be involved in the calculation of profits. Then we wanted to make a different model, which was trained by 75% of the dataset, and use that model to predict the values for the other 25% of the data that was not in the training data, and we caluclated the accuracy and change in profits from this out of sample model prediction. And finally, we used a different way to classify the data by using cross validation -- we divided the data into 5 folds, and training the data from each fold into its own model, and using that model to predict values from the rest of that fold, and calculated profits and accuracies. As a result, the cross-validation classifier resulted in the highest accuracy out of the three classifiers we created.

### Build Model

In [17]:
# Construct a naive Bayes model to predict big_growth given PC1 and PC2 (use laplace=TRUE).
# Present a brief summary of the model parameters.

model = naiveBayes(big_growth ~ PC1+PC2, data, laplace=TRUE)
model


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
        NO        YES 
0.91637631 0.08362369 

Conditional probabilities:
     PC1
Y           [,1]      [,2]
  NO  -0.2239142 13.299922
  YES  2.4537263  4.550796

     PC2
Y           [,1]     [,2]
  NO   0.0424303 7.676443
  YES -0.4649654 1.453473


### In-Sample Estimated Performance

In [18]:
# Present the model's in-sample estimated accuracy, profit, and profit rate at cutoff=0.5.
prob = predict(model, data, type="raw")
class.predicted = as.class(prob, class="YES", cutoff=0.5)

# Get accuracy
CM = confusionMatrix(class.predicted, data$big_growth)$table
cm = CM / sum(CM)
fmt.cm(cm)
accuracy = (cm[1,1]+cm[2,2])/sum(cm)

# Get profits
bigs = data.frame(big_growth=data$big_growth,
                  predicted=class.predicted,
                  growth=data$growth)
big12 = bigs[1:12,]
profit = sum((big12$growth + 1) * allocation) - budget
profit_rate = profit / budget
result_df = data.frame(accuracy, profit, profit_rate)
fmt(result_df, "In-Sample Estimated Performance")

big12
# profit for company i = sum (1+growth_i)*(allocation_i) - budget
# profit rate = profit / budget
# budget = sum(allocation_i)

Unnamed: 0,NO,YES
NO,0.2376307,0.0130081
YES,0.6787456,0.0706156


accuracy,profit,profit_rate
0.3082462,-69740.64,-0.0697406


big_growth,predicted,growth
NO,YES,0.0507455507
NO,NO,-0.3828560446
YES,YES,0.3157894737
NO,YES,-0.2164739518
NO,YES,-0.1184971098
NO,NO,0.0002347969
NO,YES,0.0552070263
NO,NO,0.2673909234
NO,YES,-0.9186834463
YES,YES,0.4449427005


In [19]:
# Accuracy
cutoff = 0.5
data.p = data[, colnames(data)!="big_growth"]
prob = predict(model, data, type="raw")
data.p$class.predicted = as.class(prob, class="YES", cutoff=cutoff)
CM = confusionMatrix(data.p$class.predicted, data$big_growth)$table
cm = CM / sum(CM)
accuracy = (cm[1,1]+cm[2,2])/sum(cm)

# Profit & Profit rate
bigs = data.p[data.p$class.predicted == "YES",]
big12 = bigs[1:12,]
profit = sum((big12$growth + 1) * allocation) - budget
profit_rate = profit / budget
tbl = data.frame(accuracy, profit, profit_rate)
fmt(tbl, "In-Sample Estimated Performance")


accuracy,profit,profit_rate
0.3082462,-80393.21,-0.0803932


### Out-of-Sample Estimated Performance

In [20]:
# Partition the data into training (75%) and validation (25%)
# (use set.seed(0) and sample(...) to choose training observations).
# How many observations and variables in the training data?
# How many observations and variables in the validation data?

set.seed(0)
holdout = sample(1:nrow(data), 0.75*nrow(data))
holdin = setdiff(1:nrow(data), holdout)
data.train = data[holdout,]
data.dev = data[holdin,]
layout(fmt(size(data.train)), fmt(size(data.dev)))

observations,variables
observations,variables
3228,8
1077,8
size(data.train)  observations variables 3228 8,size(data.dev)  observations variables 1077 8

observations,variables
3228,8

observations,variables
1077,8


In [21]:
# Present the model's out-of-sample estimated accuracy, profit, and profit rate at cutoff=0.5.

data.t = data.dev[, colnames(data.dev)!="big_growth"]
model.t = naiveBayes(big_growth ~ PC1+PC2, data.train)
prob = predict(model.t, data.dev, type="raw")
class.predicted = as.class(prob, class="YES", cutoff=0.5)
CM.t = confusionMatrix(class.predicted, data.dev$big_growth)$table
cm.t = CM.t/sum(CM.t)

accuracy.t = cm.t[1,1]+cm.t[2,2]
data.t$class.predicted = class.predicted
bigs.t = data.t[data.t$class.predicted == "YES",]
big12.t = bigs.t[1:12,]
profit.t = sum((big12.t$growth + 1) * allocation) - budget
profit_rate.t = profit.t / budget
result_df.t = data.frame(accuracy.t, profit.t, profit_rate.t)
fmt(result_df.t, "Out-of-Sample Estimated Performance")

accuracy.t,profit.t,profit_rate.t
0.2989786,-120201.9,-0.1202019


### 5-Fold Cross-Validation Estimated Performance

In [22]:
# Partition the data into 5 folds (use set.seed(0) and createFolds(...)).
# Present the first few observation (row) numbers for each of the folds.
#
# You can use the str() function.
set.seed(0)
fold = createFolds(data$big_growth, k=5)
str(fold)

List of 5
 $ Fold1: int [1:861] 9 13 17 19 31 42 44 54 60 66 ...
 $ Fold2: int [1:861] 1 2 6 11 16 25 32 49 55 59 ...
 $ Fold3: int [1:861] 4 8 14 22 28 34 40 45 50 52 ...
 $ Fold4: int [1:861] 3 5 15 18 21 24 26 27 30 36 ...
 $ Fold5: int [1:861] 7 10 12 20 23 29 33 35 37 46 ...


In [23]:
# Present the model's estimated accuracy and profit at cutoff=0.5 for each fold.
accuracies = c(NA, NA, NA, NA, NA)
profits = c(NA, NA, NA, NA, NA)
profit_rates = c(NA, NA, NA, NA, NA)
cutoff = 0.5

for (i in 1:5)
{ 
    data.train = data[setdiff(1:nrow(data), fold[[i]]),]
    data.dev = data[fold[[i]],]
    model = naiveBayes(big_growth ~ PC1+PC2, data.train)
    prob = predict(model, data.dev, type="raw")
    class.predicted = as.class(prob, class="YES", cutoff=cutoff)
    CM = confusionMatrix(class.predicted, data.dev$big_growth)$table
    cm = CM/sum(CM)
    accuracies[i] = cm[1,1]+cm[2,2]
    data.dev$class.predicted = class.predicted
    bigs.i = data.dev[data.dev$class.predicted == "YES",]
    big12.i = bigs.i[1:12,]
    profit.i = sum((big12.i$growth + 1) * allocation) - budget
    profits[i] = profit.i
    profit_rates[i] = profit.i / budget
}

# Re-do fold #2 because it has no big_growth==YES values giving us an NA value.
data.train.2 = data[setdiff(1:nrow(data), fold[[2]]),]
data.dev.2 = data[fold[[2]],]
model.2 = naiveBayes(big_growth ~ PC1+PC2, data.train.2)
prob.2 = predict(model.2, data.dev.2, type="raw")
class.predicted.2 = as.class(prob.2, class="YES", cutoff=cutoff)
data.dev.2$class.predicted = class.predicted.2
bigs.2 = data.dev.2[data.dev.2$class.predicted == "NO",]
big12.2 = bigs.2[1:12,]
profit.2 = sum((big12.2$growth + 1) * allocation) - budget
profits[2] = profit.2
profit_rates[2] = profit.2 / budget


tbl2 = data.frame(fold=c(1,2,3,4,5),
                  accuracy=accuracies,
                  profit=profits)
tbl2

fold,accuracy,profit
1,0.3042973,-221281.1
2,0.9163763,-90115.24
3,0.2659698,-28710.93
4,0.2950058,-89837.99
5,0.3066202,31939.77


In [24]:
# Present the model's 5-fold cross-validation estimated accuracy, profit, and profit rate at cutoff=0.5

df = data.frame(accuracy.cv=mean(accuracies),
                profit.cv=mean(profits),
                profit_rate.cv=mean(profit_rates))
df

accuracy.cv,profit.cv,profit_rate.cv
0.4176539,-79601.1,-0.0796011


<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised June 9, 2020
</span>
</p>
</font>