# Project Part E: Tuning

![](banner_project.jpg)

In [6]:
analyst = "Charlie Ellis" # Replace this with your name

In [7]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

## Directions

### Objective

Construct and tune a classifier and a regressor, each trained on a transformed dataset about public company fundamentals.  Later, use the best classifer or regressor along with additional analysis to recommend a portfolio of 12 company investments that maximizes 12-month return of an overall \$1,000,000 investment.

### Approach

Retrieve a transformed dataset.

Construct a model to predict whether stock price will grow more than 30% over 12 months, given 12 months of past company fundamentals data, and using a machine learning model construction method and transformed data.  Tune the model by systematically selecting various combinations of predictor variables and cutoffs, and identify the best business performance based on a business model and business parameters.  

Similarly, construct a model to predict how much stock price will grow over 12 months, given 12 months of past company fundamentals data, and using a machine learning model construction method and transformed data.  Tune the model by systematically selecting various combinations of predictor variables, and identify the best business performance based on a business model and business parameters.

## Business Model


The business model is ...

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>

$ profit\,rate = profit \div budget $


$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business model parameters include ...

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocations = \\$1,000,000 $\div$ 12 to each company: investments to allocate to specific companies in the portfolio 

For classifier evaluation, fill the portfolio with companies with the lowest gvkey values from among those you predict to grow above 30%.  If you predict fewer than the portfolio size to grow above 30%, then fill the rest of the portfolio with the remaining companies with lowest gvkey values.

For regressor evaluation, fill the portfolio with companies that have the highest predicted growths.

In [8]:
# Set the business parameters.

budget = 1000000
portfolio_size = 12
allocation = rep(budget/portfolio_size, portfolio_size)

layout(fmt(budget), fmt(portfolio_size), fmt(allocation))

budget,Unnamed: 1_level_0,Unnamed: 2_level_0
portfolio_size,Unnamed: 1_level_1,Unnamed: 2_level_1
allocation,Unnamed: 1_level_2,Unnamed: 2_level_2
1000000,,
12,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,

budget
1000000

portfolio_size
12

allocation
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33


## Data Retrieval

In this step, we just acquire our data frame that we created from the last part of Project B where we established what 5 predictor variables and 3 outcome variables we wanted to use for classifying simply by using the read.csv function in R (We did this step in Project Part C and D). Two of the predictor variables we used were the first and second principal components, which we were able to acquire from Project part , and are great estimators for our data and will be helpful for making predictions.

In [9]:
# Retrieve "My Data.csv"
data = read.csv("My Data.csv", header=TRUE)

# Present a few rows
data[1:6,]

gvkey,tic,conm,PC1,PC2,prccq,growth,big_growth
1004,AIR,AAR CORP,3.4371231,-0.2260719,43.69,0.0507455507,NO
1045,AAL,AMERICAN AIRLINES GROUP INC,-12.0332067,0.8045109,32.11,-0.3828560446,NO
1050,CECE,CECO ENVIRONMENTAL CORP,3.9532234,-0.7553386,6.75,0.3157894737,YES
1062,ASA,ASA GOLD AND PRECIOUS METALS,3.6561434,-0.7981915,8.66,-0.2164739518,NO
1072,AVX,AVX CORP,2.9282228,-0.71042,15.25,-0.1184971098,NO
1075,PNW,PINNACLE WEST CAPITAL CORP,0.3488491,1.1389605,85.2,0.0002347969,NO


## Build & Tune Classification Model

In [10]:
df = data[,c("big_growth", "PC1", "PC2")]
df2 = data[,c("growth", "PC1", "PC2")]

In [11]:
# Partition the data into 5 folds (use set.seed(0) and createFolds(...) based on big_growth).

# Construct several naive Bayes models to predict big_growth (use laplace=TRUE).
# Iterate through unique combinations of predictor variables, selected from PC1 and PC2.
# Iterate through cutoff values, selected from 0.25, 0.33, and 0.50.

# Estimate each model's accuracy and profit, using 5-fold cross validation.

# Present the best model: selected variables, selected cutoff, accuracy, and profit.
# Present all the models: selected variables, selected cutoff, accuracy, and profit.

fold = createFolds(data$big_growth, k=5)
tune = data.frame()
accuracy = c()
profits = c()


for (f in exhaustive(names(df), keep="big_growth")) # try several combinations of variables
{ 
    for (c in c(0.25, 0.33, 0.5)) {
        
        for (i in 1:5) { 
            data.train = data[setdiff(1:nrow(data), fold[[i]]),]
            data.test  = data[fold[[i]],]
            set.seed(0)
            model = naiveBayes(big_growth ~ ., data.train[,f])
            prob = predict(model, data.test, type="raw")
            class.predicted = as.class(prob, "YES", cutoff=c)
            CM = confusionMatrix(class.predicted, data.test$big_growth)$table
            cm = CM/sum(CM)
            accuracy[i] = cm[1,1]+cm[2,2]
            data.test$class.predicted = class.predicted
            bigs.i = data.test[order(data.test$class.predicted == "NO"),]
            big12.i = bigs.i[1:12,]
            profit.i = sum((big12.i$growth + 1) * allocation) - budget
            profits[i] = profit.i
        }
        
        accuracy.cv = mean(accuracy)
        profit.cv = mean(profits)
        tune = rbind(tune, data.frame(method="naive bayes", variables=paste(f, collapse=", "), cutoff=c, accuracy.cv, profit.cv)) 
        
    }
    
}
    
tune

method,variables,cutoff,accuracy.cv,profit.cv
naive bayes,"PC1, big_growth",0.25,0.7865273,-77384.67
naive bayes,"PC1, big_growth",0.33,0.9156794,-79341.81
naive bayes,"PC1, big_growth",0.5,0.9156794,-79341.81
naive bayes,"PC2, big_growth",0.25,0.3398374,-98638.03
naive bayes,"PC2, big_growth",0.33,0.6594657,-73636.9
naive bayes,"PC2, big_growth",0.5,0.9154472,-73161.7
naive bayes,"PC1, PC2, big_growth",0.25,0.2088269,-84303.85
naive bayes,"PC1, PC2, big_growth",0.33,0.2269454,-96651.95
naive bayes,"PC1, PC2, big_growth",0.5,0.4160279,-114519.58


## Build & Tune Regression Model

In [12]:
# Partition the data into 5 folds (use set.seed(0) and createFolds(...) based on growth).

# Construct several linear regression models to predict growth.
# Iterate through unique combinations of predictor variables, selected from PC1 and PC2.

# Estimate each model's RMSE and profit, using 5-fold cross validation.

# Present the best model: selected variables, RMSE, and profit.
# Present all the models: selected variables, selected cutoff, accuracy, and profit.

fold = createFolds(data$growth, k=5)
tune = data.frame()
rmses = c()
profits = c()


for (f in exhaustive(names(df2), keep="growth")) # try several combinations of variables
{ 
    for (i in 1:5) { 
        data.train = data[setdiff(1:nrow(data), fold[[i]]),]
        data.test  = data[fold[[i]],]
        model = lm(growth ~ ., data.train[,f])
        prob = predict(model, data.test)
        error = prob - data.test$growth
        square_error = error^2
        RMSE = sqrt(mean(square_error))
        rmses[i] = RMSE
        
        
        data.temp = data.test
        data.temp$pred.growth = prob
        bigs.i = data.temp[order(-data.temp$pred.growth),]
        big12.i = bigs.i[1:12,]
        profit.i = sum((big12.i$growth + 1) * allocation) - budget
        profits[i] = profit.i
    }
    
    rmse.cv = mean(rmses)
    profit.cv = mean(profits)
    tune = rbind(tune, data.frame(method="linear regression", variables=paste(f, collapse=", "), rmse.cv, profit.cv)) 
    
    
}
    
tune

method,variables,rmse.cv,profit.cv
linear regression,"PC1, growth",0.4659713,-260853.48
linear regression,"PC2, growth",0.466048,-77011.29
linear regression,"PC1, PC2, growth",0.4660607,-71385.76


<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised June 14, 2020
</span>
</p>
</font>