# Project Part D: Regression

![](banner_project.jpg)

In [1]:
analyst = "Citlalli Villarreal" # Replace this with your name

In [2]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

## Directions

### Objective

Construct, evaluate, and tune a regressor trained on a transformed dataset about public company fundamentals.  Later, use the regressor along with additional analysis to recommend a portfolio of 12 company investments that maximizes 12-month return of an overall \$1,000,000 investment.

### Approach

Retrieve a transformed dataset.

Construct a model to predict how much stock price will grow over 12 months, given 12 months of past company fundamentals data, and using a machine learning model construction method and transformed data.

Evaluate the model's business performance based on a business model and business parameters.

Tune the model by exhaustive search for the best performing model.

## Business Model

The business model is ...

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>

$ profit\,rate = profit \div budget $


$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business model parameters include ...

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocations = \\$1,000,000 $\div$ 12 to each company: investments to allocate to specific companies in the portfolio 

Fill the portfolio with companies you predict to have the highest growths.

In [3]:
# Set the business parameters.

budget = 1000000
portfolio_size = 12
allocation = rep(budget/portfolio_size, portfolio_size)

layout(fmt(budget), fmt(portfolio_size), fmt(allocation))

budget,Unnamed: 1_level_0,Unnamed: 2_level_0
portfolio_size,Unnamed: 1_level_1,Unnamed: 2_level_1
allocation,Unnamed: 1_level_2,Unnamed: 2_level_2
1000000,,
12,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,

budget
1000000

portfolio_size
12

allocation
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33


## Data

_<< Data was recieved from a CSV file called My Data. >>_

In [4]:
# Retrieve "My Data.csv"
data = read.csv("My Data.csv", header=TRUE)
data$big_growth = factor(data$big_growth, levels=c("YES","NO"))

# Present a few rows ...
data[1:6,]

gvkey,tic,conm,PC1,PC2,PC3,prccq,growth,big_growth
1004,AIR,AAR CORP,1.4097638,0.2124544,-0.18735809,43.69,0.0507455507,NO
1045,AAL,AMERICAN AIRLINES GROUP INC,-2.8093139,0.2246363,1.43661206,32.11,-0.3828560446,NO
1050,CECE,CECO ENVIRONMENTAL CORP,1.5247216,0.4396434,-0.16785608,6.75,0.3157894737,YES
1062,ASA,ASA GOLD AND PRECIOUS METALS,1.5736687,0.6384403,0.01227541,8.66,-0.2164739518,NO
1072,AVX,AVX CORP,1.2812646,0.4529129,0.09293832,15.25,-0.1184971098,NO
1075,PNW,PINNACLE WEST CAPITAL CORP,0.3697622,-0.4860613,-0.01283639,85.2,0.0002347969,NO


## Build Regression Model

_<< Below is a construction of a linear regression model which predicts the growth provided the PC1, PC2, and PC3. >>_

In [5]:
# Construct a linear regression model to predict growth given PC1, PC2, and PC3.
# Present a brief summary of the model parameters.
model = lm(formula = growth ~ PC1 + PC2 + PC3, data = data)
model


Call:
lm(formula = growth ~ PC1 + PC2 + PC3, data = data)

Coefficients:
(Intercept)          PC1          PC2          PC3  
  -0.118589     0.001091    -0.001686    -0.001792  


## Evaluate Regression Model (5-fold cross-validation)

_<< Below the data is partition using the 5-fold cross validation and then for each fold the RMSE and profit is calculated. Lastly, the cross-validation RMSE and profit is calculated by taking the average of all RMSE and profit for each individual fold respectively.  >>_

In [6]:
# Partition the data into 5 folds (use set.seed(0) and createFolds(...) based on growth).
# Present the first few observation (row) numbers for each of the folds.
#
# You can use the str() function.

set.seed(0)
fold = createFolds(data$growth, k=5, list=TRUE)
str(fold)

List of 5
 $ Fold1: int [1:862] 8 11 16 22 30 32 38 40 41 44 ...
 $ Fold2: int [1:860] 3 9 10 23 26 27 34 39 52 64 ...
 $ Fold3: int [1:862] 2 7 19 29 35 42 53 57 61 62 ...
 $ Fold4: int [1:861] 1 4 5 6 15 17 28 33 36 43 ...
 $ Fold5: int [1:860] 12 13 14 18 20 21 24 25 31 37 ...


In [7]:
# Present the model's estimated RMSE and profit for each fold.
RMSE <- function(actual, predicted) { sqrt(mean((actual - predicted)^2)) }
profit <- function(dataset) {sum((1+dataset$growth) * allocation) - budget}

rmse_profit <- function(train, test) {
        
    model = lm(formula = growth ~ PC1 + PC2 + PC3, data = train)
    outcome.predicted=predict(model, test)
    new.test = cbind(outcome.predicted, test)
    new.test= new.test[order(-new.test$outcome.predicted),]
    rmse = RMSE(test$growth, outcome.predicted)
    prof = profit(new.test[1:12,])
    return (c(rmse, prof))
}

data.train.1 = data[setdiff(1:nrow(data), fold$Fold1),]
data.train.2 = data[setdiff(1:nrow(data), fold$Fold2),]
data.train.3 = data[setdiff(1:nrow(data), fold$Fold3),]
data.train.4 = data[setdiff(1:nrow(data), fold$Fold4),]
data.train.5 = data[setdiff(1:nrow(data), fold$Fold5),]

data.test.1 = data[fold$Fold1,]
data.test.2 = data[fold$Fold2,]
data.test.3 = data[fold$Fold3,]
data.test.4 = data[fold$Fold4,]
data.test.5 = data[fold$Fold5,]

rmse_profit1 = rmse_profit(data.train.1, data.test.1)
rmse_profit2 = rmse_profit(data.train.2, data.test.2)
rmse_profit3 = rmse_profit(data.train.3, data.test.3)
rmse_profit4 = rmse_profit(data.train.4, data.test.4)
rmse_profit5 = rmse_profit(data.train.5, data.test.5)


data.frame(fold=c(1, 2, 3, 4, 5), rmse=c(rmse_profit1[1], rmse_profit2[1], rmse_profit3[1], rmse_profit4[1], rmse_profit5[1]), 
           profit=c(rmse_profit1[2], rmse_profit2[2], rmse_profit3[2], rmse_profit4[2], rmse_profit5[2]))



fold,rmse,profit
1,0.4444572,-112168.03
2,0.4358929,-159109.52
3,0.5040095,-68570.91
4,0.3991218,-81948.94
5,0.5459158,-14433.04


In [8]:
# Present the model's 5-fold cross-validation estimated RMSE, profit, and profit rate.
rmse.cv = sum(rmse_profit1[1], rmse_profit2[1], rmse_profit3[1], rmse_profit4[1], rmse_profit5[1])/5
profit.cv =  sum(rmse_profit1[2], rmse_profit2[2], rmse_profit3[2], rmse_profit4[2], rmse_profit5[2])/5
profit_rate.cv = sum(c(rmse_profit1[2], rmse_profit2[2], rmse_profit3[2], rmse_profit4[2], rmse_profit5[2])/budget)/5
fmt(data.frame(rmse.cv, profit.cv, profit_rate.cv), "5-Fold Cross-Validation Estimated Performance")

rmse.cv,profit.cv,profit_rate.cv
0.4658794,-87246.09,-0.0872461


## Tune Regression Model

_<< Below I partition the data into 5 folds then for each fold I construct various models as I iterate over unique combinations of the predictor variables (PC1, PC2, PC3). I then estimate each model's cross-validation RMSE and profit. Lastly, the best model is chosen via selecting the model with the highest profit (or lowest negative profit). >>_

In [9]:
# Partition the data into 5 folds (use set.seed(0) and createFolds(...) based on growth).

# Construct several linear regression models to predict growth.
# Iterate through unique combinations of predictor variables, selected from PC1, PC2, PC3.

# Estimate each model's RMSE and profit, using 5-fold cross validation.

# Present the best model: selected variables, RMSE, profit, and profit rate.
# Present all the models: selected variables, RMSE, profit, and profit rate.

rmse_profit_modified <- function(train, test, mvar) {
    model = lm(as.formula(paste("growth ~ ", mvar)), data = train)
    outcome.predicted=predict(model, test)
    new.test = cbind(outcome.predicted, test)
    new.test= new.test[order(-new.test$outcome.predicted),]
    rmse = RMSE(test$growth, outcome.predicted)
    prof = profit(new.test[1:12,])
    return (c(rmse, prof))
}

set.seed(0)
fold = createFolds(data$growth, k=5, list=TRUE)
data.train.1 = data[setdiff(1:nrow(data), fold$Fold1),]
data.train.2 = data[setdiff(1:nrow(data), fold$Fold2),]
data.train.3 = data[setdiff(1:nrow(data), fold$Fold3),]
data.train.4 = data[setdiff(1:nrow(data), fold$Fold4),]
data.train.5 = data[setdiff(1:nrow(data), fold$Fold5),]

data.test.1 = data[fold$Fold1,]
data.test.2 = data[fold$Fold2,]
data.test.3 = data[fold$Fold3,]
data.test.4 = data[fold$Fold4,]
data.test.5 = data[fold$Fold5,]

search_for_best_model = data.frame()
unique_comb = list("PC1", "PC2", "PC3", c("PC1 + PC2"), c("PC1 + PC3"), c("PC2 + PC3"), c("PC1 + PC2 + PC3"))
for (var in unique_comb) {
    rmse_profit1 = rmse_profit_modified(data.train.1, data.test.1, var)
    rmse_profit2 = rmse_profit_modified(data.train.2, data.test.2, var)
    rmse_profit3 = rmse_profit_modified(data.train.3, data.test.3, var)
    rmse_profit4 = rmse_profit_modified(data.train.4, data.test.4, var)
    rmse_profit5 = rmse_profit_modified(data.train.5, data.test.5, var)
    rmse.cv = sum(rmse_profit1[1], rmse_profit2[1], rmse_profit3[1], rmse_profit4[1], rmse_profit5[1])/5
    profit.cv =  sum(rmse_profit1[2], rmse_profit2[2], rmse_profit3[2], rmse_profit4[2], rmse_profit5[2])/5
    profit_rate.cv = sum(c(rmse_profit1[2], rmse_profit2[2], rmse_profit3[2], rmse_profit4[2], rmse_profit5[2])/budget)/5
    search_for_best_model=rbind(search_for_best_model,data.frame(method= "linear regression", variables=paste(gsub("[+]", ",", var), ", growth"), rmse.cv, profit.cv, profit_rate.cv))
  
    }

fmt(search_for_best_model[order(-search_for_best_model$profit.cv),][1,], "best model")
fmt(search_for_best_model, "search for best model")





method,variables,rmse.cv,profit.cv,profit_rate.cv
linear regression,"PC1 , PC2 , growth",0.465881,-51469.51,-0.0514695


method,variables,rmse.cv,profit.cv,profit_rate.cv
linear regression,"PC1 , growth",0.4659377,-288146.13,-0.2881461
linear regression,"PC2 , growth",0.465904,-70483.33,-0.0704833
linear regression,"PC3 , growth",0.4659627,-111427.63,-0.1114276
linear regression,"PC1 , PC2 , growth",0.465881,-51469.51,-0.0514695
linear regression,"PC1 , PC3 , growth",0.4659351,-75214.42,-0.0752144
linear regression,"PC2 , PC3 , growth",0.4659102,-93628.28,-0.0936283
linear regression,"PC1 , PC2 , PC3 , growth",0.4658794,-87246.09,-0.0872461


<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised April 9, 2021
</span>
</p>
</font>