# Project Part D: Regression

![](banner_project.jpg)

In [1]:
analyst = "Firstname Lastname" # Replace this with your name

In [2]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

## Directions

### Objective

Construct, evaluate, and tune a regressor trained on a transformed dataset about public company fundamentals.  Later, use the regressor along with additional analysis to recommend a portfolio of 12 company investments that maximizes 12-month return of an overall \$1,000,000 investment.

### Approach

Retrieve a transformed dataset.

Construct a model to predict how much stock price will grow over 12 months, given 12 months of past company fundamentals data, and using a machine learning model construction method and transformed data.

Evaluate the model's business performance based on a business model and business parameters.

Tune the model by exhaustive search for the best performing model.

## Business Model

The business model is ...

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>

$ profit\,rate = profit \div budget $


$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business model parameters include ...

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocations = \\$1,000,000 $\div$ 12 to each company: investments to allocate to specific companies in the portfolio 

Fill the portfolio with companies you predict to have the highest growths.

In [3]:
# Set the business parameters.

budget = 1000000
portfolio_size = 12
allocation = rep(budget/portfolio_size, portfolio_size)

layout(fmt(budget), fmt(portfolio_size), fmt(allocation))

budget,Unnamed: 1_level_0,Unnamed: 2_level_0
portfolio_size,Unnamed: 1_level_1,Unnamed: 2_level_1
allocation,Unnamed: 1_level_2,Unnamed: 2_level_2
1000000,,
12,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,

budget
1000000

portfolio_size
12

allocation
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33


## Data

_<< Discuss this data retrieval. >>_

In [4]:
# Retrieve "My Data.csv"
data = read.csv("My Data.csv", header=TRUE)
data$big_growth = factor(data$big_growth, levels=c("YES","NO"))

# Present a few rows ...
data[1:6,]
size(data)

gvkey,tic,conm,PC1,PC2,PC3,prccq,growth,big_growth
1004,AIR,AAR CORP,1.4097638,0.2124544,-0.18735809,43.69,0.0507455507,NO
1045,AAL,AMERICAN AIRLINES GROUP INC,-2.8093139,0.2246363,1.43661206,32.11,-0.3828560446,NO
1050,CECE,CECO ENVIRONMENTAL CORP,1.5247216,0.4396434,-0.16785608,6.75,0.3157894737,YES
1062,ASA,ASA GOLD AND PRECIOUS METALS,1.5736687,0.6384403,0.01227541,8.66,-0.2164739518,NO
1072,AVX,AVX CORP,1.2812646,0.4529129,0.09293832,15.25,-0.1184971098,NO
1075,PNW,PINNACLE WEST CAPITAL CORP,0.3697622,-0.4860613,-0.01283639,85.2,0.0002347969,NO


observations,variables
4305,9


## Build Regression Model

_<< Discuss this model construction. >>_

In [5]:
# Construct a linear regression model to predict growth given PC1, PC2, and PC3.
# Present a brief summary of the model parameters.

model = lm(growth ~ PC1 + PC2 + PC3, data)
model


Call:
lm(formula = growth ~ PC1 + PC2 + PC3, data = data)

Coefficients:
(Intercept)          PC1          PC2          PC3  
  -0.118589     0.001091    -0.001686    -0.001792  


## Evaluate Regression Model (5-fold cross-validation)

_<< Discuss this model evaluation. >>_

In [6]:
# Partition the data into 5 folds (use set.seed(0) and createFolds(...) based on growth).
# Present the first few observation (row) numbers for each of the folds.
#
# You can use the str() function.

set.seed(0)
folds = createFolds(data$growth, k=5)
str(folds)

List of 5
 $ Fold1: int [1:862] 8 11 16 22 30 32 38 40 41 44 ...
 $ Fold2: int [1:860] 3 9 10 23 26 27 34 39 52 64 ...
 $ Fold3: int [1:862] 2 7 19 29 35 42 53 57 61 62 ...
 $ Fold4: int [1:861] 1 4 5 6 15 17 28 33 36 43 ...
 $ Fold5: int [1:860] 12 13 14 18 20 21 24 25 31 37 ...


In [43]:
# Present the model's estimated RMSE and profit for each fold.

set.seed(0)
fold = createFolds(data$growth, k=5)

rmse = c(); profit=c(); portion=budget/12

for (i in 1:5)
  { data.train = data[setdiff(1:nrow(data), fold[[i]]),]
    data.test = data[fold[[i]],]
    model = lm(growth ~ PC1+PC2+PC3, data.train)
    data.u.test = data.test; data.u.test$growth = NULL
    prob=predict(model, data.u.test); data.u.test$growth.predicted = prob
    error=data.test$growth - data.u.test$growth.predicted
    rmse[i] = sqrt(mean(error^2))
   
   combined = cbind(data.test, prob)
   sorted = combined[order(-combined$prob),]
   profit[i] =  sum(portion*(1+sorted[1:12,]$growth))-budget }

rmse.cv = mean(rmse)
profit.cv = mean(profit)
profit_rate.cv=profit.cv/budget

evaluation=data.frame(fold=1:5, rmse=rmse, profit=profit)
cross_validation = data.frame(rmse.cv=rmse.cv, profit.cv=profit.cv, profit_rate.cv=profit_rate.cv)
last = data.frame(method="linear regression", variables= "PC1, PC2, PC3, growth",
           rmse.cv=rmse.cv, profit.cv=profit.cv, profit_rate.cv=profit_rate.cv)
evaluation

fold,rmse,profit
1,0.4444572,-112168.03
2,0.4358929,-159109.52
3,0.5040095,-68570.91
4,0.3991218,-81948.94
5,0.5459158,-14433.04


In [37]:
# Present the model's 5-fold cross-validation estimated RMSE, profit, and profit rate.
fmt(cross_validation, "5-Fold Cross-Validation Estimated Performance")

rmse.cv,profit.cv,profit_rate.cv
0.4658794,-87246.09,-0.0872461


## Tune Regression Model

_<< Discuss this model tuning. >>_

In [47]:
#PC1, growth 

set.seed(0)
fold = createFolds(data$growth, k=5)

rmse = c(); profit=c(); portion=budget/12

for (i in 1:5)
  { data.train = data[setdiff(1:nrow(data), fold[[i]]),]
    data.test = data[fold[[i]],]
    model = lm(growth ~ PC1, data.train)
    data.u.test = data.test; data.u.test$growth = NULL
    prob=predict(model, data.u.test); data.u.test$growth.predicted = prob
    error=data.test$growth - data.u.test$growth.predicted
    rmse[i] = sqrt(mean(error^2))
   
   combined = cbind(data.test, prob)
   sorted = combined[order(-combined$prob),]
   profit[i] =  sum(portion*(1+sorted[1:12,]$growth))-budget }


rmse.cv = mean(rmse)
profit.cv = mean(profit)
profit_rate.cv=profit.cv/budget

start = data.frame(method="linear regression", variables= "PC1, growth",
           rmse.cv=rmse.cv, profit.cv=profit.cv, profit_rate.cv=profit_rate.cv)

In [48]:
#PC2, growth 

set.seed(0)
fold = createFolds(data$growth, k=5)

rmse = c(); profit=c(); portion=budget/12

for (i in 1:5)
  { data.train = data[setdiff(1:nrow(data), fold[[i]]),]
    data.test = data[fold[[i]],]
    model = lm(growth ~ PC2, data.train)
    data.u.test = data.test; data.u.test$growth = NULL
    prob=predict(model, data.u.test); data.u.test$growth.predicted = prob
    error=data.test$growth - data.u.test$growth.predicted
    rmse[i] = sqrt(mean(error^2))
   
   combined = cbind(data.test, prob)
   sorted = combined[order(-combined$prob),]
   profit[i] =  sum(portion*(1+sorted[1:12,]$growth))-budget }


rmse.cv = mean(rmse)
profit.cv = mean(profit)
profit_rate.cv=profit.cv/budget

new = data.frame(method="linear regression", variables= "PC2, growth",
           rmse.cv=rmse.cv, profit.cv=profit.cv, profit_rate.cv=profit_rate.cv)
start=rbind(start,new)

In [49]:
#PC3, growth 

set.seed(0)
fold = createFolds(data$growth, k=5)

rmse = c(); profit=c(); portion=budget/12

for (i in 1:5)
  { data.train = data[setdiff(1:nrow(data), fold[[i]]),]
    data.test = data[fold[[i]],]
    model = lm(growth ~ PC3, data.train)
    data.u.test = data.test; data.u.test$growth = NULL
    prob=predict(model, data.u.test); data.u.test$growth.predicted = prob
    error=data.test$growth - data.u.test$growth.predicted
    rmse[i] = sqrt(mean(error^2))
   
   combined = cbind(data.test, prob)
   sorted = combined[order(-combined$prob),]
   profit[i] =  sum(portion*(1+sorted[1:12,]$growth))-budget }


rmse.cv = mean(rmse)
profit.cv = mean(profit)
profit_rate.cv=profit.cv/budget

new = data.frame(method="linear regression", variables= "PC3, growth",
           rmse.cv=rmse.cv, profit.cv=profit.cv, profit_rate.cv=profit_rate.cv)
start=rbind(start,new)

In [50]:
#PC1, PC2, growth 

set.seed(0)
fold = createFolds(data$growth, k=5)

rmse = c(); profit=c(); portion=budget/12

for (i in 1:5)
  { data.train = data[setdiff(1:nrow(data), fold[[i]]),]
    data.test = data[fold[[i]],]
    model = lm(growth ~ PC1 + PC2, data.train)
    data.u.test = data.test; data.u.test$growth = NULL
    prob=predict(model, data.u.test); data.u.test$growth.predicted = prob
    error=data.test$growth - data.u.test$growth.predicted
    rmse[i] = sqrt(mean(error^2))
   
   combined = cbind(data.test, prob)
   sorted = combined[order(-combined$prob),]
   profit[i] =  sum(portion*(1+sorted[1:12,]$growth))-budget }


rmse.cv = mean(rmse)
profit.cv = mean(profit)
profit_rate.cv=profit.cv/budget

new = data.frame(method="linear regression", variables= "PC1, PC2, growth",
           rmse.cv=rmse.cv, profit.cv=profit.cv, profit_rate.cv=profit_rate.cv)
start=rbind(start,new)

In [51]:
#PC1, PC3, growth 

set.seed(0)
fold = createFolds(data$growth, k=5)

rmse = c(); profit=c(); portion=budget/12

for (i in 1:5)
  { data.train = data[setdiff(1:nrow(data), fold[[i]]),]
    data.test = data[fold[[i]],]
    model = lm(growth ~ PC1 + PC3, data.train)
    data.u.test = data.test; data.u.test$growth = NULL
    prob=predict(model, data.u.test); data.u.test$growth.predicted = prob
    error=data.test$growth - data.u.test$growth.predicted
    rmse[i] = sqrt(mean(error^2))
   
   combined = cbind(data.test, prob)
   sorted = combined[order(-combined$prob),]
   profit[i] =  sum(portion*(1+sorted[1:12,]$growth))-budget }


rmse.cv = mean(rmse)
profit.cv = mean(profit)
profit_rate.cv=profit.cv/budget

new = data.frame(method="linear regression", variables= "PC1, PC3, growth",
           rmse.cv=rmse.cv, profit.cv=profit.cv, profit_rate.cv=profit_rate.cv)
start=rbind(start,new)

In [52]:
#PC2, PC3, growth 

set.seed(0)
fold = createFolds(data$growth, k=5)

rmse = c(); profit=c(); portion=budget/12

for (i in 1:5)
  { data.train = data[setdiff(1:nrow(data), fold[[i]]),]
    data.test = data[fold[[i]],]
    model = lm(growth ~ PC2 + PC3, data.train)
    data.u.test = data.test; data.u.test$growth = NULL
    prob=predict(model, data.u.test); data.u.test$growth.predicted = prob
    error=data.test$growth - data.u.test$growth.predicted
    rmse[i] = sqrt(mean(error^2))
   
   combined = cbind(data.test, prob)
   sorted = combined[order(-combined$prob),]
   profit[i] =  sum(portion*(1+sorted[1:12,]$growth))-budget }


rmse.cv = mean(rmse)
profit.cv = mean(profit)
profit_rate.cv=profit.cv/budget

new = data.frame(method="linear regression", variables= "PC2, PC3, growth",
           rmse.cv=rmse.cv, profit.cv=profit.cv, profit_rate.cv=profit_rate.cv)
start=rbind(start,new)
final = rbind(start, last)

In [55]:
# Partition the data into 5 folds (use set.seed(0) and createFolds(...) based on growth).

# Construct several linear regression models to predict growth.
# Iterate through unique combinations of predictor variables, selected from PC1, PC2, PC3.

# Estimate each model's RMSE and profit, using 5-fold cross validation.

# Present the best model: selected variables, RMSE, profit, and profit rate.
# Present all the models: selected variables, RMSE, profit, and profit rate.

fmt(final[4,], "best model")
fmt(final, "search for best model")

method,variables,rmse.cv,profit.cv,profit_rate.cv
linear regression,"PC1, PC2, growth",0.465881,-51469.51,-0.0514695


method,variables,rmse.cv,profit.cv,profit_rate.cv
linear regression,"PC1, growth",0.4659377,-288146.13,-0.2881461
linear regression,"PC2, growth",0.465904,-70483.33,-0.0704833
linear regression,"PC3, growth",0.4659627,-111427.63,-0.1114276
linear regression,"PC1, PC2, growth",0.465881,-51469.51,-0.0514695
linear regression,"PC1, PC3, growth",0.4659351,-75214.42,-0.0752144
linear regression,"PC2, PC3, growth",0.4659102,-93628.28,-0.0936283
linear regression,"PC1, PC2, PC3, growth",0.4658794,-87246.09,-0.0872461


<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised April 9, 2021
</span>
</p>
</font>