# Project Part D: Regression

![](banner_project.jpg)

In [8]:
analyst = "Charlie Ellis" # Replace this with your name

In [9]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

## Directions

### Objective

Construct and evaluate a regressor trained on a transformed dataset about public company fundamentals.  Later, use the regressor along with additional analysis to recommend a portfolio of 12 company investments that maximizes 12-month return of an overall \$1,000,000 investment.

### Approach

Retrieve a transformed dataset.

Construct a model to predict how much stock price will grow over 12 months, given 12 months of past company fundamentals data, and using a machine learning model construction method and transformed data.

Evaluate the model's business performance based on a business model and business parameters.

## Business Model

The business model is ...

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>

$ profit\,rate = profit \div budget $


$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business model parameters include ...

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocations = \\$1,000,000 $\div$ 12 to each company: investments to allocate to specific companies in the portfolio 

Fill the portfolio with companies that have the highest predicted growths.

In [10]:
# Set the business parameters.

budget = 1000000
portfolio_size = 12
allocation = rep(budget/portfolio_size, portfolio_size)

layout(fmt(budget), fmt(portfolio_size), fmt(allocation))

budget,Unnamed: 1_level_0,Unnamed: 2_level_0
portfolio_size,Unnamed: 1_level_1,Unnamed: 2_level_1
allocation,Unnamed: 1_level_2,Unnamed: 2_level_2
1000000,,
12,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,

budget
1000000

portfolio_size
12

allocation
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33


## Data

In this step, we just acquire our data frame that we created from the last part of Project B where we established what 5 predictor variables and 3 outcome variables we wanted to use for classifying simply by using the read.csv function in R. Two of the predictor variables we used were the first and second principal components, which we were able to acquire from last project, and are great estimators for our data and will be helpful for making predictions.

In [11]:
# Retrieve "My Data.csv"
data = read.csv("My Data.csv", header=TRUE)

# Present a few rows
data[1:6,]

gvkey,tic,conm,PC1,PC2,prccq,growth,big_growth
1004,AIR,AAR CORP,3.4371231,-0.2260719,43.69,0.0507455507,NO
1045,AAL,AMERICAN AIRLINES GROUP INC,-12.0332067,0.8045109,32.11,-0.3828560446,NO
1050,CECE,CECO ENVIRONMENTAL CORP,3.9532234,-0.7553386,6.75,0.3157894737,YES
1062,ASA,ASA GOLD AND PRECIOUS METALS,3.6561434,-0.7981915,8.66,-0.2164739518,NO
1072,AVX,AVX CORP,2.9282228,-0.71042,15.25,-0.1184971098,NO
1075,PNW,PINNACLE WEST CAPITAL CORP,0.3488491,1.1389605,85.2,0.0002347969,NO


## Regression Model

In the model construction in this section, we are using a linear regression model to predict the growth of our companies based on the first two principal components. Then we trained this model in three different ways -- we used the entire dataset to train (in-sample), used 75% of the data to train and predict the other 25% (out-of-sample) and finally we used cross validation where we took 5 folds of data and trained the model on the rest of the data to make predictions for this fold.

### Build Model

In [12]:
# Construct a linear regression model to predict growth given PC1 and PC2.
# Present a brief summary of the model parameters.
model = lm(growth~PC1+PC2, data)
model


Call:
lm(formula = growth ~ PC1 + PC2, data = data)

Coefficients:
(Intercept)          PC1          PC2  
 -0.1185887    0.0002455    0.0006294  


### In-Sample Estimated Performance

In [13]:
# Present the model's in-sample estimated RMSE, profit, and profit rate.
prob = predict(model, data)
error = prob-data$growth
square_error = error^2
RMSE = sqrt(mean(square_error))


# Get profits
bigs = data
bigs$pred = prob
bigss = bigs[order(-bigs$pred),]
big12 = bigss[1:12,]
profit = sum((big12$growth + 1) * allocation) - budget
profit_rate = profit / budget
result_df = data.frame(RMSE, profit, profit_rate)
fmt(result_df, "In-Sample Estimated Performance")

RMSE,profit,profit_rate
0.468815,-115641.1,-0.1156411


### Out-of-Sample Estimated Performance

In [14]:
# Partition the data into training (75%) and validation (25%)
# (use set.seed(0) and sample(...) to choose training observations).
# How many observations and variables in the training data?
# How many observations and variables in the validation data?
set.seed(0)
holdout = sample(1:nrow(data), 0.75*nrow(data))
holdin = setdiff(1:nrow(data), holdout)
data.train = data[holdout,]
data.dev = data[holdin,]
layout(fmt(size(data.train)), fmt(size(data.dev)))

observations,variables
observations,variables
3228,8
1077,8
size(data.train)  observations variables 3228 8,size(data.dev)  observations variables 1077 8

observations,variables
3228,8

observations,variables
1077,8


In [15]:
# Present the model's out-of-sample estimated RMSE, profit, and profit rate.
data.t = data.dev[, colnames(data.dev)!="big_growth"]
model.t = lm(growth~PC1+PC2, data.train)
prob.t = predict(model.t, data.dev)
error.t = prob.t-data.t$growth
square_error.t = error.t^2
RMSE.t = sqrt(mean(square_error.t))

bigs.t = data.t
bigs.t$pred = prob.t
bigss.t = bigs.t[order(-bigs.t$pred),]
big12.t = bigss.t[1:12,]
profit.t = sum((big12.t$growth + 1) * allocation) - budget
profit_rate.t = profit.t / budget
result_df.t = data.frame(RMSE.t, profit.t, profit_rate.t)
fmt(result_df.t, "Out-of-Sample Estimated Performance")

RMSE.t,profit.t,profit_rate.t
0.5081251,-40986.91,-0.0409869


### 5-Fold Cross-Validation Estimated Performance

In [16]:
# Partition the data into 5 folds (use set.seed(0) and createFolds(...) based on growth).
# Present the first few observation (row) numbers for each of the folds.
#
# You can use the str() function.
set.seed(0)
fold = createFolds(data$growth, k=5)
str(fold)

List of 5
 $ Fold1: int [1:862] 8 11 16 22 30 32 38 40 41 44 ...
 $ Fold2: int [1:860] 3 9 10 23 26 27 34 39 52 64 ...
 $ Fold3: int [1:862] 2 7 19 29 35 42 53 57 61 62 ...
 $ Fold4: int [1:861] 1 4 5 6 15 17 28 33 36 43 ...
 $ Fold5: int [1:860] 12 13 14 18 20 21 24 25 31 37 ...


In [17]:
# Present the model's estimated RMSE and profit for each fold.
# Present the model's estimated accuracy and profit at cutoff=0.5 for each fold.
rmses = c(NA, NA, NA, NA, NA)
profits = c(NA, NA, NA, NA, NA)
profit_rates = c(NA, NA, NA, NA, NA)
cutoff = 0.5

for (i in 1:5)
{ 
    data.train = data[setdiff(1:nrow(data), fold[[i]]),]
    data.dev = data[fold[[i]],]
    model = lm(growth~PC1+PC2, data.train)
    prob = predict(model, data.dev)
    error = prob-data.dev$growth
    square_error = error^2
    RMSE = sqrt(mean(square_error))
    rmses[i] = RMSE
    
    data.temp = data.dev
    data.temp$pred.growth = prob
    bigs.i = data.temp[order(-data.temp$pred.growth),]
    big12.i = bigs.i[1:12,]
    profit.i = sum((big12.i$growth + 1) * allocation) - budget
    
    profits[i] = profit.i
    profit_rates[i] = profit.i / budget
}


tb12 = data.frame(fold=c(1,2,3,4,5),
                  rmse=rmses,
                  profit=profits)
tb12

fold,rmse,profit
1,0.4446211,-68328.2
2,0.435679,-84359.01
3,0.5041439,-83515.95
4,0.3998234,-114715.51
5,0.546036,-6010.14


In [20]:
# Present the model's 5-fold cross-validation estimated RMSE, profit, and profit rate.
df = data.frame(rmse.cv=mean(rmses),
                profit.cv=mean(profits),
                profit_rate.cv=mean(profit_rates))
df

rmse.cv,profit.cv,profit_rate.cv
0.4660607,-71385.76,-0.07138576


<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised June 14, 2020
</span>
</p>
</font>