# Project Part E: Deployment

![](banner_project.jpg)

In [1]:
analyst = "Citlalli Villarreal" # Replace this with your name

In [2]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)
options(repr.matrix.max.rows=674)
options(repr.matrix.max.cols=200)
update_geom_defaults("point", list(size=1))                                

## Directions

### Objective

Recommend a portfolio of 12 company investments that maximizes 12-month return of an overall \$1,000,000 investment.

### Approach

Retrieve a transformed dataset about public company fundamentals and use it reproduce the construction of a selected model.

Retrieve an investment opportunities dataset, comprising fundamentals for some set of public companies over some one-year period.  Transform the representation of the investment opportunities to match the representation expected by the model, leveraging previous analysis.

Use the model to make predictions about the investment opportunities and accordingly recommend a portfolio of 12 company investments.

## Business Model


The business model is ...

$ \begin{align} profit = \left( \sum_{i \in portfolio} (1 + growth_i) \times allocation_i \right) - budget \end{align} $

<br>

$ profit\,rate = profit \div budget $


$ \begin{align} budget = \sum_{i \in portfolio} allocation_i \end{align} $

<br>

Business model parameters include ...

* Budget = \\$1,000,000: total investment to allocate across the companies in the portfolio
* Portfolio Size = 12: number of companies in the portfolio
* Allocations = \\$1,000,000 $\div$ 12 to each company: investments to allocate to specific companies in the portfolio 

Fill the portfolio with companies you predict to have the highest growths.

In [3]:
# Set the business parameters.

budget = 1000000
portfolio_size = 12
allocation = rep(budget/portfolio_size, portfolio_size)

layout(fmt(budget), fmt(portfolio_size), fmt(allocation))

budget,Unnamed: 1_level_0,Unnamed: 2_level_0
portfolio_size,Unnamed: 1_level_1,Unnamed: 2_level_1
allocation,Unnamed: 1_level_2,Unnamed: 2_level_2
1000000,,
12,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,
83333.33,,

budget
1000000

portfolio_size
12

allocation
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33
83333.33


## Build Model

_<<Below the training data is retrieve from the CSV file denoted as "My Data.csv". Then a linear regression model is constructed to predict the growth provided a PC1, PC2, and PC. >>_

In [4]:
# Retrieve "My Data.csv".  This is the ORIGINAL model training data.
data = read.csv("My Data.csv", header=TRUE)
data$big_growth = factor(data$big_growth, levels=c("YES","NO"))

# Present a few rows ...
data[1:6,]

gvkey,tic,conm,PC1,PC2,PC3,prccq,growth,big_growth
1004,AIR,AAR CORP,1.4097638,0.2124544,-0.18735809,43.69,0.0507455507,NO
1045,AAL,AMERICAN AIRLINES GROUP INC,-2.8093139,0.2246363,1.43661206,32.11,-0.3828560446,NO
1050,CECE,CECO ENVIRONMENTAL CORP,1.5247216,0.4396434,-0.16785608,6.75,0.3157894737,YES
1062,ASA,ASA GOLD AND PRECIOUS METALS,1.5736687,0.6384403,0.01227541,8.66,-0.2164739518,NO
1072,AVX,AVX CORP,1.2812646,0.4529129,0.09293832,15.25,-0.1184971098,NO
1075,PNW,PINNACLE WEST CAPITAL CORP,0.3697622,-0.4860613,-0.01283639,85.2,0.0002347969,NO


In [5]:
# Construct a linear regression model to predict growth given PC1, PC2 and PC3, based on the
# ORIGINAL model training data.
# Present a brief summary of the model parameters.
model = lm(formula=growth ~ PC1 + PC2 + PC3, data=data)
model


Call:
lm(formula = growth ~ PC1 + PC2 + PC3, data = data)

Coefficients:
(Intercept)          PC1          PC2          PC3  
  -0.118589     0.001091    -0.001686    -0.001792  


## Investment Opportunities

_<< Below the data is retrieve from the CSV file called "Investment Opportunities.csv". Once the data is retrieved, I partition the data by the calendar quarter. Then the different datasets representing different quarters are joined into one dataset. This new dataset is filtered and manipulated through the use of various data objects (e.g. RDS). The final resulting dataset, "data.filter", is used to train the model. >>_

### Retrieve Data

In [6]:
# Retrieve "Investment Opportunities.csv"
# Present the dataset size ...

datax = read.csv("Investment Opportunities.csv", header=TRUE)
size(datax)

observations,variables
918,680


### Partition Data by Calendar Quarter 

To partition the dataset by calendar quarter in which information is reported, first add a synthetic variable to indicate such.  Then partition into four new datasets, one for each quarter, and drop the quarter variables. Additionally, filter the observations to include only those with non-missing `prccq`.  Then remove any observations about companies that reported more than once per quarter.  Then change all the variable names (except for the `gvkey`, `tic`, and `conm` variables) by suffixing them with quarter information - e.g., in the Quarter 1 dataset, `prccq` becomes `prccq.q1`, etc.

In [7]:
# Partition the dataset as described.
datax$quarter = quarter(mdy(datax[,2]))

data.current.q1 = datax[(datax$quarter==1) & !is.na(datax$prccq), -ncol(datax)]
data.current.q2 = datax[(datax$quarter==2) & !is.na(datax$prccq), -ncol(datax)]
data.current.q3 = datax[(datax$quarter==3) & !is.na(datax$prccq), -ncol(datax)]
data.current.q4 = datax[(datax$quarter==4) & !is.na(datax$prccq), -ncol(datax)]

data.current.q1 = data.current.q1[!duplicated(data.current.q1$gvkey),]
data.current.q2 = data.current.q2[!duplicated(data.current.q2$gvkey),]
data.current.q3 = data.current.q3[!duplicated(data.current.q3$gvkey),]
data.current.q4 = data.current.q4[!duplicated(data.current.q4$gvkey),]

colnames(data.current.q1)[-c(1, 10, 12)] = paste0(colnames(data.current.q1)[-c(1, 10, 12)], ".q1")
colnames(data.current.q2)[-c(1, 10, 12)] = paste0(colnames(data.current.q2)[-c(1, 10, 12)], ".q2")
colnames(data.current.q3)[-c(1, 10, 12)] = paste0(colnames(data.current.q3)[-c(1, 10, 12)], ".q3")
colnames(data.current.q4)[-c(1, 10, 12)] = paste0(colnames(data.current.q4)[-c(1, 10, 12)], ".q4")

In [8]:
# Present the sizes of the data partitions

layout(fmt(size(data.current.q1)),
       fmt(size(data.current.q2)),
       fmt(size(data.current.q3)),
       fmt(size(data.current.q4)))

observations,variables,Unnamed: 2_level_0,Unnamed: 3_level_0
observations,variables,Unnamed: 2_level_1,Unnamed: 3_level_1
observations,variables,Unnamed: 2_level_2,Unnamed: 3_level_2
observations,variables,Unnamed: 2_level_3,Unnamed: 3_level_3
209,680,,
221,680,,
227,680,,
230,680,,
size(data.current.q1)  observations variables 209 680,size(data.current.q2)  observations variables 221 680,size(data.current.q3)  observations variables 227 680,size(data.current.q4)  observations variables 230 680

observations,variables
209,680

observations,variables
221,680

observations,variables
227,680

observations,variables
230,680


### Consolidate Data by Company

Consolidate the four quarter datasets into one dataset, with one observation per company that includes variables for all four quarters.  Remove any observations with missing `prccq.q4` values.

In [9]:
# Consolidate the partitions as described.
# How many observations and variables in the resulting dataset? 

m12 = merge(data.current.q1, data.current.q2, by=c("gvkey", "tic", "conm"), all=TRUE)
m34 = merge(data.current.q3, data.current.q4, by=c("gvkey", "tic", "conm"), all=TRUE)
data.current = merge(m12, m34, by=c("gvkey", "tic", "conm"), all=TRUE, sort=TRUE)

data.current = data.current[!is.na(data.current$prccq.q4),]

size(data.current)

observations,variables
230,2711


### Transform Representation of Data

In [10]:
# Filter the data to include only those variables with at least 95% non-missing values
# in the ORIGINAL model training data.
# How many observations and variables in the resulting dataset? 
#
# You can use the readRDS() function. 

cn = readRDS("My Filter.rds")
data.current.filter=data.current[,cn]
size(data.current.filter)

observations,variables
230,239


In [11]:
# Impute the data using the same imputation values as computed for the ORIGINAL model
# training data. 
# How many observations and variables in the resulting dataset? 
#
# You can use the readRDS() and put_impute() functions.
ml = readRDS("My Imputation.rds")
data.current.imputed=put_impute(data.current.filter, ml)
size(data.current.imputed)

observations,variables
230,239


In [12]:
# Compute principal components using the centroids and weight matrix from the analysis
# of the ORIGINAL model training data.  Apply to only the (numeric and integer) variables
# used in the analysis of the ORIGINAL model training data. 
# How many observations and variables in the resulting dataset? 
#
# You can use the readRDS() and predict() functions.
# You can use rownames(pc$rotation) to get the (numeric and integer) variables. 

pc = readRDS("My PC.rds")
data.pc = predict(pc, data.current.imputed)
size(data.pc)

observations,variables
230,151


In [13]:
# Combine and filter datasets as necessary to produce a new datset that includes all investment
# opportunities, but includes only predictor variables stored by previous analysis. 
# How many observations and variables?
# Present the few few observations of the resulting dataset.
#
# You can use the readRDS() function.

prevars = readRDS("My Predictors.rds")
data.filter = cbind(data.current.imputed, data.pc)
data.filter = data.filter[,prevars]
size(data.filter)
data.filter[1:6,]


observations,variables
230,6


gvkey,tic,conm,PC1,PC2,PC3
1004,AIR,AAR CORP,1.419587,0.05796411,-0.2576737
1410,ABM,ABM INDUSTRIES INC,1.0563147,0.07293782,-0.160247
1562,AMSWA,AMERICAN SOFTWARE -CL A,1.6304006,0.32243636,-0.1278981
1618,AXR,AMREP CORP,0.8877064,0.14517578,-0.6410072
1632,ADI,ANALOG DEVICES,-1.6234366,-0.48540835,-0.9770837
1686,APOG,APOGEE ENTERPRISES INC,1.4219415,-0.15294429,-0.3697524


## Apply Model

_<< The linear regression model previously constructed above is used to predict the growths of each investment opportunity. Then a portfolio of 12 investment opportunities is presented (Note that the investment opportunities presented are in the portfolio have the highest overall predicted growth among the other investment opportunities). >>_

### Predict & Make Portfolio Recommendation

In [14]:
# Use the model to predict growths of each investment opportunity.
# Recommend a portfolio of allocations to 12 investment opportunities: gvkey, tic, conm, allocation

outcome.predicted = predict(model, data.filter)
new.data.filter = cbind(data.filter, outcome.predicted)
new.data.filter = new.data.filter[order(-new.data.filter$outcome.predicted),]
new.data.filter = new.data.filter[1:12,c('gvkey', 'tic', 'conm')]
portfolio = cbind(new.data.filter, allocation)
fmt(portfolio, 'portfolio')
    

gvkey,tic,conm,allocation
23809,AZO,AUTOZONE INC,83333.33
180711,AVGO,BROADCOM INC,83333.33
29692,WEBC,WEBCO INDUSTRIES INC,83333.33
3570,CBRL,CRACKER BARREL OLD CTRY STOR,83333.33
178704,ULTA,ULTA BEAUTY INC,83333.33
65430,PLCE,CHILDRENS PLACE INC,83333.33
63172,FDS,FACTSET RESEARCH SYSTEMS INC,83333.33
8551,PVH,PVH CORP,83333.33
1864,REX,REX AMERICAN RESOURCES CORP,83333.33
3504,COO,COOPER COS INC (THE),83333.33


### Store Portfolio Recommendation

In [15]:
# Store portfolio recommendation

write.csv(portfolio, paste0(analyst, ".csv"), row.names=FALSE)

### Confirm That Format Is Correct

In [16]:
portfolio.retrieved = read.csv(paste0(analyst, ".csv"), header=TRUE)
opportunities = unique(read.csv("Investment Opportunities.csv", header=TRUE)$gvkey)

columns = all(colnames(portfolio.retrieved) == c("gvkey", "tic", "conm", "allocation"))
companies = all(portfolio.retrieved$gvkey %in% opportunities)
allocations = round(sum(portfolio.retrieved$allocation)) == budget
                         
check = data.frame(analyst, columns, companies, allocations)
fmt(check, "Portfolio Recommendation | Format Check")

analyst,columns,companies,allocations
Citlalli Villarreal,True,True,True


<font size=1;>
<p style="text-align: left;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float: right;">
Document revised April 9, 2021
</span>
</p>
</font>