# 4329 Session 3

Principal Component Analysis and Factor Models
* Factor Models
* PCA
* Statistical Factor Analysis

Statistical learning and model selection
* Lasso
* Ridge

#### Yield data

In [11]:
data = read.table('https://people.orie.cornell.edu/davidr/SDAFE/data/yields.txt', header = T)[,-11]
ddata = diff(as.ts(data))

# Principal components: `prcomp`

In [17]:
pca = prcomp(ddata, scale. = TRUE)

#### Variance explained

In [18]:
summary(pca)

Importance of components%s:
                          PC1    PC2     PC3     PC4    PC5     PC6    PC7
Standard deviation     2.9174 1.0222 0.54972 0.26048 0.1732 0.14397 0.1097
Proportion of Variance 0.8511 0.1045 0.03022 0.00678 0.0030 0.00207 0.0012
Cumulative Proportion  0.8511 0.9556 0.98584 0.99263 0.9956 0.99770 0.9989
                           PC8     PC9    PC10
Standard deviation     0.07457 0.06546 0.03388
Proportion of Variance 0.00056 0.00043 0.00011
Cumulative Proportion  0.99946 0.99989 1.00000

#### The loadings (eigenvectors)

In [19]:
pca$rotation

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
X1mon,0.08228573,0.93680368,0.27439245,0.1896871,-0.03161511,-0.05632103,0.001891534,-0.007397267,-0.010421,-0.004129093
X2mon,0.30560117,0.28451236,-0.54781293,-0.6239901,0.26831196,0.24354461,-0.044467623,0.02265399,0.04607092,0.01671949
X3mon,0.3297118,0.0221962,-0.44785386,0.137984,-0.34154958,-0.58588083,0.322035237,-0.157287111,-0.27265294,-0.092123746
X4mon,0.33773356,-0.02458943,-0.22350374,0.3119622,-0.32791686,0.04656225,-0.316704566,0.420704863,0.55914367,0.196019249
X5mon,0.33936688,-0.05689953,-0.06518434,0.3166588,-0.11140024,0.48511584,-0.386430969,-0.249138558,-0.41267253,-0.384221125
X5.5mon,0.33966084,-0.07661367,0.03855192,0.2903094,0.27328509,0.24243835,0.328483359,-0.287142909,-0.05233258,0.682966253
X6.5mon,0.33825157,-0.09023546,0.11911901,0.2104617,0.50156639,-0.02435377,0.398417845,0.24731123,0.24538572,-0.535429046
X7.5mon,0.33532358,-0.09867443,0.25506659,-0.1048163,0.36449861,-0.46816927,-0.506922558,0.231615528,-0.3214244,0.189220654
X8.5mon,0.33314095,-0.08662408,0.35049915,-0.2963443,-0.16050384,-0.15354271,-0.117114911,-0.618348759,0.45966918,-0.118912081
X9.5mon,0.32967388,-0.0765784,0.40773713,-0.3614447,-0.45336441,0.23371973,0.327654092,0.394987853,-0.25085202,0.045776647


# Factor models

In [70]:
data = read.csv('https://people.orie.cornell.edu/davidr/SDAFE/data/Stock_FX_Bond.csv')
# Select the stock or index data (column names that ends with "_AC")
data = data[grep('_AC', colnames(data))]
# Convert to ts
data = as.ts(data)
# Convert to percent
data = 100 * (data / lag(data, -1) - 1)

In [74]:
head(data)

data.GM_AC,data.F_AC,data.UTX_AC,data.CAT_AC,data.MRK_AC,data.PFE_AC,data.IBM_AC,data.MSFT_AC,data.C_AC,data.XOM_AC,data.S.P_AC
2.4454148,3.8961039,1.092896,2.9585799,1.9900498,2.994012,1.3209393,7.142857,4.0723982,2.4336283,2.3290728
0.341006,1.25,1.081081,0.0,-0.7317073,1.744186,-0.5311444,0.0,-1.3043478,-0.2159827,0.2339506
0.5097706,4.1152263,1.069519,-0.862069,0.2457002,0.5714286,0.4368932,6.666667,1.7621145,-0.2164502,1.0087823
-1.4370245,-0.7905138,1.058201,-0.2898551,1.2254902,1.1363636,-0.5316578,6.25,1.2987013,0.0,0.7637175
0.9433962,1.5936255,-1.308901,2.0348837,-0.4842615,0.0,-0.4859086,0.0,0.0,0.4338395,0.5635883
-0.1699235,1.9607843,-1.061008,0.5698006,-1.216545,0.0,-1.5625,11.764706,0.4273504,1.9438445,0.6068102


Simple factor model with the market (last column in data is SP500 index), and 2 other stocks treated as factors

In [126]:
Mkt = data[, 11]
factor1 = data[, 9]
factor2 = data[, 10]
stocks = data[, 1:5] 
fit = lm(stocks ~ Mkt + factor1 + factor2)
coef(fit)

Unnamed: 0,data.GM_AC,data.F_AC,data.UTX_AC,data.CAT_AC,data.MRK_AC
(Intercept),0.002831656,0.009497674,0.03523024,0.04029447,0.02955034
Mkt,1.102119291,1.068293224,0.8255546,0.9151073,0.8424842
factor1,-0.01976985,0.014155008,0.04246978,0.04966791,-0.011874653
factor2,-0.05736795,-0.077216084,0.01982829,-0.01319556,0.008889204


The factor estimated covariance matrix, same notation as the book

In [159]:
sigma_F = cov(cbind(Mkt, factor1, factor2))
beta = coef(fit)[-1, ]
sigma_eps = diag(cov(fit$residuals))

t(beta) %*% sigma_F %*% beta + sigma_eps

Unnamed: 0,data.GM_AC,data.F_AC,data.UTX_AC,data.CAT_AC,data.MRK_AC
data.GM_AC,4.260596,4.255369,4.090335,4.180319,4.018864
data.F_AC,4.440178,4.439106,4.273878,4.36545,4.197752
data.UTX_AC,3.17797,3.176704,3.046767,3.122921,2.975406
data.CAT_AC,3.970169,3.97049,3.825135,3.909462,3.747528
data.MRK_AC,3.277679,3.271756,3.146585,3.216493,3.087967


The actual covariance

In [160]:
cov(stocks)

Unnamed: 0,data.GM_AC,data.F_AC,data.UTX_AC,data.CAT_AC,data.MRK_AC
data.GM_AC,4.2605957,2.6699572,1.2455745,1.4654812,0.8957578
data.F_AC,2.6699572,4.4391064,1.2740768,1.5316444,0.9780046
data.UTX_AC,1.2455745,1.2740768,3.0467665,1.3875721,0.8154347
data.CAT_AC,1.4654812,1.5316444,1.3875721,3.9094617,0.8230401
data.MRK_AC,0.8957578,0.9780046,0.8154347,0.8230401,3.0879666


# Stepwise regression 

In [1]:
data = read.csv('http://www-bcf.usc.edu/~gareth/ISL/Credit.csv')[, -1]

Predict `Rating` with the other variabes

In [2]:
head(data)

Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
104.593,7075,514,4,71,11,Male,No,No,Asian,580
148.924,9504,681,3,36,11,Female,No,No,Asian,964
55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331
80.18,8047,569,4,77,10,Male,No,No,Caucasian,1151


The formula `Rating ~ .` regresses `Rating` on every other variable

# Backward stepwise regression: `step`

In [3]:
step(lm(Rating ~ ., data), direction = 'backward')

Start:  AIC=1865.55
Rating ~ Income + Limit + Cards + Age + Education + Gender + 
    Student + Married + Ethnicity + Balance

            Df Sum of Sq    RSS    AIC
- Gender     1         5  39953 1863.6
- Age        1        18  39966 1863.7
- Ethnicity  2       247  40195 1864.0
- Student    1        53  40001 1864.1
<none>                    39948 1865.5
- Education  1       212  40160 1865.7
- Married    1       522  40470 1868.7
- Balance    1       552  40500 1869.0
- Income     1       720  40668 1870.7
- Cards      1     14230  54178 1985.4
- Limit      1    202143 242091 2584.2

Step:  AIC=1863.6
Rating ~ Income + Limit + Cards + Age + Education + Student + 
    Married + Ethnicity + Balance

            Df Sum of Sq    RSS    AIC
- Age        1        18  39971 1861.8
- Ethnicity  2       246  40199 1862.0
- Student    1        51  40004 1862.1
<none>                    39953 1863.6
- Education  1       213  40166 1863.7
- Married    1       523  40476 1866.8
- Balance    1 


Call:
lm(formula = Rating ~ Income + Limit + Cards + Education + Married + 
    Balance, data = data)

Coefficients:
(Intercept)       Income        Limit        Cards    Education   MarriedYes  
   30.57853      0.09803      0.06411      4.67871     -0.24806      2.20414  
    Balance  
    0.00859  


# Ridge regression: `glmnet` (package `glmnet`)

with option
* alpha = 0

Does not take a formula argument, but the left hand variable (x vector), and the right hand variables (y matrix)

Can construct `x` with `model.matrix` (converts strings/factors to dummies)

In [4]:
# Drop missing observations first
data = data[complete.cases(data), ]
X = model.matrix(Rating ~ ., data)[, -1] # -1 to remove the intercept, glmnet automatically includes one
y = data$Rating

In [5]:
head(X)

Unnamed: 0,Income,Limit,Cards,Age,Education,GenderFemale,StudentYes,MarriedYes,EthnicityAsian,EthnicityCaucasian,Balance
1,14.891,3606,2,34,11,0,0,1,0,1,333
2,106.025,6645,3,82,15,1,1,1,1,0,903
3,104.593,7075,4,71,11,0,0,0,1,0,580
4,148.924,9504,3,36,11,1,0,0,1,0,964
5,55.882,4897,2,68,16,0,0,1,0,1,331
6,80.18,8047,4,77,10,0,0,0,0,1,1151


Fit ridge with an automatic sequence of $\lambda$'s, the variables are standardized by default
(use `standardize = FALSE` to override)

In [6]:
library(glmnet)
ridge = glmnet(X, y, alpha = 0)

“package ‘glmnet’ was built under R version 3.4.2”Loading required package: Matrix
Loading required package: foreach
“package ‘foreach’ was built under R version 3.4.3”Loaded glmnet 2.0-13



Three of the generated $\lambda$'s, from smallest, the middle, and the largest:

In [7]:
ridge$lambda[ c(length(ridge$lambda), length(ridge$lambda)/2, 1) ]    

The corresponding estimated coefficients for the three $\lambda$s

In [8]:
coef(ridge,
    s = ridge$lambda[ c(length(ridge$lambda), length(ridge$lambda)/2, 1) ])

12 x 3 sparse Matrix of class "dgCMatrix"
                              1             2             3
(Intercept)         93.50983985 302.536028622  3.549400e+02
Income               1.05431123   0.267492457  3.509281e-36
Limit                0.03157811   0.005211713  6.749824e-38
Cards                2.22612480   0.466707856  6.067762e-36
Age                  0.07979794   0.066107584  9.346974e-37
Education           -0.14674871  -0.111754970 -1.507039e-36
GenderFemale         1.35391531   0.203892100  2.775319e-36
StudentYes         -41.22965958  -0.869970420 -1.054994e-36
MarriedYes           3.21444116   0.923958155  1.177489e-35
EthnicityAsian      -3.74581195  -0.977058448 -1.289218e-35
EthnicityCaucasian  -1.09045277  -0.031800936 -3.339477e-37
Balance              0.11175267   0.022597378  2.935743e-37

Fit ridge with a given $\lambda$

In [9]:
ridge2 = glmnet(X, y, alpha = 0, lambda = 15.405)

In [10]:
coef(ridge2)

12 x 1 sparse Matrix of class "dgCMatrix"
                             s0
(Intercept)         93.30529052
Income               1.04964193
Limit                0.03169607
Cards                2.23434010
Age                  0.07997622
Education           -0.14788551
GenderFemale         1.34759011
StudentYes         -41.05845028
MarriedYes           3.21555624
EthnicityAsian      -3.74218480
EthnicityCaucasian  -1.08999331
Balance              0.11141128

# Lasso regression: `glmnet` (package `glmnet`)

with option
* alpha = 1

In [11]:
lasso = glmnet(X, y, alpha = 1, lambda = 15.405)

In [12]:
coef(lasso)

12 x 1 sparse Matrix of class "dgCMatrix"
                            s0
(Intercept)        70.13693025
Income              .         
Limit               0.06014086
Cards               .         
Age                 .         
Education           .         
GenderFemale        .         
StudentYes          .         
MarriedYes          .         
EthnicityAsian      .         
EthnicityCaucasian  .         
Balance             .         

# Split into train and test sample 

To split evenly, just use `sample` to draw elements from `c(TRUE, FALSE)`

In [13]:
train = sample(c(TRUE, FALSE), size = nrow(X), replace = TRUE)

In [14]:
head(train)

In [15]:
head(!train)

In [16]:
lasso_train = glmnet(X[train, ], y[train], alpha = 1, lambda = 15.405)

The MSE on the test data

select the test set with `!train`

In [17]:
prediction = predict(lasso_train, newx = X[!train, ])
actual = y[!train]

MSE = mean((prediction - actual)^2)
MSE

# Cross validation: `cv.glmnet` (package `glmnet`) 

Default is 10-fold cross validation

In [38]:
cvlasso = cv.glmnet(X, y, alpha = 1)

The optimal $\lambda$

(this will change slightly for every time you run the above line, since `cv.glmnet` chooses
the folds randomly)

In [43]:
cvlasso$lambda.min

The coefficients for the $\lambda$ with the smallest MSE

In [40]:
coef(cvlasso, s = 'lambda.min')

12 x 1 sparse Matrix of class "dgCMatrix"
                              1
(Intercept)        29.676238566
Income              0.062108863
Limit               0.064849927
Cards               4.334895487
Age                 .          
Education          -0.061058121
GenderFemale        .          
StudentYes          .          
MarriedYes          0.938052920
EthnicityAsian     -0.323949591
EthnicityCaucasian  .          
Balance             0.005501346

The test set MSE with this $\lambda$:

In [41]:
lasso_train_cv = glmnet(X[train, ], y[train], alpha = 1, lambda = cvlasso$lambda.min)

mean((predict(lasso_train_cv, newx = X[!train, ]) - y[!train])^2)

In [42]:
coef(lasso_train_cv)

12 x 1 sparse Matrix of class "dgCMatrix"
                             s0
(Intercept)        28.681425058
Income              0.038026226
Limit               0.065812369
Cards               4.739452889
Age                -0.004087100
Education          -0.069020934
GenderFemale       -0.212908056
StudentYes          .          
MarriedYes          0.145552298
EthnicityAsian      .          
EthnicityCaucasian  .          
Balance             0.002163858