# Voluntary Problem Set

This problem set allows you to play around with concepts from class and to solve some smaller subproblems on your own.

Additional guidance is provided for concepts that did not make it into our class time. 

# Topic: A selection of the financial industrie's factor zoo. 

you will learn about

- most often used factor returns (market factor, size factor, value factor, profitability , investment , short-term reversal factor, momentum factor)

- linear factor modeling (backbone of todays asset management industry)

- Principal component analysis of factor returns

- Quantifying information content of factors for daily returns on SP500 constituents. 

# Background Knowledge (not necessary, yet helpful to grasp the context, for solving the Problem Set)
$$
\\
$$

**Some quant firm on FF-5**

https://www.robeco.com/en/insights/2015/10/fama-french-5-factor-model-why-more-is-not-always-better.html


# Background Knowledge (not necessary to solve Problem Set):
$$
\\
$$

**If you are curious what industry means by the term Factor Investing, feel free to watch some vidoes of this playlist**

https://www.youtube.com/watch?v=d1fz4LFquv4&list=PLyQSjcv8LwAHcUWCG-zRWbzuczxa0hB3n
$$
\\
$$

**Quant Blog on Fama-French 5-Factor Model**

https://blog.quantinsti.com/fama-french-five-factor-asset-pricing-model/
    

In [361]:
import pandas as pd
import numpy as np

# Tasks [Degree of Difficulty: Beginner]: 
$$
\\
$$

**Notice:** The return data for SP500 constituents is given in "r_SP500_d_cleaned_Dec2020.csv"

$$
\\
$$

## A: Getting Data

**A.1** Get daily returns of so called FF5 factor returns (MKT, size, HML, RMW, CMA) and the risk-free rate  

**A.2** Include the short-term reversal and momentum factor to end up with a FF7 factor return matrix

**A.3** Ensure that the daily returns of SP500 constituents (data file is delivered) aligns with A.1 and A.2 

$$
\\
$$
**Hint:**
The merged data for FF7, Rf and SP500 constituents are in **R_d.csv**
$$
\\
    $$
 
 

Fama French 5 Factor

$ r_i = r_f + \beta_{MKT} (r_{MKT} - r_f) + \beta_{HML} + \beta_{SML} $

In [362]:
df = pd.read_csv("./rf.csv", parse_dates=True, index_col=0)
R_d = pd.read_csv("R_d.csv", parse_dates=True, index_col=0)

## B: Factor Analysis

**B.1** How many principal components are necessary to explain at least 96\% of variations in FF7?

**B.2** How much variance does each principal component of FF7 explain?

**B.3** Does each factor of FF7 span a different risk factor? Defend your answer.  

**B.4** Which of the FF7 factor is most (least) important for explaining the first principal component of US Blue Chip returns?

**B.5** Do FF7 factors explain variations in other than the first principal component of US Blue Chip returns? 
$$
\\
$$
**Hint:**
B.3: look at correlation table; high correlations imply common information

B.4: regress PC1(r) onto FF7. Check point estimates and t-stats. Largest t-stat together with largest point estimate (assuming features of FF7 have same scale) tells me this is the most important factor. The opposite holds for the least import factor.

B.5: take (say) the first 20 principal components of the SP500 panel. Regress each PC onto the FF7 and record the adj.R2. Plot the adj.R2 against the nr of PCs.



# Tasks [Degree of Difficulty: Advanced]:

$$
\\
$$

**Notice:** The return data for SP500 constituents is given in "r_SP500_d_cleaned_Dec2020.csv"

$$
\\
$$

## A: Getting Data

**A.1** Get daily returns of so called FF5 factor returns (MKT, size, HML, RMW, CMA) and the risk-free rate  

**A.2** Include the short-term reversal and momentum factor to end up with a FF7 factor return matrix

**A.3** Ensure that the daily returns of SP500 constituents (data file is delivered) aligns with A.1 and A.2 

$$
\\
$$
**Hint:**
H_A_1: Page 12 of https://buildmedia.readthedocs.org/media/pdf/pandas-datareader/stable/pandas-datareader.pdf

H_A_3: The aligned data consists of 5241 data points, starts on Jan 4th 2000 and ends October 30th 2020.

$$
\\
$$
 
 

## B: Factor Analysis

**B.1** How many principal components are necessary to explain at least 96\% of variations in FF7?

**B.2** How much variance does each principal component of FF7 explain?

**B.3** Does each factor of FF7 span a different risk factor? Defend your answer.  

**B.4** Which of the FF7 factor is most (least) important for explaining the first principal component of US Blue Chip returns?

**B.5** Do FF7 factors explain variations in other than the first principal component of US Blue Chip returns? 
$$
\\
$$
**Hint:**
B.3: look at correlation table and/or regress every factor onto all other FF7.

B.4: use regression techniques to answer this question

B.5: use regression techniques to answer this question

 

# Tasks [Degree of Difficulty: Expert]:


$$
\\
$$

**Notice:** The return data for SP500 constituents is given in "r_SP500_d_cleaned_Dec2020.csv"

$$
\\
$$

## A: Getting Data

**A.1** Get daily returns of so called FF5 factor returns (MKT, size, HML, RMW, CMA) and the risk-free rate  

**A.2** Include the short-term reversal and momentum factor to end up with a FF7 factor return matrix

**A.3** Ensure that the daily returns of SP500 constituents (data file is delivered) aligns with A.1 and A.2 

$$
\\
$$
**Hint:**
H_A_3: The aligned data consists of 5241 data points, starts on Jan 4th 2000 and ends October 30th 2020.

$$
\\
$$

Fama French 5 Factor

$ r_i = r_f + \beta_{MKT} (r_{MKT} - r_f) + \beta_{HML} + \beta_{SML} $

In [363]:
import pandas_datareader as web
from pandas_datareader.famafrench import get_available_datasets

get_available_datasets()

['F-F_Research_Data_Factors',
 'F-F_Research_Data_Factors_weekly',
 'F-F_Research_Data_Factors_daily',
 'F-F_Research_Data_5_Factors_2x3',
 'F-F_Research_Data_5_Factors_2x3_daily',
 'Portfolios_Formed_on_ME',
 'Portfolios_Formed_on_ME_Wout_Div',
 'Portfolios_Formed_on_ME_Daily',
 'Portfolios_Formed_on_BE-ME',
 'Portfolios_Formed_on_BE-ME_Wout_Div',
 'Portfolios_Formed_on_BE-ME_Daily',
 'Portfolios_Formed_on_OP',
 'Portfolios_Formed_on_OP_Wout_Div',
 'Portfolios_Formed_on_OP_Daily',
 'Portfolios_Formed_on_INV',
 'Portfolios_Formed_on_INV_Wout_Div',
 'Portfolios_Formed_on_INV_Daily',
 '6_Portfolios_2x3',
 '6_Portfolios_2x3_Wout_Div',
 '6_Portfolios_2x3_weekly',
 '6_Portfolios_2x3_daily',
 '25_Portfolios_5x5',
 '25_Portfolios_5x5_Wout_Div',
 '25_Portfolios_5x5_Daily',
 '100_Portfolios_10x10',
 '100_Portfolios_10x10_Wout_Div',
 '100_Portfolios_10x10_Daily',
 '6_Portfolios_ME_OP_2x3',
 '6_Portfolios_ME_OP_2x3_Wout_Div',
 '6_Portfolios_ME_OP_2x3_daily',
 '25_Portfolios_ME_OP_5x5',
 '25_Portf

In [364]:
ff_moment = web.DataReader('F-F_Momentum_Factor_daily', 'famafrench', start=df.index[0], end="2020-10-30")[0]
ff_factors = web.DataReader('F-F_Research_Data_5_Factors_2x3_daily', 'famafrench', start=df.index[0], end="2020-10-30")[0]
ff_reversal = web.DataReader('F-F_ST_Reversal_Factor_daily', 'famafrench', start=df.index[0], end=df.index[-1])[0]
df = pd.read_csv("./r_SP500_d_cleaned_Dec2020.csv", parse_dates=True, index_col=0)

In [420]:
ff_wm_factors = ff_factors.merge(ff_moment, left_index=True, right_index=True)
ff7 = ff_wm_factors.merge(ff_reversal, left_index=True, right_index=True)
ff7.columns = ["ff_" + col for col in ff_wm_rev_factors.columns]
ff7 = ff7.apply(lambda x: x/100)  # data comes as bps -> divide by 100
ff7.loc["2020-10-30",:]


#ff7 = ff7.loc[ff7.index & df.index,:]
df = df.loc[ff7.index & df.index,:]

#ff7 = ff7.merge(df, right_index=True, left_index=True)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
ff7_scaled = pd.DataFrame(sc.fit_transform(ff7), index=ff7.index, columns=ff7.columns)



  if __name__ == '__main__':


## B: Factor Analysis

**B.1** How many principal components are necessary to explain at least 96\% of variations in FF7?

**B.2** How much variance does each principal component of FF7 explain?

**B.3** Does each factor of FF7 span a different risk factor? Defend your answer.

**B.4** Which of the FF7 factor is most (least) important for explaining the first principal component of US Blue Chip returns?

**B.5** Do FF7 factors explain variations in other than the first principal component of US Blue Chip returns? 

In [366]:
from sklearn.decomposition import PCA

In [367]:
ff7_scaled

Unnamed: 0_level_0,ff_Mkt-RF,ff_SMB,ff_HML,ff_RMW,ff_CMA,ff_RF,ff_Mom,ff_ST_Rev
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2000-01-04,-3.220815,0.549815,2.946151,0.787943,3.435773,2.064827,-1.802406,1.584941
2000-01-05,-0.091747,0.485078,0.392327,0.768732,2.368759,2.064827,-0.504221,1.094892
2000-01-06,-0.596181,-0.065188,1.745163,1.210578,2.795564,2.064827,-1.446607,0.704853
2000-01-07,2.509243,-1.554143,-1.982038,-1.709448,-2.278682,2.064827,0.563175,-0.715288
2000-01-10,1.366384,0.792579,-2.106279,-3.822626,-0.192076,2.064827,1.909441,-1.055321
...,...,...,...,...,...,...,...,...
2020-10-26,-1.478941,-0.744928,-0.560180,-0.114960,0.092461,-0.872220,1.015136,0.694852
2020-10-27,-0.202092,-0.955324,-3.293461,-0.556806,-1.757030,-0.872220,2.640271,0.154799
2020-10-28,-2.708499,0.339419,0.751243,-0.979442,-0.002384,-0.872220,0.149679,-0.055222
2020-10-29,0.869831,0.209945,0.309500,0.403729,1.325456,-0.872220,-1.090809,-0.115228


In [410]:
pca = PCA()

ff7_final = ff7_scaled.drop(columns="ff_RF")
pcs = pca.fit_transform(ff7_final) # risk factor not part of ff7 matrix

cumsum = np.cumsum(pca.explained_variance_ratio_)

for x in range(len(cumsum)):
    if cumsum[x] > 0.96:
        comps = x + 1
        break
print(f"At least {comps} components needed to reach 0.96 explained variance.")

At least 7 components needed to reach 0.96 explained variance.


In [411]:
np.round(pca.explained_variance_ratio_, 2)

array([0.29, 0.23, 0.15, 0.11, 0.09, 0.08, 0.05])

In [412]:
ff7_final.corr()

Unnamed: 0,ff_Mkt-RF,ff_SMB,ff_HML,ff_RMW,ff_CMA,ff_Mom,ff_ST_Rev
ff_Mkt-RF,1.0,0.138544,0.098619,-0.366388,-0.262737,-0.27896,0.368801
ff_SMB,0.138544,1.0,0.133199,-0.256692,0.041356,-0.036762,0.065884
ff_HML,0.098619,0.133199,1.0,0.104823,0.429976,-0.420852,0.009397
ff_RMW,-0.366388,-0.256692,0.104823,1.0,0.281752,0.106432,-0.212558
ff_CMA,-0.262737,0.041356,0.429976,0.281752,1.0,0.061987,-0.263248
ff_Mom,-0.27896,-0.036762,-0.420852,0.106432,0.061987,1.0,-0.124768
ff_ST_Rev,0.368801,0.065884,0.009397,-0.212558,-0.263248,-0.124768,1.0


### Does each factor span a different risk factor? 

**NO**. Some of the values have siginificant correlations (0.3 to 0.4). All factors correlate with the market risk to some extent. Further are the investment factor (CMA) and the value factor (growth vs value stocks) correlated by 0.42. We also see some negative correlations like HML and Mom.

In [415]:
# Do first pc
pca_df = PCA(4)
df_scaled = StandardScaler().fit_transform(df)
df_pcs = pca_df.fit_transform(df_scaled)

import numpy as np
import statsmodels.api as sm

df_pca1 = df_pcs[:,0]
results = sm.GLS(df_pca1, ff7_final).fit()
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared (uncentered):,0.97
Model:,GLS,Adj. R-squared (uncentered):,0.97
Method:,Least Squares,F-statistic:,24220.0
Date:,"Wed, 13 Jan 2021",Prob (F-statistic):,0.0
Time:,10:37:50,Log-Likelihood:,-11011.0
No. Observations:,5241,AIC:,22040.0
Df Residuals:,5234,BIC:,22080.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
ff_Mkt-RF,-11.2061,0.032,-345.357,0.000,-11.270,-11.143
ff_SMB,-0.5997,0.029,-20.805,0.000,-0.656,-0.543
ff_HML,-1.0184,0.036,-28.562,0.000,-1.088,-0.948
ff_RMW,-0.9061,0.031,-29.039,0.000,-0.967,-0.845
ff_CMA,-0.8655,0.034,-25.442,0.000,-0.932,-0.799
ff_Mom,0.5864,0.032,18.221,0.000,0.523,0.649
ff_ST_Rev,-0.1168,0.030,-3.888,0.000,-0.176,-0.058

0,1,2,3
Omnibus:,798.799,Durbin-Watson:,1.784
Prob(Omnibus):,0.0,Jarque-Bera (JB):,11957.263
Skew:,0.192,Prob(JB):,0.0
Kurtosis:,10.39,Cond. No.,2.41


The **most siginificant** factor is Mkt-RF. With a huge t-stat of ~ -345.

In [417]:
df_pca2 = df_pcs[:,1]
results = sm.GLS(df_pca2, ff7_final).fit()
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared (uncentered):,0.441
Model:,GLS,Adj. R-squared (uncentered):,0.44
Method:,Least Squares,F-statistic:,589.5
Date:,"Wed, 13 Jan 2021",Prob (F-statistic):,0.0
Time:,10:40:10,Log-Likelihood:,-12801.0
No. Observations:,5241,AIC:,25620.0
Df Residuals:,5234,BIC:,25660.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
ff_Mkt-RF,-0.4158,0.046,-9.107,0.000,-0.505,-0.326
ff_SMB,0.6029,0.041,14.865,0.000,0.523,0.682
ff_HML,-1.0310,0.050,-20.550,0.000,-1.129,-0.933
ff_RMW,-1.4699,0.044,-33.481,0.000,-1.556,-1.384
ff_CMA,-0.8829,0.048,-18.446,0.000,-0.977,-0.789
ff_Mom,-0.6021,0.045,-13.298,0.000,-0.691,-0.513
ff_ST_Rev,0.1198,0.042,2.834,0.005,0.037,0.203

0,1,2,3
Omnibus:,444.521,Durbin-Watson:,1.943
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2850.658
Skew:,0.023,Prob(JB):,0.0
Kurtosis:,6.613,Cond. No.,2.41


In [419]:
df_pca3 = df_pcs[:,2]
results = sm.GLS(df_pca3, ff7_final).fit()
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared (uncentered):,0.52
Model:,GLS,Adj. R-squared (uncentered):,0.52
Method:,Least Squares,F-statistic:,811.5
Date:,"Wed, 13 Jan 2021",Prob (F-statistic):,0.0
Time:,10:40:22,Log-Likelihood:,-11757.0
No. Observations:,5241,AIC:,23530.0
Df Residuals:,5234,BIC:,23570.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
ff_Mkt-RF,-1.0009,0.037,-26.754,0.000,-1.074,-0.928
ff_SMB,0.5704,0.033,17.162,0.000,0.505,0.636
ff_HML,2.3069,0.041,56.113,0.000,2.226,2.387
ff_RMW,0.0738,0.036,2.052,0.040,0.003,0.144
ff_CMA,-0.9721,0.039,-24.784,0.000,-1.049,-0.895
ff_Mom,-0.3973,0.037,-10.707,0.000,-0.470,-0.325
ff_ST_Rev,0.0364,0.035,1.049,0.294,-0.032,0.104

0,1,2,3
Omnibus:,800.679,Durbin-Watson:,2.151
Prob(Omnibus):,0.0,Jarque-Bera (JB):,13501.3
Skew:,-0.052,Prob(JB):,0.0
Kurtosis:,10.862,Cond. No.,2.41


### Do FF7 factors explain variations in other than the first principal component of US Blue Chip returns?

Yes, to some extent. Some of them are (weakly to midranged) siginificant.
E.g. see HML for component 3