**Fin 585**  
**Diether**  
**Problem Set**  

**Testing the CAPM using Analyst Disagreement Portfolios**

The primary purpose of this problem set is to give you a portfolio formation task that makes you go through all five steps of our portfolio formation framework including testing the CAPM as a model.

1. Data Preparation.

2. Create portfolio formation or criterion variable.

3. Bin the data based on the formation variable.

4. Portfolio creation using the bins.

5. Test the historical performance and test a model.

A secondary goal is to introduce another interesting portfolio strategy. It produces a large spread in average return. Given that, it's a good set of portfolios for testing the CAPM.

To accomplish the programming tasks, you should be able to adapt a lot of code we've used before, and apply it this situation. <br><br>

**Overview**

In this problem set you reproduce another important empirical result in academic finance. Specifically, you reproduce the **dispersion effect** (or the analyst disagreement effect) of Diether, Malloy, and Scherbina (2002). This empirical result spawned a large literature in academic finance, and certainly some quant funds have traded on this effect.

Dispersion (or analyst disagreement) portfolios are formed based on the standard deviation of analyst eps (earnings per share) forecasts over a given period. Here the standard deviation of analyst eps forecasts is the standard deviation across analysts for a given stock and month (most stocks have between 3 to 13 analysts covering them). Diether, Malloy, and Scherbina don't use raw standard deviation. Instead, they scale the standard deviation of analyst forecasts by the absolute value of the mean forecast. Therefore for a given month ($t$), dispersion for stock $i$ is defined as the following:
\begin{align*}
disp_{it} &= \frac{stdev_{it}}{|mean_{it}|}
\end{align*}
DMS form dispersion portfolios using $disp_{i,t-1}$; in other words, they lag dispersion one month. In this homework you will do the same.

There are three datasets for this problem set. The first is the CRSP data (security prices and returns) during the period from January of 1980 to September of 2024. The second is the analyst earnings per share data from IBES. It also covers the period of January of 1980 to September 2024. The frequency for both datasets is monthly. The stock level identifier in the IBES data is called a CUSIP. Consequently, I also included CUSIPs in the CRSP data. The CUSIP and the calendar month uniquely identify the analyst earnings per share observations.

You can download the CRSP data directly using the following link: [the CRSP data](https://diether.org/prephd/08-mstk_80-24.csv). There is also a link on *Learning Suite*. The data contain the following variables:

|Variable | Description                                              |
|---------|----------------------------------------------------------|
|permno   | stock identifier                                         |
|cusip    | stock identifier also in IBES data                       |
|caldt    | calendar date (the day is not truncated to 1)            |
|ret      | monthly return                                           |
|prc      | stock price (not lagged, contemporaneous with returns)   |   


You can download the IBES data directly using the following link: [the IBES data](https://diether.org/prephd/08-ibes_eps_analyst.csv). There is also a link on *Learning Suite*. The data contain the following variables:

|Variable | Description                                          |
|---------|------------------------------------------------------|
|cusip    | stock identifier also in IBES data                   |
|caldt    | calendar date (the day is not truncated to 1)        |
|meanest  | average analyst forecast for that month/stock        |
|stdev    | standard deviation of forecasts for that month/stock |


Finally, to test the CAPM you are going to need a proxy for the market portfolio and for the riskfree rate. Data from these can be found at [Ken French's Data Library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html). For your convenience I have created a csv file that contains both these variables, and it can be loaded directly into a dataframe from my website (see the code below). The `dataframe` contains the excess return on a proxy for the market portfolio (`exmkt`), a proxy for the riskfree rate (`rf`), and some other portfolios you can ignore. The returns from Ken French's library are in percent: raw returns multiplied by 100 (so make sure after forming your portfolios, you multiply your portfolio returns by 100 so it matches the units of the market return and riskfree rate).<br><br>


**Tasks**

1. Form quintile based equal-weight dispersion portfolios where dispersion is lagged one month. Report summary statistics (including a t-test of whether the average return is statistically different from zero for each portfolio). You should exclude low price stocks from your portfolios (price below $5). 

2. Test the CAPM by running a time series CAPM regression for each of the analyst dispersion portfolios:
$$
r_{pt} - r_{ft} = \alpha_p + \beta_{pM}( r_{Mt} - r_{ft}) + \epsilon_{it}
$$
Consolidate all your regression results into one table using the `Regtable` function in the BYU Finance library: [Regtable Docs](https://fin-library.readthedocs.io/en/latest/regtables.html)

3. Interpret the regression results from question 2). What can you infer? Can you reject that the CAPM holds? Is the market portfolio, the tangency portfolio? Explain your answers.

4. Create a spread portfolio that goes 100% long in portfolio 0 and 100% short in portfolio 4. Test the CAPM using this portfolio. Can you reject the CAPM? Explain your answers.

5. Estimate the security market line using the data available for this homework. Specifically, estimate the following line:
$$
E(r_p) = r_f + \beta_{p}\bigl[E(r_M) - r_f\bigr]
$$
You don't need to plot the estimated line, but report your estimates of $r_f$ and $E(r_M) - r_f$ as a line. So something like:
$$
\overline{r}_p = 4\% + \hat{\beta}_p(6\%)
$$

6. Why is the intercept in a time series CAPM regression called an *average abnormal return*? Explain.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

In [2]:
fac = pd.read_csv('https://diether.org/prephd/08-factors.csv',parse_dates=['caldt'])
fac

Unnamed: 0,caldt,exmkt,smb,hml,umd,rf
0,1927-01-31,-0.06,-0.37,4.54,0.36,0.25
1,1927-02-28,4.18,0.04,2.94,-2.14,0.26
2,1927-03-31,0.13,-1.65,-2.61,3.61,0.30
3,1927-04-30,0.46,0.30,0.81,4.30,0.25
4,1927-05-31,5.44,1.53,4.73,3.00,0.30
...,...,...,...,...,...,...
1168,2024-05-31,4.34,0.78,-1.67,-0.02,0.44
1169,2024-06-28,2.77,-3.06,-3.31,0.90,0.41
1170,2024-07-31,1.24,6.80,5.74,-2.42,0.45
1171,2024-08-30,1.61,-3.55,-1.13,4.79,0.48


In [3]:
stk = pd.read_csv('08-mstk_80-24.csv',parse_dates=['caldt'])
stk

Unnamed: 0,permno,caldt,cusip,ret,prc,me
0,10000,1986-01-31,68391610,,4.37500,16.1000
1,10000,1986-02-28,68391610,-0.257143,3.25000,11.9600
2,10000,1986-03-31,68391610,0.365385,4.43750,16.3300
3,10000,1986-04-30,68391610,-0.098592,4.00000,15.1720
4,10000,1986-05-30,68391610,-0.222656,3.10938,11.7939
...,...,...,...,...,...,...
2741076,93436,2024-05-31,88160R10,-0.028372,178.08000,567932.0000
2741077,93436,2024-06-28,88160R10,0.111186,197.88000,632155.0000
2741078,93436,2024-07-31,88160R10,0.172781,232.07000,741380.0000
2741079,93436,2024-08-30,88160R10,-0.077391,214.11000,684004.0000


In [4]:
ibes = pd.read_csv("08-ibes_eps_analyst.csv",parse_dates=['caldt'])
ibes

Unnamed: 0,cusip,caldt,meanest,stdev
0,00000000,2010-06-17,1.00,0.01
1,00000000,2010-07-15,0.98,0.02
2,00000000,2016-04-14,0.25,0.08
3,00000000,2016-05-19,0.31,0.01
4,00000000,2016-06-16,0.31,0.01
...,...,...,...,...
1827951,ZNPRICES,2024-07-18,1.19,0.05
1827952,ZNPRICES,2024-08-15,1.20,0.06
1827953,ZNPRICES,2024-09-19,1.20,0.06
1827954,ZNPRICES,2024-10-17,1.21,0.05


<br>

**Hint About Merging the two Datasets**

In the datasets I've include the full calendar dates of the observations. Even though the frequency for both is monthly, the timing is not the same. The CRSP data are from the last trading day in the month, and the IBES data tend to be around the middle of the month. Therefore, to merge these dataframes you need to create a new date variable that only preserves uniqueness at the year-month level. Here is a shortcut way to accomplish that:

In [5]:
stk['mdt'] = stk['caldt'].values.astype('datetime64[M]')
stk.head(5)

Unnamed: 0,permno,caldt,cusip,ret,prc,me,mdt
0,10000,1986-01-31,68391610,,4.375,16.1,1986-01-01
1,10000,1986-02-28,68391610,-0.257143,3.25,11.96,1986-02-01
2,10000,1986-03-31,68391610,0.365385,4.4375,16.33,1986-03-01
3,10000,1986-04-30,68391610,-0.098592,4.0,15.172,1986-04-01
4,10000,1986-05-30,68391610,-0.222656,3.10938,11.7939,1986-05-01


In [6]:
ibes['mdt'] = ibes['caldt'].values.astype('datetime64[M]')
ibes.head(5)

Unnamed: 0,cusip,caldt,meanest,stdev,mdt
0,0,2010-06-17,1.0,0.01,2010-06-01
1,0,2010-07-15,0.98,0.02,2010-07-01
2,0,2016-04-14,0.25,0.08,2016-04-01
3,0,2016-05-19,0.31,0.01,2016-05-01
4,0,2016-06-16,0.31,0.01,2016-06-01


What is the code above doing? Pandas stores all dates with precision to the nanosecond. But Numpy (the library Pandas uses for its date functionality) actually includes date types for varying levels of precision (including monthly). So the above code changes the original nanosecond datetype to a monthly datetype; this causes all the information about time beyond a month to be lost and when pandas automatically reconverts the date to a nanosecond datetype the day gets set equal to one for all observations.

Now you should be able to merge the two datasets.

In [7]:
# Preparing the data:
# merging the two data sets: inner join so only matching values are included. 
df = pd.merge(stk, ibes, how='inner', on=['mdt', 'cusip'])

# created Dispersion variable
df['disp'] = df.stdev / np.abs(df.meanest)

# lagging the dispersion variable
df['disp'] = df.groupby('cusip').disp.shift(1)

# filtering out smaller stocks and dropping NaN variables
df = df.query("disp == disp and prc >= 5")

In [8]:
# creating portfolios:
# binning portfolios based on the lagged dispersion
df['bins'] = df.groupby('mdt').disp.transform(pd.qcut,5,labels=False)

# looking at performance of portfolios:
port = df.groupby(['mdt', 'bins']).ret.mean() * 100

# unstacking bins 
port = port.unstack(level='bins')


In [9]:
# summary of portfolio performance
from finance_byu.summarize import summary
summary(port).round(3)

bins,0,1,2,3,4
count,537.0,537.0,537.0,537.0,537.0
mean,1.47,1.381,1.477,1.676,1.999
std,4.697,5.052,5.533,6.152,7.122
tstat,7.253,6.332,6.184,6.314,6.505
pval,0.0,0.0,0.0,0.0,0.0
min,-25.513,-25.086,-26.839,-28.074,-31.127
25%,-1.123,-1.728,-1.971,-1.988,-2.168
50%,1.761,1.724,1.905,1.829,1.999
75%,4.267,4.592,4.932,5.237,5.896
max,14.688,16.944,22.169,24.999,30.2


In [10]:
# merging CAPM data with the portfolios
# renaming caldt column to be able to merge with port.
fac['mdt'] = fac['caldt'].values.astype('datetime64[M]')

# merging portfolios with capm data so I can test CAPM
CAPM_data = port.merge(fac, how='inner', on='mdt').set_index('mdt')

# dropping caldt because I dont want to have it with mdt.
CAPM_data = CAPM_data.drop(columns=['caldt'])

In [11]:
# testing CAPM
# finding excess return for each portfolio
for i in range(5):
    CAPM_data[f'exc{i}'] = CAPM_data[i] - CAPM_data['rf']

models = []

# running time series CAPM regression for each dispersion portfolio:
for i in range(5):
    reg = smf.ols(f'exc{i} ~ exmkt', data=CAPM_data).fit()

    models.append(reg)

In [12]:
# getting CAPM testing results
from finance_byu.regtables import Regtable

table = Regtable(models, sig='stat')

table.render()

Unnamed: 0,exc0,exc1,exc2,exc3,exc4
Intercept,0.455,0.305,0.337,0.467,0.688
,(5.52)***,(3.64)***,(3.47)***,(3.87)***,(4.34)***
exmkt,0.952,1.036,1.124,1.221,1.363
,(52.91)***,(56.61)***,(53.02)***,(46.27)***,(39.38)***
obs,537,537,537,537,537
Rsq,0.84,0.86,0.84,0.80,0.74


3: What can you infer? Can you reject that the CAPM holds? Is the market portfolio, the tangency porfolio? Explain:

Based on the data, we can infer that these portfolios are getting abnormal returns with a very high confidence. From this information we can reject that the CAPM holds. The market portfolio is not the tangency portfolio. This is because there are alphas' greater than 0 with a high statistical inference. This means that there are portfolios that exist that have a better return to variance ratio, meaning the market portfolio is not an MVE portfolio and is therefore not a tangency portfolio.

In [14]:
CAPM_data['spread'] = CAPM_data[0] - CAPM_data[4]

reg = smf.ols('spread ~ exmkt', data=CAPM_data).fit()

modelsAndSpread = []

modelsAndSpread = models + [reg]

table2 = Regtable(modelsAndSpread, sig='stat')

table2.render()


Unnamed: 0,exc0,exc1,exc2,exc3,exc4,spread
Intercept,0.455,0.305,0.337,0.467,0.688,-0.233
,(5.52)***,(3.64)***,(3.47)***,(3.87)***,(4.34)***,(-1.57)
exmkt,0.952,1.036,1.124,1.221,1.363,-0.411
,(52.91)***,(56.61)***,(53.02)***,(46.27)***,(39.38)***,(-12.63)***
obs,537,537,537,537,537,537
Rsq,0.84,0.86,0.84,0.80,0.74,0.23


with the spread portfolio we cannot as easily reject the CAPM. The T-value is much smaller than the other portfolios so there is very little confidence with the non zero alpha. 

In [32]:
# finding E(rp) 
erMeans = CAPM_data[['exc0', 'exc1', 'exc2', 'exc3', 'exc4']].mean()

betas = [model.params['exmkt'] for model in models]

sml = pd.DataFrame({
    'returns': erMeans.values,
    'beta': betas
})

sml_model = smf.ols('returns ~ beta',data=sml).fit()

sml_model.summary()

  warn("omni_normtest is not valid with less than 8 observations; %i "


0,1,2,3
Dep. Variable:,returns,R-squared:,0.827
Model:,OLS,Adj. R-squared:,0.769
Method:,Least Squares,F-statistic:,14.32
Date:,"Thu, 12 Feb 2026",Prob (F-statistic):,0.0323
Time:,03:15:33,Log-Likelihood:,4.8261
No. Observations:,5,AIC:,-5.652
Df Residuals:,3,BIC:,-6.433
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.3308,0.427,-0.775,0.495,-1.689,1.027
beta,1.4066,0.372,3.785,0.032,0.224,2.589

0,1,2,3
Omnibus:,,Durbin-Watson:,1.488
Prob(Omnibus):,,Jarque-Bera (JB):,0.631
Skew:,0.35,Prob(JB):,0.729
Kurtosis:,1.407,Cond. No.,16.1


5. r_p = -33.08% + Beta(140.6%)

6. Why is the intercept in a time series CAPM regression called an average abnormal return? Explain?

It is called the average abnormal return because it is the return that cannot be explained by the model of the market. It is the average because the time series regression gets the average error over the time frame. If the CAPM were perfect, the average abnormal return would be zero since the CAPM would have zero inaccuracies in its predictions. 