# Midterm 2

## FINM 36700 - 2023

### UChicago Financial Mathematics

* Mark Hendricks
* hendricks@uchicago.edu

# Instructions

## Please note the following:

Points
* The exam is 100 points.
* You have 120 minutes to complete the exam.
* For every minute late you submit the exam, you will lose one point.


Submission
* You will upload your solution to the `Midterm 2` assignment on Canvas, where you downloaded this. (Be sure to **submit** on Canvas, not just **save** on Canvas.
* Your submission should be readable, (the graders can understand your answers,) and it should **include all code used in your analysis in a file format that the code can be executed.** 

Rules
* The exam is open-material, closed-communication.
* You do not need to cite material from the course github repo--you are welcome to use the code posted there without citation.

Advice
* If you find any question to be unclear, state your interpretation and proceed. We will only answer questions of interpretation if there is a typo, error, etc.
* The exam will be graded for partial credit.

## Data

**All data files are found in the class github repo, in the `data` folder.**

This exam makes use of the following data files:
* `midterm_2_data.xlsx`

This file has sheets for...
* `info` - names and descriptions of each factor
* `factors (excess returns)` - excess returns on several factors
* `portfolios (excess returns)` - excess returns on industry portfolios
* `risk-free rate` - risk-free rates over time

Note the data is **monthly** so any annualizations should use `12` months in a year.

## Scoring

| Problem | Points |
|---------|--------|
| 1       | 30     |
| 2       | 35     |
| 3       | 20     |
| 4       | 15     |

### Each numbered question is worth 5 points unless otherwise specified.

### Notation
(Hidden LaTeX commands)

$$\newcommand{\betamkt}{\beta^{i,\text{MKT}}}$$
$$\newcommand{\betahml}{\beta^{i,\text{HML}}}$$
$$\newcommand{\betaumd}{\beta^{i,\text{UMD}}}$$
$$\newcommand{\Eri}{E\left[\tilde{r}^{i}\right]}$$
$$\newcommand{\Emkt}{E\left[\tilde{r}^{\text{MKT}}\right]}$$
$$\newcommand{\Ehml}{E\left[\tilde{r}^{\text{HML}}\right]}$$
$$\newcommand{\Eumd}{E\left[\tilde{r}^{\text{UMD}}\right]}$$

# 1. Short Answer

#### No Data Needed

These problems do not require any data file. Rather, analyze them conceptually. 

## 1.

Suppose that we find a set of factors that perfectly hedge any asset. Will these factors work as a linear factor pricing model? 

A perfect hedge does not equal perfect pricing model. There is a difference between the E(R) explainability yielded by factors vs the capturing of all systematic risk components. 

## 2.

If the Fama-French 3-factor model fit perfectly, would the Treynor ratio be equal for every asset?

The Treynor ratio measures the return in excess of the risk-free rate that an investment earns per unit of market risk (beta). Even if the Fama-French model perfectly explains returns, but they could still have different betas. Therefore, the Treynor ratio could differ across assets because it's dependent on the beta with the market factor only, not on size or value factors.

## 3.

Suppose the CAPM fits perfectly. Then assets which have higher time-series r-squared metrics on the market factor will have higher Sharpe ratios.

False. A higher R-squared does not necessarily equal higher Sharpe ratio. The Sharpe ratio is affected by both the asset's beta and its total standard deviation given perfect CAPM. An asset could have a high R-squared but also a high total risk or a low excess return, leading to a lower Sharpe ratio.

## 4.

Based on the case, what are two ways DFA hopes to generate attractive returns for investors?

Factor-Based Investing: DFA believes in strong research to best design portfolios that drive returns despite markets being 'efficient'.

Efficient Implementation: DFA also focuses on implementing these strategies efficiently via focus on profitability and cost.

## 5.

We analyzed a strategy similar to "AQR's Momentum Funds" (mutual funds.) We found this implementation had much higher returns than the momentum factor of Fama French. What was a major drawback to this construction?

THe index assumes monthly rebalancing, which might cause huge transaction costs.

## 6.

From our analysis, momentum has had a negative mean return since 2009. Is this evidence against momentum as a pricing factor? Explain why this is a problem or why it is not a problem.

Even though it is it is negative after 2009, momentum is still valuable given it's lack of correlation to other factors. This makes it a viable addition to potentially capture any "stray" explainability

***

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import sys
import os
import warnings
warnings.filterwarnings('ignore')

#helper functions
def freq_multiplier(input_freq, output_freq):
    """
    multiplies the input frequency to the output frequency
    """
    multiplier = 1

    if input_freq == 'm':
        multiplier *= 12
    elif input_freq == 'w':
        multiplier *= 52
    elif input_freq == 'd':
        multiplier *= 252
    elif input_freq == 'a':
        multiplier *= 1
    else:
        print('invalid input frequency')
        return
    
    if output_freq == 'm':
        multiplier /= 12
    elif output_freq == 'w':
        multiplier /= 52
    elif output_freq == 'd':
        multiplier /= 252
    elif output_freq == 'a':
        pass

    return multiplier

def calc_stats(df, input_freq = 'm', output_freq = 'a', ci_level = 0.05):
    """
    Returns the Performance Stats for given set of returns
        Inputs: 
            return_data - DataFrame with Date index and Returns for different assets/strategies.
        Output:
            summary_stats - DataFrame with mean return, vol, sharpe ratio. Skewness, Excess Kurtosis, Var (0.5) and
                            CVaR (0.5) and drawdown based on monthly returns. 
    """
    multiplier = freq_multiplier(input_freq, output_freq)

    # calculate mean, vol, sharpe, VaR(5%), CVaR(5%) for each item in df
    summary_stats = df.mean().to_frame('Mean').apply(lambda x: x * multiplier)
    summary_stats['Volatility'] = df.std() * np.sqrt(multiplier)
    summary_stats['Sharpe Ratio'] = summary_stats['Mean'] / summary_stats['Volatility']
    summary_stats['Skewness'] = df.skew()
    summary_stats['Excess Kurtosis'] = df.kurtosis()
    summary_stats[f'VaR ({ci_level*100}%)'] = df.quantile(ci_level, axis = 0)
    summary_stats[f'CVaR ({ci_level*100}%)'] = df[df <= df.quantile(ci_level, axis = 0)].mean()

    cum_returns = (1 + df).cumprod()
    previous_peaks = cum_returns.cummax()
    drawdowns = (cum_returns - previous_peaks) / previous_peaks
    summary_stats['Max Drawdown'] = drawdowns.min()
    # find the last date of the min drawdown
    summary_stats['Peak'] = [previous_peaks[col][:drawdowns[col].idxmin()].idxmax() for col in previous_peaks.columns]
    summary_stats['Bottom'] = drawdowns.idxmin()

    # calculate recovery time
    recovery_date = []
    for col in cum_returns.columns:
        prev_max = previous_peaks[col][:drawdowns[col].idxmin()].max()
        recovery_cum = pd.DataFrame([cum_returns[col][drawdowns[col].idxmin():]]).T
        recovery_date.append(recovery_cum[recovery_cum[col] >= prev_max].index.min())
    summary_stats['Recovery'] = recovery_date   

    return summary_stats

def pre_post_calc_stats_comp(df, pre, post, input_freq = 'm', output_freq = 'a', ci_level = 0.05):
    """
    Returns the performance stats for a given set of excess returns but comparing 2 time periods.
    Uses the cal_stats function above:
        Inputs: 
            return_data - DataFrame with Date index and Returns for different assets/strategies, as well as the pre and post periods in string format (e.g. '2014')
        Output:
            summary_stats - DataFrame with mean return, vol, sharpe ratio. Skewness, Excess Kurtosis, Var (0.5) and
                            CVaR (0.5) and drawdown based on monthly returns. 
    """
    df.index = df.index.normalize()
    df_pre = df.loc[:pre]
    df_post = df.loc[post:]

    stats_pre = calc_stats(df_pre, input_freq, output_freq, ci_level)
    stats_post = calc_stats(df_post, input_freq, output_freq, ci_level)

    # combined summary stats for all subsamples in one dataframe with period as column
    stats_pre['Period'] = f'from {df_pre.index[0].strftime("%Y")} to {pre}'
    stats_post['Period'] = f'{post} - present'
    summary_stats = pd.concat([stats_pre, stats_post]).reset_index().rename(columns = {'index': 'Factor'}).set_index(['Period', 'Factor'])
    
    return summary_stats

def plot_time_series(df, pre = None, post = None):
    """
    Plots the factors over time. Also, includes a line demarcating the pre and post periods if pre is available.
    Will also throw the cumulative returns plot for pre and post if pre and post available.
    """
    fig, ax = plt.subplots(figsize = (10, 5))
    cum_returns = (1 + df).cumprod() - 1
    cum_returns.plot(ax = ax)
    ax.set_title('Cumulative Returns of Factors')
    ax.set_xlabel('Date')
    ax.set_ylabel('Cumulative Returns')
    
    if pre is not None and post is not None:
        ax.axvline(x=pre, color='k', linestyle='--')
        ax.legend(['Cumulative Returns', 'Pre', 'Post'])
        
        # Plot cumulative returns for pre period
        fig_pre, ax_pre = plt.subplots(figsize=(10, 5))
        cum_returns_pre = (1 + df.loc[:pre]).cumprod() - 1
        cum_returns_pre.plot(ax=ax_pre)
        ax_pre.set_title('Cumulative Returns of Factors - Pre')
        ax_pre.set_xlabel('Date')
        ax_pre.set_ylabel('Cumulative Returns')
        
        # Reset the cumulative returns for the post period
        df_post = df.loc[post:]
        df_post = df_post - df_post.iloc[0] + 1  # Reset to start from 1
        cum_returns_post = df_post.cumprod() - 1
        
        fig_post, ax_post = plt.subplots(figsize=(10, 5))
        cum_returns_post.plot(ax=ax_post)
        ax_post.set_title('Cumulative Returns of Factors - Post')
        ax_post.set_xlabel('Date')
        ax_post.set_ylabel('Cumulative Returns')
        
        return fig, ax, fig_pre, ax_pre, fig_post, ax_post
    else:
        return fig, ax

def plot_corr_matrix(df):
    """
    plot correlation matrices for a set of time series data.
    """
    fig, ax = plt.subplots(figsize = (10, 5))
    sns.heatmap(df.corr(), annot = True, cmap = 'Blues', ax = ax)
    ax.set_title('Correlation Matrix')
    return fig, ax

def tangency_weights(df, input_freq = 'm', output_freq = 'a'):
    """
    Returns the weights of the tangency portfolio for given set of returns and covariance matrix.
    Also returns the annualised returns, vol and sharpe ratio of each asset in the portfolio
    """
    multipler = freq_multiplier(input_freq, output_freq)

    # calculate the mean returns and covariance matrix
    mean_returns = df.mean() * multipler
    cov_matrix = df.cov() * multipler
    cov_inv = np.linalg.inv(cov_matrix)

    # calculate the tangency portfolio weights
    tangency_weights = cov_inv.dot(mean_returns)
    tangency_weights /= tangency_weights.sum()

    #create a dataframe with weights and asset names
    tangency_weights = pd.DataFrame(tangency_weights, index = df.columns, columns = ['Weights'])

    return tangency_weights

def linearRegression(seriesY,seriesX):
    
    mean =seriesY.mean()*12
    sharpe = mean/(seriesY.std()*(12**0.5))
    model = sm.OLS(seriesY,sm.add_constant(seriesX)).fit()
    rsq = model.rsquared
    
    beta = pd.DataFrame(index= [seriesY.name])
    
    for i,x in enumerate(seriesX):
         beta[x] = model.params[i+1]
    
    betaCols = [i+'Beta' for i in seriesX]
    beta = beta.rename(columns = dict(zip(beta.columns,betaCols)))
    
    treynor = mean/beta[beta.columns[0]]
    alpha = model.params[0]*12
    information = alpha/(model.resid.std()*np.sqrt(12))
    
    RegressionStats = pd.DataFrame({'Mean Return':mean,'Sharpe Ratio':sharpe,'R Squared':rsq,\
                         'Alpha':alpha, 'Information Ratio':information, 'Treynor':treynor},index= [seriesY.name])
    
    return pd.concat([RegressionStats,beta], axis =1)

def calc_pricing_regression(rets, factors, intercept=True, adj=12):
    if intercept:
        factors = sm.add_constant(factors)
    summary = {f'{k} Beta': [] for k in factors.columns[1:]}
    summary['Alpha']  = []
    summary['R^2']    = []
    for asset in rets.columns:
        model = sm.OLS(rets[asset], factors).fit()
        for k, _ in summary.items():
            try:
                summary[k].append(model.params[k[:-5]])
            except:
                pass
        summary['R^2'].append(model.rsquared)
        summary['Alpha'].append(model.params[0] * adj)
            
    return pd.DataFrame(summary, index=rets.columns)


def calc_cross_sectional_regression(mean_rets, betas, intercept=False):
    if intercept:
        betas = sm.add_constant(betas)
    model = sm.OLS(mean_rets, betas).fit()
    params = model.params.to_frame('Cross Sectional Regression')
    params.loc['R^2'] = model.rsquared
    params.loc['MAE'] = model.resid.abs().mean() 
    return params


In [7]:
FILEIN = 'Documents/GitHub/finm-portfolio-2023/data/midterm_2_data.xlsx'
sheet_des = 'descriptions'
sheet_factorexrets = 'factors (excess returns)'
sheet_portfolioexrets = 'portfolios (excess returns)'
sheet_rfr = 'risk-free rate'

# reference github repo finm-portfolio-2023
des = pd.read_excel(FILEIN, sheet_name = sheet_des)
factorexrets = pd.read_excel(FILEIN, sheet_name = sheet_factorexrets).set_index('Date')
portfolioexrets = pd.read_excel(FILEIN, sheet_name = sheet_portfolioexrets).set_index('Date')
rfr = pd.read_excel(FILEIN, sheet_name = sheet_rfr).set_index('Date')

# 2. Linear Factor Pricing Models (LFPMs)

This problem tests the following LFPM:

$$\begin{align}
\Eri = \betamkt \Emkt + \betahml \Ehml + \betaumd \Eumd
\end{align}$$

## 1.

### (8 pts)

Estimate the **time-series (TS)** test of this pricing model. 

For each asset, report the following statistics:
* annualized alpha
* betas
* r-squared

In [16]:
frames = []

for col in portfolioexrets:
    p = linearRegression(portfolioexrets[col],factorexrets)
    frames.append(p) 

TSRegression = pd.concat(frames)
TSRegression[['Alpha', 'MKTBeta', 'HMLBeta','UMDBeta', 'R Squared']]

Unnamed: 0,Alpha,MKTBeta,HMLBeta,UMDBeta,R Squared
NoDur,0.029253,0.739522,0.20458,0.049333,0.617919
Durbl,0.010734,1.271865,0.173595,-0.320023,0.613493
Manuf,-0.000996,1.049482,0.197462,-0.036704,0.870268
Enrgy,-0.015117,0.992222,0.637006,0.07517,0.465602
HiTec,0.028207,1.154959,-0.637135,-0.140638,0.829498
Telcm,0.003506,0.837326,0.094363,-0.084518,0.588052
Shops,0.026739,0.946928,-0.042222,-0.015005,0.742161
Hlth,0.031862,0.757605,-0.119928,0.074058,0.580514
Utils,0.01371,0.527879,0.353033,0.108622,0.342654
Other,-0.01978,1.115433,0.426753,-0.048678,0.910098


## 2.

### (7pts)

Estimate the **cross-sectional (CS)** test of the pricing model. 

Include an intercept in your cross-sectional test.

Report the
* annualized intercept
* annualized regression coefficients
* r-squared

In [28]:
# reporting annualized intercept, annualsied regression coefficients, R^2
CSRegression = calc_cross_sectional_regression(TSRegression['Mean Return'], TSRegression[['MKTBeta', 'HMLBeta','UMDBeta']], intercept=True)
CSRegression.T

Unnamed: 0,const,MKTBeta,HMLBeta,UMDBeta,R^2,MAE
Cross Sectional Regression,0.063716,0.031992,-0.015767,0.030301,0.366198,0.007915


## 3.

Report the annualized factor premia (expected excess returns of the three factors) as implied by each of the TS and CS estimations.

In [20]:
time_series_premia = (factorexrets.mean()*12).to_frame('Time Series Premia')
time_series_premia.index = [x+"Beta" for x in time_series_premia.index]
time_series_premia

Unnamed: 0,Time Series Premia
MKTBeta,0.083853
HMLBeta,0.025028
UMDBeta,0.061692


In [24]:
CSRegression[1:4]

Unnamed: 0,Cross Sectional Regression
MKTBeta,0.031992
HMLBeta,-0.015767
UMDBeta,0.030301


## 4.

Use the r-squared statistics from the TS and CS tests above to assess whether these factors are effective for decomposition and/or pricing.

Be specific as to how the r-squared statistics from the TS and CS tests impact your conclusions.

In [39]:
#TS and CS R^2 in a dataframe
R2 = pd.DataFrame({'Time Series': TSRegression['R Squared'].mean(), 
                'Cross Sectional': CSRegression.loc['R^2']})
R2.T.rename(columns = {'Cross Sectional Regression': 'R Squared'})

Unnamed: 0,R Squared
Time Series,0.656026
Cross Sectional,0.366198


Comparing the R Squared value, based on my data above. I seems like time series explains about 65% of the data, which is moderate in terms of moderate explainability while for cross sectional, it only has low explainability of about 36%.

Hence, looking at these statistics, it does not look like this factors would be good for the decomposition or pricing of these assets. They are moderate at the best.

## 5.

Report the annualized pricing mean absolute error (MAE) implied by each of the TS and CS estimations.

In [40]:
#MAE for TS and CS regression in a dataframe
MAE = pd.DataFrame({'Time Series': TSRegression['Mean Return'].abs().mean(), 
                'Cross Sectional': CSRegression.loc['MAE']})
MAE.T.rename(columns = {'Cross Sectional Regression': 'MAE'})

Unnamed: 0,MAE
Time Series,0.090712
Cross Sectional,0.007915


In [42]:
CSRegression

Unnamed: 0,Cross Sectional Regression
const,0.063716
MKTBeta,0.031992
HMLBeta,-0.015767
UMDBeta,0.030301
R^2,0.366198
MAE,0.007915


## 6.

Which asset has the highest premium as implied by the TS estimation? And as implied by the CS estimation? (For the latter, feel free to include the cross-sectional intercept.)

In [43]:
#Which asset has the highest premium as implied by the TS estimation? And as implied by the CS estimation? (For the latter, feel free to include the cross-sectional intercept.)
# TS
print('TS', time_series_premia.loc[time_series_premia['Time Series Premia'].idxmax()])

# CS, including intercept
print('CS', CSRegression[1:4].idxmax())


TS Time Series Premia    0.083853
Name: MKTBeta, dtype: float64
CS Cross Sectional Regression    MKTBeta
dtype: object


For both, it seems that the highest premia belong to the MKT.

***

# 3. Additional Analysis

## 1. 

Consider the three-factor pricing model above. How can we assess whether all three factors are useful in this pricing model? 

Specifically, discuss whether the previously estimated regression betas would be informative. If not, what other statistic could we calculate?

The t-statistics for these betas would indicate whether each factor is statistically significant individually. We could also use the F-test to check the joint significance of these factors combined. 

## 2.

Suppose we are testing the 3-factor model above, and now we want to allow for time-varying betas.

How could we test the model while allowing for this?

Be specific about the number of regressions we would run and the nature of these regressions.

We can perform a rolling regression, i.e.:
- Perform a series of regressions over rolling windows of time. For instance, if you have monthly data, you might use a 1-year rolling window, which means running a regression using the first 12 months of data, then moving one month forward and running the regression again, and so on.
- Each regression would yield a set of betas for that particular time period, which could then be analyzed to see how they change over time.
- The number of regressions would equal the total number of periods minus the number of periods in the window, plus one.

## 3.

State one advantage and one disadvantage of using the CS estimation as opposed to the TS estimation in fitting the LFPM to the data.

CS advantage:
- Uses the variability across different assets at a single point in time to estimate the factor premiums. This can provide a more comprehensive view of how different assets are priced relative to each other

CS disadvantage:
- Assumes constant beta. Is a limitation if betas vary over time

## 4.

Suppose we are investing in just the assets included in our data set. We want to implement a momentum strategy.

Relative to the momentum strategies we studied, do you expect this strategy would have higher or lower...
* mean
* volatility

Explain.

Given that we studied that momentum has low correlation to other factors, we assume that adding momentum should provide some form of diversification. 
i.e. we would see our sharpe improve although I can say that our mean and volatility should both go down

***

# 4. Returns Over Time

## 1.

If Barnstable’s assumptions hold, (log iid returns, normally distributed,) then in what sense is an investment safer in the long-run? And in what sense is it riskier in the long-run?

An investment would be safer in the long run due to the compounding of returns and the law of large numbers. However, it can also be riskier due to the absolute size of potential losses in dollar terms and the potential for long-term shifts in market conditions that are not captured by historical return distributions

## 2. 

### (10pts)

Data 
* Make use of the `risk-free rate` tab.
* Construct the **total** factor returns by adding the risk-free rate to the excess `MKT` and `HML` factor returns.

Assumptions
* The total returns are lognormally distributed and iid. 

Report the probability that `MKT` will outperform `HML` over the following 5 years.

In [48]:
from scipy.stats import norm
#take MKT and HML and add risk free rate to get the total return
TR = factorexrets[['MKT', 'HML']].add(rfr['RF'], axis=0)

#calculate probability MKT will outperform HML over the following 5 years given lognormal distribution and iid
mean = TR.mean()*60
std = TR.std()*np.sqrt(60)
mean_diff = mean['MKT'] - mean['HML']
std_diff = np.sqrt(std['MKT'] ** 2 + std['HML'] ** 2)
prob = 1 - norm.cdf(0, mean_diff, std_diff)
print('Probability MKT will outperform HML over the following 5 years:', prob)


Probability MKT will outperform HML over the following 5 years: 0.7530567656887349


***