# Lab

    Partially replicate the results of Avellaneda & Lee 2008

# Background - Modern Portfolio Theory (MPT)

When handling a portfolio of assets (read: a group of stocks, commodities, etc. that you own), modern portfolio theory was laid out by Harry Markowitz.

The core of his theory is that at least some of the assets tend to move with/against a few other assets. It doesn't try to determine *why* they're correlated, they just are.

This can be used to quantify both the expected return of a portfolio, as well as its exposure to risk.

Ex. A portfolio consisting of several correlated stocks is likely to have the highest expected return as well as the greatest exposure to risk - your profit/loss just depends on whether the market as a whole went up or down.

In the Markowitz theory, people are rational and would like to maximize returns while minimizing risk. There is a tradeoff between the two, but people have their own preferences for expected returns or tolerable risk. 

Constructing an appropriate portfolio is then a constrained optimization problem.

Given a desired level of either minimum return or maximum risk, one can calculate the portfolio that satisfies that request while either minimizing risk or maximizing return respectively (this is the [efficient frontier](https://plotly.com/python/v3/ipython-notebooks/markowitz-portfolio-optimization/)).

<img src="EfficientFrontier.png" style="width: 50em" />

One general result of MPT is that, at fixed return, a diversified portfolio is much more risk-tolerant than the alternative - don't put all your eggs in one basket.

That said, there is still a degree of *market* risk in making portfolios because the market as a whole moves; the degree to which the market is involved in a stock's risk and return would be the $\beta$-terms in section 2 of the paper

*Directional* trading strategies are strategies that generate returns depending on market movements ($\beta$ ~ 1) like investing in an index fund.

*Market-neutral* trading strategies are trading strategies which try to decorrelate from the market ($\beta \rightarrow$ 0) and their returns are ideally coming just from the stocks themselves.

# Paper

Avellaneda and Lee proposed a market-neutral trading strategy utilising PCA.

The general idea is that historical price data includes movements due to noise/random processes as well as actual information.

By analyzing how groups of stocks move together, you can find recover that information and find patterns where you bet on some industries and against others in order to generate a profit with lower theoretical risk (i.e. arbitrage).

Directly analyzing all of the market data, or even the S&P500 indices, is impractical.  

Hence dimensionality reduction (PCA) is used to extract the relevant information from a collection of stocks, from which they can construct a trading strategy.

They find that the first eigenportfolio is associated to the market portfolio, which has the largest exposure to market risk.

All other portfolios, being orthogonal to it, are uncorrelated with it and are theoretically market-neutral. 

# Goals

    Basic: 
    
Get the price data for the S&P500 from at least 2000 onward (try `yfinance`) and repeat the PCA analysis. Specifically, remake figures 1-6.
    
Then interpret the meaning of the eigenportfolios (the first is the market portfolio, as mentioned in the paper)

    Advanced: 

1.) Do PCA over some time directly after your data and see if the results change significantly. In particular, check the variance explained by a fixed number of PCA eigenvectors.

2.) Explicitly check that the returns of the eigenportfolios are uncorrelated.

3.) Find where your portfolios relative to the efficient frontier.

# Optional Followup

**Trading and Backtesting**

Statistical arbitrage encompasses trading strategies where returns are generated through market mispricings, determined statistically.

From the paper "the idea is to make many bets with positive expected returns, taking advantage of diversification across stocks, to produce a low-volatility investment strategy which is uncorrelated with the market."

In the paper, the strategy they use is based on [mean-reversion](https://medium.com/auquan/mean-reversion-simple-trading-strategies-part-1-a18a87c1196a).

The price of an item can temporarily deviate from the "fair" price because of some noise or a market overreaction, but eventually the market will realize that and revert the price change.

The way the "fair" price is determined can vary, but usually it's measured by taking a moving average within a window of time. 

In the image below, it's done with the 90 day average, while the recent behavior is measured with a 30 day average, and the crossover between them is the signal to either buy or sell. 

<img src="MA.png" style="width: 50em" />

By all means, pick a trading strategy (mean reversion, momentum, Laguerre RSI, etc.), balance the portfolio with your PCA results, then backtest it to recreate the other parts of the paper.

In particular, find the Sharpe ratios, the ratio of excess return and risk, to evaluate performance. 

Here, excess return is the portfolio return minus the theoretical rate of return of an investment with zero risk (like a treasury bond). 

Somewhere around 1 is generally considered a good Sharpe ratio.

**Random Matrix Theory (RMT)**

One of the problems of getting an empirical correlation matrix is noise. This is due to the limits on data sampling.

You want to get enough data points s.t. it is well-sampled in a statistical sense, but you can't wait too long or the market will move and your data is no longer representative.

There is unfortunately not much we can do regarding data frequency (yet), but the next best option is to check how contaminated our measurements are by noise to try to filter it out.

That is to say, assuming our correlation measurements are purely noise, what would we expect to get and how different is that from what we actually have.

It turns out that sufficiently large matrices with random entries produce eigenvalues that follow specific distributions depending on symmetry and the aspect ratio of the matrix (e.g. the Marchenko-Pastur distribution).

This tells us that large deviations from the random matrix's distribution contain real information that we could potentially recover, while everything within the distribution likely contains no useful information whatsoever.

Some results in the literature: 

    1.) Eigenvalues below some threshold are there simply due to noise, and the associated portfolios should be discarded.
    2.) Risk tends to be underestimated with the empirical correlation matrix (~ 1.5x difference)