# Beta Calculation Optimizations
This notebook runs through some comparisions of beta calculations and why I chose the method I eventually went with. Instead of using real stock data, we'll mimic the data using numpy's random number generator so we can scale up the size of our test cases quickly and test the performance.

## Generate Data
First, we generate our test data. In the example below, we generate 15 years of data, from 2000 to 2015 (5,845 days), for 1,000 tickers. The total number of data points is 5,845,000, and the number of beta calculations is approximately the same (slightly less because the first few days won't have enough historical data to calculate).

In [13]:
import pandas as pd
import numpy as np
import datetime

In [14]:
# source: http://stackoverflow.com/questions/2030053/random-strings-in-python
import random, string
def randomword(length): return ''.join(random.choice(string.ascii_uppercase) for i in range(length))

In [15]:
# settings
start = '01-01-2000'
end = '01-01-2016'
ticker_count = 1000       # number of tickers
window = 100              # window for beta calc

In [25]:
idx = pd.date_range(start=start, end=end)
tickers = [randomword(4) for i in range(ticker_count+1)]
dt = {t: np.random.uniform(-.05, .05, len(idx)) for t in tickers}
df = pd.DataFrame(data=dt, index=idx)

mkt = tickers[0]
window = 100

In [26]:
df.head() # dataframe created 

Unnamed: 0,AAXJ,AAYO,ABSF,ACBP,ADIA,ADOT,ADWA,ADZF,AEHK,AGXS,...,ZSRH,ZUZA,ZVOJ,ZVUJ,ZWPK,ZXAJ,ZXDL,ZYSN,ZZFY,ZZLF
2000-01-01,0.008283,0.019936,0.034291,-0.030667,0.030982,0.028384,0.047911,-0.001368,-0.034942,0.008707,...,0.035814,-0.044858,0.029882,0.043596,-0.018689,0.015947,0.023468,-0.014212,0.016259,0.031984
2000-01-02,0.032373,-0.014669,0.007413,0.035563,-0.044525,-0.03137,0.031193,-0.028979,0.044249,-0.032793,...,0.003719,0.007442,0.026192,0.016993,0.018229,-0.032976,-0.00497,0.008953,-0.02076,0.025812
2000-01-03,-0.015042,-0.031432,-0.018058,-0.03529,-0.007694,-0.005616,-0.010107,-0.049436,0.03339,0.010262,...,-0.01464,0.033551,-0.042038,-0.036268,-0.033047,-0.042756,-0.035099,-0.022503,-0.038022,0.00746
2000-01-04,0.042674,0.027748,0.024061,0.047282,0.013558,-0.01872,0.035461,-0.018302,0.038879,0.002341,...,-0.020772,0.020344,-0.036218,-0.045622,-0.012704,0.021155,-0.004348,0.004046,-0.005093,-0.047804
2000-01-05,-0.044972,-0.030506,-0.001019,0.027927,0.001515,-0.022988,0.036734,0.035547,-0.018268,0.049465,...,0.025707,-0.028557,-0.030465,-0.000533,-0.037896,-0.037277,0.032265,0.01335,-0.036006,0.047599


In [27]:
df.shape # number of values = days x tickers

(5845, 999)

## Calculation Method 1: Using Numpy

Our first iteration consists of using pandas and numpy. We use `.apply()` to each column in the DataFrame and within the `historical_beta` function, we extract the relevant data and calculate the beta using numpy's covariance and variance functionality.

We completed one ticker in 17.7s using this method. This would take a total of around 4.9 hours to complete all 1,000 tickers. Not great, but doable if we have an overnight script. Additionally, since we'd likely only need to create a single day at a time after the initial run, it may not be terrible.

In [33]:
df_beta = df.copy()
df_beta['date'] = df_beta.index
stock = tickers[1]

In [56]:
def historical_beta(date, mkt, stock, window):
    start = date + datetime.timedelta(days=-window)
    data = df.loc[(df.index < date) & (df.index > start)][[stock, mkt]]
    if (len(data) < 10): return np.nan
    cov = np.cov(data)[0][1]
    var=np.var(data[mkt])
    beta = cov / var
    return beta

In [57]:
df_beta[stock] = df_beta.apply(lambda x: historical_beta(x.date, mkt=mkt, stock=stock, window=200), axis=1)

## Calculation Method 2: Using `numpy.linalg.lstsq`

For this run, I compare using `np.cov` versus `np.linalg.lstsq` in terms of performance. The reason I only look at `cov` as opposed to `cov` + `var` is because `var` only needs to be calculated once per day.

Both `cov` and `lstsq` take approximately the same amount of time (around 100 us min). For entire run of a single ticker, this calculation would be performed around 5,000 times, meaning that about 5 seconds of the 17 seconds are accounded for by this calculation.

Regardless, it seems that the performance difference between `cov` and `lstsq` is measured in % and not in magnititude.

In [72]:
start = datetime.datetime(2012,1,1)
end = start + datetime.timedelta(days=window)
data = df.loc[(df.index < end) & (df.index > start)][[tickers[0], tickers[1]]]

In [73]:
x = np.array(data[tickers[0]])
y = np.array(data[tickers[1]])
A = np.vstack([x, np.ones(len(x))]).T

In [77]:
%timeit cov = np.cov(x, y)[0][1]

The slowest run took 4.59 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 91.4 µs per loop


In [78]:
%timeit m, c = np.linalg.lstsq(A, y)[0]

The slowest run took 6.22 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 114 µs per loop


In [76]:
print(m, c)

-0.0801992299017 0.00310402928554


## Calculation Method 3: Using Pandas Rolling Functions

We use pandas built in rolling windows, that directly use CPython under the hood. These are highly optimized functions that resulted in blazing fast performance. Compated to calculation method 1, this did all calculations (for all 1,000 tickers) in 5.81s.

`Last executed in 5.81s`

Versus calculation method 1 above, which was 17.7s for 1 ticker, this is over **3000x** faster.

In [58]:
covs = df.rolling(window=window).cov(df[mkt], pairwise=True)
var = df[mkt].rolling(window=window).var()
beta = covs.div(var,axis=0)

In [59]:
# difference
(17.7 * 1000) / 5.81

3046.4716006884682