In [77]:
# libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import os

# 2. Compute Stocks Prices Correlation

In this notebook we compute the correlation between all possible stock pairs over a moving time frame.

First, we create a DataFrame of log returns for all stocks (calculated on Adjusted Close):

In [78]:
# path to prices data
path = '../data/prices/'

# create empty dataframe (only business days)
data = pd.DataFrame(index=pd.date_range(start="1990-01-01", end="2021-01-01", freq='B'))

# iterate over files and add columns
files = os.listdir(path)
for file in files: 
    
    # import, keep log returns and change column name
    price = pd.read_csv(os.path.join(path, file), index_col=0)
    price = price[['LogRet_AdjClose']]
    price.columns = [file.replace('.csv', '')]
    
    # merge
    data  = pd.merge(data, price, left_index=True, right_index=True, how='outer')
    
data.head()

Unnamed: 0,CSCO,UAL,TROW,ISRG,NVR,PRGO,TPR,DVN,CE,MRO,...,CRM,PGR,WAT,IEX,BWA,LRCX,NWL,UAA,BLK,PPL
1990-01-01,,,,,,,,,,,...,,,,,,,,,,
1990-01-02,,,,,,,,,,,...,,,,,,,,,,
1990-01-03,,,0.016529,,0.0,,,0.018692,,-0.006969,...,,0.006472,,0.0,,-0.022473,-0.005222,,,0.002903
1990-01-04,,,0.024293,,0.024098,,,0.0,,0.024182,...,,-0.006472,,-0.029852,,-0.09531,-0.00525,,,-0.008734
1990-01-05,,,-0.008033,,0.02353,,,-0.018692,,-0.013746,...,,0.003242,,-0.007604,,0.0,0.0,,,-0.00881


Then, we define a function to compute pairwise correlations between stocks. In particular, we consider three possible different types of correlation: 
- [Pearson](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html)
- [Kendall](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html)
- [Spearman](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html)

We also have to deal with nan data. Indeed, some stocks start to be traded only after the beginning of our dataset (for example new companies), some others stop to be traded at a certain point (for example failed companies), and of course there can missing data here and there for other reasons. Therefore, we define a threshold of minimum non-nan values required in order to compute the correlation (otherwise we return a nan correlation). We note that, since we are computing pairwise correlations, we need both prices to be non-nan at the same time.

In [67]:
def correlation(data, ticker1, ticker2, start, end, coeff="pearson", notnan_fraction=0.9):
    
    from scipy.stats import pearsonr, kendalltau, spearmanr
    
    """
    This function computes the correlation between the prices of two stocks
    over a given a time period:
        :param data (pandas DataFrame): DataFrame of stocks prices
        :param ticker1 (string): ticker of first stock
        :param ticker2 (string): ticker of second stock
        :param start (datetime): start of time window (included)
        :param end (datetime): end of time window (excluded)
        :param coeff (string, default="pearson"): type of correlation (possible values: 'pearson', 'kendall', 'spearman')
        :param notnan_fracton (float, default=0.9): fraction of non-nan prices required in order to compute correlation
        :return: coefficient of correlation between two price series
    """
    
    # select data between start/end and get the two price time series
    data_window = data.loc[(data.index>=start) & (data.index<end)]
    price1 = data_window[ticker1].values
    price2 = data_window[ticker2].values
    
    # control nan values: if the fraction of non-nan pairs is smaller than the parameter
    # notnan_fraction we return a nan correlation (i.e. not enough data to compute it)
    # We define a pair of prices a 'nan-pair' if on a day one or both of the two prices is nan
    notnans = ~np.logical_or(np.isnan(price1), np.isnan(price2)) 
    
    # if enough non-nan data compute correlation
    if (notnans).sum() / len(price1) >= notnan_fraction:
        
        # compute correlation coefficient only on non-nan pairs (otherwise scipy raises an error)
        if coeff=="pearson": 
            return pearsonr(np.compress(notnans, price1), np.compress(notnans, price2))[0]

        elif coeff=="kendall": 
            return kendalltau(np.compress(notnans, price1), np.compress(notnans, price2))[0]

        elif coeff=="spearman":
            return spearmanr(np.compress(notnans, price1), np.compress(notnans, price2))[0]

        else: 
            print("Coefficient name not recognised. Possible values: 'pearson', 'kendall', 'spearman'")
    
    # not enough non-nan data, return nan correlation
    else: 
        return np.nan

Finally, we compute correlation matrices (i.e matrices of correlation between all possible stocks) over a moving time frame. To keep data comparable at different time steps, we will keep rows and columns of the correlation matrix ordered like the columns of the stocks Pandas DataFrame (we save a file with the ordering in order to match row and column in next notebooks):

In [74]:
pd.DataFrame(data.columns, columns=["ticker"]).to_csv("../data/tickers_order.csv", index=False)

We define the moving time frame to be 3 months long and moving forward in steps of 1 month. Let's compute the correlation matrices:

In [None]:
start = datetime(1990, 1, 1)
end   = datetime(2021, 1, 1)
current_date = start

window = 90  # days
step   = 30  # days

tickers = data.columns.values  # list of tickers

# move the time window until end of observations
while current_date + timedelta(days=window) < end:
    
    # show advancement
    print(current_date)
    
    # initialize correlation matrix for this frame
    corr_matrix = np.zeros((tickers.shape[0], tickers.shape[0]))
    
    # iterate over tickers to compute correlations: 
    for i in range(tickers.shape[0]):
        for j in range(tickers.shape[0]):
            if i!=j:  # no self-correlation  
                corr_matrix[i, j] = correlation(data, tickers[i], tickers[j], current_date, current_date + timedelta(days=window))
                
    # save correlation matrix
    file_name = current_date.strftime("%Y%m%d") + "_" + (current_date + timedelta(days=window)).strftime("%Y%m%d") + ".npz"
    np.savez_compressed("../data/corr_matrices/" + file_name, corr_matrix)
    
    # advance window
    current_date += timedelta(days=step)

1990-01-01 00:00:00
