## Master's Thesis - Machine Learning in Asset Pricing

### Thomas Theodor Kjølbye 

The Following script handles all data used in the paper. On my computer, the entire script takes approximately 5 minutes to run. The data consist of individual firm characteristics as well as macroeconomic variables and are generously made available by Professors Gu, Kelly, Xiu, and Goyal. 

The script is the 2nd of two data processing scripts. The first one does not compute interaction terms or reduce dimensionality by PCA. However, it does perform the rank normalization leveraged in Gu, Kelly, and Xiu (2020). 

In [1]:
# Load the usual suspects
import csv
import pandas as pd
import numpy as np
import os
import Toolbox as tb
import time as time
from sklearn.decomposition import PCA

%load_ext autoreload
%autoreload 2

In [2]:
# Initialize warning log container 
log = list()

In [3]:
os.getcwd()

'C:\\Users\\thoma\\OneDrive - Københavns Universitet\\Documents\\Økonomi - Kandidat\\6. Semester\\Speciale\\Masters-Thesis'

In [4]:
# Load monthly returns data from CRSP and process data
returns = pd.read_csv(os.path.dirname(os.getcwd()) + '\\RET.txt')
returns.columns = returns.columns.str.strip()
returns.columns = returns.columns.str.lower() 
returns.rename(columns = {"col1":"date"}, inplace = True)
returns["date"] = returns["date"].floordiv(100)
returns_raw = returns[["date", "permno", "ret"]]

# Load macroeconomic predictors and clean it
macro_data_raw = pd.read_csv(os.path.dirname(os.getcwd()) + '\\macro_data.csv') 
macro_data_raw = macro_data_raw[(macro_data_raw["date"] > 195612) & (macro_data_raw["date"] < 201701)]
macro_data_raw["constant"] = 1 # Add column of ones for interaction terms later

# Load firm characteristics
firm_data_raw = pd.read_csv(os.path.dirname(os.getcwd()) + '\\datashare.zip')
firm_data_raw.columns = firm_data_raw.columns.str.lower()
firm_data_raw["date"] = firm_data_raw["date"].floordiv(100) # https://stackoverflow.com/questions/33034559/how-to-remove-last-the-two-digits-in-a-column-that-is-of-integer-type
firm_data_raw.set_index(["permno", "date"], inplace = True)
print("The firm characteristics dataset is {:1.3f} GB".format(firm_data_raw.memory_usage().sum()/(1024 ** 3)))

# I have refrained from saving the firm char. in my wd (repo) because I am unable to push 3 GB worth of data to the github.

The firm characteristics dataset is 2.939 GB


In [9]:
# Downcast from 64bit flots and ints to 32bit
tb.downcast(firm_data_raw)
tb.downcast(returns_raw)
tb.downcast(macro_data_raw)

Before downcast: 1.481 GB and float32    95
dtype: int64
After downcast: 1.481 GB and float32    95
dtype: int64
Before downcast: 0.045 GB and int32      2
float32    1
dtype: int64
After downcast: 0.045 GB and int32      2
float32    1
dtype: int64
Before downcast: 0.000 GB and float32    8
int32      1
int8       1
dtype: int64
After downcast: 0.000 GB and float32    8
int32      1
int8       1
dtype: int64


In [6]:
# Filter out rows missing in returns data by merging
data = firm_data_raw.reset_index().merge(returns_raw, on = ["permno", "date"], how = "inner")

# Scale macroeconomic data to same shape as other data by merging
data = data.merge(macro_data_raw, on = "date")
data = data.set_index(["permno", "date"]) # 3760315 x 105 (94 char, industry dummy, returns, 8 macro, ones)

In [7]:
# Separate dataset in 1) firm chars + macro (FM), 2) returns, and 3) industry codes

# 1)
FM_todrop = ["ret", "sic2"]
FM_data = data.drop(FM_todrop, axis = 1).reset_index()

# 2) 
returns = data.ret

# 3) 
industry_code = data.sic2

In [8]:
# Save FM data data
FM_data.to_csv(os.path.dirname(os.getcwd()) + '\\FM_data.csv', header = True, index = False)

In [10]:
### NOTE: Above code can be ignored once FM_data.csv has been saved to disc

# Load FM data
FM_data = pd.read_csv(os.path.dirname(os.getcwd()) + '\\FM_data.csv')
tb.downcast(FM_data)

Before downcast: 2.942 GB and float64    102
int64        3
dtype: int64
After downcast: 1.460 GB and float32    102
int32        2
int8         1
dtype: int64


In [60]:
# Dates partitioning the training, validation, and test set
tv_dates = [197501, 197601, 197701, 197801, 197901, 198001, 198101, 198201, 198301, 198401, 198501, 198601]
v_dates = [197601, 197701, 197801, 197901, 198001, 198101, 198201, 198301, 198401, 198501, 198601, 198701]
t_dates = [198701, 198701, 198701, 198701, 198701, 198701, 198701, 198701, 198701, 198701, 198701, 198701]

tv_dates = tv_dates[1]
v_dates = v_dates[1]
t_dates = t_dates[1]

In [36]:
# Split and clean data according to selected dates
tfirm, vfirm, ttfirm, tmacro, vmacro, ttmacro = tb.data_processing(data = FM_data, TV_date = tv_dates, 
                                                                   V_date = v_dates, T_date = t_dates)

In [38]:
# Compute interaction terms for training and validation data
interaction_t, mean, std = tb.interaction(tfirm, tmacro)
interaction_v, _, _ = tb.interaction(vfirm, vmacro)

In [41]:
# Compute interaction terms for test data
tb.interaction_noRAM(ttfirm, ttmacro, mean = mean, std = std, filename = 'interaction_tt.csv')

In [42]:
# PCA: Training data
pca = PCA(n_components = 0.95)
pca.fit(interaction_t)
pca_data_t = pca.transform(interaction_t) 
pca_data_t = pd.DataFrame(pca_data_t)


# PCA: Validation data
pca_data_v = pca.transform(interaction_v)
pca_data_v = pd.DataFrame(pca_data_v)

In [46]:
# PCA: Test data
tb.save_txt(name = os.path.dirname(os.getcwd()) + "\\" + 'interaction_tt.csv', newfilename = 'pca_data_tt.csv',
                    pc = pca.components_.T)

In [48]:
# Load test data post PCA
pca_data_tt = pd.read_csv(os.path.dirname(os.getcwd()) + "\\pca_data_tt.csv", header = None)
tb.downcast(pca_data_tt) # 2.753 GB, 1.377 GB efter

Before downcast: 2.746 GB and float64    147
dtype: int64
After downcast: 1.373 GB and float32    147
dtype: int64


In [49]:
# Prep industry codes for merge with remainder of predictors (covariates)
industry_dummies_t, industry_dummies_v, industry_dummies_tt = tb.dummies(data = industry_code, TV_date = tv_dates, 
                                                                         V_date = v_dates, T_date = t_dates)

# Prep returns data for merge with predictors (covariates)
returns_t = returns[returns.index.get_level_values("date") < tv_dates].reset_index()
returns_v = returns[(returns.index.get_level_values("date") >= tv_dates) & 
                                                  (returns.index.get_level_values("date") < v_dates)].reset_index()
returns_tt = returns[returns.index.get_level_values("date") >= t_dates].reset_index()


In [53]:
# Merge data and save to disc

# Training
data_t = pd.concat([pca_data_t, industry_dummies_t], axis = 1)
data_t = data_t.merge(returns_t, on = ["permno", "date"], how = "inner").set_index(["permno", "date"])
data_t.to_csv(os.path.dirname(os.getcwd()) + '\\Data' +'\\data_t_01.csv', header = False, index = True)

# Validation
data_v = pd.concat([pca_data_v, industry_dummies_v], axis = 1)
data_v = data_v.merge(returns_v, on = ["permno", "date"], how = "inner").set_index(["permno", "date"])
data_v.to_csv(os.path.dirname(os.getcwd()) + '\\Data' +'\\data_v_01.csv', header = False, index = True)

# Test
data_tt = pd.concat([pca_data_tt, industry_dummies_tt], axis = 1)
data_tt = data_tt.merge(returns_tt, on = ["permno", "date"], how = "inner").set_index(["permno", "date"])
data_tt.to_csv(os.path.dirname(os.getcwd()) + '\\Data' +'\\data_tt_01.csv', header = False, index = True)