# <span style='font-family: CMU Sans Serif, sans-serif;'> Data cleaning  </span> 

Now that we have addressed the issue of missing data (in a barbaric way) we can move on to cleaning the data. This involves outlier detection and scaling. Like the "Factor Investment" book we will uniformized: for each point in time, the cross-sectional distribution of each feature is uniform over the unit interval.

In [1]:
import pandas as pd 
import numpy as np
import json
import warnings

from sklearn.preprocessing import QuantileTransformer
from scipy.stats.mstats import winsorize
from tqdm import tqdm  

## <span style='font-family: CMU Sans Serif, sans-serif;'> Data import  </span> 

We first import the data where the missing values have been removed.

In [2]:
# Import primary features
with open('listPrimaryFeaturesRemaining.json', 'r') as f:
    listPrimaryFeatures = json.load(f)

# Import dataframe cleaned for missing values
dataGfdUs = pd.read_csv('usa__gfd__cleaned_v1.csv')

# Define eom as a datetime object
dataGfdUs['eom'] = pd.to_datetime(dataGfdUs['eom'])

## <span style='font-family: CMU Sans Serif, sans-serif;'> Outlier detection  </span> 

We will use winsorization to deal with outliers. This is done on a day-by-day and feature-by-feature basis. We will use `eom` as the date and the features to be winsorized are kept in `listPrimaryFeatures`. This is done below.

In [3]:
def winsorize_features(features, lower = 0.025, upper = 0.025):
    return winsorize(features, limits = [lower, upper])

In [4]:
# Create a copy of the already existing data
dataGfdUs_v1 = dataGfdUs.copy()

# Get all dates
dates = dataGfdUs_v1['eom'].unique()

# Winsorize data 
for date in tqdm(dates, desc="Processing dates"):
    for feature in listPrimaryFeatures:
        dataGfdUs_v1.loc[dataGfdUs['eom'] == date, feature] = winsorize_features(dataGfdUs.loc[dataGfdUs['eom'] == date, feature])

Processing dates: 100%|██████████| 623/623 [02:38<00:00,  3.94it/s]


## <span style='font-family: CMU Sans Serif, sans-serif;'> Scaling data  </span> 

Now that the outliers have been winsorized we move on to scaling. As as done in the book we will scale day-by-day and feature by feature. This is done below.

In [5]:
# Initialize scaler
scaler = QuantileTransformer(output_distribution = 'uniform')

# Create a copy of the already existing data
dataGfdUs_v2 = dataGfdUs_v1.copy()

# Get all dates
dates = dataGfdUs_v2['eom'].unique()

# Scale data and ignore warnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", message=".*n_quantiles.*greater than the total number of samples.*")
    
    for date in tqdm(dates, desc="Processing dates"):
        for feature in listPrimaryFeatures:
            dataGfdUs_v2.loc[dataGfdUs['eom'] == date, feature] = scaler.fit_transform(dataGfdUs_v1.loc[dataGfdUs_v1['eom'] == date, [feature]])

Processing dates: 100%|██████████| 623/623 [04:30<00:00,  2.30it/s]


# <span style='font-family: CMU Sans Serif, sans-serif;'> Saving data  </span> 

Now that we have successfully cleaned data fully, we are ready to fit neural networks. We will this export the final cleaned data as a CSV, ready for further analysis. 

In [6]:
dataGfdUs_v2.to_csv("usa__gfd__cleaned_v2.csv", index = False)