# Data Eng

There are a lot of colinearity between the features X1 to X780, and if we train a vanilla model on the original data, we are not going to get very good results. In order to improve the performance, we have computed the correlation matrix between the features X1 to X780, and we will use this correlation matrix to generate new features for our models. 

In [1]:
import pandas as pd
import numpy as np
import lightgbm as lgb
from matplotlib import pyplot as plt

import warnings

# Suppress all warnings
warnings.filterwarnings("ignore")

In [9]:
train = pd.read_parquet('data/train.parquet')

## Data Exploration

We are going to cluster the features X1 to X780 based on their correlation distance. First, we are going to drop the order book data and the label.

In [10]:
Y = train['label']

order_features = list(train.columns[:5])

fe = order_features + ['label']

X = train.drop(columns=fe, inplace=False)

## Correaltion between X features

We have computed the correclation matrix of X1 to X780 and stored it in corr.csv

In [3]:
C = pd.read_csv('Corr.csv', index_col=0)

## Clustering the X features

We define the pairwise distance between any two Xi and Xj by

dist(Xi, Xj) = 1 - corr(Xi, Xj)

Then we construct a linkage graph based on this distance matrix and cluster the X features using scipy.fcluster.

In [5]:
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
from scipy.spatial.distance import pdist, squareform


dist = 1 - abs(C)


condensed_dist = squareform(dist) 

Z = linkage(condensed_dist, method='average')

cluster_labels = fcluster(Z, t=0.3, criterion='distance')

We store the clsuter lables in clusters.csv; there are 174 different clusters.

In [15]:
cluster_labels = pd.Series(cluster_labels, index = X.columns)

cluster_labels.to_csv('clusters.csv', index = True)

In [19]:
np.unique(cluster_labels)

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
       118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
       131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
       144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
       157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,
       170, 171, 172, 173, 174], dtype=int32)

## Generating New Features Based on the Clusters

We take the average of each cluster as a new feature, and this reduces the number of X features from 780 to 174.

In [21]:
Xt = X.T                                         

S = Xt.groupby(cluster_labels).agg('mean')  

X_meta = S.T                                  

We add the order book data to these features and save the result in new DataFrame called X_clustered.
We will use X_clustered to train our first model. We run the same procedure on the test data to get X_clustered_test.

In [25]:
X_clustered = pd.concat([train[order_features], X_meta], axis=1)

X_clustered

Unnamed: 0,bid_qty,ask_qty,buy_qty,sell_qty,volume,1,2,3,4,5,...,165,166,167,168,169,170,171,172,173,174
2023-03-01 00:00:00,15.283,8.425,176.405,44.984,221.389,0.719700,0.381609,0.161881,-0.566295,-0.367657,...,0.177609,-1.180835,-0.084300,-0.443547,-0.835094,-1.152707,-0.183498,-0.258361,0.117485,0.052208
2023-03-01 00:01:00,38.590,2.336,525.846,321.950,847.796,0.718563,0.373820,0.161809,-0.567102,-0.367698,...,0.176786,-1.128351,-0.118967,-0.456249,-0.835419,-1.153453,-0.195886,-0.270320,0.067916,0.019157
2023-03-01 00:02:00,0.442,60.250,159.227,136.369,295.596,0.717429,0.366352,0.161815,-0.567908,-0.367737,...,0.175965,-1.079697,-0.147331,-0.468449,-0.835742,-1.154196,-0.207796,-0.281506,0.027359,-0.007884
2023-03-01 00:03:00,4.865,21.016,335.742,124.963,460.705,0.716297,0.359185,0.161447,-0.568712,-0.367776,...,0.175145,-1.034549,-0.170537,-0.480169,-0.836064,-1.154935,-0.219248,-0.291971,-0.005824,-0.030009
2023-03-01 00:04:00,27.158,3.451,98.411,44.407,142.818,0.715167,0.352297,0.161444,-0.569514,-0.367813,...,0.174327,-0.992616,-0.189525,-0.491434,-0.836385,-1.155672,-0.230265,-0.301761,-0.032974,-0.048111
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-02-29 23:55:00,4.163,6.805,39.037,55.351,94.388,-0.021544,0.019561,-0.100629,-0.614978,-0.346103,...,-2.454413,-1.178194,1.830766,0.857081,2.213995,0.249071,1.927000,0.505912,0.385324,0.373993
2024-02-29 23:56:00,2.290,4.058,110.201,67.171,177.372,-0.021509,0.019299,-0.100721,-0.615785,-0.346645,...,-2.451530,-1.140714,1.447905,0.797025,2.203201,0.243710,1.860089,0.444646,0.287057,0.282436
2024-02-29 23:57:00,5.237,3.640,70.499,30.753,101.252,-0.021474,0.019040,-0.100903,-0.616591,-0.347171,...,-2.448651,-1.105740,1.134656,0.739697,2.192453,0.238371,1.795440,0.387333,0.206656,0.207525
2024-02-29 23:58:00,5.731,4.901,22.365,52.195,74.560,-0.021439,0.018786,-0.100623,-0.617394,-0.347682,...,-2.445778,-1.073065,0.878361,0.684953,2.181753,0.233055,1.732959,0.333717,0.140874,0.146235


In [27]:
X_clustered.to_parquet('X_clustered.parquet')