# Exploratory Data Analysis

This notebook performs initial exploratory data analysis on the Hull Tactical Market Prediction dataset. We will load the data, examine its structure, and use utility functions to understand the dataset's characteristics.

## Library Imports

First, we import the necessary libraries for data manipulation and visualization:
- **Polars**: For high-performance data processing
- **Pandas**: For additional data manipulation if needed
- **Seaborn** and **Matplotlib**: For data visualization

In [20]:
import polars as pl
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plots as pt
import queries as qu
import utils as ut
import process as pp

from importlib import reload
# Set up plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

## Data Loading

Now we load the training dataset from the `data/train.csv` file using Polars. This dataset contains historical financial market data with various features and target variables for our machine learning models.

In [36]:
# Load the training dataset
train_df = pl.read_csv('../data/train.csv')

# Display basic information about the dataset
print(f"Dataset shape: {train_df.shape}")
print(f"Number of columns: {len(train_df.columns)}")
print(f"Column names: {train_df.columns[:10]}...")  # Show first 10 columns

Dataset shape: (8990, 98)
Number of columns: 98
Column names: ['date_id', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9']...


In [37]:
# cloning the dataframe
train_df_cloned = train_df.clone()


## Dataset Information Summary

To get a detailed overview of our dataset, we use a custom utility function that provides information about each column including:
- **Data Type**: The Polars data type of each column
- **Null Count**: Number of missing values in each column
- **Total Count**: Total number of rows in the dataset

This helps us understand the data structure and identify any data quality issues.

In [39]:
train_df_info = qu.get_dataframe_info(train_df)

train_df_info.tail(10)

Column,Data Type,Null Count,Total Count
str,str,i64,i64
"""V3""","""String""",1006,8810
"""V4""","""String""",1006,8810
"""V5""","""String""",1512,8810
"""V6""","""String""",1006,8810
"""V7""","""String""",1511,8810
"""V8""","""String""",1006,8810
"""V9""","""String""",4539,8810
"""forward_returns""","""Float64""",0,8810
"""risk_free_rate""","""Float64""",0,8810
"""market_forward_excess_returns""","""Float64""",0,8810


In [None]:
train_df = pp.create_lagged_targets(train_df)

target variables have zero null values

In [40]:
train_df_feature_type_counts = qu.get_feature_type_column_counts(train_df)
train_df_feature_type_counts.head(10)

Feature Type,Column Count,Unique Data Types
str,i64,str
"""Technical/Market Dynamics""",18,"""String"""
"""Macroeconomic""",20,"""String"""
"""Interest Rate""",9,"""String"""
"""Price/Valuation""",13,"""String"""
"""Volatility""",13,"""String"""
"""Sentiment""",12,"""String"""
"""Momentum""",0,""""""
"""Dummy/Binary""",9,"""Int64"""


In [41]:
train_df_feature_samples = qu.get_feature_type_samples(train_df)
train_df_feature_samples.head(10)

Feature Type,Sample Values
str,list[str]
"""Technical/Market Dynamics""","[""0.375"", ""-0.392556008733254"", … ""0.619847227306824""]"
"""Macroeconomic""","[""0.19973544973545"", ""0.919973544973545"", … ""0.00396825396825397""]"
"""Interest Rate""","[""-1.5366659942006"", ""-0.154043690140403"", … ""0.538359788359788""]"
"""Price/Valuation""","[""1.43614726327281"", ""0.743386243386243"", … ""0.937830687830688""]"
"""Volatility""","[""0.000661375661375661"", ""0.000661375661375661"", … ""0.000661375661375661""]"
"""Sentiment""","[""0.892195767195767"", ""0.0992063492063492"", … ""0.82473544973545""]"
"""Momentum""",[]
"""Dummy/Binary""","[""0"", ""0"", … ""0""]"


In [42]:

qu.get_feature_counts(train_df)

feature_type,Null count,Non Null count
str,i64,i64
"""Technical/Market Dynamics""",41254,117326
"""Macroeconomic""",27471,148729
"""Interest Rate""",9054,70236
"""Price/Valuation""",14888,99642
"""Volatility""",23170,91360
"""Sentiment""",21838,83882
"""Momentum""",0,0
"""Dummy/Binary""",0,79290


eval_df has no null values

## Data Preprocessing

In [None]:
train_df = pp.remove_dummy_features(train_df)


In [None]:
train_df = pp.cast_feature_type(train_df, ut.FeatureType.technical, pl.Float64)
train_df = pp.cast_feature_type(train_df, ut.FeatureType.macroeconomic, pl.Float64)
train_df = pp.cast_feature_type(train_df, ut.FeatureType.interest_rate, pl.Float64)
train_df = pp.cast_feature_type(train_df, ut.FeatureType.price_valuation, pl.Float64)
train_df = pp.cast_feature_type(train_df, ut.FeatureType.volatility, pl.Float64)
train_df = pp.cast_feature_type(train_df, ut.FeatureType.sentiment, pl.Float64)





In [None]:
eval_df = train_df[-180:]
train_df = train_df[:-180]
print(train_df.shape)
print(eval_df.shape)

Remove all rows with null feature variables

In [51]:
feature_cols = [col for col in train_df.columns if col not in ut.TARGET_VARIABLES]
train_df = train_df.filter(~pl.all_horizontal(pl.col(feature_cols).is_null()))
eval_df = eval_df.filter(~pl.all_horizontal(pl.col(feature_cols).is_null()))
print(train_df.shape, eval_df.shape)

(8810, 89) (180, 89)


Fill null values using backward strategy

In [None]:
train_df = train_df.fill_null(strategy='backward')
qu.get_feature_counts(train_df)


In [114]:
qu.get_dataframe_info(train_df)

Column,Data Type,Null Count,Total Count
str,str,i64,i64
"""date_id""","""Int64""",0,8810
"""E1""","""Float64""",0,8810
"""E10""","""Float64""",0,8810
"""E11""","""Float64""",0,8810
"""E12""","""Float64""",0,8810
…,…,…,…
"""risk_free_rate""","""Float64""",0,8810
"""market_forward_excess_returns""","""Float64""",0,8810
"""lagged_forward_returns""","""Float64""",0,8810
"""lagged_risk_free_rate""","""Float64""",0,8810
