# Hull Tactical - Kaggle Competition

## Overview

* Wisdom from most personal finance experts would suggest that it's irresponsible to try and time the market. The Efficient Market Hypothesis (EMH) would agree: everything knowable is already priced in, so don’t bother trying.

* But in the age of machine learning, is it irresponsible to not try and time the market? Is the EMH an extreme oversimplification at best and possibly just…false?

*  Most investors don’t beat the S&P 500. That failure has been used for decades to prop up EMH: If even the professionals can’t win, it must be impossible. This observation has long been cited as evidence for the Efficient Market Hypothesis the idea that prices already reflect all available information and no persistent edge is possible. This story is tidy, but reality is less so. Markets are noisy, messy, and full of behavioral quirks that don’t vanish just because academic orthodoxy said they should.

* Data science has changed the game. With enough features, machine learning, and creativity, it’s possible to uncover repeatable edges that theory says shouldn’t exist. The real challenge isn’t whether they exist—it’s whether you can find them and combine them in a way that is robust enough to overcome frictions and implementation issues.

* Our current approach blends a handful of quantitative models to adjust market exposure at the close of each trading day. It points in the right direction, but with a blurry compass. Our model is clearly a sub-optimal way to model a complex, non-linear, adaptive system. 

## Project Objective

* To build a model that predicts excess returns and includes a betting strategy designed to outperform the S&P 500 while staying within a 120% volatility constraint.

## Imports

In [8]:
# ===============================================================
# 1️⃣  CORE PYTHON + DATA HANDLING
# ===============================================================
import os
from pathlib import Path
import numpy as np
import pandas as pd
import polars as pl

# ===============================================================
# 2️⃣  VISUALIZATION
# ===============================================================
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8')
sns.set_theme(context='notebook', style='whitegrid')

# ===============================================================
# 3️⃣  SCIKIT-LEARN UTILITIES
# ===============================================================
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# ===============================================================
# 4️⃣  DEEP LEARNING – PYTORCH
# ===============================================================
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

# Check device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# ===============================================================
# 5️⃣  PORTFOLIO OPTIMIZATION – VOLATILITY CONSTRAINT
# ===============================================================
import cvxpy as cp

# ===============================================================
# 6️⃣  GENERAL HOUSEKEEPING
# ===============================================================
import warnings
warnings.filterwarnings('ignore')

# Optional: better numerical formatting
pd.set_option('display.float_format', lambda x: f'{x:,.4f}')


#=================================================================
# 7️⃣ KAGGLE EVALUATION SERVER
#=================================================================

import kaggle_evaluation.default_inference_server as default_inference_server

Using device: cpu


## Data Understanding

* First thing is to load the test and train data set to data set to explore features, distributions and correlations

* We can the also check for missing values and key predictors

In [3]:
# Loading the datasets

train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

# Previewing the data

train_df.head()

Unnamed: 0,date_id,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,V3,V4,V5,V6,V7,V8,V9,forward_returns,risk_free_rate,market_forward_excess_returns
0,0,0,0,0,1,1,0,0,0,1,...,,,,,,,,-0.0024,0.0003,-0.003
1,1,0,0,0,1,1,0,0,0,1,...,,,,,,,,-0.0085,0.0003,-0.0091
2,2,0,0,0,1,0,0,0,0,1,...,,,,,,,,-0.0096,0.0003,-0.0102
3,3,0,0,0,1,0,0,0,0,0,...,,,,,,,,0.0047,0.0003,0.004
4,4,0,0,0,1,0,0,0,0,0,...,,,,,,,,-0.0117,0.0003,-0.0123


In [6]:
# Learning the datatypes in the columns

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8990 entries, 0 to 8989
Data columns (total 98 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   date_id                        8990 non-null   int64  
 1   D1                             8990 non-null   int64  
 2   D2                             8990 non-null   int64  
 3   D3                             8990 non-null   int64  
 4   D4                             8990 non-null   int64  
 5   D5                             8990 non-null   int64  
 6   D6                             8990 non-null   int64  
 7   D7                             8990 non-null   int64  
 8   D8                             8990 non-null   int64  
 9   D9                             8990 non-null   int64  
 10  E1                             7206 non-null   float64
 11  E10                            7984 non-null   float64
 12  E11                            7984 non-null   f

* From the competitions webpage the column names can be froken down into:

    - `date_id` - An identifier for a single trading day.
    - `M*` - Market Dynamics/Technical features.
    - `E*` - Macro Economic features.
    - `I*` - Interest Rate features.
    - `P*` - Price/Valuation features.
    - `V*` - Volatility features.
    - `S*` - Sentiment features.
    - `MOM*` - Momentum features.
    - `D*` - Dummy/Binary features.
    - `forward_returns` - The returns from buying the S&P 500 and selling it a day later. Train set only.
    - `risk_free_rate` - The federal funds rate. Train set only.
    - `market_forward_excess_returns` - Forward returns relative to expectations. Computed by subtracting the rolling five-year mean forward returns and winsorizing the result using a median absolute deviation (MAD) with a criterion of 4. Train set only.

In [5]:
# Analysing the distributions of the columns 

train_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
date_id,8990.0000,4494.5000,2595.3338,0.0000,2247.2500,4494.5000,6741.7500,8989.0000
D1,8990.0000,0.0316,0.1749,0.0000,0.0000,0.0000,0.0000,1.0000
D2,8990.0000,0.0316,0.1749,0.0000,0.0000,0.0000,0.0000,1.0000
D3,8990.0000,0.0478,0.2134,0.0000,0.0000,0.0000,0.0000,1.0000
D4,8990.0000,0.5752,0.4943,0.0000,0.0000,1.0000,1.0000,1.0000
...,...,...,...,...,...,...,...,...
V8,7984.0000,0.3039,0.3511,0.0007,0.0007,0.1010,0.5900,1.0000
V9,4451.0000,0.1292,1.2773,-1.4974,-0.7382,-0.1708,0.6859,12.9975
forward_returns,8990.0000,0.0005,0.0106,-0.0398,-0.0043,0.0007,0.0059,0.0407
risk_free_rate,8990.0000,0.0001,0.0001,-0.0000,0.0000,0.0001,0.0002,0.0003


In [11]:
# Identifying the target column and feature column
# # The target column should be the market foward express returns

target_col = 'market_forward_excess_returns'

features = [col for col in train_df.columns if col != target_col]

In [12]:
# Checking for missing values

train_df.isna().sum().sort_values(ascending=False)

E7                               6969
V10                              6049
S3                               5733
M1                               5547
M13                              5540
                                 ... 
D6                                  0
D5                                  0
forward_returns                     0
risk_free_rate                      0
market_forward_excess_returns       0
Length: 98, dtype: int64

### Obervations

* The data consists of 98 columns in the dataset with 8990 rows.

* The `date_id` column is the index of the data set so we shall have  to set it as the index to avoid skewed predictions in the modelling phase

* The dataset consist of both float(64) and int(64) values the int(64), other than the date is the `D` columns which are binary in nature.

* The target column is the `market_foward_excess_returns`

* The columns contain missing values that require different filling techniques.From market research the stregies would be:

| Prefix/Category | Description | Recommended Imputation Strategy |
|-----------------|-------------|----------------------------------|
| M*              | Market/Technical | Fill with median — stable against outliers |
| E*              | Macro Economic   | Fill with mean (these are smoother, long-term data) |
| I*              | Interest Rates   | Fill with forward fill (these are often time-series-based) |
| P*              | Price/Valuation | Fill with median or sector median if you have that info |
| V*              | Volatility       | Fill with mean, or possibly constant (like average market vol) |
| S*              | Sentiment        | Fill with 0 (neutral sentiment baseline) |
| MOM*            | Momentum         | Fill with 0 or median (depending on how momentum is defined) |
| D*              | Dummy/Binary     | Fill with mode (most common) or 0 |

* Differnt rows have differnt scales, thus `StandardScaler` would need to be applied to the features but not the target as it aims to measure the wins against the S&P 500.

## Configurations

In [None]:
# ============ PATHS ============
DATA_PATH: Path = Path('/kaggle/input/hull-tactical-market-prediction/')

# ============ RETURNS TO SIGNAL CONFIGS ============
MIN_SIGNAL: float = 0.0                         # Minimum value for the daily signal 
MAX_SIGNAL: float = 2.0                         # Maximum value for the daily signal 
SIGNAL_MULTIPLIER: float = 400.0                # Multiplier of the OLS market forward excess returns predictions to signal 

# ============ MODEL CONFIGS ============
CV: int = 10                                    # Number of cross validation folds in the model fitting
L1_RATIO: float = 0.5                           # ElasticNet mixing parameter
ALPHAS: np.ndarray = np.logspace(-4, 2, 100)    # Constant that multiplies the penalty terms
MAX_ITER: int = 1000000                         # The maximum number of iterations

## Dataclasses Helpers

In [None]:
@dataclass
class DatasetOutput:
    X_train : pl.DataFrame 
    X_test: pl.DataFrame
    y_train: pl.Series
    y_test: pl.Series
    scaler: StandardScaler

@dataclass 
class ElasticNetParameters:
    l1_ratio : float 
    cv: int
    alphas: np.ndarray 
    max_iter: int 
    
    def __post_init__(self): 
        if self.l1_ratio < 0 or self.l1_ratio > 1: 
            raise ValueError("Wrong initializing value for ElasticNet l1_ratio")
        
@dataclass(frozen=True)
class RetToSignalParameters:
    signal_multiplier: float 
    min_signal : float = MIN_SIGNAL
    max_signal : float = MAX_SIGNAL

## Set the Parameters

In [None]:
ret_signal_params = RetToSignalParameters(
    signal_multiplier= SIGNAL_MULTIPLIER
)

enet_params = ElasticNetParameters(
    l1_ratio = L1_RATIO, 
    cv = CV, 
    alphas = ALPHAS, 
    max_iter = MAX_ITER
)

## Dataset Loading/Creating Helper Functions

In [None]:
def load_trainset() -> pl.DataFrame:
    """
    Loads and preprocesses the training dataset.

    Returns:
        pl.DataFrame: The preprocessed training DataFrame.
    """
    return (
        pl.read_csv(DATA_PATH / "train.csv")
        .rename({'market_forward_excess_returns':'target'})
        .with_columns(
            pl.exclude('date_id').cast(pl.Float64, strict=False)
        )
        .head(-10)
    )

def load_testset() -> pl.DataFrame:
    """
    Loads and preprocesses the testing dataset.

    Returns:
        pl.DataFrame: The preprocessed testing DataFrame.
    """
    return (
        pl.read_csv(DATA_PATH / "test.csv")
        .rename({'lagged_forward_returns':'target'})
        .with_columns(
            pl.exclude('date_id').cast(pl.Float64, strict=False)
        )
    )

def create_example_dataset(df: pl.DataFrame) -> pl.DataFrame:
    """
    Creates new features and cleans a DataFrame.

    Args:
        df (pl.DataFrame): The input Polars DataFrame.

    Returns:
        pl.DataFrame: The DataFrame with new features, selected columns, and no null values.
    """
    vars_to_keep: List[str] = [
        "S2", "E2", "E3", "P9", "S1", "S5", "I2", "P8",
        "P10", "P12", "P13", "U1", "U2"
    ]

    return (
        df.with_columns(
            (pl.col("I2") - pl.col("I1")).alias("U1"),
            (pl.col("M11") / ((pl.col("I2") + pl.col("I9") + pl.col("I7")) / 3)).alias("U2")
        )
        .select(["date_id", "target"] + vars_to_keep)
        .with_columns([
            pl.col(col).fill_null(pl.col(col).ewm_mean(com=0.5))
            for col in vars_to_keep
        ])
        .drop_nulls()
    )
    
def join_train_test_dataframes(train: pl.DataFrame, test: pl.DataFrame) -> pl.DataFrame:
    """
    Joins two dataframes by common columns and concatenates them vertically.

    Args:
        train (pl.DataFrame): The training DataFrame.
        test (pl.DataFrame): The testing DataFrame.

    Returns:
        pl.DataFrame: A single DataFrame with vertically stacked data from common columns.
    """
    common_columns: list[str] = [col for col in train.columns if col in test.columns]
    
    return pl.concat([train.select(common_columns), test.select(common_columns)], how="vertical")

def split_dataset(train: pl.DataFrame, test: pl.DataFrame, features: list[str]) -> DatasetOutput: 
    """
    Splits the data into features (X) and target (y), and scales the features.

    Args:
        train (pl.DataFrame): The processed training DataFrame.
        test (pl.DataFrame): The processed testing DataFrame.
        features (list[str]): List of features to used in model. 

    Returns:
        DatasetOutput: A dataclass containing the scaled feature sets, target series, and the fitted scaler.
    """
    X_train = train.drop(['date_id','target']) 
    y_train = train.get_column('target')
    X_test = test.drop(['date_id','target']) 
    y_test = test.get_column('target')
    
    scaler = StandardScaler() 
    
    X_train_scaled_np = scaler.fit_transform(X_train)
    X_train = pl.from_numpy(X_train_scaled_np, schema=features)
    
    X_test_scaled_np = scaler.transform(X_test)
    X_test = pl.from_numpy(X_test_scaled_np, schema=features)
    
    
    return DatasetOutput(
        X_train = X_train,
        y_train = y_train, 
        X_test = X_test, 
        y_test = y_test,
        scaler = scaler
    )

## Converting Return Prediction to Signal

Here is an example of a potential function used to convert a prediction based on the market forward excess return to a daily signal position. 

In [None]:
def convert_ret_to_signal(
    ret_arr: np.ndarray,
    params: RetToSignalParameters
) -> np.ndarray:
    """
    Converts raw model predictions (expected returns) into a trading signal.

    Args:
        ret_arr (np.ndarray): The array of predicted returns.
        params (RetToSignalParameters): Parameters for scaling and clipping the signal.

    Returns:
        np.ndarray: The resulting trading signal, clipped between min and max values.
    """
    return np.clip(
        ret_arr * params.signal_multiplier + 1, params.min_signal, params.max_signal
    )

## Looking at the Data

In [None]:
train: pl.DataFrame = load_trainset()
test: pl.DataFrame = load_testset() 
print(train.tail(3)) 
print(test.head(3))

## Generating the Train and Test

In [None]:
df: pl.DataFrame = join_train_test_dataframes(train, test)
df = create_example_dataset(df=df) 
train: pl.DataFrame = df.filter(pl.col('date_id').is_in(train.get_column('date_id')))
test: pl.DataFrame = df.filter(pl.col('date_id').is_in(test.get_column('date_id')))

FEATURES: list[str] = [col for col in test.columns if col not in ['date_id', 'target']]

dataset: DatasetOutput = split_dataset(train=train, test=test, features=FEATURES) 

X_train: pl.DataFrame = dataset.X_train
X_test: pl.DataFrame = dataset.X_test
y_train: pl.DataFrame = dataset.y_train
y_test: pl.DataFrame = dataset.y_test
scaler: StandardScaler = dataset.scaler 

## Fitting the Model 

In [None]:
model_cv: ElasticNetCV = ElasticNetCV(
    **asdict(enet_params)
)
model_cv.fit(X_train, y_train) 
        
# Fit the final model using the best alpha found by cross-validation
model: ElasticNet = ElasticNet(alpha=model_cv.alpha_, l1_ratio=enet_params.l1_ratio) 
model.fit(X_train, y_train)

## Prediction Function via Kaggle Server

In [None]:
def predict(test: pl.DataFrame) -> float:
    test = test.rename({'lagged_forward_returns':'target'})
    df: pl.DataFrame = create_example_dataset(test)
    X_test: pl.DataFrame = df.select(FEATURES)
    X_test_scaled_np: np.ndarray = scaler.transform(X_test)
    X_test: pl.DataFrame = pl.from_numpy(X_test_scaled_np, schema=FEATURES)
    raw_pred: float = model.predict(X_test)[0]
    return convert_ret_to_signal(raw_pred, ret_signal_params)

## Launch Server

In [None]:
inference_server = kaggle_evaluation.default_inference_server.DefaultInferenceServer(predict)

if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    inference_server.serve()
else:
    inference_server.run_local_gateway(('/kaggle/input/hull-tactical-market-prediction/',))