# Hull Tactical Market Prediction - Exploratory Notebook

This notebook loads the provided `train.csv` and `test.csv` for the Hull Tactical Market Prediction-style dataset, performs EDA, feature engineering, trains baseline XGBoost and a small Keras neural network, performs hyperparameter tuning, and simulates a trading strategy that converts predictions into portfolio weights (0-2) and evaluates performance using a Sharpe-like metric that penalizes excess volatility.

Notes: The target column is `market_forward_excess_returns` (daily excess returns versus risk free). The notebook is designed to run in a Kaggle-like environment (Python 3.10+, Pandas, NumPy, scikit-learn, XGBoost, TensorFlow).

In [None]:
# Standard imports
import os
import gc
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from xgboost import XGBRegressor
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Add these imports in the first cell
import pyarrow as pa
import pyarrow.parquet as pq
# OR simply
# Print versions for reproducibility
print('pandas', pd.__version__)
print('numpy', np.__version__)
import xgboost as xgb
print('xgboost', xgb.__version__)
print('tensorflow', tf.__version__)
print('scikit-learn', __import__('sklearn').__version__)

In [None]:
# Load data

train = pd.read_csv('/kaggle/input/hull-tactical-market-prediction/train.csv')
print('train shape', train.shape)
display(train.head())

test = pd.read_csv('/kaggle/input/hull-tactical-market-prediction/test.csv')
print('test shape', test.shape)
test.head()

print('test.csv and train.csv file found ',)

## Basic inspection and target identification
We confirm the target `market_forward_excess_returns` and inspect missing values and basic statistics.

In [None]:
target = 'market_forward_excess_returns'
print('target in columns?', target in train.columns)
display(train.dtypes.value_counts())
missing = train.isna().sum().sort_values(ascending=False)
display(missing[missing>0].head(30))
display(train[[target, 'forward_returns', 'risk_free_rate']].describe().T)

## Visualizations: target series, histogram, boxplot, correlation heatmap
We plot the target series and distributions to understand its behavior.

In [None]:
plt.figure(figsize=(12,4))
plt.plot(train['date_id'], train[target], label='market_forward_excess_returns')
plt.xlabel('date_id')
plt.ylabel('excess return')
plt.title('Target series over date_id')
plt.legend()
plt.show()
fig, axes = plt.subplots(1,2,figsize=(12,4))
sns.histplot(train[target].dropna(), bins=80, ax=axes[0], kde=True)
axes[0].set_title('Histogram of target')
sns.boxplot(x=train[target].dropna(), ax=axes[1])
axes[1].set_title('Boxplot of target')
plt.show()
corr = train.corr()
corr_target = corr[target].abs().sort_values(ascending=False)
top_feats = corr_target.index[1:31].tolist()
plt.figure(figsize=(10,10))
sns.heatmap(train[top_feats + [target]].corr(), cmap='coolwarm', center=0, vmin=-1, vmax=1)
plt.title('Correlation matrix - top features vs target')
plt.show()

## Feature engineering
Create lag features, rolling means and rolling volatility for `forward_returns` and the target where appropriate. We'll drop near-constant features and optionally use PCA for dimensionality reduction.

In [None]:
df = train.copy()
for lag in [1,2,3]:
    df[f'forward_returns_lag{lag}'] = df['forward_returns'].shift(lag)
    df[f'{target}_lag{lag}'] = df[target].shift(lag)
df['fr_roll_mean_5'] = df['forward_returns'].rolling(window=5, min_periods=1).mean()
df['fr_roll_std_5'] = df['forward_returns'].rolling(window=5, min_periods=1).std().fillna(0)
df['fr_roll_mean_20'] = df['forward_returns'].rolling(window=20, min_periods=1).mean()
df['fr_roll_std_20'] = df['forward_returns'].rolling(window=20, min_periods=1).std().fillna(0)
nunique = df.nunique()
constant_cols = nunique[nunique<=1].index.tolist()
print('constant or single-value columns:', len(constant_cols))
df.drop(columns=constant_cols, inplace=True, errors='ignore')
df.fillna(0, inplace=True)
print('df shape after feature engineering', df.shape)
display(df.head())

## Prepare training and validation splits (time-aware)
We'll use `date_id` to split the data in time order to avoid lookahead bias. Use the last 20% of time as validation.

In [None]:
features = [c for c in df.columns if c not in ['date_id', target, 'forward_returns', 'risk_free_rate']]
print('selected feature count', len(features))
date_cut = df['date_id'].quantile(0.8)
train_idx = df['date_id'] <= date_cut
val_idx = df['date_id'] > date_cut
X_train = df.loc[train_idx, features].values
y_train = df.loc[train_idx, target].values
X_val = df.loc[val_idx, features].values
y_val = df.loc[val_idx, target].values
print('X_train', X_train.shape, 'X_val', X_val.shape)

## Baseline XGBoost model
Train a baseline XGBRegressor and evaluate on the time-validation set.

In [None]:
xgb_model = XGBRegressor(objective='reg:squarederror', n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42, n_jobs=4)
xgb_model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=10, verbose=False)
pred_val = xgb_model.predict(X_val)
print('Baseline XGBoost - MSE', mean_squared_error(y_val, pred_val))
print('Baseline XGBoost - MAE', mean_absolute_error(y_val, pred_val))
print('Baseline XGBoost - R2', r2_score(y_val, pred_val))
try:
    fi = xgb_model.get_booster().get_score(importance_type='gain')
    fi2 = {features[int(k.replace('f',''))]:v for k,v in fi.items()}
    sorted_fi = sorted(fi2.items(), key=lambda x: x[1], reverse=True)
    print('Top 15 features by gain')
    for f,v in sorted_fi[:15]:
        print(f, v)
except Exception as e:
    print('feature importance error', e)

## Hyperparameter tuning (full run)
Perform a GridSearchCV using TimeSeriesSplit to find reasonable XGBoost params. This grid is larger and will run longer.

In [None]:
tscv = TimeSeriesSplit(n_splits=3)
param_grid = {
    'max_depth': [3,5,7],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 200]
}
gsearch = GridSearchCV(estimator=XGBRegressor(objective='reg:squarederror', random_state=42, n_jobs=4),
                   param_grid=param_grid, cv=tscv, scoring='neg_mean_squared_error', verbose=2)
gsearch.fit(X_train, y_train)
print('best params', gsearch.best_params_)
best_xgb = gsearch.best_estimator_
pred_val_g = best_xgb.predict(X_val)
print('Tuned XGBoost - MSE', mean_squared_error(y_val, pred_val_g))
print('Tuned XGBoost - MAE', mean_absolute_error(y_val, pred_val_g))
print('Tuned XGBoost - R2', r2_score(y_val, pred_val_g))

## Keras feedforward network (full train)
Build and train a dense neural network with early stopping (up to 100 epochs).

In [None]:
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_val_s = scaler.transform(X_val)
def build_ffn(input_shape):
    model = keras.Sequential([
        layers.Input(shape=(input_shape,)),
        layers.Dense(256, activation='relu'),
        layers.Dropout(0.2),
        layers.Dense(64, activation='relu'),
        layers.Dense(1)
    ])
    model.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-3), loss='mse', metrics=['mae'])
    return model
ffn = build_ffn(X_train_s.shape[1])
es = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
history = ffn.fit(X_train_s, y_train, validation_data=(X_val_s, y_val), epochs=100, batch_size=128, callbacks=[es], verbose=2)
pred_val_nn = ffn.predict(X_val_s).ravel()
print('NN - MSE', mean_squared_error(y_val, pred_val_nn))
print('NN - MAE', mean_absolute_error(y_val, pred_val_nn))
print('NN - R2', r2_score(y_val, pred_val_nn))

## Strategy Simulation & Competition-style Scoring
Convert predictions to portfolio weights in [0,2], apply a 120% volatility cap relative to market volatility, include transaction costs, and compute an adjusted Sharpe ratio similar to the competition metric.

In [None]:
# Use tuned XGBoost predictions as primary signal
signal = pred_val_g
def signal_to_weight(s, lower=0.0, upper=2.0):
    lo = np.percentile(s, 5)
    hi = np.percentile(s, 95)
    w = (s - lo) / (hi - lo + 1e-9) * (upper - lower) + lower
    return np.clip(w, lower, upper)
weights = signal_to_weight(signal)
val_df = df.loc[val_idx].copy().reset_index(drop=True)
val_df['pred'] = signal
val_df['weight'] = weights
# transaction cost per round-trip (default 5 bps)
tc = 0.0005
# compute turnover as abs change in weight (day-over-day) and apply cost
val_df['turnover'] = val_df['weight'].diff().abs().fillna(0)
val_df['tc_cost'] = val_df['turnover'] * tc
# strategy excess returns before cost
val_df['strategy_excess'] = val_df['weight'] * val_df['forward_returns']
# subtract transaction cost from strategy returns
val_df['strategy_excess_net'] = val_df['strategy_excess'] - val_df['tc_cost']
# Sharpe function
def sharpe_ratio(returns, periods=252):
    mean = returns.mean()
    vol = returns.std()
    if vol == 0: return np.nan
    return (mean/vol) * np.sqrt(periods)
strat_ret = val_df['strategy_excess_net']
market_ret = val_df['forward_returns']
print('Strategy net mean daily', strat_ret.mean())
print('Market mean daily', market_ret.mean())
print('Strategy net Sharpe', sharpe_ratio(strat_ret))
print('Market Sharpe', sharpe_ratio(market_ret))
# volatility cap: scale weights to keep strategy vol <= 1.2 * market vol
market_vol = market_ret.std()
strat_vol = strat_ret.std()
cap = 1.2 * market_vol
print('market vol', market_vol, 'strategy vol', strat_vol, 'cap', cap)
if strat_vol > cap and strat_vol>0:
    scale = cap / strat_vol
    val_df['weight_adj'] = val_df['weight'] * scale
    val_df['turnover_adj'] = val_df['weight_adj'].diff().abs().fillna(0)
    val_df['tc_cost_adj'] = val_df['turnover_adj'] * tc
    val_df['strategy_excess_adj'] = val_df['weight_adj'] * val_df['forward_returns'] - val_df['tc_cost_adj']
    print('Adjusted Strategy Sharpe', sharpe_ratio(val_df['strategy_excess_adj']))
else:
    val_df['weight_adj'] = val_df['weight']
    val_df['strategy_excess_adj'] = val_df['strategy_excess_net']
# cumulative returns
val_df['cum_strategy'] = (1 + val_df['strategy_excess_adj']).cumprod() - 1
val_df['cum_market'] = (1 + val_df['forward_returns']).cumprod() - 1
plt.figure(figsize=(12,6))
plt.plot(val_df['date_id'], val_df['cum_strategy'], label='Strategy (net, adj)')
plt.plot(val_df['date_id'], val_df['cum_market'], label='Market')
plt.legend()
plt.title('Cumulative returns: strategy vs market (validation)')
plt.show()
display(val_df[['date_id','pred','weight','weight_adj','forward_returns','strategy_excess_adj']].head())

## Discussion & Next steps
- The notebook shows a complete flow from data loading to strategy simulation.
- Next improvements: cross-sectional signals (if available), ensemble models, rolling retraining, transaction cost models, stricter backtest hygiene to avoid data leakage, and using PCA/feature selection for correlated features.
- Be mindful that beating the S&P on historical data does not guarantee future performance and may reflect data snooping.

If you'd like, I can re-run this notebook now (full run will take time), or run a quicker version first.