# Predicting closing price movements of NASDAQ stocks
Attila Jamilov, Cooper Reynolds, Yueying Du, Brandon Leong, Josephine Welin.
## Introduction
In the last minutes of the market being open, many stocks see heightened volatility as well as big price fluctuation. NASDAQ stock exchange uses the NASDAQ Closing Cross auction to determine the official closing prices for various assets on their exchange. We want to evaluate the performance of multiple models that we learned in class and not, on predicting this closing price movement using the dataset provided in the [Kaggle](https://www.kaggle.com/competitions/optiver-trading-at-the-close/overview), and see what models performs best, and what features we can engineer to improve on the performance of the models. 

For our features, we will try using only the features provided in the dataset, then creating our own original features, trying features that the Kaggle competitors had success with, and finally a compilation of all features. Then, we will select only the most helpful features, and then test our best model on the test data set through the Kaggle. 

For our models, we will begin with Linear Regression (Josephine), Random Forest (Brandon), LightGBM and CNN (Yueying), XGBoost (Cooper), and finally we will look into Catboost (Attila), a model developed by Yandex which the winner of the Kaggle used for his approach to this Kaggle.

## Dataset explanation

## Data Processing (Attila, Cooper)
First, we need to import the data: 

In [None]:
import pandas as pd

df = pd.read_csv("./train.csv", index_col="row_id") # 88 out of 5 million rows have null targets, which we can't train any model on if we include this

In [9]:
df.dropna(subset=["target"], inplace=True)

X = df.drop(["target", "time_id"], axis=1)
y = df["target"]

We drop `time_id`, from `X` as it's an identifying feature that won't help the model. We also 'remove' another identifying feature but we set it as the `index_col`, which is necessary for submitting to the Kggle. Next, we need to split the data into a training and validating subsets:

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=42)

`train_test_split` shuffles the data on it's own, therefore there is nothing we need to do on that part. Now we need to deal with certain features that are NaN or missing from the data. Some orders never fill, therefore it makes sense that there would be many unfilled orders with NaN target features.

In [None]:
from sklearn.linear_model import LinearRegression


lr = LinearRegression()
lr.fit(X_train.dropna(subset=["near_price", "far_price"], inplace=True), y_train)

y_pred_lr = lr.predict(X_val)

ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [11]:
from catboost import CatBoostRegressor

cb = CatBoostRegressor(loss_function="MAE", random_state=42, verbose=1, task_type="GPU", thread_count=-1)

cb.fit(X_train, y_train)

y_pred = cb.predict(X_val)

Default metric period is 5 because MAE is/are not implemented for GPU


0:	learn: 6.4094544	total: 31.3ms	remaining: 31.3s
1:	total: 40.9ms	remaining: 20.4s
2:	total: 50.5ms	remaining: 16.8s
3:	total: 60.1ms	remaining: 15s
4:	total: 69.5ms	remaining: 13.8s
5:	learn: 6.4085705	total: 79.3ms	remaining: 13.1s
6:	total: 88.7ms	remaining: 12.6s
7:	total: 98.3ms	remaining: 12.2s
8:	total: 108ms	remaining: 11.9s
9:	total: 117ms	remaining: 11.6s
10:	learn: 6.4076900	total: 127ms	remaining: 11.4s
11:	total: 136ms	remaining: 11.2s
12:	total: 146ms	remaining: 11.1s
13:	total: 155ms	remaining: 10.9s
14:	total: 165ms	remaining: 10.8s
15:	learn: 6.4068380	total: 175ms	remaining: 10.7s
16:	total: 184ms	remaining: 10.6s
17:	total: 193ms	remaining: 10.6s
18:	total: 203ms	remaining: 10.5s
19:	total: 211ms	remaining: 10.3s
20:	learn: 6.4059906	total: 220ms	remaining: 10.2s
21:	total: 228ms	remaining: 10.1s
22:	total: 236ms	remaining: 10s
23:	total: 244ms	remaining: 9.94s
24:	total: 253ms	remaining: 9.86s
25:	learn: 6.4051608	total: 261ms	remaining: 9.79s
26:	total: 270ms	rem

Then, we evaluate this

In [12]:
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_val, y_pred)
print(mae)

6.346476399623894


In [None]:
import numpy as np

def _drop_null_rows(data):
    """
    Drop rows where target, near_price, or far_price is null.
    """
    return data.dropna(subset=['target', 'near_price', 'far_price'])

def feature_engineering_none(data):
    """
    No feature engineering, returns raw features only after dropping nulls.
    """
    return _drop_null_rows(data)

def feature_engineering_non_leaderboard(data):
    """
    Feature engineering with new features not from leaderboard posts.
    """
    data = _drop_null_rows(data)
    # Volatility features
    data['wap_volatility'] = data.groupby('stock_id')['wap'].transform(
        lambda x: x.pct_change().rolling(window=5, min_periods=1).std()
    )
    data['bid_ask_spread'] = data['ask_price'] - data['bid_price']
    data['bid_ask_volatility'] = data.groupby('stock_id')['bid_ask_spread'].transform(
        lambda x: x.rolling(window=5, min_periods=1).std()
    )

    # Momentum features
    data['wap_momentum'] = data.groupby('stock_id')['wap'].transform(
        lambda x: x.pct_change(periods=3)
    )
    data['price_momentum'] = data.groupby('stock_id')['reference_price'].transform(
        lambda x: x.pct_change(periods=3)
    )

    # Log transformations
    size_cols = ['imbalance_size', 'matched_size', 'bid_size', 'ask_size']
    for col in size_cols:
        data[f'log_{col}'] = np.log1p(data[col].clip(lower=0))

    # Time-based interactions
    data['bucket_price_interaction'] = data['seconds_in_bucket'] * data['reference_price']
    data['bucket_imbalance_interaction'] = data['seconds_in_bucket'] * data['imbalance_size']

    # Relative price features
    data['wap_to_ref_price'] = data['wap'] / (data['reference_price'] + 1e-6)
    data['bid_to_ask_price'] = data['bid_price'] / (data['ask_price'] + 1e-6)

    # Handle NaN and inf
    new_cols = [col for col in data.columns if col not in ['stock_id', 'date_id', 'target', 'time_id', 'row_id']]
    data[new_cols] = data[new_cols].replace([np.inf, -np.inf], np.nan).fillna(data[new_cols].median())

    return data

def feature_engineering_leaderboard(data):
    """
    Simplified feature engineering based on 1st, 9th, and 14th place Kaggle solutions.
    """
    data = _drop_null_rows(data)
    # Handle NaN and infinities in input columns
    input_cols = ['imbalance_size', 'matched_size', 'ask_price', 'bid_price', 'wap', 'reference_price']
    for col in input_cols:
        data[col] = data[col].replace([np.inf, -np.inf], np.nan).fillna(data[col].median())

    # 1st Place: Seconds in bucket group
    data['seconds_in_bucket_group'] = np.where(data['seconds_in_bucket'] < 300, 0,
                                              np.where(data['seconds_in_bucket'] < 480, 1, 2))

    # 9th Place: Basic features
    data['bid_ask_spread'] = data['ask_price'] - data['bid_price']
    data['imbalance_ratio'] = data['imbalance_size'] / (data['matched_size'] + 1e-6)

    # 14th Place: Mid price
    data['mid_price'] = (data['ask_price'] + data['bid_price']) / 2

    # Time in auction
    data['time_in_auction'] = data['seconds_in_bucket'] / 540

    # Handle NaN and inf in new features
    new_cols = ['seconds_in_bucket_group', 'bid_ask_spread', 'imbalance_ratio', 'mid_price', 'time_in_auction']
    data[new_cols] = data[new_cols].replace([np.inf, -np.inf], np.nan).fillna(data[new_cols].median())

    return data

def feature_engineering_combined(data):
    """
    Combine non-leaderboard and simplified leaderboard features.
    """
    # Start with non-leaderboard features
    data = feature_engineering_non_leaderboard(data.copy())
    # Add simplified leaderboard features
    input_cols = ['imbalance_size', 'matched_size', 'ask_price', 'bid_price', 'wap', 'reference_price']
    for col in input_cols:
        data[col] = data[col].replace([np.inf, -np.inf], np.nan).fillna(data[col].median())

    data['seconds_in_bucket_group'] = np.where(data['seconds_in_bucket'] < 300, 0,
                                              np.where(data['seconds_in_bucket'] < 480, 1, 2))
    data['imbalance_ratio'] = data['imbalance_size'] / (data['matched_size'] + 1e-6)
    data['mid_price'] = (data['ask_price'] + data['bid_price']) / 2
    data['time_in_auction'] = data['seconds_in_bucket'] / 540

    # Handle NaN and inf in new features
    new_cols = ['seconds_in_bucket_group', 'imbalance_ratio', 'mid_price', 'time_in_auction']
    data[new_cols] = data[new_cols].replace([np.inf, -np.inf], np.nan).fillna(data[new_cols].median())

    return data

In [None]:
# Written by Cooper Richmond
import xgboost as xgb
import pandas as pd
from sklearn.metrics import mean_absolute_error

def run_xgboost_regression(X_train, y_train, X_val, y_val, features, target='target', random_state=42):
    """
    Behavior:
    Runs XGBoost regression
    
    Parameters:
    - X_train (pd.DataFrame): training features
    - y_train (pd.Series): training target
    - X_val (pd.DataFrame): validation features 
    - y_val (pd.Series): validation target
    - features (list): list of features
    - target (str): target column name 
    - random_state (int): random seed 42
    
    Returns:
    - model: trained XGBoost model
    - val_mae: validation MAE
    - feature_importance: dataframe with feature importance
    """
    # params
    params = {
        'objective': 'reg:squarederror',
        'eval_metric': 'mae',
        'max_depth': 6,
        'learning_rate': 0.05,
        'n_estimators': 1000,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'random_state': random_state,
        'n_jobs': -1
    }
    
    model = xgb.XGBRegressor(**params)
    
    eval_set = [(X_train, y_train), (X_val, y_val)]
    
    # Train
    model.fit(
        X_train[features], y_train,
        eval_set=eval_set,
        
        verbose=100
    )
    
    # Predict on validation
    y_pred = model.predict(X_val[features])
    
    val_mae = mean_absolute_error(y_val, y_pred)
    
    # Feature importance
    feature_importance = pd.DataFrame({
        'feature': features,
        'importance': model.feature_importances_
    }).sort_values(by='importance', ascending=False)
    
    print(f"Validation MAE: {val_mae:.6f}")
    print("\nTop 5 Features:")
    print(feature_importance.head())
    
    return model, val_mae, feature_importance


from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import data_processing as dp

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

def train_random_forest(X_train, y_train, X_test=None, y_test=None, sample_size=500000, random_state=42):
    """
    Train a Random Forest regressor on a random subset of training data.
    
    Parameters:
    - X_train (pd.DataFrame): Training features
    - y_train (pd.Series): Training target
    - X_test (pd.DataFrame): Test features (optional)
    - y_test (pd.Series): Test target (optional)
    - sample_size (int): Number of rows to sample for training (default: 500,000)
    - random_state (int): Random seed for reproducibility
    
    Returns:
    - model: Trained Random Forest model
    - mae: Mean Absolute Error on test set (if provided), else None
    """
    # Sample subset of training data
    if len(X_train) > sample_size:
        indices = np.random.choice(X_train.index, size=sample_size, replace=False)
        X_train_subset = X_train.loc[indices]
        y_train_subset = y_train.loc[indices]
    else:
        X_train_subset = X_train
        y_train_subset = y_train
    
    # Train model
    rf = RandomForestRegressor(n_estimators=10, random_state=random_state, n_jobs=-1)
    rf.fit(X_train_subset, y_train_subset)
    
    # Compute metrics if test data provided
    if X_test is not None and y_test is not None:
        y_pred = rf.predict(X_test)
        mae = mean_absolute_error(y_test, y_pred)
        return rf, mae
    
    return rf, None

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import data_processing as dp

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def train_lin_reg(X_train, y_train, X_test=None, y_test=None):
    """
    Train a Linear Regression model and return model and MAE.

    Parameters:
    - X_train (pd.DataFrame): Training features
    - y_train (pd.Series): Training target
    - X_test (pd.DataFrame): Test features (optional)
    - y_test (pd.Series): Test target (optional)

    Returns:
    - model: Trained Linear Regression model
    - mae: Mean Absolute Error on test set (if provided), else None
    """
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    
    if X_test is not None and y_test is not None:
        y_pred = lr.predict(X_test)
        mae = mean_absolute_error(y_test, y_pred)
        return lr, mae
    
    return lr, None
