Gabriel Natter (e12321311)

# 194.025 EML Project Competition

## Project Description

In this project, I worked on a classification task as part of a university machine learning competition. The dataset and competition were hosted on Kaggle, where the goal was to develop machine learning models that perform well on an unseen test set. Submissions are scored on a leaderboard, and higher scores translate into more points for the course.

The main objective was to beat several baseline models provided by the course organizers, such as random guessing and majority voting, and ideally perform better than progressively stronger models for maximum points. To achieve this, I experimented with different algorithms, preprocessing strategies, and evaluation methods.

I started by exploring and understanding the dataset in detail, including data types, distributions, missing values, and class imbalance. Based on these insights, I implemented appropriate preprocessing steps to clean and prepare the data.

The core of the project involves training and evaluating at least two different models. I chose to focus on Decision Trees and Random Forests, both of which are well-suited for classification tasks and provide a good balance between interpretability and performance. I used train-validation splits and cross-validation to assess model performance more reliably and avoid overfitting. To further improve the models, I performed hyperparameter tuning using grid search.

Throughout the notebook, I evaluated models using various metrics including accuracy and confusion matrices to better understand not just how often the models were right, but how they were getting things right or wrong.

The final part of the project involves selecting one model for submission. This decision was based on both performance metrics and the model's generalization behavior on validation data. All relevant steps — including preprocessing choices, modeling decisions, and performance evaluations — are documented below.

## Table of Contents

1. [Library Imports and Setup](#1-library-imports-and-setup)
2. [Data Exploration](#2-data-exploration)
3. [Data Preprocessing](#3-data-preprocessing)

## 1. Library Imports and Setup

In this section, I import all the necessary Python libraries used throughout the notebook. These include tools for data manipulation, model training and evaluation, as well as hyperparameter tuning and performance metrics.

In [None]:
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

import matplotlib.pyplot as plt

import seaborn as sns

## 2. Data Exploration

Before training any models, it's important to understand the structure and characteristics of the data. Below, I load the training and test datasets provided in the competition. I also perform some basic exploration to get an initial feel for the data, such as looking at the first few rows, checking data types and null values, and reviewing the label distribution.

Dataset description can be found [here](https://www.kaggle.com/competitions/194-025-eml-project-competition/data).

In [None]:
train = pd.read_csv("data/rocketskillshots_train.csv")
test = pd.read_csv("data/rocketskillshots_test.csv")

In [None]:
display(train.info())
display(train.head())
display(train.describe())
display(train["label"].value_counts(normalize=True))
train.corr(numeric_only=True)

## 3. Data Preprocessing

In [None]:

def get_aggregation_rules(df):
    aggregation_rules = {
        "BallAcceleration": ["mean", "max", "min"],
        "Time": "sum",
        "DistanceWall": ["mean", "max", "min"],
        "DistanceCeil": ["mean", "max", "min"],
        "DistanceBall": ["mean", "max", "min"],
        "PlayerSpeed": ["mean", "max", "min"],
        "BallSpeed": ["mean", "max", "min"],
        "up": ["mean", "max", "min"],
        "accelerate": ["mean", "max", "min"],
        "slow": ["mean", "max", "min"],
        "goal": ["mean", "max", "min"],
        "left": ["mean", "max", "min"],
        "boost": ["mean", "max", "min"],
        "camera": ["mean", "max", "min"],
        "down": ["mean", "max", "min"],
        "right": ["mean", "max", "min"],
        "slide": ["mean", "max", "min"],
        "jump": ["mean", "max", "min"],
        "BallAcceleration_skew": "first",
        "Time_skew": "first",
        "DistanceWall_skew": "first",
        "DistanceCeil_skew": "first",
        "DistanceBall_skew": "first",
        "PlayerSpeed_skew": "first",
        "BallSpeed_skew": "first",
        "up_skew": "first",
        "accelerate_skew": "first",
        "slow_skew": "first",
        "goal_skew": "first",
        "left_skew": "first",
        "boost_skew": "first",
        "camera_skew": "first",
        "down_skew": "first",
        "right_skew": "first",
        "slide_skew": "first",
        "jump_skew": "first"
    }
    if "label" in df.columns:
        aggregation_rules["label"] = "first"
    return aggregation_rules

train = train.groupby("id").agg(get_aggregation_rules(train)).reset_index()
test = test.groupby("id").agg(get_aggregation_rules(test)).reset_index()

train.columns = [
    f"{col[0]}_{col[1]}" if (col[1] != "" and col[0] != "label") else col[0]
    for col in train.columns
]
test.columns = [
    f"{col[0]}_{col[1]}" if col[1] != "" else col[0] 
    for col in test.columns
]

After dropping unnecessary columns, I separated the label column (`label`) from the feature set. `X` contains all input features, and `y` contains the target variable. To evaluate model performance before making final predictions, I then split the data into training and validation sets using an 80/20 ratio. A fixed random seed ensures that results are reproducible across runs.

In [None]:
y = train["label"]
X = train.drop(columns=["id", "label"])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 4. Model Selection

In this step, I compare the performance of two classification models: a Decision Tree and a Random Forest, both with default parameters. I evaluate each model using accuracy, precision, recall, and F1 score on the validation set, and also compute 5-fold cross-validation accuracy to get a more robust estimate of each model’s generalization ability.

This comparison helps identify which model is worth tuning further. Random Forest typically performs better due to ensemble averaging, but I include both for completeness.

Each model is passed through two evaluation functions:
- `evaluate_classification`: measures performance on the held-out validation set
- `evaluate_with_cv`: computes average accuracy across 5 folds of cross-validation

In [None]:
models = {
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42)
}

In [None]:
def evaluate_classification(name, model):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    
    acc = accuracy_score(y_test, preds)
    precision = precision_score(y_test, preds, average="weighted")
    recall = recall_score(y_test, preds, average="weighted")
    f1 = f1_score(y_test, preds, average="weighted")
    cm = confusion_matrix(y_test, preds)
    print(pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False).head(10))

    print(f"{name} Evaluation:")
    print(f"Accuracy:  {acc:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall:    {recall:.4f}")
    print(f"F1 Score:  {f1:.4f}")

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt="d", cbar=False)
    plt.title(f"{name} Confusion Matrix")
    plt.ylabel("Actual Label")
    plt.xlabel("Predicted Label")
    plt.show()

In [None]:
def evaluate_with_cv(name, model):
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
    print(f"{name} (5-fold CV Accuracy): {scores.mean():.4f} ± {scores.std():.4f}\n")

In [None]:
for name, model in models.items():
    evaluate_classification(name, model)
    evaluate_with_cv(name, model)

After evaluating the baseline models, I decided to continue with the Random Forest classifier for hyperparameter tuning. Although the Decision Tree achieved slightly better performance on the validation set and in cross-validation, Random Forests are generally more robust and less prone to overfitting due to their ensemble nature.

To optimize its performance, I used `GridSearchCV` to tune two important hyperparameters:

- `n_estimators`: the number of trees in the forest  
- `max_depth`: the maximum depth of each tree

A 5-fold cross-validation was used during the grid search to evaluate each parameter combination. This ensures the model is optimized for generalization and not just a single train-validation split.

After training, I retrieved and evaluated the best model found during the search (`best_estimator_`) using the same classification metrics as before.

In [None]:
param_grid = {
    "n_estimators": [200, 300],
    "max_depth": [6, 8, 10, 15, 20, 25],
}

grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best CV Score:  ", grid_search.best_score_)

model = grid_search.best_estimator_
evaluate_classification("Tuned Random Forest", model)

## 5. Predict

After selecting and training the final model, I used it to predict the labels on the test dataset. These predictions are made on a row-level basis, meaning each individual row in the test set receives a prediction.

In [None]:
model.fit(X, y)
test["pred"] = model.predict(test.drop(columns=["id"]))

## 6. Create submission

In [None]:
test = test[["id", "pred"]].rename(columns={"id": "ID", "pred": "LABEL"})
test.to_csv("submission/submission.csv", index=False)

## 7. Conclusion

For this project, I experimented with both a Decision Tree and a Random Forest classifier. Although the Decision Tree showed slightly better validation performance, I chose to proceed with the Random Forest due to its better generalization capabilities and robustness to overfitting. To optimize its performance, I used `GridSearchCV` with 5-fold cross-validation to tune key hyperparameters (`n_estimators` and `max_depth`).

In terms of data preprocessing, I removed columns with more than 50% missing values to simplify the pipeline and avoid the complexity of imputation. I also dropped the `window_id` column, as it acted as a group identifier and had no predictive value. Finally, I chose to use a majority voting strategy across rows sharing the same `id` to generate the final predictions, as this aggregation method improved performance for this task.

# TODO
- Finish table of contents
- Fix imports
- Eventuell die ..._skew column noch miteinbeziehen
- Tune feature aggregating
- Rewrite most comments, as i removed majority voting approach