Gabriel Natter (e12321311)

# 194.025 EML Project Competition

## Project Description

This project is a machine learning task focused on classifying Rocket League trickshots based on gameplay data. The goal is to predict the type of trickshot performed using features like player and ball movements, speeds, and control inputs, with classes split evenly in the dataset. I built a model to accurately identify these shots, processing the data to capture key patterns and testing different algorithms to find the best performer. The final submission is a set of predictions on a test dataset, formatted for evaluation. My approach aimed to balance simplicity and effectiveness, ensuring robust results for this university assignment.

## Library Imports

I imported pandas for data handling, matplotlib for visualizations, and scikit-learn tools like DecisionTreeClassifier, RandomForestClassifier, GridSearchCV, and metrics for model training and evaluation. This gave me a compact toolkit to manage the classification task efficiently.

In [None]:
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, f1_score, confusion_matrix, make_scorer, classification_report

## Data Loading and Initial Exploration

I started by loading the training and test datasets from CSV files since they were already prepared for the Rocket League trickshot classification task. My goal was to get a quick sense of the data's structure and quality before diving into preprocessing. I used pandas to read the files and checked the training data with info(), head(), and describe() to understand the columns, data types, and basic statistics. I also looked at the label distribution to confirm the classes were balanced. Checking for missing values and unique counts of categorical columns helped me spot any potential issues early, like nulls or high-cardinality features that might complicate modeling. I ran a correlation check on numeric columns to see if any features were strongly related, which could guide feature selection later. This initial step was about building a solid foundation for the rest of the project.

A Dataset description can be found [here](https://www.kaggle.com/competitions/194-025-eml-project-competition/data).

In [None]:
train = pd.read_csv("data/rocketskillshots_train.csv")
test = pd.read_csv("data/rocketskillshots_test.csv")

In [None]:
display(train.info())
display(train.head())
display(train.describe())
display(train["label"].value_counts(normalize=True))
display(train.isnull().sum())
display(train.select_dtypes(include="object").nunique())
train.corr(numeric_only=True)

## Data Preprocessing

I began preprocessing by defining a list of summary features, since these capture important statistical patterns in the trickshot data. I separated the training and test data into timestamped and summary portions using the "window_id" column. The timestamped data has the play-by-play details, while the summary data holds precomputed stats I wanted to keep. Splitting them this way let me process each part appropriately while preserving all the useful info for later merging.

In [None]:
summary_features = [
    "BallAcceleration_skew",
    "Time_skew",
    "DistanceWall_skew", 
    "DistanceCeil_skew",
    "DistanceBall_skew",
    "PlayerSpeed_skew", 
    "BallSpeed_skew",
    #"up_skew",
    "accelerate_skew",
    "slow_skew", 
    "goal_skew",
    #"left_skew",
    "boost_skew",
    #"camera_skew", 
    #"down_skew",
    #"right_skew",
    "slide_skew",
    "jump_skew"
]

train_timestamp = train[train["window_id"].notna()].copy()
test_timestamp = test[test["window_id"].notna()].copy()

train_summary = train[train["window_id"].isna()][["id"] + summary_features].copy()
test_summary = test[test["window_id"].isna()][["id"] + summary_features].copy()

I wrote a function to define aggregation rules for the timestamped data, grouping by "id" to summarize each trickshot. I picked rules like taking the last value for distances and means or extremes for speeds and actions because they reflect the key moments and trends in a shot without bloating the dataset. I applied these rules to both training and test sets, flattened the column names for clarity, and merged the summary features back in to enrich the data with skew stats. This approach condensed the data into a format that's both compact and informative for modeling.

In [None]:
def get_aggregation_rules(df):
    aggregation_rules = {
        "BallAcceleration": ["last"],
        "Time": ["max"],
        "DistanceWall": ["mean", "min", "max"],
        "DistanceCeil": ["mean", "min", "max"],
        "DistanceBall": ["mean", "min", "max"],
        "PlayerSpeed": ["mean", "min", "max", "last"],
        "BallSpeed": ["mean", "min", "max", "last"],
        #"up": ["mean", "max"],
        "accelerate": ["mean", "max"],
        "slow": ["mean", "max"],
        "goal": ["mean", "max"],
        #"left": ["mean", "max"],
        "boost": ["mean", "max"],
        #"camera": ["mean", "max"],
        #"down": ["mean", "max"],
        #"right": ["mean", "max"],
        "slide": ["mean", "max"],
        "jump": ["mean", "max"]
    }

    if "label" in df.columns:
        aggregation_rules["label"] = ["first"]
        
    return aggregation_rules

In [None]:
train = train_timestamp.groupby("id").agg(get_aggregation_rules(train)).reset_index()
test = test_timestamp.groupby("id").agg(get_aggregation_rules(test)).reset_index()

train.columns = [
    f"{col[0]}_{col[1]}" if (col[1] != "" and col[0] != "label") else col[0]
    for col in train.columns
]
test.columns = [
    f"{col[0]}_{col[1]}" if col[1] != "" else col[0] 
    for col in test.columns
]

train = train.merge(train_summary, on="id", how="left")
test = test.merge(test_summary, on="id", how="left")
display(train)

After aggregation, I split the training data into features and labels, dropping "id" since it's just an identifier. I used train_test_split with a 20% validation set, setting a random seed for reproducibility and stratifying to maintain the even class balance. This gave me clean training and validation sets ready for model training, ensuring the data was well-structured and representative.

In [None]:
y = train["label"]
X = train.drop(columns=["id", "label"])

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## Model Selection and Evaluation

I chose to test a Decision Tree and a Random Forest for the trickshot classification task since they’re straightforward and handle multiclass problems well, especially with balanced classes. I set a random seed for consistency across runs. For evaluation, I used accuracy and macro F1 score because they give a clear picture of performance on a balanced dataset, with F1 focusing on per-class quality. I wrote a function to run 5-fold cross-validation on the training set to get a robust sense of how each model performs, then trained and tested on the validation set to see real results. I included classification reports and confusion matrices to dig into per-class performance and spot any misclassifications visually. This setup let me compare the models fairly and understand their strengths before picking one.

In [None]:
models = {
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42)
}

scorers = {
    "Accuracy": make_scorer(accuracy_score),
    "F1": make_scorer(f1_score, average="macro") # Good for balanced multiclass
}

In [None]:
def evaluate_model(model_name, model):
    print(model_name, "\n")
    
    for score_name, scorer in scorers.items():
        scores = cross_val_score(model, X_train, y_train, cv=5, scoring=scorer)
        print(f"{score_name}: {scores.mean():.4f} ± {scores.std():.4f}")
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)

    print("Classification Report:")
    print(classification_report(y_val, y_pred))

    cm = confusion_matrix(y_val, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot(cmap=plt.cm.Blues)
    plt.title(f"Confusion Matrix: {model_name}")
    plt.show()

In [None]:
for model_name, model in models.items():
    evaluate_model(model_name, model)

## Hyperparameter Tuning

I decided to tune the Random Forest since it showed promise during initial model selection. I set up a grid search over key parameters like the number of trees, max depth, and sample splits to find the best combination for performance. I used 5-fold cross-validation and focused on macro F1 score to optimize for balanced class performance, as the dataset has evenly split classes. Running it with all available cores saved time. After finding the best parameters, I evaluated the tuned model using the same metrics and visuals as before to confirm it improved on the default setup. This step was about squeezing out better accuracy and reliability from the Random Forest without overcomplicating the process.

In [None]:
param_grid = {
    "n_estimators": [200],
    "max_depth": [6, 8, 10],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2],
    "bootstrap": [True, False]
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring="f1_macro",
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
print("Best parameters found:", grid_search.best_params_)
evaluate_model("Tuned Random Forest", best_model)

## Model Prediction and Submission

I trained the final Random Forest model on the full training data to make the most of all available information, as this usually boosts performance for predictions. I then used it to predict labels for the test set, keeping only the "id" column and predictions to match the submission format. I renamed the columns to "ID" and "LABEL" as required and saved the results to a CSV file. This step was about wrapping up the project cleanly, ensuring the output was ready for evaluation without any extra fuss.

In [None]:
model.fit(X, y)
test["pred"] = model.predict(test.drop(columns=["id"]))

In [None]:
test = test[["id", "pred"]].rename(columns={"id": "ID", "pred": "LABEL"})
test.to_csv("submission/submission.csv", index=False)

## Conclusion

I picked a Random Forest for the final submission because it outperformed the Decision Tree in initial tests, showing better accuracy and F1 scores across balanced classes. I tuned it with grid search, testing different tree counts, depths, and splits, settling on the best combo for macro F1 to ensure solid per-class performance. For data processing, I split the timestamped and summary data to handle them separately, aggregating the timestamped part with rules like last values for distances and means for actions to capture trickshot patterns concisely. I merged in skew features to add context, then split the data with stratification to keep the class balance. This approach kept the dataset manageable and focused on key shot dynamics, which helped the model learn effectively without unnecessary complexity. Working on this project was a great experience - I really enjoyed the challenge and had fun putting it all together.