<a href="https://colab.research.google.com/github/stepthom/869_course/blob/main/2026%20869%20Project%20Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MMAI 869 Project: Example Notebook

*Updated May 1, 2025*

This notebook serves as a template for the Team Project. Teams can use this notebook as a starting point, and update it successively with new ideas and techniques to improve their model results.

Note that is not required to use this template. Teams may also alter this template in any way they see fit.

# Preliminaries: Inspect and Set up environment

No action is required on your part in this section. These cells print out helpful information about the environment, just in case.

In [1]:
# 🧰 General-purpose libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import joblib


# 🧪 Scikit-learn preprocessing & pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif

# 🔍 Scikit-learn model selection
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    cross_validate,
    GridSearchCV,
    StratifiedKFold
)

# 🧠 Scikit-learn classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    AdaBoostClassifier,
    ExtraTreesClassifier
)
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# 🚀 Gradient boosting frameworks
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# 📊 Evaluation
from sklearn.metrics import accuracy_score, classification_report

# 🧪 Sample dataset (for testing/demo)
from sklearn.datasets import make_classification

import warnings
warnings.filterwarnings('ignore', category=UserWarning)


In [2]:
!python --version

Python 3.10.16


# 0: Data Loading and Inspection

In [8]:
# Load complete processed dataset
df_processed = pd.read_csv('../data/processed/train_dataset_spaceship_titanic_processed.csv')
X_train = df_processed.drop(['Transported', 'PassengerId'], axis=1, errors='ignore')
y_train = df_processed['Transported']

In [10]:
df_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Columns: 101 entries, PassengerId to CabinSide_U
dtypes: bool(15), float64(43), int64(43)
memory usage: 5.8 MB


In [13]:
df_processed.columns.to_list()

['PassengerId',
 'CryoSleep',
 'Age',
 'VIP',
 'RoomService',
 'FoodCourt',
 'ShoppingMall',
 'Spa',
 'VRDeck',
 'Transported',
 'HasAgeOutlier',
 'HasSpendingOutlier',
 'PassengerGroup',
 'GroupMember',
 'GroupSize',
 'GroupSurvivalRate',
 'IsSolo',
 'IsLargeGroup',
 'GroupTotalSpend',
 'CabinNum',
 'DeckLevel',
 'IsStarboard',
 'IsPort',
 'IsUnknownSide',
 'CabinNumQuartile',
 'IsLuxuryCabin',
 'IsStandardCabin',
 'HasCabin',
 'TotalSpend',
 'LuxurySpend',
 'FoodSpend',
 'ShoppingSpend',
 'IsSpender',
 'IsHighSpender',
 'IsLowSpender',
 'UsesLuxury',
 'TotalSpend_Log',
 'LuxurySpend_Log',
 'FoodSpend_Log',
 'TotalSpend_Sqrt',
 'SpendDiversity',
 'SpendPerGroupMember',
 'SpendRatioInGroup',
 'SpendPercentileInGroup',
 'IsChild',
 'IsTeen',
 'IsYoungAdult',
 'IsMiddleAged',
 'IsElderly',
 'Age_Squared',
 'Age_Cubed',
 'Age_Log',
 'Age_Sqrt',
 'PlanetSurvivalRate',
 'DestSurvivalRate',
 'PlanetDestSurvivalRate',
 'FamilySize',
 'FamilySurvivalRate',
 'IsLargeFamily',
 'IsMediumFamily',


## 1.2: Model creation, hyperparameter tuning, and validation

### STEP 1: Baseline Model Experimentation
Evaluate both Tree-based and non-Tree ML models using K-fold CV to compare F1-macro, weighted F1, and accuracy scores. No feature engineering is applied. This serves as a baseline using raw numeric data for leaderboard benchmarking.

In [11]:
%%time

# Define short descriptions
model_descriptions = {
    "Decision Tree": "A simple, interpretable tree that splits data based on feature thresholds.",
    "Random Forest": "An ensemble of decision trees that improves generalization via bagging.",
    "Gradient Boosting": "A sequential ensemble where each tree corrects errors from the last.",
    "AdaBoost": "A boosting method that emphasizes misclassified examples during training.",
    "Logistic Regression": "A linear model that predicts probabilities for classification tasks.",
    "SVM": "A margin-based classifier that finds the optimal boundary between classes using support vectors.",
    "XGBoost": "A scalable, regularized boosting method with tree-based learners.",
    "LightGBM": "A fast, efficient gradient boosting method based on histogram-based learning.",
    "CatBoost": "A gradient boosting library with native support for categorical features."
}

# Define model types
model_types = {
    "Decision Tree": "Tree-Based",
    "Random Forest": "Tree-Based",
    "Gradient Boosting": "Tree-Based",
    "AdaBoost": "Tree-Based",
    "XGBoost": "Tree-Based",
    "CatBoost": "Tree-Based",
    "Logistic Regression": "Non-Tree",
    "SVM": "Non-Tree"
}

# Define models
models = {
    "Decision Tree": DecisionTreeClassifier(max_depth=3, random_state=0),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=0),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=0),
    "AdaBoost": AdaBoostClassifier(n_estimators=100, random_state=0),
    "Logistic Regression": LogisticRegression(max_iter=5000, random_state=0),
    "SVM": SVC(kernel="rbf", C=1.0, probability=True, random_state=0),
    "XGBoost": XGBClassifier(eval_metric='logloss', random_state=0),
    "CatBoost": CatBoostClassifier(verbose=0, random_state=0)
}

separator = "-" * 60
results = {}

print("\n📊 Starting model evaluation with 5-fold cross-validation\n")

for name, model in models.items():
    print(separator)
    print(f"🔍 Model: {name}")
    print(f"🧠 Description: {model_descriptions.get(name, 'N/A')}")
    print(f"⚙️  Params: {model.get_params()}")

    cv_result = cross_validate(
        model, X_train, y_train,
        cv=5,
        scoring=["f1_macro", "f1_weighted", "accuracy"],
        return_train_score=True,
        n_jobs=-1
    )

    results[name] = {
        "Model Type": model_types[name],
        "Train F1 (Macro)": np.mean(cv_result["train_f1_macro"]),
        "CV F1 (Macro)": np.mean(cv_result["test_f1_macro"]),
        "CV F1 (Weighted)": np.mean(cv_result["test_f1_weighted"]),
        "CV Accuracy": np.mean(cv_result["test_accuracy"])
    }

    print(f"✅ Train F1 (Macro): {results[name]['Train F1 (Macro)']:.2f}")
    print(f"✅ CV F1 (Macro): {results[name]['CV F1 (Macro)']:.2f}")
    print(f"✅ CV F1 (Weighted): {results[name]['CV F1 (Weighted)']:.2f}")
    print(f"✅ CV Accuracy: {results[name]['CV Accuracy']:.2f}")

# Final summary
print("\n✅ All models evaluated. Summary below:\n")

summary_df = pd.DataFrame(results).T.sort_values("CV Accuracy", ascending=False)
display(summary_df)



📊 Starting model evaluation with 5-fold cross-validation

------------------------------------------------------------
🔍 Model: Decision Tree
🧠 Description: A simple, interpretable tree that splits data based on feature thresholds.
⚙️  Params: {'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'random_state': 0, 'splitter': 'best'}
✅ Train F1 (Macro): 0.94
✅ CV F1 (Macro): 0.94
✅ CV F1 (Weighted): 0.94
✅ CV Accuracy: 0.94
------------------------------------------------------------
🔍 Model: Random Forest
🧠 Description: An ensemble of decision trees that improves generalization via bagging.
⚙️  Params: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_d

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

✅ Train F1 (Macro): 0.78
✅ CV F1 (Macro): 0.77
✅ CV F1 (Weighted): 0.77
✅ CV Accuracy: 0.77
------------------------------------------------------------
🔍 Model: SVM
🧠 Description: A margin-based classifier that finds the optimal boundary between classes using support vectors.
⚙️  Params: {'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'probability': True, 'random_state': 0, 'shrinking': True, 'tol': 0.001, 'verbose': False}
✅ Train F1 (Macro): 0.61
✅ CV F1 (Macro): 0.59
✅ CV F1 (Weighted): 0.59
✅ CV Accuracy: 0.60
------------------------------------------------------------
🔍 Model: XGBoost
🧠 Description: A scalable, regularized boosting method with tree-based learners.
⚙️  Params: {'objective': 'binary:logistic', 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': None, 'de

Unnamed: 0,Model Type,Train F1 (Macro),CV F1 (Macro),CV F1 (Weighted),CV Accuracy
CatBoost,Tree-Based,0.995657,0.958792,0.958788,0.958818
Gradient Boosting,Tree-Based,0.972046,0.957635,0.95763,0.957667
XGBoost,Tree-Based,1.0,0.957303,0.957301,0.957322
Random Forest,Tree-Based,1.0,0.957316,0.957315,0.957321
AdaBoost,Tree-Based,0.95824,0.954778,0.954777,0.95479
Decision Tree,Tree-Based,0.939229,0.93776,0.937757,0.937765
Logistic Regression,Non-Tree,0.778769,0.770847,0.770869,0.771314
SVM,Non-Tree,0.606735,0.594704,0.594638,0.595997


CPU times: user 114 ms, sys: 135 ms, total: 249 ms
Wall time: 36.9 s
