<a href="https://colab.research.google.com/github/stepthom/869_course/blob/main/2026%20869%20Project%20Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MMAI 869 Project: Example Notebook

*Updated May 1, 2025*

This notebook serves as a template for the Team Project. Teams can use this notebook as a starting point, and update it successively with new ideas and techniques to improve their model results.

Note that is not required to use this template. Teams may also alter this template in any way they see fit.

# Preliminaries: Inspect and Set up environment

No action is required on your part in this section. These cells print out helpful information about the environment, just in case.

In [1]:
# 🧰 General-purpose libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import joblib


# 🧪 Scikit-learn preprocessing & pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif

# 🔍 Scikit-learn model selection
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    cross_validate,
    GridSearchCV,
    StratifiedKFold
)

# 🧠 Scikit-learn classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    AdaBoostClassifier,
    ExtraTreesClassifier
)
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# 🚀 Gradient boosting frameworks
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# 📊 Evaluation
from sklearn.metrics import accuracy_score, classification_report

# 🧪 Sample dataset (for testing/demo)
from sklearn.datasets import make_classification

import warnings
warnings.filterwarnings('ignore', category=UserWarning)


In [2]:
!python --version

Python 3.10.16


# 0: Data Loading and Inspection

In [3]:
# ================================================================
# LOAD TRAINING DATA WITH SELECTED FEATURES
# ================================================================

print("📥 LOADING TRAINING DATA WITH SELECTED FEATURES")
print("=" * 55)

# Load complete processed training dataset
df_processed = pd.read_csv('../data/processed/train_features_engineered.csv')
print(f"   ✅ Training dataset loaded: {df_processed.shape}")

# Load selected features from feature selection phase
best_features = pd.read_csv('../data/processed/best_features_selected.csv')['best_features'].tolist()
print(f"   ✅ Selected features loaded: {len(best_features)} features")

# Filter training data using selected features only
X_train = df_processed[best_features]
y_train = df_processed['Transported']

print(f"\n📊 FILTERED TRAINING DATA SUMMARY:")
print(f"   📊 Original features: {df_processed.shape[1] - 2}")  # Exclude PassengerId and Transported
print(f"   📊 Selected features: {X_train.shape[1]}")
print(f"   📊 Feature reduction: {((df_processed.shape[1] - 2 - X_train.shape[1]) / (df_processed.shape[1] - 2) * 100):.1f}%")
print(f"   📊 Training samples: {X_train.shape[0]}")
print(f"   📊 Target distribution: {y_train.sum()}/{len(y_train)} transported")

print(f"\n🎯 SELECTED FEATURES FOR OPTIMIZATION:")
for i, feature in enumerate(best_features, 1):
    print(f"   {i:2d}. {feature}")

print(f"\n✅ Training data ready for hyperparameter optimization!")

📥 LOADING TRAINING DATA WITH SELECTED FEATURES
   ✅ Training dataset loaded: (8693, 19)
   ✅ Selected features loaded: 6 features

📊 FILTERED TRAINING DATA SUMMARY:
   📊 Original features: 17
   📊 Selected features: 6
   📊 Feature reduction: 64.7%
   📊 Training samples: 8693
   📊 Target distribution: 4378/8693 transported

🎯 SELECTED FEATURES FOR OPTIMIZATION:
    1. HomePlanet
    2. CryoSleep
    3. RoomService
    4. TotalSpend
    5. LuxurySpend
    6. Cabin_HomePlanet

✅ Training data ready for hyperparameter optimization!


In [None]:
df_processed.head(10)

## 1.2: Model creation, hyperparameter tuning, and validation

### STEP 1: Baseline Model Experimentation
Evaluate both Tree-based and non-Tree ML models using K-fold CV to compare F1-macro, weighted F1, and accuracy scores. No feature engineering is applied. This serves as a baseline using raw numeric data for leaderboard benchmarking.

In [5]:
%%time

# Define short descriptions
model_descriptions = {
    "Decision Tree": "A simple, interpretable tree that splits data based on feature thresholds.",
    "Random Forest": "An ensemble of decision trees that improves generalization via bagging.",
    "Gradient Boosting": "A sequential ensemble where each tree corrects errors from the last.",
    "AdaBoost": "A boosting method that emphasizes misclassified examples during training.",
    "Logistic Regression": "A linear model that predicts probabilities for classification tasks.",
    "SVM": "A margin-based classifier that finds the optimal boundary between classes using support vectors.",
    "XGBoost": "A scalable, regularized boosting method with tree-based learners.",
    "LightGBM": "A fast, efficient gradient boosting method based on histogram-based learning.",
    "CatBoost": "A gradient boosting library with native support for categorical features."
}

# Define model types
model_types = {
    "Decision Tree": "Tree-Based",
    "Random Forest": "Tree-Based",
    "Gradient Boosting": "Tree-Based",
    "AdaBoost": "Tree-Based",
    "XGBoost": "Tree-Based",
    "CatBoost": "Tree-Based",
    "Logistic Regression": "Non-Tree",
    "SVM": "Non-Tree"
}

# Define models
models = {
    "Decision Tree": DecisionTreeClassifier(max_depth=3, random_state=0),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=0),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=0),
    "AdaBoost": AdaBoostClassifier(n_estimators=100, random_state=0),
    "Logistic Regression": LogisticRegression(max_iter=5000, random_state=0),
    "SVM": SVC(kernel="rbf", C=1.0, probability=True, random_state=0),
    "XGBoost": XGBClassifier(eval_metric='logloss', random_state=0),
    "CatBoost": CatBoostClassifier(verbose=0, random_state=0)
}

separator = "-" * 60
results = {}

print("\n📊 Starting model evaluation with 5-fold cross-validation\n")

for name, model in models.items():
    print(separator)
    print(f"🔍 Model: {name}")
    print(f"🧠 Description: {model_descriptions.get(name, 'N/A')}")
    print(f"⚙️  Params: {model.get_params()}")

    cv_result = cross_validate(
        model, X_train, y_train,
        cv=5,
        scoring=["f1_macro", "f1_weighted", "accuracy"],
        return_train_score=True,
        n_jobs=-1
    )

    # Calculate means and standard deviations
    results[name] = {
        "Model Type": model_types[name],
        "Train F1 (Macro)": np.mean(cv_result["train_f1_macro"]),
        "Train F1 Std": np.std(cv_result["train_f1_macro"]),
        "CV F1 (Macro)": np.mean(cv_result["test_f1_macro"]),
        "CV F1 Std": np.std(cv_result["test_f1_macro"]),
        "CV F1 (Weighted)": np.mean(cv_result["test_f1_weighted"]),
        "CV F1 (W) Std": np.std(cv_result["test_f1_weighted"]),
        "CV Accuracy": np.mean(cv_result["test_accuracy"]),
        "CV Accuracy Std": np.std(cv_result["test_accuracy"])
    }

    # Print results with standard deviations
    print(f"✅ Train F1 (Macro): {results[name]['Train F1 (Macro)']:.4f} (±{results[name]['Train F1 Std']:.4f})")
    print(f"✅ CV F1 (Macro): {results[name]['CV F1 (Macro)']:.4f} (±{results[name]['CV F1 Std']:.4f})")
    print(f"✅ CV F1 (Weighted): {results[name]['CV F1 (Weighted)']:.4f} (±{results[name]['CV F1 (W) Std']:.4f})")
    print(f"✅ CV Accuracy: {results[name]['CV Accuracy']:.4f} (±{results[name]['CV Accuracy Std']:.4f})")

# Final summary
print("\n✅ All models evaluated. Summary below:\n")

summary_df = pd.DataFrame(results).T.sort_values("CV Accuracy", ascending=False)
display_columns = ["Model Type", "CV Accuracy", "CV Accuracy Std", "CV F1 (Macro)", "CV F1 (Weighted)"]
display(summary_df[display_columns].round(4))



📊 Starting model evaluation with 5-fold cross-validation

------------------------------------------------------------
🔍 Model: Decision Tree
🧠 Description: A simple, interpretable tree that splits data based on feature thresholds.
⚙️  Params: {'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'random_state': 0, 'splitter': 'best'}
✅ Train F1 (Macro): 0.7647 (±0.0039)
✅ CV F1 (Macro): 0.7600 (±0.0175)
✅ CV F1 (Weighted): 0.7602 (±0.0175)
✅ CV Accuracy: 0.7622 (±0.0166)
------------------------------------------------------------
🔍 Model: Random Forest
🧠 Description: An ensemble of decision trees that improves generalization via bagging.
⚙️  Params: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_no

Unnamed: 0,Model Type,CV Accuracy,CV Accuracy Std,CV F1 (Macro),CV F1 (Weighted)
CatBoost,Tree-Based,0.800761,0.007552,0.800015,0.800085
Gradient Boosting,Tree-Based,0.795356,0.012662,0.794396,0.79449
XGBoost,Tree-Based,0.791904,0.009794,0.79118,0.791243
SVM,Non-Tree,0.787878,0.012199,0.785775,0.785923
Random Forest,Tree-Based,0.786727,0.009458,0.78613,0.786183
Logistic Regression,Non-Tree,0.78454,0.006646,0.784172,0.784225
AdaBoost,Tree-Based,0.771428,0.009803,0.771305,0.771296
Decision Tree,Tree-Based,0.762226,0.016611,0.760031,0.760193


CPU times: user 74.6 ms, sys: 100 ms, total: 175 ms
Wall time: 9.95 s


### Summary of Baseline Model Results

After engineering 17 high-impact features from the raw spaceship data, we evaluated eight machine learning algorithms using a structured performance framework. From basic decision trees to advanced ensemble models like CatBoost, each model was tested through 5-fold cross-validation.

**Key Findings:**
- **CatBoost emerged as the clear winner**, achieving 80.08% accuracy with exceptional stability (0.75% standard deviation)
- **Tree-based models dominated the leaderboard**, occupying 6 of the top 8 positions
- **Gradient Boosting and XGBoost followed closely** at 79.53% and 79.19% respectively
- **Non-tree models showed competitive performance**, with SVM reaching 78.79% accuracy

**Performance Insights:**
- **Top 3 models all exceeded 79% accuracy**, indicating strong feature engineering effectiveness
- **CatBoost's low variance** (0.75% std) suggests robust performance across different data splits
- **Significant performance gap** between top performers (80%+) and basic models like Decision Tree (76%)

**Strategic Implications:**
The baseline results validate our feature engineering approach and identify **CatBoost, Gradient Boosting, and XGBoost** as prime candidates for hyperparameter optimization. With the current baseline exceeding 80% accuracy, the foundation is set for advanced model tuning to push performance even higher.

In [10]:
# Models to test
submission_models = {
    "CatBoost": CatBoostClassifier(verbose=0, random_state=0),
    "GradientBoosting": GradientBoostingClassifier(n_estimators=100, random_state=0),
    "XGBoost": XGBClassifier(eval_metric='logloss', use_label_encoder=False, random_state=0)
}

# Prepare test data
X_test_kaggle = pd.read_csv('../data/processed/test_features_engineered.csv')
passenger_ids = X_test_kaggle["PassengerId"]
X_test_kaggle = X_test_kaggle[best_features]
# X_test_kaggle = X_test_kaggle.drop("PassengerId", axis=1)

# Generate and save predictions for each model
for name, model in submission_models.items():
    print(f"\n🚀 Training and predicting with: {name}")
    model.fit(X_train, y_train)
    preds = model.predict(X_test_kaggle)

    submission = pd.DataFrame({
        "PassengerId": passenger_ids,
        "Transported": preds.astype(bool)
    })

    submission.to_csv(f"submissions/submission_best_feats_{name.lower()}.csv", index=False)
    print(f"✅ Submission file created: submission_best_feats_{name.lower()}.csv")


🚀 Training and predicting with: CatBoost
✅ Submission file created: submission_best_feats_catboost.csv

🚀 Training and predicting with: GradientBoosting
✅ Submission file created: submission_best_feats_gradientboosting.csv

🚀 Training and predicting with: XGBoost
✅ Submission file created: submission_best_feats_xgboost.csv
