### A definition of the machine learning problem you are working on: the input and the target. 

This project focuses on predicting whether NBA players will be selected to an All-NBA team based on their season statistics. The dataset includes one row per player-season, with a variety of performance metrics (points, rebounds, assists, etc.), and a binary target variable `is_allnba` indicating whether the player made an All-NBA team. The goal is to develop machine learning models that can generalize well and identify future All-NBA players, even in the presence of severe class imbalance—only a small percentage of players make these teams each year.

After loading and inspecting the dataset, the target variable is created by checking if a player was assigned to any All-NBA team. Categorical features such as player position are cleaned and mapped to standardized roles (G, F, C). Irrelevant columns like player name, team, and year are removed to prevent data leakage. The remaining features are separated into numerical and categorical types to support the preprocessing pipeline.

The data is split into training and test sets with stratification to preserve the original class distribution. Preprocessing includes imputing missing values, scaling numerical features, and one-hot encoding categorical ones. Three models are trained: logistic regression, random forest, and XGBoost. Each model is placed into a pipeline that includes SMOTE, an oversampling method used to balance the classes. Hyperparameters are optimized using grid search with 3-fold stratified cross-validation, and ROC-AUC is used as the scoring metric to ensure the models remain sensitive to the minority class.

Model performance is evaluated using ROC-AUC, precision, recall, and F1-score on the held-out test set. While all metrics are considered, recall is especially important for this task. Since only a handful of players make the All-NBA team, the project prioritizes catching as many of these true positive cases as possible. High recall ensures that the model successfully identifies most deserving players, even if it makes a few incorrect guesses.

The project concludes with a comparison of the models’ performance and, optionally, a feature importance analysis using the random forest model. This helps provide insight into which stats—such as points, win shares, or efficiency ratings—are most predictive of All-NBA selection. Overall, the project reflects a practical, metrics-aware approach to dealing with imbalanced classification problems in sports analytics.


In [1]:
# === Prerequisites ===
# Make sure you have these libraries installed in your notebook environment:
# !pip install imbalanced-learn xgboost scikit-learn pandas numpy

import pandas as pd
import numpy as np

# Preprocessing & modeling imports
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score, classification_report

# For handling class imbalance
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE

# --- 1. Load dataset ---
print("Loading data...")
df = pd.read_csv('data_used_for_models.csv')
print(f"Loaded stats: {df.shape[0]} rows × {df.shape[1]} cols")

# --- 2. Create binary target and basic cleaning ---
# Convert team column to binary target (1 if player made All-NBA, 0 otherwise)
if 'nba_team' in df.columns:
    # For compatibility with original data
    df['is_allnba'] = df['nba_team'].apply(lambda x: 0 if pd.isna(x) or x == 'None' else 1)
elif 'all_nba_team' in df.columns:
    # For compatibility with the uploaded notebook data
    df['is_allnba'] = df['all_nba_team'].apply(lambda x: 0 if x == 0 else 1)
else:
    # Try to infer from data
    potential_columns = [col for col in df.columns if 'nba' in col.lower() or 'team' in col.lower()]
    if potential_columns:
        print(f"Inferring 'is_allnba' from column: {potential_columns[0]}")
        df['is_allnba'] = df[potential_columns[0]].apply(lambda x: 0 if pd.isna(x) or x == 0 or x == 'None' else 1)
    else:
        raise ValueError("Cannot find NBA team column to create target variable")

print("Target distribution:")
print(df['is_allnba'].value_counts())
print(f"Positive class percentage: {df['is_allnba'].mean()*100:.2f}%")

# --- 3. Basic cleaning & feature preparation ---
# Handle categorical features like Position
if 'Position' in df.columns:
    # Normalize Position to G/F/C
    pos_map = {
        'G':'G','PG':'G','SG':'G','G-F':'G',
        'F':'F','SF':'F','PF':'F','F-G':'F','F-C':'F',
        'C':'C','C-F':'C'
    }
    df['Position'] = df['Position'].map(pos_map).fillna('G')  # Default to Guard if unknown

# Drop identifiers and other non-feature columns
drop_cols = ['index', 'Year', 'Player', 'Tm', 'is_allnba']
if 'nba_team' in df.columns:
    drop_cols.append('nba_team')
if 'all_nba_team' in df.columns:
    drop_cols.append('all_nba_team')

feature_cols = [col for col in df.columns if col not in drop_cols]

# Identify numeric and categorical features
cat_cols = df[feature_cols].select_dtypes(include=['object']).columns.tolist()
num_cols = df[feature_cols].select_dtypes(include=['number']).columns.tolist()

print(f"Using {len(num_cols)} numeric features and {len(cat_cols)} categorical features")

# --- 4. Train/test split ---
X = df[num_cols + cat_cols]
y = df['is_allnba']

# Ensure enough samples in each class with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
print(f"Training set class distribution:\n{y_train.value_counts()}")
print(f"Test set class distribution:\n{y_test.value_counts()}")

# --- 5. Create preprocessing pipeline ---
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')), 
        ('scaler', StandardScaler())
    ]), num_cols),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), 
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]), cat_cols)
], remainder='drop')

# --- 6. Create models with hyperparameter search ---
# Use 3-fold CV to avoid issues with class imbalance
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Define models
models = {
    'LogisticRegression': {
        'pipe': ImbPipeline([
            ('pre', preprocessor),
            ('sampler', SMOTE(random_state=42)),
            ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))
        ]),
        'params': {
            'clf__C': [0.01, 0.1, 1.0, 10.0],
            'clf__solver': ['liblinear', 'saga']
        }
    },
    'RandomForest': {
        'pipe': ImbPipeline([
            ('pre', preprocessor),
            ('sampler', SMOTE(random_state=42)),
            ('clf', RandomForestClassifier(random_state=42, class_weight='balanced'))
        ]),
        'params': {
            'clf__n_estimators': [100, 200],
            'clf__max_depth': [None, 10, 20],
            'clf__min_samples_split': [2, 5]
        }
    },
    'XGBoost': {
        'pipe': ImbPipeline([
            ('pre', preprocessor),
            ('sampler', SMOTE(random_state=42)),
            ('clf', XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'))
        ]),
        'params': {
            'clf__n_estimators': [100, 200],
            'clf__max_depth': [3, 5, 7],
            'clf__learning_rate': [0.01, 0.1]
        }
    }
}

# --- 7. Train models with hyperparameter search ---
best_models = {}

for name, model_info in models.items():
    print(f"\n{'-'*50}")
    print(f"Training {name}...")
    
    grid = GridSearchCV(
        model_info['pipe'],
        model_info['params'],
        cv=cv,
        scoring='roc_auc',
        n_jobs=-1,
        verbose=1
    )
    
    grid.fit(X_train, y_train)
    
    print(f"Best parameters: {grid.best_params_}")
    print(f"Best CV score: {grid.best_score_:.4f}")
    
    best_models[name] = grid.best_estimator_

# --- 8. Evaluate models on test set ---
results = pd.DataFrame(columns=['Model', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score'])

print("\n" + "="*60)
print("TEST SET EVALUATION")
print("="*60)

for name, model in best_models.items():
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    roc_auc = roc_auc_score(y_test, y_prob)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    # Add to results dataframe
    results = pd.concat([results, pd.DataFrame({
        'Model': [name],
        'ROC-AUC': [roc_auc],
        'Precision': [precision],
        'Recall': [recall],
        'F1-Score': [f1]
    })], ignore_index=True)
    
    print(f"\n--- {name} Results ---")
    print(f"ROC-AUC: {roc_auc:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))

# Show final comparison table
print("\n" + "="*60)
print("MODEL COMPARISON")
print("="*60)
print(results.sort_values('ROC-AUC', ascending=False).reset_index(drop=True))

# --- 9. (Optional) Feature importance from best model ---
if 'RandomForest' in best_models:
    print("\n" + "="*60)
    print("FEATURE IMPORTANCE")
    print("="*60)
    
    rf_model = best_models['RandomForest'].named_steps['clf']
    preprocessor = best_models['RandomForest'].named_steps['pre']
    
    # Get feature names
    feature_names = preprocessor.get_feature_names_out()
    
    # Get feature importances
    importances = rf_model.feature_importances_
    
    # Sort by importance
    indices = np.argsort(importances)[::-1]
    
    print("Top 15 most important features:")
    for i in range(min(15, len(feature_names))):
        print(f"{feature_names[indices[i]]}: {importances[indices[i]]:.4f}")

Loading data...
Loaded stats: 9663 rows × 37 cols
Target distribution:
is_allnba
0    9398
1     265
Name: count, dtype: int64
Positive class percentage: 2.74%
Using 31 numeric features and 1 categorical features
Training set size: 7730 samples
Test set size: 1933 samples
Training set class distribution:
is_allnba
0    7518
1     212
Name: count, dtype: int64
Test set class distribution:
is_allnba
0    1880
1      53
Name: count, dtype: int64

--------------------------------------------------
Training LogisticRegression...
Fitting 3 folds for each of 8 candidates, totalling 24 fits


TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.


### For each algorithm explain which hyper-parameters you worked with and how you picked them. Bonus points if you apply the xgboost library.

This project applied three different algorithms—logistic regression, random forest, and XGBoost(and the library for extra points)—to predict whether an NBA player would be selected to an All-NBA team based on season statistics. All models were implemented using scikit-learn and imbalanced-learn, with XGBoost handled using its dedicated Python API.

Each model was placed into a pipeline that included preprocessing steps and a SMOTE sampler to address the significant class imbalance. For logistic regression, the hyperparameters tuned were the inverse regularization strength `C` and the solver (`liblinear` and `saga`). These were selected because they control the model’s flexibility and optimization strategy. GridSearchCV with 3-fold stratified cross-validation was used to search for the combination that gave the best ROC-AUC score.

For the random forest model, key hyperparameters included the number of estimators (`n_estimators`), maximum depth of each tree (`max_depth`), and the minimum number of samples required to split a node (`min_samples_split`). These settings influence both model complexity and overfitting. A grid search was again performed to identify the best configuration, balancing depth and generalization.

The third algorithm used was XGBoost, which was also the most flexible and powerful among the three. The main hyperparameters tuned included the number of boosting rounds (`n_estimators`), maximum tree depth (`max_depth`), and learning rate. These parameters govern how fast and deep the boosting process goes. Grid search was used here as well, and the model was evaluated using ROC-AUC to ensure good separation of the minority (All-NBA) class.

All models were evaluated on a held-out test set using metrics such as precision, recall, F1-score, and ROC-AUC, with particular attention paid to recall given the importance of correctly identifying actual All-NBA players. This setup not only satisfies the assignment requirement of applying three distinct algorithms but also includes bonus implementation of XGBoost with appropriate tuning and evaluation.


### A summary of your observations, and a separate optional section describing what you think may be a non-standard/novel thing you did in your experiments.

##### Summary of Observations:
After training and evaluating logistic regression, random forest, and XGBoost models on the All-NBA dataset, several patterns emerged. The most notable was that all models achieved very strong ROC-AUC scores (above 0.98), suggesting they were effective at distinguishing between All-NBA and non-All-NBA players. However, there were clear differences in how each model balanced precision and recall. Logistic regression had the highest recall at 94.3%, correctly identifying nearly all actual All-NBA players, but at the cost of precision—it incorrectly labeled many non-All-NBA players as positive. XGBoost offered the best balance overall, with a strong recall of 79.2% and a significantly higher precision of 55.3%, leading to the highest F1-score. Random forest trailed slightly behind XGBoost but still performed well in both recall and precision.


The class imbalance (only 2.74% positive class) posed a major challenge. This was effectively handled using SMOTE within each model’s pipeline, which helped ensure that the models could actually learn from the limited All-NBA examples. Without this, the models would have likely defaulted to always predicting the negative class. The results highlight how critical it is to tune for recall in a task like this, where failing to recognize deserving All-NBA players would undermine the usefulness of the model.


#### Optional: Non-standard/Novel Elements:
A few elements of this project go slightly beyond a standard application. Most notably, SMOTE was integrated directly within each model’s cross-validation pipeline, ensuring oversampling was applied only during training folds and not leaked into validation, which could otherwise inflate performance. Another thoughtful detail was the normalization of positional data to core roles (G/F/C), which reduced noise during one-hot encoding and likely helped model consistency. Additionally, the feature importance analysis from the random forest model provided valuable interpretability. It revealed that advanced metrics like PER, VORP, WS/48, and BPM were among the most predictive of All-NBA selection, confirming that efficiency and overall impact metrics matter more than just raw box score stats. This kind of insight, derived from the model itself, adds meaningful depth to the analysis.

