# Stage 5 – Baseline Ridge Regression

**Goal:** Train a simple, regularized linear model on historical seasons and quantify how much error grows on newer eras.

**Plan:**
1. Load and clean the player-season dataset built earlier.
2. Split seasons chronologically so training stops at 2021, validation is 2022, and tests are 2023–2024.
3. Fit a Ridge regression baseline with time-aware cross-validation to choose the regularization strength.
4. Report MAE, RMSE, and R² for train, 2022 validation, and the two post-2022 seasons.


In [35]:
from pathlib import Path

import pandas as pd

from src.data_prep import clean_data, load_data, train_val_test_split
from src.models import train_ridge
from src.features import build_feature_matrix
from src.evaluation import evaluate_split


In [36]:
DATA_PATH = Path("../data/raw/player_season_2015_2024.csv")

if not DATA_PATH.exists():
    raise FileNotFoundError(
        "Expected the canonical player-season file at ../data/raw/player_season_2015_2024.csv. "
        "Run python -m src.build_player_season_dataset first."
    )

raw_df = load_data(DATA_PATH)
print(f"Loaded {len(raw_df):,} rows from {DATA_PATH}")

df = clean_data(raw_df)
print(f"Remaining after cleaning: {len(df):,}")

train_df, val_df, test_2023_df, test_2024_df = train_val_test_split(df)
print(
    f"Train seasons {train_df['season'].min()}–{train_df['season'].max()}, "
    f"validation {val_df['season'].unique().tolist()}, "
    f"tests {[2023, 2024]}"
)


Loaded 6,054 rows from ../data/raw/player_season_2015_2024.csv
Remaining after cleaning: 4,207
Train rows: 2880
Validation rows: 457
Test 2023 rows: 436
Test 2024 rows: 434
Train seasons 2015–2021, validation [2022], tests [2023, 2024]


In [37]:
ALPHA_GRID = [0.1, 0.3, 1.0, 3.0, 10.0]
IDENTIFIER_COLUMNS = ["player_id", "player_name", "team"]
TARGET_COLUMN = "ppr_points"
POSITION_COLUMN = "position"

# Clean target and position columns of any NaN/null/missing values before training
print("Cleaning target and position columns...")
print(f"Train: {len(train_df)} rows before cleaning")
print(f"Val: {len(val_df)} rows before cleaning")

def clean_missing_values(df, columns):
    """Comprehensively clean missing values including NaN, None, and string representations."""
    df_clean = df.copy()
    
    for col in columns:
        if col not in df_clean.columns:
            continue
            
        # Replace string representations of missing values with actual NaN
        # Handle common string representations: "NaN", "nan", "NULL", "null", "", etc.
        if df_clean[col].dtype == 'object':
            # For string/object columns (like position)
            missing_strings = ['NaN', 'nan', 'NULL', 'null', 'None', 'N/A', 'n/a', '']
            df_clean[col] = df_clean[col].replace(missing_strings, pd.NA)
        else:
            # For numeric columns (like target)
            # Replace string "NaN" if somehow present in numeric column
            if df_clean[col].dtype == 'object':
                df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')
            # Replace None values
            df_clean[col] = df_clean[col].replace([None], pd.NA)
    
    # Drop rows with any missing values in the specified columns
    df_clean = df_clean.dropna(subset=columns)
    
    return df_clean

# Clean both dataframes
train_df_clean = clean_missing_values(train_df, [TARGET_COLUMN, POSITION_COLUMN])
val_df_clean = clean_missing_values(val_df, [TARGET_COLUMN, POSITION_COLUMN])

print(f"Train: {len(train_df_clean)} rows after cleaning (dropped {len(train_df) - len(train_df_clean)})")
print(f"Val: {len(val_df_clean)} rows after cleaning (dropped {len(val_df) - len(val_df_clean)})")

# Verify no missing values remain (NaN, None, or string "NaN")
for df_name, df_clean in [("train_df", train_df_clean), ("val_df", val_df_clean)]:
    # Check for NaN/NA values
    assert df_clean[TARGET_COLUMN].notna().all(), f"Target column still contains missing values in {df_name}"
    assert df_clean[POSITION_COLUMN].notna().all(), f"Position column still contains missing values in {df_name}"
    # Check that no string "NaN" values remain (should be caught by notna(), but double-check)
    if df_clean[TARGET_COLUMN].dtype == 'object':
        assert not df_clean[TARGET_COLUMN].isin(['NaN', 'nan', 'NULL', 'null', 'None', 'N/A', 'n/a', '']).any(), \
            f"Target column still contains string missing values in {df_name}"
    if df_clean[POSITION_COLUMN].dtype == 'object':
        assert not df_clean[POSITION_COLUMN].isin(['NaN', 'nan', 'NULL', 'null', 'None', 'N/A', 'n/a', '']).any(), \
            f"Position column still contains string missing values in {df_name}"

print("✓ All target and position columns are clean (no NaN, None, or string 'NaN' values)")

baseline_result = train_ridge(
    train_df=train_df_clean,
    val_df=val_df_clean,
    alpha_grid=ALPHA_GRID,
    target_column=TARGET_COLUMN,
    position_column=POSITION_COLUMN,
    drop_columns=IDENTIFIER_COLUMNS,
    scale_numeric=True,
)

print(f"Best alpha from RidgeCV: {baseline_result['alpha']:.3f}")
print(
    f"Train RMSE={baseline_result['train_metrics'].rmse:.2f}, "
    f"Val RMSE={baseline_result['val_metrics'].rmse:.2f}"
)


Cleaning target and position columns...
Train: 2880 rows before cleaning
Val: 457 rows before cleaning
Train: 2880 rows after cleaning (dropped 0)
Val: 457 rows after cleaning (dropped 0)
✓ All target and position columns are clean (no NaN, None, or string 'NaN' values)


  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


ValueError: 
All the 25 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/adhirajsen/Documents/Documents - Adhiraj’s MacBook Pro/UTA/Semesters/2. Fall2025/CSE6363/FinalProject/cse6363_finalProj/.venv/lib/python3.13/site-packages/sklearn/model_selection/_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/adhirajsen/Documents/Documents - Adhiraj’s MacBook Pro/UTA/Semesters/2. Fall2025/CSE6363/FinalProject/cse6363_finalProj/.venv/lib/python3.13/site-packages/sklearn/base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/adhirajsen/Documents/Documents - Adhiraj’s MacBook Pro/UTA/Semesters/2. Fall2025/CSE6363/FinalProject/cse6363_finalProj/.venv/lib/python3.13/site-packages/sklearn/linear_model/_ridge.py", line 1238, in fit
    X, y = validate_data(
           ~~~~~~~~~~~~~^
        self,
        ^^^^^
    ...<6 lines>...
        y_numeric=True,
        ^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/adhirajsen/Documents/Documents - Adhiraj’s MacBook Pro/UTA/Semesters/2. Fall2025/CSE6363/FinalProject/cse6363_finalProj/.venv/lib/python3.13/site-packages/sklearn/utils/validation.py", line 2971, in validate_data
    X, y = check_X_y(X, y, **check_params)
           ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/adhirajsen/Documents/Documents - Adhiraj’s MacBook Pro/UTA/Semesters/2. Fall2025/CSE6363/FinalProject/cse6363_finalProj/.venv/lib/python3.13/site-packages/sklearn/utils/validation.py", line 1368, in check_X_y
    X = check_array(
        X,
    ...<12 lines>...
        input_name="X",
    )
  File "/Users/adhirajsen/Documents/Documents - Adhiraj’s MacBook Pro/UTA/Semesters/2. Fall2025/CSE6363/FinalProject/cse6363_finalProj/.venv/lib/python3.13/site-packages/sklearn/utils/validation.py", line 1105, in check_array
    _assert_all_finite(
    ~~~~~~~~~~~~~~~~~~^
        array,
        ^^^^^^
    ...<2 lines>...
        allow_nan=ensure_all_finite == "allow-nan",
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/adhirajsen/Documents/Documents - Adhiraj’s MacBook Pro/UTA/Semesters/2. Fall2025/CSE6363/FinalProject/cse6363_finalProj/.venv/lib/python3.13/site-packages/sklearn/utils/validation.py", line 120, in _assert_all_finite
    _assert_all_finite_element_wise(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        X,
        ^^
    ...<4 lines>...
        input_name=input_name,
        ^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/adhirajsen/Documents/Documents - Adhiraj’s MacBook Pro/UTA/Semesters/2. Fall2025/CSE6363/FinalProject/cse6363_finalProj/.venv/lib/python3.13/site-packages/sklearn/utils/validation.py", line 169, in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
Ridge does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values


In [None]:
splits = {
    "Train": train_df,
    "2022 val": val_df,
    "2023 test": test_2023_df,
    "2024 test": test_2024_df,
}

rows = []
for split_name, frame in splits.items():
    X_split, y_split, _ = build_feature_matrix(
        frame.sort_values(["season", "player_id"]).reset_index(drop=True),
        target_column="ppr_points",
        position_column="position",
        drop_columns=IDENTIFIER_COLUMNS,
        preprocessor=baseline_result["preprocessor"],
        fit=False,
    )
    metrics = evaluate_split(baseline_result["model"], X_split, y_split)
    rows.append(
        {
            "Split": split_name,
            "MAE": metrics.mae,
            "RMSE": metrics.rmse,
            "R²": metrics.r2,
        }
    )

metrics_df = pd.DataFrame(rows).set_index("Split")
metrics_df


NameError: name 'baseline_result' is not defined

The Ridge baseline remains calibrated on older data but loses accuracy on recent seasons. The widening MAE/RMSE gaps between 2022 and 2024 quantify the concept-drift baseline we will try to close with adaptive models or transfer learning in later stages.
