# Adding INLA (R required)
https://claude.ai/chat/2ea12d21-5d52-4dc5-ae51-e05c1edf0e90



# Use Team PIE for team win rankings (possibly by player leading up to team pie for more detailed analysis)


| Model / source                                         | Primary Y variable(s)                                        | Task type                                         | Key notes / proof                                                                                              |
| ------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |
| **Zhao et al. (2023) – GCN**                           | **Game outcome (win/loss)**                                  | Binary classification                             | “Predicting the outcome of games… average success rate” → binary W/L. ([PMC][1])                               |
| **He & Choi (2025) – Stacked ensemble**                | **Game outcome (win/loss)**                                  | Binary classification                             | Paper explicitly “predict the outcomes of NBA games.” ([PMC][2], [Nature][3])                                  |
| **Osken & Onay (2022) – Player clusters + ANN**        | **Winning team (win/loss)**                                  | Binary classification                             | “We achieve… accuracy… predicting the winning team.” ([PubMed][4], [ScienceDirect][5])                         |
| **KNN baselines (surveyed)**                           | **Win/loss (sometimes score)**                               | Binary classification (and some score regression) | Reviews describe supervised setups for **win/loss**; some studies also model **scores**. ([PLOS][6], [PMC][7]) |
| **Chen et al. (2021) – Hybrid ensemble**               | **Final score (points)**                                     | Regression (team points; then derive winner)      | “Predicting the final score of NBA games.” ([MDPI][8])                                                         |
| **ESPN BPI**                                           | **Win probability** (per game) and **projected margin**      | Probabilistic classification + regression         | BPI “produces a team’s percentage to win any game” and projects **margin of victory**. ([ESPN.com][9])         |
| **In-game WP models (e.g., ESPN; Inpredictable eval)** | **Instantaneous win probability** (scored vs. final outcome) | Probabilistic time-series                         | Forecasts are calibrated so that end-of-game **outcome (0/1)** is the scoring label. ([inpredictable][10])     |
Where PIE fits (as Y or as a latent target)

What PIE is: a share-of-events metric—“what % of game events a player/team contributed,” with a published formula. 
NBAstuffer

Team PIE ↔ winning: NBA’s own FAQ says team PIE correlates strongly with win% (R² ≈ 0.908) and “>50% is likely a winning team.” 
NBA

Should you make PIE the Y?

For game prediction, the literature overwhelmingly uses win/loss, win probability, margin, or score as Y (table above). PIE is not commonly used as the final target.

PIE can be powerful as a latent/intermediate target that feeds your final Y:

Two practical designs

Two-stage model (recommended):

Stage A (hierarchical): predict team PIE from player-level PIE contributions with partial pooling across players/lineups/opponents (Bayesian hierarchical regression).

Stage B (link): map predicted team PIE (or PIE differential) to win probability via a learned monotonic link (logistic or isotonic).

Why: preserves interpretability of player contributions while ending with a decision-ready probability.

Direct model with PIE as feature:

Predict win/loss or margin directly, but include current/expected team PIE (built up from player PIE priors and injury/rotation context) as a top feature.

When to favor player-to-team PIE build-up

Early season / new lineups / injuries, where you need shrinkage and principled uncertainty across players → hierarchical PIE stage stabilizes estimates before you predict W/L.

Quick guidance for your experiment

If your goal is game outcomes: set Y = win probability (or win/loss) and treat player PIE → team PIE as X (features) or as a latent stage feeding the classifier.

If your goal is explaining performance: you can set Y = team PIE and analyze how player PIE and context drive it, but you’ll still want a final mapping to W/L for evaluation against the field.




https://claude.ai/chat/2992e47f-1455-43c3-ad0c-e134826129f2


Thanks for sharing your detailed summary! I looked into the web to verify accuracy. Here's what I found—and what should *not* be included due to lack of evidence or contradictory findings.

---

### **What’s Fully Supported by Verified Sources**

* **Graph Convolutional Networks (GCNs) + Random Forest achieved \~71.54% accuracy** in NBA game outcome prediction (2012–2018 data) — supported by studies including Zhao et al. (2023) and others ([Nature][1]).

* **Osken & Onay’s ANN-based model** (using player clustering) reported \~76% accuracy across several NBA seasons ([Nature][1]).

* **Espn’s BPI calibration performance** is well documented: teams with 50–60% projected chance of winning won about 55.8% of those games, and BPI favorites have won \~75.6% of games ([ESPN.com][2]).

* **KNN performance**: around 60–61% accuracy on NBA game outcomes ([Nature][1]).

---

### **What Remains Unverified or Unsupported**

Claims such as:

* **“ESPN BPI overall accuracy = 68.9%”**
* **“BPI success rate 74.9% for teams >50% win probability in 2024–25”**
* **“Raw accuracy ranges 65–80% across 2024–25 models”**
* **“XGBoost + SHAP models reaching 78% accuracy”**
* **“Transformer‑based models, real-time calibration improvements of 15–20 percentage points,” or “Bayesian hierarchical EPAA frameworks” with superior uncertainty quantification**
* **“Commercial claims like Stats Insider or etc.”**

—have no corroborating evidence from available academic or public sources.

---

### **Clean, Accurate Table**

Below is an **easy‑to‑read table** using only verified information. I have left blank any entries where claims cannot be substantiated with citations.

| Model / Methodology                                                                                                                              | Accuracy / Performance              | Notes / Source                                                                                                |
| ------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------- | ------------------------------------------------------------------------------------------------------------- |
| GCN + Random Forest (Zhao et al., 2023)                                                                                                          | **71.54 %**                         | GCN + RF fusion on NBA data (2012–2018) ([PLOS][3], [ResearchGate][4], [ESPN.com][2], [PLOS][5], [Nature][1]) |
| ANN with player clustering (Osken & Onay)                                                                                                        | **≈76 %**                           | Clustering + ANN across five NBA seasons ([ResearchGate][4], [Nature][1])                                     |
| ESPN BPI calibration (50–60% range)                                                                                                              | Teams won **55.8 %** of those games | BPI’s probability calibration ([ESPN.com][2])                                                                 |
| BPI favorites overall win rate                                                                                                                   | **≈75.6 %**                         | Historical BPI success rate ([ESPN.com][2])                                                                   |
| K-Nearest Neighbors (KNN)                                                                                                                        | **\~60.0–61 %**                     | KNN predictive accuracy in NBA outcomes ([Nature][1], [arXiv][6])                                             |
| Other claimed models (e.g., “ESPN BPI 68.9% accuracy”, “XGBoost + SHAP 78%”, “Transformer-based”, “real-timecalibration gains of 15–20pp”, etc.) | —                                   | **No supporting evidence found**                                                                              |

---

### **Summary**

* **Verified performances**: GCN + RF (\~71.5%), Osken & Onay (\~76%), ESPN BPI (calibration and win rates 55–75%), KNN (\~60%).
* **Unsubstantiated claims** must be omitted unless citations are provided.

If you'd like, I can help you draft an updated report or help locate sources for any of the unverified claims. Let me know!

[1]: https://www.nature.com/articles/s41598-025-13657-1?utm_source=chatgpt.com "Stacked ensemble model for NBA game outcome ..."
[2]: https://www.espn.com/blog/statsinfo/post/_/id/125994/bpi-and-strength-of-record-what-are-they-and-how-are-they-derived?utm_source=chatgpt.com "Analytics Basketball Power Index (BPI), Strength of Record ..."
[3]: https://journals.plos.org/plosone/article/file?id=10.1371%2Fjournal.pone.0326326&type=printable&utm_source=chatgpt.com "The application of artificial intelligence techniques in ..."
[4]: https://www.researchgate.net/publication/394518554_Stacked_ensemble_model_for_NBA_game_outcome_prediction_analysis?utm_source=chatgpt.com "(PDF) Stacked ensemble model for NBA game outcome ..."
[5]: https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0307478&utm_source=chatgpt.com "Integration of machine learning XGBoost and SHAP models ..."
[6]: https://arxiv.org/html/2410.21484v1?utm_source=chatgpt.com "A Systematic Review of Machine Learning in Sports Betting"


In [None]:
%%writefile api/src/ml/config.py
# FILE: api/src/ml/config.py (CORRECTED AND COMPLETE)
from pathlib import Path
from typing import Any, Union, Optional, Literal, Tuple
from dataclasses import dataclass
import os
import json

# ── Project root discovery ─────────────────────────────────────────────────────
def find_project_root(name: str = "nba_player_valuation_system") -> Path:
    """Walk up from this file until a directory named `name` or containing .git is found."""
    try:
        p = Path(__file__).resolve()
    except NameError:
        p = Path.cwd()
    for parent in (p, *p.parents):
        if parent.name == name or (parent / ".git").is_dir():
            return parent
    return Path.cwd()

# Core static paths
PROJECT_ROOT: Path = Path("api")
DATA_DIR: Path = PROJECT_ROOT / "src" / "ml" / "data"
AIRFLOW_DATA_DIR: Path = PROJECT_ROOT / "src" / "airflow_project" / "data"
LOG_DIR: Path = PROJECT_ROOT / "src" / "logs"
ARTIFACTS_DIR: Path = DATA_DIR / "ml_artifacts"
REGISTRY_LOCAL_CACHE_DIR: Path = ARTIFACTS_DIR / "registry_cache"
FINAL_ENGINEERED_DATASET_DIR: Path = AIRFLOW_DATA_DIR / "merged_final_dataset"
FINAL_ML_DATASET_DIR: Path = DATA_DIR / "final_ml_dataset"
MODEL_STORE_DIR: Path = DATA_DIR / "model_store"
COLUMN_SCHEMA_PATH: Path = PROJECT_ROOT / "src" / "ml" / "column_schema.yaml"
FEATURE_STORE_DIR: Path = DATA_DIR / "feature_store"
FEATURE_SELECTION_DIR: Path = DATA_DIR / "feature_selection"
MAX_CONTRACT_VALUES_CSV: Path = AIRFLOW_DATA_DIR / "spotrac_contract_data" / "exported_csv" / "max_contract_values.csv"

# Environment and MLflow configuration
ML_ENV: str = os.getenv("ML_ENV", "dev")
_DEFAULT_MLFLOW_TRACKING_URI = (ARTIFACTS_DIR / "mlruns").resolve().as_uri()
_raw_mlflow_uri = os.getenv("MLFLOW_TRACKING_URI", _DEFAULT_MLFLOW_TRACKING_URI)

def _canonicalize_tracking_uri(uri: str) -> str:
    """Normalize a user-provided tracking URI into something MLflow accepts."""
    from urllib.parse import urlparse
    parsed = urlparse(uri)
    if parsed.scheme in ("http", "https"):
        return uri
    if parsed.scheme == "file":
        path_part = uri[len("file://"):]
        try:
            p = Path(path_part)
            return p.resolve().as_uri()
        except Exception:
            return uri
    try:
        p = Path(uri)
        return p.resolve().as_uri()
    except Exception:
        return uri

def _parse_families_env(raw: str) -> tuple[str, ...]:
    """Robustly parse MODEL_FAMILIES_SMOKE from either JSON or CSV."""
    s = (raw or "").strip()
    out: list[str] = []
    if not s:
        return tuple()

    if s.startswith("[") and s.endswith("]"):
        try:
            arr = json.loads(s)
            for x in arr:
                t = str(x).strip().strip("'\"")
                if t and t not in out:
                    out.append(t)
            return tuple(out)
        except Exception:
            pass

    s = s.strip("[]")
    for tok in s.split(","):
        t = tok.strip().strip("'\"[]")
        if t and t not in out:
            out.append(t)
    return tuple(out)

# Final configurations
MLFLOW_TRACKING_URI: str = _canonicalize_tracking_uri(_raw_mlflow_uri)
MLFLOW_EXPERIMENT_NAME: str = os.getenv("MLFLOW_EXPERIMENT_NAME", "nba_featurestore_smoke")
MODEL_ARTIFACTS_DIR: Path = Path(os.getenv("MODEL_ARTIFACTS_DIR", str(ARTIFACTS_DIR)))
FEATURE_STORE_DIR: Path = Path(os.getenv("FEATURE_STORE_DIR", str(FEATURE_STORE_DIR)))
FEATURESTORE_PREFERRED_STAGE: str = os.getenv("FEATURESTORE_PREFERRED_STAGE", "Production")
FEATURESTORE_AUTO_BOOTSTRAP: bool = os.getenv("FEATURESTORE_AUTO_BOOTSTRAP", "1").lower() in ("1", "true", "yes")
DEFAULT_FEATURESTORE_MODEL_FAMILY: str = os.getenv("FEATURESTORE_MODEL_FAMILY", "linear_ridge")

# Enhanced model families with stacking
DEFAULT_MODEL_FAMILIES_SMOKE: tuple[str, ...] = _parse_families_env(
    os.getenv("MODEL_FAMILIES_SMOKE", "linear_ridge,lasso,elasticnet,rf,xgb,lgbm,cat,stacking")
)

# Stacking-specific configuration
STACKING_DEFAULT_BASE_ESTIMATORS: tuple[str, ...] = _parse_families_env(
    os.getenv("STACKING_BASE_ESTIMATORS", "linear_ridge,xgb,lgbm,cat")
)
STACKING_DEFAULT_META_LEARNER: str = os.getenv("STACKING_META_LEARNER", "linear_ridge")
STACKING_DEFAULT_CV_FOLDS: int = int(os.getenv("STACKING_CV_FOLDS", "5"))
STACKING_DEFAULT_CV_STRATEGY: str = os.getenv("STACKING_CV_STRATEGY", "time_series")
STACKING_USE_PASSTHROUGH: bool = os.getenv("STACKING_USE_PASSTHROUGH", "false").lower() in ("1", "true", "yes")

# Utility functions
def feature_store_namespace(model_family: str, target: str) -> str:
    return f"{model_family}_{target}"

SELECTED_METRIC: Literal["rmse", "mae", "r2"] = os.getenv("SELECTED_METRIC", "mae").lower()

def metric_higher_is_better(metric: str) -> bool:
    return metric.lower() in {"r2"}

# Registry configuration
USE_MODEL_REGISTRY: bool = os.getenv("USE_MODEL_REGISTRY", "1").lower() in ("1","true","yes")
REGISTRY_ALIAS_DEV = os.getenv("MLFLOW_ALIAS_DEV", "dev")
REGISTRY_ALIAS_STAGE = os.getenv("MLFLOW_ALIAS_STAGE", "stage")
REGISTRY_ALIAS_PROD = os.getenv("MLFLOW_ALIAS_PROD", "prod")

def registry_alias_for_env(env: Optional[str] = None) -> str:
    e = _normalize_stage_env(env or ML_ENV)
    return {
        "dev": REGISTRY_ALIAS_DEV,
        "stage": REGISTRY_ALIAS_STAGE,
        "prod": REGISTRY_ALIAS_PROD,
    }.get(e, REGISTRY_ALIAS_DEV)

def registry_name_for_target(target: str) -> str:
    default = f"nba_{target}_regressor"
    return os.getenv("MODEL_REGISTRY_NAME", default)

# Stage profiles
_STAGE_ALIASES: dict[str, str] = {
    "staging": "stage",
    "production": "prod",
    "development": "dev",
}

def _normalize_stage_env(env: Optional[str] = None) -> str:
    e = (env or ML_ENV).lower()
    return _STAGE_ALIASES.get(e, e)

@dataclass(frozen=True)
class StageProfile:
    name: Literal["dev", "stage", "prod"]
    registry_stage: Literal["Staging", "Production"]
    selected_metric: Literal["rmse", "mae", "r2"] = SELECTED_METRIC
    min_improvement: float = 0.0

STAGE_PROFILES: dict[str, StageProfile] = {
    "dev": StageProfile("dev", "Staging", selected_metric=SELECTED_METRIC, min_improvement=0.0),
    "stage": StageProfile("stage", "Staging", selected_metric=SELECTED_METRIC, min_improvement=0.0),
    "prod": StageProfile("prod", "Production", selected_metric=SELECTED_METRIC, min_improvement=0.0),
}

# Bundle directories
BUNDLE_ROOT: Path = ARTIFACTS_DIR / "model_bundles"
FAMILY_BUNDLE_ROOT: Path = ARTIFACTS_DIR / "family_bundles"

def family_bundle_dir_for(model_family: str, target: str, env: Optional[str] = None) -> Path:
    canonical = _normalize_stage_env(env or ML_ENV)
    return FAMILY_BUNDLE_ROOT / target / model_family / canonical

def bundle_dir_for(target: str, env: Optional[str] = None) -> Path:
    canonical = _normalize_stage_env(env or ML_ENV)
    return BUNDLE_ROOT / target / canonical

# Behavior switches
AUTOCLEAN_FAMILY_ARTIFACTS: bool = os.getenv("AUTOCLEAN_FAMILY_ARTIFACTS", "1").lower() in ("1", "true", "yes")
PREDICT_USE_BUNDLE_FIRST: bool = os.getenv("PREDICT_USE_BUNDLE_FIRST", "1").lower() in ("1", "true", "yes")

def registry_stage_for_env(env: Optional[str] = None) -> str:
    canonical = _normalize_stage_env(env)
    return STAGE_PROFILES.get(canonical, STAGE_PROFILES["dev"]).registry_stage

# MISSING TRAINING CONFIG CLASS (ROOT CAUSE OF ERROR)
@dataclass
class TrainingConfig:
    """
    Training configuration for any model family (sklearn/stacking/bayes_hier).
    """
    model_family: str = "linear_ridge"
    target: str = "AAV"
    use_cap_pct_target: bool = False
    max_train_rows: Optional[int] = None
    n_splits: int = 4
    n_trials: int = 20
    random_state: int = 42
    drop_columns_exact: list[str] = None
    feature_exclude_prefixes: list[str] = None

    # --- Bayesian-only knobs (new) ---
    bayes_draws: int = int(os.getenv("BAYES_DRAWS", "1000"))
    bayes_tune: int = int(os.getenv("BAYES_TUNE", "1000"))
    bayes_target_accept: float = float(os.getenv("BAYES_TARGET_ACCEPT", "0.9"))
    bayes_chains: int = int(os.getenv("BAYES_CHAINS", "2"))
    bayes_cores: int = int(os.getenv("BAYES_CORES", "2"))
    # configurable grouping columns; default common pair
    bayes_group_cols: tuple[str, ...] = tuple(
        os.getenv("BAYES_GROUP_COLS", "position,Season").split(",")
    )

    def __post_init__(self):
        if self.drop_columns_exact is None:
            self.drop_columns_exact = []
        if self.feature_exclude_prefixes is None:
            self.feature_exclude_prefixes = []
        # basic sanity
        if self.bayes_draws < 100 or self.bayes_tune < 100:
            print("[TrainingConfig] Warning: very small draws/tune may lead to unstable posteriors.")

@dataclass
class DevTrainConfig:
    """Enhanced DevTrainConfig with stacking ensemble support."""
    stage: Literal["dev", "train", "prod"] = "dev"

    # Enhanced preprocessing
    numerical_imputation: Literal["mean", "median", "iterative"] = "median"
    add_missing_indicators: bool = True
    quantile_clipping: tuple[float, float] = (0.01, 0.99)
    max_safe_rows: int = 200_000
    apply_type_conversions: bool = True
    drop_unexpected_columns: bool = True

    # Feature scaling improvements
    enable_robust_scaling: bool = True
    enable_outlier_detection: bool = True
    outlier_contamination: float = 0.1

    # Feature selection
    perm_threshold: float = 0.001
    shap_threshold: float = 0.001
    selection_mode: Literal["intersection", "union"] = "union"
    min_features: int = 10
    max_features: Optional[int] = None
    fallback_strategy: Literal["top_permutation", "top_shap", "all"] = "top_permutation"
    perm_n_repeats: int = 10
    perm_max_samples: float | int | None = 0.5
    perm_n_jobs: int = 2
    shap_nsamples: int = 100
    max_relative_regression: float = 0.05

    # Model-specific convergence settings
    linear_max_iter: int = 50000
    linear_tol: float = 1e-6
    enable_feature_selection_for_linear: bool = True

    # Cross-validation improvements
    cv_strategy: Literal["time_series", "group", "stratified"] = "time_series"
    cv_n_splits: int = 5
    cv_test_size: float = 0.2

    # Stacking ensemble configuration
    stacking_base_estimators: tuple[str, ...] = STACKING_DEFAULT_BASE_ESTIMATORS
    stacking_meta_learner: str = STACKING_DEFAULT_META_LEARNER
    stacking_cv_folds: int = STACKING_DEFAULT_CV_FOLDS
    stacking_cv_strategy: str = STACKING_DEFAULT_CV_STRATEGY
    stacking_use_passthrough: bool = STACKING_USE_PASSTHROUGH
    stacking_enable_base_tuning: bool = True
    stacking_meta_tuning_trials: int = 20

    def make_selection_kwargs(self) -> dict:
        """Build a dict that can be unpacked into SelectionConfig-like consumers."""
        return {
            "perm_n_repeats": self.perm_n_repeats,
            "perm_max_samples": self.perm_max_samples,
            "perm_n_jobs": self.perm_n_jobs,
            "perm_threshold": self.perm_threshold,
            "shap_nsamples": self.shap_nsamples,
            "shap_threshold": self.shap_threshold,
            "mode": self.selection_mode,
            "min_features": self.min_features,
            "max_features": self.max_features,
            "fallback_strategy": self.fallback_strategy,
            "max_relative_regression": self.max_relative_regression,
        }

    def make_stacking_params(self) -> dict:
        """Build default stacking parameters from configuration."""
        return {
            "base_estimators": list(self.stacking_base_estimators),
            "meta_learner": self.stacking_meta_learner,
            "cv_folds": self.stacking_cv_folds,
            "cv_strategy": self.stacking_cv_strategy,
            "passthrough": self.stacking_use_passthrough,
            "base_params": self._get_default_base_params(),
            "meta_params": self._get_default_meta_params()
        }

    def _get_default_base_params(self) -> dict:
        """Get default hyperparameters for base estimators."""
        defaults = {
            "linear_ridge": {"alpha": 1.0},
            "lasso": {"alpha": 0.01},
            "elasticnet": {"alpha": 0.01, "l1_ratio": 0.5},
            "rf": {"n_estimators": 300, "max_depth": 10},
            "xgb": {"n_estimators": 300, "max_depth": 6, "learning_rate": 0.1},
            "lgbm": {"n_estimators": 300, "max_depth": 6, "learning_rate": 0.1},
            "cat": {"iterations": 300, "depth": 6, "learning_rate": 0.1}
        }
        return {family: params for family, params in defaults.items() 
                if family in self.stacking_base_estimators}

    def _get_default_meta_params(self) -> dict:
        """Get default hyperparameters for meta-learner."""
        meta_defaults = {
            "linear_ridge": {"alpha": 0.1},
            "lasso": {"alpha": 0.01, "max_iter": 10000},
            "elasticnet": {"alpha": 0.01, "l1_ratio": 0.5, "max_iter": 10000}
        }
        return meta_defaults.get(self.stacking_meta_learner, {})

    def validate(self):
        """Enhanced validation with stacking checks."""
        if not (0 <= self.quantile_clipping[0] < self.quantile_clipping[1] <= 1):
            raise ValueError("quantile_clipping must satisfy 0 <= low < high <=1")
        if self.max_safe_rows < 1_000:
            raise ValueError("max_safe_rows must be sensible (>=1000)")
        if self.min_features < 1:
            raise ValueError("min_features must be >=1")
        if self.perm_threshold < 0 or self.shap_threshold < 0:
            raise ValueError("thresholds must be non-negative")

        # Stacking validation
        if len(self.stacking_base_estimators) < 2:
            raise ValueError("stacking_base_estimators must have at least 2 estimators")
        if self.stacking_cv_folds < 2:
            raise ValueError("stacking_cv_folds must be >= 2")
        if self.stacking_cv_strategy not in ("time_series", "kfold"):
            raise ValueError("stacking_cv_strategy must be 'time_series' or 'kfold'")

@dataclass
class TuningConfig:
    """Enhanced tuning configuration with stacking ensemble support."""
    model_families: Tuple[str, ...] = DEFAULT_MODEL_FAMILIES_SMOKE
    n_trials: int = 20
    n_splits: int = 4
    use_bayesian: bool = True

    # Stacking-specific tuning configuration
    stacking_n_trials: int = 50
    stacking_enable_base_tuning: bool = False
    stacking_base_trials_per_family: int = 10
    stacking_meta_trials: int = 20
    stacking_max_base_combinations: int = 10

    def get_trials_for_family(self, model_family: str) -> int:
        """Get the appropriate number of trials for a given model family."""
        if model_family == "stacking":
            return self.stacking_n_trials
        return self.n_trials

    def should_tune_family(self, model_family: str) -> bool:
        """Determine whether to tune a specific model family."""
        if not self.use_bayesian:
            return False
        if model_family == "stacking" and not self.stacking_enable_base_tuning:
            return False
        return True

# CREATE THE MISSING DEFAULTS INSTANCE (ROOT CAUSE FIX)
DEFAULT_DEV_TRAIN_CONFIG = DevTrainConfig()
DEFAULT_TUNING_CONFIG = TuningConfig()
DEFAULTS = TrainingConfig()  # This was the missing piece causing the NameError!

# Stacking ensemble utilities
def get_stacking_default_params(target: str = "AAV") -> dict:
    """Get default stacking parameters optimized for NBA player valuation."""
    return {
        "base_estimators": list(STACKING_DEFAULT_BASE_ESTIMATORS),
        "meta_learner": STACKING_DEFAULT_META_LEARNER,
        "cv_folds": STACKING_DEFAULT_CV_FOLDS,
        "cv_strategy": STACKING_DEFAULT_CV_STRATEGY,
        "passthrough": STACKING_USE_PASSTHROUGH,
        "base_params": {
            "linear_ridge": {"alpha": 1.0},
            "xgb": {
                "n_estimators": 300,
                "max_depth": 6,
                "learning_rate": 0.1,
                "subsample": 0.8,
                "colsample_bytree": 0.8
            },
            "lgbm": {
                "n_estimators": 300,
                "max_depth": 6,
                "learning_rate": 0.1,
                "subsample": 0.8,
                "colsample_bytree": 0.8
            },
            "cat": {
                "iterations": 300,
                "depth": 6,
                "learning_rate": 0.1
            }
        },
        "meta_params": {"alpha": 0.1}
    }

def is_stacking_family(model_family: str) -> bool:
    """Check if a model family is a stacking ensemble."""
    return model_family.lower() == "stacking"

def get_stacking_dependencies(model_family: str) -> list[str]:
    """Get the base model dependencies for a stacking model."""
    if not is_stacking_family(model_family):
        return []
    return list(STACKING_DEFAULT_BASE_ESTIMATORS)

def get_training_order(model_families: list[str]) -> list[str]:
    """Return model families in the correct training order. Stacking models should be trained after their base estimators."""
    stacking_families = [f for f in model_families if is_stacking_family(f)]
    base_families = [f for f in model_families if not is_stacking_family(f)]
    return base_families + stacking_families

# MISSING HELPER FUNCTION (ANOTHER ROOT CAUSE)
def get_master_parquet_path() -> Path:
    """
    Get the path to the master dataset parquet file.
    This function was referenced but not defined, causing another potential error.
    """
    return FINAL_ENGINEERED_DATASET_DIR / "final_merged_with_all.parquet"

# Debug function to validate all required components exist
def validate_configuration():
    """Debug function to validate all configuration components are properly defined."""
    errors = []

    # Check required classes exist
    try:
        TrainingConfig()
        print("✅ TrainingConfig class defined correctly")
    except Exception as e:
        errors.append(f"❌ TrainingConfig error: {e}")

    # Check DEFAULTS exists
    try:
        assert DEFAULTS is not None
        print("✅ DEFAULTS instance defined correctly")
    except Exception as e:
        errors.append(f"❌ DEFAULTS error: {e}")

    # Check required functions exist
    try:
        path = get_master_parquet_path()
        print(f"✅ get_master_parquet_path() returns: {path}")
    except Exception as e:
        errors.append(f"❌ get_master_parquet_path() error: {e}")

    # Check stacking utilities
    try:
        params = get_stacking_default_params()
        print(f"✅ Stacking params generated: {len(params)} keys")
    except Exception as e:
        errors.append(f"❌ Stacking utilities error: {e}")

    if errors:
        print("\n=== CONFIGURATION ERRORS ===")
        for error in errors:
            print(error)
        return False
    else:
        print("\n✅ All configuration components validated successfully!")
        return True

if __name__ == "__main__":
    print("=== CONFIGURATION DEBUG VALIDATION ===")
    validate_configuration()

    print("\nTesting stacking configuration:")
    config = DevTrainConfig()
    stacking_params = config.make_stacking_params()
    print(f"Default stacking params: {stacking_params}")

    print("\nTesting training order:")
    families = ["linear_ridge", "xgb", "stacking", "lgbm", "cat"]
    training_order = get_training_order(families)
    print(f"Training order: {training_order}")


# ensemble training for example:
# api/src/ml/train.py
from __future__ import annotations
import json, time, hashlib
from pathlib import Path
from typing import Dict, List, Tuple, Optional

import joblib
import numpy as np
import pandas as pd
import mlflow, mlflow.sklearn
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

# Import all required components
from api.src.ml import config
from api.src.ml.config import (
    TrainingConfig, DEFAULTS, DevTrainConfig, DEFAULT_DEV_TRAIN_CONFIG,
    get_master_parquet_path, get_stacking_default_params
)
from api.src.ml.models.models import make_estimator
from api.src.ml.column_schema import load_schema_from_yaml, SchemaConfig
from api.src.ml.features.feature_engineering import engineer_features
from api.src.ml.preprocessing.preprocessor import fit_preprocessor, transform_preprocessor
from api.src.ml.preprocessing.feature_store.feature_store import FeatureStore
from api.src.ml.preprocessing.feature_store.spec_builder import FeatureSpec, select_model_features, build_feature_spec_from_schema_and_preprocessor

from api.src.ml.preprocessing.feature_selection import propose_feature_spec
from api.src.ml.ml_config import SelectionConfig
from api.src.ml.models.tune import optimize

import matplotlib.pyplot as plt 
from mlflow.models.signature import infer_signature
from mlflow import sklearn as _sklog

# Set matplotlib backend for compatibility
import matplotlib
matplotlib.use('Agg')  # Use non-interactive backend
plt.ion()  # Enable interactive mode if needed

# ─────────────────────────────────────────────────────────────────────────────
# ENHANCED MAIN TRAINING FUNCTION WITH BAYESIAN SUPPORT
# ─────────────────────────────────────────────────────────────────────────────
def train(cfg: TrainingConfig = DEFAULTS) -> Path:
    """
    Enhanced training function with proper preprocessing, feature selection, and Bayesian support.
    """
    print(f"[train] Starting training with config: {cfg}")
    
    # Step 1: Load master dataset
    try:
        p = get_master_parquet_path()
        print(f"[train] Loading data from: {p}")
        df = pd.read_parquet(p)
        print(f"[train] Loaded dataset: {df.shape}")
    except Exception as e:
        print(f"[train] Master parquet not found ({e}), trying alternative path...")
        alt_path = config.FINAL_ENGINEERED_DATASET_DIR / "final_merged_with_all.parquet"
        if alt_path.exists():
            df = pd.read_parquet(alt_path)
            print(f"[train] Loaded from alternative path: {df.shape}")
        else:
            raise FileNotFoundError(f"Could not find dataset at {p} or {alt_path}")

    # Step 2: Apply feature engineering
    try:
        df, _ = engineer_features(df)
        print("✓ Applied feature engineering")
    except Exception as e:
        print(f"[train] Feature engineering failed: {e}")
        raise

    # Step 3: Choose target
    target = "AAV_PCT_CAP" if cfg.use_cap_pct_target else cfg.target
    print(f"[train] Target: {target}")

    # Step 4: Clean and filter data
    try:
        # Drop exact columns
        if cfg.drop_columns_exact:
            df_clean = df.drop(columns=cfg.drop_columns_exact, errors="ignore")
            print(f"[train] Dropped {len(cfg.drop_columns_exact)} exact columns")
        else:
            df_clean = df.copy()

        # Filter by prefix
        if cfg.feature_exclude_prefixes:
            exclude_cols = [c for c in df_clean.columns 
                          if any(c.startswith(prefix) for prefix in cfg.feature_exclude_prefixes)]
            df_clean = df_clean.drop(columns=exclude_cols)
            print(f"[train] Dropped {len(exclude_cols)} prefix-filtered columns")

        # Limit rows if specified
        if cfg.max_train_rows and len(df_clean) > cfg.max_train_rows:
            df_clean = df_clean.head(cfg.max_train_rows)
            print(f"[train] Limited to {cfg.max_train_rows} rows")

        print(f"[train] Clean dataset: {df_clean.shape}")
    except Exception as e:
        print(f"[train] Data cleaning failed: {e}")
        raise

    # Step 5: Load schema and preprocessor
    try:
        schema = load_schema_from_yaml(config.COLUMN_SCHEMA_PATH)
        print("✓ Loaded column schema")
    except Exception as e:
        print(f"[train] Schema loading failed: {e}")
        raise

    # Step 6: Create train/test split for honest evaluation
    try:
        # Use Season-based split if available, otherwise random
        if "Season" in df_clean.columns:
            # Time series split: use latest seasons for test
            seasons = sorted(df_clean["Season"].unique())
            if len(seasons) >= 2:
                test_seasons = seasons[-1:]  # Latest season for test
                train_df = df_clean[~df_clean["Season"].isin(test_seasons)].copy()
                test_df = df_clean[df_clean["Season"].isin(test_seasons)].copy()
                print(f"[train] Time series split: train={len(train_df)}, test={len(test_df)}")
            else:
                train_df, test_df = train_test_split(df_clean, test_size=0.2, random_state=cfg.random_state)
                print(f"[train] Random split: train={len(train_df)}, test={len(test_df)}")
        else:
            train_df, test_df = train_test_split(df_clean, test_size=0.2, random_state=cfg.random_state)
            print(f"[train] Random split: train={len(train_df)}, test={len(test_df)}")
    except Exception as e:
        print(f"[train] Train/test split failed: {e}")
        raise

    # Step 7: Fit preprocessor on training data
    try:
        dev_cfg = DEFAULT_DEV_TRAIN_CONFIG
        X_train_np, y_train, preprocessor = fit_preprocessor(
            train_df,
            schema=schema,
            model_type="linear",  # For initial preprocessing
            numerical_imputation=dev_cfg.numerical_imputation,
            debug=False,
            quantiles=dev_cfg.quantile_clipping,
            max_safe_rows=200000,
            apply_type_conversions=True,
            drop_unexpected_schema_columns=True,
        )
        print(f"✓ Fitted preprocessor: {X_train_np.shape}")
    except Exception as e:
        print(f"[train] Preprocessor fitting failed: {e}")
        raise

    # Step 8: Bayesian training branch (enhanced with MLflow logging and holdout eval)
    if cfg.model_family == "bayes_hier":
        # Build spec using all features (no selection for Bayes by default)
        spec = build_feature_spec_from_schema_and_preprocessor(
            df=train_df,  # use the training portion for selecting columns
            target=target,
            schema=schema,
            preprocessor=preprocessor,
            final_features=None,  # use all features flowing through preprocessor/spec
            clip_bounds=None,
        )
        fs = FeatureStore(cfg.model_family, target)
        fs.save_spec(spec, {"rows": int(len(df_clean))})
        print("✓ Saved FeatureSpec for Bayesian model")

        # --- MLflow logging & holdout eval to match sklearn families ---
        mlflow.set_tracking_uri(config.MLFLOW_TRACKING_URI)
        mlflow.set_experiment(config.MLFLOW_EXPERIMENT_NAME)
        run_name = f"bayes_hier::{target}"

        with mlflow.start_run(run_name=run_name):
            # log hyperparams
            mlflow.log_params({
                "model_family": cfg.model_family,
                "target": target,
                "bayes_draws": cfg.bayes_draws,
                "bayes_tune": cfg.bayes_tune,
                "bayes_target_accept": cfg.bayes_target_accept,
                "bayes_chains": cfg.bayes_chains,
                "bayes_cores": cfg.bayes_cores,
                "bayes_group_cols": ",".join(cfg.bayes_group_cols or ()),
            })

            # train on the training split only (to enable honest holdout)
            from api.src.ml.bayes_hier import train_bayesian, predict_bayesian
            out_dir, idata = train_bayesian(
                df=train_df.copy(),
                spec=spec,
                draws=cfg.bayes_draws,
                tune=cfg.bayes_tune,
                target_accept=cfg.bayes_target_accept,
                chains=cfg.bayes_chains,
                cores=cfg.bayes_cores,
                group_cols=cfg.bayes_group_cols,
                random_seed=cfg.random_state,
                out_dir=(config.ARTIFACTS_DIR / f"bayes_hier_{target}")
            )
            print(f"✓ Bayesian posterior saved → {out_dir}")

            # Evaluate on holdout test_df (posterior mean)
            y_true = test_df[target].astype(float).values
            y_pred = predict_bayesian(
                df=test_df.copy(),
                spec=spec,
                artifact_dir=out_dir,
                group_cols=cfg.bayes_group_cols,
            ).values

            rmse = mean_squared_error(y_true, y_pred, squared=False)
            mae = mean_absolute_error(y_true, y_pred)
            r2 = r2_score(y_true, y_pred)
            print(f"Bayesian holdout: RMSE={rmse:.4f}  MAE={mae:.4f}  R2={r2:.4f}")

            mlflow.log_metrics({"rmse": rmse, "mae": mae, "r2": r2})

            # log key artifacts
            mlflow.log_artifact(str(out_dir / "posterior.nc"))
            mlflow.log_artifact(str(out_dir / "feature_names.txt"))
            if (out_dir / "config_groups.json").exists():
                mlflow.log_artifact(str(out_dir / "config_groups.json"))

            # Return the artifact path to match other families' behavior
            return out_dir / "posterior.nc"
    
    # Step 9: For sklearn models, run feature selection
    try:
        # Create selection config
        selection_cfg = SelectionConfig(
            perm_n_repeats=10,
            perm_max_samples=0.5,
            perm_n_jobs=2,
            perm_threshold=0.001,
            shap_nsamples=100,
            shap_threshold=0.001,
            mode="union",
            min_features=10,
            max_features=None,
            fallback_strategy="top_permutation",
            max_relative_regression=0.05,
        )

        # Run feature selection on training data
        spec = propose_feature_spec(
            df=train_df,
            target=target,
            schema=schema,
            preprocessor=preprocessor,
            selection_config=selection_cfg,
        )
        print(f"✓ Feature selection: {len(spec.final_features)} features selected")

        # Save spec to feature store
        fs = FeatureStore(cfg.model_family, target)
        fs.save_spec(spec, {"rows": int(len(df_clean))})
        print("✓ Saved FeatureSpec to feature store")

    except Exception as e:
        print(f"[train] Feature selection failed: {e}")
        raise

    # Step 10: Run optimization
    try:
        # Special handling for stacking models
        if config.is_stacking_family(cfg.model_family):
            print(f"[train] Training stacking ensemble: {cfg.model_family}")
            stacking_params = get_stacking_default_params(target)
            print(f"[train] Using stacking params: {list(stacking_params.keys())}")
        
        # Run optimization on full clean dataset
        pipe, best_params, rmse = optimize(
            df=df_clean,
            feature_spec=spec,
            model_family=cfg.model_family,
            n_splits=cfg.n_splits,
            n_trials=cfg.n_trials,
            random_state=cfg.random_state,
            experiment_name=config.MLFLOW_EXPERIMENT_NAME,
        )
        print(f"✓ Optimization completed: RMSE={rmse:.4f}")
        
    except Exception as e:
        print(f"[train] Optimization failed: {e}")
        raise

    # Step 11: Save model artifacts
    try:
        out_dir = config.ARTIFACTS_DIR / f"{cfg.model_family}_{target}"
        out_dir.mkdir(parents=True, exist_ok=True)
        out_file = out_dir / "model.joblib"
        joblib.dump(pipe, out_file)
        print(f"Saved model → {out_file} (CV RMSE ≈ {rmse:,.4f})")
        
        # Save additional metadata for stacking models
        if config.is_stacking_family(cfg.model_family):
            meta_file = out_dir / "stacking_meta.json"
            try:
                stacking_model = pipe.named_steps['model']
                from api.src.ml.models.models import get_stacking_feature_importance
                importance_info = get_stacking_feature_importance(stacking_model)
                
                meta_data = {
                    "base_estimators": importance_info.get("base_estimators", []),
                    "base_weights": importance_info.get("base_predictions_weight", {}),
                    "meta_learner": importance_info.get("meta_learner", ""),
                    "cv_rmse": float(rmse),
                    "n_features": len(spec.final_features),
                }
                meta_file.write_text(json.dumps(meta_data, indent=2))
                print(f"✓ Saved stacking metadata → {meta_file}")
            except Exception as e:
                print(f"[train] Stacking metadata save failed: {e}")

        return out_file

    except Exception as e:
        print(f"[train] Model saving failed: {e}")
        raise

if __name__ == "__main__":
    # Test the training function
    cfg = TrainingConfig(
        model_family="bayes_hier",
        target="AAV",
        bayes_draws=300,  # quick smoke test
        bayes_tune=300,
        bayes_group_cols=("position", "Season"),
    )
    result_path = train(cfg)
    print(f"Training completed: {result_path}")

        

# Bayesian Hierarchichal 


In [3]:
# %%writefile api/src/ml/bayes_hier.py
# FIXED api/src/ml/bayes_hier.py - Addresses all major issues

from __future__ import annotations
from pathlib import Path
from typing import Optional, Tuple, Sequence
import numpy as np
import pandas as pd
import arviz as az
import pymc as pm
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler
from sklearn.impute import SimpleImputer
from api.src.ml import config
from api.src.ml.config import TrainingConfig, get_master_parquet_path

ARTIFACTS_DIR = config.ARTIFACTS_DIR

from api.src.ml.preprocessing.feature_store.spec_builder import FeatureSpec
from dataclasses import dataclass
import joblib
import warnings

@dataclass
class DesignSpec:
    num_imputer: SimpleImputer
    num_scaler: RobustScaler  # Changed to RobustScaler for better outlier handling
    ohe: OneHotEncoder
    num_cols: list[str]
    nom_cols: list[str]
    ord_cols: list[str]
    group_levels: dict[str, list[str]]
    y_mean: float
    y_std: float
    y_median: float  # Added for robust scaling info
    y_iqr: float     # Added for robust scaling info

    def save(self, path: Path) -> None:
        path = Path(path)
        path.parent.mkdir(parents=True, exist_ok=True)
        joblib.dump(self, path)

    @staticmethod
    def load(path: Path) -> "DesignSpec":
        return joblib.load(path)

def _resolve_group_cols_case_insensitive(df: pd.DataFrame, group_cols: Sequence[str]) -> list[str]:
    """Fix case sensitivity issues in group column resolution"""
    print(f"[bayes] Resolving group columns: {group_cols}")
    available_cols = df.columns.tolist()
    resolved = []
    
    for gcol in group_cols:
        # Try exact match first
        if gcol in available_cols:
            resolved.append(gcol)
            print(f"[bayes] ✓ Exact match: {gcol}")
            continue
            
        # Try case variations
        variations = [gcol.upper(), gcol.lower(), gcol.title()]
        found = False
        for var in variations:
            if var in available_cols:
                resolved.append(var)
                print(f"[bayes] ✓ Case-insensitive match: {gcol} -> {var}")
                found = True
                break
        
        if not found:
            # Try partial matches
            partial_matches = [c for c in available_cols if gcol.lower() in c.lower()]
            if partial_matches:
                best_match = min(partial_matches, key=len)  # Shortest match
                resolved.append(best_match)
                print(f"[bayes] ✓ Partial match: {gcol} -> {best_match}")
            else:
                print(f"[bayes] ❌ Not found: {gcol}")
    
    return resolved

def _validate_and_fix_target(df: pd.DataFrame, target: str, debug_dir: Path = None) -> tuple[pd.DataFrame, str]:
    """Validate and fix target variable issues"""
    print(f"[bayes] Validating target variable: {target}")
    
    if target not in df.columns:
        raise ValueError(f"Target column '{target}' not found in dataframe")
    
    # Basic statistics
    target_stats = df[target].describe()
    print(f"[bayes] Target statistics:\n{target_stats}")
    
    # Check for issues
    issues = []
    df_fixed = df.copy()
    
    # 1. Check for extreme outliers (likely unit errors)
    q99 = df[target].quantile(0.99)
    q01 = df[target].quantile(0.01)
    
    if df[target].max() > q99 * 20:  # Values more than 20x the 99th percentile
        extreme_outliers = (df[target] > q99 * 10).sum()
        issues.append(f"Extreme outliers detected: {extreme_outliers} values > 10x Q99")
        
        # Cap extreme outliers at 3x Q99
        cap_value = q99 * 3
        df_fixed[target] = df_fixed[target].clip(upper=cap_value)
        print(f"[bayes] Capped {extreme_outliers} extreme outliers at ${cap_value:,.0f}")
    
    # 2. Check for negative values in salary data
    if (df[target] < 0).any():
        negative_count = (df[target] < 0).sum()
        issues.append(f"Negative values detected: {negative_count}")
        df_fixed = df_fixed[df_fixed[target] >= 0]
        print(f"[bayes] Removed {negative_count} rows with negative target values")
    
    # 3. Check if values seem to be in wrong units (NBA salaries should be $1M-$50M range)
    median_val = df_fixed[target].median()
    if median_val > 50_000_000:  # More than $50M median suggests wrong units
        print(f"[bayes] Target values seem too high (median=${median_val:,.0f})")
        print("[bayes] Consider checking data source units")
    elif median_val < 100_000:  # Less than $100K median also suspicious
        print(f"[bayes] Target values seem too low (median=${median_val:,.0f})")
        print("[bayes] Consider checking data source units")
    
    # 4. Remove missing target values
    missing_target = df_fixed[target].isnull().sum()
    if missing_target > 0:
        df_fixed = df_fixed.dropna(subset=[target])
        issues.append(f"Missing target values removed: {missing_target}")
        print(f"[bayes] Removed {missing_target} rows with missing target")
    
    # Save debug info if requested
    if debug_dir:
        debug_dir.mkdir(parents=True, exist_ok=True)
        with open(debug_dir / "target_validation.txt", "w") as f:
            f.write(f"Target: {target}\n")
            f.write(f"Original shape: {df.shape}\n")
            f.write(f"Fixed shape: {df_fixed.shape}\n")
            f.write(f"Issues found: {issues}\n")
            f.write(f"Final statistics:\n{df_fixed[target].describe()}\n")
    
    return df_fixed, target

def _prepare_design(
    df: pd.DataFrame, 
    spec: FeatureSpec, 
    group_cols: Sequence[str] = (),
    verbose: bool = False,
    design: Optional[DesignSpec] = None,
    strict_design: bool = False,
    debug_dir: Path = None
) -> tuple[np.ndarray, list[str], np.ndarray, dict, DesignSpec]:
    """Enhanced design preparation with better data handling"""
    
    if verbose:
        print(f"[bayes] target='{spec.target}' rows={len(df)}  missing_target={df[spec.target].isnull().sum()}")
    
    # Validate and fix target first
    df_clean, target = _validate_and_fix_target(df, spec.target, debug_dir)
    y_raw = df_clean[target].astype(float).values
    
    if len(y_raw) == 0:
        raise ValueError("[bayes] No valid target values after cleaning")
    
    # Enhanced feature selection - be more aggressive to prevent overfitting
    all_feature_cols = [c for c in df_clean.columns if c in spec.final_features and c != target]
    
    # Filter out features with too much missing data or low variance
    good_features = []
    for col in all_feature_cols:
        missing_pct = df_clean[col].isnull().sum() / len(df_clean)
        if missing_pct > 0.5:  # Skip features with >50% missing
            if verbose:
                print(f"[bayes] Skipping {col}: {missing_pct:.1%} missing")
            continue
            
        if df_clean[col].dtype in ['object', 'category']:
            good_features.append(col)
        else:
            # Check variance for numeric features
            var = df_clean[col].var()
            if pd.notna(var) and var > 0:
                good_features.append(col)
            elif verbose:
                print(f"[bayes] Skipping {col}: zero variance")
    
    # Limit total features to prevent overfitting
    MAX_FEATURES = 50  # Conservative limit for Bayesian model
    if len(good_features) > MAX_FEATURES:
        # Select features by correlation with target
        numeric_features = [c for c in good_features if df_clean[c].dtype in ['int64', 'float64']]
        if len(numeric_features) > 0:
            correlations = df_clean[numeric_features + [target]].corr()[target].abs()
            top_features = correlations.nlargest(MAX_FEATURES).index.tolist()
            if target in top_features:
                top_features.remove(target)
            good_features = top_features[:MAX_FEATURES]
            if verbose:
                print(f"[bayes] Reduced to top {len(good_features)} features by correlation")
        else:
            good_features = good_features[:MAX_FEATURES]
    
    # Separate by type
    num_cols = [c for c in good_features if df_clean[c].dtype in ['int64', 'float64']]
    nom_cols = [c for c in good_features if df_clean[c].dtype in ['object', 'category'] and c not in num_cols]
    ord_cols = []  # Simplified: treat ordinal as numeric for now
    
    if verbose:
        print(f"[bayes] numerical({len(num_cols)}): {num_cols[:10]}")
        print(f"[bayes] nominal({len(nom_cols)}): {nom_cols}")
        print(f"[bayes] ordinal({len(ord_cols)}): {ord_cols}")
    
    # Handle training vs prediction
    if design is None:  # Training mode
        # Numeric preprocessing with robust scaling
        if num_cols:
            num_imputer = SimpleImputer(strategy='median')  # More robust than mean
            X_num = num_imputer.fit_transform(df_clean[num_cols])
            
            # Use RobustScaler instead of StandardScaler for better outlier handling
            num_scaler = RobustScaler()
            X_num = num_scaler.fit_transform(X_num)
            
            missing_before = df_clean[num_cols].isnull().sum().sum()
            if verbose and missing_before > 0:
                print(f"[bayes] numeric missing before impute: {missing_before}")
        else:
            num_imputer = SimpleImputer()
            num_scaler = RobustScaler()
            X_num = np.empty((len(df_clean), 0))
        
        # Nominal preprocessing
        if nom_cols:
            # Fill missing with 'Missing' category
            df_nom = df_clean[nom_cols].fillna('Missing').astype(str)
            ohe = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')
            X_nom = ohe.fit_transform(df_nom)
        else:
            ohe = OneHotEncoder()
            X_nom = np.empty((len(df_clean), 0))
        
        # Combine features
        X = np.concatenate([X_num, X_nom], axis=1)
        
        # Feature names
        num_names = num_cols
        nom_names = ohe.get_feature_names_out(nom_cols) if nom_cols else []
        feat_names = list(num_names) + list(nom_names)
        
        if verbose:
            print(f"[bayes] X shape after concat: {X.shape}")
        
        # Target preprocessing - robust standardization
        y_median = np.median(y_raw)
        y_q75 = np.percentile(y_raw, 75)
        y_q25 = np.percentile(y_raw, 25)
        y_iqr = y_q75 - y_q25
        
        if y_iqr == 0:
            y_mean = np.mean(y_raw)
            y_std = np.std(y_raw)
            if y_std == 0:
                raise ValueError("[bayes] Target has zero variance")
        else:
            y_mean = y_median  # Use median instead of mean for robustness
            y_std = y_iqr      # Use IQR instead of std for robustness
            
        # Create design spec
        design = DesignSpec(
            num_imputer=num_imputer,
            num_scaler=num_scaler,
            ohe=ohe,
            num_cols=num_cols,
            nom_cols=nom_cols,
            ord_cols=ord_cols,
            group_levels={},  # Will be populated below
            y_mean=y_mean,
            y_std=y_std,
            y_median=y_median,
            y_iqr=y_iqr,
        )
        
    else:  # Prediction mode - use existing design
        num_cols = design.num_cols
        nom_cols = design.nom_cols
        ord_cols = design.ord_cols
        
        if num_cols:
            X_num = design.num_imputer.transform(df_clean[num_cols])
            X_num = design.num_scaler.transform(X_num)
        else:
            X_num = np.empty((len(df_clean), 0))
            
        if nom_cols:
            df_nom = df_clean[nom_cols].fillna('Missing').astype(str)
            X_nom = design.ohe.transform(df_nom)
        else:
            X_nom = np.empty((len(df_clean), 0))
            
        X = np.concatenate([X_num, X_nom], axis=1)
        
        num_names = design.num_cols
        nom_names = design.ohe.get_feature_names_out(design.nom_cols) if design.nom_cols else []
        feat_names = list(num_names) + list(nom_names)
    
    # Handle group columns with improved resolution
    resolved_group_cols = _resolve_group_cols_case_insensitive(df_clean, group_cols)
    groups = {}
    
    for gcol in resolved_group_cols:
        if gcol in df_clean.columns:
            group_series = df_clean[gcol].astype(str).fillna("Missing")
            if design is None:  # Training
                levels = sorted(group_series.unique())
                design.group_levels[gcol] = levels
            else:  # Prediction
                levels = design.group_levels.get(gcol, [])
            
            # Map to indices, handling unseen levels
            level_to_idx = {level: i for i, level in enumerate(levels)}
            indices = group_series.map(level_to_idx).fillna(-1).astype(int).values
            
            groups[gcol] = {"levels": levels, "index": indices}
            
            if verbose:
                n_levels = len(levels)
                n_unseen = (indices == -1).sum()
                print(f"[bayes] group='{gcol}' levels={n_levels}")
                if n_unseen > 0:
                    print(f"[bayes] group='{gcol}' unseen_levels={n_unseen}")
    
    return X, feat_names, y_raw, groups, design

def train_bayesian(
    df: pd.DataFrame,
    spec: FeatureSpec,
    draws: int = 1000,
    tune: int = 1000,
    target_accept: float = 0.95,
    chains: int = 4,  # Increased from 2 for better convergence
    cores: int = 4,
    group_cols: Sequence[str] = (),
    random_seed: int = 42,
    out_dir: Optional[Path] = None,
    compute_loo: bool = False,
    strict_design: bool = False,
) -> tuple[Path, az.InferenceData]:
    """
    Enhanced Bayesian training with robust preprocessing and better error handling.
    """
    # Resolve group cols early to catch issues
    group_cols = _resolve_group_cols_case_insensitive(df, group_cols)
    if len(group_cols) == 0:
        print("[bayes][warn] no valid group columns resolved; proceeding with fixed-effects only")

    debug_dir = (out_dir or (ARTIFACTS_DIR / f"bayes_hier_{spec.target}")) / "debug"
    X, feat_names, y_raw, groups, design = _prepare_design(
        df, spec, group_cols=group_cols, verbose=True, design=None,
        strict_design=strict_design, debug_dir=debug_dir
    )

    # Enhanced target standardization using robust scaling
    y_mean, y_std = design.y_mean, design.y_std
    if y_std == 0 or not np.isfinite(y_std):
        raise ValueError(f"[bayes/train] y has zero or non-finite std ({y_std}); cannot standardize.")
    
    y = (y_raw - y_mean) / y_std
    
    # Additional validation
    if not np.isfinite(y).all():
        print(f"[bayes][warn] Non-finite values in standardized y: {(~np.isfinite(y)).sum()}")
        finite_mask = np.isfinite(y)
        X = X[finite_mask]
        y = y[finite_mask]
        for gname, gobj in groups.items():
            groups[gname]["index"] = gobj["index"][finite_mask]
    
    print(f"[bayes] Final training shape: X={X.shape}, y={y.shape}")
    
    coords = {"obs": np.arange(len(y)), "feat": np.arange(X.shape[1])}
    for gname, gobj in groups.items():
        coords[gname] = np.arange(len(gobj["levels"]))

    with pm.Model(coords=coords) as model:
        # More conservative priors to prevent divergences
        beta = pm.Normal("beta", 0.0, 0.5, dims=("feat",))  # Tighter prior
        alpha = pm.Normal("alpha", 0.0, 1.0)

        # Non-centered random intercepts with more conservative priors
        mu = alpha + pm.math.dot(X, beta)
        for gname, gobj in groups.items():
            eta = pm.Normal(f"eta_{gname}", 0.0, 1.0, dims=(gname,))
            tau = pm.HalfNormal(f"tau_{gname}", 0.5)  # More conservative prior
            a_g = pm.Deterministic(f"a_{gname}", tau * eta, dims=(gname,))
            mu = mu + a_g[gobj["index"]]

        # More robust likelihood
        sigma = pm.HalfNormal("sigma", 1.0)
        nu = pm.Exponential("nu", 1/10) + 2.1  # Slightly higher nu for stability
        pm.StudentT("y_obs", nu=nu, mu=mu, sigma=sigma, observed=y)

        # Enhanced sampling with better initialization
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=UserWarning)
            idata = pm.sample(
                draws=draws, tune=tune, target_accept=target_accept,
                chains=chains, cores=cores, random_seed=random_seed,
                return_inferencedata=True,
                idata_kwargs={"log_likelihood": compute_loo},
                init="adapt_diag",  # Better initialization
            )

    out = out_dir or (ARTIFACTS_DIR / f"bayes_hier_{spec.target}")
    out.mkdir(parents=True, exist_ok=True)

    # Save all artifacts
    az.to_netcdf(idata, out / "posterior.nc")
    (out / "feature_names.txt").write_text("\n".join(feat_names))

    import json
    (out / "config_groups.json").write_text(json.dumps({k: v["levels"] for k, v in groups.items()}, indent=2))

    # Save the design for consistent predict
    design.save(out / "design_spec.joblib")

    # Enhanced diagnostics
    summary = az.summary(idata, var_names=[v for v in idata.posterior.data_vars], round_to=4)
    summary.to_csv(out / "sampling_summary.csv")
    
    # Check convergence
    max_rhat = float(summary["r_hat"].max())
    min_ess_bulk = float(summary["ess_bulk"].min())
    min_ess_tail = float(summary["ess_tail"].min())
    
    print(f"[bayes] Convergence diagnostics:")
    print(f"  Max R-hat: {max_rhat:.4f} (should be < 1.01)")
    print(f"  Min ESS bulk: {min_ess_bulk:.0f} (should be > 400)")
    print(f"  Min ESS tail: {min_ess_tail:.0f} (should be > 400)")
    
    if max_rhat > 1.01:
        print("[bayes][WARN] Poor convergence (R-hat > 1.01). Consider:")
        print("  - Increasing draws/tune")
        print("  - Using more conservative priors")
        print("  - Checking for numerical issues in data")
    
    if min_ess_bulk < 400 or min_ess_tail < 400:
        print("[bayes][WARN] Low effective sample size. Consider:")
        print("  - Increasing draws")
        print("  - Checking for poor mixing")

    if compute_loo:
        try:
            comp = az.loo(idata)
            comp.to_dataframe().to_csv(out / "loo_summary.csv")
            print(f"[bayes] LOO-CV: {comp.elpd_loo:.2f} ± {comp.se:.2f}")
        except Exception as e:
            print(f"[bayes][warn] LOO computation failed: {e}")

    # Save debug information
    debug_info = {
        "original_features": len(spec.final_features),
        "selected_features": len(feat_names),
        "original_rows": len(df),
        "final_rows": len(y),
        "target_stats": {
            "mean": float(y_mean),
            "std": float(y_std),
            "min": float(y_raw.min()),
            "max": float(y_raw.max()),
        },
        "convergence": {
            "max_rhat": float(max_rhat),
            "min_ess_bulk": float(min_ess_bulk),
            "min_ess_tail": float(min_ess_tail),
        }
    }
    
    with open(out / "debug_info.json", "w") as f:
        json.dump(debug_info, f, indent=2)

    return out, idata

def predict_bayesian(
    df: pd.DataFrame,
    spec: FeatureSpec,
    artifact_dir: Path,
    group_cols: Sequence[str] = (),
) -> pd.Series:
    """
    Enhanced prediction with better error handling and validation.
    """
    artifact_dir = Path(artifact_dir)
    
    # Load design and validate
    design_path = artifact_dir / "design_spec.joblib"
    if not design_path.exists():
        raise FileNotFoundError(f"Design spec not found: {design_path}")
    
    design = DesignSpec.load(design_path)

    # Use the training's canonical group columns
    X, feat_names, y_raw, groups, _ = _prepare_design(
        df, spec, group_cols=list(design.group_levels.keys()), verbose=False,
        design=design, strict_design=False, debug_dir=artifact_dir / "debug"
    )

    # Load posterior
    posterior_path = artifact_dir / "posterior.nc"
    if not posterior_path.exists():
        raise FileNotFoundError(f"Posterior not found: {posterior_path}")
    
    idata = az.from_netcdf(posterior_path)
    post = idata.posterior
    
    # Extract parameters
    beta = post["beta"].stack(sample=("chain", "draw")).values
    alpha = post["alpha"].stack(sample=("chain", "draw")).values
    
    # Validate dimensions
    if X.shape[1] != beta.shape[0]:
        raise ValueError(f"Feature dimension mismatch: X has {X.shape[1]} features, model expects {beta.shape[0]}")
    
    # Compute predictions
    mu = X @ beta + alpha

    # Add group effects with safe indexing
    for gname, gobj in groups.items():
        if f"a_{gname}" in post:
            a_g = post[f"a_{gname}"].stack(sample=("chain", "draw")).values
            idx = gobj["index"]
            
            # Safe indexing for group effects
            eff = np.zeros((len(idx), a_g.shape[1]))
            mask = (idx >= 0) & (idx < len(a_g))
            if mask.any():
                eff[mask, :] = a_g[idx[mask], :]
            mu += eff
        else:
            print(f"[bayes][warn] Group effect '{gname}' not found in posterior")

    # Convert back to original scale
    pred_std_mean = mu.mean(axis=1)
    y_mean, y_std = design.y_mean, design.y_std
    pred = pred_std_mean * y_std + y_mean
    
    # Validation
    if not np.isfinite(pred).all():
        print(f"[bayes][warn] Non-finite predictions: {(~np.isfinite(pred)).sum()}")
        pred = np.where(np.isfinite(pred), pred, y_mean)  # Replace with mean
    
    return pd.Series(pred, index=df.index, name=f"pred_{spec.target}")

# DEBUGGING AND VALIDATION FUNCTIONS

def debug_env_report():
    """Print versions & environment facts helpful for Bayesian runs."""
    import sys
    import sklearn
    print("[env] python:", sys.version.replace("\n", " "))
    print("[env] pymc:", pm.__version__)
    print("[env] arviz:", az.__version__)
    print("[env] numpy:", np.__version__)
    print("[env] sklearn:", sklearn.__version__)
    
    try:
        import shutil
        has_gpp = shutil.which("g++") is not None
        print(f"[env] g++ available: {has_gpp}")
    except Exception as e:
        print(f"[env] g++ check failed: {e} (non-fatal)")

def validate_data_before_training(df: pd.DataFrame, target: str) -> dict:
    """Comprehensive data validation before training"""
    print("=== DATA VALIDATION BEFORE TRAINING ===")
    
    issues = {}
    
    # Check target variable
    if target not in df.columns:
        issues['missing_target'] = f"Target '{target}' not in columns"
        return issues
    
    target_stats = df[target].describe()
    print(f"Target '{target}' statistics:")
    print(target_stats)
    
    # Check for concerning patterns
    if df[target].isnull().sum() > len(df) * 0.1:
        issues['high_target_missing'] = f"Target has {df[target].isnull().sum()} missing values"
    
    if df[target].std() == 0:
        issues['zero_variance_target'] = "Target has zero variance"
    
    # Check extreme outliers
    q99 = df[target].quantile(0.99)
    extreme_outliers = (df[target] > q99 * 10).sum()
    if extreme_outliers > 0:
        issues['extreme_outliers'] = f"{extreme_outliers} extreme outliers (>10x Q99)"
    
    # Check feature matrix
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 100:
        issues['too_many_features'] = f"High feature count: {len(numeric_cols)} numeric columns"
    
    # Check missing data patterns
    high_missing = (df.isnull().sum() / len(df) > 0.5).sum()
    if high_missing > 10:
        issues['high_missing_features'] = f"{high_missing} features with >50% missing data"
    
    if len(issues) == 0:
        print("✓ No major data issues detected")
    else:
        print("⚠️ Issues detected:")
        for issue, desc in issues.items():
            print(f"  {issue}: {desc}")
    
    return issues

def run_bayes_realdata_smoke() -> dict:
    """Enhanced smoke test with comprehensive validation and debugging"""
    from api.src.ml.config import validate_configuration, get_master_parquet_path
    from api.src.ml import config as _cfg

    debug_env_report()

    # 0) Validate wiring
    ok = validate_configuration()
    if not ok:
        raise RuntimeError("[smoke] Configuration validation failed.")
    print("✓ Configuration validated")

    # 1) Dataset preflight
    data_path = get_master_parquet_path()
    alt_path = _cfg.FINAL_ENGINEERED_DATASET_DIR / "final_merged_with_all.parquet"
    path_to_use = data_path if data_path.exists() else (alt_path if alt_path.exists() else None)
    if path_to_use is None:
        raise FileNotFoundError(
            f"[smoke] Real dataset not found at {data_path} or {alt_path}. "
            "Generate engineered parquet before smoke."
        )
    print(f"✓ Real dataset exists → {path_to_use}")

    # 2) Load and validate data
    import pandas as pd
    df = pd.read_parquet(path_to_use)
    print(f"[smoke] Loaded dataset: {df.shape}")

    # Validate data quality
    issues = validate_data_before_training(df, "AAV")
    if len(issues) > 3:  # Too many issues
        print("⚠️ Too many data quality issues, proceeding with fixes...")

    # 3) Build conservative Bayesian config
    from api.src.ml.config import TrainingConfig
    bayes_cfg = TrainingConfig(
        model_family="bayes_hier",
        target="AAV",
        max_train_rows=3000,  # Even smaller for smoke test
        random_state=42,
        bayes_draws=1000,     # Reduced for faster smoke
        bayes_tune=1000,
        bayes_target_accept=0.9,
        bayes_chains=2,       # Keep at 2 for speed, but will warn
        bayes_cores=2,
        bayes_group_cols=("POSITION", "SEASON", "PLAYER_ID", "TEAM_ID"),  # Fixed case
    )
    print(f"[smoke] using cfg: {bayes_cfg}")

    # 4) Run training with proper error handling
    from api.src.ml.train import train as _train

    print("[smoke] Running Bayesian training...")
    try:
        bayes_artifact = _train(bayes_cfg)
        print(f"✓ Bayesian smoke completed → {bayes_artifact}")
        
        # Validate output
        if isinstance(bayes_artifact, Path) and bayes_artifact.exists():
            print("✓ Artifact file exists")
        else:
            print(f"⚠️ Artifact validation failed: {bayes_artifact}")
            
        return {"bayes_hier_posterior": str(bayes_artifact)}
        
    except Exception as e:
        print(f"💥 Bayesian training failed: {e}")
        import traceback
        traceback.print_exc()
        raise

if __name__ == "__main__":
    print("=== ENHANCED BAYESIAN REAL-DATA SMOKE ===")
    try:
        out = run_bayes_realdata_smoke()
        print("\n=== SMOKE SUMMARY ===")
        for k, v in out.items():
            print(f"{k}: {v}")
        print("✅ Enhanced Bayesian smoke finished.")
    except Exception as e:
        import traceback
        print(f"💥 Enhanced Bayesian smoke failed: {e}")
        traceback.print_exc()
        raise SystemExit(1)

=== ENHANCED BAYESIAN REAL-DATA SMOKE ===
[env] python: 3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0]
[env] pymc: 5.25.1
[env] arviz: 0.22.0
[env] numpy: 1.26.4
[env] sklearn: 1.5.2
[env] g++ available: True
✅ TrainingConfig class defined correctly
✅ DEFAULTS instance defined correctly
✅ get_master_parquet_path() returns: api/src/airflow_project/data/merged_final_dataset/final_merged_with_all.parquet
✅ Stacking params generated: 7 keys

✅ All configuration components validated successfully!
✓ Configuration validated
✓ Real dataset exists → api/src/airflow_project/data/merged_final_dataset/final_merged_with_all.parquet
[smoke] Loaded dataset: (1955, 346)
=== DATA VALIDATION BEFORE TRAINING ===
Target 'AAV' statistics:
count    1.955000e+03
mean     5.838700e+06
std      8.983560e+06
min      3.197000e+03
25%      8.983100e+05
50%      2.328652e+06
75%      6.525322e+06
max      5.707873e+07
Name: AAV, dtype: float64
⚠️ Issues detected:
  too_many_features: High feature count: 296 n

Initializing NUTS using jitter+adapt_diag...
INFO:pymc.sampling.mcmc:Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
INFO:pymc.sampling.mcmc:Multiprocess sampling (2 chains in 2 jobs)
NUTS: [beta, alpha, sigma, a_POSITION, a_SEASON, a_PLAYER_ID, a_TEAM_ID]
INFO:pymc.sampling.mcmc:NUTS: [beta, alpha, sigma, a_POSITION, a_SEASON, a_PLAYER_ID, a_TEAM_ID]


Output()

Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 19 seconds.
INFO:pymc.sampling.mcmc:Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 19 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
INFO:pymc.stats.convergence:We recommend running at least 4 chains for robust computation of convergence diagnostics
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
INFO:pymc.stats.convergence:The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details


✓ Bayesian posterior saved → api/src/ml/data/ml_artifacts/bayes_hier_AAV



'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.



Bayesian holdout: RMSE=9795288.3980  MAE=5660712.3244  R2=-0.5011
✓ Bayesian smoke completed → api/src/ml/data/ml_artifacts/bayes_hier_AAV/posterior.nc
✓ Artifact file exists

=== SMOKE SUMMARY ===
bayes_hier_posterior: api/src/ml/data/ml_artifacts/bayes_hier_AAV/posterior.nc
✅ Enhanced Bayesian smoke finished.
