---
# 1 - IMPORTS

### 1.1 - SETUP PROJECT

In [1]:
# Centralized setup
import sys
from pathlib import Path

# Make sure PROJECT_PATH is in sys
PROJECT_ROOT = Path.cwd().resolve().parent
PROJECT_PATH = PROJECT_ROOT / "src" / "project"

if str(PROJECT_PATH) not in sys.path:
    sys.path.insert(0, str(PROJECT_PATH))

# Centralized import
from imports import *

Imports ready: pd, np, sns, plt, joblib, sklearn, etc.
PROJECT_ROOT: /Users/enricovaccari/Desktop/ENRICO/05_LEARNING/University/ToU/Phases/02_Calibration_Phase/Applied_Machine_Learning/Regression/beyond-grades-ml-project


Here I am loading back in the X sets (with the addition of the newly engineered features) and y sets (which have not undergone any modifications since the splitting stage).

---
# 2 - PREPROCESSING FUNCTIONS

Before proceeding to the modeling stage, it is crucial to ensure that all features are properly prepared.  
- **Categorical variables** (like `gender`) must be converted into numerical representations (e.g., through one-hot encoding or similar techniques).  
- **Numerical features** (like `StudyTimeWeeky`) generally perform better when standardized to a common scale.  


### 2.1 - PIPELINE

> **Note:**  
Although I typically store my functions in dedicated `.py` files within the `/src` directory, the following function is essential for the next steps of this notebook. For convenience and clarity, I will define it directly here, bbut you can also find it in (../src/data/preprocessing.py)

In [3]:
def create_pipeline(numeric_features, categorical_features, k_best=None, model=None):
    """
    Preprocess:
      - numeric: median impute + StandardScaler
      - categorical: most_frequent impute + OneHotEncoder(ignore unknown)
    Then (optional) SelectKBest(f_regression, k=k_best), then model.
    CV-safe: tutto viene rifittato per ogni fold.
    """
    if model is None:
        model = LinearRegression()

    # Define transformers for preprocessing
    numerical_transformer = StandardScaler()
    categorical_transformer = OneHotEncoder(handle_unknown='ignore')  # One-hot encode region
    
    # Define imputers 
    numerical_imputer = SimpleImputer(strategy='median')  # Impute with median as numeric features are not normally distributed
    categorical_imputer = SimpleImputer(strategy='most_frequent')  # Impute with most frequent

    # Build preprocessor
    num_pipe = Pipeline([
        ("imputer", numerical_imputer),
        ("scaler", numerical_transformer),
    ])
    cat_pipe = Pipeline([
        ("imputer", categorical_imputer),
        ("ohe", categorical_transformer),
    ])
    preprocessor = ColumnTransformer([
        ("num", num_pipe, numeric_features),
        ("cat", cat_pipe, categorical_features),
    ])

    selector = SelectKBest(score_func=f_regression, k=k_best) if k_best is not None else "passthrough"

    pipe = Pipeline([
        ("preprocessor", preprocessor),
        ("select", selector),
        ("model", model),
    ])
    return pipe

In [4]:
def get_feature_names_after_preprocess(pipe, numeric_features, categorical_features):
    """
    Column names after preprocessing (original numeric + expanded OHE categorical).
    Call AFTER pipe.fit(...).
    """
    pre = pipe.named_steps["preprocessor"]
    ohe = pre.named_transformers_["cat"].named_steps["ohe"]
    cat_names = ohe.get_feature_names_out(categorical_features)
    return list(numeric_features) + list(cat_names)

In [5]:
def get_selected_features(pipe, numeric_features, categorical_features):
    """
    Returns (all_names, scores, mask, selected_names) after fitting.
    If k_best=None, mask is all True and scores are NaN.
    """
    all_names = get_feature_names_after_preprocess(pipe, numeric_features, categorical_features)
    selector = pipe.named_steps.get("select", None)

    if selector in (None, "passthrough"):
        scores = np.full(len(all_names), np.nan)
        mask = np.ones(len(all_names), dtype=bool)
        selected_names = all_names
        return all_names, scores, mask, selected_names

    scores = selector.scores_
    mask = selector.get_support()
    selected_names = [n for n, keep in zip(all_names, mask) if keep]
    return all_names, scores, mask, selected_names

**How I Built My Preprocessing & Modeling Pipeline**

My pipeline is designed to ensure robust, reproducible, and leakage-free preprocessing and modeling. Here’s how it works and why:

**Pipeline Construction**

- **Numeric features** (`numeric_features`):  
    - Imputed with the median (to handle missing values robustly).
    - Standardized using `StandardScaler` (zero mean, unit variance).

- **Categorical features** (`categorical_features`):  
    - Imputed with the most frequent value.
    - One-hot encoded (`OneHotEncoder(handle_unknown='ignore')`).

- **Feature selection**:  
    - Optionally, I use `SelectKBest(f_regression, k=k_best)` to keep only the most predictive features.
    - I can also force-keep domain-relevant features (e.g., `EngagementIndex`) even if not selected by univariate tests.

- **Model**:  
    - The pipeline can end with any estimator (e.g., `LinearRegression`, `Ridge`, etc.).

All these steps are combined using `Pipeline` and `ColumnTransformer` for clean, modular, and scikit-learn-compatible workflows.

### My Strategy: Cross-Validation & Final Model

#### 🔄 What does cross-validation do?

- Splits the training set into folds (e.g., 5 parts).
- For each fold:
    - Fits preprocessing, feature selection, and model on 4/5 of the data.
    - Evaluates on the remaining 1/5.
- Returns an average metric (e.g., RMSE, R²) — this estimates how well the whole process generalizes.
- **Note:** Each fold may select slightly different features and fit a slightly different model.

#### 🚩 Why refit on the full training set?

- Once I’ve chosen the best pipeline and parameters (e.g., `k_best=12`, `Ridge(alpha=1.0)`), I refit the entire pipeline on 100% of the training data.
- This uses all available information, making the final model as robust as possible.
- Feature selection is now based on the whole training set, giving a definitive list of features.
- This final model is what I apply to the hold-out test set (or future data).

#### ✅ Standard Workflow

1. **Cross-validation**
     - Choose hyperparameters, model, number of features.
     - Get a realistic estimate of performance.

2. **Refit on full training set**
     - Build the final, unique model with those parameters.
     - Get the definitive feature list.

3. **Final test**
     - Evaluate this model on the (never-seen) test set (which will need scaling/transformation as well).

#### 🌱 In short

Refitting on the full training set:
- Maximizes use of available data → more robust model.
- Freezes the final pipeline (selected features, scaling, encoding, parameters).
- Prepares the model for production or final test evaluation.

This approach ensures that my modeling process is both statistically sound and ready for real-world application.