# 3. Modeling & Final Training

_In this notebook we:_  
1. Load the fully processed dataset from `data/processed/historical_proc.csv`.  
2. Verify that the target column (`target_up`) exists.  
3. Impute any remaining missing values.  
4. Define and tune our classifiers and regressors.  
5. Use TimeSeriesSplit for honest CV.  
6. Select the best classifier model.  
7. Retrain that classifier on 100% of the data.  
8. Save the final model for deployment.

### 1. Parameters & Papermill Setup
  1. Declare all variables you’ll define later so VS Code/Pylance stops complaining about `… is not defined`.  
  2. Set `DATA_PROCESSED` and `MODEL_DIR` for Papermill to override.  
  3. Default `USE_GPU=False` so on a CPU-only CI runner no GPU code paths are activated.

In [None]:
# Cell 1 – Parameters for Papermill & runtime

DATA_PROCESSED = "data/processed"   # Papermill will replace if passed -p
MODEL_DIR      = "src/models"       # Papermill will replace if passed -p

# By default off in CI; set CI_USE_GPU=true in your workflow if you have a GPU runner
USE_GPU = False

### 2. Imports, Styling & GPU Flag

- **multiprocessing.set_start_method('fork')** to avoid the `resource_tracker` warnings when using `n_jobs>1`.  
- Suppress deprecation/device warnings.  
- `%load_ext autotime` gives you per-cell timing.  
- Re-evaluate `USE_GPU` from the `CI_USE_GPU` environment variable.

In [None]:
# Cell 2 – Imports & GPU flag

import multiprocessing as mp
mp.set_start_method('fork', force=True)

import os, warnings
import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns

from sklearn.model_selection import (
    RandomizedSearchCV, GridSearchCV, TimeSeriesSplit,
    ParameterGrid, cross_val_score
)
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import (
    RandomForestClassifier, StackingClassifier,
    RandomForestRegressor
)
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    mean_squared_error, precision_recall_curve
)
import xgboost as xgb
from lightgbm import LGBMClassifier, LGBMRegressor
import joblib

# Suppress scikit-learn deprecation / unused-param warnings
warnings.filterwarnings("ignore", message=".*not used.*")

%matplotlib inline
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

# Autotime: prints each cell’s run time
%load_ext autotime

# Re-evaluate USE_GPU if the CI runner sets CI_USE_GPU=true
USE_GPU = os.environ.get("CI_USE_GPU", "false").lower() in ("1","true","yes")
print(f"→ USE_GPU = {USE_GPU}")

# ── Pre-declare so Pylance knows they exist ───────────────────────
best_lr = best_rf = best_xgb = best_lgb = None
stack_clf = None
feature_cols = []
imputer = None
X_train_imp = None
X_test_imp = None
keep_cols = []
best_regr = {}
# ────────────────────────────────────────────────────────────────

### 3. Load Features & Build Targets

- Read in the processed features CSV.  
- Sort by date, shift up to create “next-day” columns.  
- Define three targets:  
  - `target_up`: binary up/down tomorrow  
  - `target_high`, `target_low`: regression targets.  
- Drop the final row (it has no “next-day” values).

In [None]:
# Cell 3 – Load data and build targets

print("DEBUG: data/processed contains:", os.listdir(DATA_PROCESSED))

df = pd.read_csv(
    os.path.join(DATA_PROCESSED, "historical_proc.csv"),
    parse_dates=["date"]
)
df.sort_values("date", inplace=True)

# Create next-day columns
df["close_next"] = df["close"].shift(-1)
df["high_next"]  = df["high"].shift(-1)
df["low_next"]   = df["low"].shift(-1)

# Define targets
df["target_up"]   = (df["close_next"] > df["close"]).astype(int)
df["target_high"] = df["high_next"]
df["target_low"]  = df["low_next"]

# Drop the last row (NaNs in targets)
df.dropna(subset=["target_up","target_high","target_low"], inplace=True)

### 4. Train / Test Split

- Select feature columns (`_std`, `_mm`, `PC*`, and date parts).  
- 75% of data for training, 25% for testing, preserving the time order.

In [None]:
# Cell 4 – Train/Test split

feature_cols = [
    c for c in df.columns
    if c.endswith(("_std","_mm"))    # scaled numeric
    or c.startswith("PC")            # PCA components
    or c in ["year","month","day","weekday"]
]

X      = df[feature_cols]
y_up   = df["target_up"]
y_high = df["target_high"]
y_low  = df["target_low"]

split = int(len(df) * 0.75)
X_train, X_test = X.iloc[:split], X.iloc[split:]
y_up_train, y_up_test       = y_up.iloc[:split], y_up.iloc[split:]
y_high_train, y_high_test   = y_high.iloc[:split], y_high.iloc[split:]
y_low_train,  y_low_test    = y_low.iloc[:split],  y_low.iloc[split:]

print(f"Train/Test size → {X_train.shape[0]} / {X_test.shape[0]}")

### 5. Impute Missing Values

- Use a `SimpleImputer(strategy="mean")`.  
- **Preserve** original DataFrame indices so later boolean masking stays aligned.

In [None]:
# Cell 5 – Impute missing values

%%time
imputer = SimpleImputer(strategy="mean")

X_train_imp = pd.DataFrame(
    imputer.fit_transform(X_train),
    columns=feature_cols,
    index=X_train.index
)
X_test_imp  = pd.DataFrame(
    imputer.transform(X_test),
    columns=feature_cols,
    index=X_test.index
)

### 6. Initialize classifiers & parameter grids

Here we:

1. Instantiate each model once (LR and RF always on CPU).  
2. For XGBoost, if `USE_GPU` is true we use `tree_method="gpu_hist"` + `gpu_id=0`; otherwise we fall back to plain `"hist"` on CPU.  
3. For LightGBM, we only add the `device="gpu", gpu_platform_id, gpu_device_id` keys when `USE_GPU`.  
4. Finally we define small hyperparameter grids for each model.

In [None]:
# Cell 6 – Initialize classifiers & parameter grids

# 0) Make sure we have our 3-fold TimeSeriesSplit object ready
tscv = TimeSeriesSplit(n_splits=3)

# 1) Logistic Regression (CPU only)
lr_clf = LogisticRegression(
    class_weight="balanced",
    max_iter=2000,
    random_state=42
)

# 2) Random Forest (CPU only)
rf_clf = RandomForestClassifier(
    class_weight="balanced",
    n_jobs=1,
    random_state=42
)

# 3) XGBoost Classifier
#   - On GPU: use gpu_hist + gpu_id
#   - On CPU: use hist
xgb_params = {
    "tree_method": "gpu_hist" if USE_GPU else "hist",
    "eval_metric": "logloss",
    "random_state": 42
}
if USE_GPU:
    xgb_params["gpu_id"] = 0
xgb_clf = xgb.XGBClassifier(**xgb_params)

# 4) LightGBM Classifier
#   - Only pass GPU args if USE_GPU
lgb_params = dict(
    n_estimators=100,
    max_depth=7,
    learning_rate=0.05,
    random_state=42
)
if USE_GPU:
    lgb_params.update(
        device="gpu",
        gpu_platform_id=0,
        gpu_device_id=0
    )
lgb_clf = LGBMClassifier(**lgb_params)

# 5) Hyperparameter grids for tuning
param_lr  = {"C": [0.1, 1]}
param_rf  = {"n_estimators": [50, 100], "max_depth": [5, 10]}
param_xgb = {"n_estimators": [100, 200], "max_depth": [3, 5], "learning_rate": [0.05, 0.1]}
param_lgb = {"n_estimators": [100, 200], "max_depth": [5, 7], "learning_rate": [0.05, 0.1]}

print("✅ Cell 6 complete: classifiers & grids defined.")

### 7. Fast Hyperparameter Tuning

- A helper `tune_fast` that:  
  1. Sub-samples RF on a fraction of the data  
  2. Chooses `GridSearchCV` vs. `RandomizedSearchCV`  
  3. Uses a thread pool (`parallel_backend('threading')`) to avoid resource-tracker issues

In [None]:
%%time
from datetime import datetime
from joblib import parallel_backend

def tune_fast(model, params, X, y, name, max_iter=3, frac=0.3):
    print(f"\n▶ Starting tuning for {name} at {datetime.now().strftime('%H:%M:%S')}")
    start = datetime.now()

    # subsample for RandomForest only
    if isinstance(model, RandomForestClassifier):
        n = int(len(X) * frac)
        idx = np.random.RandomState(42).choice(len(X), size=n, replace=False)
        Xt, yt = X.iloc[idx], y.iloc[idx]
    else:
        Xt, yt = X, y

    # choose Grid vs Random
    total = len(list(ParameterGrid(params)))
    if total <= max_iter:
        search = GridSearchCV(
            model, params,
            cv=tscv,
            scoring="accuracy",
            n_jobs=1,
            verbose=2
        )
    else:
        search = RandomizedSearchCV(
            model, params,
            n_iter=max_iter,
            cv=tscv,
            scoring="accuracy",
            n_jobs=1,
            random_state=42,
            verbose=2
        )

    # fit using threading to avoid ResourceTracker errors
    with parallel_backend('threading'):
        search.fit(Xt, yt)

    # report
    elapsed = (datetime.now() - start).total_seconds()
    print(f"✔ Finished {name} in {elapsed:.1f}s")
    print(f"{name} best params: {search.best_params_}")
    print(f"{name} CV acc: {search.best_score_:.4f}\n")
    return search.best_estimator_

best_lr  = tune_fast(lr_clf,  param_lr,  X_train_imp, y_up_train, "Logistic")
best_rf  = tune_fast(rf_clf,  param_rf,  X_train_imp, y_up_train, "RandomForest")
best_xgb = tune_fast(xgb_clf, param_xgb, X_train_imp, y_up_train, "XGBoost (GPU)")
best_lgb = tune_fast(lgb_clf, param_lgb, X_train_imp, y_up_train, "LightGBM")

### 8. Build & Train Prefitted Stacking

- Stack the four tuned CPU/GPU-safe base models.  
- Use a `LogisticRegression` meta-learner.  
- `cv="prefit"` since base models are already trained.

In [None]:
# Cell 8 – Prefitted stacking ensemble

from datetime import datetime

print(f"[{datetime.now():%H:%M:%S}] Building stacking ensemble")
stack_clf = StackingClassifier(
    estimators=[
        ("lr",  best_lr),
        ("rf",  best_rf),
        ("xgb", best_xgb),
        ("lgb", best_lgb),
    ],
    final_estimator=LogisticRegression(max_iter=2000),
    cv="prefit",
    n_jobs=1
)
stack_clf.fit(X_train_imp, y_up_train)
print("✅ Stacking complete")

### 9. Evaluate on Test Set

- For each model, compute accuracy, print a classification report, and show a confusion matrix heatmap.

In [None]:
# Cell 9 – Test-set evaluation

for name, mdl in [
    ("Logistic", best_lr),
    ("RandomForest", best_rf),
    ("XGBoost", best_xgb),
    ("LightGBM", best_lgb),
    ("Stacking", stack_clf)
]:
    y_pred = mdl.predict(X_test_imp)
    acc    = accuracy_score(y_up_test, y_pred)
    print(f"\n► {name} Accuracy: {acc:.4f}")
    print(classification_report(y_up_test, y_pred))
    sns.heatmap(confusion_matrix(y_up_test, y_pred), annot=True, fmt="d")
    plt.title(name)
    plt.show()

### 10. Precision–Recall Threshold Optimization

- Compute the precision–recall curve on the stacking model’s probabilities.  
- Pick the threshold that maximizes F₁, then re-evaluate at that threshold.

In [None]:
# Cell 10 – PR-curve threshold optimization

probs = stack_clf.predict_proba(X_test_imp)[:,1]
prec, rec, thr = precision_recall_curve(y_up_test, probs)

f1 = 2 * prec * rec / (prec + rec)
opt_thresh = thr[np.nanargmax(f1)]
print(f"Optimal threshold: {opt_thresh:.3f}")

y_opt = (probs >= opt_thresh).astype(int)
print("Stacking@opt Threshold Accuracy:", accuracy_score(y_up_test, y_opt))
print(classification_report(y_up_test, y_opt))

### 11. Define & (Optionally) Tune Regressors

- We’ll predict tomorrow’s high/low with RF/LGBM/XGB regressors.  
- Same `USE_GPU` guard ensures no OpenCL errors in CI.

In [None]:
%%time
# Cell 11 – Optional: Tune Regressors for High/Low Price Prediction

from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection   import RandomizedSearchCV
from sklearn.ensemble          import RandomForestRegressor
from lightgbm                  import LGBMRegressor
import xgboost                 as xgb
from joblib                    import parallel_backend

# 1) Subset training data, dropping NaNs in each target
mask_h = ~y_high_train.isna()
mask_l = ~y_low_train.isna()
Xh, yh = X_train_imp.loc[mask_h], y_high_train.loc[mask_h]
Xl, yl = X_train_imp.loc[mask_l], y_low_train.loc[mask_l]

# 2) Remove zero‐variance features (they carry no information)
vt = VarianceThreshold(threshold=0.0)
vt.fit(X_train_imp)
keep_mask = vt.get_support()
keep_cols = [c for c, keep in zip(feature_cols, keep_mask) if keep]
Xh = Xh[keep_cols]
Xl = Xl[keep_cols]

# 3) Define regressors with GPU‐safe flags
rf_regr = RandomForestRegressor(n_jobs=1, random_state=42)

lgb_params = {
    "n_estimators": 100,
    "max_depth": 7,
    "min_child_samples": 20,
    "learning_rate": 0.05,
    "random_state": 42
}
if USE_GPU:
    # only enable LightGBM GPU when CI_USE_GPU is true
    lgb_params.update(device="gpu", gpu_platform_id=0, gpu_device_id=0)
lgb_regr = LGBMRegressor(**lgb_params)

xgb_params = {
    "tree_method": "hist",
    "objective": "reg:squarederror",
    "eval_metric": "rmse",
    "random_state": 42,
    "device": "cuda" if USE_GPU else "cpu"
}
xgb_regr = xgb.XGBRegressor(**xgb_params)

# 4) Hyperparameter grids for each regressor
param_rf   = {"n_estimators": [50, 100],         "max_depth": [5, 10]}
param_lgb  = {"n_estimators": [100, 200],        "max_depth": [5, 7],  "learning_rate": [0.05, 0.1]}
param_xgb  = {"n_estimators": [100, 200],        "max_depth": [3, 5],  "learning_rate": [0.05, 0.1]}

models = {
    "RF High":  (rf_regr,  param_rf,  Xh, yh),
    "LGB High": (lgb_regr, param_lgb, Xh, yh),
    "XGB High": (xgb_regr, param_xgb, Xh, yh),
    "RF Low":   (rf_regr,  param_rf,  Xl, yl),
    "LGB Low":  (lgb_regr, param_lgb, Xl, yl),
    "XGB Low":  (xgb_regr, param_xgb, Xl, yl),
}

best_regr = {}
for name, (mdl, prm, Xtr, ytr) in models.items():
    print(f"\n▶ Tuning {name}")
    search = RandomizedSearchCV(
        estimator=mdl,
        param_distributions=prm,
        n_iter=3,
        cv=tscv,
        scoring="neg_root_mean_squared_error",
        random_state=42,
        verbose=2,
        n_jobs=1
    )
    # use threading backend to avoid ResourceTracker errors
    with parallel_backend('threading'):
        search.fit(Xtr, ytr)
    best_regr[name] = search.best_estimator_
    print(f"{name} best RMSE: {-search.best_score_:.3f}")

### 12. Visualize Regressor Performance

- Plot predicted vs. actual tomorrow’s high/low for each tuned regressor.  
- Compute RMSE and overlay a 1:1 line.

In [None]:
# Cell 12 – Evaluate regressors

def eval_reg(name, model, X_te, y_te):
    yp   = model.predict(X_te)
    rmse = np.sqrt(mean_squared_error(y_te, yp))
    print(f"{name} — RMSE: {rmse:.2f}")
    plt.scatter(y_te, yp, alpha=0.3)
    plt.plot([y_te.min(), y_te.max()],[y_te.min(), y_te.max()],'r--')
    plt.title(name)
    plt.show()

# Filter out any zero-variance columns used in training
vt         = VarianceThreshold(threshold=0.0)
vt.fit(X_train_imp)
keep_cols  = [c for c, keep in zip(feature_cols, vt.get_support()) if keep]

for nm, mdl, ytest in [
    ("LGB High", best_regr["LGB High"], y_high_test),
    ("XGB High", best_regr["XGB High"], y_high_test),
    ("RF High",  best_regr["RF High"],  y_high_test),
    ("LGB Low",  best_regr["LGB Low"],  y_low_test),
    ("XGB Low",  best_regr["XGB Low"],  y_low_test),
    ("RF Low",   best_regr["RF Low"],   y_low_test)
]:
    eval_reg(nm, mdl, X_test_imp[keep_cols], ytest)

### 13. Full-Data CV, Select & Retrain, Save

1. Re-load the full dataset.  
2. Time-series CV on **all** data to pick the overall best model.  
3. Retrain that model on 100% of the data.  
4. Save to `src/models/final_{model}.pkl`.

In [None]:
# Cell 13 – Full-data CV & final save

df_full = pd.read_csv(
    os.path.join(DATA_PROCESSED, "historical_proc.csv"),
    parse_dates=["date"]
)
df_full.sort_values("date", inplace=True)

Xf = pd.DataFrame(imputer.transform(df_full[feature_cols]), columns=feature_cols)
yf = df_full["target_up"]

tscv_full = TimeSeriesSplit(n_splits=3)
candidates = {
    "Logistic":     best_lr,
    "RandomForest": best_rf,
    "XGBoost":      best_xgb,
    "LightGBM":     best_lgb,
    "Stacking":     stack_clf
}

scores = {}
for name, model in candidates.items():
    sc = cross_val_score(
        model, Xf, yf,
        cv=tscv_full,
        scoring="accuracy",
        n_jobs=1
    )
    scores[name] = sc.mean()
    print(f"{name} full-CV: {sc.mean():.4f} ± {sc.std():.4f}")

best_name   = max(scores, key=scores.get)
final_model = candidates[best_name]
print(f"\n▶ Best on full-CV: {best_name}")
final_model.fit(Xf, yf)

os.makedirs(MODEL_DIR, exist_ok=True)
out_path = os.path.join(MODEL_DIR, f"final_{best_name.lower()}.pkl")
joblib.dump(final_model, out_path)
print(f"✅ Saved {os.path.basename(out_path)} to {MODEL_DIR}/")