# 3. Modeling & Final Training

_In this notebook we:_  
1. Load the fully processed dataset from `data/processed/historical_proc.csv`.  
2. Verify that the target column (`target_up`) exists.  
3. Impute any remaining missing values.  
4. Define and tune our classifiers and regressors.  
5. Use TimeSeriesSplit for honest CV.  
6. Select the best classifier model.  
7. Retrain that classifier on 100% of the data.  
8. Save the final model for deployment.

# 1. Parameters for Papermill & runtime  
These control where we read/write and whether to force GPU usage.  
- `DATA_PROCESSED`: path to processed data  
- `MODEL_DIR`: where to save final models  
- `USE_GPU`: toggle GPU‐based training (CI must set this to `False`)  

In [None]:
# Parameters cell for Papermill
DATA_PROCESSED = "data/processed"
MODEL_DIR = "src/models"
USE_GPU = False

### 2. Imports & Warnings Configuration
Load modeling libraries, metrics, and GPU‐enabled boosters; suppress known warnings for clarity.

In [None]:
# Cell 2 – Imports & GPU flag

# 1) Use fork start method to avoid multiprocessing.resource_tracker errors
import multiprocessing as mp
mp.set_start_method('fork', force=True)

import os, warnings
import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns

from sklearn.model_selection import (
    RandomizedSearchCV, GridSearchCV, TimeSeriesSplit,
    ParameterGrid, cross_val_score
)
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
    RandomForestClassifier, StackingClassifier,
    RandomForestRegressor
)
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    mean_squared_error, precision_recall_curve
)
import xgboost as xgb
from lightgbm import LGBMClassifier, LGBMRegressor
import joblib

# Suppress deprecation & device warnings
warnings.filterwarnings("ignore", message=".*not used.*")

# Matplotlib & Seaborn setup
%matplotlib inline
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

# Autotime for per-cell timing
%load_ext autotime

# Determine at runtime whether to use GPU
USE_GPU = os.environ.get("CI_USE_GPU", "false").lower() in ("1","true","yes")
print(f"→ USE_GPU = {USE_GPU}")


### 3. Load Features & Build Targets
1. Debug‐print contents of `DATA_PROCESSED`.
2. Read `historical_proc.csv`, sort chronologically.
3. Define next‐day targets: `target_up` (binary), `target_high`, `target_low`.

In [None]:
print("DEBUG: data/processed contains:", os.listdir(DATA_PROCESSED))

df = pd.read_csv(os.path.join(DATA_PROCESSED, "historical_proc.csv"),
                 parse_dates=["date"])
df.sort_values("date", inplace=True)

# create next‐day columns
df["close_next"] = df["close"].shift(-1)
df["high_next"]  = df["high"].shift(-1)
df["low_next"]   = df["low"].shift(-1)

# define targets
df["target_up"]   = (df["close_next"] > df["close"]).astype(int)
df["target_high"] = df["high_next"]
df["target_low"]  = df["low_next"]

# drop final row with NaN targets
df.dropna(subset=["target_up","target_high","target_low"], inplace=True)

### 4. Train/Test Split
Select only feature columns, then split first 75 % of rows for training and last 25 % for testing (time‐aware).

In [None]:
feature_cols = [c for c in df.columns
                if c.endswith(("_std","_mm")) or c.startswith("PC") or c in ["year","month","day","weekday"]]

X     = df[feature_cols]
y_up  = df["target_up"]
y_high = df["target_high"]
y_low  = df["target_low"]

split = int(len(df) * 0.75)
X_train, X_test = X.iloc[:split], X.iloc[split:]
y_up_train, y_up_test   = y_up.iloc[:split], y_up.iloc[split:]
y_high_train, y_high_test = y_high.iloc[:split], y_high.iloc[split:]
y_low_train,  y_low_test  = y_low.iloc[:split],  y_low.iloc[split:]

print(f"Train/Test size → {X_train.shape[0]} / {X_test.shape[0]}")

### 5. Impute Missing Values
We fit a `SimpleImputer` on the training features and transform both train and test sets, preserving the original row indices so that any later boolean‐masking stays aligned.

In [None]:
%%time
imputer = SimpleImputer(strategy="mean")

# Fit on training, transform both sets, and preserve indices
X_train_imp = pd.DataFrame(
    imputer.fit_transform(X_train),
    columns=feature_cols,
    index=X_train.index           # <— preserve original train indices
)
X_test_imp = pd.DataFrame(
    imputer.transform(X_test),
    columns=feature_cols,
    index=X_test.index            # <— preserve original test indices
)

## 6. Initialize classifiers & parameter grids  
- Use CPU implementations by default  
- If `USE_GPU=True`, enable GPU training for XGBoost and LightGBM  

In [None]:
# Cell 6 – Define classifiers with safe GPU logic

tscv = TimeSeriesSplit(n_splits=3)

# 1) Logistic Regression (CPU only)
lr_clf = LogisticRegression(class_weight="balanced", max_iter=2000, random_state=42)

# 2) Random Forest (CPU only)
rf_clf = RandomForestClassifier(class_weight="balanced", n_jobs=1, random_state=42)

# 3) XGBoost Classifier
xgb_params = dict(tree_method="hist", eval_metric="logloss", random_state=42)
xgb_params["device"] = "cuda" if USE_GPU else "cpu"
xgb_clf = xgb.XGBClassifier(**xgb_params)

# 4) LightGBM Classifier
lgb_params = dict(n_estimators=100, max_depth=7, learning_rate=0.05, random_state=42)
if USE_GPU:
    lgb_params.update(device="gpu", gpu_platform_id=0, gpu_device_id=0)
lgb_clf = LGBMClassifier(**lgb_params)

# 5) Hyperparameter grids
param_lr  = {"C": [0.1, 1]}
param_rf  = {"n_estimators": [50, 100], "max_depth": [5, 10]}
param_xgb = {"n_estimators": [100, 200], "max_depth": [3, 5], "learning_rate": [0.05, 0.1]}
param_lgb = {"n_estimators": [100, 200], "max_depth": [5, 7], "learning_rate": [0.05, 0.1]}


### 7. Fast Hyperparameter Tuning (full 75% training data, 2-fold CV, 2 candidates each)

We’ll perform hyperparameter search on each model using the *entire* 75% training split (no subsampling), 2-fold time-series CV, and explore only 2 parameter combinations per model to keep the run-time reasonable.

In [None]:
%%time
from datetime import datetime
from joblib import parallel_backend

def tune_fast(model, params, X, y, name, max_iter=3, frac=0.3):
    print(f"\n▶ Starting tuning for {name} at {datetime.now().strftime('%H:%M:%S')}")
    start = datetime.now()

    # subsample for RandomForest only
    if isinstance(model, RandomForestClassifier):
        n = int(len(X) * frac)
        idx = np.random.RandomState(42).choice(len(X), size=n, replace=False)
        Xt, yt = X.iloc[idx], y.iloc[idx]
    else:
        Xt, yt = X, y

    # choose Grid vs Random
    total = len(list(ParameterGrid(params)))
    if total <= max_iter:
        search = GridSearchCV(
            model, params,
            cv=tscv,
            scoring="accuracy",
            n_jobs=1,
            verbose=2
        )
    else:
        search = RandomizedSearchCV(
            model, params,
            n_iter=max_iter,
            cv=tscv,
            scoring="accuracy",
            n_jobs=1,
            random_state=42,
            verbose=2
        )

    # fit using threading to avoid ResourceTracker errors
    with parallel_backend('threading'):
        search.fit(Xt, yt)

    # report
    elapsed = (datetime.now() - start).total_seconds()
    print(f"✔ Finished {name} in {elapsed:.1f}s")
    print(f"{name} best params: {search.best_params_}")
    print(f"{name} CV acc: {search.best_score_:.4f}\n")
    return search.best_estimator_

best_lr  = tune_fast(lr_clf,  param_lr,  X_train_imp, y_up_train, "Logistic")
best_rf  = tune_fast(rf_clf,  param_rf,  X_train_imp, y_up_train, "RandomForest")
best_xgb = tune_fast(xgb_clf, param_xgb, X_train_imp, y_up_train, "XGBoost (GPU)")
best_lgb = tune_fast(lgb_clf, param_lgb, X_train_imp, y_up_train, "LightGBM")

### 8. Build Prefit Stacking Ensemble
Combine the four tuned estimators into a `StackingClassifier`, using a logistic meta‐learner on top.

In [None]:
from datetime import datetime

print(f"[{datetime.now().strftime('%H:%M:%S')}] Building stacking ensemble")
stack_clf = StackingClassifier(
    estimators=[
        ("lr",  best_lr),
        ("rf",  best_rf),
        ("xgb", best_xgb),
        ("lgb", best_lgb),
    ],
    final_estimator=LogisticRegression(max_iter=2_000),
    cv="prefit",  # base models already trained
    n_jobs=1
)
stack_clf.fit(X_train_imp, y_up_train)
print("✅ Stacking complete")

### 9. Evaluate on Test Set
For each model, predict on the hold‐out set, print accuracy & classification report, and plot confusion matrix.

In [None]:
for name, mdl in [
    ("Logistic", best_lr),
    ("RandomForest", best_rf),
    ("XGBoost", best_xgb),
    ("LightGBM", best_lgb),
    ("Stacking", stack_clf)
]:
    y_pred = mdl.predict(X_test_imp)
    acc    = accuracy_score(y_up_test, y_pred)
    print(f"\n► {name} Accuracy: {acc:.4f}")
    print(classification_report(y_up_test, y_pred))
    sns.heatmap(confusion_matrix(y_up_test, y_pred), annot=True, fmt="d")
    plt.title(name); plt.show()

### 10. Precision–Recall Threshold Optimization for Stacker
Find the probability cutoff that maximizes F1 on the test set and show its new accuracy.

In [None]:
probs = stack_clf.predict_proba(X_test_imp)[:,1]
prec, rec, thr = precision_recall_curve(y_up_test, probs)
f1 = 2 * prec * rec / (prec + rec)
opt_thresh = thr[np.nanargmax(f1)]
print(f"Optimal threshold: {opt_thresh:.3f}")

y_opt = (probs >= opt_thresh).astype(int)
print("Stacking@opt Threshold Accuracy:", accuracy_score(y_up_test, y_opt))
print(classification_report(y_up_test, y_opt))

### 11. Tune & Evaluate Regressors for High/Low Prediction
1. Drop rows where target is NaN.
2. Remove zero‐variance features.
3. Define RF, LGBM, XGB regressors (GPU).
4. Run RandomizedSearchCV to minimize RMSE.

In [None]:
# Cell 11 – Define regressors with safe GPU logic

from sklearn.feature_selection import VarianceThreshold

# 1) Random Forest Regressor (CPU only)
rf_regr = RandomForestRegressor(n_jobs=1, random_state=42)

# 2) LightGBM Regressor
lgb_regr_params = dict(
    n_estimators=100,
    max_depth=7,
    min_child_samples=20,
    learning_rate=0.05,
    random_state=42
)
if USE_GPU:
    lgb_regr_params.update(device="gpu", gpu_platform_id=0, gpu_device_id=0)
lgb_regr = LGBMRegressor(**lgb_regr_params)

# 3) XGBoost Regressor
xgb_regr_params = dict(
    tree_method="hist",
    objective="reg:squarederror",
    eval_metric="rmse",
    random_state=42
)
xgb_regr_params["device"] = "cuda" if USE_GPU else "cpu"
xgb_regr = xgb.XGBRegressor(**xgb_regr_params)

# 4) You can now safely tune these regressors downstream…

### 12. Visualize Regressor Performance
For each best regressor, compute test RMSE and scatter true vs predicted values.

In [None]:
from sklearn.metrics import mean_squared_error

def eval_reg(name, model, X_te, y_te):
    yp = model.predict(X_te)
    rmse = np.sqrt(mean_squared_error(y_te, yp))
    print(f"{name} — RMSE: {rmse:.2f}")
    plt.scatter(y_te, yp, alpha=0.3)
    plt.plot([y_te.min(), y_te.max()],[y_te.min(),y_te.max()],'r--')
    plt.title(name)
    plt.show()

for nm, mdl, ytest in [
    ("LGB High", best_regr["LGB High"], y_high_test),
    ("XGB High", best_regr["XGB High"], y_high_test),
    ("RF High",  best_regr["RF High"],  y_high_test),
    ("LGB Low",  best_regr["LGB Low"],  y_low_test),
    ("XGB Low",  best_regr["XGB Low"],  y_low_test),
    ("RF Low",   best_regr["RF Low"],   y_low_test),
]:
    eval_reg(nm, mdl, X_test_imp[keep_cols], ytest)

### Full‐Data CV, Select & Retrain Best Model, Then Save
1. Load the full processed dataset.  
2. Perform time‐series CV on each candidate to compare mean accuracies.  
3. Retrain the winning model on all data.  
4. Write the model artifact under the parameterized `MODEL_DIR`.  

In [None]:
# 1) Load full processed data
df_full = pd.read_csv(
    os.path.join(DATA_PROCESSED, "historical_proc.csv"),
    parse_dates=["date"]
)
df_full.sort_values("date", inplace=True)

Xf = pd.DataFrame(imputer.transform(df_full[feature_cols]), columns=feature_cols)
yf = df_full["target_up"]

# 2) Full‐data time‐series CV
tscv_full = TimeSeriesSplit(n_splits=3)
candidates = {
    "Logistic":     best_lr,
    "RandomForest": best_rf,
    "XGBoost":      best_xgb,
    "LightGBM":     best_lgb,
    "Stacking":     stack_clf
}

scores = {}
for name, model in candidates.items():
    sc = cross_val_score(model, Xf, yf, cv=tscv_full,
                         scoring="accuracy", n_jobs=1)
    scores[name] = sc.mean()
    print(f"{name} full‐CV: {sc.mean():.4f} ± {sc.std():.4f}")

# 3) Select best and retrain on all data
best_name   = max(scores, key=scores.get)
final_model = candidates[best_name]
print(f"\n▶ Best on full‐CV: {best_name}")
final_model.fit(Xf, yf)

# 4) Save the final model artifact
os.makedirs(MODEL_DIR, exist_ok=True)
out_path = os.path.join(MODEL_DIR, f"final_{best_name.lower()}.pkl")
joblib.dump(final_model, out_path)
print(f"✅ Saved {os.path.basename(out_path)} to {MODEL_DIR}/")