
# Random Walk → Next-Step Prediction (FAST Profile)

This notebook is a compact, **fast** version to run locally or in **Google Colab**.
It uses the utilities in `model.py` and trains a strong baseline (Histogram Gradient Boosting) **without** grid search.

**Why FAST?** You avoid slow GridSearchCV on large datasets and still get clean, convincing results for your portfolio.


In [None]:

# --- Path helper: make sure `model.py` is importable in Colab or locally

import os, sys

# 1) Try import from current dir
try:
    from model import (
        WalkConfig,
        generate_random_walks_1d,
        make_windows_from_walks,
        build_pipeline,
        default_param_grid,
        group_train_test_split,
        tune_with_cv,
        evaluate,
    )
except Exception as e:
    print("Initial import failed:", e)
    # 2) Try common project structures
    tried = []
    candidates = [
        os.path.abspath('.'),
        os.path.abspath('./src'),
        os.path.abspath('../src'),
        os.path.abspath('/content/drive/MyDrive/random-walk-ml/src'),
        os.path.abspath('/content/drive/MyDrive/Random_Walk_ML/src'),
    ]
    for c in candidates:
        if c not in sys.path:
            sys.path.insert(0, c)
        tried.append(c)
        try:
            from model import (
                WalkConfig,
                generate_random_walks_1d,
                make_windows_from_walks,
                build_pipeline,
                default_param_grid,
                group_train_test_split,
                tune_with_cv,
                evaluate,
            )
            print("Imported model.py from:", c)
            break
        except Exception as e2:
            continue
    else:
        raise ImportError(f"Could not import model.py. Tried: {tried}")

print("✅ model.py import OK")


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay, ConfusionMatrixDisplay

np.random.seed(42)
plt.style.use('default')



## 1) Generate data (smaller & faster)

We create a **mixed** dataset (some fair, some biased walks) to inject learnable signal, but with reduced size to keep runtime low.


In [None]:

cfg = WalkConfig(n_walks=200, n_steps=300, bias_mode="mixed", seed=42)
positions, p_ups = generate_random_walks_1d(cfg)

window = 50  # larger window ⇒ fewer samples (faster)
X, y, groups = make_windows_from_walks(positions, window=window, horizon=1)

X_train, X_test, y_train, y_test, g_train, g_test = group_train_test_split(
    X, y, groups, test_size=0.2, seed=42
)

print("Shapes: X:", X.shape, "| y:", y.shape)
print("Train/Test:", X_train.shape[0], "/", X_test.shape[0])



## 2) Train a strong, fast baseline (no grid search)

We use **HistogramGradientBoostingClassifier** with lightweight parameters—fast and effective.


In [None]:

from time import perf_counter

pipe = build_pipeline("hgb")
pipe.set_params(clf__max_depth=3, clf__learning_rate=0.1)  # lightweight defaults

t0 = perf_counter()
pipe.fit(X_train, y_train)
metrics = evaluate(pipe, X_test, y_test)
t1 = perf_counter()

print("Metrics:", {k: (round(v,3) if isinstance(v, (int,float)) else v) for k,v in metrics.items()})
print(f"Train time: {t1 - t0:.2f}s")



## 3) Visualize performance
ROC, Precision–Recall, and Confusion Matrix.


In [None]:

plt.figure()
RocCurveDisplay.from_estimator(pipe, X_test, y_test)
plt.title("ROC — HGB (FAST)")
plt.show()

plt.figure()
PrecisionRecallDisplay.from_estimator(pipe, X_test, y_test)
plt.title("Precision–Recall — HGB (FAST)")
plt.show()

plt.figure()
ConfusionMatrixDisplay.from_estimator(pipe, X_test, y_test)
plt.title("Confusion Matrix — HGB (FAST)")
plt.show()



## 4) Control: purely fair walks (no signal)

Accuracy and AUC should be ≈ 0.5 when there’s no bias.


In [None]:

cfg_fair = WalkConfig(n_walks=150, n_steps=250, bias_mode="fair", seed=7)
pos_fair, p_ups_fair = generate_random_walks_1d(cfg_fair)
Xf, yf, gf = make_windows_from_walks(pos_fair, window=window, horizon=1)

Xf_tr, Xf_te, yf_tr, yf_te, gf_tr, gf_te = group_train_test_split(Xf, yf, gf, test_size=0.2, seed=7)

pipe_fair = build_pipeline("hgb")
pipe_fair.set_params(clf__max_depth=3, clf__learning_rate=0.1)
pipe_fair.fit(Xf_tr, yf_tr)
metrics_fair = evaluate(pipe_fair, Xf_te, yf_te)

print("Fair-walk metrics (should be ~chance):", {k: (round(v,3) if isinstance(v, (int,float)) else v) for k,v in metrics_fair.items()})



## 5) Takeaways

- **FAST** profile avoids heavy grid search and runs quickly in Colab.  
- On **mixed** data, the model beats chance by detecting bias.  
- On **fair** data, performance drops to **~0.5**, validating the method.  

> When you have time, you can re-enable hyperparameter search (RandomizedSearchCV, 3 folds) for a deeper analysis.
