# Dataset Interview：训练与部署安排（自用速查）

**目标**：在同一套可复现实验管线下，训练/比较多个模型，输出稳定的结论与可落地的部署接口。

- 所有模型共享：同一数据切分、同一特征工程、同一评估口径、同一预算
- 重点检查：泄露（leakage）、时间漂移（drift）、稳定性（by time slice / by group）
- 交付物：leaderboard、关键图、最终模型 artifact（pipeline + metadata）、推理函数


## 0. 运行环境与全局约束
- 仅依赖：`numpy/pandas/scikit-learn/matplotlib`（不额外安装第三方包）
- 时间序列任务：严格按时间顺序切分；任何 `fit`（imputer/scaler/encoder）只在训练折上发生
- 结果可复现：固定随机种子；实验日志落盘


In [None]:

# ===== Imports & Global Seed =====
import os
import json
import time
import math
import random
import warnings
from dataclasses import dataclass, asdict
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error, roc_auc_score, log_loss
from sklearn.model_selection import ParameterGrid

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.ensemble import HistGradientBoostingRegressor, HistGradientBoostingClassifier
from sklearn.svm import SVR, SVC

warnings.filterwarnings("ignore")

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

RUN_DIR = Path("runs") / time.strftime("%Y%m%d_%H%M%S")
RUN_DIR.mkdir(parents=True, exist_ok=True)

print("Run dir:", RUN_DIR.resolve())
print("Python:", os.sys.version.split()[0])
print("Numpy:", np.__version__, "Pandas:", pd.__version__)


## 1. 数据读取与字段定义
这里统一把原始数据映射成：
- `time_col`：时间列（可排序）
- `target_col`：标签
- 可选：`group_col`（ticker / id / meter_id 等）

所有下游模块只依赖这三个概念。

In [None]:

# ===== Load data =====
# TODO: 替换为实际读取逻辑
# df = pd.read_parquet("data/train.parquet")  # 或 read_csv / read_feather 等
df = None

time_col = "timestamp"     # TODO
target_col = "y"           # TODO
group_col = None           # e.g., "ticker"；无则 None

# ===== Basic sanity checks =====
def assert_columns_exist(df: pd.DataFrame, cols):
    missing = [c for c in cols if c is not None and c not in df.columns]
    if missing:
        raise ValueError(f"Missing columns: {missing}")

# assert_columns_exist(df, [time_col, target_col, group_col])

# df[time_col] = pd.to_datetime(df[time_col])  # 视数据类型而定
# df = df.sort_values(time_col).reset_index(drop=True)

# print(df.shape)
# display(df.head(3))


## 2. 泄露检查清单（快速过一遍）
- 特征中是否包含未来信息（例如未来收益、未来均值、未来统计窗口）
- 是否在全量数据上 fit 了 scaler / encoder / imputer
- 训练/验证是否按时间交叉（shuffle 造成穿越）
- 同一实体跨时间重复出现时，是否出现“未来同组信息泄露”


In [None]:

def leakage_sanity_report(df: pd.DataFrame, time_col: str, target_col: str):
    # 只做轻量检查：空值比例、重复、目标是否异常
    rep = {}
    rep["n_rows"] = int(df.shape[0])
    rep["n_cols"] = int(df.shape[1])
    rep["target_nan_frac"] = float(df[target_col].isna().mean())
    rep["time_is_monotone"] = bool(pd.Series(df[time_col]).is_monotonic_increasing)
    rep["dup_row_frac"] = float(df.duplicated().mean())
    return rep

# print(leakage_sanity_report(df, time_col, target_col))


## 3. 切分：Train / Valid / Test + Walk-forward
时序任务默认：
- `test`：最后一段时间
- `valid`：倒数第二段时间
- `train`：更早的数据

同时准备 walk-forward folds，用于更稳健的模型对比。


In [None]:

@dataclass
class SplitConfig:
    valid_frac: float = 0.15
    test_frac: float = 0.15
    min_train_size: int = 1000

split_cfg = SplitConfig()

def time_based_split(df: pd.DataFrame, time_col: str, valid_frac: float, test_frac: float):
    n = df.shape[0]
    n_test = int(round(n * test_frac))
    n_valid = int(round(n * valid_frac))
    n_train = n - n_valid - n_test
    if n_train <= 0:
        raise ValueError("Split fractions too large.")
    idx_train = np.arange(0, n_train)
    idx_valid = np.arange(n_train, n_train + n_valid)
    idx_test  = np.arange(n_train + n_valid, n)
    return idx_train, idx_valid, idx_test

def walk_forward_folds(df: pd.DataFrame, time_col: str, n_folds: int = 5, valid_size_frac: float = 0.1, min_train_size: int = 1000):
    # expanding window; each fold uses a contiguous validation block after train block
    n = df.shape[0]
    valid_size = int(round(n * valid_size_frac))
    valid_size = max(valid_size, 1)
    # compute fold endpoints
    folds = []
    # ensure last fold ends before final test chunk if test exists (此处不强制，按需要裁剪)
    step = (n - min_train_size - valid_size) // max(n_folds, 1)
    step = max(step, 1)
    for k in range(n_folds):
        train_end = min_train_size + k * step
        valid_start = train_end
        valid_end = min(valid_start + valid_size, n)
        if valid_end - valid_start < 1:
            break
        train_idx = np.arange(0, train_end)
        valid_idx = np.arange(valid_start, valid_end)
        folds.append((train_idx, valid_idx))
        if valid_end >= n:
            break
    return folds

# idx_train, idx_valid, idx_test = time_based_split(df, time_col, split_cfg.valid_frac, split_cfg.test_frac)
# folds = walk_forward_folds(df, time_col, n_folds=5, valid_size_frac=0.1, min_train_size=split_cfg.min_train_size)
# print("TVT sizes:", len(idx_train), len(idx_valid), len(idx_test))
# print("n_folds:", len(folds))


## 4. 特征工程骨架：把所有处理塞进 sklearn Pipeline
原则：
- 对数值/类别列分别处理
- 后续所有模型都复用同一套 preprocessing
- 时序特征（lag/rolling）在进入 pipeline 前生成（避免在 transform 时拿到未来）


In [None]:

# ===== Feature columns (TODO: 由 df 自动推断或手动列出) =====
numeric_cols = []      # TODO
categorical_cols = []  # TODO

# ===== Optional: create time features (calendar) =====
def add_time_features(df: pd.DataFrame, time_col: str):
    out = df.copy()
    t = pd.to_datetime(out[time_col])
    out["hour"] = t.dt.hour
    out["dow"] = t.dt.dayofweek
    out["month"] = t.dt.month
    out["day"] = t.dt.day
    return out

# ===== Optional: lag/rolling features (strictly past) =====
def add_lag_rolling_features(df: pd.DataFrame, time_col: str, group_col: str | None, base_cols: list[str],
                             lags=(1, 2, 5), rolls=(5, 20)):
    out = df.copy()
    out = out.sort_values([c for c in [group_col, time_col] if c is not None]).copy()
    gb = out.groupby(group_col, sort=False) if group_col else [(None, out)]
    if group_col:
        # groupby object
        for col in base_cols:
            for L in lags:
                out[f"{col}_lag{L}"] = gb[col].shift(L)
            for W in rolls:
                out[f"{col}_rollmean{W}"] = gb[col].shift(1).rolling(W).mean()
                out[f"{col}_rollstd{W}"]  = gb[col].shift(1).rolling(W).std()
    else:
        for col in base_cols:
            for L in lags:
                out[f"{col}_lag{L}"] = out[col].shift(L)
            for W in rolls:
                out[f"{col}_rollmean{W}"] = out[col].shift(1).rolling(W).mean()
                out[f"{col}_rollstd{W}"]  = out[col].shift(1).rolling(W).std()
    return out

# ===== Preprocess pipeline =====
numeric_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", RobustScaler(with_centering=True))  # 或 StandardScaler
])

categorical_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, numeric_cols),
        ("cat", categorical_pipe, categorical_cols),
    ],
    remainder="drop",
    verbose_feature_names_out=False
)

def make_xy(df: pd.DataFrame, target_col: str):
    X = df.drop(columns=[target_col])
    y = df[target_col].values
    return X, y


## 5. 任务类型与指标
回归与分类分别跑一套指标。面试现场只挑一套主指标讲清楚，其他指标做 sanity check。


In [None]:

@dataclass
class TaskConfig:
    task_type: str = "regression"  # "classification"
    main_metric: str = "mae"       # regression: "mae"/"rmse"; classification: "auc"/"logloss"

task_cfg = TaskConfig()

def eval_metrics(y_true, y_pred, task_type: str):
    out = {}
    if task_type == "regression":
        out["mae"] = float(mean_absolute_error(y_true, y_pred))
        out["rmse"] = float(mean_squared_error(y_true, y_pred, squared=False))
    else:
        # y_pred: probability for positive class
        # 需要保证 y_true ∈ {0,1}
        out["auc"] = float(roc_auc_score(y_true, y_pred)) if len(np.unique(y_true)) > 1 else float("nan")
        # log_loss 需要 [p, 1-p] or probability; sklearn accepts prob for positive with labels?
        try:
            out["logloss"] = float(log_loss(y_true, y_pred, eps=1e-15))
        except Exception:
            out["logloss"] = float("nan")
    return out


## 6. 模型候选集合（Model Zoo）
每个模型只改 estimator，本体共享同一个 `preprocess`。


In [None]:

def model_zoo(task_type: str):
    if task_type == "regression":
        models = {
            "baseline_ridge": Ridge(random_state=SEED) if "random_state" in Ridge().get_params() else Ridge(),
            "lasso": Lasso(max_iter=5000, random_state=SEED) if "random_state" in Lasso().get_params() else Lasso(max_iter=5000),
            "elasticnet": ElasticNet(max_iter=5000, random_state=SEED) if "random_state" in ElasticNet().get_params() else ElasticNet(max_iter=5000),
            "rf": RandomForestRegressor(n_estimators=300, random_state=SEED, n_jobs=-1),
            "hgb": HistGradientBoostingRegressor(random_state=SEED),
            "svr": SVR(),
        }
    else:
        models = {
            "logreg": LogisticRegression(max_iter=2000, n_jobs=-1),
            "rf": RandomForestClassifier(n_estimators=400, random_state=SEED, n_jobs=-1),
            "hgb": HistGradientBoostingClassifier(random_state=SEED),
            "svc": SVC(probability=True, random_state=SEED),
        }
    return models

models = model_zoo(task_cfg.task_type)
list(models.keys())


## 7. 调参预算（统一预算，公平对比）
这里用 `ParameterGrid` 手动小网格：快、可控、debug 友好。
每个模型控制在 5~20 组以内。


In [None]:

def param_grids(task_type: str):
    if task_type == "regression":
        grids = {
            "baseline_ridge": {"model__alpha": [0.1, 1.0, 10.0]},
            "lasso": {"model__alpha": [1e-4, 1e-3, 1e-2]},
            "elasticnet": {"model__alpha": [1e-4, 1e-3], "model__l1_ratio": [0.2, 0.5, 0.8]},
            "rf": {"model__max_depth": [None, 8, 16], "model__min_samples_leaf": [1, 5]},
            "hgb": {"model__learning_rate": [0.05, 0.1], "model__max_depth": [None, 6], "model__max_leaf_nodes": [31, 63]},
            "svr": {"model__C": [0.5, 1.0, 2.0], "model__gamma": ["scale", "auto"]},
        }
    else:
        grids = {
            "logreg": {"model__C": [0.5, 1.0, 2.0]},
            "rf": {"model__max_depth": [None, 8, 16], "model__min_samples_leaf": [1, 5]},
            "hgb": {"model__learning_rate": [0.05, 0.1], "model__max_depth": [None, 6], "model__max_leaf_nodes": [31, 63]},
            "svc": {"model__C": [0.5, 1.0, 2.0], "model__gamma": ["scale", "auto"]},
        }
    return grids

grids = param_grids(task_cfg.task_type)
{ k: len(list(ParameterGrid(v))) for k,v in grids.items() }


## 8. 统一训练-验证循环（Walk-forward）
输出：
- 每个 (model, params) 的 fold 分数、均值、方差
- 训练耗时
- 选出最优配置并在 TVT 上复验


In [None]:

def fit_predict(est: Pipeline, X_train, y_train, X_valid):
    est.fit(X_train, y_train)
    if task_cfg.task_type == "classification":
        # 统一用 positive class prob
        if hasattr(est, "predict_proba"):
            p = est.predict_proba(X_valid)[:, 1]
            return p
        # fallback
        s = est.decision_function(X_valid)
        # map scores to (0,1) via sigmoid as a rough surrogate
        p = 1.0 / (1.0 + np.exp(-s))
        return p
    else:
        return est.predict(X_valid)

def score_on_folds(df, folds, pipe_template, task_type):
    rows = []
    X_all, y_all = make_xy(df, target_col)
    for fold_id, (tr_idx, va_idx) in enumerate(folds):
        X_tr, y_tr = X_all.iloc[tr_idx], y_all[tr_idx]
        X_va, y_va = X_all.iloc[va_idx], y_all[va_idx]
        t0 = time.time()
        y_hat = fit_predict(pipe_template, X_tr, y_tr, X_va)
        dt = time.time() - t0
        m = eval_metrics(y_va, y_hat, task_type)
        m["fold"] = fold_id
        m["train_size"] = int(len(tr_idx))
        m["valid_size"] = int(len(va_idx))
        m["fit_time_sec"] = float(dt)
        rows.append(m)
    return pd.DataFrame(rows)

def run_model_search(df, folds, preprocess, models, grids, task_type, main_metric):
    results = []
    for name, est in models.items():
        grid = grids.get(name, {})
        for params in ParameterGrid(grid):
            pipe = Pipeline(steps=[("preprocess", preprocess), ("model", clone(est))])
            pipe.set_params(**params)
            fold_df = score_on_folds(df, folds, pipe, task_type)
            mean_score = float(fold_df[main_metric].mean())
            std_score = float(fold_df[main_metric].std(ddof=0))
            total_time = float(fold_df["fit_time_sec"].sum())
            results.append({
                "model": name,
                "params": json.dumps(params, sort_keys=True),
                f"{main_metric}_mean": mean_score,
                f"{main_metric}_std": std_score,
                "total_fit_time_sec": total_time,
                "n_folds": int(fold_df.shape[0]),
            })
            # 记录每个 fold 的明细
            fold_path = RUN_DIR / f"folds__{name}__{hash(json.dumps(params, sort_keys=True))}.csv"
            fold_df.to_csv(fold_path, index=False)
    res_df = pd.DataFrame(results)
    # main_metric: 回归越小越好（mae/rmse），分类 auc 越大越好，logloss 越小越好
    if task_type == "classification" and main_metric == "auc":
        res_df = res_df.sort_values(by=f"{main_metric}_mean", ascending=False)
    else:
        res_df = res_df.sort_values(by=f"{main_metric}_mean", ascending=True)
    return res_df

# ===== Example run (uncomment when df is ready) =====
# folds = walk_forward_folds(df, time_col, n_folds=5, valid_size_frac=0.1, min_train_size=split_cfg.min_train_size)
# leaderboard = run_model_search(df, folds, preprocess, models, grids, task_cfg.task_type, task_cfg.main_metric)
# leaderboard.to_csv(RUN_DIR / "leaderboard_cv.csv", index=False)
# display(leaderboard.head(20))


## 9. 在 TVT 上复验最优模型 + 最终训练
步骤：
1) 从 leaderboard 取 top1（或 topK）
2) 在 train 上 fit，在 valid 上评估（sanity）
3) train+valid 合并重新 fit
4) 在 test 上给最终分数
5) 落盘：pipeline + config + 特征名


In [None]:

import pickle

def refit_and_eval_tvt(df, idx_train, idx_valid, idx_test, preprocess, model_est, params: dict, task_type: str):
    # train
    X_all, y_all = make_xy(df, target_col)
    X_tr, y_tr = X_all.iloc[idx_train], y_all[idx_train]
    X_va, y_va = X_all.iloc[idx_valid], y_all[idx_valid]
    X_te, y_te = X_all.iloc[idx_test],  y_all[idx_test]

    pipe = Pipeline(steps=[("preprocess", preprocess), ("model", clone(model_est))])
    pipe.set_params(**params)

    # fit on train, eval on valid
    y_hat_va = fit_predict(pipe, X_tr, y_tr, X_va)
    va_metrics = eval_metrics(y_va, y_hat_va, task_type)

    # fit on train+valid, eval on test
    trva_idx = np.concatenate([idx_train, idx_valid])
    X_trva, y_trva = X_all.iloc[trva_idx], y_all[trva_idx]
    pipe.fit(X_trva, y_trva)
    if task_type == "classification":
        y_hat_te = pipe.predict_proba(X_te)[:, 1] if hasattr(pipe, "predict_proba") else pipe.decision_function(X_te)
        if y_hat_te.ndim != 1:
            y_hat_te = np.asarray(y_hat_te).ravel()
        if task_cfg.main_metric != "auc":  # if decision_function, map sigmoid for logloss
            y_hat_te = 1.0 / (1.0 + np.exp(-y_hat_te))
    else:
        y_hat_te = pipe.predict(X_te)
    te_metrics = eval_metrics(y_te, y_hat_te, task_type)

    return pipe, va_metrics, te_metrics, (y_te, y_hat_te)

# ===== Example usage =====
# idx_train, idx_valid, idx_test = time_based_split(df, time_col, split_cfg.valid_frac, split_cfg.test_frac)
# best = leaderboard.iloc[0]
# best_name = best["model"]
# best_params = json.loads(best["params"])
# pipe_final, va_m, te_m, (y_te, yhat_te) = refit_and_eval_tvt(
#     df, idx_train, idx_valid, idx_test,
#     preprocess, models[best_name], best_params, task_cfg.task_type
# )
# print("VALID:", va_m)
# print("TEST :", te_m)

# ===== Save artifact =====
# artifact = {
#     "time_col": time_col,
#     "target_col": target_col,
#     "group_col": group_col,
#     "task_cfg": asdict(task_cfg),
#     "split_cfg": asdict(split_cfg),
#     "best_model": best_name,
#     "best_params": best_params,
#     "valid_metrics": va_m,
#     "test_metrics": te_m,
#     "run_dir": str(RUN_DIR),
# }
# with open(RUN_DIR / "artifact_meta.json", "w") as f:
#     json.dump(artifact, f, indent=2)
# with open(RUN_DIR / "pipeline.pkl", "wb") as f:
#     pickle.dump(pipe_final, f)


## 10. 稳定性：按时间切片评估（核心汇报图）
输出：
- 每个时间桶的指标（例如按月/按周/按天）
- 观察 drift：是否越到后面越差


In [None]:

def time_slice_scores(df, idx, y_true, y_pred, time_col: str, freq: str, task_type: str):
    # freq: "W" / "M" / "D" etc
    t = pd.to_datetime(df.iloc[idx][time_col]).dt.to_period(freq).astype(str)
    tmp = pd.DataFrame({"bucket": t.values, "y": y_true, "pred": y_pred})
    out_rows = []
    for b, g in tmp.groupby("bucket", sort=True):
        m = eval_metrics(g["y"].values, g["pred"].values, task_type)
        m["bucket"] = b
        m["n"] = int(g.shape[0])
        out_rows.append(m)
    out = pd.DataFrame(out_rows).sort_values("bucket")
    return out

def plot_time_metric(ts_df: pd.DataFrame, metric: str):
    plt.figure()
    plt.plot(ts_df["bucket"], ts_df[metric])
    plt.xticks(rotation=45, ha="right")
    plt.xlabel("time bucket")
    plt.ylabel(metric)
    plt.title(f"{metric} by time bucket")
    plt.tight_layout()
    plt.show()

# ===== Example =====
# ts = time_slice_scores(df, idx_test, y_te, yhat_te, time_col=time_col, freq="W", task_type=task_cfg.task_type)
# ts.to_csv(RUN_DIR / "test_time_slices.csv", index=False)
# display(ts.head())
# plot_time_metric(ts, task_cfg.main_metric)


## 11. 误差诊断：残差/分位数/尾部
回归：
- residual 分布
- pred vs true
- 绝对误差分位数

分类：
- 分桶校准（简单版）
- top-decile lift（按概率排序）


In [None]:

def regression_diagnostics(y_true, y_pred):
    resid = y_true - y_pred

    plt.figure()
    plt.hist(resid, bins=50)
    plt.title("Residual histogram")
    plt.xlabel("y - pred")
    plt.ylabel("count")
    plt.tight_layout()
    plt.show()

    plt.figure()
    plt.scatter(y_pred, y_true, s=6)
    plt.title("Pred vs True")
    plt.xlabel("pred")
    plt.ylabel("true")
    plt.tight_layout()
    plt.show()

    ae = np.abs(resid)
    qs = np.quantile(ae, [0.5, 0.8, 0.9, 0.95, 0.99])
    return {"abs_err_quantiles": qs.tolist()}

def classification_topk(y_true, p_pred, k_frac=0.1):
    n = len(y_true)
    k = max(int(n * k_frac), 1)
    order = np.argsort(-p_pred)
    top = y_true[order[:k]].mean()
    base = y_true.mean()
    lift = (top / base) if base > 0 else float("nan")
    return {"top_frac": k_frac, "top_mean": float(top), "base_mean": float(base), "lift": float(lift)}

# ===== Example =====
# if task_cfg.task_type == "regression":
#     diag = regression_diagnostics(y_te, yhat_te)
#     print(diag)
# else:
#     print(classification_topk(y_te, yhat_te, 0.1))


## 12. 模型解释（在树模型上更好讲）
- HGB / RF：feature_importances_（若存在）
- 线性模型：coef_（经过 one-hot 后会变长）

这里只做轻量输出：Top-N 特征名 + 重要性。


In [None]:

def get_feature_names(pipe: Pipeline):
    pre = pipe.named_steps["preprocess"]
    try:
        names = pre.get_feature_names_out()
        return list(names)
    except Exception:
        return None

def top_feature_importance(pipe: Pipeline, topn: int = 30):
    model = pipe.named_steps["model"]
    names = get_feature_names(pipe)
    if hasattr(model, "feature_importances_"):
        imp = model.feature_importances_
    elif hasattr(model, "coef_"):
        coef = model.coef_
        imp = np.abs(coef).ravel()
    else:
        return None

    if names is None:
        names = [f"f{i}" for i in range(len(imp))]
    s = pd.DataFrame({"feature": names, "importance": imp})
    s = s.sort_values("importance", ascending=False).head(topn)
    return s

# ===== Example =====
# fi = top_feature_importance(pipe_final, topn=25)
# if fi is not None:
#     fi.to_csv(RUN_DIR / "top_features.csv", index=False)
#     display(fi)


## 13. 部署接口：离线 artifact + 在线推理函数
交付目标：
- `pipeline.pkl`：包含 preprocess + model
- `artifact_meta.json`：记录字段、版本、指标、训练时间窗口
- `predict(df_new)`：输入一批新数据，输出预测

部署时的硬约束：
- 输入数据必须包含训练时同名字段
- 时间特征/lag 特征生成必须与训练一致（同一段代码复用）


In [None]:

# ===== Load & Predict (deployment stub) =====
def load_pipeline(run_dir: str | Path):
    run_dir = Path(run_dir)
    with open(run_dir / "pipeline.pkl", "rb") as f:
        pipe = pickle.load(f)
    with open(run_dir / "artifact_meta.json", "r") as f:
        meta = json.load(f)
    return pipe, meta

def predict_batch(pipe: Pipeline, df_new: pd.DataFrame):
    # df_new: 仅包含特征列（不要包含 target）
    # 若训练用了 add_time_features / add_lag_rolling_features，这里必须同样处理
    if task_cfg.task_type == "classification":
        if hasattr(pipe, "predict_proba"):
            return pipe.predict_proba(df_new)[:, 1]
        s = pipe.decision_function(df_new)
        return 1.0 / (1.0 + np.exp(-s))
    else:
        return pipe.predict(df_new)

# ===== Example =====
# pipe_loaded, meta = load_pipeline(RUN_DIR)
# preds = predict_batch(pipe_loaded, df_new)


## 14. 面试当天时间安排（从 9:00 到 18:00）
### 9:00–10:00 读题 + 数据快速体检
- 明确 label、时间字段、可用特征、不可用信息
- 画数据时间轴：覆盖区间、缺失、异常

### 10:00–12:00 Baseline + 第一版 Pipeline
- TVT 切分 + 最基础特征 + Ridge/LogReg
- 泄露排查：fit 是否只在 train

### 12:00–14:30 扩展特征 + 树模型/Boosting
- 加入 lag/rolling（如适用）
- RF/HGB 小网格调参

### 14:30–16:30 稳定性与诊断
- 时间切片分数
- 残差/尾部/子样本

### 16:30–17:30 最终定型 + 落盘
- 选模型：分数 + 稳定性 + 复杂度
- 保存 pipeline.pkl + meta.json

### 17:30–18:00 汇报稿
- 问题定义、切分、特征、模型对比、诊断、最终选择与部署接口


## 15. 最终汇报模板（10 页以内）
1) Problem & Target
2) Data overview + leakage constraints
3) Split scheme (time-based / walk-forward)
4) Feature set v1 (baseline) + results
5) Feature set v2 (time features / lag/rolling) + results
6) Model comparison leaderboard
7) Stability by time slices
8) Diagnostics (residual / calibration / tail)
9) Final model choice (trade-offs)
10) Deployment: artifact + predict() + monitoring
