# Dataset Interview — 时序建模 LSTM 训练与部署速查（可直接照抄）

> 目标：在有限时间内完成 **数据读取 → 时间切分 → 特征/窗口 → 基线 → LSTM → 评估 → 结果导出 → 汇报要点** 的闭环。  
> 写法：给自己看的操作手册，按顺序跑。  
> 约束：时间序列严格防泄漏；所有 fit 只在 train 上做；val/test 只 transform。  


## 0. 全局配置（先定死）

- 预测任务：单步 `horizon=1` 优先；多步之后再扩展。  
- 输入窗口：`lookback` 从小到大（例如 30 → 60 → 120）逐步加。  
- 验证方式：时间切分（walk-forward 若时间允许）。  
- 产出：一个可复现的 `run_id`，保存 model、scaler、预测结果、图。  


In [None]:
# ==== 0) 环境与随机种子 ====
import os, time, json, math, random
import numpy as np
import pandas as pd

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

RUN_ID = time.strftime("%Y%m%d_%H%M%S")
ART_DIR = f"artifacts_{RUN_ID}"
os.makedirs(ART_DIR, exist_ok=True)

print("RUN_ID:", RUN_ID)
print("ART_DIR:", ART_DIR)


## 1. 依赖导入（保持轻量）

- tabular 基线：`sklearn`  
- LSTM：`torch`（若环境无 torch，现场安装；若安装不顺，直接只交付基线）  


In [None]:
# ==== 1) 依赖导入 ====
from dataclasses import dataclass
from typing import Optional, Tuple, Dict

from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

# 可选：如果 metric 是方向/分类，换 LogisticRegression / XGBoost(若允许) 等


In [None]:
# ==== 1b) Torch（可选）====
TORCH_OK = True
try:
    import torch
    import torch.nn as nn
    from torch.utils.data import DataLoader, TensorDataset
    print("torch:", torch.__version__, "| cuda:", torch.cuda.is_available())
except Exception as e:
    TORCH_OK = False
    print("Torch not available:", repr(e))


## 2. 数据加载与基本检查

### 假设输入数据至少包含：
- 时间列：`timestamp`（或等价列名）  
- 目标列：`y`（或等价列名）  
- 特征列：若原始是面板数据（asset/region），先做筛选或分组后建模

### 输出目标：
- 一个按时间排序的 DataFrame：`df`  
- 不允许乱序  


In [None]:
# ==== 2) 数据加载（按现场数据改路径/列名）====
DATA_PATH = "data.csv"  # TODO: 修改

# 常见：csv/parquet
# df = pd.read_csv(DATA_PATH)
# df = pd.read_parquet(DATA_PATH)

# 这里先放占位
df = None
print("Set DATA_PATH and load df")


In [None]:
# ==== 2b) schema 约束检查（加载后解注释）====
# assert df is not None
# print(df.shape)
# print(df.columns.tolist()[:50])
# df.head()


In [None]:
# ==== 2c) 列名对齐（加载后改这里）====
# TIME_COL = "timestamp"
# TARGET_COL = "y"
# ID_COL = None  # 面板数据可设为 "asset" / "region" 等；单序列留 None
# EXTRA_DROP_COLS = []

# df[TIME_COL] = pd.to_datetime(df[TIME_COL])
# df = df.sort_values([ID_COL, TIME_COL] if ID_COL else [TIME_COL]).reset_index(drop=True)

# # 缺失基本情况
# print(df.isna().mean().sort_values(ascending=False).head(20))


## 3. 时间切分（防泄漏一票否决项）

切分方式：
- 单序列：按时间切 `train / val / test`  
- 面板数据：先按 `ID_COL` 分组，各组内部按时间切；或先选一个 ID 做 demo，再扩展

这里先实现：**单序列** 的最稳切分。


In [None]:
# ==== 3) 单序列时间切分 ====
@dataclass
class SplitConfig:
    train_frac: float = 0.7
    val_frac: float = 0.15
    # test_frac = 1 - train_frac - val_frac

def time_split(df: pd.DataFrame, time_col: str, cfg: SplitConfig):
    n = len(df)
    i1 = int(n * cfg.train_frac)
    i2 = int(n * (cfg.train_frac + cfg.val_frac))
    train = df.iloc[:i1].copy()
    val = df.iloc[i1:i2].copy()
    test = df.iloc[i2:].copy()
    return train, val, test

# 用法：
# train_df, val_df, test_df = time_split(df, TIME_COL, SplitConfig())
# print(len(train_df), len(val_df), len(test_df))


## 4. 基础清洗与缺失处理（先做能跑的版本）

原则：
- 目标缺失：直接丢弃对应行（否则监督信号不干净）  
- 特征缺失：先 `ffill/bfill`，剩余再 `fillna(0)`；同时保留缺失指示列（可选）  


In [None]:
# ==== 4) 缺失处理（按现场数据改规则）====
def basic_clean(df: pd.DataFrame, target_col: str, time_col: str):
    df = df.copy()
    # 目标缺失直接丢弃
    df = df.dropna(subset=[target_col])
    # 特征列集合
    feat_cols = [c for c in df.columns if c not in [target_col, time_col]]
    # 缺失指示列（可选）
    for c in feat_cols:
        if df[c].isna().any():
            df[f"{c}__isna"] = df[c].isna().astype(np.int8)
    # 简单填补
    df[feat_cols] = df[feat_cols].ffill().bfill()
    df[feat_cols] = df[feat_cols].fillna(0.0)
    return df

# 用法：
# train_df = basic_clean(train_df, TARGET_COL, TIME_COL)
# val_df   = basic_clean(val_df, TARGET_COL, TIME_COL)
# test_df  = basic_clean(test_df, TARGET_COL, TIME_COL)


## 5. 特征工程（tabular 基线用）

最低配：
- lag：`y_{t-1}, y_{t-2}, ...` 或对关键特征做 lag  
- rolling：均值/方差/极值（rolling 前 shift(1)）  
- 时间特征：小时/星期/月份（若有明显周期）  


In [None]:
# ==== 5) lag/rolling 特征（单序列版本）====
def add_lag_rolling_features(df: pd.DataFrame, time_col: str, target_col: str,
                            lags=(1,2,3,5,10),
                            windows=(5,10,20)):
    df = df.copy()
    df = df.sort_values(time_col).reset_index(drop=True)

    # target lag
    for k in lags:
        df[f"{target_col}_lag{k}"] = df[target_col].shift(k)

    # rolling on shifted target (防泄漏)
    shifted = df[target_col].shift(1)
    for w in windows:
        df[f"{target_col}_roll_mean{w}"] = shifted.rolling(w).mean()
        df[f"{target_col}_roll_std{w}"]  = shifted.rolling(w).std()
        df[f"{target_col}_roll_min{w}"]  = shifted.rolling(w).min()
        df[f"{target_col}_roll_max{w}"]  = shifted.rolling(w).max()

    # 时间特征（若 time_col 为 datetime）
    if np.issubdtype(df[time_col].dtype, np.datetime64):
        df["hour"] = df[time_col].dt.hour.astype(np.int16)
        df["dow"]  = df[time_col].dt.dayofweek.astype(np.int16)
        df["month"]= df[time_col].dt.month.astype(np.int16)

    return df

# 用法：
# train_df = add_lag_rolling_features(train_df, TIME_COL, TARGET_COL)
# val_df   = add_lag_rolling_features(val_df, TIME_COL, TARGET_COL)
# test_df  = add_lag_rolling_features(test_df, TIME_COL, TARGET_COL)


## 6. tabular 基线训练（先拿分）

流程：
1) 选特征列  
2) scaler：只 fit train  
3) 模型：Ridge 先跑；再试 RandomForest（时间允许再调参）  
4) 输出 val 指标、画预测曲线  


In [None]:
# ==== 6) 取特征 + 对齐列 ====
def get_feature_columns(df: pd.DataFrame, time_col: str, target_col: str):
    drop = {time_col, target_col}
    return [c for c in df.columns if c not in drop]

def align_columns(train_df, val_df, test_df, feat_cols):
    # 缺列补 0，多列忽略
    def _align(d):
        d = d.copy()
        for c in feat_cols:
            if c not in d.columns:
                d[c] = 0.0
        return d[feat_cols]
    return _align(train_df), _align(val_df), _align(test_df)

def drop_na_rows(df: pd.DataFrame, cols):
    return df.dropna(subset=cols)

# 用法：
# feat_cols = get_feature_columns(train_df, TIME_COL, TARGET_COL)
# train_df2 = drop_na_rows(train_df, feat_cols + [TARGET_COL])
# val_df2   = drop_na_rows(val_df, feat_cols + [TARGET_COL])
# test_df2  = drop_na_rows(test_df, feat_cols + [TARGET_COL])

# Xtr, Xva, Xte = align_columns(train_df2, val_df2, test_df2, feat_cols)
# ytr, yva, yte = train_df2[TARGET_COL].values, val_df2[TARGET_COL].values, test_df2[TARGET_COL].values


In [None]:
# ==== 6b) 基线模型训练与评估 ====
def eval_regression(y_true, y_pred, prefix=""):
    mse = mean_squared_error(y_true, y_pred)
    rmse = math.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    out = {"mse": mse, "rmse": rmse, "mae": mae, "r2": r2}
    print(prefix, json.dumps(out, ensure_ascii=False))
    return out

# 用法：
# scaler = RobustScaler()  # heavy-tail 常见，先用 robust
# Xtr_s = scaler.fit_transform(Xtr)
# Xva_s = scaler.transform(Xva)
# Xte_s = scaler.transform(Xte)

# ridge = Ridge(alpha=1.0, random_state=SEED)
# ridge.fit(Xtr_s, ytr)
# pva = ridge.predict(Xva_s)
# pte = ridge.predict(Xte_s)

# val_metrics = eval_regression(yva, pva, "Ridge val:")
# test_metrics= eval_regression(yte, pte, "Ridge test:")

# # 保存
# import joblib
# joblib.dump(scaler, f"{ART_DIR}/scaler.joblib")
# joblib.dump(ridge,  f"{ART_DIR}/ridge.joblib")


## 7. LSTM 输入窗口构造（核心）

定义：用过去 `lookback` 步的特征序列预测 `t+horizon-1`。  
输入形状：`(N, lookback, n_features)`  


In [None]:
# ==== 7) make windows ====
def make_windows(X: np.ndarray, y: np.ndarray, lookback: int = 60, horizon: int = 1, stride: int = 1):
    X = np.asarray(X)
    y = np.asarray(y)
    T = len(X)
    Xw, yw = [], []
    end = T - horizon + 1
    for t in range(lookback, end, stride):
        Xw.append(X[t - lookback:t])
        yw.append(y[t + horizon - 1])
    return np.stack(Xw), np.asarray(yw)

# 用法：
# LOOKBACK = 60
# HORIZON = 1
# Xtr_w, ytr_w = make_windows(Xtr_s, ytr, lookback=LOOKBACK, horizon=HORIZON)
# Xva_w, yva_w = make_windows(Xva_s, yva, lookback=LOOKBACK, horizon=HORIZON)
# Xte_w, yte_w = make_windows(Xte_s, yte, lookback=LOOKBACK, horizon=HORIZON)
# print(Xtr_w.shape, ytr_w.shape)


## 8. LSTM 模型与训练循环（PyTorch 最小实现）

- 结构：LSTM → 取最后一步 hidden → linear head  
- 训练：Adam + MSE + early stopping + 梯度裁剪  
- 保存：best state_dict  


In [None]:
# ==== 8) LSTM 回归模型 ====
if TORCH_OK:
    class LSTMReg(nn.Module):
        def __init__(self, n_features: int, hidden: int = 64, num_layers: int = 1, dropout: float = 0.0):
            super().__init__()
            self.lstm = nn.LSTM(
                input_size=n_features,
                hidden_size=hidden,
                num_layers=num_layers,
                batch_first=True,
                dropout=dropout if num_layers > 1 else 0.0
            )
            self.head = nn.Linear(hidden, 1)

        def forward(self, x):
            out, _ = self.lstm(x)      # (B, L, H)
            last = out[:, -1, :]       # (B, H)
            yhat = self.head(last).squeeze(-1)  # (B,)
            return yhat


In [None]:
# ==== 8b) 训练器 ====
if TORCH_OK:
    def train_lstm(Xtr_w, ytr_w, Xva_w, yva_w,
                   hidden=64, num_layers=1, dropout=0.0,
                   lr=1e-3, batch=256, epochs=30, patience=3,
                   device=None):
        device = device or ("cuda" if torch.cuda.is_available() else "cpu")

        Xtr_t = torch.tensor(Xtr_w, dtype=torch.float32)
        ytr_t = torch.tensor(ytr_w, dtype=torch.float32)
        Xva_t = torch.tensor(Xva_w, dtype=torch.float32)
        yva_t = torch.tensor(yva_w, dtype=torch.float32)

        dl = DataLoader(TensorDataset(Xtr_t, ytr_t), batch_size=batch, shuffle=False)

        model = LSTMReg(n_features=Xtr_w.shape[-1], hidden=hidden, num_layers=num_layers, dropout=dropout).to(device)
        opt = torch.optim.Adam(model.parameters(), lr=lr)
        loss_fn = nn.MSELoss()

        best_loss = float("inf")
        best_state = None
        bad = 0

        for ep in range(1, epochs + 1):
            model.train()
            for xb, yb in dl:
                xb = xb.to(device); yb = yb.to(device)
                opt.zero_grad()
                pred = model(xb)
                loss = loss_fn(pred, yb)
                loss.backward()
                nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                opt.step()

            model.eval()
            with torch.no_grad():
                vpred = model(Xva_t.to(device))
                vloss = loss_fn(vpred, yva_t.to(device)).item()

            print(f"ep={ep:02d} val_mse={vloss:.6f}")

            if vloss < best_loss - 1e-6:
                best_loss = vloss
                best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
                bad = 0
            else:
                bad += 1
                if bad >= patience:
                    break

        if best_state is not None:
            model.load_state_dict(best_state)
        return model, best_loss


In [None]:
# ==== 8c) 推理与保存 ====
if TORCH_OK:
    def predict_lstm(model, Xw, device=None, batch=512):
        device = device or ("cuda" if torch.cuda.is_available() else "cpu")
        model.eval()
        X_t = torch.tensor(Xw, dtype=torch.float32)
        dl = DataLoader(TensorDataset(X_t), batch_size=batch, shuffle=False)
        out = []
        with torch.no_grad():
            for (xb,) in dl:
                xb = xb.to(device)
                pred = model(xb).detach().cpu().numpy()
                out.append(pred)
        return np.concatenate(out, axis=0)

    # 用法：
    # model, best = train_lstm(Xtr_w, ytr_w, Xva_w, yva_w)
    # pva = predict_lstm(model, Xva_w)
    # pte = predict_lstm(model, Xte_w)
    # eval_regression(yva_w, pva, "LSTM val:")
    # eval_regression(yte_w, pte, "LSTM test:")
    #
    # torch.save(model.state_dict(), f"{ART_DIR}/lstm_state.pt")


## 9. 可视化（如果允许 matplotlib）

输出两张图：
1) val/test 的真实 vs 预测（时间轴对齐）  
2) 残差分布/绝对误差随时间变化（可选）  


In [None]:
# ==== 9) 画图（可选）====
PLOT_OK = True
try:
    import matplotlib.pyplot as plt
except Exception as e:
    PLOT_OK = False
    print("matplotlib not available:", repr(e))

def plot_series(y_true, y_pred, title, path):
    if not PLOT_OK:
        return
    plt.figure()
    plt.plot(y_true)
    plt.plot(y_pred)
    plt.title(title)
    plt.legend(["true", "pred"])
    plt.tight_layout()
    plt.savefig(path, dpi=150)
    plt.close()

# 用法：
# plot_series(yva_w, pva, "LSTM val true vs pred", f"{ART_DIR}/lstm_val.png")
# plot_series(yte_w, pte, "LSTM test true vs pred", f"{ART_DIR}/lstm_test.png")


## 10. 汇报模板（直接照念的要点）

- 数据：时间范围、缺失比例、是否有面板维度  
- 切分：train/val/test 的时间边界（具体日期/索引），强调无泄漏  
- 基线：Ridge / RF 指标（val/test），解释优势：快、稳、可解释  
- LSTM：输入窗口 lookback、特征维度、训练轮数、early stopping，指标对比基线  
- 误差分析：在哪些时间段偏差大（高波动/极端值/缺失集中）  
- 下一步：多步预测、walk-forward CV、改 loss（Huber/Quantile）、加入外生变量/节假日特征  


In [None]:
# ==== 10) 自动生成汇报摘要（把关键数字写到 json）====
def save_summary(path, **kwargs):
    with open(path, "w", encoding="utf-8") as f:
        json.dump(kwargs, f, ensure_ascii=False, indent=2)
    print("Saved:", path)

# 用法：
# save_summary(
#     f"{ART_DIR}/summary.json",
#     run_id=RUN_ID,
#     lookback=LOOKBACK,
#     horizon=HORIZON,
#     baseline_val=val_metrics,
#     baseline_test=test_metrics,
#     lstm_val=lstm_val_metrics,
#     lstm_test=lstm_test_metrics,
# )


## 11. 快速故障排查（只列结论）

- 训练不收敛：lookback 降小、hidden 降小、lr 降到 3e-4、检查 scaler 与窗口对齐  
- val 很好 test 很差：存在非平稳/分布漂移；改 robust 特征、加 drift-aware split、减少过拟合  
- LSTM 不如基线：特征已足够；LSTM 只在明显非线性/长依赖时占优；用它当加分项即可  
- 时间不够：只交付基线 + 完整评估 + 清晰汇报逻辑  
