# Chronos-2 SFT — Evaluation dumps for statistical testing (Global + Industry)

This notebook **does not perform training** and does not alter the SFT training notebook.
It assumes that the SFT checkpoints have already been produced by the training notebook and are available under:
`outputs/chronos2_sft/<group>/finetuned-ckpt` (relative to the `notebooks/` directory).

The goal is to **re-run the same evaluation procedure** used in the project codebase, but additionally **save per-ticker, per-window results** to disk.
These dumps are then used by a separate notebook to compute paired statistical tests (e.g., Wilcoxon + bootstrap CIs).


In [1]:
# Project imports and environment setup
#
# Expected usage: run this notebook from the repository's `notebooks/` directory.
# The training notebook writes checkpoints under `outputs/chronos2_sft/...` relative to that directory.

import os
import sys
from pathlib import Path

import numpy as np
import pandas as pd
import torch
from tqdm import tqdm



current_dir = os.getcwd()
REPO_ROOT = os.path.dirname(current_dir) # Using this notation to keep the rest of the code the same.
sys.path.append(REPO_ROOT)


from tiingo_data.download_data import get_daily_returns_data_cached
from core.data import prepare_data_for_chronos, GICS_LEVEL_1
from utils import get_device
from chronos import Chronos2Pipeline

DEVICE = get_device()
print("DEVICE:", DEVICE)
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0), "capability:", torch.cuda.get_device_capability(0))
print("torch:", torch.__version__)


DEVICE: cuda
GPU: NVIDIA GeForce RTX 4060 Laptop GPU capability: (8, 9)
torch: 2.9.1


In [2]:
# Load daily returns and build the exact train/eval split used for training/evaluation
df_all = get_daily_returns_data_cached()
df_train_clean, df_eval_clean = prepare_data_for_chronos(df_all, test_size=1200)

print("df_train_clean:", df_train_clean.shape, "df_eval_clean:", df_eval_clean.shape)
print("eval date range:", df_eval_clean.index[0], "→", df_eval_clean.index[-1])


df_train_clean: (2797, 114) df_eval_clean: (1200, 114)
eval date range: 2021-02-12 00:00:00+00:00 → 2025-11-20 00:00:00+00:00


In [3]:
# Load the best hyperparameters found during the Optuna tuning stage.
# The CSV is produced by the training notebook and stored under outputs/.
CSV_CANDIDATES = [
    Path(REPO_ROOT) / "notebooks" / "outputs" / "tuning_results" / "tuning_best_results.csv",
    Path(REPO_ROOT) / "outputs" / "tuning_results" / "tuning_best_results.csv"
]
best_csv_path = next((p for p in CSV_CANDIDATES if p.exists()), None)
if best_csv_path is None:
    raise FileNotFoundError(
        "tuning_best_results.csv not found. Searched:\n" + "\n".join(map(str, CSV_CANDIDATES))
    )

df_best = pd.read_csv(best_csv_path)
df_best["group"] = df_best["group"].astype(str).str.strip()
hp_by_group = df_best.set_index("group").to_dict(orient="index")

# Forecast horizon used throughout the project (one-step ahead)
PREDICTION_LENGTH = 1


def get_hparams(group: str):
    """Return the tuned hyperparameters for a given evaluation group."""
    g = str(group).strip()
    if g not in hp_by_group:
        raise KeyError(f"Group '{g}' not present in the tuning CSV. Available: {sorted(hp_by_group.keys())}")
    row = hp_by_group[g]
    return dict(
        prediction_length=PREDICTION_LENGTH,
        context_length=int(row["context_length"]),
        num_steps=int(row["num_steps"]),
        batch_size=int(row["batch_size"]),
        learning_rate=float(row["learning_rate"]),
        stride=int(row.get("stride", 50)),
        n_eval_samples=int(row.get("n_eval_samples", 100)),
    )


global_hp = get_hparams("global")
print("GLOBAL HP:", global_hp)


GLOBAL HP: {'prediction_length': 1, 'context_length': 128, 'num_steps': 1500, 'batch_size': 48, 'learning_rate': 3.113813151474403e-06, 'stride': 100, 'n_eval_samples': 100}


In [4]:
# Utility: normalize group names to directory-friendly slugs (matches the training notebook)
def slugify(category: str) -> str:
    return (
        category.lower()
        .replace("&", "and")
        .replace("/", "_")
        .replace(" ", "_")
    )

# Checkpoint/output base directory.
# The training notebook uses output_dir="outputs/..."; when run from `notebooks/`, this resolves to `notebooks/outputs/...`.
OUTPUTS_BASE = Path("outputs")

# Expected location of the global fine-tuned checkpoint
general_ckpt = OUTPUTS_BASE / "chronos2_sft" / "general" / "finetuned-ckpt"
if not general_ckpt.exists():
    print("WARNING: global SFT checkpoint not found at:", general_ckpt)
    print("Make sure you have run the training notebook and that outputs/ is in the expected location.")
else:
    print("OK: global SFT checkpoint found:", general_ckpt)


def category_ckpt(category: str) -> Path:
    """Return the fine-tuned checkpoint path for an industry/category group."""
    return OUTPUTS_BASE / "chronos2_sft" / slugify(category) / "finetuned-ckpt"


OK: global SFT checkpoint found: outputs\chronos2_sft\general\finetuned-ckpt


In [5]:
# Metric helpers
#
# We compute per-ticker MAE/MSE using the median forecast and per-ticker MQL using the pinball loss
# averaged over the quantiles 0.1..0.9.

QUANTILES = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], dtype=np.float32)


def per_ticker_mql(y_true: np.ndarray, y_pred_quantiles: np.ndarray) -> np.ndarray:
    """Mean Quantile Loss (pinball loss), returned per ticker.

    Parameters
    ----------
    y_true: (N,) array
        True 1-step-ahead value for each ticker.
    y_pred_quantiles: (N, Q) array
        Predicted quantiles for each ticker. If Q > 9, we keep the first 9 columns
        corresponding to the 0.1..0.9 grid used in the project.
    """
    y_pred_9 = y_pred_quantiles[:, : len(QUANTILES)]  # (N, 9)
    errors = y_true[:, None] - y_pred_9              # (N, 9)
    q = QUANTILES[None, :]                           # (1, 9)
    pin = np.where(errors >= 0, q * errors, (q - 1.0) * errors)
    return pin.mean(axis=1)


def sample_start_indices(T: int, context_length: int, n_samples: int, seed: int = 42) -> np.ndarray:
    """Sample evaluation window start indices.

    This mirrors the evaluation procedure used in the project code (`core/eval.py`):
    we set the NumPy RNG seed and sample start indices without replacement.
    """
    max_start = T - context_length - 1
    if max_start <= 0:
        raise ValueError(f"Series too short: T={T}, context_length={context_length}")
    np.random.seed(seed)
    n = min(int(n_samples), int(max_start))
    return np.random.choice(max_start, size=n, replace=False)


In [6]:
# Core routine: evaluate a model on sampled windows and dump per-ticker results to disk
def eval_dump_windows(
    pipeline: Chronos2Pipeline,
    df_test: pd.DataFrame,
    context_length: int,
    start_indices: np.ndarray,
    model_name: str,
    group_name: str,
    out_path: Path,
) -> pd.DataFrame:
    """Run 1-step-ahead evaluation on a set of sampled windows and save results.

    For each sampled start index we build a multivariate context window of shape (N, context_length),
    predict the next step, and store per-ticker metrics (MAE/MSE/MQL) plus the 0.1..0.9 quantile forecasts.
    """
    data = df_test.values.astype(np.float32)  # (T, N)
    T, N = data.shape
    tickers = list(df_test.columns)
    dates = df_test.index

    rows = []
    for window_id, start in enumerate(tqdm(start_indices, desc=f"{model_name} | {group_name}")):
        # Build context and target for a 1-step-ahead forecast
        ctx = data[start : start + context_length, :].T  # (N, context_length)
        y_true = data[start + context_length, :]         # (N,)
        date = dates[start + context_length]

        # Predict quantiles for the next step
        forecast = pipeline.predict([{"target": ctx}], prediction_length=1)
        y_pred_q = forecast[0][:, :, 0].detach().cpu().numpy().astype(np.float32)  # (N, Q)
        y_med = y_pred_q[:, 4]  # median index for the 0.1..0.9 grid

        # Per-ticker point metrics
        mae_t = np.abs(y_true - y_med)
        mse_t = (y_true - y_med) ** 2

        # Per-ticker quantile metric
        mql_t = per_ticker_mql(y_true, y_pred_q)

        # Emit one row per ticker for this window
        for i, tkr in enumerate(tickers):
            rows.append(
                {
                    "window_id": int(window_id),
                    "start_idx": int(start),
                    "date": date,
                    "ticker": tkr,
                    "group": group_name,
                    "model": model_name,
                    "context_length": int(context_length),
                    "y_true": float(y_true[i]),
                    "y_pred_q10": float(y_pred_q[i, 0]),
                    "y_pred_q20": float(y_pred_q[i, 1]),
                    "y_pred_q30": float(y_pred_q[i, 2]),
                    "y_pred_q40": float(y_pred_q[i, 3]),
                    "y_pred_q50": float(y_pred_q[i, 4]),
                    "y_pred_q60": float(y_pred_q[i, 5]),
                    "y_pred_q70": float(y_pred_q[i, 6]),
                    "y_pred_q80": float(y_pred_q[i, 7]),
                    "y_pred_q90": float(y_pred_q[i, 8]),
                    "mae": float(mae_t[i]),
                    "mse": float(mse_t[i]),
                    "mql": float(mql_t[i]),
                }
            )

    out_path.parent.mkdir(parents=True, exist_ok=True)
    df_out = pd.DataFrame(rows)
    df_out.to_parquet(out_path, index=False)
    print(f"Saved: {out_path} | rows={len(df_out)}")
    return df_out


In [7]:
# Load the models to evaluate
#
# 1) Baseline: Chronos-2 zero-shot
baseline = Chronos2Pipeline.from_pretrained(
    "amazon/chronos-2",
    device_map=DEVICE,
    dtype=torch.float32,
)

# 2) Global SFT model: checkpoint produced by the training notebook
if not general_ckpt.exists():
    raise FileNotFoundError(f"Global SFT checkpoint not found: {general_ckpt}")

sft_general = Chronos2Pipeline.from_pretrained(
    str(general_ckpt),
    device_map=DEVICE,
    dtype=torch.float32,
)

print("Loaded models: baseline + global SFT")


Loaded models: baseline + global SFT


In [8]:
# GLOBAL evaluation dumps (baseline vs global SFT)
DUMPS_DIR = OUTPUTS_BASE / "eval_dumps" / "sft"
DUMPS_DIR.mkdir(parents=True, exist_ok=True)

ctx_g = global_hp["context_length"]
n_g = global_hp["n_eval_samples"]
starts_g = sample_start_indices(len(df_eval_clean), ctx_g, n_g, seed=42)

global_baseline_path = DUMPS_DIR / "global__baseline.parquet"
global_sftgen_path = DUMPS_DIR / "global__sft_general.parquet"

_ = eval_dump_windows(baseline, df_eval_clean, ctx_g, starts_g, "baseline", "global", global_baseline_path)
_ = eval_dump_windows(sft_general, df_eval_clean, ctx_g, starts_g, "sft_general", "global", global_sftgen_path)


baseline | global: 100%|██████████| 100/100 [00:05<00:00, 16.86it/s]


Saved: outputs\eval_dumps\sft\global__baseline.parquet | rows=11400


sft_general | global: 100%|██████████| 100/100 [00:05<00:00, 18.69it/s]

Saved: outputs\eval_dumps\sft\global__sft_general.parquet | rows=11400





In [9]:
# Optional sanity check
#
# Compare aggregated metrics computed from the dump files with the metrics returned by
# `core.eval.evaluate_model_on_test`, to verify that the dump procedure matches the project's evaluation.

from core.eval import evaluate_model_on_test


def agg_dump_metrics(parquet_path: Path):
    df = pd.read_parquet(parquet_path)
    mean_mae = df["mae"].mean()
    mean_mql = df["mql"].mean()
    return mean_mql, mean_mae


print("--- Sanity check: GLOBAL ---")
res_base = evaluate_model_on_test(baseline, df_eval_clean, context_length=ctx_g, n_samples=n_g)
res_sft  = evaluate_model_on_test(sft_general, df_eval_clean, context_length=ctx_g, n_samples=n_g)

dump_base = agg_dump_metrics(global_baseline_path)
dump_sft  = agg_dump_metrics(global_sftgen_path)

print("baseline (core): MQL=", res_base["mean_quantile_loss"], " MAE=", res_base["mean_mae"])
print("baseline (dump): MQL=", dump_base[0], " MAE=", dump_base[1])
print("sft_general (core): MQL=", res_sft["mean_quantile_loss"], " MAE=", res_sft["mean_mae"])
print("sft_general (dump): MQL=", dump_sft[0], " MAE=", dump_sft[1])

print("Abs. diff baseline (dump-core):", abs(dump_base[0] - res_base["mean_quantile_loss"]), abs(dump_base[1] - res_base["mean_mae"]))
print("Abs. diff sft_general (dump-core):", abs(dump_sft[0] - res_sft["mean_quantile_loss"]), abs(dump_sft[1] - res_sft["mean_mae"]))


--- Sanity check: GLOBAL ---
baseline (core): MQL= 0.0073292092  MAE= 0.016542953
baseline (dump): MQL= 0.007329209622847358  MAE= 0.01654295417169985
sft_general (core): MQL= 0.0058949403  MAE= 0.013767662
sft_general (dump): MQL= 0.005894940915168263  MAE= 0.013767661353317395
Abs. diff baseline (dump-core): 3.8083128057336824e-10 1.663965552151092e-09
Abs. diff sft_general (dump-core): 5.686888468817153e-10 1.1048043816602737e-09


In [10]:
# Industry/category evaluation dumps
#
# For each GICS Level-1 industry group we create four dump files:
#   1) baseline evaluated with the category context length (ctx=cat)
#   2) global SFT evaluated with the *global* context length (ctx=global)  [matches the original report setting]
#   3) global SFT evaluated with the category context length (ctx=cat)     [fair paired comparison vs category model]
#   4) category SFT evaluated with the category context length (ctx=cat)

categories = list(GICS_LEVEL_1.keys())
print("n_categories:", len(categories))

for cat in categories:
    tickers = [t for t in GICS_LEVEL_1[cat] if t in df_eval_clean.columns]
    if len(tickers) == 0:
        print(f"[{cat}] skip: no tickers available in eval split")
        continue

    cat_hp = get_hparams(cat) if cat in hp_by_group else None
    if cat_hp is None:
        print(f"[{cat}] skip: no tuned hyperparameters found for this group")
        continue

    df_cat = df_eval_clean[tickers]
    ctx_cat = cat_hp["context_length"]
    n_cat = cat_hp["n_eval_samples"]
    starts_cat = sample_start_indices(len(df_cat), ctx_cat, n_cat, seed=42)

    # Load category checkpoint
    ckpt = category_ckpt(cat)
    if not ckpt.exists():
        print(f"[{cat}] checkpoint not found: {ckpt} (skipping this category)")
        continue

    sft_cat = Chronos2Pipeline.from_pretrained(
        str(ckpt),
        device_map=DEVICE,
        dtype=torch.float32,
    )

    cat_slug = slugify(cat)
    out_base = DUMPS_DIR / f"{cat_slug}__baseline.parquet"
    out_gen_report = DUMPS_DIR / f"{cat_slug}__sft_general_ctx_global.parquet"
    out_gen_fair = DUMPS_DIR / f"{cat_slug}__sft_general_ctx_cat.parquet"
    out_cat = DUMPS_DIR / f"{cat_slug}__sft_category.parquet"

    # 1) Baseline (ctx=cat)
    _ = eval_dump_windows(baseline, df_cat, ctx_cat, starts_cat, "baseline", cat, out_base)

    # 2) Global SFT, global context length (ctx=global)
    ctx_global = global_hp["context_length"]
    starts_global_ctx = sample_start_indices(len(df_cat), ctx_global, n_cat, seed=42)
    _ = eval_dump_windows(sft_general, df_cat, ctx_global, starts_global_ctx, "sft_general", cat, out_gen_report)

    # 3) Global SFT, category context length (ctx=cat)
    _ = eval_dump_windows(sft_general, df_cat, ctx_cat, starts_cat, "sft_general", cat, out_gen_fair)

    # 4) Category SFT (ctx=cat)
    _ = eval_dump_windows(sft_cat, df_cat, ctx_cat, starts_cat, "sft_category", cat, out_cat)

    # Free GPU memory between categories
    del sft_cat
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

print("Done. DUMPS_DIR:", DUMPS_DIR)


n_categories: 11


baseline | Information Technology: 100%|██████████| 100/100 [00:02<00:00, 44.49it/s]


Saved: outputs\eval_dumps\sft\information_technology__baseline.parquet | rows=1700


sft_general | Information Technology: 100%|██████████| 100/100 [00:02<00:00, 42.33it/s]


Saved: outputs\eval_dumps\sft\information_technology__sft_general_ctx_global.parquet | rows=1700


sft_general | Information Technology: 100%|██████████| 100/100 [00:02<00:00, 43.70it/s]


Saved: outputs\eval_dumps\sft\information_technology__sft_general_ctx_cat.parquet | rows=1700


sft_category | Information Technology: 100%|██████████| 100/100 [00:02<00:00, 46.75it/s]


Saved: outputs\eval_dumps\sft\information_technology__sft_category.parquet | rows=1700


baseline | Health Care: 100%|██████████| 50/50 [00:01<00:00, 44.49it/s]


Saved: outputs\eval_dumps\sft\health_care__baseline.parquet | rows=900


sft_general | Health Care: 100%|██████████| 50/50 [00:01<00:00, 44.12it/s]


Saved: outputs\eval_dumps\sft\health_care__sft_general_ctx_global.parquet | rows=900


sft_general | Health Care: 100%|██████████| 50/50 [00:01<00:00, 44.31it/s]


Saved: outputs\eval_dumps\sft\health_care__sft_general_ctx_cat.parquet | rows=900


sft_category | Health Care: 100%|██████████| 50/50 [00:01<00:00, 40.08it/s]


Saved: outputs\eval_dumps\sft\health_care__sft_category.parquet | rows=900


baseline | Financials: 100%|██████████| 100/100 [00:02<00:00, 44.82it/s]


Saved: outputs\eval_dumps\sft\financials__baseline.parquet | rows=1700


sft_general | Financials: 100%|██████████| 100/100 [00:02<00:00, 44.21it/s]


Saved: outputs\eval_dumps\sft\financials__sft_general_ctx_global.parquet | rows=1700


sft_general | Financials: 100%|██████████| 100/100 [00:02<00:00, 45.56it/s]


Saved: outputs\eval_dumps\sft\financials__sft_general_ctx_cat.parquet | rows=1700


sft_category | Financials: 100%|██████████| 100/100 [00:02<00:00, 41.65it/s]


Saved: outputs\eval_dumps\sft\financials__sft_category.parquet | rows=1700


baseline | Consumer Discretionary: 100%|██████████| 200/200 [00:04<00:00, 42.72it/s]


Saved: outputs\eval_dumps\sft\consumer_discretionary__baseline.parquet | rows=2000


sft_general | Consumer Discretionary: 100%|██████████| 200/200 [00:04<00:00, 41.11it/s]


Saved: outputs\eval_dumps\sft\consumer_discretionary__sft_general_ctx_global.parquet | rows=2000


sft_general | Consumer Discretionary: 100%|██████████| 200/200 [00:04<00:00, 43.30it/s]


Saved: outputs\eval_dumps\sft\consumer_discretionary__sft_general_ctx_cat.parquet | rows=2000


sft_category | Consumer Discretionary: 100%|██████████| 200/200 [00:04<00:00, 45.47it/s]


Saved: outputs\eval_dumps\sft\consumer_discretionary__sft_category.parquet | rows=2000


baseline | Consumer Staples: 100%|██████████| 100/100 [00:02<00:00, 40.89it/s]


Saved: outputs\eval_dumps\sft\consumer_staples__baseline.parquet | rows=1100


sft_general | Consumer Staples: 100%|██████████| 100/100 [00:02<00:00, 41.54it/s]


Saved: outputs\eval_dumps\sft\consumer_staples__sft_general_ctx_global.parquet | rows=1100


sft_general | Consumer Staples: 100%|██████████| 100/100 [00:02<00:00, 43.51it/s]


Saved: outputs\eval_dumps\sft\consumer_staples__sft_general_ctx_cat.parquet | rows=1100


sft_category | Consumer Staples: 100%|██████████| 100/100 [00:02<00:00, 40.44it/s]


Saved: outputs\eval_dumps\sft\consumer_staples__sft_category.parquet | rows=1100


baseline | Industrials: 100%|██████████| 200/200 [00:04<00:00, 42.67it/s]


Saved: outputs\eval_dumps\sft\industrials__baseline.parquet | rows=3600


sft_general | Industrials: 100%|██████████| 200/200 [00:04<00:00, 43.11it/s]


Saved: outputs\eval_dumps\sft\industrials__sft_general_ctx_global.parquet | rows=3600


sft_general | Industrials: 100%|██████████| 200/200 [00:04<00:00, 43.98it/s]


Saved: outputs\eval_dumps\sft\industrials__sft_general_ctx_cat.parquet | rows=3600


sft_category | Industrials: 100%|██████████| 200/200 [00:05<00:00, 39.92it/s]


Saved: outputs\eval_dumps\sft\industrials__sft_category.parquet | rows=3600


baseline | Energy: 100%|██████████| 50/50 [00:01<00:00, 38.83it/s]


Saved: outputs\eval_dumps\sft\energy__baseline.parquet | rows=300


sft_general | Energy: 100%|██████████| 50/50 [00:01<00:00, 40.04it/s]


Saved: outputs\eval_dumps\sft\energy__sft_general_ctx_global.parquet | rows=300


sft_general | Energy: 100%|██████████| 50/50 [00:01<00:00, 43.06it/s]


Saved: outputs\eval_dumps\sft\energy__sft_general_ctx_cat.parquet | rows=300


sft_category | Energy: 100%|██████████| 50/50 [00:01<00:00, 44.60it/s]


Saved: outputs\eval_dumps\sft\energy__sft_category.parquet | rows=300


baseline | Communication Services: 100%|██████████| 50/50 [00:01<00:00, 41.40it/s]


Saved: outputs\eval_dumps\sft\communication_services__baseline.parquet | rows=300


sft_general | Communication Services: 100%|██████████| 50/50 [00:01<00:00, 42.85it/s]


Saved: outputs\eval_dumps\sft\communication_services__sft_general_ctx_global.parquet | rows=300


sft_general | Communication Services: 100%|██████████| 50/50 [00:01<00:00, 42.35it/s]


Saved: outputs\eval_dumps\sft\communication_services__sft_general_ctx_cat.parquet | rows=300


sft_category | Communication Services: 100%|██████████| 50/50 [00:01<00:00, 42.66it/s]


Saved: outputs\eval_dumps\sft\communication_services__sft_category.parquet | rows=300


baseline | Materials: 100%|██████████| 200/200 [00:04<00:00, 41.71it/s]


Saved: outputs\eval_dumps\sft\materials__baseline.parquet | rows=400


sft_general | Materials: 100%|██████████| 200/200 [00:04<00:00, 40.17it/s]


Saved: outputs\eval_dumps\sft\materials__sft_general_ctx_global.parquet | rows=400


sft_general | Materials: 100%|██████████| 200/200 [00:04<00:00, 44.39it/s]


Saved: outputs\eval_dumps\sft\materials__sft_general_ctx_cat.parquet | rows=400


sft_category | Materials: 100%|██████████| 200/200 [00:04<00:00, 44.74it/s]


Saved: outputs\eval_dumps\sft\materials__sft_category.parquet | rows=400


baseline | Real Estate: 100%|██████████| 150/150 [00:03<00:00, 42.72it/s]


Saved: outputs\eval_dumps\sft\real_estate__baseline.parquet | rows=300


sft_general | Real Estate: 100%|██████████| 150/150 [00:03<00:00, 41.85it/s]


Saved: outputs\eval_dumps\sft\real_estate__sft_general_ctx_global.parquet | rows=300


sft_general | Real Estate: 100%|██████████| 150/150 [00:03<00:00, 42.99it/s]


Saved: outputs\eval_dumps\sft\real_estate__sft_general_ctx_cat.parquet | rows=300


sft_category | Real Estate: 100%|██████████| 150/150 [00:03<00:00, 43.06it/s]

Saved: outputs\eval_dumps\sft\real_estate__sft_category.parquet | rows=300
[Utilities] skip: no tickers available in eval split
Done. DUMPS_DIR: outputs\eval_dumps\sft



