# 3.5 Z-Score 标准化（历史基准）

对选定特征做 **Z-score 标准化**，使用 **历史基准** 计算均值/方差，避免 look-ahead bias，与后续时间序列切分一致。

**重要**：均值/方差必须来自 **历史数据**（shift），不得使用当前或未来时点。

---

## 1. 方法论

- **分组**：按 `ticker` 分组
- **排序**：按 `quarter` 列升序。`quarter` 为 **财报季度**（fiscal quarter），格式 `YYYY-Qn`（如 `2024-Q1`、`2021-Q2`）。**排序基于财报季度而非发布日期**，避免同季度不同 call 日期导致的顺序错乱。
- **历史窗口**：对于时点 t，使用 $\mu_{t-1}$, $\sigma_{t-1}$，即 **shift(1)**，只用 t 之前的数据
- **滚动窗口**：$\mu_{t-1} = \text{mean}(x_{t-4:t-1})$, $\sigma_{t-1} = \text{std}(x_{t-4:t-1})$（最近 4 期历史）
- **标准差**：使用 **样本标准差**（ddof=1）
- **min_periods=3**：不足 3 期历史时，该 z-score 设为 NaN
- **ε**：$\epsilon = 10^{-8}$，防止 $\sigma=0$ 时除零
- **裁剪**：$z = (x - \mu) / (\sigma + \epsilon)$，再 clip 到 $[-3, 3]$

**边界处理**：
- 若 $\sigma_{t-1} = 0$（历史窗口内完全不变），则分母为 $\sigma + \epsilon \approx \epsilon$，$z \approx 0$，等价于“无异常”
- 对于 NaN 的 z-score：在建模阶段（如 3.6）统一 dropna 或做缺失指示变量（missingness flag），避免训练/测试不一致

---

## 2. 参与标准化的特征（9 个）

| 特征 | 列名 | 来源 |
|------|------|------|
| 平均句长 (Number to words ratio) | `words_per_sentence` | 3.3 |
| 复数代词比例 | `pronoun_plural_ratio` | 3.3 |
| 副词比例 | `adverb_ratio` | 3.3 |
| LM net sentiment | `lm_net_sentiment` | 3.4 |
| LM uncertainty | `lm_uncertainty` | 3.4 |
| LM litigious | `lm_litigious` | 3.4 |
| LM subjectivity | `lm_subjectivity` | 3.4 |
| LM modal strong | `lm_modal_strong` | 3.4 |
| LM modal weak | `lm_modal_weak` | 3.4 |

**不含**：`lm_polarity`（已剔除）

**words_per_sentence 说明**：反映句法复杂度（syntactic complexity），标准化后可捕捉披露文本复杂度的异常变化（abnormal changes in disclosure complexity）。

---

## 3. 写回数据库

读取完整 `segments_features` 表，**追加 9 个 `*_zscore` 列**，以 `if_exists="replace"` 写回。`df_out` 含所有原有列 + 9 列，**不影响原有特征列**；按主键 `id`（或 `(ticker, quarter, section)`）保留行一一对应。

---

**依赖**：需先运行 3.2、3.3、3.4 生成 `segments_features`。

In [1]:
# ========== 配置 ==========

import numpy as np
import sqlite3
from pathlib import Path

import pandas as pd

PROJECT_ROOT = Path("..").resolve()
OUTPUT_DB = PROJECT_ROOT / "data" / "earnings_calls_features.db"

ROLLING_WINDOW = 4   # 历史滚动窗口（期数 = 季度数）
MIN_PERIODS = 3     # 最少需要几期历史才有有效 z-score
Z_CLIP = 3          # |z| > 3 时裁剪到 ±3
EPSILON = 1e-8      # ε = 10^{-8}，防止 σ=0 时除零

FEATURE_COLS = [
    "words_per_sentence",
    "pronoun_plural_ratio",
    "adverb_ratio",
    "lm_net_sentiment",
    "lm_uncertainty",
    "lm_litigious",
    "lm_subjectivity",
    "lm_modal_strong",
    "lm_modal_weak",
]

print("OUTPUT_DB:", OUTPUT_DB)
print("FEATURE_COLS:", FEATURE_COLS)
print("ROLLING_WINDOW=", ROLLING_WINDOW, ", MIN_PERIODS=", MIN_PERIODS)
print("EPSILON (ε) =", EPSILON)

OUTPUT_DB: /Users/xinyuewang/Desktop/1.27/data/earnings_calls_features.db
FEATURE_COLS: ['words_per_sentence', 'pronoun_plural_ratio', 'adverb_ratio', 'lm_net_sentiment', 'lm_uncertainty', 'lm_litigious', 'lm_subjectivity', 'lm_modal_strong', 'lm_modal_weak']
ROLLING_WINDOW= 4 , MIN_PERIODS= 3
EPSILON (ε) = 1e-08


In [2]:
# ========== 1. 解析 quarter 为可排序键 ==========

def quarter_to_sort_key(q: str):
    """将 '2024-Q1' 转为 (year, quarter) 元组便于排序"""
    if pd.isna(q) or not q:
        return (0, 0)
    q = str(q).strip()
    try:
        parts = q.split("-")
        year = int(parts[0])
        qnum = int(parts[1].replace("Q", ""))
        return (year, qnum)
    except (IndexError, ValueError):
        return (0, 0)

# 测试
assert quarter_to_sort_key("2024-Q1") == (2024, 1)
assert quarter_to_sort_key("2013-Q4") == (2013, 4)
print("quarter 解析 OK")

quarter 解析 OK


In [3]:
# ========== 2. 读取 segments_features ==========

conn = sqlite3.connect(OUTPUT_DB)
df = pd.read_sql_query("SELECT * FROM segments_features", conn)
conn.close()

# 检查必需的列
missing = [c for c in FEATURE_COLS if c not in df.columns]
if missing:
    raise ValueError(f"缺少列，请先运行 3.3 和 3.4: {missing}")

df["_qkey"] = df["quarter"].apply(quarter_to_sort_key)
df = df.sort_values(["ticker", "_qkey", "section"]).reset_index(drop=True)

print(f"读取 {len(df)} 行")
print(f"ticker 数: {df['ticker'].nunique()}")

读取 2374 行
ticker 数: 28


In [4]:
# ========== 3. 按 ticker 分组，用历史窗口计算 mean/std（shift，无 look-ahead）==========

def compute_historical_zscore_per_ticker(group: pd.DataFrame) -> pd.DataFrame:
    """
    对单个 ticker 的分组，为每行计算基于历史窗口的 z-score。
    历史窗口：最近 4 期（季度），shift(1) 表示只用 t-1 及之前，不含当前。
    不足 min_periods=3 期时返回 NaN。
    """
    # 按 (quarter, section) 聚合：每个 quarter 可能有多行（Prepared + Q&A），我们逐行处理
    # 历史按 quarter 划分：同一 quarter 的 Prepared 和 Q&A 共享同一个历史窗口
    quarters_sorted = group.drop_duplicates("quarter").sort_values("_qkey")["quarter"].tolist()
    quarter_to_idx = {q: i for i, q in enumerate(quarters_sorted)}
    
    result = group.copy()
    for col in FEATURE_COLS:
        z_col = f"{col}_zscore"
        z_vals = np.full(len(group), np.nan)
        
        for i, (_, row) in enumerate(group.iterrows()):
            q = row["quarter"]
            qidx = quarter_to_idx.get(q, -1)
            if qidx < 0:
                continue
            # 历史窗口：quarters [qidx-4 : qidx)，即 t-4 到 t-1
            start = max(0, qidx - ROLLING_WINDOW)
            end = qidx
            hist_quarters = quarters_sorted[start:end]
            
            if len(hist_quarters) < MIN_PERIODS:
                continue
            
            hist_mask = group["quarter"].isin(hist_quarters)
            hist_vals = group.loc[hist_mask, col].dropna()
            if len(hist_vals) == 0:
                continue
            
            mean_hist = hist_vals.mean()
            std_hist = hist_vals.std(ddof=1)  # 样本标准差
            if std_hist is None or std_hist <= 0:
                continue
            
            x = row[col]
            if pd.isna(x):
                continue
            z = (x - mean_hist) / (std_hist + EPSILON)
            z = np.clip(z, -Z_CLIP, Z_CLIP)
            z_vals[i] = z
        
        result[z_col] = z_vals
    
    return result


dfs = []
for ticker, grp in df.groupby("ticker"):
    dfs.append(compute_historical_zscore_per_ticker(grp))

df_out = pd.concat(dfs, ignore_index=True)
df_out = df_out.drop(columns=["_qkey"])

z_cols = [f"{c}_zscore" for c in FEATURE_COLS]
n_valid = df_out[z_cols].notna().any(axis=1).sum()
print(f"完成 Z-score 计算，{len(z_cols)} 个新列")
print(f"有效 z-score 行数（至少一列非 NaN）: {n_valid} / {len(df_out)}")

完成 Z-score 计算，9 个新列
有效 z-score 行数（至少一列非 NaN）: 2202 / 2374


In [5]:
# ========== 4. 写回 earnings_calls_features.db ==========
# 追加 9 列 _zscore，df_out 含所有原列 + 新列，replace 时整体覆盖，原有特征列不受损

conn = sqlite3.connect(OUTPUT_DB)
df_out.to_sql("segments_features", conn, if_exists="replace", index=False)
conn.close()

print(f"已更新 {OUTPUT_DB}")
print(f"表 segments_features: {len(df_out)} 行，{len(df_out.columns)} 列")

已更新 /Users/xinyuewang/Desktop/1.27/data/earnings_calls_features.db
表 segments_features: 2374 行，40 列


In [6]:
# ========== 5. 预览 ==========

preview_cols = ["id", "ticker", "quarter", "section"] + FEATURE_COLS[:3] + [f"{c}_zscore" for c in FEATURE_COLS[:3]]
conn = sqlite3.connect(OUTPUT_DB)
preview = pd.read_sql_query(f"SELECT {', '.join(preview_cols)} FROM segments_features WHERE {FEATURE_COLS[0]}_zscore IS NOT NULL LIMIT 8", conn)
conn.close()
preview

Unnamed: 0,id,ticker,quarter,section,words_per_sentence,pronoun_plural_ratio,adverb_ratio,words_per_sentence_zscore,pronoun_plural_ratio_zscore,adverb_ratio_zscore
0,71,AAPL,2016-Q1,Prepared Remarks,16.524038,0.886364,0.014548,-0.215828,1.086728,0.068355
1,72,AAPL,2016-Q1,Q&A,15.16129,0.38814,0.01383,-1.513809,-1.091732,-0.133683
2,23,AAPL,2016-Q2,Prepared Remarks,15.937759,0.877637,0.009893,-0.59998,1.027994,-1.398365
3,24,AAPL,2016-Q2,Q&A,15.337243,0.408983,0.020076,-1.170718,-0.980001,1.846627
4,9,AAPL,2016-Q3,Prepared Remarks,14.836207,0.827411,0.007263,-1.152338,0.759422,-2.001235
5,10,AAPL,2016-Q3,Q&A,14.292479,0.408602,0.017346,-1.614102,-0.874352,0.712955
6,15,AAPL,2016-Q4,Prepared Remarks,15.397906,0.832215,0.010541,-0.418873,0.797384,-0.870571
7,16,AAPL,2016-Q4,Q&A,14.143921,0.425968,0.015088,-1.402337,-0.813975,0.162349


In [7]:
# ========== 6. Sanity Check ==========

# 1) 每列 z-score 的 min/max 应在 [-3, 3]
print("【1】z-score 范围检查（应在 [-3, 3]）")
for c in z_cols:
    lo, hi = df_out[c].min(), df_out[c].max()
    ok = -3.001 <= lo and hi <= 3.001 if pd.notna(lo) and pd.notna(hi) else True
    print(f"  {c}: min={lo:.4f}, max={hi:.4f} {'✓' if ok else '⚠'}")

# 2) NaN 比例
print("\n【2】NaN 比例（%）")
nan_pct = df_out[z_cols].isna().mean() * 100
print(nan_pct.round(2).to_string())

# 3) 抽 1 个 ticker 看前几期是否符合 min_periods=3（前几期应为 NaN）
print("\n【3】抽 1 个 ticker 前几期（前 <3 期历史应为 NaN）")
sample_ticker = df_out["ticker"].iloc[0]
sample = df_out[df_out["ticker"] == sample_ticker].sort_values("quarter").head(8)
display_cols = ["ticker", "quarter", "section"] + z_cols[:2]
print(sample[display_cols].to_string())

【1】z-score 范围检查（应在 [-3, 3]）
  words_per_sentence_zscore: min=-3.0000, max=3.0000 ✓
  pronoun_plural_ratio_zscore: min=-3.0000, max=3.0000 ✓
  adverb_ratio_zscore: min=-3.0000, max=3.0000 ✓
  lm_net_sentiment_zscore: min=-3.0000, max=3.0000 ✓
  lm_uncertainty_zscore: min=-3.0000, max=3.0000 ✓
  lm_litigious_zscore: min=-3.0000, max=3.0000 ✓
  lm_subjectivity_zscore: min=-3.0000, max=3.0000 ✓
  lm_modal_strong_zscore: min=-3.0000, max=3.0000 ✓
  lm_modal_weak_zscore: min=-3.0000, max=3.0000 ✓

【2】NaN 比例（%）
words_per_sentence_zscore      7.25
pronoun_plural_ratio_zscore    7.25
adverb_ratio_zscore            7.25
lm_net_sentiment_zscore        7.25
lm_uncertainty_zscore          7.25
lm_litigious_zscore            7.25
lm_subjectivity_zscore         7.25
lm_modal_strong_zscore         7.25
lm_modal_weak_zscore           7.25

【3】抽 1 个 ticker 前几期（前 <3 期历史应为 NaN）
  ticker  quarter           section  words_per_sentence_zscore  pronoun_plural_ratio_zscore
0   AAPL  2015-Q1  Prepared Remarks  