# 3.2 可读性特征 (Readability Scores)

从 `earnings_calls.db` 的 `segments` 表读取数据，使用 **textstat** 计算可读性特征，写入 `earnings_calls_features.db`。

**单表方案**：`segments_features` 表，不含 content，供 3.3、3.4 后续追加 Document Attributes、Sentiment 等特征。

**可读性特征**：
- automated_readability (ARI)
- coleman_liau
- dale_chall
- flesch_ease
- flesch_kincaid
- gunning_fog
- smog_index
- overall (text_standard)

In [1]:
# ========== 配置 ==========

import sqlite3
from pathlib import Path

import pandas as pd
from textstat import (
    automated_readability_index,
    coleman_liau_index,
    dale_chall_readability_score,
    flesch_reading_ease,
    flesch_kincaid_grade,
    gunning_fog,
    smog_index,
    text_standard,
)

PROJECT_ROOT = Path("..").resolve()
SOURCE_DB = PROJECT_ROOT / "data" / "earnings_calls.db"
OUTPUT_DB = PROJECT_ROOT / "data" / "earnings_calls_features.db"

print("SOURCE_DB:", SOURCE_DB)
print("OUTPUT_DB:", OUTPUT_DB)

SOURCE_DB: /Users/xinyuewang/Desktop/1.27/data/earnings_calls.db
OUTPUT_DB: /Users/xinyuewang/Desktop/1.27/data/earnings_calls_features.db


In [2]:
# ========== 1. 可读性特征计算函数 ==========

def compute_readability_features(text: str) -> dict:
    """对文本计算 8 个可读性指标，异常时返回 None"""
    text = text or ""
    result = {}
    funcs = [
        ("automated_readability", automated_readability_index),
        ("coleman_liau", coleman_liau_index),
        ("dale_chall", dale_chall_readability_score),
        ("flesch_ease", flesch_reading_ease),
        ("flesch_kincaid", flesch_kincaid_grade),
        ("gunning_fog", gunning_fog),
        ("smog_index", smog_index),
    ]
    for name, func in funcs:
        try:
            result[name] = func(text)
        except Exception:
            result[name] = None
    try:
        result["overall"] = text_standard(text)
    except Exception:
        result["overall"] = None
    return result

In [3]:
# ========== 2. 从 earnings_calls.db 读取 segments ==========

conn_src = sqlite3.connect(SOURCE_DB)
df = pd.read_sql_query(
    "SELECT id, ticker, quarter, section, timestamp, url, source_file, content FROM segments",
    conn_src
)
conn_src.close()

print(f"共读取 {len(df)} 条 segment 记录")

共读取 2374 条 segment 记录


In [4]:
# ========== 3. 计算可读性特征并合并 ==========

readability_rows = []
for idx, row in df.iterrows():
    feat = compute_readability_features(row["content"])
    readability_rows.append(feat)
    if (idx + 1) % 200 == 0:
        print(f"已计算 {idx + 1} / {len(df)}")

df_feat = pd.DataFrame(readability_rows)
df_out = df[["id", "ticker", "quarter", "section", "timestamp", "url", "source_file"]].copy()
df_out = pd.concat([df_out, df_feat], axis=1)
print(f"合并完成，共 {len(df_out)} 行")

已计算 200 / 2374
已计算 400 / 2374
已计算 600 / 2374
已计算 800 / 2374
已计算 1000 / 2374
已计算 1200 / 2374
已计算 1400 / 2374
已计算 1600 / 2374
已计算 1800 / 2374
已计算 2000 / 2374
已计算 2200 / 2374
合并完成，共 2374 行


In [5]:
# ========== 4. 创建 earnings_calls_features.db 并写入 ==========

conn_out = sqlite3.connect(OUTPUT_DB)
cur = conn_out.cursor()

cur.execute("""
    CREATE TABLE IF NOT EXISTS segments_features (
        id INTEGER PRIMARY KEY,
        ticker TEXT,
        quarter TEXT,
        section TEXT,
        timestamp TEXT,
        url TEXT,
        source_file TEXT,
        automated_readability REAL,
        coleman_liau REAL,
        dale_chall REAL,
        flesch_ease REAL,
        flesch_kincaid REAL,
        gunning_fog REAL,
        smog_index REAL,
        overall TEXT
    )
""")

cur.execute("DELETE FROM segments_features")
df_out.to_sql("segments_features", conn_out, if_exists="append", index=False)
conn_out.commit()
conn_out.close()

print(f"已写入 {OUTPUT_DB}")
print(f"表 segments_features: {len(df_out)} 行")

已写入 /Users/xinyuewang/Desktop/1.27/data/earnings_calls_features.db
表 segments_features: 2374 行


In [6]:
# ========== 5. 预览 ==========

conn = sqlite3.connect(OUTPUT_DB)
preview = pd.read_sql_query("SELECT * FROM segments_features LIMIT 5", conn)
conn.close()
preview

Unnamed: 0,id,ticker,quarter,section,timestamp,url,source_file,automated_readability,coleman_liau,dale_chall,flesch_ease,flesch_kincaid,gunning_fog,smog_index,overall
0,1,AAPL,2017-Q1,Prepared Remarks,2017-01-31 17:00,https://seekingalpha.com/article/4041266-apple...,transcripts/AAPL/Apple (AAPL) Q1 2017 Results ...,10.325452,10.734523,10.553103,52.621558,9.855627,11.654556,12.079253,9th and 10th grade
1,2,AAPL,2017-Q1,Q&A,2017-01-31 17:00,https://seekingalpha.com/article/4041266-apple...,transcripts/AAPL/Apple (AAPL) Q1 2017 Results ...,7.573979,7.775031,8.696559,68.41436,7.530903,9.706788,10.222267,7th and 8th grade
2,3,AAPL,2017-Q2,Prepared Remarks,2017-05-02 17:00,https://seekingalpha.com/article/4068153-apple...,transcripts/AAPL/Apple (AAPL) Q2 2017 Results ...,10.557117,10.675564,10.438998,53.160792,9.944696,11.940555,12.119037,9th and 10th grade
3,4,AAPL,2017-Q2,Q&A,2017-05-02 17:00,https://seekingalpha.com/article/4068153-apple...,transcripts/AAPL/Apple (AAPL) Q2 2017 Results ...,7.741924,7.856303,8.508782,68.80583,7.568579,9.727035,10.104827,7th and 8th grade
4,5,AAPL,2018-Q2,Prepared Remarks,2018-05-01 17:00,https://seekingalpha.com/article/4168271-apple...,transcripts/AAPL/Apple (AAPL) Q2 2018 Results ...,10.918473,10.822932,10.518837,51.48382,10.390191,12.236356,12.303721,10th and 11th grade
