# HeidelTime Standalone + TreeTagger — Complete Ablation Notebook

This notebook runs **HeidelTime Standalone** on your OCR `.txt` files in two scenarios:

1. **Without POS tagger** (`-pos NO`) — no TreeTagger required
2. **With TreeTagger POS** (`-pos TREETAGGER`) — requires TreeTagger + `english.par`

It then **compares** the results and saves CSVs for your ablation study.

✅ Robust parsing: HeidelTime output sometimes includes logs mixed with XML; we extract only the `<TimeML>...</TimeML>` block before parsing.

Expected repo layout:
- `heideltime-standalone/de.unihd.dbs.heideltime.standalone.jar`
- `heideltime-standalone/config.props`
- `OCR_output/<STATE>/*.txt`


## 0) Prerequisites

### Java
HeidelTime is a Java program. `java -version` must work.

### TreeTagger (only for POS scenario)
You need:
- `D:\\TreeTagger\\bin\\tree-tagger.exe`
- `D:\\TreeTagger\\lib\\english.par` (download **English parameter file (PENN tagset)** and extract `english.par.gz`)

If you don't have `english.par`, TreeTagger mode will fail.


In [13]:
import os, subprocess, shutil, re
from pathlib import Path
import datetime
import pandas as pd
import xml.etree.ElementTree as ET
from tqdm import tqdm

# =====================
# 1) Paths (edit only if needed)
# =====================
# Recommended: run this notebook from the REPO ROOT that contains `heideltime-standalone/` and `OCR_output/`
PROJECT_ROOT = Path.cwd()

HT_DIR  = PROJECT_ROOT / "heideltime-standalone"
HT_JAR  = HT_DIR / "de.unihd.dbs.heideltime.standalone.jar"
HT_CONF = HT_DIR / "config.props"

OCR_DIR = PROJECT_ROOT / "OCR_output"

# =====================
# 2) Choose what to run
# =====================
STATE = "California"  # <- change if needed
STATE_DIR = OCR_DIR / STATE

# =====================
# 3) HeidelTime parameters
# =====================
LANG    = "ENGLISH"
DOCTYPE = "NARRATIVES"     # agreements are usually fine as NARRATIVES
OUTPUT  = "TIMEML"         # we parse TIMEML
DCT     = datetime.date.today().strftime("%Y-%m-%d")  # used only for DOCTYPE NEWS/COLLOQUIAL

# =====================
# 4) TreeTagger location (set this!)
# =====================
# Set to your TreeTagger root folder; set to None to skip TreeTagger run
TREETAGGER_HOME = Path(r"D:\NLP_Project_tasks_6_8_11\TreeTagger")

print("PROJECT_ROOT:", PROJECT_ROOT)
print("HT_DIR      :", HT_DIR)
print("HT_JAR      :", HT_JAR, "exists:", HT_JAR.exists())
print("HT_CONF     :", HT_CONF, "exists:", HT_CONF.exists())
print("OCR_DIR     :", OCR_DIR, "exists:", OCR_DIR.exists())
print("STATE_DIR   :", STATE_DIR, "exists:", STATE_DIR.exists())
print("TREETAGGER_HOME:", TREETAGGER_HOME)


PROJECT_ROOT: d:\NLP_Project_tasks_6_8_11
HT_DIR      : d:\NLP_Project_tasks_6_8_11\heideltime-standalone
HT_JAR      : d:\NLP_Project_tasks_6_8_11\heideltime-standalone\de.unihd.dbs.heideltime.standalone.jar exists: True
HT_CONF     : d:\NLP_Project_tasks_6_8_11\heideltime-standalone\config.props exists: True
OCR_DIR     : d:\NLP_Project_tasks_6_8_11\OCR_output exists: True
STATE_DIR   : d:\NLP_Project_tasks_6_8_11\OCR_output\California exists: True
TREETAGGER_HOME: D:\NLP_Project_tasks_6_8_11\TreeTagger


## 1) Environment checks

In [14]:
def check(cmd):
    try:
        out = subprocess.check_output(cmd, stderr=subprocess.STDOUT, text=True)
        return True, out.strip()
    except Exception as e:
        return False, str(e)

ok_java, java_out = check(["java", "-version"])
print("Java OK:", ok_java)
print(java_out[:800])

if not ok_java:
    raise RuntimeError(
        "Java not found. Install JDK (e.g., Temurin 17) and ensure `java -version` works in a NEW terminal / after restarting VS Code."
    )

if not HT_JAR.exists():
    raise FileNotFoundError(f"Missing HeidelTime JAR at: {HT_JAR}")
if not HT_CONF.exists():
    raise FileNotFoundError(f"Missing HeidelTime config.props at: {HT_CONF}")
if not STATE_DIR.exists():
    raise FileNotFoundError(f"Missing OCR state folder at: {STATE_DIR}")

txt_files = sorted(STATE_DIR.glob("*.txt"))
print("TXT files found:", len(txt_files))
print("Example:", txt_files[:3])


Java OK: True
openjdk version "25.0.1" 2025-10-21 LTS
OpenJDK Runtime Environment Temurin-25.0.1+8 (build 25.0.1+8-LTS)
OpenJDK 64-Bit Server VM Temurin-25.0.1+8 (build 25.0.1+8-LTS, mixed mode, sharing)
TXT files found: 116
Example: [WindowsPath('d:/NLP_Project_tasks_6_8_11/OCR_output/California/1 (5).txt'), WindowsPath('d:/NLP_Project_tasks_6_8_11/OCR_output/California/1 (CA - Chille).txt'), WindowsPath('d:/NLP_Project_tasks_6_8_11/OCR_output/California/110242023.txt')]


In [15]:
from pathlib import Path

tt = Path("D:/TreeTagger")
checks = {
    "tree-tagger.exe": tt/"bin"/"tree-tagger.exe",
    "english.par": tt/"lib"/"english.par",
    "english-abbreviations": tt/"lib"/"english-abbreviations",
    "utf8-tokenize.perl": tt/"cmd"/"utf8-tokenize.perl",
}

for k,v in checks.items():
    print(f"{k:22} -> {v}  exists={v.exists()}")


tree-tagger.exe        -> D:\TreeTagger\bin\tree-tagger.exe  exists=True
english.par            -> D:\TreeTagger\lib\english.par  exists=True
english-abbreviations  -> D:\TreeTagger\lib\english-abbreviations  exists=True
utf8-tokenize.perl     -> D:\TreeTagger\cmd\utf8-tokenize.perl  exists=True


## 2) TreeTagger validation + patching config.props
HeidelTime reads TreeTagger location from `config.props` (commonly `treeTaggerHome=...`).
We create a **patched copy** of the config for the TreeTagger run, leaving your original untouched.

In [16]:
def validate_treetagger_install(treetagger_home: Path) -> None:
    if treetagger_home is None:
        return
    if not treetagger_home.exists():
        raise FileNotFoundError(f"TREETAGGER_HOME does not exist: {treetagger_home}")

    bin_dir = treetagger_home / "bin"
    lib_dir = treetagger_home / "lib"

    exe1 = bin_dir / "tree-tagger.exe"
    exe2 = bin_dir / "tree-tagger"
    if not (exe1.exists() or exe2.exists()):
        raise FileNotFoundError(f"TreeTagger binary not found in: {bin_dir}")

    eng_par = lib_dir / "english.par"
    if not eng_par.exists():
        raise FileNotFoundError(
            f"Missing TreeTagger English parameter file: {eng_par}\n"
            "Download 'English parameter file (PENN tagset)' (english.par.gz), extract to english.par, and place it in lib/."
        )

    print("TreeTagger OK:")
    print("  bin:", bin_dir)
    print("  lib:", lib_dir)
    print("  english.par:", eng_par)
    
def patch_config_for_treetagger(conf_path: Path, treetagger_home: Path, key: str = "treeTaggerHome") -> Path:
    """
    Write a patched config.props with treeTaggerHome set.

    IMPORTANT (Windows): Java .properties treats backslash as an escape.
    If we write `D:\TreeTagger`, Java reads it as `D:TreeTagger`.
    So we must escape backslashes -> `D:\\TreeTagger` (double backslash in file).
    """
    text = conf_path.read_text(encoding="utf-8", errors="ignore")
    lines = text.splitlines()

    # Escape backslashes for Java properties parsing
    tt_home = str(treetagger_home).replace('\\', '\\\\')

    found = False
    new_lines = []
    for line in lines:
        if line.strip().startswith(f"{key}="):
            new_lines.append(f"{key}={tt_home}")
            found = True
        else:
            new_lines.append(line)

    if not found:
        new_lines.append(f"{key}={tt_home}")

    patched = conf_path.with_suffix(conf_path.suffix + ".patched")
    patched.write_text('\n'.join(new_lines), encoding="utf-8")
    return patched


# ---- RUN THE PATCH (this creates `patched_conf`) ----

patched_conf = None

if TREETAGGER_HOME is None:
    raise RuntimeError(
        "TREETAGGER_HOME is not set.\n"
        "Set TREETAGGER_HOME to your TreeTagger folder (the one that contains bin/ and lib/), "
        "then re-run this notebook from Section 2.\n"
        "Example (Linux/macOS): export TREETAGGER_HOME=~/TreeTagger\n"
        r"Example (Windows PowerShell): $env:TREETAGGER_HOME='C:\TreeTagger'\n"
    )

validate_treetagger_install(TREETAGGER_HOME)
patched_conf = patch_config_for_treetagger(
    HT_CONF,
    TREETAGGER_HOME,
    key="treeTaggerHome"
)
print("Patched config written to:", patched_conf)
print("treeTaggerHome line:", [l for l in patched_conf.read_text(encoding="utf-8", errors="ignore").splitlines() if l.startswith("treeTaggerHome=")][0])


TreeTagger OK:
  bin: D:\NLP_Project_tasks_6_8_11\TreeTagger\bin
  lib: D:\NLP_Project_tasks_6_8_11\TreeTagger\lib
  english.par: D:\NLP_Project_tasks_6_8_11\TreeTagger\lib\english.par
Patched config written to: d:\NLP_Project_tasks_6_8_11\heideltime-standalone\config.props.patched
treeTaggerHome line: treeTaggerHome=D:\\NLP_Project_tasks_6_8_11\\TreeTagger


  """


## 3) Run HeidelTime on one file (smoke test)
Includes robust `<TimeML>` extraction to avoid XML parse errors from mixed stdout/logs.

In [17]:
def run_heideltime(txt_path: Path, pos_mode: str, conf_path: Path) -> str:
    cmd = [
        "java", "-Dfile.encoding=UTF-8",
        "-jar", str(HT_JAR),
        str(txt_path),
        "-c", str(conf_path),
        "-l", LANG,
        "-t", DOCTYPE,
        "-o", OUTPUT,
        "-pos", pos_mode,
        "-e", "UTF-8",
    ]
    if DOCTYPE.upper() in {"NEWS", "COLLOQUIAL"}:
        cmd += ["-dct", DCT]

    proc = subprocess.run(cmd, cwd=str(HT_DIR), capture_output=True, text=True)
    if proc.returncode != 0:
        raise RuntimeError(
            "HeidelTime failed.\n\nCMD:\n" + " ".join(cmd) +
            "\n\nSTDOUT:\n" + (proc.stdout[:2000] or "<empty>") +
            "\n\nSTDERR:\n" + (proc.stderr[:2000] or "<empty>")
        )
    return proc.stdout

def extract_timeml_block(text: str) -> str:
    m = re.search(r"<TimeML[\s\S]*?</TimeML>", text)
    if not m:
        preview = text[:500].replace("\n", "\\n")
        raise ValueError(f"No <TimeML> block found in HeidelTime output. Preview: {preview}")
    return m.group(0)

def parse_timeml_timex3(heideltime_output: str) -> pd.DataFrame:
    timeml_xml = extract_timeml_block(heideltime_output)
    root = ET.fromstring(timeml_xml)

    rows = []
    for timex in root.iter():
        if timex.tag.lower().endswith("timex3"):
            rows.append({
                "tid": timex.attrib.get("tid"),
                "type": timex.attrib.get("type"),
                "value": timex.attrib.get("value"),
                "mod": timex.attrib.get("mod"),
                "quant": timex.attrib.get("quant"),
                "freq": timex.attrib.get("freq"),
                "beginPoint": timex.attrib.get("beginPoint"),
                "endPoint": timex.attrib.get("endPoint"),
                "text": "".join(timex.itertext()).strip(),
            })
    return pd.DataFrame(rows)

example = txt_files[0]
print("Example file:", example)
out_no = run_heideltime(example, pos_mode="NO", conf_path=HT_CONF)
df_preview = parse_timeml_timex3(out_no)
df_preview.head(20)


Example file: d:\NLP_Project_tasks_6_8_11\OCR_output\California\1 (5).txt


Unnamed: 0,tid,type,value,mod,quant,freq,beginPoint,endPoint,text
0,t3,DATE,19,,,,,,twentieth\ncentury
1,t4,DATE,1961,,,,,,1961
2,t17,DURATION,P1Y,,,,,,twelve months
3,t16,DATE,1961-06-30,,,,,,June 30
4,t28,DATE,PRESENT_REF,,,,,,current
5,t29,DATE,1957,,,,,,1957
6,t30,DATE,1956,,,,,,1956
7,t32,DURATION,P30D,,,,,,30 days


### FULL DATASET (ALL STATES) — SINGLE RUN (TreeTagger)
 - evaluates ALL OCR_output/<State>/*.txt exactly once
 - extracts validity fields + short/exact evidence snippet
 - records time + memory per document
 - builds per-state time/memory matrices (aggregates)
 - saves CSV + Excel (multiple sheets)

In [18]:
# ============================================================
# FULL DATASET (ALL STATES) 
# ============================================================

import os, re, time, gc, json, datetime
from pathlib import Path
import pandas as pd
import tracemalloc
import xml.etree.ElementTree as ET

USE_CACHE = False   # <- YOU WANT THIS FALSE


# Optional RSS memory (nice-to-have)
try:
    import psutil
    _HAS_PSUTIL = True
    _PROC = psutil.Process(os.getpid())
except Exception:
    _HAS_PSUTIL = False
    _PROC = None

def _rss_mb():
    if not _HAS_PSUTIL:
        return None
    return _PROC.memory_info().rss / (1024 ** 2)

# -----------------------
# 0) Inputs / outputs
# -----------------------
OCR_ROOT = Path(OCR_DIR)  # from earlier cells
if not OCR_ROOT.exists():
    raise RuntimeError(f"OCR_DIR not found: {OCR_ROOT.resolve()}")

OUT_DIR = Path("tables")
OUT_DIR.mkdir(parents=True, exist_ok=True)

CACHE_DIR = Path("heideltime_cache")
CACHE_DIR.mkdir(exist_ok=True)

EVIDENCE_MAX_CHARS = 120

# We require TreeTagger in this notebook.
if patched_conf is None:
    raise RuntimeError("patched_conf is None — TreeTagger config not patched. Run Section 2.")

# -----------------------
# 1) Helpers (robust TimeML parsing)
# -----------------------
_ILLEGAL_XML_1_0_RE = re.compile(
    r"[\x00-\x08\x0B\x0C\x0E-\x1F]"  # illegal control chars
    r"|[\uD800-\uDFFF]"               # surrogate blocks
    r"|\uFFFE|\uFFFF"                 # non-characters
)
_AMP_NOT_ENTITY_RE = re.compile(r"&(?!(?:amp|lt|gt|quot|apos|#\d+|#x[0-9A-Fa-f]+);)")

def sanitize_xml_text(s: str) -> str:
    if not s:
        return s
    s = _ILLEGAL_XML_1_0_RE.sub("", s)
    s = _AMP_NOT_ENTITY_RE.sub("&amp;", s)
    return s

def extract_timeml_block(heideltime_output: str) -> str:
    if not heideltime_output:
        return ""
    m = re.search(r"(<TimeML\b.*?</TimeML\s*>)", heideltime_output, flags=re.DOTALL)
    if m:
        return m.group(1)
    m = re.search(r"(<TIMEML\b.*?</TIMEML\s*>)", heideltime_output, flags=re.DOTALL | re.IGNORECASE)
    if m:
        return m.group(1)
    return heideltime_output

def parse_timeml_timex3(heideltime_output: str) -> pd.DataFrame:
    timeml_xml = sanitize_xml_text(extract_timeml_block(heideltime_output))
    if "<" in timeml_xml:
        timeml_xml = timeml_xml[timeml_xml.find("<"):]
    try:
        root = ET.fromstring(timeml_xml)
    except ET.ParseError:
        root = ET.fromstring(sanitize_xml_text(f"<ROOT>{timeml_xml}</ROOT>"))

    rows = []
    for el in root.iter():
        tag = str(el.tag)
        if tag == "TIMEX3" or tag.endswith("TIMEX3"):
            rows.append({
                "tid": el.attrib.get("tid"),
                "type": el.attrib.get("type"),
                "value": el.attrib.get("value"),
                "mod": el.attrib.get("mod"),
                "text": "".join(el.itertext()).strip(),
            })
    return pd.DataFrame(rows)

# -----------------------
# 2) Validity extraction (same logic as your old final cell)
# -----------------------
ANCHOR_START = re.compile(
    r"\beffective\b|\benter\s+into\s+force\b|\benter\s+into\s+effect\b|\bas\s+of\b|\bcommenc",
    re.I
)
ANCHOR_END = re.compile(
    r"\buntil\b|\bexpire\b|\bexpiration\b|\bterminate\b|\btermination\b|\bend\s+date\b",
    re.I
)

def split_sentences(text: str):
    text = re.sub(r"\s+", " ", (text or "")).strip()
    if not text:
        return []
    return re.split(r"(?<=[\.\!\?])\s+|\n+", text)

def normalize_timex_date(value: str):
    if not value:
        return None
    v = str(value).strip()
    if re.fullmatch(r"\d{4}-\d{2}-\d{2}", v):
        return v
    if re.fullmatch(r"\d{4}-\d{2}", v):
        return f"{v}-01"
    if re.fullmatch(r"\d{4}", v):
        return f"{v}-01-01"
    return None

def aggregate_validity_from_heideltime(doc_text: str, timex_df: pd.DataFrame):
    sents = split_sentences(doc_text)

    candidates = []
    for _, r in timex_df.iterrows():
        timex_text = str(r.get("text") or "").strip()
        timex_type = str(r.get("type") or "").strip().upper()
        timex_val  = str(r.get("value") or "").strip()
        if not timex_text:
            continue

        sent_hit = None
        for s in sents:
            if timex_text and timex_text in s:
                sent_hit = s
                break

        candidates.append({
            "timex_text": timex_text,
            "timex_type": timex_type,
            "timex_value": timex_val,
            "sentence": sent_hit,
            "start_anchor": bool(sent_hit and ANCHOR_START.search(sent_hit)),
            "end_anchor": bool(sent_hit and ANCHOR_END.search(sent_hit)),
        })

    duration = None
    for c in candidates:
        if c["timex_type"] == "DURATION":
            duration = c["timex_text"]
            break

    date_cands = []
    for c in candidates:
        if c["timex_type"] == "DATE":
            d = normalize_timex_date(c["timex_value"])
            if d:
                date_cands.append((d, c))
    date_cands.sort(key=lambda x: x[0])

    effective_date = None
    end_date = None
    end_date_source = None

    for d, c in date_cands:
        if effective_date is None and c["start_anchor"]:
            effective_date = d

    for d, c in reversed(date_cands):
        if end_date is None and c["end_anchor"]:
            end_date = d
            end_date_source = "explicit"

    if effective_date is None and date_cands:
        effective_date = date_cands[0][0]
    if end_date is None and len(date_cands) >= 2:
        end_date = date_cands[-1][0]

    has_any = bool(effective_date or end_date or duration)
    status = "found" if has_any else ("uncertain" if candidates else "absent")

    evidence = []
    for c in candidates:
        if c.get("sentence"):
            evidence.append({
                "text": c["sentence"],
                "timex_text": c["timex_text"],
                "timex_type": c["timex_type"],
                "timex_value": c["timex_value"],
            })
        if len(evidence) >= 5:
            break

    return {
        "effective_date": effective_date,
        "end_date": end_date,
        "duration": duration,
        "end_date_source": end_date_source,
        "validity_status": status,
        "validity_evidence": evidence,
    }

def evidence_short(evidence_list, max_chars=EVIDENCE_MAX_CHARS):
    if not evidence_list:
        return None
    s = str(evidence_list[0].get("text") or "").strip()
    s = re.sub(r"\s+", " ", s)
    return s[:max_chars] + ("…" if len(s) > max_chars else "")

# -----------------------
# 3) HeidelTime cache (TreeTagger mode)
# -----------------------

def get_heideltime_output(txt_path: Path):
    cache_file = CACHE_DIR / (txt_path.name + ".TREETAGGER.timeml.txt")

    if USE_CACHE and cache_file.exists():
        return cache_file.read_text(encoding="utf-8", errors="ignore")

    out = run_heideltime(txt_path, pos_mode="TREETAGGER", conf_path=patched_conf)

    if USE_CACHE:
        cache_file.write_text(out or "", encoding="utf-8", errors="ignore")

    return out

# -----------------------
# 4) Collect ALL documents across states
# -----------------------
all_txt = sorted(OCR_ROOT.glob("*/*.txt"))
if not all_txt:
    raise RuntimeError(f"No .txt found under: {OCR_ROOT.resolve()} (expected OCR_output/<State>/*.txt)")

print(f"[INFO] Total documents found: {len(all_txt)}")

# -----------------------
# 5) SINGLE RUN over whole dataset (time + memory)
# -----------------------
rows = []
prof = []

t0_all = time.perf_counter()

for i, p in enumerate(all_txt, start=1):
    state = p.parent.name
    doc_id = f"{state}/{p.stem}"

    # Read text once
    doc_text = p.read_text(encoding="utf-8", errors="ignore")

    rss_before = _rss_mb()
    tracemalloc.start()
    t0 = time.perf_counter()

    err = None
    out = None
    try:
        out = get_heideltime_output(p)
        timex_df = parse_timeml_timex3(out)
        validity = aggregate_validity_from_heideltime(doc_text, timex_df)
    except Exception as e:
        validity = {
            "effective_date": None,
            "end_date": None,
            "duration": None,
            "end_date_source": None,
            "validity_status": "error",
            "validity_evidence": [{"text": f"ERROR: {e}"}],
        }
        err = str(e)

    dt = time.perf_counter() - t0
    cur, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    rss_after = _rss_mb()

    ev_short = evidence_short(validity.get("validity_evidence"))

    rows.append({
        "doc_id": doc_id,
        "state": state,
        "source_path": str(p),
        "effective_date": validity.get("effective_date"),
        "end_date": validity.get("end_date"),
        "duration": validity.get("duration"),
        "end_date_source": validity.get("end_date_source"),
        "validity_status": validity.get("validity_status"),
        # store full evidence as JSON string (Excel/CSV friendly)
        "validity_evidence": json.dumps(validity.get("validity_evidence") or [], ensure_ascii=False),
        "evidence_short": ev_short,
        "error": err,
    })

    prof.append({
        "doc_id": doc_id,
        "state": state,
        "source_path": str(p),
        "time_sec": dt,
        "py_peak_mem_mb": peak / (1024 ** 2),
        "rss_before_mb": rss_before,
        "rss_after_mb": rss_after,
        "rss_delta_mb": (rss_after - rss_before) if (rss_after is not None and rss_before is not None) else None,
        "had_error": bool(err),
    })

    if i % 50 == 0:
        gc.collect()
        print(f"[INFO] Processed {i}/{len(all_txt)}")

total_time = time.perf_counter() - t0_all

results_df = pd.DataFrame(rows)
prof_df = pd.DataFrame(prof)

# -----------------------
# 6) Per-state matrices (aggregates)
# -----------------------
state_agg = (
    prof_df.groupby("state", dropna=False)
    .agg(
        n_docs=("doc_id", "count"),
        n_errors=("had_error", "sum"),
        total_time_sec=("time_sec", "sum"),
        avg_time_sec=("time_sec", "mean"),
        median_time_sec=("time_sec", "median"),
        p95_time_sec=("time_sec", lambda s: s.quantile(0.95) if s.notna().any() else None),
        avg_py_peak_mem_mb=("py_peak_mem_mb", "mean"),
        max_py_peak_mem_mb=("py_peak_mem_mb", "max"),
        avg_rss_delta_mb=("rss_delta_mb", "mean"),
        max_rss_delta_mb=("rss_delta_mb", "max"),
    )
    .reset_index()
    .sort_values(["state"])
)

summary = pd.DataFrame([{
    "n_docs": len(results_df),
    "n_errors": int(prof_df["had_error"].sum()),
    "total_time_sec": total_time,
    "avg_time_per_doc_sec": float(prof_df["time_sec"].mean()),
    "median_time_per_doc_sec": float(prof_df["time_sec"].median()),
    "p95_time_per_doc_sec": float(prof_df["time_sec"].quantile(0.95)),
    "max_py_peak_mem_mb": float(prof_df["py_peak_mem_mb"].max()),
}])

# -----------------------
# 7) Save CSV + Excel
# -----------------------
stamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
base = "full_dataset_treetagger"

csv_results = OUT_DIR / f"{base}_results.csv"
csv_prof    = OUT_DIR / f"{base}_time_memory_per_doc.csv"
csv_state   = OUT_DIR / f"{base}_time_memory_by_state.csv"
csv_summary = OUT_DIR / f"{base}_time_memory_summary.csv"
xlsx_path   = OUT_DIR / f"{base}_ALL.xlsx"

results_df.to_csv(csv_results, index=False, encoding="utf-8")
prof_df.to_csv(csv_prof, index=False, encoding="utf-8")
state_agg.to_csv(csv_state, index=False, encoding="utf-8")
summary.to_csv(csv_summary, index=False, encoding="utf-8")

with pd.ExcelWriter(xlsx_path, engine="openpyxl") as w:
    results_df.to_excel(w, index=False, sheet_name="results")
    prof_df.to_excel(w, index=False, sheet_name="time_memory_per_doc")
    state_agg.to_excel(w, index=False, sheet_name="time_memory_by_state")
    summary.to_excel(w, index=False, sheet_name="summary")

print("\n[OK] Saved:")
print(" -", csv_results)
print(" -", csv_prof)
print(" -", csv_state)
print(" -", csv_summary)
print(" -", xlsx_path)

display(summary)
display(state_agg.head(10))
results_df.head(5)


[INFO] Total documents found: 298
[INFO] Processed 50/298
[INFO] Processed 100/298
[INFO] Processed 150/298
[INFO] Processed 200/298
[INFO] Processed 250/298

[OK] Saved:
 - tables\full_dataset_treetagger_results.csv
 - tables\full_dataset_treetagger_time_memory_per_doc.csv
 - tables\full_dataset_treetagger_time_memory_by_state.csv
 - tables\full_dataset_treetagger_time_memory_summary.csv
 - tables\full_dataset_treetagger_ALL.xlsx


Unnamed: 0,n_docs,n_errors,total_time_sec,avg_time_per_doc_sec,median_time_per_doc_sec,p95_time_per_doc_sec,max_py_peak_mem_mb
0,298,3,895.962781,3.003188,2.752497,4.43746,0.551585


Unnamed: 0,state,n_docs,n_errors,total_time_sec,avg_time_sec,median_time_sec,p95_time_sec,avg_py_peak_mem_mb,max_py_peak_mem_mb,avg_rss_delta_mb,max_rss_delta_mb
0,Alabama,10,0,28.044413,2.804441,2.633528,3.57537,0.062143,0.21539,0.080859,0.445312
1,Alaska,12,0,38.110694,3.175891,2.838027,4.758265,0.133502,0.551585,0.083333,0.429688
2,Arizona,17,1,48.411515,2.847736,2.755438,3.558179,0.095198,0.236548,0.009651,0.027344
3,Arkansas,10,0,27.291306,2.729131,2.58873,3.676403,0.076703,0.294326,0.007422,0.015625
4,California,116,0,340.659986,2.936724,2.863382,3.716802,0.102575,0.454807,0.003098,0.199219
5,Connecticut,2,0,6.072587,3.036293,3.036293,3.16333,0.141356,0.180077,0.001953,0.003906
6,Hawaii,4,0,9.093191,2.273298,2.268951,2.350889,0.040287,0.040615,0.007812,0.03125
7,Idaho,1,0,2.934099,2.934099,2.934099,2.934099,0.039525,0.039525,0.0,0.0
8,Illinois,1,0,2.518243,2.518243,2.518243,2.518243,0.039759,0.039759,0.0,0.0
9,Indiana,3,0,8.360735,2.786912,2.724178,3.154888,0.047063,0.061037,0.001302,0.003906


Unnamed: 0,doc_id,state,source_path,effective_date,end_date,duration,end_date_source,validity_status,validity_evidence,evidence_short,error
0,Alabama/Alabama_1,Alabama,d:\NLP_Project_tasks_6_8_11\OCR_output\Alabama...,,,,,uncertain,"[{""text"": ""Agreement Between The State of Alab...","Agreement Between The State of Alabama, Alabam...",
1,Alabama/Alabama_10,Alabama,d:\NLP_Project_tasks_6_8_11\OCR_output\Alabama...,2007-11-16,2032-01-01,three years,,found,"[{""text"": ""Southeastern United States - Canadi...",Southeastern United States - Canadian Province...,
2,Alabama/Alabama_2,Alabama,d:\NLP_Project_tasks_6_8_11\OCR_output\Alabama...,2007-11-01,,,,found,"[{""text"": ""It is intended that the SEUS-Canadi...",It is intended that the SEUS-Canadian Province...,
3,Alabama/Alabama_3,Alabama,d:\NLP_Project_tasks_6_8_11\OCR_output\Alabama...,,,three years,,found,"[{""text"": ""This Memorandum of Intent shall ent...",This Memorandum of Intent shall enter into eff...,
4,Alabama/Alabama_4,Alabama,d:\NLP_Project_tasks_6_8_11\OCR_output\Alabama...,1994-01-01,1994-06-01,,,found,"[{""text"": ""MEMORANDUM OF UNDERSTANDING AND COO...",MEMORANDUM OF UNDERSTANDING AND COOPERATION BE...,
