<a href="https://colab.research.google.com/github/elijahManPerson/Flappy-Bird/blob/master/X_Mechanical_Criteria_Pipeline_GPT_2510__Corrected_only.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data access and libray set up

Mounting the Google Drive to access files and save them.

# Step 1: Mount Google Drive
## Purpose:
To access and manipulate files stored in your Google Drive from the Colab environment.

##What each part does





*  drive.mount('/content/drive') starts the Google auth flow so Colab can access your Drive.
* CHECK_PATH is the single place to point at your working folder.
*  The helper status() prints clear pass or fail messages.
*  Read test lists a few entries to confirm you can read.
*  Write test creates and deletes a tiny file to confirm you can write.

## Actions:

**Import and Mount:** Uses google.colab.drive to mount the drive.
Verification: Checks if the drive is successfully mounted by verifying the existence of the /content/drive/MyDrive directory.

## Outcome:
Access to files within Google Drive is established, allowing the script to read from and write to specific directories.

In [None]:
# ===============================================
# Step 1: Mount Google Drive
# ===============================================
# ==== Drive mount + verification ====
from google.colab import drive
drive.mount('/content/drive')  # remove force_remount if you do not want to re-prompt

import os, time

def status(msg, ok):
    print(("✅ " if ok else "❌ ") + msg)

root_mount = '/content/drive'
root_mydrive = '/content/drive/MyDrive'

# 1) Basic mount checks
status("Drive mount detected at /content/drive", os.path.ismount(root_mount))
status("MyDrive folder present", os.path.isdir(root_mydrive))

# 2) Read test: list a few entries in MyDrive
try:
    entries = os.listdir(root_mydrive)[:5]
    status("Read test passed (listed MyDrive)", True)
    print("   • Sample entries:", entries if entries else "(empty)")
except Exception as e:
    status(f"Read test failed: {e}", False)

# 3) Write test: create and remove a tiny file
probe_path = os.path.join(root_mydrive, "_colab_mount_check.txt")
try:
    with open(probe_path, "w", encoding="utf-8") as f:
        f.write(f"colab mount check {time.time()}\n")
    status("Write test passed (created file)", True)
    os.remove(probe_path)
    status("Cleanup passed (deleted file)", True)
except Exception as e:
    status(f"Write test failed: {e}", False)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✅ Drive mount detected at /content/drive
✅ MyDrive folder present
✅ Read test passed (listed MyDrive)
   • Sample entries: ['Colab Notebooks', 'Untitled.gdoc', 'Copy of Untitled.gdoc', 'Russelline Doxology.gdoc', 'Application Letter to St. John Bosco.gdoc']
✅ Write test passed (created file)
✅ Cleanup passed (deleted file)


#Step 2.0 Data upload guide (optional)
###Strict loader for files where:
- first column = ID
- last column  = Raw text
- everything else optional

###Instructions for preparing your CSV

Put your unique identifier in the first column. Name it whatever you like, but “ID” is unambiguous.

Put the original writing text in the last column. Name it “Raw text” for readability.

Any other fields can live between first and last. They’ll be preserved, but not required.

Save as CSV with UTF-8 encoding. If you’re unsure, Excel and Google Sheets default exports are fine; our loader tolerates BOM too.


In [None]:
import pandas as pd
import ipywidgets as widgets
from IPython.display import display
from google.colab import files

# Build blank DataFrames
df_min = pd.DataFrame(columns=["ID", "Raw text"])
df_ext = pd.DataFrame(columns=[
    "ID", "AU", "TS", "CS/PD", "Voc", "Coh", "Pa", "SS", "Pun", "Spell",
    "Total", "WordCount", "yrlev", "Prompt ID", "Prompt Name", "Raw text"
])

# Save to local workspace
df_min.to_csv("/content/texts_template_min.csv", index=False, encoding="utf-8")
df_ext.to_csv("/content/texts_template_extended.csv", index=False, encoding="utf-8")

# Button callbacks
def on_download_min(_):
    files.download("/content/texts_template_min.csv")

def on_download_ext(_):
    files.download("/content/texts_template_extended.csv")

# UI
display(widgets.HTML("<h4>Download a CSV Template</h4>"))
display(widgets.HTML(
    "Use the first column as <b>ID</b> and the last as <b>Raw text</b>.<br>"
    "Everything in between is optional."
))
btn_min = widgets.Button(description="⬇️ Download Minimal Template (ID + Raw text)", button_style="primary")
btn_ext = widgets.Button(description="⬇️ Download Extended Template (Full NAPLAN-style)", button_style="info")

btn_min.on_click(on_download_min)
btn_ext.on_click(on_download_ext)
display(widgets.HBox([btn_min, btn_ext]))



HTML(value='<h4>Download a CSV Template</h4>')

HTML(value='Use the first column as <b>ID</b> and the last as <b>Raw text</b>.<br>Everything in between is opt…

HBox(children=(Button(button_style='primary', description='⬇️ Download Minimal Template (ID + Raw text)', styl…

# Step 2 — Load and Preprocess Data

###Purpose
- Bring a CSV file from Google Drive into Python as a Pandas DataFrame, then ensure the key text column is present, consistently named, and safe to use.
- The rest of the pipeline expects a column called "Raw text". This step prepares that column so later steps do not break.

###What you may need to change
- DATA_PATH: set this to the exact path of your CSV inside Google Drive. If you moved the file or renamed folders, update it here.

###Inputs
- A CSV file located at DATA_PATH. The file should contain one column that holds the original student text or source text. It might be called "Raw text", "Raw Text", "raw_text", or something similar.

###Outputs
- df_preprocessed: a Pandas DataFrame that definitely has a column named "Raw text". If your file used a variant name, the code renames it to "Raw text" so the rest of the notebook can rely on one standard.
- Printed checks that confirm: the path is valid, the file loaded, how many rows are present, how many are non-empty in "Raw text", and a small sample of the first few rows.

###Key ideas
- CSVs created on different systems sometimes include a special marker at the start of the file called a BOM. Using UTF-8 with BOM support avoids header glitches.
- Real-world files often vary in how they label the same concept. We accept common header variants for the text column, then rename to "Raw text" so every downstream function can assume one consistent name.
- It is better to stop early if the text column is missing or empty rather than let subtle errors propagate. This step fails fast with a clear message if something essential is wrong.

###Actions performed in this code
1) Validate your CSV path.  
2) Load the CSV with safe defaults and BOM handling.  
3) Find and standardise the "Raw text" column (case and spacing tolerant).  
4) Fill missing values and report how many rows are usable.  
5) Preview the first few rows.  

###Verification prints
- "CSV path exists" confirms the notebook can see the file. If you get a cross here, fix DATA_PATH.
- "Data loaded. Rows: X, Columns: Y" confirms the file parsed as a table.
- "Non-empty 'Raw text' rows: A of B" shows how many rows actually contain usable text after trimming blanks.
- A preview of the first five rows lets you eyeball whether the data looks right before moving on.

###Common pitfalls
- Wrong path: the file was moved, renamed, or the folder hierarchy changed. Update DATA_PATH.
- Unusual delimiter or BOM: some CSVs use semicolons or tabs, or include a BOM. The loader handles most cases, but if columns look fused together, specify a delimiter explicitly.
- Header named slightly differently: if your text column label is unexpected, the auto-detection usually finds it. If not, either rename the column in the CSV or add your variant to the accepted names in the code.


### NOTE:
####DATA_PATH = "/content/drive/MyDrive/JM/Sandbox/1.Training Data/Data for Testing avg short.csv"
#### RAW_TEXT_ALIASES = {"raw text", "raw_text", "rawtext"}


In [None]:
# ===============================================
# Step 2: Load and Preprocess Data  + optional download
# ===============================================


import os, io, csv
import pandas as pd

def status(msg, ok=True):
    print(("✅ " if ok else "❌ ") + msg)

# ======= EDIT HERE IF NEEDED =======
DATA_PATH = "/content/drive/MyDrive/JM/Sandbox/1.Training Data/Data for Testing avg short.csv"
RAW_TEXT_ALIASES = {"raw text", "raw_text", "rawtext"}
# ===================================

# If your CSV path doesn't exist, create a tiny sample so Run all won't fail
if not os.path.exists(DATA_PATH):
    status(f"File not found: {DATA_PATH}", ok=False)
    os.makedirs(os.path.dirname(DATA_PATH), exist_ok=True)
    sample = pd.DataFrame({
        "ID": ["A1","A2","A3","A4"],
        "Raw text": [
            "once upon a time a cat met a robot",
            "yesterday we visit the museum it were fun",
            "“stop!” he said i will go now",
            "the end"
        ]
    })
    sample.to_csv(DATA_PATH, index=False, encoding="utf-8")
    status("Created a small sample CSV so the pipeline can run.", ok=True)
else:
    status("CSV path exists")

if not os.path.exists(DATA_PATH):
    status(f"File not found: {DATA_PATH}", ok=False)
    raise FileNotFoundError(DATA_PATH)
status("CSV path exists")

def try_read(path, sep, engine=None):
    kwargs = dict(encoding="utf-8-sig", on_bad_lines="skip", low_memory=False)
    if sep is None:
        kwargs["sep"] = None
        kwargs["engine"] = "python"  # auto-sniff
    else:
        kwargs["sep"] = sep
        if engine:
            kwargs["engine"] = engine
    try:
        df = pd.read_csv(path, **kwargs)
        return df
    except Exception:
        return None

# Try several parsers; keep the best
candidates = [
    ("auto-sniff", None, "python"),
    ("comma", ",", None),
    ("semicolon", ";", None),
    ("tab", "\t", None),
    ("pipe", "|", None),
]

best = None
best_score = (-1, -1)  # (has_raw_text_like, n_cols)

def score_df(df):
    if df is None or df.empty:
        return (-1, -1)
    cols = [c.strip().lower() for c in df.columns]
    has_raw_like = int(any(c in RAW_TEXT_ALIASES for c in cols) or ("raw" in cols and "text" in cols))
    return (has_raw_like, len(cols))

parsed_by = None
for name, sep, eng in candidates:
    df = try_read(DATA_PATH, sep, eng)
    s = score_df(df)
    if s > best_score:
        best_score, best, parsed_by = s, df, name

if best is None or best.empty:
    status("Failed to read CSV with all strategies", ok=False)
    raise ValueError("Could not parse CSV")

status(f"Parsed using: {parsed_by}. Columns: {len(best.columns)}")

df_preprocessed = best.copy()

# --- Standardise/repair the Raw text column ---
cols_norm = {c: c.strip().lower() for c in df_preprocessed.columns}

raw_col = None
for c, norm in cols_norm.items():
    if norm in RAW_TEXT_ALIASES:
        raw_col = c
        break

# If we didn't find it, handle the split-header case: "Raw" and "text" as separate columns
if raw_col is None and "raw" in cols_norm.values() and "text" in cols_norm.values():
    # Find the actual column names that normalise to 'raw' and 'text'
    raw_name = next(k for k, v in cols_norm.items() if v == "raw")
    text_name = next(k for k, v in cols_norm.items() if v == "text")

    # Merge them into one string column, preserving whichever side has content
    df_preprocessed["Raw text"] = (
        df_preprocessed[raw_name].astype(str).fillna("").str.rstrip() +
        df_preprocessed[text_name].astype(str).fillna("").radd(
            df_preprocessed[text_name].astype(str).where(
                df_preprocessed[raw_name].astype(str).str.strip().eq(""),
                ""  # avoid double-joining if 'raw' already holds full text
            )
        )
    )

    # If that was too clever, just do a simple join with a space fallback
    mask_all_empty = df_preprocessed["Raw text"].str.strip().eq("")
    df_preprocessed.loc[mask_all_empty, "Raw text"] = (
        df_preprocessed[raw_name].astype(str).str.strip() + " " +
        df_preprocessed[text_name].astype(str).str.strip()
    ).str.strip()

    status(f"Merged split columns '{raw_name}' + '{text_name}' into 'Raw text'")
else:
    if raw_col is None:
        status("Raw text column not found after parsing", ok=False)
        print("Columns present:", list(df_preprocessed.columns))
        raise KeyError("'Raw text' column is missing")
    if raw_col != "Raw text":
        df_preprocessed.rename(columns={raw_col: "Raw text"}, inplace=True)
        status(f"Renamed '{raw_col}' to 'Raw text'")

# Clean and verify
df_preprocessed["Raw text"] = df_preprocessed["Raw text"].fillna("").astype(str)
total = len(df_preprocessed)
empty = df_preprocessed["Raw text"].str.strip().eq("").sum()
usable = total - empty
status(f"Non-empty 'Raw text' rows: {usable} of {total}")

if usable == 0:
    status("All 'Raw text' entries are empty after cleaning", ok=False)
    raise ValueError("No usable text in 'Raw text'")

# Peek and a quick stat
print("\nFirst 5 rows of 'Raw text':")
print(df_preprocessed[["Raw text"]].head(5))

avg_len = df_preprocessed["Raw text"].str.len().mean()
if pd.isna(avg_len):
    avg_len = 0.0
status(f"Average character length across 'Raw text': {avg_len:.1f}")


✅ CSV path exists
✅ CSV path exists
✅ Parsed using: comma. Columns: 15
✅ Non-empty 'Raw text' rows: 21 of 21

First 5 rows of 'Raw text':
                                            Raw text
0  There once was a girl called lilly she had pet...
1  wrire a narrative story abouta search for some...
2  The Failed Submarine I had always wanted go on...
3  The diamond ring Emmy was just having breakfas...
4  If you are locking for a dimiens go to most di...
✅ Average character length across 'Raw text': 1504.6


In [None]:
# ===============================================
# 2C — Safer 'Raw text' detection/standardization (NEW)
# Run right after Step 2 has created df_preprocessed
# ===============================================
import re
import pandas as pd

assert isinstance(df_preprocessed, pd.DataFrame), "df_preprocessed not defined yet."

RAW_TEXT_ALIASES = {
    "raw text","raw_text","rawtext","raw-text",
    "text","writing","essay","response"
}

def _norm(h):
    return re.sub(r"\s+", "", str(h or "")).strip().lower()

cols = list(df_preprocessed.columns)
norm_map = {_norm(c): c for c in cols}

raw_col = None

# 1) Exact/alias match
for alias in RAW_TEXT_ALIASES:
    if alias in norm_map:
        raw_col = norm_map[alias]
        break

# 2) Handle split columns literally named 'Raw' and 'Text'
if raw_col is None and "raw" in norm_map and "text" in norm_map:
    raw_name  = norm_map["raw"]
    text_name = norm_map["text"]
    df_preprocessed["Raw text"] = (
        df_preprocessed[raw_name].astype(str).fillna("") +
        df_preprocessed[text_name].astype(str).fillna("")
    ).str.strip()
    print(f"✅ Merged '{raw_name}' + '{text_name}' → 'Raw text'")
else:
    # 3) Heuristic: choose the non-ID-like column with the longest average string length
    if raw_col is None:
        non_id_cols = [c for c in cols if not re.search(r"\b(id|identifier|research id)\b", str(c), flags=re.I)]
        if not non_id_cols:
            non_id_cols = cols[-1:]
        avg_len = {c: df_preprocessed[c].astype(str).str.len().mean() for c in non_id_cols}
        raw_col = max(avg_len, key=avg_len.get)
        print(f"⚠️ Heuristic used: '{raw_col}' chosen as 'Raw text' (longest strings).")

    if raw_col != "Raw text":
        df_preprocessed.rename(columns={raw_col: "Raw text"}, inplace=True)
        print(f"✅ Renamed '{raw_col}' → 'Raw text'")

# Final tidy
df_preprocessed["Raw text"] = df_preprocessed["Raw text"].fillna("").astype(str)
usable = (~df_preprocessed["Raw text"].str.strip().eq("")).sum()
print(f"✅ Non-empty 'Raw text' rows: {usable}/{len(df_preprocessed)}")


✅ Non-empty 'Raw text' rows: 21/21


In [None]:
# ============================
# ID helpers — single source of truth
# ============================
import re
import pandas as pd

def _norm_header(h: str) -> str:
    """Lowercase, strip, remove BOM, collapse spaces."""
    return re.sub(r"\s+", "", str(h or "").replace("\ufeff","")).lower()

def _normalize_id_series(s: pd.Series) -> pd.Series:
    """Stringify IDs; fix '123.0'→'123'; keep alphanumerics untouched."""
    s = s.astype(str).str.strip().str.replace(r"\.0$", "", regex=True)
    def _fix(x):
        if any(c.isalpha() for c in x):
            return x
        try:
            if "." in x or "e" in x.lower():
                f = float(x)
                if f.is_integer():
                    return str(int(f))
        except Exception:
            pass
        return x
    return s.map(_fix).fillna("").astype(str)

def ensure_canonical_id(df: pd.DataFrame,
                        *,
                        canon="ID",
                        prefer=("Research ID","ResearchID","research id")) -> tuple[pd.DataFrame, str]:
    """
    Return (df_copy_with_ID, source_used)
      • Prefer Research ID-like headers if present
      • If an 'ID' exists already but is NOT the chosen source, preserve it as 'IdeaID'
      • Else synthesize IDs from the index
    """
    if df is None or not isinstance(df, pd.DataFrame) or df.empty:
        raise ValueError("ensure_canonical_id: input df is missing or empty")

    cols = list(df.columns)
    nmap = {_norm_header(c): c for c in cols}

    # 1) Find preferred source
    src = None
    for name in prefer:
        key = _norm_header(name)
        if key in nmap:
            src = nmap[key]
            break

    # 2) If no preferred, consider any existing 'ID'
    if src is None and "id" in nmap:
        src = nmap["id"]

    out = df.copy()

    # 3) Preserve any existing idea-level 'ID'
    had_id = "ID" in out.columns
    if had_id and (src is not None) and (src != "ID"):
        if "IdeaID" not in out.columns:
            out.rename(columns={"ID": "IdeaID"}, inplace=True)
        else:
            # leave both; prefer explicit canonical later
            pass

    # 4) Materialize canonical ID
    if src is not None:
        out["ID"] = _normalize_id_series(out[src])
        used = src
        if used != "ID":
            print(f"ℹ️  Using '{used}' as canonical 'ID'.")
    else:
        out["ID"] = out.index.astype(str)
        used = "synthetic_index"
        print("ℹ️  No Research ID/ID found; created synthetic IDs 0..N-1.")

    # 5) Final tidy + peek
    out["ID"] = _normalize_id_series(out["ID"])
    print("🔎 ID check →", out["ID"].head(5).tolist())

    return out, used




In [None]:
# ===============================================
# Step 2A — Canonicalize IDs (prefer Research ID)
# ===============================================
assert isinstance(df_preprocessed, pd.DataFrame), "df_preprocessed not defined."

df_preprocessed, _ID_SOURCE = ensure_canonical_id(
    df_preprocessed,
    canon="ID",
    prefer=("Research ID","ResearchID","research id")
)

# Put 'ID' first for readability
cols = ["ID"] + [c for c in df_preprocessed.columns if c != "ID"]
df_preprocessed = df_preprocessed[cols]

# Quick preview of how things landed
show_cols = [c for c in ("Research ID","ResearchID","ID","IdeaID") if c in df_preprocessed.columns]
print("Preview of ID-related columns:")
print(df_preprocessed[show_cols].head(10).to_string(index=False))


ℹ️  Using 'Research ID' as canonical 'ID'.
🔎 ID check → ['BBCMHJPT', 'BBKBYNDW', 'BBRWTLYV', 'BCDVWQDF', 'BCXSFTWC']
Preview of ID-related columns:
Research ID       ID  IdeaID
   BBCMHJPT BBCMHJPT    1.78
   BBKBYNDW BBKBYNDW    0.00
   BBRWTLYV BBRWTLYV    3.44
   BCDVWQDF BCDVWQDF    3.00
   BCXSFTWC BCXSFTWC    1.40
   BGRRHYPQ BGRRHYPQ    3.00
   BGZZHTXS BGZZHTXS    3.70
   BHLQHBRW BHLQHBRW    3.70
   BPHVBHZV BPHVBHZV    2.13
   BQTNFJFX BQTNFJFX    0.89


#NOTATION: Step 2.1 — Word and Token Stats

###Purpose
- Add quick length metrics to each row of text so you can sanity check size and plan token budgets.
- Two columns are added to df_preprocessed: WordCount and TokenCount.
- Estimate how much it would cost to send this dataset to the API, using your per-row TokenCount
  and a configurable assumption for output size.

###What you may need to change
- Nothing for most cases. If you target a specific OpenAI model later, we can switch to its matching tokenizer.
-You may need to update the cose of the API call (or could this be updated automatically?).

###Inputs
- df_preprocessed['Raw text'] produced in Step 2.

###Outputs
- df_preprocessed with two new numeric columns:
    - WordCount  count of whitespace separated words per row
    - TokenCount approximate token count per row using tiktoken
- Printed checks that show which tokenizer is used, averages, a small distribution summary, and a short preview.
- Printed summary: total input tokens, estimated output tokens, and costs for 5 models.
- Optional: a small widget to download the DataFrame now as CSV or Excel for manual checks and token and word estimates.

###Key ideas
- Word counts are simple readability and length signals.
- Token counts are model dependent. We try o200k_base first and fall back to cl100k_base, which keeps estimates close to how current OpenAI chat models tokenize text.
- API pricing bills both input and output tokens.
- We use your TokenCount as “input tokens” and estimate “output tokens” with a single ratio so you can quickly forecast spend.
- Keep stats light and fast so they scale to larger datasets.

###Verification prints
- "Using tokenizer: ..." confirms the encoding choice.
- "Average words per Raw text: ..." and "Average tokens per Raw text (approx.): ..." confirm central tendencies.
- A min, median, max snapshot for both metrics.
- A short preview of the DataFrame with counts.
- Shows total rows, total/avg tokens, the output ratio used, and a cost table for:
  gpt-4o, gpt-4o-mini, gpt-4.1, gpt-4.1-mini, gpt-3.5-turbo (Standard tier).

###Common pitfalls
- Running this before Step 2 or without a 'Raw text' column.
- Non string entries in 'Raw text'  Step 2 already coerces to strings, so you should be safe.

###Actions performed in this code
1) Choose a tokenizer and report which one is used.
2) Compute WordCount and TokenCount for each row.
3) Print summary statistics and show a small preview.
4) Offer an optional Download DataFrame button with a format picker




In [None]:
# ===============================================
# Step 2.1: Word and token stats
# ===============================================

import math
import tiktoken
import pandas as pd
from IPython.display import display, HTML

# Optional download widget support
try:
    import ipywidgets as widgets
    WIDGETS_OK = True
except Exception:
    WIDGETS_OK = False

from google.colab import files

def status(msg, ok=True):
    print(("✅ " if ok else "❌ ") + msg)

# 1) Choose tokenizer
try:
    _ENC = tiktoken.get_encoding("o200k_base")
    enc_name = "o200k_base"
except Exception:
    _ENC = tiktoken.get_encoding("cl100k_base")
    enc_name = "cl100k_base"
status(f"Using tokenizer: {enc_name}")

def count_tokens(text: str) -> int:
    try:
        return len(_ENC.encode(text or ""))
    except Exception:
        return 0

def count_words(text: str) -> int:
    if not isinstance(text, str):
        return 0
    # Simple whitespace split
    return len([w for w in text.split() if w.strip()])

# 2) Compute counts
if "Raw text" not in df_preprocessed.columns:
    status("Missing 'Raw text' column. Run Step 2 first.", ok=False)
    raise KeyError("'Raw text' column is missing")

df_preprocessed["WordCount"] = df_preprocessed["Raw text"].apply(count_words)
df_preprocessed["TokenCount"] = df_preprocessed["Raw text"].apply(count_tokens)

# 3) Summary statistics
avg_words = df_preprocessed["WordCount"].mean()
avg_tokens = df_preprocessed["TokenCount"].mean()
median_words = df_preprocessed["WordCount"].median()
median_tokens = df_preprocessed["TokenCount"].median()
min_words = df_preprocessed["WordCount"].min()
max_words = df_preprocessed["WordCount"].max()
min_tokens = df_preprocessed["TokenCount"].min()
max_tokens = df_preprocessed["TokenCount"].max()

status(f"Average words per Raw text: {avg_words:.1f}")
status(f"Average tokens per Raw text (approx.): {avg_tokens:.1f}")

print("\nQuick distribution check:")
print(f"  Words  → min {min_words}, median {median_words:.0f}, max {max_words}")
print(f"  Tokens → min {min_tokens}, median {median_tokens:.0f}, max {max_tokens}")

print("\nPreview with counts:")
display(df_preprocessed[["Raw text", "WordCount", "TokenCount"]].head(5))

# 4) Optional: Download DataFrame for checks
def _download_df(df, filename="df_preprocessed_counts.csv", file_format="csv"):
    local_path = f"/content/{filename}"
    if file_format.lower() == "csv":
        df.to_csv(local_path, index=False, encoding="utf-8")
    elif file_format.lower() in {"xlsx", "excel"}:
        df.to_excel(local_path, index=False)
    else:
        raise ValueError("Use 'csv' or 'xlsx'")
    files.download(local_path)

if WIDGETS_OK:
    fmt_dd = widgets.Dropdown(
        options=[("CSV", "csv"), ("Excel (.xlsx)", "xlsx")],
        value="csv",
        description="Format:",
        layout=widgets.Layout(width="240px")
    )
    dl_btn = widgets.Button(
        description="Download DataFrame now",
        button_style="primary",
        tooltip="Click to download the DataFrame with WordCount and TokenCount"
    )
    out = widgets.Output()

    def on_click_download(_):
        with out:
            out.clear_output()
            try:
                _download_df(df_preprocessed, filename="df_preprocessed_counts." + fmt_dd.value, file_format=fmt_dd.value)
                print(f"Started download as df_preprocessed_counts.{fmt_dd.value}")
            except Exception as e:
                print("Download failed:", e)

    dl_btn.on_click(on_click_download)
    display(HTML("<b>Do you want to download the DataFrame now for checks?</b>"))
    display(widgets.HBox([fmt_dd, dl_btn]), out)
else:
    print("\nWidgets not available. To download manually, run:")
    print("  df_preprocessed.to_csv('/content/df_preprocessed_counts.csv', index=False, encoding='utf-8')")
    print("  from google.colab import files; files.download('/content/df_preprocessed_counts.csv')")

✅ Using tokenizer: o200k_base
✅ Average words per Raw text: 284.1
✅ Average tokens per Raw text (approx.): 347.1

Quick distribution check:
  Words  → min 5, median 243, max 706
  Tokens → min 6, median 275, max 822

Preview with counts:


Unnamed: 0,Raw text,WordCount,TokenCount
0,There once was a girl called lilly she had pet...,52,61
1,wrire a narrative story abouta search for some...,43,53
2,The Failed Submarine I had always wanted go on...,494,578
3,The diamond ring Emmy was just having breakfas...,243,275
4,If you are locking for a dimiens go to most di...,57,71


HBox(children=(Dropdown(description='Format:', layout=Layout(width='240px'), options=(('CSV', 'csv'), ('Excel …

Output()

In [None]:
# ===== Normalizer upgrade (place once, e.g., after Step 2) =====
import re

def normalize_mojibake(s: str) -> str:
    """Lightweight repairs for common mojibake + whitespace cleanup."""
    if s is None:
        return ""
    s = str(s)

    # Classic UTF-8-as-Win1252 sequences
    fixes = {
        "â€”": "—",   # em dash
        "â€“": "–",   # en dash
        "â€˜": "‘", "â€™": "’",  # single quotes
        "â€œ": "“", "â€\x9d": "”", "â€\x9c": "“",  # double quotes variants
        "â€¦": "…",   # ellipsis
        "â€¢": "•",   # bullet
        "â€": "”",    # stray
    }
    for bad, good in fixes.items():
        s = s.replace(bad, good)

    # Collapse weird spaces and trim
    s = re.sub(r"[ \t\u00A0\u2007\u202F]+", " ", s).strip()
    return s


# Step 3: Install and Import Required Libraries

###Purpose
- Ensure the exact Python libraries and language models you need are installed and working.
- Verify imports and show clear version numbers so you can troubleshoot quickly.

###What you may need to change
- Library versions, if you want to pin different ones for compatibility.
- You can remove the openai uninstall if you are sure the environment is clean.

###Inputs
- Internet access in the Colab runtime to download packages and the spaCy model.

###Outputs
- Installed packages: openai, tqdm, nltk, tiktoken, spacy, pandas.
- Downloaded spaCy model: en_core_web_sm.
- Verified imports with printed version numbers.
- NLTK tokenizers available: punkt and punkt_tab.

###Key ideas
- Pin versions when you want reproducibility. The defaults below are stable and pair well with Colab.
- spaCy needs a separate model download. We fetch en_core_web_sm and then test a tiny parse.
- NLTK recently split tokenizers, so we check both punkt and punkt_tab.

###Verification prints
- Package versions for openai, pandas, spacy, nltk, tiktoken, tqdm.
- Confirmation that spaCy model loads and can process a sentence.
- Confirmation that NLTK tokenizers are present.
- A tiny tiktoken encode test to confirm the tokenizer is usable.

###Common pitfalls
- Conflicting preinstalled openai versions. We uninstall first to avoid API mismatch.
- Missing spaCy model. Installing the library is not enough, you must download a model.
- NLTK data not present. We fetch tokenizers to avoid runtime errors later.

###Actions performed in this code
1) Uninstall any preinstalled openai to avoid conflicts.
2) Install required libraries with pinned versions.
3) Download the spaCy English model en_core_web_sm.
4) Import libraries and print versions.
5) Verify spaCy model load.
6) Verify NLTK tokenizers punkt and punkt_tab.
7) Verify tiktoken by encoding a short string.


In [None]:
# -----------------------
#3.1 — Install / Upgrade (run once)
# -----------------------
# Uninstall preinstalled openai versions to avoid conflicts, then install v1 + dependencies
!pip -q uninstall -y openai
!pip -q install --upgrade "openai==1.*" tqdm nltk tiktoken spacy pandas==2.2.2 ipywidgets openpyxl jsonschema

# Download spaCy small model
!python -m spacy download en_core_web_sm -q



print("⬇️ Install step finished. IMPORTANT: Restart the runtime (Runtime > Restart runtime) and then run the verification cell.")


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/12.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/12.8 MB[0m [31m30.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m6.8/12.8 MB[0m [31m96.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.8/12.8 MB[0m [31m167.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.8/12.8 MB[0m [31m167.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m98.1 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to rest

In [None]:
# -----------------------
#3.2 — Imports, version checks, and tokenizer / model verification
# -----------------------
import importlib, sys, os, logging
from getpass import getpass

# Helper to report versions safely
def v(name):
    try:
        m = importlib.import_module(name)
        return getattr(m, "__version__", "unknown")
    except Exception as e:
        return f"import failed: {e}"

modules = ["openai","pandas","spacy","nltk","tiktoken","tqdm","ipywidgets","openpyxl","jsonschema"]
print("Versions:")
for name in modules:
    print(f"  {name}: {v(name)}")

# spaCy load check
import spacy
try:
    nlp = spacy.load("en_core_web_sm")
    doc = nlp("A tiny sanity check.")
    print("✅ spaCy loaded and tokenized sample:", [t.text for t in doc])
except Exception as e:
    print("❌ spaCy failed to load:", e)
    raise

# NLTK tokenizers
import nltk
NLTK_DIR = "/content/nltk_data"
os.makedirs(NLTK_DIR, exist_ok=True)
if NLTK_DIR not in nltk.data.path:
    nltk.data.path.insert(0, NLTK_DIR)

try:
    nltk.data.find("tokenizers/punkt")
    print("✅ NLTK 'punkt' tokenizer is present.")
except LookupError:
    print("⬇️ Downloading NLTK 'punkt' tokenizer...")
    nltk.download("punkt", download_dir=NLTK_DIR, quiet=False)
    try:
        nltk.data.find("tokenizers/punkt")
        print("✅ 'punkt' downloaded.")
    except LookupError:
        print("❌ Failed to download 'punkt'.")

# tiktoken check
import tiktoken
try:
    enc = None
    try:
        enc = tiktoken.get_encoding("o200k_base")
        enc_name = "o200k_base"
    except Exception:
        enc = tiktoken.get_encoding("cl100k_base")
        enc_name = "cl100k_base"
    n_tokens = len(enc.encode("Tokenization sanity check."))
    print(f"✅ tiktoken ready with {enc_name}. Sample tokens: {n_tokens}")
except Exception as e:
    print("❌ tiktoken failed:", e)
    raise




Versions:
  openai: 1.109.1
  pandas: 2.2.2
  spacy: 3.8.7
  nltk: 3.9.2
  tiktoken: 0.12.0
  tqdm: 4.67.1
  ipywidgets: 7.7.1
  openpyxl: 3.1.5
  jsonschema: 4.25.1


  return getattr(m, "__version__", "unknown")


✅ spaCy loaded and tokenized sample: ['A', 'tiny', 'sanity', 'check', '.']
✅ NLTK 'punkt' tokenizer is present.
✅ tiktoken ready with o200k_base. Sample tokens: 5


#Step 4.1: Optional Preempt — install common extras up front
###Purpose
- Preemptively install/verify common libraries and language data you’re likely to need later
  so downstream steps don’t break mid-run.

#What you may need to change
- Toggle WANT_SPACY_MD to True if you want the larger 'en_core_web_md' spaCy model.
- Adjust VERS pins if you prefer different versions.

#Installs/Verifies
- Libraries: ipywidgets, openpyxl (Excel export), matplotlib (plots), chardet (encoding detect),
             pyarrow (faster IO). xlsxwriter optional as an alternate Excel engine.
- NLTK data: punkt, punkt_tab (if available), stopwords, wordnet, omw-1.4.
- spaCy model: ensures 'en_core_web_sm' is present; optionally downloads 'en_core_web_md'.

#Outputs
- Clear prints of what was installed or already present, plus quick sanity checks.

In [None]:


# ----------------- toggles -----------------
WANT_SPACY_MD = False   # True to also download/load en_core_web_md
# ------------------------------------------

import sys, os, shutil, subprocess, importlib

# ---------- helpers ----------
def _pip_install(spec):
    print(f"⬇️  Installing {spec} ...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", spec])

def ensure_pkg(mod_name, spec=None):
    try:
        m = importlib.import_module(mod_name)
        print(f"✅ {mod_name} already available")
    except Exception:
        _pip_install(spec or mod_name)
        m = importlib.import_module(mod_name)
        print(f"✅ {mod_name} installed")
    return m

def print_version(mod_name):
    try:
        m = importlib.import_module(mod_name)
        v = getattr(m, "__version__", "unknown")
        print(f"   • {mod_name} {v}")
    except Exception as e:
        print(f"   • {mod_name} version check failed: {e}")

# ---------- core convenience libraries ----------
ipywidgets = ensure_pkg("ipywidgets", "ipywidgets==8.1.1")
openpyxl   = ensure_pkg("openpyxl",   "openpyxl>=3.1.2")
matplotlib = ensure_pkg("matplotlib", "matplotlib>=3.8.0")
chardet    = ensure_pkg("chardet",    "chardet>=5.2.0")
pyarrow    = ensure_pkg("pyarrow",    "pyarrow>=16.1.0")
# Optional alternative Excel writer:
# xlsxwriter = ensure_pkg("xlsxwriter", "XlsxWriter>=3.2.0")

print("\nVersions:")
for name in ["ipywidgets", "openpyxl", "matplotlib", "chardet", "pyarrow"]:
    print_version(name)

# ---------- NLTK: one directory, robust downloads ----------
nltk = ensure_pkg("nltk", "nltk>=3.8.1")

NLTK_DIR = "/content/nltk_data"
os.makedirs(NLTK_DIR, exist_ok=True)
os.environ["NLTK_DATA"] = NLTK_DIR
# put our folder at the front of the search path
if NLTK_DIR in nltk.data.path:
    nltk.data.path.remove(NLTK_DIR)
nltk.data.path.insert(0, NLTK_DIR)

print("\nNLTK search paths (in order):")
for p in nltk.data.path:
    print("  -", p)

def ensure_nltk_resource(resource, kind="corpora", retries=2, clean_if_stuck=True):
    """
    Ensure NLTK resource (e.g. 'wordnet') exists under NLTK_DIR/kind.
    Retries and optionally removes a partial folder if verification fails.
    """
    resource_path = f"{kind}/{resource}"
    target_folder = os.path.join(NLTK_DIR, kind, resource)

    def _verified():
        try:
            nltk.data.find(resource_path)
            return True
        except LookupError:
            return False

    if _verified():
        print(f"✅ NLTK {kind} '{resource}' available")
        return

    if clean_if_stuck and os.path.isdir(target_folder):
        print(f"🧹 Removing partial folder: {target_folder}")
        shutil.rmtree(target_folder, ignore_errors=True)

    for attempt in range(1, retries + 1):
        print(f"⬇️  Downloading NLTK {kind} '{resource}' (attempt {attempt}/{retries}) ...")
        ok = nltk.download(resource, download_dir=NLTK_DIR, quiet=False)
        exists_flag = os.path.isdir(target_folder)
        verified = _verified()
        print(f"   ↳ folder exists: {exists_flag}; verified: {verified}; downloader_returned: {ok}")
        if exists_flag and verified:
            print(f"✅ NLTK {kind} '{resource}' ready at {target_folder}")
            return

    # Last resort: show directory state and fail
    parent = os.path.join(NLTK_DIR, kind)
    print(f"❌ Could not verify NLTK {kind} '{resource}' after {retries} attempts.")
    print("   Contents of", parent, ":", os.listdir(parent) if os.path.isdir(parent) else "(missing)")
    raise LookupError(f"Failed to ensure NLTK {resource_path}")

# Tokenizers
ensure_nltk_resource("punkt", kind="tokenizers")



✅ ipywidgets already available
✅ openpyxl already available
✅ matplotlib already available
✅ chardet already available
✅ pyarrow already available

Versions:
   • ipywidgets 7.7.1
   • openpyxl 3.1.5
   • matplotlib 3.10.0
   • chardet 5.2.0
   • pyarrow 18.1.0
✅ nltk already available

NLTK search paths (in order):
  - /content/nltk_data
  - /root/nltk_data
  - /usr/nltk_data
  - /usr/share/nltk_data
  - /usr/lib/nltk_data
  - /usr/share/nltk_data
  - /usr/local/share/nltk_data
  - /usr/lib/nltk_data
  - /usr/local/lib/nltk_data
✅ NLTK tokenizers 'punkt' available


# Step 4: Import Required Libraries
###Purpose
- Import the libraries used for data processing, NLP, tokenisation, logging, and progress bars.
- Load the spaCy English model prepared in Step 3.
- Run quick sanity checks so you know everything is working before you proceed.

###What you may need to change
- Nothing in most cases. If you installed a different spaCy model name, update MODEL_NAME below.

###Inputs
- Installed packages from Step 3. SpaCy model en_core_web_sm should already be present.

###Outputs
- Imported modules in memory.
- tqdm progress bars enabled for pandas operations.
- A loaded spaCy pipeline in the variable `nlp`.
- Short verification prints from spaCy, NLTK, and tiktoken.

###Key ideas
- Keep imports in one place so the rest of the notebook can assume they exist.
- Fail early with clear messages if a critical import or model is missing.

###Verification prints
- Confirms tqdm was enabled.
- Confirms spaCy loaded and tokenised a tiny sentence.
- Confirms NLTK tokeniser works.
- Confirms tiktoken can encode a short string.

In [None]:
# ===============================================
# Step 4: Import Required Libraries
# ===============================================

# ================================
import os
import re
import json
import time
import string
import random
import threading
import logging
import difflib
from functools import lru_cache

# Set up logging for helpful debug output
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

# Ensure deterministic-ish behaviour for debugging
random.seed(1)

# Pandas + tqdm
import pandas as pd
try:
    # prefer notebook tqdm if available
    from tqdm.notebook import tqdm
except Exception:
    from tqdm import tqdm
tqdm.pandas()
logging.info("✅ tqdm progress bars enabled for pandas")

# NLTK setup
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
NLTK_DIR = "/content/nltk_data" if os.path.exists("/content") else os.path.join(os.getcwd(), "nltk_data")
os.makedirs(NLTK_DIR, exist_ok=True)
if NLTK_DIR not in nltk.data.path:
    nltk.data.path.insert(0, NLTK_DIR)

# Download 'punkt' if missing, but keep output quiet unless it fails
try:
    nltk.data.find("tokenizers/punkt")
    logging.info("✅ NLTK 'punkt' tokenizer present")
except LookupError:
    logging.info("⬇️ Downloading NLTK 'punkt' tokenizer...")
    nltk.download("punkt", download_dir=NLTK_DIR, quiet=True)
    try:
        nltk.data.find("tokenizers/punkt")
        logging.info("✅ NLTK 'punkt' downloaded")
    except LookupError:
        raise RuntimeError("NLTK 'punkt' tokenizer download failed. Check network or permissions.")

# spaCy + model load (single load only)
import spacy
MODEL_NAME = "en_core_web_sm"
try:
    nlp = spacy.load(MODEL_NAME)
    _doc = nlp("Quick spaCy check.")
    logging.info("✅ spaCy model loaded (%s). Tokens: %s", MODEL_NAME, [t.text for t in _doc])
except Exception as e:
    logging.error("❌ Could not load spaCy model '%s': %s", MODEL_NAME, e)
    logging.info("Tip: re-run your install cell to fetch the model, then restart the runtime.")
    raise

# tiktoken robust selection
import tiktoken
def get_token_encoder(preferred=("o200k_base", "cl100k_base")):
    names = []
    try:
        names = tiktoken.list_encoding_names()
    except Exception:
        # older tiktoken versions may not expose list_encoding_names
        pass
    for enc in preferred:
        try:
            if names and enc not in names:
                continue
            return tiktoken.get_encoding(enc)
        except Exception:
            continue
    # final fallback to a known encoding name if available
    try:
        return tiktoken.get_encoding("cl100k_base")
    except Exception as e:
        logging.error("❌ tiktoken encoders unavailable: %s", e)
        raise

ENC = get_token_encoder()
_enc_name = getattr(ENC, "__name__", "encoding")
_sample_len = len(ENC.encode("Tiny tiktoken check."))
logging.info("✅ tiktoken ready (%s). Sample length: %d", _enc_name, _sample_len)

print("🎉 Step 4 imports and model load verified.")



🎉 Step 4 imports and model load verified.


# Step 5 — Configure Logging

###Purpose
- Capture info, warnings, and errors to both the Colab console and a log file for later debugging.
- Adjust logging settings from a small UI: filenames, console/file levels, rotation size/backups.
- Make log levels easy to change for noisy vs quiet runs.

- Apply the config, run a tiny self-test, optionally preview the log tail, and download the log.


###What you can change via UI
- LOG_FILE: filename of the rotating log
- CONSOLE_LEVEL: how noisy the notebook output is
- FILE_LEVEL: how detailed the on-disk log is
- ROTATE_MAX_MB: size per log file before rotation
- ROTATE_BACKUPS: how many rotated files to keep

###Outputs
- Reconfigured root logger with a StreamHandler (console) and RotatingFileHandler (file)
- Self-test entries written to the log
- Optional log tail preview

###Key ideas
- Set the root logger level high enough to allow through what handlers need.
- Handlers have their own levels. Console can be quieter than file.
- Rotating logs prevent a single giant file.

###Verification prints
- A quick self test logs INFO, WARNING, ERROR, and an example exception.


In [None]:


import os, io, logging, traceback
from logging.handlers import RotatingFileHandler
from IPython.display import display, HTML
try:
    import ipywidgets as widgets
    WIDGETS_OK = True
except Exception:
    WIDGETS_OK = False

# ---- helper: make or update logging based on UI values ----
def configure_logging(log_file: str,
                      console_level: int,
                      file_level: int,
                      rotate_max_mb: int,
                      rotate_backups: int) -> logging.Logger:
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)  # always let handlers filter

    # Remove old handlers (avoids duplicates on re-run)
    for h in list(logger.handlers):
        logger.removeHandler(h)

    fmt = logging.Formatter("%(asctime)s - %(levelname)s - %(name)s - %(message)s")

    # Console
    ch = logging.StreamHandler()
    ch.setLevel(console_level)
    ch.setFormatter(fmt)
    logger.addHandler(ch)

    # Rotating file
    fh = RotatingFileHandler(
        log_file,
        maxBytes=rotate_max_mb * 1024 * 1024,
        backupCount=rotate_backups,
        encoding="utf-8"
    )
    fh.setLevel(file_level)
    fh.setFormatter(fmt)
    logger.addHandler(fh)

    # Chattery libs can be quieted if you like
    logging.getLogger("urllib3").setLevel(logging.WARNING)
    logging.getLogger("tqdm").setLevel(logging.WARNING)

    # Self-test
    logger.info("Logging control panel: INFO test")
    logger.warning("Logging control panel: WARNING test")
    try:
        1/0
    except ZeroDivisionError:
        logger.error("Logging control panel: ERROR test with traceback")
        logger.error(traceback.format_exc())

    return logger

# ---- level maps for dropdowns ----
LEVELS = {
    "DEBUG (most verbose)": logging.DEBUG,
    "INFO (standard)": logging.INFO,
    "WARNING (only important)": logging.WARNING,
    "ERROR (failures only)": logging.ERROR,
    "CRITICAL": logging.CRITICAL
}

# ---- defaults (you can change here if you want different initial values) ----
DEFAULT_LOG_FILE = "text_correction.log"
DEFAULT_CONSOLE = "INFO (standard)"
DEFAULT_FILE = "DEBUG (most verbose)"
DEFAULT_ROTATE_MB = 5
DEFAULT_BACKUPS = 3

if not WIDGETS_OK:
    print("ipywidgets not available. Install it first or run the preempt cell. "
          "Meanwhile, you can configure logging programmatically via Step 5.")
else:
    # ---- UI controls ----
    log_file_text = widgets.Text(
        value=DEFAULT_LOG_FILE,
        description="LOG_FILE:",
        layout=widgets.Layout(width="420px")
    )
    console_level_dd = widgets.Dropdown(
        options=list(LEVELS.keys()),
        value=DEFAULT_CONSOLE,
        description="Console:",
        layout=widgets.Layout(width="420px")
    )
    file_level_dd = widgets.Dropdown(
        options=list(LEVELS.keys()),
        value=DEFAULT_FILE,
        description="File:",
        layout=widgets.Layout(width="420px")
    )
    rotate_mb_slider = widgets.IntSlider(
        value=DEFAULT_ROTATE_MB, min=1, max=50, step=1,
        description="Rotate MB:",
        readout=True, continuous_update=False
    )
    backups_slider = widgets.IntSlider(
        value=DEFAULT_BACKUPS, min=0, max=20, step=1,
        description="Backups:",
        readout=True, continuous_update=False
    )

    apply_btn = widgets.Button(
        description="Apply logging settings",
        button_style="primary",
        tooltip="Configure handlers and run a quick self-test"
    )
    show_tail_btn = widgets.Button(
        description="Show log tail",
        tooltip="Display the last lines of the current log file"
    )
    download_btn = widgets.Button(
        description="Download log",
        tooltip="Download the current log file"
    )
    out = widgets.Output()

    # ---- callbacks ----
    def on_apply_clicked(_):
        with out:
            out.clear_output()
            log_file = log_file_text.value.strip() or DEFAULT_LOG_FILE
            console_level = LEVELS[console_level_dd.value]
            file_level = LEVELS[file_level_dd.value]
            rotate_mb = int(rotate_mb_slider.value)
            backups = int(backups_slider.value)

            logger = configure_logging(
                log_file=log_file,
                console_level=console_level,
                file_level=file_level,
                rotate_max_mb=rotate_mb,
                rotate_backups=backups
            )
            print("✅ Applied logging settings")
            print(f"   LOG_FILE: {log_file}")
            print(f"   CONSOLE_LEVEL: {console_level_dd.value}")
            print(f"   FILE_LEVEL: {file_level_dd.value}")
            print(f"   ROTATE_MAX_MB: {rotate_mb}")
            print(f"   ROTATE_BACKUPS: {backups}")
            print("\nWrote self-test lines. Use 'Show log tail' to preview.")

    def on_show_tail_clicked(_):
        with out:
            log_file = log_file_text.value.strip() or DEFAULT_LOG_FILE
            out.clear_output()
            if not os.path.exists(log_file):
                print(f"❌ Log file not found yet: {log_file}")
                print("   Click 'Apply logging settings' first.")
                return
            try:
                # read last ~100 lines
                with open(log_file, "r", encoding="utf-8", errors="replace") as f:
                    lines = f.readlines()[-100:]
                print(f"--- tail of {log_file} (last {len(lines)} lines) ---")
                for line in lines:
                    print(line.rstrip())
            except Exception as e:
                print("❌ Failed to read log:", e)

    def on_download_clicked(_):
        from google.colab import files
        with out:
            out.clear_output()
            log_file = log_file_text.value.strip() or DEFAULT_LOG_FILE
            if not os.path.exists(log_file):
                print(f"❌ Log file not found: {log_file}")
                print("   Click 'Apply logging settings' first.")
                return
            try:
                files.download(log_file)
                print(f"Started download: {log_file}")
            except Exception as e:
                print("❌ Download failed:", e)

    apply_btn.on_click(on_apply_clicked)
    show_tail_btn.on_click(on_show_tail_clicked)
    download_btn.on_click(on_download_clicked)

    # ---- layout ----
    display(HTML("<h4>Logging Control Panel</h4>"))
    display(
        widgets.VBox([
            log_file_text,
            widgets.HBox([console_level_dd, file_level_dd]),
            widgets.HBox([rotate_mb_slider, backups_slider]),
            widgets.HBox([apply_btn, show_tail_btn, download_btn]),
            out
        ])
    )



VBox(children=(Text(value='text_correction.log', description='LOG_FILE:', layout=Layout(width='420px')), HBox(…

# Step 6: Securely Prompt for OpenAI API Key and test OpenAI API key

###Purpose
- Obtain the OpenAI API key securely (without exposing it in code cells).
- Verify that the key works by attempting a harmless API call (listing models).

###What you may need to change
- Nothing. Just run this cell; it will prompt you for the key if it’s not already in memory.

###Inputs
- User-entered API key (via getpass prompt).

###Outputs
- Environment variable OPENAI_API_KEY set for this session.
- Verification message confirming that the key works, or an error if it doesn’t.

###Key ideas
- Never hard-code your API key into notebooks.
- The key is stored only in the temporary runtime environment variable, not in the notebook file.
- Verification uses a lightweight call (model listing).

###Verification prints
- “✅ OpenAI API key is valid.” if the call succeeds.
- A clear “❌ Invalid or failed verification” message if not.

In [None]:
import importlib
try:
    import openai
    print("openai version:", getattr(openai, "__version__", "unknown"))
except Exception as e:
    print("openai import failed:", e)



openai version: 1.109.1


In [None]:
# ===============================================
# Step 6 — Secure OpenAI key + global toggle (REPLACE)
# ===============================================
USE_OPENAI = True  # ← set to False for dry-run/mock mode (no paid calls)

import os
from getpass import getpass
from openai import OpenAI

def _ensure_openai_key():
    """Prompt once (hidden) and store only in env for this session."""
    key = os.environ.get("OPENAI_API_KEY")
    if key:
        return key
    print("Enter your OpenAI API key (input hidden). Press Enter to skip real calls.")
    key = getpass("API key: ").strip()
    if key:
        os.environ["OPENAI_API_KEY"] = key
        return key
    return None

OPENAI_CLIENT = None
if USE_OPENAI:
    if _ensure_openai_key():
        try:
            OPENAI_CLIENT = OpenAI()                # reads key from env
            _ = OPENAI_CLIENT.models.list().data[:1]  # light probe
            print("✅ OpenAI client ready.")
        except Exception as e:
            OPENAI_CLIENT = None
            USE_OPENAI = False
            print("❌ OpenAI init failed, falling back to mock mode:", type(e).__name__, e)
    else:
        USE_OPENAI = False
        print("ℹ️ No key provided — using mock mode.")
else:
    print("🔒 USE_OPENAI is False — using mock mode (offline).")






Enter your OpenAI API key (input hidden). Press Enter to skip real calls.
API key: ··········
✅ OpenAI client ready.


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


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


In [None]:
# ===== Step 6.1 clean-up (run once right now) =====
import os
for k in list(os.environ.keys()):
    if k.upper() in {"OPENAI_API_KEY", "OPENAI_ORG_ID"}:
        os.environ.pop(k, None)
print("✅ Cleared any OPENAI_* env vars from this session. Use Step 6 getpass() to re-auth.")


✅ Cleared any OPENAI_* env vars from this session. Use Step 6 getpass() to re-auth.


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


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


# Step 6.2: Ensure NLTK Data is Available


###Purpose
- Confirm that the required NLTK datasets, particularly the 'punkt' tokenizer,
  are available for sentence and word tokenization.
- Automatically download them into a local or Colab-safe directory if missing.

###What you may need to change
- nltk_data_path: set this to a writable directory if running outside Colab
  (e.g., './nltk_data' for local use).

###Inputs
- None (downloads handled internally if needed).

###Outputs
- Verified or newly downloaded 'punkt' tokenizer package.

###Key ideas
- NLTK looks for data in a set of known paths; adding a custom directory avoids permission issues.
- Downloading into /content/nltk_data keeps notebooks portable and clean.

In [None]:

# ===============================================
# Step 6.2: Ensure NLTK Data is Available
# ===============================================
import nltk, os

# Specify a safe directory for NLTK data (works in Colab or local)
nltk_data_path = "/content/nltk_data" if os.path.exists("/content") else "./nltk_data"

# Create the directory if it doesn't exist
os.makedirs(nltk_data_path, exist_ok=True)

# Ensure our path is part of NLTK’s search list
if nltk_data_path not in nltk.data.path:
    nltk.data.path.append(nltk_data_path)

# Try locating the punkt tokenizer
try:
    nltk.data.find("tokenizers/punkt")
    print("✅ NLTK 'punkt' tokenizer is already available.")
except LookupError:
    print("⚙️ Downloading NLTK 'punkt' tokenizer...")
    nltk.download("punkt", download_dir=nltk_data_path)
    try:
        nltk.data.find("tokenizers/punkt")
        print("✅ 'punkt' tokenizer downloaded successfully.")
    except LookupError:
        print("❌ Failed to download 'punkt' tokenizer. Check network or permissions.")

# Optional: newer punkt tokenizer for NLTK ≥ 3.8
try:
    nltk.data.find("tokenizers/punkt_tab")
except LookupError:
    try:
        nltk.download("punkt_tab", download_dir=nltk_data_path)
        print("✅ 'punkt_tab' (improved tokenizer) also downloaded.")
    except Exception:
        pass



✅ NLTK 'punkt' tokenizer is already available.


# Step 7: Define Utility Functions for ChatGPT API Interaction
## Purpose:
To create reusable functions that handle interactions with the OpenAI API, including making requests with retry mechanisms to handle potential API errors.

##Input: Function parameters (things you pass in)

-prompt — the text you want the model to read and reply to. Required.

-model="gpt-4o" — which model to call if you don’t give one. Default is gpt-4o.For cost:  https://platform.openai.com/settings/organization/limits

-max_retries=5 — how many times to try again if the API fails temporarily. Five tries is a common default.

-backoff_factor=2 — controls how long to wait between retries. Bigger numbers make you wait longer each retry.

-temperature=0 — controls randomness. Zero aims for consistent, repeatable answers.

-max_tokens=8192 — the maximum number of tokens you allow the model to output.

## Output: Local variables inside the function

attempt — which retry number you are on in the loop.

response — the raw object the API returns.

content — the actual text reply you extract from the response.

usage — token usage information (helps with billing and diagnostics).

e — the caught exception when something goes wrong.

wait — how many seconds the code sleeps before retrying.

## Actions:

* Define call_chatgpt Function:
Parameters: Accepts prompt, model, max_retries, backoff_factor, temperature, and max_tokens.
* Functionality: Attempts to call the OpenAI API with exponential backoff in case of failures like rate limits or timeouts.
* Error Handling: Catches specific OpenAI errors and retries the request after waiting for a calculated duration.
* Returns: The content of the API response or None if all retries fail.

In [None]:
# ===============================================
# Step 7: Define Utility Functions for ChatGPT API Interaction
# ===============================================
# ---- v1-safe, timeout + retries, correct exception classes ----
import time, random, logging
from openai import OpenAI
from openai import (
    APIError, APIConnectionError, RateLimitError, APITimeoutError,
    AuthenticationError
)

logger = logging.getLogger(__name__)

def _token_len(text, encoder):
    try:
        return len(encoder.encode(text or ""))
    except Exception:
        return 0

def call_chatgpt_v1(
    prompt,
    model="gpt-4o",
    max_retries=5,
    backoff_factor=2.0,
    temperature=0.0,
    max_tokens=None,
    client: OpenAI = None,
    token_encoder=None,
    model_context_limit=131072,
    request_timeout=30  # seconds
):
    client = client or OpenAI()
    # per-request timeout (prevents hanging)
    if hasattr(client, "with_options"):
        client = client.with_options(timeout=request_timeout)

    if max_tokens is None and token_encoder is not None:
        prompt_toks = _token_len(prompt, token_encoder)
        max_tokens = max(256, min(4096, model_context_limit - prompt_toks - 1024))
    elif max_tokens is None:
        max_tokens = 1024

    for attempt in range(1, max_retries + 1):
        try:
            resp = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                temperature=temperature,
                max_tokens=max_tokens,
            )
            content = resp.choices[0].message.content
            usage = getattr(resp, "usage", {}) or {}
            return content, usage

        except (RateLimitError, APIConnectionError, APITimeoutError, APIError) as e:
            wait = min(60, (backoff_factor ** attempt) + random.uniform(0, 1))
            logger.warning(
                "API transient error (attempt %d/%d): %s. Retrying in %.1fs",
                attempt, max_retries, str(e), wait
            )
            time.sleep(wait)
            continue

        except AuthenticationError as e:
            logger.error("Authentication failed: %s", e)
            raise

        except KeyboardInterrupt:
            logger.error("Interrupted by user. Aborting cleanly.")
            raise

        except Exception as e:
            logger.exception("Unexpected error on attempt %d: %s", attempt, e)
            break

    logger.error("Max retries exceeded. Returning (None, {}).")
    return None, {}



# Step 8: Correct Text by Sentence
## Purpose:
To process raw text by correcting punctuation, grammar, and spelling. Then creating a new field with the corrected text. Also, itm maaps the correctext back to the raw text.




## Actions:
* Important all helpers and loggers.
* Define corrrect_and_mmap Function:
 * Parameters: Accepts text (the raw input text).
 * Prompt Creation: Constructs a detailed prompt with instructions for the AI to perform specific tasks on the text: correcting punctuation, grammar, and spelling. Providing bencharks to map corrected to segmented by word.
 * API Call: Uses the previously defined call_chatgpt function to send the prompt to the OpenAI API.
 * Response Handling: Extracts JSON from the API response and parses it into a Python dictionary.
 * Error Handling: Catches JSON decoding errors and returns an empty dictionary if parsing fails.
*Mock Text for Testing:
at the end, create mock version of correct_and_map is used to simulate API behavior for testing purposes.

In [None]:
import pandas as pd

assert isinstance(df_preprocessed, pd.DataFrame), "df_preprocessed is not a DataFrame"
assert "ID" in df_preprocessed.columns, "Missing 'ID' column in df_preprocessed"
assert "Raw text" in df_preprocessed.columns, "Missing 'Raw text' column in df_preprocessed"

print("First IDs:", df_preprocessed["ID"].head().tolist())
print("Columns:", list(df_preprocessed.columns))


First IDs: ['BBCMHJPT', 'BBKBYNDW', 'BBRWTLYV', 'BCDVWQDF', 'BCXSFTWC']
Columns: ['ID', 'Research ID', 'AU', 'TS', 'IdeaID', 'CS/PD', 'Voc', 'Coh', 'Pa', 'SS', 'Pun', 'Spell', 'Total', 'WordCount', 'yrlev', 'Raw text', 'TokenCount']


In [None]:
# ===============================================
# Step 8 — enforce canonical ID at the top
# ===============================================
def run_correct_only(
    df_in: pd.DataFrame,
    text_col="Raw text",
    id_col="ID",
    client=None,
    model="gpt-4o",
    use_mock=False,
    out_col="Corrected text (8)"
) -> pd.DataFrame:
    if text_col not in df_in.columns:
        raise KeyError(f"Missing required column: '{text_col}'")
    df, _src = ensure_canonical_id(df_in, canon=id_col)
    # proceed as before (unchanged body below) …
    corrected, tags_json, dlg_json, sources = [], [], [], []
    for raw in df[text_col].astype(str).tolist():
        c, tags, spans, src = correct_with_tags(raw, client=client, model=model, use_mock=use_mock)
        corrected.append(c)
        tags_json.append(json.dumps(tags, ensure_ascii=False))
        dlg_json.append(json.dumps(spans, ensure_ascii=False))
        sources.append(src)
    df[out_col] = corrected
    df["NarrativeTagsJSON"] = tags_json
    df["DialogueSpansJSON"]  = dlg_json
    df["CorrectedBy"]        = sources
    return df

def run_step8(df_preprocessed: pd.DataFrame,
              raw_col="Raw text",
              id_col="ID",
              client=None,
              model="gpt-4o",
              use_mock=False):
    # Normalize ID once at the very top
    df_pre, id_src = ensure_canonical_id(df_preprocessed, canon=id_col)
    print(f"▶️  Step 8 using ID source: {id_src}")

    # 8A
    df_corr = run_correct_only(
        df_pre, text_col=raw_col, id_col=id_col,
        client=None if use_mock else client, model=model, use_mock=use_mock,
        out_col="Corrected text (8)"
    )
    df_corr["Corrected text (8)"] = df_corr["Corrected text (8)"].map(normalize_mojibake)

    # 8B
    df_map, df_texts = run_mapping_only(df_corr, id_col=id_col, raw_col=raw_col, corr_col="Corrected text (8)")

    # 8R → 8C → 8D → 8E (unchanged body)
    df_map = smart_repair_alignment(df_map, window=6, sim_word=0.60, sim_join=0.72)
    df_map = assign_corr_sentence_ids(df_map)
    df_texts["ID"] = df_texts["ID"].astype(str)
    df_map = mark_title_tokens(df_map, df_texts_with_tags=df_texts)
    df_map = add_sentence_boundary_flags(df_map)
    return df_texts, df_map



In [None]:
# ===============================
# 8A — Correct text + detect tags (LLM or mock) — Doc ID = `ID`, Idea key optional as `IdeaID`
# What this does:
# • Uses the canonical document ID in column `ID` (already set from "Research ID" in Step 2A).
# • Leaves any existing idea-level key in `IdeaID` untouched.
# • For each "Raw text", produces:
#     - "Corrected text (8)"
#     - "NarrativeTagsJSON" (title / temporal / closure with char spans on corrected text)
#     - "DialogueSpansJSON" (quoted speech spans on corrected text)
#     - "CorrectedBy" (mock, model name, or 'error')
# ===============================

import re
import json
import logging
import pandas as pd

logger = logging.getLogger(__name__)

# --- helpers ---

def _normalize_id_series(s: pd.Series) -> pd.Series:
    """Return all IDs as clean strings (no .0, no float drift)."""
    s = s.astype(str).str.replace(r"\.0$", "", regex=True)
    def _fix(x):
        if any(c.isalpha() for c in x):
            return x
        try:
            if "." in x or "e" in x.lower():
                f = float(x)
                if f.is_integer():
                    return str(int(f))
        except Exception:
            pass
        return x
    return s.map(_fix)

# Basic text normalizer (safe to re-declare if Step 2 cell didn't run here)
try:
    normalize_mojibake
except NameError:
    def normalize_mojibake(s: str) -> str:
        if s is None:
            return ""
        s = str(s)
        fixes = {
            "â€”": "—", "â€“": "–",
            "â€˜": "‘", "â€™": "’",
            "â€œ": "“", "â€\x9d": "”", "â€\x9c": "“",
            "â€¦": "…", "â€¢": "•", "â€": "”",
        }
        for bad, good in fixes.items():
            s = s.replace(bad, good)
        return re.sub(r"[ \t\u00A0\u2007\u202F]+", " ", s).strip()

def _extract_first_json_object(txt: str):
    """Find the first {...} JSON object in a string."""
    if not txt:
        return None
    start = txt.find("{")
    if start < 0:
        return None
    depth, in_str, esc = 0, False, False
    for i in range(start, len(txt)):
        ch = txt[i]
        if in_str:
            if esc: esc = False
            elif ch == "\\": esc = True
            elif ch == '"': in_str = False
        else:
            if ch == '"': in_str = True
            elif ch == "{": depth += 1
            elif ch == "}":
                depth -= 1
                if depth == 0:
                    frag = txt[start:i+1]
                    try:
                        return json.loads(frag)
                    except Exception:
                        return None
    return None

def correct_with_tags(raw: str, client=None, model="gpt-4o", use_mock=False, max_tokens=1500):
    """
    Return: corrected_text, narrative_tags(list), dialogue_spans(list), source_label.
    If use_mock=True or client is None, returns a simple deterministic correction.
    """
    s = normalize_mojibake(str(raw or "")).strip()

    if use_mock or client is None:
        # Mock: capitalise first letter; ensure terminal punctuation
        t = s
        m = re.search(r"[A-Za-z]", t)
        if m:
            i = m.start()
            t = t[:i] + t[i].upper() + t[i+1:]
        if t and not re.search(r"[.!?…]\s*$", t):
            t += "."
        return t, [], [], "mock"

    prompt = f"""
You are a careful copy-editor. Fix punctuation, grammar, and spelling.
Keep meaning and paragraphing. Use standard English punctuation.

Also detect:
- Titles at the very start: type="title"
- Temporal transitions: type="temporal" (e.g., "The next day", dates)
- Closures: type="closure" (e.g., "THE END")
Provide character indices [start,end) on the corrected text.

Also return dialogue spans (quoted direct speech) as {{start,end}}.

Return ONLY JSON:
{{
  "corrected_text": "...",
  "narrative_tags": [{{"type":"title|temporal|closure","text":"...","start":int,"end":int}}, ...],
  "dialogue_spans": [{{"start":int,"end":int}}, ...]
}}

Text:
<<<BEGIN>>>
{s}
<<<END>>>
""".strip()

    try:
        from openai import OpenAI
        _client = client or OpenAI()
        resp = _client.chat.completions.create(
            model=model,
            messages=[{"role":"user","content":prompt}],
            temperature=0.0,
            max_tokens=max_tokens
        )
        out = (resp.choices[0].message.content or "").strip()
        js = _extract_first_json_object(out)
        if not isinstance(js, dict):
            return s, [], [], "fallback"

        corrected = normalize_mojibake(js.get("corrected_text","") or "").strip()
        N = len(corrected)

        def _clean_tags(tags):
            clean = []
            for t in tags or []:
                try:
                    tt = str(t.get("type","")).lower()
                    st = max(0, int(t.get("start",0)))
                    en = max(st, int(t.get("end",0)))
                    st = min(st, N); en = min(en, N)
                    if tt in {"title","temporal","closure"} and en > st:
                        clean.append({"type":tt,"text":corrected[st:en],"start":st,"end":en})
                except Exception:
                    pass
            return clean

        def _clean_spans(spans):
            clean = []
            for d in spans or []:
                try:
                    st = max(0, int(d.get("start",0)))
                    en = max(st, int(d.get("end",0)))
                    st = min(st, N); en = min(en, N)
                    if en > st:
                        clean.append({"start":st,"end":en})
                except Exception:
                    pass
            return clean

        return corrected, _clean_tags(js.get("narrative_tags")), _clean_spans(js.get("dialogue_spans")), model
    except Exception as e:
        logger.warning("LLM correction failed. Using normalized raw. %s", e)
        return s, [], [], "error"

def run_correct_only(
    df_in: pd.DataFrame,
    text_col="Raw text",
    id_col="ID",           # canonical DOC ID (from Research ID in Step 2A)
    client=None,
    model="gpt-4o",
    use_mock=False,
    out_col="Corrected text (8)"
) -> pd.DataFrame:
    """
    Apply correction + tagging row by row and return a new dataframe.
    Assumes df_in[id_col] already holds canonical document IDs.
    Preserves IdeaID if present; does not rename ID columns here.
    """
    need = {text_col, id_col}
    miss = need - set(df_in.columns)
    if miss:
        raise KeyError(f"Missing required columns for 8A: {sorted(miss)}")

    df = df_in.copy()
    # Normalise the canonical doc ID column to stable string form
    df[id_col] = _normalize_id_series(df[id_col])

    corrected, tags_json, dlg_json, sources = [], [], [], []

    # Iterate texts deterministically in current order
    for raw in df[text_col].astype(str).tolist():
        c, tags, spans, src = correct_with_tags(
            raw, client=client, model=model, use_mock=use_mock
        )
        corrected.append(c)
        tags_json.append(json.dumps(tags, ensure_ascii=False))
        dlg_json.append(json.dumps(spans, ensure_ascii=False))
        sources.append(src)

    df[out_col] = corrected
    df["NarrativeTagsJSON"] = tags_json
    df["DialogueSpansJSON"]  = dlg_json
    df["CorrectedBy"]        = sources

    # No column shuffling or renaming here—ID/IdeaID remain as passed in.
    return df



In [None]:
# ===============================
# 8B — Tokenize, align, and build a word-level map
# What this does:
# • Splits raw and corrected into tokens.
# • Rebuilds character offsets for each token.
# • Diffs tokens to label equal, insert, delete, replace.
# • Produces a long table (one row per token alignment).
# Inputs:  df with ["ID","Raw text","Corrected text (8)"]
# Outputs: map_df (long table), texts_out (pass through of the texts)
# ===============================

import re
import numpy as np
import pandas as pd
from difflib import SequenceMatcher

_WORD_RX = re.compile(r"\w", flags=re.UNICODE)

def _split_merged_word(tok: str):
    # Split once when ALLCAPS is followed by lowercase: YAAYwe -> YAAY + we
    if not tok or not tok.isalpha():
        return [tok]
    m = re.match(r"^([A-Z]{2,})([a-z].*)$", tok)
    if m:
        left, right = m.group(1), m.group(2)
        return [left, right]
    return [tok]

def _tokenize_with_split(s: str):
    base = re.findall(r"\w+|[^\w\s]", s or "", flags=re.UNICODE)
    out = []
    for t in base:
        if re.fullmatch(r"\w+", t):
            out.extend(_split_merged_word(t))
        else:
            out.append(t)
    return out

def _rebuild_offsets_with_splitting(text, tokens):
    """Find each token's start and end index in the original string, with safe clamping."""
    text = text or ""
    spans, i, n = [], 0, len(text)
    for tok in tokens:
        if tok == "" or tok is None:
            spans.append((i, i))
            continue
        pos = text.find(tok, i)
        if pos >= 0:
            start, end = pos, pos + len(tok)
        else:
            j = i
            while j < n and text[j].isspace():
                j += 1
            start = j
            end = start + len(tok)
        start = max(0, min(start, n))
        end   = max(start, min(end, n))
        spans.append((start, end))
        i = end
    return spans

def _is_word(tok: str) -> bool:
    return bool(tok) and bool(_WORD_RX.search(tok))

def _canon(tok: str) -> str:
    """Canonical form for diff: uppercase and collapse repeated letters."""
    if tok is None:
        return ""
    u = str(tok).upper()
    return re.sub(r"(.)\1+", r"\1", u)

def build_word_map(raw_text, corr_text):
    raw_text  = str(raw_text or "")
    corr_text = str(corr_text or "")

    raw_tokens  = _tokenize_with_split(raw_text)
    corr_tokens = _tokenize_with_split(corr_text)

    raw_spans  = _rebuild_offsets_with_splitting(raw_text, raw_tokens)
    corr_spans = _rebuild_offsets_with_splitting(corr_text, corr_tokens)

    sm = SequenceMatcher(a=[_canon(t) for t in raw_tokens],
                         b=[_canon(t) for t in corr_tokens],
                         autojunk=False)

    rows = []
    for tag, i1, i2, j1, j2 in sm.get_opcodes():
        if tag == "equal":
            for k in range(i2 - i1):
                ri = i1 + k; ci = j1 + k
                r_tok, c_tok = raw_tokens[ri], corr_tokens[ci]
                r_start, r_end = raw_spans[ri]
                c_start, c_end = corr_spans[ci]
                rows.append({
                    "raw_index": ri, "raw_token": r_tok, "raw_start": r_start, "raw_end": r_end,
                    "corr_index": ci, "corr_token": c_tok, "corr_start": c_start, "corr_end": c_end,
                    "op": "equal", "equal_ci": (r_tok == c_tok), "error_type": "Equal"
                })

        elif tag == "replace":
            m = min(i2 - i1, j2 - j1)
            for k in range(m):
                ri = i1 + k; ci = j1 + k
                r_tok, c_tok = raw_tokens[ri], corr_tokens[ci]
                r_start, r_end = raw_spans[ri]
                c_start, c_end = corr_spans[ci]
                err = "Spelling" if (str(r_tok).isalpha() and str(c_tok).isalpha() and _canon(r_tok) == _canon(c_tok)) else "Replacement"
                rows.append({
                    "raw_index": ri, "raw_token": r_tok, "raw_start": r_start, "raw_end": r_end,
                    "corr_index": ci, "corr_token": c_tok, "corr_start": c_start, "corr_end": c_end,
                    "op": "replace", "equal_ci": (_canon(r_tok) == _canon(c_tok)), "error_type": err
                })
            for ri in range(i1 + m, i2):
                r_tok = raw_tokens[ri]; r_start, r_end = raw_spans[ri]
                rows.append({
                    "raw_index": ri, "raw_token": r_tok, "raw_start": r_start, "raw_end": r_end,
                    "corr_index": None, "corr_token": None, "corr_start": None, "corr_end": None,
                    "op": "delete", "equal_ci": False,
                    "error_type": "PunctuationDeletion" if not _is_word(r_tok) else "Deletion"
                })
            for ci in range(j1 + m, j2):
                c_tok = corr_tokens[ci]; c_start, c_end = corr_spans[ci]
                rows.append({
                    "raw_index": None, "raw_token": None, "raw_start": None, "raw_end": None,
                    "corr_index": ci, "corr_token": c_tok, "corr_start": c_start, "corr_end": c_end,
                    "op": "insert", "equal_ci": False,
                    "error_type": "PunctuationInsertion" if not _is_word(c_tok) else "Insertion"
                })

        elif tag == "delete":
            for ri in range(i1, i2):
                r_tok = raw_tokens[ri]; r_start, r_end = raw_spans[ri]
                rows.append({
                    "raw_index": ri, "raw_token": r_tok, "raw_start": r_start, "raw_end": r_end,
                    "corr_index": None, "corr_token": None, "corr_start": None, "corr_end": None,
                    "op": "delete", "equal_ci": False,
                    "error_type": "PunctuationDeletion" if not _is_word(r_tok) else "Deletion"
                })

        elif tag == "insert":
            for ci in range(j1, j2):
                c_tok = corr_tokens[ci]; c_start, c_end = corr_spans[ci]
                rows.append({
                    "raw_index": None, "raw_token": None, "raw_start": None, "raw_end": None,
                    "corr_index": ci, "corr_token": c_tok, "corr_start": c_start, "corr_end": c_end,
                    "op": "insert", "equal_ci": False,
                    "error_type": "PunctuationInsertion" if not _is_word(c_tok) else "Insertion"
                })
    return rows

def run_mapping_only(df_with_corr,
                     id_col="ID",
                     raw_col="Raw text",
                     corr_col="Corrected text (8)"):
    """Build the long alignment table."""
    need = {raw_col, corr_col}
    if not need.issubset(df_with_corr.columns):
        raise KeyError(f"Missing columns: {need - set(df_with_corr.columns)}")
    df = df_with_corr.copy()
    if id_col not in df.columns:
        df[id_col] = pd.RangeIndex(len(df)).astype(str)
    df[id_col] = df[id_col].astype(str)

    all_rows = []
    for order, (rid, raw, cor) in enumerate(zip(
        df[id_col].tolist(),
        df[raw_col].astype(str).tolist(),
        df[corr_col].astype(str).tolist()
    )):
        rows = build_word_map(raw, cor)
        if not rows:
            rows = [{
                "raw_index": np.nan, "raw_token": None, "raw_start": np.nan, "raw_end": np.nan,
                "corr_index": np.nan, "corr_token": None, "corr_start": np.nan, "corr_end": np.nan,
                "op": "empty", "equal_ci": False, "error_type": "EmptyText"
            }]
        for r in rows:
            rec = {"RowID": rid, "DocOrder": order, **r}
            rec["Changed"] = (r.get("op") != "equal")
            all_rows.append(rec)
    map_df = pd.DataFrame(all_rows)
    texts_out = df.copy()
    return map_df, texts_out


In [None]:
# ===============================
# 8R — Smart alignment repair to fix replace+insert anomalies
# What this does:
# • Collapses bad patterns like:
#     "lest" -> "Let" + inserted "."  into a single word replacement where appropriate
#     "im"   -> "I" + "'" + "m" into "I'm"
# • Reduces false "Incorrect Beginning/Ending" flags that come from stray punctuation inserts
# Inputs:  df_map from 8B
# Output:  repaired df_map
# ===============================

import re
import numpy as np
import pandas as pd
from difflib import SequenceMatcher

def _jw(a, b):
    return SequenceMatcher(None, str(a), str(b)).ratio()

def _letters_only(s):
    return re.sub(r"[^A-Za-z]+", "", str(s or ""))

def _is_wordish(tok):
    t = str(tok or "")
    return bool(re.search(r"\w", t)) and not re.fullmatch(r"[\W_]+", t)

def smart_repair_alignment(df_map: pd.DataFrame,
                           window=6,
                           sim_word=0.60,
                           sim_join=0.72) -> pd.DataFrame:
    """
    Two repairs:
      1) punctuation-word swap:
         replace(raw_word -> punct) + insert(correct_word) => single replace(raw_word -> correct_word)
      2) split-to-multiple inserts:
         im -> I ' m  or  tack -> t a k e (rare) => join inserts to "I'm" or "take"
    """
    if df_map.empty:
        return df_map.copy()

    df = df_map.copy()
    if "ID" not in df.columns:
        df["ID"] = df["RowID"].astype(str)

    df["_rowpos"] = np.arange(len(df))
    if "corr_index" in df.columns:
        df["_sort"] = pd.to_numeric(df["corr_index"], errors="coerce").fillna(1e12) + df["_rowpos"]*1e-9
    else:
        df["_sort"] = df["_rowpos"]

    chunks = []
    for gid, g in df.sort_values(["ID","_sort"], kind="mergesort").groupby("ID", sort=False):
        g = g.copy()
        idxs = list(g.index)

        # pass 1: punctuation-word swap
        to_drop = set()
        for k, i in enumerate(idxs):
            if i in to_drop:
                continue
            if g.at[i, "op"] != "replace":
                continue
            r_tok = g.at[i, "raw_token"]
            c_tok = g.at[i, "corr_token"]
            if _is_wordish(r_tok) and not _is_wordish(c_tok):
                for j in idxs[k+1 : k+1+window]:
                    if g.at[j, "op"] != "insert":
                        continue
                    ins_tok = g.at[j, "corr_token"]
                    if not _is_wordish(ins_tok):
                        continue
                    sim = _jw(_letters_only(r_tok).lower(), _letters_only(ins_tok).lower())
                    if sim >= sim_word:
                        g.at[i, "corr_token"] = ins_tok
                        g.at[i, "corr_index"] = g.at[j, "corr_index"]
                        g.at[i, "corr_start"] = g.at[j, "corr_start"]
                        g.at[i, "corr_end"]   = g.at[j, "corr_end"]
                        g.at[i, "error_type"] = "Replacement"
                        to_drop.add(j)
                        break

        if to_drop:
            g = g.drop(index=list(to_drop))
            idxs = list(g.index)

        # pass 2: merge multiple inserts into one replacement when they match the raw word
        to_drop = set()
        for k, i in enumerate(idxs):
            if i in to_drop:
                continue
            if g.at[i, "op"] not in ("replace", "delete"):
                continue

            r_tok = str(g.at[i, "raw_token"] or "")
            if not _is_wordish(r_tok):
                continue

            run = []
            for j in idxs[k+1 : k+1+window]:
                if g.at[j, "op"] != "insert":
                    break
                run.append(j)
            if not run:
                continue

            inserted = [str(g.at[j, "corr_token"] or "") for j in run]
            joined_letters = _letters_only("".join(inserted))
            raw_letters = _letters_only(r_tok)
            if not raw_letters:
                continue

            sim = _jw(raw_letters.lower(), joined_letters.lower())
            if sim >= sim_join:
                display_join = re.sub(r"\s+", "", "".join(inserted))
                g.at[i, "op"] = "replace"
                g.at[i, "corr_token"] = display_join
                g.at[i, "error_type"] = "Replacement"
                g.at[i, "equal_ci"] = (raw_letters.lower() == joined_letters.lower())

                first = run[0]
                last  = run[-1]
                g.at[i, "corr_index"] = g.at[first, "corr_index"]
                g.at[i, "corr_start"] = g.at[first, "corr_start"]
                g.at[i, "corr_end"]   = g.at[last,  "corr_end"]

                to_drop.update(run)

        if to_drop:
            g = g.drop(index=list(to_drop))

        chunks.append(g)

    out = pd.concat(chunks, axis=0).sort_values(["ID","_sort"], kind="mergesort")
    return out.drop(columns=["_rowpos","_sort"], errors="ignore")


In [None]:
# ===============================
# 8C — Assign sentence IDs on corrected text
# What this does:
# • Walks corrected tokens and decides sentence boundaries.
# • Handles ellipses and quotes carefully.
# • Produces CorrSentenceID per token and a "SentenceRef" later.
# Inputs:  df_map from 8B/8R
# Output:  df_map with CorrSentenceID
# ===============================

import re
import numpy as np
import pandas as pd

ABBREV = {
    "mr.","mrs.","ms.","dr.","prof.","sr.","jr.","st.","vs.","etc.",
    "e.g.","i.e.","cf.","fig.","ex.","no.","approx.","circa.","ca.",
    "dept.","est.","misc.","rev.","jan.","feb.","mar.","apr.","jun.",
    "jul.","aug.","sep.","sept.","oct.","nov.","dec."
}
TERMINALS = {".", "!", "?", "…", "...", "?!", "!?"}
CLOSERS   = {")", "]", "}", "”", "’", "»"}
OPENERS   = {"(", "[", "{", "“", "‘", "«"}
RE_INITIAL       = re.compile(r"^[A-Z]\.$")
RE_INITIAL_PAIR  = re.compile(r"^[A-Z]\.[A-Z]\.$")
RE_NUM_WITH_DOT  = re.compile(r"^\d+\.$")
RE_SECTION_NUM   = re.compile(r"^\d+(?:\.\d+){1,3}$")
RE_DOT_TAIL      = re.compile(r"^\.\d+$")
RE_ELLIPSIS      = re.compile(r"^\.\.\.$")
RE_ALPHA_PAREN   = re.compile(r"^[A-Za-z]\)$")

def _tok(x):
    if pd.isna(x) or x is None: return ""
    return str(x)

def _is_ellipsis_triplet(i, toks):
    return (i+2 < len(toks) and toks[i] == "." and toks[i+1] == "." and toks[i+2] == ".")

def _is_terminal_token(tok: str, prev_tok: str, next_tok: str) -> bool:
    t = tok.strip()
    if not t:
        return False
    if t in {"…","..."} or RE_ELLIPSIS.fullmatch(t):
        return True
    if t in {"?!","!?"}:
        return True
    if t in {"!","?"}:
        return True
    if t == ".":
        p = (prev_tok or "").strip()
        n = (next_tok or "").strip()
        low_prev = p.lower()
        if low_prev in ABBREV: return False
        if RE_INITIAL.fullmatch(p) or RE_INITIAL_PAIR.fullmatch(p): return False
        if RE_SECTION_NUM.fullmatch(p): return False
        if RE_NUM_WITH_DOT.fullmatch(p) and (n and re.match(r"[A-Za-z(“\"'\[]", n)): return False
        if RE_DOT_TAIL.fullmatch(n): return False
        if n.isdigit(): return False
        return True
    if t == ")" and RE_ALPHA_PAREN.fullmatch(prev_tok):
        return False
    return False

def _likely_ascii_opening(prev_tok: str, next_tok: str) -> bool:
    prev = (prev_tok or "").strip()
    nxt  = (next_tok or "").strip()
    if prev == "" or prev in TERMINALS or prev in OPENERS:
        return True
    if nxt and nxt not in TERMINALS and nxt not in CLOSERS:
        return True
    return False

def assign_corr_sentence_ids(df_map: pd.DataFrame) -> pd.DataFrame:
    df = df_map.copy()
    if "RowID" in df.columns and "ID" not in df.columns:
        df["ID"] = df["RowID"].astype(str)

    has_ci = "corr_index" in df.columns
    if has_ci and "corr_index_orig" not in df.columns:
        df["corr_index_orig"] = df["corr_index"]

    def _stable_sort_key(g: pd.DataFrame) -> pd.Series:
        pos = pd.Series(np.arange(len(g)), index=g.index, dtype=float)
        if has_ci:
            ci = pd.to_numeric(g["corr_index"], errors="coerce")
            nan_mask = ci.isna()
            bump = (pos - pos.min()) / max((pos.max() - pos.min()), 1) * 1e-6
            return ci.where(~nan_mask, 1e9) + bump
        return pos

    df["_sort_key"] = df.groupby("ID", group_keys=False).apply(_stable_sort_key, include_groups=False)

    def _assign(g: pd.DataFrame) -> pd.Series:
        g = g.sort_values("_sort_key", kind="mergesort")
        toks = (g["corr_token"] if "corr_token" in g.columns else g["raw_token"]).map(_tok).tolist()

        sids = []
        sent_id = 0
        pending_end = False
        i = 0
        while i < len(toks):
            tok = toks[i].strip()
            prev_tok = toks[i-1].strip() if i > 0 else ""
            next_tok = toks[i+1].strip() if i+1 < len(toks) else ""

            if _is_ellipsis_triplet(i, toks):
                pending_end = True
                sids.append(sent_id)
                i += 1
                continue

            if pending_end:
                if tok in CLOSERS or (tok == '"' and not _likely_ascii_opening(prev_tok, next_tok)):
                    sids.append(sent_id); i += 1; continue
                if tok in OPENERS or (tok == '"' and _likely_ascii_opening(prev_tok, next_tok)):
                    sent_id += 1; pending_end = False; sids.append(sent_id); i += 1; continue
                sent_id += 1; pending_end = False; sids.append(sent_id); i += 1; continue
            else:
                sids.append(sent_id); i += 1

            if _is_terminal_token(tok, prev_tok, next_tok):
                pending_end = True

        return pd.Series(sids, index=g.index).reindex(g.index)

    df["CorrSentenceID"] = (
        df.groupby("ID", group_keys=False).apply(_assign, include_groups=False).astype("Int64")
    )
    df.drop(columns=["_sort_key"], inplace=True, errors="ignore")
    return df


In [None]:
# ===============================
# 8D — Mark titles by exact spans and set SentenceRef
# What this does:
# • Uses NarrativeTagsJSON to mark exact title tokens as TITLE=True.
# • Renumbers sentences so:
#     if a title exists, it stays at 0 and non-title that were 0 become 1
#     if no title exists, shift all sentence ids up by 1 so s000 is unused
# • Builds SentenceRef = ID_sNNN
# Inputs:  df_map with CorrSentenceID, df_texts_with_tags with tags JSON
# Output:  updated df_map
# ===============================

import json
import pandas as pd

def _parse_json_list(s):
    try:
        v = json.loads(s) if isinstance(s, str) else (s or [])
        return v if isinstance(v, list) else []
    except Exception:
        return []

def mark_title_tokens(df_map: pd.DataFrame, df_texts_with_tags: pd.DataFrame) -> pd.DataFrame:
    df = df_map.copy()

    if "Sentence Boundaries" not in df.columns:
        df["Sentence Boundaries"] = ""
    if "TITLE" not in df.columns:
        df["TITLE"] = False

    # Collect title spans by ID
    title_spans = {}
    for _id, corr_text, tags_s in zip(df_texts_with_tags["ID"].astype(str),
                                      df_texts_with_tags["Corrected text (8)"].astype(str),
                                      df_texts_with_tags["NarrativeTagsJSON"].astype(str)):
        tags = _parse_json_list(tags_s)
        spans = [(t.get("start", -1), t.get("end", -1))
                 for t in tags if isinstance(t, dict) and t.get("type") == "title"
                 and isinstance(t.get("start"), int) and isinstance(t.get("end"), int)]
        if spans:
            # Only consider titles near the beginning
            spans = [sp for sp in spans if 0 <= sp[0] < max(60, len(corr_text)//3)]
        title_spans[str(_id)] = spans

    # Mark tokens that are fully inside any title span
    def mark_group(g):
        gid = str(g["ID"].iloc[0])
        spans = title_spans.get(gid, [])
        if spans:
            for (s0, s1) in spans:
                idx = g.index[(g["corr_start"] >= s0) & (g["corr_end"] <= s1)]
                if len(idx):
                    df.loc[idx, "TITLE"] = True
                    df.loc[idx, "Sentence Boundaries"] = "Title"
        return g

    df.groupby("ID", group_keys=False).apply(mark_group, include_groups=False)

    # Renumber sentence ids based on title presence
    def bump_group(g):
        has_title = bool(g["TITLE"].any())
        sids = g["CorrSentenceID"].astype("Int64").copy()
        if has_title:
            # keep titles at 0, bump other 0 to 1
            bump_mask = (~g["TITLE"]) & sids.notna()
            sids.loc[bump_mask] = sids.loc[bump_mask] + 1
        else:
            # no title: all +1 so s000 is unused
            mask = sids.notna()
            sids.loc[mask] = sids.loc[mask] + 1
        g["CorrSentenceID"] = sids
        return g

    df = df.groupby("ID", group_keys=False).apply(bump_group, include_groups=False).reset_index(drop=True)

    # Build SentenceRef = ID_sNNN
    def _sid3(x):
        try: return f"{int(x):03d}"
        except: return "000"
    df["SentenceRef"] = df["ID"].astype(str) + "_s" + df["CorrSentenceID"].map(_sid3)

    return df


In [None]:
# ===============================
# 8E — Add explicit boundary flags and checks
# What this does:
# • Marks the first content token in each sentence as "Sentence Beginning".
# • Marks the last terminal token as "Sentence Ending".
# • Adds "BoundaryCheck" labels (Correct/Incorrect/Unknown).
# Inputs:  df_map from 8D
# Output:  updated df_map
# ===============================

import numpy as np
import pandas as pd
import re

TERMINALS_HARD = {".","!","?","…","...","?!","!?"}
OPENING_PUNCT  = {'"', "“", "‘", "«", "(", "[", "{"}

def _first_alpha_case(s: str):
    m = re.search(r"[A-Za-z]", s or "")
    if not m:
        return None
    return s[m.start()].isupper()

def _is_wordish(tok: str) -> bool:
    return bool(tok) and bool(re.search(r"\w", str(tok)))

def add_sentence_boundary_flags(df_map: pd.DataFrame) -> pd.DataFrame:
    df = df_map.copy()
    for col in ("Sentence Boundaries", "BoundaryCheck"):
        if col not in df.columns:
            df[col] = ""

    if "corr_index" not in df.columns:
        df["corr_index"] = np.nan

    # Stable sort within sentence
    df["_rowpos"] = np.arange(len(df))
    df["_sort_corr"] = pd.to_numeric(df["corr_index"], errors="coerce").fillna(1e12) + (df["_rowpos"]*1e-9)
    df = df.sort_values(["ID","CorrSentenceID","_sort_corr"], kind="mergesort")

    def _first_content_row(g: pd.DataFrame):
        ops = g["op"] if "op" in g.columns else pd.Series(["equal"]*len(g), index=g.index)
        # Prefer first non-insert word (skip opening quotes/brackets)
        for idx in g.index:
            tok = str(g.at[idx, "corr_token"])
            if tok in OPENING_PUNCT:
                continue
            if not _is_wordish(tok):
                continue
            if ops.at[idx] == "insert":
                continue
            return idx
        # Fallback: first wordish token
        for idx in g.index:
            tok = str(g.at[idx, "corr_token"])
            if tok in OPENING_PUNCT:
                continue
            if _is_wordish(tok):
                return idx
        return None

    def _last_terminal_row(g: pd.DataFrame):
        toks = g["corr_token"].astype(str).tolist()
        for pos in range(len(toks)-1, -1, -1):
            if toks[pos] in TERMINALS_HARD:
                return g.index[pos]
        return None

    for (id_, sid), g in df.groupby(["ID","CorrSentenceID"], sort=False):
        # Title sentences already handled; keep label and skip checks
        if "TITLE" in g.columns and g["TITLE"].any():
            df.loc[g.index, "Sentence Boundaries"] = "Title"
            continue

        g = g.sort_values("_sort_corr", kind="mergesort")
        b = _first_content_row(g)
        e = _last_terminal_row(g)

        if b is not None:
            prev = df.at[b, "Sentence Boundaries"]
            if prev.strip() != "Title":
                df.at[b, "Sentence Boundaries"] = prev + (" | " if prev else "") + "Sentence Beginning"
                raw_tok = str(df.at[b, "raw_token"] or "") if "raw_token" in df.columns else ""
                corr_tok = str(df.at[b, "corr_token"] or "")
                ra = _first_alpha_case(raw_tok)
                ca = _first_alpha_case(corr_tok)
                tag = "Unknown Beginning" if (ra is None or ca is None) else ("Correct Beginning" if (ra == ca) else "Incorrect Beginning")
                prev = df.at[b, "BoundaryCheck"]
                df.at[b, "BoundaryCheck"] = prev + (" | " if prev else "") + tag

        if e is not None:
            prev = df.at[e, "Sentence Boundaries"]
            if prev.strip() != "Title":
                df.at[e, "Sentence Boundaries"] = prev + (" | " if prev else "") + "Sentence Ending"
                ce_tok = str(df.at[e, "corr_token"] or "")
                tag = "Correct Ending" if (ce_tok in TERMINALS_HARD) else "Incorrect Ending"
                prev = df.at[e, "BoundaryCheck"]
                df.at[e, "BoundaryCheck"] = prev + (" | " if prev else "") + tag

    # Ensure SentenceRef exists/updated
    def _sid3(x):
        try: return f"{int(x):03d}"
        except: return "000"
    df["SentenceRef"] = df["ID"].astype(str) + "_s" + df["CorrSentenceID"].map(_sid3)

    df.drop(columns=["_rowpos","_sort_corr"], inplace=True, errors="ignore")
    return df


In [None]:
# ===============================
# 8Z — Step 8 orchestrator
# What this does:
# • Runs the full Step 8 pipeline in order:
#     correction + tags  -> mapping  -> smart alignment repair
#     -> sentence IDs -> title marking -> boundary flags
# Inputs:  df_preprocessed with ["ID","Raw text"]
# Outputs: df_texts (per row, corrected text + tags), df_map (long alignment table)
# ===============================

def run_step8(df_preprocessed: pd.DataFrame,
              raw_col="Raw text",
              id_col="ID",
              client=None,
              model="gpt-4o",
              use_mock=False):
    # 8A: correction + tags
    df_corr = run_correct_only(
        df_preprocessed,
        text_col=raw_col,
        id_col=id_col,
        client=None if use_mock else client,
        model=model,
        use_mock=use_mock,
        out_col="Corrected text (8)"
    )

    # Hardening against mojibake, just in case
    df_corr["Corrected text (8)"] = df_corr["Corrected text (8)"].map(normalize_mojibake)

    # 8B: token map
    df_map, df_texts = run_mapping_only(
        df_corr, id_col=id_col, raw_col=raw_col, corr_col="Corrected text (8)"
    )

    # 8R: alignment repair to fold split inserts into clean replacements
    df_map = smart_repair_alignment(df_map, window=6, sim_word=0.60, sim_join=0.72)

    # 8C: sentence IDs
    df_map = assign_corr_sentence_ids(df_map)

    # 8D: title tokens and renumbering
    df_texts["ID"] = df_texts["ID"].astype(str)
    df_map = mark_title_tokens(df_map, df_texts_with_tags=df_texts)

    # 8E: boundary flags
    df_map = add_sentence_boundary_flags(df_map)

    return df_texts, df_map


In [None]:
# ===============================
# 9 — Build sentence table from token map
# What this does:
# • Aggregates tokens into sentences (excluding title tokens).
# • Creates corrected and raw sentence strings.
# • Counts edits per sentence and marks begin/end checks.
# • Projects dialogue, temporal, and closure flags onto sentences.
# Inputs:  df_map (from Step 8), df_texts (from Step 8)
# Output:  sent_df (one row per sentence with stats and flags)
# ===============================

import numpy as np
import pandas as pd
import re
import json

def _detok(tokens):
    NO_SPACE_BEFORE = set(list(".,;:!?)]}\"'»”’…"))
    NO_SPACE_AFTER  = set(list("([{\"'«“‘"))
    out = []
    for t in tokens:
        if t is None or (isinstance(t, float) and np.isnan(t)):
            continue
        t = str(t)
        if not out:
            out.append(t); continue
        prev = out[-1]
        if t in NO_SPACE_BEFORE or re.fullmatch(r"[.]{3}", t):
            out[-1] = prev + t
        elif prev in NO_SPACE_AFTER:
            out[-1] = prev + t
        else:
            out.append(" " + t)
    s = "".join(out)
    s = re.sub(r"\.\s*\.\s*\.", "...", s)
    return s.strip()

def _parse_json_list(s):
    try:
        v = json.loads(s) if isinstance(s, str) else (s or [])
        return v if isinstance(v, list) else []
    except Exception:
        return []

def _overlap(a0, a1, b0, b1):
    return max(0, min(a1, b1) - max(a0, b0)) > 0

def run_step9(df_map: pd.DataFrame, df_texts_with_tags: pd.DataFrame) -> pd.DataFrame:
    need = {"ID","CorrSentenceID","corr_token","Sentence Boundaries","BoundaryCheck","SentenceRef","TITLE","corr_start","corr_end"}
    miss = need - set(df_map.columns)
    if miss:
        raise KeyError(f"df_map missing columns needed for Step 9: {miss}")

    wm = df_map.sort_values(["ID","CorrSentenceID","corr_index"], kind="mergesort").copy()

    # Exclude titles from aggregation
    core = wm[~wm["TITLE"].astype(bool)].copy()

    # sentence spans in corrected text
    sent_spans = (
        core.groupby(["ID","CorrSentenceID"], as_index=False, sort=False)
            .agg(CorrStartMin=("corr_start","min"),
                 CorrEndMax=("corr_end","max"))
    )

    # Summarise each sentence
    def _summarize_sentence(g: pd.DataFrame) -> pd.Series:
        corr_tokens = g["corr_token"].tolist()
        raw_tokens  = [x for x in g["raw_token"].tolist() if not pd.isna(x)] if "raw_token" in g.columns else []
        corr_text   = _detok(corr_tokens)
        raw_text    = _detok(raw_tokens) if raw_tokens else ""

        b_rows = g[g["Sentence Boundaries"].str.contains("Sentence Beginning", na=False)]
        e_rows = g[g["Sentence Boundaries"].str.contains("Sentence Ending",   na=False)]

        begin_ok = np.nan
        end_ok   = np.nan
        if not b_rows.empty:
            chk = " | ".join(b_rows["BoundaryCheck"].dropna().astype(str))
            begin_ok = 1 if "Correct Beginning" in chk else (0 if "Incorrect Beginning" in chk else np.nan)
        if not e_rows.empty:
            chk = " | ".join(e_rows["BoundaryCheck"].dropna().astype(str))
            end_ok = 1 if "Correct Ending" in chk else (0 if "Incorrect Ending" in chk else np.nan)

        ops = g["op"] if "op" in g.columns else pd.Series([], dtype=object)
        return pd.Series({
            "SentenceRef": g["SentenceRef"].iloc[0],
            "CorrectedSentence": corr_text,
            "RawSentence": raw_text,
            "TokensInSentence": int(len(g)),
            "EditsInSentence": int((ops != "equal").sum()) if not ops.empty else np.nan,
            "EqualsInSentence": int((ops == "equal").sum()) if not ops.empty else np.nan,
            "Insertions": int((ops == "insert").sum()) if not ops.empty else np.nan,
            "Deletions": int((ops == "delete").sum()) if not ops.empty else np.nan,
            "Replacements": int((ops == "replace").sum()) if not ops.empty else np.nan,
            "BeginBoundaryRow": (b_rows.index[0] if not b_rows.empty else np.nan),
            "EndBoundaryRow":   (e_rows.index[0] if not e_rows.empty else np.nan),
            "CorrectBeginning": begin_ok,
            "CorrectEnding":    end_ok,
        })

    sent_df = (
        core.groupby(["ID","CorrSentenceID"], as_index=False, sort=False)
            .apply(_summarize_sentence, include_groups=False)
            .reset_index(drop=True)
            .sort_values(["ID","SentenceRef"], kind="mergesort")
    )

    # Project dialogue, temporal, closure flags
    df_texts = df_texts_with_tags[["ID","Corrected text (8)","NarrativeTagsJSON","DialogueSpansJSON"]].copy()
    df_texts["ID"] = df_texts["ID"].astype(str)

    sent_df = sent_df.merge(sent_spans, on=["ID","CorrSentenceID"], how="left")
    sent_df = sent_df.merge(df_texts, on="ID", how="left")

    def _flags(row):
        s0, s1 = row["CorrStartMin"], row["CorrEndMax"]

        # dialogue
        has_dialogue = False
        dlg = _parse_json_list(row.get("DialogueSpansJSON","[]"))
        for d in dlg:
            try:
                ds, de = int(d.get("start",-1)), int(d.get("end",-1))
                if ds >= 0 and de >= 0 and _overlap(s0, s1, ds, de):
                    has_dialogue = True
                    break
            except Exception:
                pass

        # temporal and closure
        has_temporal = False
        has_closure  = False
        tags = _parse_json_list(row.get("NarrativeTagsJSON","[]"))
        for t in tags:
            try:
                tt = t.get("type","")
                ts, te = int(t.get("start",-1)), int(t.get("end",-1))
                if ts >= 0 and te >= 0 and _overlap(s0, s1, ts, te):
                    if tt == "temporal":
                        has_temporal = True
                    elif tt == "closure":
                        has_closure = True
            except Exception:
                pass

        return pd.Series({"IsDialogue": has_dialogue, "HasTemporal": has_temporal, "HasClosure": has_closure})

    sent_df = pd.concat([sent_df, sent_df.apply(_flags, axis=1)], axis=1)

    # Clean helper columns
    for c in ["CorrStartMin","CorrEndMax","Corrected text (8)","NarrativeTagsJSON","DialogueSpansJSON"]:
        if c in sent_df.columns:
            sent_df.drop(columns=[c], inplace=True)

    return sent_df


In [None]:
# ===============================================
# 9.5 — Run Step 8 + Step 9, save, and (optionally) download
#        (ID-hardened + progress bar)
# ===============================================
import os, time, zipfile
from typing import Dict
try:
    from tqdm.auto import tqdm
    _HAVE_TQDM = True
except Exception:
    _HAVE_TQDM = False

try:
    from google.colab import files as _colab_files
    _IN_COLAB = True
except Exception:
    _colab_files = None
    _IN_COLAB = False

def _ensure_dir(path: str) -> str:
    os.makedirs(path, exist_ok=True)
    return path

def _safe_download(path: str):
    if _IN_COLAB and _colab_files:
        try:
            _colab_files.download(path)
        except Exception as e:
            print(f"   ↳ Download hint: {e}")

def save_and_download_step8_9(
    df_preprocessed,
    *,
    raw_col: str = "Raw text",
    id_col: str = "ID",
    client=None,
    model: str = "gpt-4o",
    use_mock: bool = False,
    out_dir: str = "/content/drive/MyDrive/JM/Outputs",
    prefix: str = "step"
) -> Dict[str, object]:
    if df_preprocessed is None or len(df_preprocessed) == 0:
        raise ValueError("df_preprocessed is empty or not defined.")
    if raw_col not in df_preprocessed.columns:
        raise KeyError(f"'{raw_col}' not found in df_preprocessed.")

    # Normalize/guarantee ID once for the whole run
    df_work, id_source = ensure_canonical_id(df_preprocessed, canon=id_col)

    ts = time.strftime("%Y%m%d_%H%M%S")
    _ensure_dir(out_dir)

    bar = tqdm(total=8, desc="Pipeline 8→9", unit="step") if _HAVE_TQDM else None
    def _tick(msg):
        if bar:
            bar.set_postfix_str(msg[:60]); bar.update(1)

    print(f"▶️  Running Step 8 (correction, mapping, boundaries)...  [ID source: {id_source}]")
    df_texts_8, df_map_8 = run_step8(
        df_work, raw_col=raw_col, id_col=id_col,
        client=None if use_mock else client, model=model, use_mock=use_mock
    )
    _tick("Step 8 complete")
    print(f"   ✓ Step 8 complete — rows: {len(df_texts_8):,}, map rows: {len(df_map_8):,}")

    print("▶️  Running Step 9 (sentence table)...")
    sent_df = run_step9(df_map_8, df_texts_with_tags=df_texts_8)
    _tick("Step 9 complete")
    print(f"   ✓ Step 9 complete — sentences: {len(sent_df):,}")

    base = f"{prefix}_8_9_{ts}"
    p_texts = os.path.join(out_dir, f"{base}_texts.csv")
    p_map   = os.path.join(out_dir, f"{base}_wordmap_checked.csv")
    p_sent  = os.path.join(out_dir, f"{base}_sentence_mapping_with_boundaries.csv")
    p_zip   = os.path.join(out_dir, f"{base}.zip")

    print("💾 Saving CSVs...")
    df_texts_8.to_csv(p_texts, index=False, encoding="utf-8"); _tick("Saved texts CSV")
    df_map_8.to_csv(p_map,   index=False, encoding="utf-8");   _tick("Saved wordmap CSV")
    sent_df.to_csv(p_sent,   index=False, encoding="utf-8");   _tick("Saved sentences CSV")
    print("   •", p_texts); print("   •", p_map); print("   •", p_sent)

    print("🗜️  Bundling ZIP...")
    with zipfile.ZipFile(p_zip, "w", compression=zipfile.ZIP_DEFLATED) as zf:
        zf.write(p_texts, arcname=os.path.basename(p_texts))
        zf.write(p_map,   arcname=os.path.basename(p_map))
        zf.write(p_sent,  arcname=os.path.basename(p_sent))
    _tick("ZIP complete")
    print("   •", p_zip)

    if _IN_COLAB:
        print("⬇️  Initiating downloads (Colab)...")
        for p in (p_texts, p_map, p_sent, p_zip):
            _safe_download(p)
        _tick("Downloads triggered")
    else:
        print("ℹ️  Not running in Colab — files saved to disk only.")
        _tick("Skipped downloads")

    try:
        print("\n— Per document (first 10) —")
        preview = (
            df_map_8.groupby("ID")["op"]
                    .apply(lambda s: (s != "equal").sum())
                    .rename("Edits").reset_index()
                    .sort_values("ID").head(10)
        )
        print(preview.to_string(index=False))
    except Exception:
        pass

    try:
        print("\n— First 10 sentences with boundary checks —")
        cols = ["SentenceRef","CorrectedSentence","CorrectBeginning","CorrectEnding",
                "EditsInSentence","IsDialogue","HasTemporal","HasClosure"]
        print(sent_df[cols].head(10).to_string(index=False))
    except Exception:
        pass

    if bar:
        bar.set_postfix_str("Done"); bar.update(1); bar.close()

    return dict(
        step8_texts_path=p_texts,
        step8_map_path=p_map,
        step9_sentences_path=p_sent,
        zip_path=p_zip,
        df_texts_8=df_texts_8,
        df_map_8=df_map_8,
        sent_df=sent_df
    )


In [None]:
# -- Sanity + Retrieve Outputs (run this) --
import os, glob, pandas as pd
from datetime import datetime

OUT_DIR = "/content/drive/MyDrive/JM/Outputs"
print("OUT_DIR exists:", os.path.isdir(OUT_DIR), "→", OUT_DIR)

# Find the newest run
candidates = sorted(glob.glob(os.path.join(OUT_DIR, "step_8_9_*_texts.csv")))
print("Found runs:", len(candidates))
if candidates:
    latest_texts = candidates[-1]
    stem = latest_texts.replace("_texts.csv","")
    latest_map  = stem + "_wordmap_checked.csv"
    latest_sent = stem + "_sentence_mapping_with_boundaries.csv"
    latest_zip  = stem + ".zip"

    def _size(p):
        try: return f"{os.path.getsize(p)/1024:.1f} KB"
        except: return "missing"

    print("\nLatest run:")
    print("  Texts CSV    :", latest_texts, "(", _size(latest_texts), ")")
    print("  Word-map CSV :", latest_map,   "(", _size(latest_map),   ")")
    print("  Sentences CSV:", latest_sent,  "(", _size(latest_sent),  ")")
    print("  ZIP bundle   :", latest_zip,   "(", _size(latest_zip),   ")")

    # Peek contents so you can see it's real
    try:
        df_texts = pd.read_csv(latest_texts, nrows=5)
        print("\nTexts head:")
        display(df_texts)
    except Exception as e:
        print("Couldn't preview texts:", e)

    try:
        df_sent = pd.read_csv(latest_sent, nrows=5)
        print("\nSentences head:")
        display(df_sent[["SentenceRef","CorrectedSentence"]])
    except Exception as e:
        print("Couldn't preview sentences:", e)

    # Re-trigger downloads (one by one; pop-ups may be blocked)
    try:
        from google.colab import files
        print("\nTriggering a single download: ZIP (easiest).")
        files.download(latest_zip)
        # If you want all of them uncomment below:
        # files.download(latest_texts)
        # files.download(latest_map)
        # files.download(latest_sent)
    except Exception as e:
        print("Download hint:", e, "— if pop-ups are blocked, allow them and run again.")
else:
    print("No outputs found in OUT_DIR. Re-run the Patch+Final cell, then rerun this.")


OUT_DIR exists: True → /content/drive/MyDrive/JM/Outputs
Found runs: 0
No outputs found in OUT_DIR. Re-run the Patch+Final cell, then rerun this.


In [None]:
# === Hotfix: ID Guard (drop this once, above your launcher) ===
import pandas as pd

def _guarantee_id(df: pd.DataFrame) -> pd.DataFrame:
    """Always return a copy with a solid 'ID' column (string), pulled from best available."""
    if df is None or not isinstance(df, pd.DataFrame) or df.empty:
        raise ValueError("ID Guard: input df missing or empty")
    prefer = ["ID", "Research ID", "ResearchID", "IdeaID", "RowID"]
    out = df.copy()
    src = next((c for c in prefer if c in out.columns), None)
    if src is None:
        out["ID"] = out.index.astype(str)
    else:
        out["ID"] = out[src].astype(str)
    # tidy numeric-looking ids like '123.0' → '123'
    out["ID"] = (out["ID"].astype(str).str.strip()
                 .str.replace(r"\.0$", "", regex=True))
    return out

# --- Wrap run_correct_only safely (no recursion) ---
try:
    _orig_run_correct_only = run_correct_only
except NameError:
    _orig_run_correct_only = None

def run_correct_only(
    df_in: pd.DataFrame,
    text_col="Raw text",
    id_col="ID",
    client=None,
    model="gpt-4o",
    use_mock=False,
    out_col="Corrected text (8)"
) -> pd.DataFrame:
    if text_col not in df_in.columns:
        raise KeyError(f"Missing required column: '{text_col}'")
    bas


In [None]:
# --- Fix & harden run_correct_only (kills the 'bas' NameError) ---

import pandas as pd, json, re

def _guarantee_id(df: pd.DataFrame) -> pd.DataFrame:
    prefer = ["ID", "Research ID", "ResearchID", "IdeaID", "RowID"]
    out = df.copy()
    src = next((c for c in prefer if c in out.columns), None)
    if src is None:
        out["ID"] = out.index.astype(str)
    else:
        out["ID"] = out[src].astype(str)
    out["ID"] = out["ID"].astype(str).str.strip().str.replace(r"\.0$", "", regex=True)
    return out

# Keep any original for delegation (avoids recursion)
try:
    _orig_run_correct_only
except NameError:
    _orig_run_correct_only = None

def run_correct_only(
    df_in: pd.DataFrame,
    text_col="Raw text",
    id_col="ID",
    client=None,
    model="gpt-4o",
    use_mock=False,
    out_col="Corrected text (8)"
) -> pd.DataFrame:
    if text_col not in df_in.columns:
        raise KeyError(f"Missing required column: '{text_col}'")

    base = _guarantee_id(df_in)  # ← this was 'bas' before

    # If there’s an earlier implementation, delegate safely
    if _orig_run_correct_only and _orig_run_correct_only is not run_correct_only:
        tmp = base.copy()
        tmp[id_col] = tmp["ID"]
        out = _orig_run_correct_only(
            tmp, text_col=text_col, id_col=id_col,
            client=client, model=model, use_mock=use_mock, out_col=out_col
        )
        return _guarantee_id(out)

    # Otherwise, do the full correction here using your correct_with_tags()
    corrected, tags_json, dlg_json, sources = [], [], [], []
    for raw in base[text_col].astype(str).tolist():
        c, tags, spans, src = correct_with_tags(
            raw, client=client, model=model, use_mock=use_mock
        )
        corrected.append(c)
        tags_json.append(json.dumps(tags, ensure_ascii=False))
        dlg_json.append(json.dumps(spans, ensure_ascii=False))
        sources.append(src)

    out = base.copy()
    out[out_col] = corrected
    out["NarrativeTagsJSON"] = tags_json
    out["DialogueSpansJSON"]  = dlg_json
    out["CorrectedBy"]        = sources
    return _guarantee_id(out)

print("✅ Patched run_correct_only: typo fixed, ID guaranteed, real/Mock path intact.")


✅ Patched run_correct_only: typo fixed, ID guaranteed, real/Mock path intact.


In [None]:
# === REAL RUN LAUNCHER (gpt-4o) ===
# Requirements:
#  - You already ran your function/defs cells (so save_and_download_step8_9 exists)
#  - You already mounted Drive and loaded df_preprocessed with a "Raw text" column

USE_OPENAI = True
MODEL = "gpt-4o"
OUT_DIR = "/content/drive/MyDrive/JM/Outputs"

import os, time
from getpass import getpass

# 1) Secure key entry (hidden) — NEVER paste keys into cells.
os.environ.pop("OPENAI_API_KEY", None)
print("Enter your OpenAI API key (input hidden):")
os.environ["OPENAI_API_KEY"] = getpass("API key: ").strip()

# 2) Client probe (verifies the key works)
from openai import OpenAI
try:
    OPENAI_CLIENT = OpenAI()
    _ = OPENAI_CLIENT.models.list().data[:1]
    print("✅ OpenAI client ready.")
except Exception as e:
    raise RuntimeError(f"OpenAI init failed: {type(e).__name__} - {e}")

# 3) Sanity checks for data + functions
import pandas as pd, os

if 'df_preprocessed' not in globals() or not isinstance(df_preprocessed, pd.DataFrame):
    raise RuntimeError("df_preprocessed is not defined. Run your Step 2 Load/Preprocess cell first.")
if "Raw text" not in df_preprocessed.columns:
    raise KeyError("Missing 'Raw text' column in df_preprocessed.")
if 'save_and_download_step8_9' not in globals():
    raise RuntimeError("save_and_download_step8_9 is not defined. Run the big definitions cell first.")

os.makedirs(OUT_DIR, exist_ok=True)

# 4) Fire the pipeline (REAL model, no mock)
print("▶️  Running Step 8→9 with REAL model (this does paid API calls).")
results = save_and_download_step8_9(
    df_preprocessed,
    raw_col="Raw text",
    id_col="ID",            # will prefer 'Research ID' if present (via ensure_canonical_id)
    client=OPENAI_CLIENT,
    model=MODEL,
    use_mock=False,         # <-- real calls
    out_dir=OUT_DIR,
    prefix="step"
)

# 5) Summary of output paths
print("\n🎯 Done. Key output files:")
print("  • Texts CSV:     ", results.get("step8_texts_path"))
print("  • Word-map CSV:  ", results.get("step8_map_path"))
print("  • Sentences CSV: ", results.get("step9_sentences_path"))
print("  • ZIP bundle:    ", results.get("zip_path"))



Enter your OpenAI API key (input hidden):
API key: ··········
✅ OpenAI client ready.
▶️  Running Step 8→9 with REAL model (this does paid API calls).
ℹ️  Using 'Research ID' as canonical 'ID'.
🔎 ID check → ['BBCMHJPT', 'BBKBYNDW', 'BBRWTLYV', 'BCDVWQDF', 'BCXSFTWC']


Pipeline 8→9:   0%|          | 0/8 [00:00<?, ?step/s]

▶️  Running Step 8 (correction, mapping, boundaries)...  [ID source: Research ID]


KeyError: 'ID'

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


#Below is archived - Please ignore


All the code that will introduce the functions