# German Verb Master · Example Migration Notebook

This notebook replaces the previous TypeScript migration script. It fetches candidate rows from the `words` table, detects the language of legacy example sentences, and helps you curate updates before writing them back to PostgreSQL.

👉 **Recommended workflow**

1. Install Python dependencies (see below).
2. Configure the database connection string (pulls from the `.env` file by default).
3. Run the detection cells, review the output, and adjust the resulting DataFrame.
4. When satisfied, run the final cell to apply the updates to the database.

> Always keep a fresh duplicate table (e.g. `words_duplicate_21_10_2025`) before applying changes.

## 1. Environment setup

Create/activate a Python environment and install the tooling:

```bash
pip install pandas sqlalchemy python-dotenv langdetect langid tqdm
```

The notebook expects a `DATABASE_URL` environment variable (same format used by Prisma/Drizzle), e.g.:

```
postgresql://postgres:postgres@localhost:5432/german_verbs
```

Place it in `.env` at the repo root or export it before launching Jupyter.

In [3]:
# %pip install pandas sqlalchemy python-dotenv langdetect langid tqdm

In [3]:
%load_ext autoreload
%autoreload 2

import json
import math
import os
from dataclasses import dataclass
from typing import Any, Dict, List, Optional

import langid
import pandas as pd
from dotenv import load_dotenv
from langdetect import DetectorFactory, detect_langs
from sqlalchemy import create_engine, text
from sqlalchemy.engine import Engine
from tqdm.auto import tqdm

DetectorFactory.seed = 42
langid.set_languages(['de', 'en'])

load_dotenv()

DATABASE_URL = os.getenv("DATABASE_URL")
if not DATABASE_URL:
    raise RuntimeError("DATABASE_URL is not set. Update your .env or export it before running this notebook.")

engine: Engine = create_engine(DATABASE_URL)
print("Connected to", engine.url.render_as_string(hide_password=True))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Connected to postgresql://postgres:***@db.kagsgjzijfgvtvkylczl.supabase.co:5432/postgres


In [4]:
QUERY = text(
    """
    SELECT id,
           lemma,
           pos,
           example_en,
           examples::text AS examples_json
    FROM words
    WHERE example_en IS NOT NULL
      AND TRIM(example_en) <> ''
    ORDER BY id
    """
)

df = pd.read_sql_query(QUERY, engine)
print(f"Loaded {len(df)} candidate rows with legacy example_en content.")
df.head()

Loaded 3080 candidate rows with legacy example_en content.


Unnamed: 0,id,lemma,pos,example_en,examples_json
0,1,arbeiten,V,Er hat als Koch gearbeitet.,"[{""sentence"": ""Sie arbeitete in einer Bank."", ..."
1,2,essen,V,Wir haben zu Mittag gegessen.,"[{""sentence"": ""Er aß ein Sandwich."", ""translat..."
2,3,gehen,V,Wir sind nach Hause gegangen.,"[{""sentence"": ""Er ging zur Schule."", ""translat..."
3,4,haben,V,Er hat viel Geld gehabt.,"[{""sentence"": ""Ich hatte keine Zeit."", ""transl..."
4,5,heißen,V,Sie hat anders geheißen.,"[{""sentence"": ""Er hieß Peter."", ""translations""..."


In [5]:
@dataclass
class Detection:
    language: str
    confidence: float
    source: str


def detect_language(text: str) -> Detection:
    cleaned = (text or "").strip()
    if not cleaned:
        return Detection(language="und", confidence=0.0, source="none")

    langid_code, langid_log_prob = langid.classify(cleaned)
    try:
        langid_conf = math.exp(langid_log_prob)
    except OverflowError:
        langid_conf = 0.0

    try:
        candidates = detect_langs(cleaned)
        best = max(candidates, key=lambda entry: entry.prob)
        return Detection(
            language=best.lang,
            confidence=float(best.prob),
            source=f"langdetect:{langid_code}:{langid_conf:.6f}:{langid_log_prob:.3f}",
        )
    except Exception:
        return Detection(language=langid_code, confidence=float(langid_conf), source=f"langid:{langid_log_prob:.3f}")


def parse_examples(raw: Optional[str]) -> List[Dict[str, Any]]:
    if not raw:
        return []
    try:
        parsed = json.loads(raw)
        if isinstance(parsed, list):
            return parsed
    except json.JSONDecodeError:
        pass
    return []


df["examples"] = df["examples_json"].apply(parse_examples)

tqdm.pandas(desc="Detecting languages")
detections = df["example_en"].progress_apply(detect_language)

df["detected_language"] = [d.language for d in detections]
df["language_confidence"] = [d.confidence for d in detections]
df["detection_source"] = [d.source for d in detections]

df.head()

Detecting languages:   0%|          | 0/3080 [00:00<?, ?it/s]

Unnamed: 0,id,lemma,pos,example_en,examples_json,examples,detected_language,language_confidence,detection_source
0,1,arbeiten,V,Er hat als Koch gearbeitet.,"[{""sentence"": ""Sie arbeitete in einer Bank."", ...","[{'sentence': 'Sie arbeitete in einer Bank.', ...",de,0.999997,langdetect:de:-101.985
1,2,essen,V,Wir haben zu Mittag gegessen.,"[{""sentence"": ""Er aß ein Sandwich."", ""translat...","[{'sentence': 'Er aß ein Sandwich.', 'translat...",de,0.999997,langdetect:de:-180.532
2,3,gehen,V,Wir sind nach Hause gegangen.,"[{""sentence"": ""Er ging zur Schule."", ""translat...","[{'sentence': 'Er ging zur Schule.', 'translat...",de,0.999997,langdetect:de:-118.167
3,4,haben,V,Er hat viel Geld gehabt.,"[{""sentence"": ""Ich hatte keine Zeit."", ""transl...","[{'sentence': 'Ich hatte keine Zeit.', 'transl...",de,0.999996,langdetect:de:-52.700
4,5,heißen,V,Sie hat anders geheißen.,"[{""sentence"": ""Er hieß Peter."", ""translations""...","[{'sentence': 'Er hieß Peter.', 'translations'...",de,0.999995,langdetect:de:-92.654


In [23]:
suspected_german = df[(df["detected_language"] == "de") ]
print(f"Potential German example_en sentences: {len(suspected_german)}")
suspected_german[["id", "lemma", "pos", "example_en", "language_confidence", "detection_source"]]

Potential German example_en sentences: 48


Unnamed: 0,id,lemma,pos,example_en,language_confidence,detection_source
0,1,arbeiten,V,Er hat als Koch gearbeitet.,0.999997,langdetect:de:-101.985
1,2,essen,V,Wir haben zu Mittag gegessen.,0.999997,langdetect:de:-180.532
2,3,gehen,V,Wir sind nach Hause gegangen.,0.999997,langdetect:de:-118.167
3,4,haben,V,Er hat viel Geld gehabt.,0.999996,langdetect:de:-52.700
4,5,heißen,V,Sie hat anders geheißen.,0.999995,langdetect:de:-92.654
5,6,hören,V,Ich habe ein Geräusch gehört.,0.999997,langdetect:de:-144.378
6,7,kaufen,V,Er hat ein Auto gekauft.,0.999998,langdetect:de:-92.219
7,8,kommen,V,Der Bus ist pünktlich gekommen.,0.999996,langdetect:de:-126.572
8,9,können,V,Das hat er immer gut gekonnt.,0.999995,langdetect:de:-126.957
10,11,lernen,V,Er hat viel gelernt.,0.85714,langdetect:de:-70.770


In [17]:
detect_langs('Sie hat lange in Paris gelebt.')

[af:0.7142819547522308, de:0.2857165832059127]

In [25]:
langid.classify('Sie hat lange in Paris gelebt.')

('de', -89.425368309021)

In [24]:
langid.classify('Hi, how are you doing today?')

('en', -38.4182243347168)

In [16]:
df[["example_en", "detected_language", "language_confidence"]].values.tolist()

[['Er hat als Koch gearbeitet.', 'de', 0.9999969911349207],
 ['Wir haben zu Mittag gegessen.', 'de', 0.9999971147687953],
 ['Wir sind nach Hause gegangen.', 'de', 0.9999972605618549],
 ['Er hat viel Geld gehabt.', 'de', 0.9999955560983653],
 ['Sie hat anders geheißen.', 'de', 0.9999948145603174],
 ['Ich habe ein Geräusch gehört.', 'de', 0.9999971686799207],
 ['Er hat ein Auto gekauft.', 'de', 0.9999975043905702],
 ['Der Bus ist pünktlich gekommen.', 'de', 0.9999964553791492],
 ['Das hat er immer gut gekonnt.', 'de', 0.9999952233953684],
 ['Sie hat lange in Paris gelebt.', 'af', 0.7142819547522308],
 ['Er hat viel gelernt.', 'de', 0.8571404242961682],
 ['Er hat die Zeitung gelesen.', 'de', 0.9999954704665728],
 ['Sie hat das Essen gemacht.', 'de', 0.9999980983059539],
 ['Er hat arbeiten gemusst.', 'de', 0.9999970253497498],
 ['Ich habe gut geschlafen.', 'de', 0.9999977429370752],
 ['Sie hat eine E-Mail geschrieben.', 'de', 0.9999979270940208],
 ['Wir haben uns gestern gesehen.', 'de', 0

In [None]:
def build_updated_examples(row: pd.Series) -> List[Dict[str, Any]]:
    examples = list(row.examples)  # shallow copy
    sentence = (row.example_en or "").strip()
    if not sentence:
        return examples

    sentence_lower = sentence.lower()
    target = None
    for entry in examples:
        if not isinstance(entry, dict):
            continue
        existing_sentence = (entry.get("sentence") or "").strip()
        if existing_sentence and existing_sentence.lower() == sentence_lower:
            target = entry
            break

    if target is None:
        target = {"sentence": sentence, "translations": None}
        examples.append(target)

    translations = target.get("translations") or {}
    translations["en"] = sentence
    target["translations"] = translations
    return examples


updates: List[Dict[str, Any]] = []
for _, row in suspected_german.iterrows():
    updated_examples = build_updated_examples(row)
    updates.append(
        {
            "id": int(row.id),
            "lemma": row.lemma,
            "pos": row.pos,
            "example_en": row.example_en,
            "examples": updated_examples,
        }
    )

print(f"Prepared {len(updates)} candidate updates.")
updates[:3]

In [None]:
APPLY_UPDATES = False  # set to True after manual review

if APPLY_UPDATES:
    with engine.begin() as conn:
        for payload in tqdm(updates, desc="Applying updates"):
            conn.execute(
                text(
                    """
                    UPDATE words
                    SET examples = :examples::jsonb,
                        example_en = NULL,
                        updated_at = NOW()
                    WHERE id = :word_id
                    """
                ),
                {
                    "examples": json.dumps(payload["examples"]),
                    "word_id": payload["id"],
                },
            )
    print("Updates applied. Review the database before exporting snapshots again.")
else:
    print("APPLY_UPDATES is False. Review the `updates` list and toggle the flag when ready.")

## Next steps

* Inspect `suspected_german` to confirm detections.
* Edit the `updates` list manually if you need custom translations or to drop false positives.
* Set `APPLY_UPDATES = True` and rerun the final cell to write changes back.
* Re-run `scripts/enrichment/restore-words-from-duplicate.ts` if you need to revert.
* Once satisfied, export snapshots and upload to Supabase as usual.