***Hohenheimer Verständlichkeitsindex (HIX)*** is a German text comprehensibility/readability score designed to objectively compare how easy or hard a text is to understand.  ￼

**How it works (high level)**
	•	It combines four readability formulas (validated for German): Amstad, 1st new Vienna factual-text formula, SMOG (German), and LIX.  ￼
	•	It also adds extra text features like average sentence length, average clause/segment length, average word length, and shares of very long words / clauses / sentences.  ￼
	•	Those results are scaled and combined into a single score from 0 to 20 (higher = easier).  ￼

**How to interpret the score (examples used in the method**
	•	Roughly 0–5: comparable to a political science dissertation (expert-level).
	•	Roughly 15–20: comparable to a tabloid-style newspaper article (very easy).  ￼

**Practical note**
	•	HIX is very useful for screening and tracking improvements, but it mainly measures surface features (length, structure, etc.), not whether the content is truly “Easy Language” for a specific audience.  ￼


***Turn the HIX approach into a practical, repeatable model-selection ruleset.**

In [None]:
# Setup
import json
from pathlib import Path

# Uncomment for training
# import torch
# from transformers import (
#     AutoTokenizer,
#     AutoModelForSeq2SeqLM,
#     Seq2SeqTrainer,
#     Seq2SeqTrainingArguments,
#     DataCollatorForSeq2Seq,
# )
# from datasets import Dataset

PROJECT_ROOT = Path("..").resolve()
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
MODELS_DIR = PROJECT_ROOT / "models"

print(f"Data: {DATA_PROCESSED}")
print(f"Models will be saved to: {MODELS_DIR}")


## 1. Configuration


In [None]:
# Training configuration
CONFIG = {
    # Model
    "base_model": "google/mt5-small",  # or mt5-base, flan-t5-base
    "max_source_length": 512,
    "max_target_length": 256,
    
    # Training
    "batch_size": 8,
    "learning_rate": 5e-5,
    "num_epochs": 3,
    "warmup_steps": 500,
    "weight_decay": 0.01,
    
    # Outputs
    "output_dir": str(MODELS_DIR / "klartext-mt5-small"),
    "logging_steps": 100,
    "save_steps": 500,
    "eval_steps": 500,
}

print("Training config:")
for k, v in CONFIG.items():
    print(f"  {k}: {v}")


## HIX setup 



Below is a web-sourced, concrete way to build your Negative (“hard”) and Target (“easy”) sets, plus Klartext guardrails and a HIX-style normalization scheme you can apply consistently across LLM outputs.

**1. Robust setup with real, existing texts**


Negative set (“hard”): bureaucratic / academic German

Use sources that are intentionally dense and publicly accessible.

Bureaucratic / official
	•	Federal laws & ordinances (consolidated text, free): “Gesetze im Internet” provides nearly all current German federal law for reuse (HTML/PDF/XML).  ￼
	Example = /Users/simonvoegely/Desktop/Klartext/klartext/data/samples/federal law text.txt

•	Good because: long sentences, nominal style, references, legal definitions.

Academic / dissertation-style
	•	Humboldt University edoc repository (Open Access PDFs): example dissertation pages with downloadable PDF files.  ￼
	•	FU Berlin Refubium (Open Access dissertations, PDF download): example dissertation with PDF file listed.  
	Example = /Users/simonvoegely/Desktop/Klartext/klartext/data/samples/federal law text.txt

Practical collection rule for the negative set:
	•	Sample across topics (law, health, politics, education) and formats (HTML pages + PDFs).
	•	Keep the set “realistic”: use texts your users would actually face (letters, rules, laws, reports).

**2. Target** ***set (“easy”): human-written “Einfache Sprache” samples***

/Users/simonvoegely/Desktop/Klartext/klartext/data/samples/Nachrichten leicht.txt

Curated directories of easy-language news sources
	•	bpb “einfach POLITIK” list of easy/light news sources (helps you expand your target set without hunting blindly).  ￼
	•	ARD Digital overview of “Einfache und Leichte Sprache” offers (useful to add regional broadcasters as more target data).  ￼

Research-grade parallel corpus (best for evaluation)
	•	DEplain (parallel corpus: standard German ↔ plain German / “Einfache Sprache”): professionally written and aligned sentence/document pairs; available on Hugging Face + described in ACL paper.  ￼

	[DEplain: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification](https://aclanthology.org/2023.acl-long.908/) (Stodden et al., ACL 2023)

Minimum viable dataset recommendation (for stable model comparison):
	•	Negative: 50–200 documents (laws + dissertations)
	•	Target: 50–200 documents (nachrichtenleicht + tagesschau easy + DEplain plain side)

***3. Ready-to-use guardrails aligned with Uni Hohenheim “Klartext”***

Uni Hohenheim’s Klartext materials give you concrete, enforceable checks:

Hard rule
	•	Split sentences longer than 20 words (track “% sentences > 20 words”).  ￼

Style guardrails
	•	Avoid passive and nominal style (prefer strong verbs).  ￼
	•	Avoid very long/rare words; if unavoidable, explain briefly or use hyphens to show word parts.  ￼
	•	Keep consistent wording and clear references (avoid switching terms).  ￼

You can treat these as hard constraints in your evaluation: a model can have a good HIX-like score but still fail if it violates these rules too often.

---

## 5. HIX-Style Normalization

The HIX method uses a 3-step process: **benchmarking - scaling to 0-10 - combine to 0-20**

---

### Step 5.1: Define Benchmarks from Your Corpora

For each metric `m`, define two reference values:

| Benchmark | Definition | Recommended Statistic |
|-----------|------------|----------------------|
| `target_m` | "Good/Easy" reference | Median (P50) or P25 for strict |
| `neg_m` | "Bad/Hard" reference | Median (P50) or P75 for strict |

**Why use corpus statistics?**
- Avoids hand-picked arbitrary numbers
- Keeps benchmarks stable across domains
- Reproducible and defensible

**Policy Options:**

| Policy | Target | Negative | Use Case |
|--------|--------|----------|----------|
| **A (Balanced)** | P50(target) | P50(negative) | General comparison |
| **B (Strict)** | P25(target) | P75(negative) | High-quality filtering |

---

### Step 5.2: Scale Each Metric to 0-10

Apply clamped linear scaling based on direction:

**If LOWER is better** (sentence length, % long sentences, word length, clause length):

`score_m = clamp( 10 * (neg_m - x) / (neg_m - target_m), 0, 10 )`

**If HIGHER is better** (rare, but if defined that way):

`score_m = clamp( 10 * (x - neg_m) / (target_m - neg_m), 0, 10 )`

**Interpretation:**
- `x = target_m` -> score = 10 (perfect)
- `x = neg_m` -> score = 0 (worst)
- Values outside range -> clamped to 0 or 10

---

### Step 5.3: Combine to 0-20 HIX-like Index

HIX combines two groups of scores:

**Group 1: Readability Formulas (4 scores)**

`S_formulas = mean(score_Amstad, score_Wiener, score_SMOG_DE, score_LIX)`

**Group 2: Text Parameters (6 scores)**

`S_params = mean(score_sentLen, score_clauseLen, score_wordLen, score_%words>6, score_%clauses>12, score_%sents>20)`

**Final HIX-like Score:**

`HIX_like = S_formulas + S_params`

| Score Range | Interpretation |
|-------------|----------------|
| 0-5 | Expert-level (dissertation complexity) |
| 6-10 | Moderate difficulty |
| 11-15 | Accessible to general audience |
| 16-20 | Easy Language / tabloid-style |

---

### Step 5.4: Add Strictness Gates (Guardrails)

**Independently from HIX_like score, enforce hard constraints:**

| Gate | Threshold | Action if Failed |
|------|-----------|------------------|
| % sentences > 20 words | < 10% (from Target corpus) | REJECT model output |
| Passive voice rate | < 15% | Flag for review |
| Negation rate | < 10% | Flag for review |

---

### Summary: Complete Scoring Pipeline

1. **COLLECT** - Freeze negative (hard) + target (easy) corpora
2. **PREPROCESS** - Extract text, segment sentences/clauses
3. **BENCHMARK** - Compute P50/P25/P75 for each metric on both corpora
4. **SCALE** - For each model output, scale metrics to 0-10
5. **COMBINE** - S_formulas + S_params = HIX_like (0-20)
6. **GATE** - Check guardrails (reject if failed)
7. **RANK** - Compare models by HIX_like + guardrail pass rate


---

## 4. Implementation Roadmap

A step-by-step guide to implement the HIX-based evaluation pipeline rigorously.

---

### Phase 1: Dataset Tasks

#### Step 1: Collect & Freeze Corpora

| Corpus | Sources | Action |
|--------|---------|--------|
| **Negative (hard)** | Federal laws, dissertations, academic papers | Select URLs/PDFs and store snapshots locally |
| **Target (easy)** | nachrichtenleicht, tagesschau easy, DEplain plain side | Download and version-control |

**Key principle:** Freeze the corpus versions to ensure reproducible benchmarks.

#### Step 2: Licensing & Reuse Check

Create a metadata table for each source:

| Source | License | Reuse Allowed | Notes |
|--------|---------|---------------|-------|
| Gesetze im Internet | Public Domain | ✅ Yes | German federal law |
| DEplain | Open/Permission | ✅ Yes | ACL paper dataset |
| nachrichtenleicht | Check ToS | ⚠️ Verify | News content |

---

### Phase 2: Preprocessing Tasks

#### Step 3: Text Extraction

| Format | Tool/Method |
|--------|-------------|
| HTML | Boilerplate removal (trafilatura, BeautifulSoup) |
| PDF | PyMuPDF, pdfplumber, or OCR if needed |
| Plain text | Direct loading |

#### Step 4: Segmentation

Two levels of segmentation required:

1. **Sentence splitting**
   - Handle German punctuation rules
   - Handle abbreviations (z.B., d.h., usw.)
   - Tools: spaCy German, SoMaJo, or custom rules

2. **Clause splitting**
   - Split on subordinate clause markers (weil, dass, obwohl, wenn...)
   - Required for computing "Satzteillänge" (clause length) like HIX

---

### Phase 3: Metrics & Guardrail Tasks

#### Step 5: Compute HIX Components

**4 Readability Formulas:**

| Formula | Description |
|---------|-------------|
| Amstad | German adaptation of Flesch Reading Ease |
| 1st Vienna Formula | Austrian factual-text readability |
| SMOG (German) | Simple Measure of Gobbledygook |
| LIX | Läsbarhetsindex (Scandinavian origin) |

**6 Text Parameters:**

| Parameter | What it measures |
|-----------|------------------|
| Avg sentence length | Words per sentence |
| Avg clause length | Words per clause/segment |
| Avg word length | Characters per word |
| % words > 6 chars | Share of long words |
| % clauses > 12 words | Share of long clauses |
| % sentences > 20 words | Share of very long sentences |

#### Step 6: Compute Klartext Guardrail Metrics

| Metric | Target | How to compute |
|--------|--------|----------------|
| % sentences > 20 words | < 10% | Count and divide |
| Passive voice rate | < 15% | POS tagging + pattern matching |
| Negation rate | < 10% | Count "nicht", "kein", "nie", etc. |
| Terminology consistency | > 90% | Glossary hits / total terms |

#### Step 7: Meaning Preservation Check

Critical to avoid "easy but wrong" outputs:

- **Human QA sampling:** Random sample review
- **Rule-based checks:**
  - Dates preserved correctly
  - Numbers unchanged
  - Obligations (must/may/shall) not altered
  - Named entities intact

---

### Phase 4: Decision & Reporting Tasks

#### Step 8: Model Evaluation Harness

Standardize evaluation conditions:

```
┌─────────────────────────────────────────┐
│ EVALUATION HARNESS REQUIREMENTS         │
├─────────────────────────────────────────┤
│ ✓ Same system prompt for all models     │
│ ✓ Same decoding settings (temp, top_p)  │
│ ✓ Same test set (frozen)                │
│ ✓ Versioned outputs (git/timestamps)    │
│ ✓ Reproducible random seeds             │
└─────────────────────────────────────────┘
```

#### Step 9: Score Dashboard

Create a summary view with:

| Metric | Display |
|--------|---------|
| Per-model scores | Median, IQR, min, max |
| Guardrail pass/fail | Binary + violation count |
| Error rate | % of failed generations |
| HIX-like index | 0-20 scale |
