# ÁSZF Érthetőség Predikciós Modell

A projekt célja egy NLP modell létrehozása, amely 1-től 5-ig terjedő skálán megbecsüli jogi szövegrészletek érthetőségét.

## Főbb lépések:
1. **Adatbetöltés**: Címkézett JSON adatok beolvasása.
2. **Jellemzők kinyerése (Feature Engineering)**: Szöveges jellemzők (olvashatósági indexek, jogi terminusok stb.) előállítása.
3. **Alapmodell (Baseline)**: Ordinális logisztikus regresszió tanítása a kinyert jellemzőkön.
4. **Transformer Modell**: Egyéni CORAL fejjel ellátott transzformer modell (pl. `SZTAKI/HuBERT`) finomhangolása.
5. **Értékelés**: Modellek összehasonlítása (MAE, QWK).
6. **Hibaanalízis**: A téves predikciók vizsgálata.

In [3]:
# 1. Importok és Konfiguráció
import os, re, json, math, random
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler
from mord import LogisticAT
import spacy
from tqdm import tqdm

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

MODEL_NAME = 'xlm-roberta-base'
MAX_LEN = 256
BATCH_SIZE = 8
EPOCHS = 8
LR = 2e-5
VAL_SIZE = 0.2
ANNOTATION_FILE = 'granit_bank_cimkezes.json'
TEXT_FILE = 'granit_bank-penzforgalmi_szolgaltatasok_aszf.txt'
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# spaCy betöltés robusztus fallback-kel (nincs hálózat)
try:
    nlp = spacy.load('xx_ent_wiki_sm')
except Exception:
    print('Fallback: blank multilingual spaCy model (no entities).')
    nlp = spacy.blank('xx')
    if 'sentencizer' not in nlp.pipe_names:
        nlp.add_pipe('sentencizer')

print('Konfiguráció betöltve. Modell:', MODEL_NAME, '| Eszköz:', DEVICE)

Fallback: blank multilingual spaCy model (no entities).
Konfiguráció betöltve. Modell: xlm-roberta-base | Eszköz: cpu


# 2. Adatbetöltés és Előfeldolgozás

A címkézett adatok betöltése a JSON fájlból. A `load_annotation_json` függvény kinyeri a bekezdés szövegét és a hozzá tartozó numerikus címkét.

In [6]:
# Adatbetöltés + Haladó Feature Engineering (Label Studio formátum támogatás)

def load_annotation_json(path: str) -> pd.DataFrame:
    """
    Betölti a címkézési adatokat egy JSON fájlból és DataFrame-mé alakítja.
    Elvárt felépítés (Label Studio export): lista a task objektumokkal.
    - task['data']['text'] tartalmazza a bekezdés szövegét
    - task['annotations'][0]['result'][0]['value']['choices'][0] tartalmazza a label szöveget (pl. '3 - Közepes')
    A labelt az első számjegy alapján nyeri ki.
    Ha a fájl üres vagy hibás, üres DataFrame-et ad vissza.
    """
    p = Path(path)
    if not p.exists():
        print(f"Figyelem: A fájl nem található: {path}")
        return pd.DataFrame(columns=['task_id','paragraph_text','label_int','label_text'])
    raw = p.read_text(encoding='utf-8').strip()
    if not raw:
        print("Figyelem: A JSON fájl üres.")
        return pd.DataFrame(columns=['task_id','paragraph_text','label_int','label_text'])
    if not raw.startswith('['):  # ha véletlenül nem listaként mentették
        raw = f'[{raw}]'
    try:
        data = json.loads(raw)
    except json.JSONDecodeError as e:
        print('JSON decode hiba:', e)
        return pd.DataFrame(columns=['task_id','paragraph_text','label_int','label_text'])
    rows = []
    for task in data:
        text = task.get('data', {}).get('text', '').strip()
        ann_list = task.get('annotations', [])
        if not ann_list:
            continue
        ann = ann_list[0]
        result = ann.get('result', [])
        if not result:
            continue
        choice = result[0].get('value', {}).get('choices', [None])[0]
        if not choice:
            continue
        m = re.match(r'(\d)', str(choice))
        if not m:
            continue
        label_int = int(m.group(1))
        rows.append({
            'task_id': task.get('id'),
            'paragraph_text': text,
            'label_int': label_int,
            'label_text': choice
        })
    return pd.DataFrame(rows)

# Jogi terminus lista (bővíthető)
LEGAL_TERMS = ["szerződés","feltétel","jog","kötelezettség","felelősség","kártérítés","hatály","rendelkezés","törvény","rendelet","bíróság","per","felmondás","biztosítás","ügyfél"]
VOWELS = "aáeéiíoóöőuúüű"

def count_syllables_hu(word):
    count = 0; in_grp = False
    for ch in word.lower():
        if ch in VOWELS:
            if not in_grp:
                count += 1; in_grp = True
        else:
            in_grp = False
    return max(1, count)

def extract_features(text):
    words = re.findall(r'\w+', text.lower())
    word_count = len(words)
    char_count = len(text)
    sentence_count = max(1, len(re.findall(r'[.!?]', text)))
    syllable_count = sum(count_syllables_hu(w) for w in words) if words else 0
    avg_words_per_sentence = word_count / sentence_count if sentence_count else 0
    avg_syllables_per_word = syllable_count / word_count if word_count else 0
    flesch_score_hu = 206.835 - 1.015 * avg_words_per_sentence - 84.6 * avg_syllables_per_word if word_count else 0
    legal_term_ratio = sum(1 for w in words if w in LEGAL_TERMS) / word_count if word_count else 0
    long_word_ratio = sum(1 for w in words if len(w) > 12) / word_count if word_count else 0
    # spaCy elemzés (blank modell esetén entitás 0 marad)
    doc = nlp(text)
    num_entities = len(doc.ents) if doc.ents else 0
    pos_counts = doc.count_by(spacy.attrs.POS) if hasattr(spacy.attrs,'POS') else {}
    num_nouns = pos_counts.get(spacy.symbols.NOUN, 0) if hasattr(spacy.symbols,'NOUN') else 0
    num_verbs = pos_counts.get(spacy.symbols.VERB, 0) if hasattr(spacy.symbols,'VERB') else 0
    num_adjs  = pos_counts.get(spacy.symbols.ADJ, 0) if hasattr(spacy.symbols,'ADJ') else 0
    pos_noun_ratio = num_nouns / word_count if word_count else 0
    pos_verb_ratio = num_verbs / word_count if word_count else 0
    pos_adj_ratio  = num_adjs  / word_count if word_count else 0
    depths = []
    for token in doc:
        d=0; cur=token
        while cur.head != cur and d < 100:
            d += 1; cur = cur.head
        depths.append(d)
    avg_dep_depth = float(np.mean(depths)) if depths else 0
    return {
        'char_count': char_count,
        'word_count': word_count,
        'sentence_count': sentence_count,
        'syllable_count': syllable_count,
        'flesch_score_hu': flesch_score_hu,
        'legal_term_ratio': legal_term_ratio,
        'long_word_ratio': long_word_ratio,
        'num_entities': num_entities,
        'pos_noun_ratio': pos_noun_ratio,
        'pos_verb_ratio': pos_verb_ratio,
        'pos_adj_ratio': pos_adj_ratio,
        'avg_dep_depth': avg_dep_depth
    }

# Annotációk betöltése az új függvénnyel
df_labels = load_annotation_json(ANNOTATION_FILE)
if not df_labels.empty:
    print(f"Betöltött címkék száma: {len(df_labels)}")
    print("Címkék eloszlása:")
    print(df_labels['label_int'].value_counts().sort_index())
    print("\nAdatminta:")
    print(df_labels.head())
else:
    print("Nem sikerült adatokat betölteni. A további feature / modell lépések kihagyva amíg nincs adat.")

# Feature kinyerés csak ha van adat
if not df_labels.empty:
    feature_rows = [extract_features(t) for t in tqdm(df_labels['paragraph_text'], desc='Feature extract')] 
    df_feat = pd.DataFrame(feature_rows)
    df_processed = pd.concat([df_labels.reset_index(drop=True), df_feat], axis=1)
    print('Feature oszlopok:', list(df_feat.columns)[:8], '...')
    print(df_processed.head(3))

Betöltött címkék száma: 119
Címkék eloszlása:
label_int
1    14
2    12
3    28
4    43
5    22
Name: count, dtype: int64

Adatminta:
   task_id                                     paragraph_text  label_int  \
0      204  2.1.1. A Pénzforgalmi Keretszerződés alapján a...          2   
1      205  2.1.2. A Korlátozottan cselekvőképes kiskorú T...          2   
2      206  2.1.3. A Bank által a Bankszámla megnyitásának...          3   
3      207  2.1.4. Ha a gazdálkodó szervezet / egyéb szerv...          1   
4      208  2.1.5. Ha a gazdálkodó szervezet / egyéb szerv...          3   

                 label_text  
0         2-Nehezen érthető  
1         2-Nehezen érthető  
2  3-Többé/kevésbé megértem  
3  1-Nagyon nehezen érthető  
4  3-Többé/kevésbé megértem  


Feature extract: 100%|██████████| 119/119 [00:00<00:00, 1797.24it/s]

Feature oszlopok: ['char_count', 'word_count', 'sentence_count', 'syllable_count', 'flesch_score_hu', 'legal_term_ratio', 'long_word_ratio', 'num_entities'] ...
   task_id                                     paragraph_text  label_int  \
0      204  2.1.1. A Pénzforgalmi Keretszerződés alapján a...          2   
1      205  2.1.2. A Korlátozottan cselekvőképes kiskorú T...          2   
2      206  2.1.3. A Bank által a Bankszámla megnyitásának...          3   

                 label_text  char_count  word_count  sentence_count  \
0         2-Nehezen érthető        1158         135               9   
1         2-Nehezen érthető        1534         183              11   
2  3-Többé/kevésbé megértem        1211         144               5   

   syllable_count  flesch_score_hu  legal_term_ratio  long_word_ratio  \
0             391       -53.416667          0.000000         0.155556   
1             534       -56.916483          0.005464         0.120219   
2             424       -71.49




# 4. Alapmodell (Baseline) - Ordinális Regresszió

Egy egyszerű, de hatékony alapmodellt tanítunk a kinyert jellemzőkön a `mord` csomag segítségével.

In [None]:
# 5. Baseline Ordinal Regression (LogisticAT) haladó feature-ökkel
if 'df_processed' in globals() and not df_processed.empty:
    baseline_feature_cols = [
        'char_count','word_count','sentence_count','syllable_count','flesch_score_hu',
        'legal_term_ratio','long_word_ratio','num_entities','pos_noun_ratio','pos_verb_ratio',
        'pos_adj_ratio','avg_dep_depth'
    ]
    missing = [c for c in baseline_feature_cols if c not in df_processed.columns]
    if missing:
        print('Hiányzó feature oszlopok, baseline kihagyva:', missing)
    else:
        Xb = df_processed[baseline_feature_cols].values
        yb = df_processed['label_int'].values
        X_train_b, X_val_b, y_train_b, y_val_b = train_test_split(Xb, yb, test_size=VAL_SIZE, random_state=SEED, stratify=yb)
        scaler_b = StandardScaler()
        X_train_b = scaler_b.fit_transform(X_train_b)
        X_val_b = scaler_b.transform(X_val_b)
        baseline_model = LogisticAT(alpha=0.5)
        baseline_model.fit(X_train_b, y_train_b)
        preds_b = baseline_model.predict(X_val_b)
        mae_b = mean_absolute_error(y_val_b, preds_b)
        print(f'Baseline LogisticAT MAE: {mae_b:.4f}')
else:
    print('Baseline nem futtatható: df_processed hiányzik vagy üres.')

Baseline LogisticAT MAE: 0.9583 (cél < 0.9167)


# 5. Transformer Modell (CORAL)

Egy neurális háló alapú modellt definiálunk, amely egy előtanított transzformert (pl. huBERT) használ, és egy CORAL (Cumulative Ordinal Ranking and Regression) kimeneti réteggel egészíti ki az ordinális klasszifikációhoz.

In [14]:
# --- CORAL Modell definíciója (feature integráció + expected value támogatás) ---
NUM_CLASSES = 5

class CoralHead(nn.Module):
    def __init__(self, hidden_size, num_classes, extra_feat_dim=0, dropout=0.3):
        super().__init__()
        self.use_extra = extra_feat_dim > 0
        in_dim = hidden_size + extra_feat_dim if self.use_extra else hidden_size
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(in_dim, num_classes - 1)
    def forward(self, cls_hidden, extra_feats=None):
        if self.use_extra and extra_feats is not None:
            x = torch.cat([cls_hidden, extra_feats], dim=1)
        else:
            x = cls_hidden
        x = self.dropout(x)
        logits = self.linear(x)
        probs = torch.sigmoid(logits)
        return probs

def coral_probs_to_label_argmax(probs):
    batch_size = probs.size(0)
    ones = torch.ones(batch_size, 1, device=probs.device)
    zeros = torch.zeros(batch_size, 1, device=probs.device)
    p_greater_than = torch.cat([ones, probs, zeros], dim=1)
    p_exact = p_greater_than[:, :-1] - p_greater_than[:, 1:]
    return torch.argmax(p_exact, dim=1) + 1

def coral_probs_to_label_expected(probs):
    batch_size = probs.size(0)
    ones = torch.ones(batch_size, 1, device=probs.device)
    zeros = torch.zeros(batch_size, 1, device=probs.device)
    p_greater_than = torch.cat([ones, probs, zeros], dim=1)
    p_exact = p_greater_than[:, :-1] - p_greater_than[:, 1:]
    labels = torch.arange(1, probs.size(1)+2, device=probs.device).float()  # 1..K
    exp_val = torch.sum(p_exact * labels, dim=1)
    return torch.clamp(torch.round(exp_val), 1, probs.size(1)+1).long()

class CoralModel(nn.Module):
    def __init__(self, model_name, num_classes=5, extra_feat_dim=0):
        super().__init__()
        self.base = AutoModel.from_pretrained(model_name)
        hidden_size = self.base.config.hidden_size
        self.head = CoralHead(hidden_size, num_classes, extra_feat_dim=extra_feat_dim, dropout=0.3)
    def forward(self, input_ids, attention_mask, extra_feats=None):
        outputs = self.base(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0]
        probs = self.head(cls_output, extra_feats)
        return probs

def build_pos_weights(labels, num_classes):
    # labels 1..K; for each threshold k (1..K-1) compute pos = labels>k
    weights = []
    labels_t = torch.tensor(labels)
    for k in range(1, num_classes):
        pos = (labels_t > k).sum().item()
        neg = (labels_t <= k).sum().item()
        if pos == 0: pos = 1
        w_pos = neg / pos  # emphasize minority positives
        weights.append(w_pos)
    return torch.tensor(weights, dtype=torch.float)

def coral_loss_weighted(probs, labels, pos_weights):
    # probs shape (B, K-1); labels 1..K
    K = probs.size(1) + 1
    targets = []
    for k in range(1, K):
        target_k = (labels > k).float().unsqueeze(1)
        targets.append(target_k)
    target_tensor = torch.cat(targets, dim=1)  # (B, K-1)
    # Weighted BCE manually
    # pos_weight for positive target; negative weight = 1
    pw = pos_weights.to(probs.device).unsqueeze(0)  # (1,K-1)
    loss_pos = -pw * target_tensor * torch.log(probs + 1e-8)
    loss_neg = -(1 - target_tensor) * torch.log(1 - probs + 1e-8)
    loss = (loss_pos + loss_neg).mean()
    return loss

# Tokenizer + modell példányosítás extra feature dim alapján
extra_dim = len(['char_count','word_count','sentence_count','long_word_ratio','syllable_count','flesch_score_hu','legal_term_ratio','num_entities','pos_noun_ratio','pos_verb_ratio','pos_adj_ratio','avg_dep_depth'])
if 'tokenizer' not in globals():
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
coral_model = CoralModel(MODEL_NAME, num_classes=NUM_CLASSES, extra_feat_dim=extra_dim)
print('Transformer + CORAL modell betöltve. Extra feature dim:', extra_dim)

Transformer + CORAL modell betöltve. Extra feature dim: 12


In [16]:
# 7. 5-Fold Stratified Cross-Validation Ensemble
from sklearn.model_selection import StratifiedKFold

KFOLDS = 5
EPOCHS_KF = 6
PATIENCE_KF = 2
LEARNING_RATE_KF = 2e-5

if 'df_processed' in globals() and not df_processed.empty:
    skf = StratifiedKFold(n_splits=KFOLDS, shuffle=True, random_state=SEED)
    labels_all = df_processed['label_int'].values
    fold_results = []
    ensemble_exact_probs = []  # store per-fold val exact class prob for ensemble

    # Helper: simple CORAL loss
    def coral_loss_simple(probs, labels):
        K = probs.size(1) + 1
        targets = []
        for k in range(1, K):
            target_k = (labels > k).float().unsqueeze(1)
            targets.append(target_k)
        target_tensor = torch.cat(targets, dim=1)
        return nn.BCELoss()(probs, target_tensor)

    # Function to get exact class probabilities from cumulative probs
    def cumulative_to_exact(probs):
        batch_size = probs.size(0)
        ones = torch.ones(batch_size, 1, device=probs.device)
        zeros = torch.zeros(batch_size, 1, device=probs.device)
        p_greater_than = torch.cat([ones, probs, zeros], dim=1)
        p_exact = p_greater_than[:, :-1] - p_greater_than[:, 1:]
        return p_exact  # shape (B, K)

    for fold, (train_idx, val_idx) in enumerate(skf.split(df_processed, labels_all), start=1):
        print(f'\n[Fold {fold}/{KFOLDS}] Train size={len(train_idx)} Val size={len(val_idx)}')
        df_train_f = df_processed.iloc[train_idx].reset_index(drop=True)
        df_val_f = df_processed.iloc[val_idx].reset_index(drop=True)
        # Stats per fold
        feature_stats_f = {c: (df_train_f[c].mean(), df_train_f[c].std() if df_train_f[c].std() > 0 else 1.0) for c in advanced_feature_cols}

        train_loader_f = create_data_loader(df_train_f, tokenizer, MAX_LEN, BATCH_SIZE, advanced_feature_cols, feature_stats_f)
        val_loader_f = create_data_loader(df_val_f, tokenizer, MAX_LEN, BATCH_SIZE, advanced_feature_cols, feature_stats_f)

        # Fresh model per fold
        model_f = CoralModel(MODEL_NAME, num_classes=NUM_CLASSES, extra_feat_dim=len(advanced_feature_cols)).to(DEVICE)
        optimizer_f = torch.optim.AdamW(model_f.parameters(), lr=LEARNING_RATE_KF)
        total_steps_f = len(train_loader_f) * EPOCHS_KF
        from transformers import get_linear_schedule_with_warmup
        scheduler_f = get_linear_schedule_with_warmup(optimizer_f, num_warmup_steps=0, num_training_steps=total_steps_f)

        best_mae_f = float('inf')
        epochs_no_improve = 0

        for epoch in range(EPOCHS_KF):
            model_f.train(); train_losses=[]
            for batch in train_loader_f:
                ids = batch['input_ids'].to(DEVICE)
                mask = batch['attention_mask'].to(DEVICE)
                feats = batch['features'].to(DEVICE) if batch['features'] is not None else None
                labels = batch['labels'].to(DEVICE)
                probs = model_f(ids, mask, extra_feats=feats)
                loss = coral_loss_simple(probs, labels)
                train_losses.append(loss.item())
                loss.backward(); nn.utils.clip_grad_norm_(model_f.parameters(),1.0)
                optimizer_f.step(); scheduler_f.step(); optimizer_f.zero_grad()
            # Validation
            model_f.eval(); all_lab=[]; all_pred=[]; val_losses=[]; val_exact_prob_collect=[]
            with torch.no_grad():
                for batch in val_loader_f:
                    ids = batch['input_ids'].to(DEVICE)
                    mask = batch['attention_mask'].to(DEVICE)
                    feats = batch['features'].to(DEVICE) if batch['features'] is not None else None
                    labels = batch['labels'].to(DEVICE)
                    probs = model_f(ids, mask, extra_feats=feats)
                    loss = coral_loss_simple(probs, labels)
                    val_losses.append(loss.item())
                    p_exact = cumulative_to_exact(probs)
                    preds = torch.argmax(p_exact, dim=1) + 1
                    all_lab.extend(labels.cpu().numpy())
                    all_pred.extend(preds.cpu().numpy())
                    val_exact_prob_collect.append(p_exact.cpu())
            mae_f = mean_absolute_error(all_lab, all_pred)
            print(f'Fold {fold} Epoch {epoch+1}: TrainLoss={np.mean(train_losses):.4f} ValLoss={np.mean(val_losses):.4f} ValMAE={mae_f:.4f}')
            if mae_f < best_mae_f:
                best_mae_f = mae_f
                epochs_no_improve = 0
                torch.save(model_f.state_dict(), f'coral_fold{fold}.bin')
            else:
                epochs_no_improve += 1
                if epochs_no_improve >= PATIENCE_KF:
                    print('Early stopping fold', fold)
                    break
        print(f'Fold {fold} best MAE: {best_mae_f:.4f}')
        fold_results.append(best_mae_f)
        # Store ensemble probabilities from best epoch (we used last collected p_exact list)
        ensemble_exact_probs.append(torch.cat(val_exact_prob_collect, dim=0))
    mean_mae = np.mean(fold_results)
    print('\nK-Fold MAE-k:', fold_results)
    print('Átlagos MAE (fold átlag):', mean_mae)
    # Ensemble across folds on their own validation splits is not directly comparable; we can just report mean.
    # (Optionally could retrain on full data with averaged thresholds.)
else:
    print('Nincs adat a K-fold futtatáshoz.')


[Fold 1/5] Train size=95 Val size=24
Fold 1 Epoch 1: TrainLoss=0.5994 ValLoss=0.5349 ValMAE=1.0417
Fold 1 Epoch 1: TrainLoss=0.5994 ValLoss=0.5349 ValMAE=1.0417
Fold 1 Epoch 2: TrainLoss=0.5477 ValLoss=0.5370 ValMAE=0.9583
Fold 1 Epoch 2: TrainLoss=0.5477 ValLoss=0.5370 ValMAE=0.9583
Fold 1 Epoch 3: TrainLoss=0.5174 ValLoss=0.5291 ValMAE=1.0000
Fold 1 Epoch 3: TrainLoss=0.5174 ValLoss=0.5291 ValMAE=1.0000
Fold 1 Epoch 4: TrainLoss=0.5233 ValLoss=0.5330 ValMAE=1.0417
Early stopping fold 1
Fold 1 best MAE: 0.9583

[Fold 2/5] Train size=95 Val size=24
Fold 1 Epoch 4: TrainLoss=0.5233 ValLoss=0.5330 ValMAE=1.0417
Early stopping fold 1
Fold 1 best MAE: 0.9583

[Fold 2/5] Train size=95 Val size=24
Fold 2 Epoch 1: TrainLoss=0.6358 ValLoss=0.5700 ValMAE=1.5417
Fold 2 Epoch 1: TrainLoss=0.6358 ValLoss=0.5700 ValMAE=1.5417
Fold 2 Epoch 2: TrainLoss=0.5311 ValLoss=0.5091 ValMAE=1.0000
Fold 2 Epoch 2: TrainLoss=0.5311 ValLoss=0.5091 ValMAE=1.0000
Fold 2 Epoch 3: TrainLoss=0.5441 ValLoss=0.5116 Va