## **4. Fine-tuning modelu**

### **Uwaga!**
Ze względu na poufność danych, surowe raporty i adnotacje ekspertów nie są zawarte w tym repozytorium.

### **Problem:** 
Wieloetykietowa klasyfikacja długich, nieustrukturyzowanych dokumentów w języku polskim, w warunkach silnego niezbalansowania klas.

### **Model enkodera:** sdadas/polish-longformer-base-4096 (https://huggingface.co/sdadas/polish-longformer-base-4096)

#### **Import Bibliotek**

In [1]:
import json
from pathlib import Path
from typing import Dict

import mlflow
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from databricks.sdk import WorkspaceClient
from datasets import load_from_disk, Dataset
from sklearn.metrics import f1_score, accuracy_score
from transformers import (
    LongformerForSequenceClassification,
    LongformerTokenizerFast,
    TrainingArguments,
    Trainer,
    LongformerConfig,
    default_data_collator,
    EvalPrediction,
    EarlyStoppingCallback
)

#### **Definicja parametrów treningu oraz stałych.**

In [2]:
TOKENIZED_DATA_PATH = "data/data_tokenized"
MODEL_NAME = "sdadas/polish-longformer-base-4096"
MODEL_OUTPUT_BASE_PATH = "models"
MLFLOW_EXPERIMENT_NAME = "ESGAnalyzeModel-Training"

CRITERIA_NAMES = [
    'c1_transition_plan',
    'c2_risk_management',
    'c4_boundaries',
    'c6_historical_data',
    'c7_intensity_metrics',
    'c8_targets_credibility',
]

NUM_LABELS = len(CRITERIA_NAMES)

TRAINING_ARGS_DICT = {
    "learning_rate": 2.21e-05,
    "per_device_train_batch_size": 1,
    "per_device_eval_batch_size": 1,
    "gradient_accumulation_steps": 8,
    "num_train_epochs": 10,
    "weight_decay": 0.159,
    "max_grad_norm": 1.0,
    "lr_scheduler_type": "cosine",
    "warmup_ratio": 0.1,
    "fp16": True,
    "gradient_checkpointing": True,
    "eval_strategy": "steps",
    "eval_steps": 100,
    "save_strategy": "steps",
    "save_steps": 100,
    "logging_strategy": "steps",
    "logging_steps": 25,
    "metric_for_best_model": "f1_macro",
    "greater_is_better": True,
    "save_total_limit": 3,
    "load_best_model_at_end": True,
    "report_to": "mlflow",
    "seed": 42
}

#### **Wczytanie tokenizowanego zbioru.**

In [3]:
tokenized_datasets = load_from_disk(TOKENIZED_DATA_PATH)
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['labels', 'doc_id', 'input_ids', 'attention_mask'],
        num_rows: 5070
    })
    validation: Dataset({
        features: ['labels', 'doc_id', 'input_ids', 'attention_mask'],
        num_rows: 1062
    })
    test: Dataset({
        features: ['labels', 'doc_id', 'input_ids', 'attention_mask'],
        num_rows: 929
    })
})


#### **Konfiguracja MLflow.**

In [4]:
mlflow.set_tracking_uri("databricks")
w = WorkspaceClient()
user_email = w.current_user.me().user_name
experiment_path = f"/Users/{user_email}/{MLFLOW_EXPERIMENT_NAME}"
mlflow.set_experiment(experiment_path)

print(f"Pomyślnie ustawiono eksperyment")

Pomyślnie ustawiono eksperyment


## **Sekcja treningu**

#### **Przygotowanie komponentów do treningu i ewaluacji.**

Funkcja do obliczenia bazowych wag klas na podstawie częstości występowania

In [5]:
def calculate_class_weights(dataset: Dataset) -> torch.Tensor:
    labels = np.array(dataset['labels'])
    pos_counts = np.sum(labels, axis=0)
    total_samples = len(labels)
    weights = [total_samples / (2 * count + 1e-6) if count > 0 else 1.0 for count in pos_counts]
    return torch.tensor(weights, dtype=torch.float)

Implementacja Focal Loss

In [6]:
class FocalLoss(nn.Module):
    def __init__(self, alpha, gamma, pos_weight=None):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.pos_weight = pos_weight
    
    def forward(self, inputs, targets):
        bce_loss = nn.functional.binary_cross_entropy_with_logits(
            inputs, targets, reduction='none', pos_weight=self.pos_weight
        )
        pt = torch.exp(-bce_loss)
        focal_loss = self.alpha * (1 - pt)**self.gamma * bce_loss
        return focal_loss.mean()

Niestandardowa klasa Trainer implementuje FocalLoss i ważenie klas

In [7]:
class ESGTrainer(Trainer):
    def __init__(self, *args, focal_loss_alpha=0.5, focal_loss_gamma=2.0, class_weights=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.class_weights = class_weights.to(self.args.device) if class_weights is not None else None
        self.loss_fct = FocalLoss(alpha=focal_loss_alpha, gamma=focal_loss_gamma, pos_weight=self.class_weights)

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        loss = self.loss_fct(logits, labels.float())
        return (loss, outputs) if return_outputs else loss

Funkcja do obliczania metryk

In [8]:
def compute_metrics(p: EvalPrediction) -> Dict[str, float]:
    logits, labels = p
    preds = (1 / (1 + np.exp(-logits)) > 0.5).astype(int)
    return {
        'f1_macro': f1_score(labels, preds, average='macro', zero_division=0),
        'exact_match_ratio': accuracy_score(labels, preds)
    }

Funkcja do optymalizacji progów klasyfikacyjnych na zbiorze walidacyjnym

In [9]:
def optimize_thresholds(trainer: Trainer, eval_dataset: Dataset) -> np.ndarray:
    preds = trainer.predict(eval_dataset)
    logits, y_true = preds.predictions, preds.label_ids
    y_probs = 1 / (1 + np.exp(-logits))
    
    optimal_thresholds = []
    for i in range(y_true.shape[1]):
        best_f1, best_thresh = 0, 0.5
        for thresh in np.arange(0.1, 0.91, 0.01):
            f1 = f1_score(y_true[:, i], (y_probs[:, i] >= thresh).astype(int), zero_division=0)
            if f1 > best_f1: best_f1, best_thresh = f1, thresh
        optimal_thresholds.append(best_thresh)
    return np.array(optimal_thresholds)

Funkcja do ewaluacji na poziomie dokumentów

In [10]:
def evaluate_document_level(trainer: Trainer, dataset: Dataset, thresholds: np.ndarray) -> Dict[str, float]:
    preds = trainer.predict(dataset)
    chunk_probs = 1 / (1 + np.exp(-preds.predictions))
    
    df = pd.DataFrame({'doc_id': dataset['doc_id'], 'probs': list(chunk_probs), 'labels': list(preds.label_ids)})
    doc_results_df = df.groupby('doc_id').agg({
        'probs': lambda x: np.percentile(np.array(x.tolist()), 75, axis=0), 'labels': 'first'
    }).reset_index()

    doc_probs = np.array(doc_results_df['probs'].tolist())
    doc_labels = np.array(doc_results_df['labels'].tolist())
    doc_preds = (doc_probs >= thresholds).astype(int)
    
    results = {'doc_f1_macro': f1_score(doc_labels, doc_preds, average='macro', zero_division=0),
                'num_documents': len(doc_results_df)}
    f1_per_label = f1_score(doc_labels, doc_preds, average=None, zero_division=0)
    for i, f1 in enumerate(f1_per_label): results[f'doc_f1_{CRITERIA_NAMES[i]}'] = f1
    return results

#### **Inicjalizacja treningu.**

In [None]:
try:
    with mlflow.start_run() as run:
        run_id = run.info.run_id
        
        model_config = LongformerConfig.from_pretrained(
            MODEL_NAME, num_labels=NUM_LABELS, problem_type="multi_label_classification"
        )
        model = LongformerForSequenceClassification.from_pretrained(MODEL_NAME, config=model_config)
        tokenizer = LongformerTokenizerFast.from_pretrained(MODEL_NAME)

        output_dir = Path(MODEL_OUTPUT_BASE_PATH) / f"run-{run_id}"
        
        training_args = TrainingArguments(output_dir=str(output_dir), **TRAINING_ARGS_DICT)
        
        mlflow.log_params(training_args.to_dict())
        mlflow.log_param("model_name", MODEL_NAME)

        class_weights = calculate_class_weights(tokenized_datasets['train'])

        trainer = ESGTrainer(
            model=model, args=training_args,
            train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"],
            data_collator=default_data_collator, compute_metrics=compute_metrics,
            tokenizer=tokenizer, class_weights=class_weights,
            callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
        )
        
        print("Starting model training...")
        trainer.train()
        print("Training finished.")

        optimal_thresholds = optimize_thresholds(trainer, tokenized_datasets['validation'])
        final_doc_results = evaluate_document_level(trainer, tokenized_datasets['test'], optimal_thresholds)

        mlflow.log_metrics({f"optimal_threshold_{k}": v for k, v in zip(CRITERIA_NAMES, optimal_thresholds)})
        mlflow.log_metrics(final_doc_results)
        mlflow.log_param("aggregation_strategy", "75th percentile")
        mlflow.log_param("loss function", "FocalLoss")

        final_model_dir = trainer.args.output_dir
        trainer.save_model(final_model_dir)
        tokenizer.save_pretrained(final_model_dir)

        with open(Path(final_model_dir) / "optimal_thresholds.json", "w") as f:
            json.dump({name: float(thresh) for name, thresh in zip(CRITERIA_NAMES, optimal_thresholds)}, f, indent=4)
        with open(Path(final_model_dir) / "final_metrics.json", "w") as f:
            json.dump(final_doc_results, f, indent=4)

        mlflow.log_artifacts(final_model_dir, artifact_path="model")

except Exception as e:
    print(f"\n❌ An unexpected error occurred: {e}")
    raise e
finally:
    if mlflow.active_run():
        mlflow.end_run()

Some weights of LongformerForSequenceClassification were not initialized from the model checkpoint at sdadas/polish-longformer-base-4096 and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)
Initializing global attention on CLS token...


Starting model training...


Step,Training Loss,Validation Loss,F1 Macro,Exact Match Ratio
100,0.0916,0.090634,0.182103,0.016008
200,0.0912,0.090735,0.513753,0.009416
300,0.0809,0.09286,0.611876,0.035782
400,0.0684,0.088601,0.659493,0.141243
500,0.0552,0.090005,0.658953,0.144068
600,0.0471,0.091131,0.690315,0.143126
700,0.0391,0.083919,0.728494,0.23258
800,0.0292,0.086807,0.751934,0.285311
900,0.0318,0.096516,0.742708,0.269303
1000,0.0218,0.098177,0.743463,0.264595


Training finished.


Uploading artifacts:   0%|          | 0/45 [00:00<?, ?it/s]

Uploading models/run-21b6e434b35f4028a784f1c0711662a1/model.safetensors:   0%|          | 0.00/566M [00:00<?, …

Uploading models/run-21b6e434b35f4028a784f1c0711662a1/checkpoint-1000/model.safetensors:   0%|          | 0.00…

Uploading models/run-21b6e434b35f4028a784f1c0711662a1/checkpoint-1000/optimizer.pt:   0%|          | 0.00/1.11…

Uploading models/run-21b6e434b35f4028a784f1c0711662a1/checkpoint-800/model.safetensors:   0%|          | 0.00/…

Uploading models/run-21b6e434b35f4028a784f1c0711662a1/checkpoint-800/optimizer.pt:   0%|          | 0.00/1.11G…

Uploading models/run-21b6e434b35f4028a784f1c0711662a1/checkpoint-1100/model.safetensors:   0%|          | 0.00…

Uploading models/run-21b6e434b35f4028a784f1c0711662a1/checkpoint-1100/optimizer.pt:   0%|          | 0.00/1.11…

🏃 View run respected-cub-733 at: https://dbc-26ad907d-404c.cloud.databricks.com/ml/experiments/1527560425643185/runs/21b6e434b35f4028a784f1c0711662a1
🧪 View experiment at: https://dbc-26ad907d-404c.cloud.databricks.com/ml/experiments/1527560425643185
