# Klassifikation von Texten mithilfe von Transformern

*Transformer* sind seit ChatGPT in aller Munde. Mit den "kleinen Geschwistern" von GPT kann man sehr gut Texte klassifizieren und andere NLP-Aufgaben lösen.
Die folgende Beschreibung wie das funktioniert stammt übrigens zu großen Teilen von ChatGPT, lediglich an einigen Stellen habe ich etwas geändert (damit Sie auch noch selbst etwas zu tun haben).

Ja, ich kann Ihnen helfen. Um Twitter-Nachrichten mit einem Transformer-Modell zu klassifizieren, folgen Sie diesen Schritten:

1. Installieren Sie die erforderlichen Bibliotheken:


In [None]:
!pip install transformers tqdm scikit-learn --upgrade

2. Importieren Sie die erforderlichen Bibliotheken:


In [4]:
import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import pandas as pd
import numpy as np
from tqdm import tqdm

3. Implementieren Sie eine benutzerdefinierte Dataset-Klasse

**Aufgabe 1: Ergänzen Sie Code zum Bereinigen der Tweets.**

In [5]:
class GermEvalDataset(Dataset):
    def __init__(self, tokenizer, data_path, max_len):
        self.tokenizer = tokenizer
        self.data = pd.read_csv(data_path, sep='\t', header=None, names=['text', 'label', 'fine'])
        
        ### YOUR CODE HERE
        #   Clean Tweets
        ###
        
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        text = self.data.loc[index, 'text']
        label = self.data.loc[index, 'label']
        
        inputs = self.tokenizer.encode_plus(
            text,
            max_length=self.max_len,
            truncation=True,
            padding='max_length',
            return_tensors='pt'
        )
        
        input_ids = inputs['input_ids'][0]
        attention_mask = inputs['attention_mask'][0]
        
        if label == "OTHER":
            label_tensor = torch.tensor(0)
        elif label == "OFFENSE":
            label_tensor = torch.tensor(1)
        else:
            raise ValueError(f"Invalid label: {label} for {text} at {index}")
            
        return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": label_tensor}

4. Laden Sie das pre-trained Modell und den Tokenizer:

**ChatGPT schlägt hier das Modell `"deepset/gbert-large"` vor – eine gute Wahl für deutschsprachige Tweets.
Recherchieren Sie im [Model-Hub von Higging Face](https://huggingface.co/models) ein paar Alternativen und vergleichen Sie die Ergebnisse.**

In [6]:
model_name = "deepset/gbert-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

tokenizer_config.json:   0%|          | 0.00/83.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/240k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.35G [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at deepset/gbert-large and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


5. Erstellen Sie die DataLoader für Training und Validierung:

In [8]:
MAX_LEN = 512
BATCH_SIZE = 16
train_data_path = "../data/GermEval-2018/germeval2018.training.txt"
val_data_path = "../data/GermEval-2018/germeval2018.test.txt"

train_dataset = GermEvalDataset(tokenizer, train_data_path, MAX_LEN)
val_dataset = GermEvalDataset(tokenizer, val_data_path, MAX_LEN)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

6. Erstellen Sie die Trainings-Argumente:

**Die Vorgaben von ChatGPT sind in Ordnung, aber schauen Sie einmal, was passiert, wenn Sie an den Parametern `BATCH_SIZE` und `learning_rate` "drehen".**

In [9]:
training_args = TrainingArguments(
    output_dir="./results",
    report_to=None,
    learning_rate=1e-5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=5,
    logging_steps=100,
    save_steps=1000,
    evaluation_strategy="steps",
    seed=42,
)

7. Definieren Sie die Metrik-Funktion:

**Hier habe ich geschummelt und die Metriken aus GermEval 2018 "nachgebaut".**

In [10]:
def compute_metrics(p):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(preds, axis=1)
    accuracy = (preds == p.label_ids).astype(np.float32).mean().item()
    metrics = { "accuracy": accuracy }
    for val, key in enumerate(['OTHER', 'OFFENSE']):
        tp = ((preds == p.label_ids) * (preds == val)).sum().item()
        fp = ((preds != p.label_ids) * (preds == val)).sum().item()
        fn = ((preds != p.label_ids) * (preds != val)).sum().item()

        precision = tp / (tp + fp)
        recall = tp / (tp + fn)
        f1 = 2 * precision * recall / (precision + recall)
        metrics[f"precision_{key}"] = precision
        metrics[f"recall_{key}"] = recall
        metrics[f"f1_{key}"] = f1
        
    metrics[f"precision_AVERAGE"] = 0.5 * (metrics[f"precision_OTHER"] + metrics[f"precision_OFFENSE"])
    metrics[f"recall_AVERAGE"] = 0.5 * (metrics[f"recall_OTHER"] + metrics[f"recall_OFFENSE"])
    metrics[f"f1_AVERAGE"] = 0.5 * (metrics[f"f1_OTHER"] + metrics[f"f1_OFFENSE"])
    return metrics

8. Erstellen Sie einen Trainer und trainieren Sie das Modell:


In [13]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Step,Training Loss,Validation Loss,Accuracy,Precision Other,Recall Other,F1 Other,Precision Offense,Recall Offense,F1 Offense,Precision Average,Recall Average,F1 Average
100,0.4876,0.478014,0.81548,0.811611,0.939057,0.870695,0.828105,0.573913,0.677966,0.819858,0.756485,0.774331
200,0.4006,0.477481,0.811948,0.797192,0.959964,0.871039,0.869754,0.522609,0.652906,0.833473,0.741287,0.761973
300,0.3558,0.420322,0.829606,0.829972,0.933719,0.878794,0.828539,0.626087,0.713224,0.829255,0.779903,0.796009
400,0.2448,0.562521,0.829606,0.819365,0.952402,0.880889,0.863694,0.589565,0.700775,0.841529,0.770984,0.790832
500,0.2788,0.444283,0.834903,0.859395,0.897242,0.877911,0.780209,0.713043,0.745116,0.819802,0.805143,0.811513
600,0.2123,0.536638,0.829311,0.830166,0.932829,0.878509,0.826835,0.626957,0.713155,0.828501,0.779893,0.795832
700,0.1365,0.697891,0.837551,0.858714,0.903025,0.880312,0.789168,0.709565,0.747253,0.823941,0.806295,0.813782
800,0.1083,0.8199,0.839612,0.852296,0.91637,0.883173,0.808359,0.689565,0.744252,0.830328,0.802968,0.813712
900,0.1545,0.861723,0.819305,0.806912,0.955516,0.874949,0.86413,0.553043,0.674443,0.835521,0.75428,0.774696
1000,0.0841,1.071701,0.82166,0.809578,0.955071,0.876327,0.864611,0.56087,0.68038,0.837094,0.75797,0.778353


TrainOutput(global_step=1570, training_loss=0.17342701322713475, metrics={'train_runtime': 2858.4847, 'train_samples_per_second': 8.762, 'train_steps_per_second': 0.549, 'total_flos': 2.334022099454976e+16, 'train_loss': 0.17342701322713475, 'epoch': 5.0})

In [12]:
!wandb logout

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Usage: wandb [OPTIONS] COMMAND [ARGS]...
Try 'wandb --help' for help.

Error: No such command 'logout'.


9. Optional: Bewerten Sie das Modell nach dem Training:

```python
trainer.evaluate()
```

Das trainierte Modell kann jetzt zur Klassifikation von Twitter-Nachrichten verwendet werden.