# Klassifikation von Texten mithilfe von Transformern

Ja, ich kann Ihnen helfen. Um Twitter-Nachrichten mit einem Transformer-Modell zu klassifizieren, folgen Sie diesen Schritten:

1. Installieren Sie die erforderlichen Bibliotheken:

```bash
pip install torch transformers tqdm scikit-learn
```

2. Importieren Sie die erforderlichen Bibliotheken:

```python
import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import pandas as pd
import numpy as np
from tqdm import tqdm
```

3. Implementieren Sie eine benutzerdefinierte Dataset-Klasse:

```python
class GermEvalDataset(Dataset):
    def __init__(self, tokenizer, data_path, max_len):
        self.tokenizer = tokenizer
        self.data = pd.read_csv(data_path, sep='\t', names=['id', 'text', 'label'])
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        text = self.data.loc[index, 'text']
        label = self.data.loc[index, 'label']
        
        inputs = self.tokenizer.encode_plus(
            text,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_tensors='pt'
        )
        
        input_ids = inputs['input_ids'][0]
        attention_mask = inputs['attention_mask'][0]
        
        if label == "OTHER":
            label_tensor = torch.tensor(0)
        elif label == "OFFENSE":
            label_tensor = torch.tensor(1)
        else:
            raise ValueError("Invalid label")
            
        return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": label_tensor}
```

4. Laden Sie das pre-trained Modell und den Tokenizer:

```python
model_name = "deepset/gbert-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
```

5. Erstellen Sie die DataLoader für Training und Validierung:

```python
MAX_LEN = 128
BATCH_SIZE = 16
train_data_path = "path_to_train_data.tsv"
val_data_path = "path_to_val_data.tsv"

train_dataset = GermEvalDataset(tokenizer, train_data_path, MAX_LEN)
val_dataset = GermEvalDataset(tokenizer, val_data_path, MAX_LEN)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)
```

6. Erstellen Sie die Trainings-Argumente:

```python
training_args = Training

Arguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=3,
    logging_steps=100,
    save_steps=1000,
    evaluation_strategy="steps",
    seed=42,
)
```

7. Definieren Sie die Metrik-Funktion:

```python
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    targets = p.label_ids
    precision = precision_score(targets, preds)
    recall = recall_score(targets, preds)
    f1 = f1_score(targets, preds)
    acc = accuracy_score(targets, preds)
    return {"precision": precision, "recall": recall, "f1": f1, "accuracy": acc}
```

8. Erstellen Sie einen Trainer und trainieren Sie das Modell:

```python
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()
```

9. Optional: Bewerten Sie das Modell nach dem Training:

```python
trainer.evaluate()
```

Das trainierte Modell kann jetzt zur Klassifikation von Twitter-Nachrichten verwendet werden.