# Klassifikation von Texten mithilfe von Transformern

Ja, ich kann Ihnen helfen. Um Twitter-Nachrichten mit einem Transformer-Modell zu klassifizieren, folgen Sie diesen Schritten:

1. Installieren Sie die erforderlichen Bibliotheken:

```bash
pip install torch transformers tqdm scikit-learn
```

In [1]:
! pip install torch transformers tqdm scikit-learn



2. Importieren Sie die erforderlichen Bibliotheken:

```python
import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import pandas as pd
import numpy as np
from tqdm import tqdm
```

In [2]:
import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import pandas as pd
import numpy as np
from tqdm import tqdm

2023-03-29 20:45:55.833644: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-29 20:45:55.983516: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-29 20:45:55.983571: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-29 20:45:56.690799: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-

3. Implementieren Sie eine benutzerdefinierte Dataset-Klasse:

```python
class GermEvalDataset(Dataset):
    def __init__(self, tokenizer, data_path, max_len):
        self.tokenizer = tokenizer
        self.data = pd.read_csv(data_path, sep='\t', names=['text', 'label', 'fine'])
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        text = self.data.loc[index, 'text']
        label = self.data.loc[index, 'label']
        
        inputs = self.tokenizer.encode_plus(
            text,
            max_length=self.max_len,
            padding=True,
            return_tensors='pt'
        )
        
        input_ids = inputs['input_ids'][0]
        attention_mask = inputs['attention_mask'][0]
        
        if label == "OTHER":
            label_tensor = torch.tensor(0)
        elif label == "OFFENSE":
            label_tensor = torch.tensor(1)
        else:
            raise ValueError(f"Invalid label: {label}")
            
        return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": label_tensor}
```

In [47]:
class GermEvalDataset(Dataset):
    def __init__(self, tokenizer, data_path, max_len):
        self.tokenizer = tokenizer
        self.data = pd.read_csv(data_path, sep='\t', header=None, names=['text', 'label', 'fine'])
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        text = self.data.loc[index, 'text']
        label = self.data.loc[index, 'label']
        
        inputs = self.tokenizer.encode_plus(
            text,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_tensors='pt'
        )
        
        input_ids = inputs['input_ids'][0]
        attention_mask = inputs['attention_mask'][0]
        
        if label == "OTHER":
            label_tensor = torch.tensor(0)
        elif label == "OFFENSE":
            label_tensor = torch.tensor(1)
        else:
            raise ValueError(f"Invalid label: {label} for {text} at {index}")
            
        return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": label_tensor}

4. Laden Sie das pre-trained Modell und den Tokenizer:

```python
model_name = "deepset/gbert-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
```

In [48]:
model_name = "deepset/gbert-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of the model checkpoint at deepset/gbert-large were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initia

5. Erstellen Sie die DataLoader für Training und Validierung:

```python
MAX_LEN = 128
BATCH_SIZE = 16
train_data_path = "path_to_train_data.tsv"
val_data_path = "path_to_val_data.tsv"

train_dataset = GermEvalDataset(tokenizer, train_data_path, MAX_LEN)
val_dataset = GermEvalDataset(tokenizer, val_data_path, MAX_LEN)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)
```

In [49]:
MAX_LEN = 128
BATCH_SIZE = 16
train_data_path = "../data/GermEval-2018/germeval2018.training.txt"
val_data_path = "../data/GermEval-2018/germeval2018.test.txt"

train_dataset = GermEvalDataset(tokenizer, train_data_path, MAX_LEN)
val_dataset = GermEvalDataset(tokenizer, val_data_path, MAX_LEN)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

6. Erstellen Sie die Trainings-Argumente:

```python
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=3,
    logging_steps=100,
    save_steps=1000,
    evaluation_strategy="steps",
    seed=42,
)
```

In [50]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=3,
    logging_steps=100,
    save_steps=1000,
    evaluation_strategy="steps",
    seed=42,
)

7. Definieren Sie die Metrik-Funktion:

```python
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    targets = p.label_ids
    precision = precision_score(targets, preds)
    recall = recall_score(targets, preds)
    f1 = f1_score(targets, preds)
    acc = accuracy_score(targets, preds)
    return {"precision": precision, "recall": recall, "f1": f1, "accuracy": acc}
```

In [51]:
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    targets = p.label_ids
    precision = precision_score(targets, preds)
    recall = recall_score(targets, preds)
    f1 = f1_score(targets, preds)
    acc = accuracy_score(targets, preds)
    return {"precision": precision, "recall": recall, "f1": f1, "accuracy": acc}

8. Erstellen Sie einen Trainer und trainieren Sie das Modell:

```python
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()
```

In [52]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Step,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
train_dataset.data

9. Optional: Bewerten Sie das Modell nach dem Training:

```python
trainer.evaluate()
```

Das trainierte Modell kann jetzt zur Klassifikation von Twitter-Nachrichten verwendet werden.