# Protest event detection with FLAN-T5

FLAN-T5 is a recent Google model that is finetuned on a variety of tasks, comparable to InstructGPT (which makes GPT3 so nice). It is train on multiple languages, including German. I hope that it may outperform the German Electra model, with or even without task-specific finetuning.

Since this is a sequence-to-sequence model, I have to change the setup slightly, but basically this is a copy of [`01-27.ipynb`](01-27.ipynb).

This notebook is to be run on Google Colab with GPU support. Some of the original outputs are missing, but I copied the results as text.

In [None]:
!pip install transformers datasets evaluate

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "csv",
    data_files={
        "train": "drive/MyDrive/0 Protest Impact/Protest Event Replication/data/glpn_train.csv",
        "dev": "drive/MyDrive/0 Protest Impact/Protest Event Replication/data/glpn_dev.csv",
        "test": "drive/MyDrive/0 Protest Impact/Protest Event Replication/data/glpn_test.csv",
        "test.time": "drive/MyDrive/0 Protest Impact/Protest Event Replication/data/glpn_test-time.csv",
        "test.loc": "drive/MyDrive/0 Protest Impact/Protest Event Replication/data/glpn_test-loc.csv",
    },
)
dataset

In [None]:
dataset["train"]["labels"][:10]

['relevant',
 'relevant',
 'irrelevant',
 'irrelevant',
 'irrelevant',
 'relevant',
 'irrelevant',
 'relevant',
 'irrelevant',
 'relevant']

In [None]:
dataset = dataset.map(
    lambda x: {"response": "Ja" if x["labels"] == "relevant" else "Nein"}
)

In [None]:
dataset["train"]["response"][:10]

['Ja', 'Ja', 'Nein', 'Nein', 'Nein', 'Ja', 'Nein', 'Ja', 'Nein', 'Ja']

In [None]:
from datasets import ClassLabel

dataset = dataset.cast_column("labels", ClassLabel(names=["irrelevant", "relevant"]))
dataset = dataset.rename_column("labels", "label")
dataset["train"]["label"][:10]



[1, 1, 0, 0, 0, 1, 0, 1, 0, 1]

In [None]:
prompt = """

Zu Protesteregnissen zählen vielfältige Protestformen, wie Demonstrationen, Streiks, Blockaden, Unterschriftensammlungen, Besetzungen, Boykotte, etc. Beschreibt dieser Zeitungsartikel ein Protestereignis? Antworte mit "Ja" oder "Nein".

Antwort: """
dataset = dataset.map(lambda x: {"prompt": x["excerpt"] + prompt})
print(dataset["train"]["prompt"][84])



Stuttgarter Zeitung 2010-06-19 100 000 Unterschriften gegen tödliche Waffen Von Kathrin Wesely  Gemeinsam mit der Initiative "Keine Mordwaffen als Sportwaffen!" ist das Aktionsbündnis Amoklauf Winnenden (Rems-Murr-Kreis) gestern bei Katrin Göring-Eckardt in Berlin vorstellig geworden. Die Delegation hat der grünen Bundestagsvizepräsidentin mehr als 100 000 Unterschriften gegen tödliche Sportwaffen überreicht. In seinem Appell an den Bundestag fordert das Aktionsbündnis das Verbot großkalibriger Handfeuerwaffen in Privathaushalten und die getrennte Aufbewahrung von Waffen und Munition. Außerdem hat das Aktionsbündnis der Vizepräsidentin 85 000 Unterschriften für ein Verbot von Killerspielen übergeben. Vertreter aller Bundestagsfraktionen begrüßten bei der Übergabe der Unterschriften auf den Stufen des Reichstags das Engagement der betroffenen Eltern aus Winnenden. "In der anschließenden Bundestagsdebatte waren die politischen Lager aber wieder gespalten", bedauert Carlos Bolesch, einer 

In [None]:
from transformers import AutoTokenizer, T5ForConditionalGeneration

# model_name = "google/flan-t5-base"
model_name = "google/flan-t5-large"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = T5ForConditionalGeneration.from_pretrained(model_name)

model_name = model_name.split("/")[1]

In [None]:
def tokenize_function(examples):
    return tokenizer(
        examples["prompt"],
        text_target=examples["response"],
        padding="max_length",
        truncation=True,
        max_length=512,
        return_tensors="pt",
    )


tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset

In [None]:
import evaluate
import numpy as np

metric = evaluate.load("bleu")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
import torch

responses = {}
for split in ["test", "test.time", "test.loc"]:
    print(split)
    responses[split] = []
    input_ids = tokenizer(
        dataset[split]["prompt"],
        padding="max_length",
        truncation=True,
        max_length=512,
        return_tensors="pt",
    )  # .to("cuda")
    output = model.generate(**input_ids)
    responses_ = tokenizer.batch_decode(output, skip_special_tokens=True)
    for response in responses_:
        if "ja" in response.lower():
            label = 1
        elif "nein" in response.lower():
            label = 0
        else:
            print(response)
            label = 0
            # label = None
        responses[split].append(label)

test




erschwert, dass die Fraktion der islamkritischen Legida-Fraktion
Leipzig steht erneut vor einem Ausnahmezustand: Nachdem das islamkritische Bündnis
che
Konferenz erteilt.
Legida-Fraktion für eine Bürgerbegehren tragen.
ffentlichkeitsmitglied, der ffentlichkeitsmitglied hat seinen ffentlichkeit
hnliches hnliches hnliches hnliche
Die Bürger von Untertürkheim empfanden den Burschenschaftstag und die prot
Im Prozess wegen der Ermordung einer Eislinger Familie haben sich die Anwält
Ein Ehepaar aus einer kleinen Gemeinde im Landkreis Böblingen muss seine drei Kinder zur
Der Anlass für die blautige Schlägerei, bei der am 1. August 2008
Die ffentlichkeit hat sich während der ffentlichkeitsauftritte in Ge
hnliches hnliches hnliches
Dietlichkeitsgrund, sondern auf der politischen Grundlage.
nderungsantrag nderungsantrag nderungsantrag 
Kinderstätten zu reservieren, um die Kinder zu erlernen.
Der Widerstand gegen die milliardenteure Tieferlegung des Bahnhofs, gegen die 
Der sanierte Landesstraße 

In many cases it does not answer with Yes or No but with some nonsense :/

In part this may be because the prompt is truncated when the article is very long.

In [None]:
responses["test"][:10]

In [None]:
eval_results = {}
for split in ["test", "test.time", "test.loc"]:
    eval_results[split] = metric.compute(
        predictions=responses[split], references=list(dataset[split]["label"])
    )
eval_results

**flan-t5-base:**
{'test': {'f1': 0.6657183499288764},
 'test.time': {'f1': 0.6810035842293908},
 'test.loc': {'f1': 0.26666666666666666}}

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir=f"drive/MyDrive/0 Protest Impact/Protest Event Replication/model/{model_name}/checkpoints",
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    lr_scheduler_type="linear",
    warmup_ratio=0.1,
    learning_rate=5e-6,
    weight_decay=0.2,
    num_train_epochs=6,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["dev"],
    compute_metrics=compute_metrics,
)

In [None]:
from pathlib import Path

from transformers import T5ForConditionalGeneration

if Path(model_location).exists():
    model = T5ForConditionalGeneration.from_pretrained(model_location)
else:
    trainer.train()
    trainer.save_model(model_location)

loading configuration file drive/MyDrive/0 Protest Impact/Protest Event Replication/model/flan-t5-base/checkpoints/checkpoint-2000/config.json
Model config T5Config {
  "_name_or_path": "google/flan-t5-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,

In [None]:
from collections import Counter

Counter(dataset["test"]["label"])

Counter({1: 330, 0: 217})

In [None]:
import torch

responses = {}
for split in ["test", "test.time", "test.loc"]:
    print(split)
    responses[split] = []
    input_ids = tokenizer(
        dataset[split]["prompt"],
        padding="max_length",
        truncation=True,
        max_length=512,
        return_tensors="pt",
    )  # .to("cuda")
    output = model.generate(**input_ids)
    responses_ = tokenizer.batch_decode(output, skip_special_tokens=True)
    for response in responses_:
        if "ja" in response.lower():
            label = 1
        elif "nein" in response.lower():
            label = 0
        else:
            print(response)
            label = 0
            # label = None
        responses[split].append(label)

test


Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}



test.time


Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}



Mit dem Namen des islamischen Botschaften in Leipzig.
test.loc


In [None]:
eval_results = {}
for split in ["test", "test.time", "test.loc"]:
    eval_results[split] = metric.compute(
        predictions=responses[split], references=list(dataset[split]["label"])
    )
eval_results

{'test': {'f1': 0.7525655644241733},
 'test.time': {'f1': 0.8304821150855365},
 'test.loc': {'f1': 0.31010452961672474}}

Above are the main results. They are worse than the G-Electra model, especially for test.loc, maybe due to prompt truncation.

Evaluation of **flan-t5-base** during training (this is faked from restored checkpoint, as I am too lazy to convert the F1 metric to work with the Huggingface Seq2Seq Trainer):

- after 500 batches: 
  - 'test': {'f1': 0.0},
  - 'test.time': {'f1': 0.00373},
  - 'test.loc': {'f1': 0.0}
- after 2000 batches:
  - 'test': {'f1': 0.752},
  - 'test.time': {'f1': 0.830},
  - 'test.loc': {'f1': 0.310}
- after 2500 batches:
  - 'test': {'f1': 0.752},
  - 'test.time': {'f1': 0.829},
  - 'test.loc': {'f1': 0.304}
- after complete training:
  - 'test': {'f1': 0.752},
  - 'test.time': {'f1': 0.829},
  - 'test.loc': {'f1': 0.304}