## Danish multilingual Analysis

In this notebook we will look at the errors that our model performs in zero-shot mode.

We will use a model trained on OLID

In [1]:
%load_ext autoreload
%autoreload 2
import os
from datetime import datetime
import fire
import torch
import pandas as pd
from torchtext import data
import torch.nn as nn
from transformers import (
    AdamW, BertForSequenceClassification, BertTokenizer,
    get_constant_schedule_with_warmup
)

from offenseval.nn import (
    Tokenizer,
    train, evaluate, train_cycle, save_model, load_model, evaluate_dataset
)
from offenseval.datasets import datasets

pd.options.display.max_rows = 200
pd.options.display.max_colwidth = 300

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model, TEXT = load_model("../../models/bert_cased.olid.pt", device)


In [2]:

loss, acc, f1, pos_f1, neg_f1 = evaluate_dataset(
    model, TEXT, datasets["danish"]["test"], batch_size=64
)
print(f'Test Loss: {loss:.3f}  Acc: {acc*100:.2f}% Macro F1: {f1:.3f} Pos F1 {pos_f1:.3f} Neg F1 {neg_f1:.3f}')


Loading dataset...
Building iterators


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Test Loss: 0.713  Acc: 44.93% Macro F1: 0.415 Pos F1 0.272 Neg F1 0.557


Create fields and some other boilerplate

In [3]:
from offenseval.datasets import datasets, build_dataset

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Loading dataset...")
ID = data.Field(sequential=False, use_vocab=False)
SUBTASK_A = data.LabelField()


fields = {
    "id": ('id', ID),
    "text": ('text', TEXT),
    "subtask_a": ("subtask_a", SUBTASK_A)
}


test_dataset = build_dataset(datasets["danish"]["test"], fields)

SUBTASK_A.build_vocab(test_dataset)

assert SUBTASK_A.vocab.itos == ["NOT", "OFF"]


Loading dataset...


Get the predictions

In [4]:
from offenseval.nn.evaluation import get_outputs
from tqdm.auto import tqdm

# DON'T SORT!
test_it = data.Iterator(
    test_dataset, batch_size=1, device=device,
    shuffle=False, sort=False,
)


pred_probas, labels = get_outputs(model, tqdm(test_it))

HBox(children=(FloatProgress(value=0.0, max=592.0), HTML(value='')))




Construct dataset for better visualization

In [6]:
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

true = labels
pred = (pred_probas > 0.5).float()

acc = accuracy_score(true, pred)
pos_f1 = f1_score(true, pred)
neg_f1 = f1_score(1-true, 1-pred)
avg_f1 = (pos_f1 + neg_f1) / 2
roc = roc_auc_score(true, pred_probas)

print(f'Acc: {acc*100:.2f}% Macro F1: {avg_f1:.3f} Pos F1 {pos_f1:.3f} Neg F1 {neg_f1:.3f} ROC {roc:.3f}')


Acc: 79.22% Macro F1: 0.622 Pos F1 0.369 Neg F1 0.876 ROC 0.744


In [7]:
import pandas as pd


df_da = pd.read_table(datasets["danish"]["test"], index_col=0)

df_da["label"] = df_da["subtask_a"] == 'OFF' 
df_da["prob"] = pred_probas.view(-1) 
df_da["pred"] = df_da["prob"] > 0.5

df_da

Unnamed: 0_level_0,tweet,subtask_a,label,prob,pred
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2308,"##15,29 CM REN DANSK MAND.",NOT,False,0.380701,False
2083,Er der vods et sted?,NOT,False,0.107194,False
263,Og??????,NOT,False,0.159163,False
3266,Meget overraskende at Japan ikke er på listen!,NOT,False,0.165946,False
1871,jeg overvejer næsten at invitere dig på aftensmad.,NOT,False,0.163459,False
...,...,...,...,...,...
347,"Jeg studerer selv i England, i en af de dyreste byer i landet - i forhold til Danmark så ligner det LORT. Hjemløse og folk tabt af systemet over alt, skrald over det hele, alle mine venner er ved at drukne i studie lån og meget misundelige på min SU, biler på den forkerte side a vejen... Jeg kun...",OFF,True,0.512812,True
1395,bare hun huske at betale skat og ik kommer hjem når hun bliver syg og nasser på der danske sundhedsvæsen,NOT,False,0.405745,False
1217,"nogen folk har bare ondt i røven.. men jeg forstår godt du gør det, kunne jeg måske også finde på ud i fremtiden",NOT,False,0.547293,True
258,"Han er sku da den eneste som gør noget for veteranerne i USA! Skriv lige lidt om det også, så jeres brugere ikke forledes til at tro at Trump ikke anerkender soldaters ofre!",NOT,False,0.365415,False


## Get the translations

Now, let's check out what our model predicts when using the translations of those exact texts to English. 

First, append the translations to the dataframe

In [8]:
translations = pd.read_json("../../data/translations/Danish/offenseval-da-training-v1-en.json")
translations.set_index("id", inplace=True)

df_da = pd.merge(df_da, translations, how="left", left_index=True, right_index=True)

df_da[:10]

Unnamed: 0_level_0,tweet,subtask_a,label,prob,pred,en
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2308,"##15,29 CM REN DANSK MAND.",NOT,False,0.380701,False,## 15.29 CM PURE DANISH MAN.
2083,Er der vods et sted?,NOT,False,0.107194,False,Is there a place somewhere?
263,Og??????,NOT,False,0.159163,False,And??????
3266,Meget overraskende at Japan ikke er på listen!,NOT,False,0.165946,False,Very surprising that Japan is not on the list!
1871,jeg overvejer næsten at invitere dig på aftensmad.,NOT,False,0.163459,False,I&#39;m almost considering inviting you to dinner.
181,Brænd hele lortet ned og start forfra.,OFF,True,0.277059,False,Burn the whole shit down and start over.
1,Satans danskjävlar!,OFF,True,0.932986,True,Satan&#39;s devil!
2840,Thank you :),NOT,False,0.060507,False,Thank you :)
1190,Dansk propaganda hade funkar bättre om ni hade skrivit lite bättre fejk-svenska. Protip: URL,NOT,False,0.256336,False,Danish propaganda would have worked better if you had written a little better fake-Swedish. Protip: URL
2689,Vi mangler lige Blond skiltet der blev smadret,NOT,False,0.829687,True,We&#39;re just missing the Blond sign that was smashed



Now, predict. We need to create examples in a different way to take the "en" field for translation

In [26]:
"""
We need different fields!
"""
translated_fields = {
    "id": fields["id"],
    "en": fields["text"],
    "subtask_a": fields["subtask_a"]
}

examples = [data.Example.fromdict({
    **{"id": id}, 
    **t
}, fields=translated_fields) for id, t in df_da.iterrows()]

translated_dataset = data.Dataset(examples, fields.values())

test_it = data.Iterator(
    translated_dataset, batch_size=1, device=device,
)

translated_pred_probas, _ = get_outputs(model, tqdm(test_it))

df_da["prob_en"] = translated_pred_probas.view(-1)
df_da["pred_en"] = df_da["prob_en"] > 0.5

HBox(children=(FloatProgress(value=0.0, max=592.0), HTML(value='')))




Let's check the results in English

In [27]:
true = labels
pred = (df_da["prob_en"] > 0.5)

acc = accuracy_score(true, pred)
pos_f1 = f1_score(true, pred)
neg_f1 = f1_score(1-true, 1-pred)
avg_f1 = (pos_f1 + neg_f1) / 2

print(f'Acc: {acc*100:.2f}% Macro F1: {avg_f1:.3f} Pos F1 {pos_f1:.3f} Neg F1 {neg_f1:.3f}')


Acc: 67.23% Macro F1: 0.470 Pos F1 0.142 Neg F1 0.797


## Error Analysis

Let's check out the errors

In [33]:
errors = df_da[df_da["label"] != df_da["pred"]]

false_neg = errors[errors["label"]]
false_pos = errors[~errors["label"]]


print(f"There are {len(errors)} errors (out of {len(df_da)} instances)")
print(f"{len(false_neg)} are false negatives and {len(false_pos)} are false positives")

There are 123 errors (out of 592 instances)
41 are false negatives and 82 are false positives


In [34]:
false_pos.sort_values("prob", ascending=False)

Unnamed: 0_level_0,tweet,subtask_a,label,prob,pred,en,prob_en,pred_en
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2597,"Det er fandme noget **Fallout: Nakskov** type shit, det der",NOT,False,0.972302,True,"It&#39;s fuckin &#39;something ** Fallout: Nakskov ** type of shit, that thing",0.973622,True
196,"så længe der er mænd som villigt betaler 5000 for et knald, det er ikke pigerne som er dumme ?",NOT,False,0.93552,True,"as long as there are men willing to pay 5000 for a bang, it&#39;s not the girls who are stupid?",0.965079,True
2303,SKÅNE ER DANSK,NOT,False,0.900015,True,Skåne is Danish,0.93436,True
3198,Fantastisk. Hvis ikke jeg havde været fattig havde du fået guld.,NOT,False,0.898958,True,"Fantastic. If I had not been poor, you would have received gold.",0.216345,False
3362,"Det her kommer måske til at være dumt, men det er jo 100% photoshoppet - så er jeg out of the loop? Hvad er joken",NOT,False,0.898711,True,"This may sound silly, but it&#39;s 100% photoshopped - so am I out of the loop? What is the joke",0.076652,False
3489,Kugle for panden!,NOT,False,0.896978,True,Bullet to the forehead!,0.168349,False
3342,"Kan vi ikke spare de gifteksperter vi har siddende i døgnvagt, og bruge pengene på bedre ting? ""Det skulle man tro, Hr. Løkke. Men danskerne er nogle idioter, der spiser deres stueplanter.""",NOT,False,0.885027,True,"&quot;Can&#39;t we save the marriage experts we have sitting around the clock and spend the money on better things?&quot; &quot;You should think so, Mr. Loke. But the Danes are some idiots who eat their houseplants.&quot;",0.151516,False
2574,LUK BROEN!,NOT,False,0.873045,True,CLOSE BROEN!,0.099021,False
1340,Satan er det dig? &#x200B;,NOT,False,0.862102,True,Satan is that you?,0.125945,False
1927,"DET DER, MINE KJÆRE DANSKER, SER UTROLIG GODT UT!",NOT,False,0.849911,True,"THAT, MY DEAR DANS, LOOKS EXCELLENTLY GOOD!",0.16714,False


## False Negative 

In [35]:
false_neg.sort_values("prob", ascending=True)

Unnamed: 0_level_0,tweet,subtask_a,label,prob,pred,en,prob_en,pred_en
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2065,URL,OFF,True,0.089447,False,URL,0.087948,False
1017,"@USER gå i seng, du er syg",OFF,True,0.130515,False,"@USER go to bed, you&#39;re sick",0.752331,True
2415,*Danskjävel,OFF,True,0.13443,False,* Danskjävel,0.082972,False
2704,Danske javle!!,OFF,True,0.14345,False,Danish javelin !!,0.275274,False
3224,Hun er nok elendigt opdraget,OFF,True,0.148922,False,She&#39;s probably miserably brought up,0.04722,False
1860,Jeg ved sgu ikke om jeg havde lyst til at få at vide Malmö nu var vores ansvar.,OFF,True,0.151056,False,I don&#39;t even know if I wanted to know Malmö was our responsibility.,0.050296,False
1541,DAN man har så mange fugtige lortepostideer men ingen photoshop eller gif skills :(,OFF,True,0.1609,False,THEN you have so many moist shit post ideas but no photoshop or gif skills :(,0.970509,True
1322,Der er tale om tre unge mænd af udenlandsk afstamning Surprise surprise,OFF,True,0.179219,False,&quot;These are three young men of foreign descent&quot; Surprise surprise,0.078217,False
1174,"Her havde jeg scannet lortet ind på pc, markeret og taget hendes underskrift over i et nyt dokument og skrevet det på korrekt engelsk. Gad vide om hun fik jobbet som vennetjeneste.",OFF,True,0.182341,False,"Here I had scanned the shit on PC, marked and taken her signature into a new document and written it in correct English. Wondered if she got a job as a friend.",0.842876,True
3438,Det er fandme et stærkt album! Tak for grinet :),OFF,True,0.182544,False,It&#39;s a strong album! Thanks for the laugh :),0.974295,True


In [64]:
bert_tokenizer = TEXT.tokenize.__self__.bert_tokenizer

def predict_sentence(sentence):
    model.eval()
    inp = torch.tensor(bert_tokenizer.encode(sentence)).view(1, -1).to(device)

    return torch.sigmoid(model(inp))

predict_sentence("Brat!")

tensor([[0.1475]], device='cuda:0', grad_fn=<SigmoidBackward>)

In [65]:
predict_sentence("Analgafler")

tensor([[0.3034]], device='cuda:0', grad_fn=<SigmoidBackward>)

In [66]:
predict_sentence("Hold dig væk fra vores ø, du dansker lort.")

tensor([[0.3348]], device='cuda:0', grad_fn=<SigmoidBackward>)

In [67]:
predict_sentence("Stay away from our island, you danish scum")

tensor([[0.9648]], device='cuda:0', grad_fn=<SigmoidBackward>)

In [68]:
predict_sentence("Dungarn lyder bare som et lorte sted på Nørrebro...")

tensor([[0.3415]], device='cuda:0', grad_fn=<SigmoidBackward>)

In [69]:
predict_sentence("Dungarn just sounds like a shit place on Nørrebro ...")

tensor([[0.9735]], device='cuda:0', grad_fn=<SigmoidBackward>)

In [70]:
predict_sentence("This is really shit. 0/10.")

tensor([[0.9733]], device='cuda:0', grad_fn=<SigmoidBackward>)

Why this is different from the previous prediction?

In [72]:
df_da.loc[858].tweet

'Det her er vitterligt lort. 0/10.'

In [73]:
predict_sentence(df_da.loc[858].tweet)

tensor([[0.3436]], device='cuda:0', grad_fn=<SigmoidBackward>)

In [74]:
predict_sentence(df_da.loc[858].en)

tensor([[0.9733]], device='cuda:0', grad_fn=<SigmoidBackward>)

Ok, something is wrong here, check it out