<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebooks/M3_Finetuning_DK_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Danish Sentiment Analysis Model
**By finetuning `Maltehb/danish-bert-botx` BERT**

BERT is probably THE model that was a breaktrough for transformers.
We will be using the simple-transformers library to finetune a Danish BERT model with an auto-translated [go-emotions](https://ai.googleblog.com/2021/10/goemotions-dataset-for-fine-grained.html) corpus.
We will be monitoring the training with [Weights & Biases](http://wandb.ai). Finally, we will push the ready model to 🤗 HF-hub.

In [None]:
# Installs
!pip install wandb simpletransformers sacremoses -q
!pip install -U transformers huggingface_hub -q

![](https://camo.githubusercontent.com/76a007a89ca0ad97ae1da9a08c7ead72ad94966e61b18b17b635b2a17cc76f23/68747470733a2f2f692e696d6775722e636f6d2f54485958424e302e706e67)

Log into W&B

In [None]:
!wandb login

In [None]:
# Imports
import pandas as pd
import numpy as np
import wandb
import logging
from tqdm import tqdm
from sklearn.model_selection import train_test_split

from simpletransformers.classification import MultiLabelClassificationModel, MultiLabelClassificationArgs


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

In [None]:
# Initialise project with wandb
wandb.init(project="M3-W2-dk-sentiment", entity="rjurow")

In [None]:
# Open Data
df = pd.read_json('https://github.com/aaubs/ds-master/raw/main/data/dk-go-emotions-10k.json.gz')

In [None]:
# Define labels (from column names)
label_cols = ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']
len(label_cols)

In [None]:
dk_labels = ['beundring', 'fornøjelse', 'vrede', 'irritation', 'medhold', 'omsorg', 'forvirring', 'nysgerrighed', 'begær', 'skuffelse', 'misbilligelse', 'afsky', 'forlegenhed', 'spænding', 'frygt', 'taknemmelighed', 'sorg', 'glæde', 'kærlighed', 'nervøsitet', 'optimisme', 'stolthed', 'indsigt', 'lettelse', 'fortrydelse', 'tristhed', 'overraskelse', 'neutral']

We need to define label-number-text dictionaries
for both languages (actually only danish ones needed here)

In [None]:
# english labels
id2label = {str(i):label for i, label in enumerate(label_cols)}
label2id = {label:str(i) for i, label in enumerate(label_cols)}

In [None]:
# danish labels
id2label_dk = {i:label for i, label in enumerate(dk_labels)}
label2id_dk = {label:i for i, label in enumerate(dk_labels)}

In [None]:
# making label-matrices !!!!
df["labels"] = df[label_cols].values.tolist()

In [None]:
# split data
train_df, eval_df = train_test_split(df, test_size=0.1)

In [None]:
# model training args
model_args = MultiLabelClassificationArgs(num_train_epochs=3,
                                          learning_rate= 3e-5,
                                          overwrite_output_dir= True,
                                          reprocess_input_data = True,
                                          multiprocessing_chunksize = 30,
                                          save_eval_checkpoints = False,
                                          do_lower_case = True,
                                          best_model_dir = '/content/dk-go-emotions/model',
                                         # train_batch_size = 8,
                                          wandb_project = "M3-W2-dk-sentiment")

In [None]:
# model training
model_mlc = MultiLabelClassificationModel('bert', 
                                          'Maltehb/danish-bert-botxo', 
                                          num_labels=28,
                                          args=model_args)

In [None]:
# getting train and eval data into right shape
train_df_t = train_df[['text_dk','labels']]
eval_df_t = eval_df[['text_dk','labels']]

In [None]:
# rename columns so simpletransformers are happy
train_df_t.columns = ['text','labels']
eval_df_t.columns = ['text','labels']

In [None]:
# make dictionary for data and model
mkdir dk-go-emotions

In [None]:
# save data
train_df_t.to_json('/content/dk-go-emotions/train_df_s.json.gz')
eval_df_t.to_json('/content/dk-go-emotions/eval_df_s.json.gz')

In [None]:
# train the model
model_mlc.train_model(train_df_t)

In [None]:
# save model
model_mlc.save_model("/content/dk-go-emotions/model", model=model_mlc.model)

In [None]:
# try out model directly
p, r = model_mlc.predict(['Jeg elsker dig!'])

In [None]:
p

In [None]:
print(np.argmax(p))
id2label_dk[np.argmax(p)]

## Let's use it with the 🤗 `transformer` package

In [None]:
# load the library and pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model = AutoModelForSequenceClassification.from_pretrained('/content/dk-go-emotions/model')
tokenizer = AutoTokenizer.from_pretrained('/content/dk-go-emotions/model')

# sentiment-analysis pipeline is optimised for that use-case (there are many others)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

classifier('Jeg elsker dig')

for some reason simpletransformers-models are not automatically storing the labels in config

In [None]:
# we can do that manually
model.config.id2label = id2label_dk
model.config.label2id = label2id_dk

In [None]:
classifier('Jeg elsker dig')

In [None]:
classifier('Du er bare en stor idiot!')

HF makes it super easy to publish models on their hub

In [None]:
from huggingface_hub import notebook_login

In [None]:
notebook_login()

In [None]:
model.push_to_hub('M3-W2-dk-sentiment')

In [None]:
tokenizer.push_to_hub('M3-W2-dk-sentiment')