# The model

## Choosing the model

After a bit of research, the multilingual BERT -> [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased) ([GitHub](https://github.com/google-research/bert/blob/master/multilingual.md)) seems like a logical first approach to the project.

We will use mBERT and fine-tune it, training it with our own dataset and labels (sentiments).

## Fine-tuning strategy

Our goal is to gauge the citizen's opinion on a topic around which the outputted articles revolve around.

That's why we'll perform sentiment analysis on is the **comments** in each article.

This will give us insight into how the citizens feel about different topics treated in the press.

1. First, we'll want to pre-train mBERT with a corpus of catalan text, since this model is multilingual and we could give it some insight as to the nuances of the catalan language.

2. Then, we'll have to lable our dataset with the pertinent sentiment labels. We'll need to:

    - Choose the label set
    - Label each comment

    If the final dataset is fairly large, manually labeling each comment can be an arduous task.

    That is why we will follow a semi-supervised learning approach, performing **pseudo-labeling**:

    - We'll manually label a subset of the original dataset: 200-300 comments.
    - We'll use mBERT to predict the labels for the unlabeled data.
    - We will retain the most confident predictions (those with high probability scores), and then treat them as additional labeled data.
    - Finally, we'll retrain the model based on both the original labeled data and the new pseudo-labeled data.

3. Once we have a labeled dataset, we can use it to train mBERT.

4. Model evaluation

    We'll use basic evaluation metrics like accuracy and F1-score for each of the label classes.

5. Finally, we can deploy the model to perform analysis on new data.

### Choosing a set of labels

For the comments we'll manually label, we have to choose a set of labels to assign, that we later want our model to be able to predict.

We want the model to be able to understand the context of each comment, seeing as if the comment is in response to somebody else, they may direct disagreement/agreement to the response, but that may not necessarily be the feeling towards the article.

After having scraped many articles and having been a reader of the Andorran press for some years, I can anticipate the fact that most of these comments will have negative connotations. Normally people tend to leave comments to express disagreement or anger, especially for the topics we have chosen, which are generating a lot of debate among the people currently.

That is why we want to have nuance in our labels.

The first approach to labels is:

- Positive
- Neutral
- Negative
- Very Negative

## Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np

In [7]:
from sklearn.model_selection import train_test_split

In [5]:
from datasets import Dataset

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch