# NLP: Sentiment Analysis
Choosing and training a model to perform sentiment analysis on catalan text.

In [None]:
import pandas as pd
import numpy as np

## Choosing a model

After a bit of research, the multilingual BERT -> [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased) ([GitHub](https://github.com/google-research/bert/blob/master/multilingual.md)) seems like a logical first approach to the project.

We will use mBERT and fine-tune it, training it with our own dataset and labels (sentiments).

## Fine-tuning strategy

Since our dataset is fairly large, manually labeling each text can be an arduous task. That is why we will follow a semi-supervised learning approach:

1. We have manually labeled a small subset of the data.
2. We'll use mBERT to predict the labels for the unlabeled data.
3. We will retain the most confident predictions (those with high probability scores), and then treat them as additional labeled data.
4. Finally, we'll retrain the model based on both the original labeled data and the new pseudo-labeled data.

What we want to do sentiment analysis on is the **comments** in each article. This will give us insight into how the citizens feel about different topics treated in the press.

## Data pre-processing

The data was obtained using a web crawler (`main_crawler.py`). This program takes as input a list of terms and uses them to search the different journals of the andorran press for articles and comments.

Check out `data-preprocessing.ipynb` for details about how this initial dataset was chosen, and how the subset dataset was labeled.

In [1]:
# import subset dataset -> labeled
# import large dataset -> unlabeled

### The manually assigned labels

In [1]:
# Get a list of manually assigned labels

In [None]:
num_custom_labels = 3

## Fine-tuning

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch

# Load mBERT model and tokenizer
model_name = 'bert-base-multilingual-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=num_custom_labels)