# Phase 2: Model Training

This notebook focuses on fine-tuning a pretrained BERT model to perform sentiment analysis. It leverages Hugging Face's `BertForSequenceClassification` to build a model that distinguishes among three sentiment classes.

## Model Architecture

- **BertForSequenceClassification**
  - Uses the base BERT pretrained model from Hugging Face.
  - Adds a dense layer on top of the BERT encoder.
  - The dense layer produces three outputs corresponding to:
    - Positive sentiment
    - Negative sentiment
    - Neutral sentiment

## Steps

1. **Data Preparation**
   - Load the preprocessed dataset from the previous notebook.

2. **Model Fine-Tuning**
   - Configure the model to use a dense classification layer with three output nodes (`num_labels = 3`).
   - Set up the training parameters, including the optimizer and loss function (by default, Adam optimizer and a cross-entropy loss function).
   - Use the **DataCollatorWithPadding** function to handle dynamic padding, ensuring that batches are efficiently processed without unnecessary padding.
   - Fine-tune the model on the sentiment analysis task using the training dataset.

---

In [5]:
from transformers import AutoTokenizer
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding

import torch

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

### Load the model and the tokenizer

In [27]:
# define the model that will be used
model_name = 'bert-base-uncased'

In [None]:
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=3)
model.to(device)

In [30]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

### Load the dataset

In [None]:
import pickle
with open('data/dataset_sentiment_analysis.pkl', 'rb') as file:
    dataset = pickle.load(file)

In [33]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 61692
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 828
    })
})

### Create the data collator, set the training parameters and create the trainer

In [34]:
# Define the data collator to handle padding dynamically.
data_collator = DataCollatorWithPadding(tokenizer)

In [None]:
training_args = TrainingArguments(
    output_dir='data/training/',
    report_to="none",  # Disable logging to W&B
    num_train_epochs=2,
    per_device_train_batch_size=64,
    weight_decay=0.01,
    save_steps=100,
    logging_steps=100,
    learning_rate=3e-5
)

In [36]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    data_collator=data_collator
)

In [37]:
# Start training
trainer.train()

Step,Training Loss
100,0.8832
200,0.7167
300,0.6693
400,0.6127
500,0.5706
600,0.5519
700,0.507
800,0.4817
900,0.4501
1000,0.3775


TrainOutput(global_step=1928, training_loss=0.4189438364812447, metrics={'train_runtime': 1927.8391, 'train_samples_per_second': 64.001, 'train_steps_per_second': 1.0, 'total_flos': 5889064127221800.0, 'train_loss': 0.4189438364812447, 'epoch': 2.0})

In [None]:
model_path="data/model"

trainer.model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)