# Phase 3: Model Evaluation

This notebook evaluates the performance of the fine-tuned BERT model on the test dataset and compares it with the non-fine-tuned version. Additionally, it demonstrates the model's prediction capabilities on custom sentences.

## Steps:
1. **Evaluation on the Test Set**

- The model is run on the test set to assess its predictive accuracy.
- The accuracy is computed to quantify the model's performance on unseen data.

2. **Fine-Tuned vs. Non Fine-Tuned Comparison**

- A direct comparison is made between the fine-tuned model and the non-fine-tuned (pretrained) model.
- Results indicate that the fine-tuned model has significantly improved performance, confirming that the training process effectively enhanced its ability to predict sentiment.

3. **Custom Sentence Predictions**

- The notebook includes experiments with three custom sentences.
- The predictions for these sentences showcase the model's accuracy in identifying the correct sentiment (positive, negative, or neutral).


---

In [None]:
import torch

from transformers import BertForSequenceClassification, AutoTokenizer

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

### Load the models

In [5]:
# define the model that will be used
model_name = 'bert-base-uncased'

In [None]:
# load the base model and the fine-tuned model
base_model = BertForSequenceClassification.from_pretrained(model_name, num_labels=3)
base_model.to(device)

In [None]:
# load the fine-tuned model
model_path = 'data/model/'
fine_tuned_model = BertForSequenceClassification.from_pretrained(model_path)
fine_tuned_model.to(device)

### Load the dataset

In [None]:
import pickle
with open('data/dataset_sentiment_analysis.pkl', 'rb') as file:
    dataset = pickle.load(file)

In [10]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 61692
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 828
    })
})

### Run the model on the whole test set


In [62]:
def predict_sentiment(model, input_ids):
  # prepare the list of input_ids to be fed to the model
  input_ids = torch.tensor(input_ids).unsqueeze(0)
  input_ids = input_ids.to(device)

  # run the inference
  model.eval()
  with torch.no_grad():
    outputs = model(input_ids=input_ids)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(probabilities, dim=-1)

  return predicted_class.item()

In [51]:
from tqdm import tqdm

# predict on the test set using the fine-tuned model
predictions_fine_tuned_model = []

for input in tqdm(dataset['test']):
  predicted_sentiment = predict_sentiment(fine_tuned_model, input['input_ids'])
  predictions_fine_tuned_model.append(predicted_sentiment)


100%|██████████| 828/828 [03:31<00:00,  3.92it/s]


In [53]:
# now predict on the test set using the  base model
predictions_base_model = []

for input in tqdm(dataset['test']):
  predicted_sentiment = predict_sentiment(base_model, input['input_ids'])
  predictions_base_model.append(predicted_sentiment)

100%|██████████| 828/828 [03:28<00:00,  3.98it/s]


### Compare the performance

In [56]:
from sklearn.metrics import accuracy_score

# computation of the accuracy
accuracy_base_model = accuracy_score(dataset['test']['labels'], predictions_base_model)
accuracy_fine_tuned_model = accuracy_score(dataset['test']['labels'], predictions_fine_tuned_model)

print(f'Accuracy of the base model: {accuracy_base_model*100:.2f}%')
print(f'Accuracy of the fine-tuned model: {accuracy_fine_tuned_model*100:.2f}%')


Accuracy of the base model: 29.35%
Accuracy of the fine-tuned model: 96.26%


### Run the model on three custom sentances

The model has been trained on tweets, but let's check out how it behaves on custom sentances

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
# create a dictionary to decode the prediction
decode = {0: 'negative', 1: 'positive', 2: 'neutral'}

In [None]:
sentence_1 = "I am so excited about this new project"
sentence_2 = "Maybe I'll pass by later this week, I don't know yet"
sentence_3 = "There's no way I am going there tonight"

inputs = [sentence_1, sentence_2, sentence_3]

In [None]:
for input in inputs:
  inputs = tokenizer(input)
  prediction = predict_sentiment(fine_tuned_model, inputs['input_ids'])
  print(f"Sentence: {input}")
  print(f"Predicted sentiment: {decode[prediction]}\n")

Sentence: I am so excited about this new project
Predicted sentiment: positive

Sentence: Maybe I'll pass by later this week, I don't know yet
Predicted sentiment: neutral

Sentence: There's no way I am going there tonight
Predicted sentiment: negative

