# Assignment 3
Submission DDL: 24.11.2024 23:59.

* For the tasks that require the text answer use *Insert* > *Text cell* and provide your answer in this cell. Text cell supports Markdown.
* Submit your work to the submission box on MyCourses. You should submit only the **.ipynb file** with your code.
* To download the file from Google Colab use *File* > *Download* > *Download .ipynb*.
* Submit the file with the name: A3_NAME_SURNAME_STUDENT NUMBER.ipynb

**NB! Before editing the file, save a local copy to your Google Drive, otherwise your progress will be lost**

The assignment contains 2 main tasks:
* **A** – Implement the Naive Bayes Classifier **(5 points)**
* **B** – Implement the Transformer **(15 points)**

Further description and specific instructions are provided within the course of the assignment. Places where you need to write your code are commented with capital leters, e.g. #YOUR CODE HERE

In the assignment you will be classifying hotel reviews from the Tripadvisor dataset.

In [None]:
# init deterministic seed

seed = # YOUR CODE HERE # INPUT YOUR STUDENT NUMBER HERE, omit the letters
assert type(seed) is int, "Exclude letters, leave only numbers"


In [None]:
# import dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format='retina' # high-resolution plots

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# url to our dataset
url = 'https://raw.githubusercontent.com/dsfb2/dsfb2-2024/main/assignment_3/data/tripadvisor_hotel_reviews.csv'

# fix seed
np.random.seed(seed)

**TASK A** – Implement the Naive Bayes Classifier **(5 points)**

- **A1** - explore the dataset and consider what changes you should do. Write a short note how you should convert the ratings (Tip: computers start from 0) (**2 points**)
- **A2** - build the Naive Bayes Classifier and report the results. Please, include, at least, the accuracy score, the classification report and the confusion matrix. Write a short report (roughly 10 sentences) about the result you get. (**3 points**)

For this task there are no code snippets provided, you can take code parts from the tutorial

In [None]:
# add as many cells as you need

**TASK B** – Implement the Transformer **(15 points)**

- **B1** - create the TravelDataset class (**5 points**)
- **B2** - train the Transformer (**5 points**)
- **B3** - evaluate the performance on the epoch with the lowest error and report the results. Please, include, relevant binary classification metrics, the classification report and the confusion matrix. Write a short report on the results and compare them with Naive Nayes Classifier (roughly 15 sentences). (**5 points**)

The majority of the code skeleton is provided, look for `# YOUR CODE HERE` to complete functions. For obtaining results you should write your own code.

Prepare all the code on Google Colab, make sure it runs and then move to Azure Lab.

In [None]:
# install the library that contains checkpoints of models and tokenizers
!pip install transformers

In [None]:
from transformers import BertTokenizerFast

# Defining some key variables that will be used later on in the training, you can select your own parameters

MAX_LEN = # YOUR CODE HERE                # max length of sequence. we will use all 512 as our text articles are long.
TRAIN_BATCH_SIZE = # YOUR CODE HERE       # how many sequences are included in the training batch
VALID_BATCH_SIZE = # YOUR CODE HERE       # how many sequences are included in the validation batch
EPOCHS = # YOUR CODE HERE                 # how many epochs we will use during the training process
LEARNING_RATE =  # YOUR CODE HERE         # our learning rate
TOKENIZER = BertTokenizerFast.from_pretrained('bert-base-uncased', lower=True) # our tokenizer

In [None]:
import torch
class TravelDataset(torch.utils.data.Dataset):

    def __init__(self, df):

        self.labels = # YOUR CODE HERE # create labels for each article
        self.texts = # YOUR CODE HERE # create tokens for each article

    def classes(self):
        return self.labels

    def __len__(self):
        return len(self.labels)

    def get_batch_labels(self, idx):
        # Fetch a batch of labels
        return np.array(self.labels[idx])

    def get_batch_texts(self, idx):
        # Fetch a batch of inputs
        return self.texts[idx]

    def __getitem__(self, idx):

        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)

        return batch_texts, batch_y

In [None]:
df_train, df_val = # YOUR CODE HERE

print(f'Training set length is {len(df_train)} and validation set length {len(df_val)}')

In [None]:
from torch import nn
from transformers import BertModel

class BertClassifier(nn.Module):

    def __init__(self, dropout=0.3):

        super(BertClassifier, self).__init__()

        self.bert = BertModel.from_pretrained('bert-base-uncased') # pre-trained transformer
        self.dropout = nn.Dropout(dropout)                         # pool with dropout
        self.linear = # YOUR CODE HERE                            # classification fully-connected layer, you can add additional layers if you want
        self.relu = nn.ReLU()                                      # ReLU activation function

    def forward(self, input_id, mask):

        _, pooled_output = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)
        dropout_output = self.dropout(pooled_output)
        linear_output = self.linear(dropout_output)
        final_layer = self.relu(linear_output)

        return final_layer

In [None]:
import sys
import os
import getpass

# before we define our training loop, we create the folder, where we will save our model checkpoints
# check if running on Google Colab
if 'google.colab' in str(get_ipython()):
    
    print('Running on Colab')
    
    # import the Google Colab GDrive connector
    from google.colab import drive

    # mount GDrive inside the Colab notebook
    drive.mount('/content/drive')
    
    # name Colab Notebooks directory
    CHECKPOINT_DIRECTORY = '/content/drive/MyDrive/Colab Notebooks/dsfb2/a3'
    
else:
    # check if running on MacOS
    if sys.platform == 'darwin':
        print('Running on MacOS')

        # get the username
        user_name = getpass.getuser()

        # name main directory
        CHECKPOINT_DIRECTORY = f"/Users/{user_name}/dsfb2/a3"

    # check if running on Windows
    elif sys.platform == 'win32':
        print('Running on Windows')

        # get the username
        user_name = getpass.getuser()

        # name main directory
        CHECKPOINT_DIRECTORY = f"C:/Users/{user_name}/dsfb2/a3"

# create the main directory
if not os.path.exists(CHECKPOINT_DIRECTORY): os.makedirs(CHECKPOINT_DIRECTORY)


In [None]:
from torch.optim import Adam
from tqdm import tqdm
import os

# function for training and validation
def train_validate(model, train_data, val_data, learning_rate, epochs):

    # create tokenized datasets for training and validation
    train, val = # YOUR CODE HERE

    # create loaders for tensors
    train_dataloader = torch.utils.data.DataLoader(train, batch_size=TRAIN_BATCH_SIZE, shuffle=True)
    val_dataloader = torch.utils.data.DataLoader(val, batch_size=VALID_BATCH_SIZE)

    # activate GPU computing
    device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu').type
    print('[LOG] notebook with {} computation enabled'.format(str(device)))

    # initialize loss function
    criterion = # YOUR CODE HERE

    # initialize optimizer
    optimizer = # YOUR CODE HERE

    # send model and loss function to computational device
    model = # YOUR CODE HERE
    criterion = # YOUR CODE HERE

    # initialize empty lists for storing data
    history_val_loss=[]     # average training loss for epoch
    history_train_loss = [] # average validation loss for epoch
    history_val_acc=[]      # training accuracy for epoch
    history_train_acc = []  # validation accuracy for epoch

    # training and validation cycle
    for epoch in range(epochs):

        # set the model to the training mode (gradients are updated)
        # YOUR CODE HERE

        # initialize list for storing loss for each propagation
        loss_train = []

        # initilize lists for storing actual and predicted labels
        train_label_list = []
        train_output_list = []

################## TRAINING ##################

        # get our train input and label tensors for loader, tdqm is just a nice progress bar
        for train_input, train_label in tqdm(train_dataloader):

            # send training label, attention mask and id to device
            train_label = # YOUR CODE HERE
            mask =  # YOUR CODE HERE
            input_id = # YOUR CODE HERE

            # receive predicted label
            output = # YOUR CODE HERE

            # calculate the loss value between actual and predicted label
            batch_loss = # YOUR CODE HERE

            # store the loss
            loss_train.append(batch_loss.item())

            # save actual and predicted values
            train_label_list.extend(train_label.cpu().detach().numpy().tolist())
            train_output_list.extend(torch.sigmoid(output).cpu().detach().numpy().tolist())

            # reset graph gradients
            # YOUR CODE HERE

            # run backward pass to update the weights
            # YOUR CODE HERE

            # update network parameters
            # YOUR CODE HERE

        # calculate average training loss
        total_loss_train = np.mean(loss_train)
        # append average training loss
        history_train_loss.append(total_loss_train)
        # calculate training accuracy
        acc_train = accuracy_score(np.array(train_label_list).astype(int), np.argmax(np.array(train_output_list), axis=1))
        # append training accuracy
        history_train_acc.append(acc_train)

################## VALIDATION ##################

        # initialize list for storing loss for each propagation
        loss_val = []

        # initialize lists for storing actual and predicted labels
        val_label_list = []
        val_output_list = []

        # set the model to the validation mode (gradients are not updated)
        model.eval()

        with torch.no_grad():

            for val_input, val_label in val_dataloader:

                # send validation label, attention mask and id to device
                val_label = # YOUR CODE HERE
                mask =  # YOUR CODE HERE
                input_id =  # YOUR CODE HERE

                # receive predicted label
                output =  # YOUR CODE HERE

                # calculate the loss value between actual and predicted label
                batch_loss =  # YOUR CODE HERE

                # store the loss
                loss_val.append(batch_loss.item())

                # save actual and predicted values
                val_label_list.extend(val_label.cpu().detach().numpy().tolist())
                val_output_list.extend(torch.sigmoid(output).cpu().detach().numpy().tolist())

        # calculate average validation loss
        total_loss_val = np.mean(loss_val)
        # append average validation loss
        history_val_loss.append(total_loss_val)
        # calculate validation accuracy
        acc_val = accuracy_score(np.array(val_label_list).astype(int), np.argmax(np.array(val_output_list), axis=1))
        # append validation accuracy
        history_val_acc.append(acc_val)

        print(f'Epochs: {epoch} | Train Loss: {total_loss_train: .3f} | Train Accuracy: {acc_train: .3f} | Val Loss: {total_loss_val: .3f} | Val Accuracy: {acc_val: .3f}')
        model_name = f'{epoch}_tripadvisor_classifier.pth'
        model_path = os.path.join(CHECKPOINT_DIRECTORY, model_name)
        torch.save(model.state_dict(), model_path)

    return history_train_loss, history_val_loss, history_train_acc, history_val_acc


In [None]:
# initialize our model
# YOUR CODE HERE

Before training the model, move the code to Azure Labs

In [None]:
# check that gpu is activated

device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu').type

if device == 'cuda':
    !nvidia-smi
elif device == 'mps':
    print('Using Apple M-series SoC GPU accelerator')
else:
    print('Using CPU')

In [None]:
# train the model
# YOUR CODE HERE

In [None]:
# ploting the loss
# YOUR CODE HERE

In [None]:
# plotting the accuracy
# YOUR CODE HERE

In [None]:
def evaluate(model, test_data):

    # create tokenized dataset
    test = # YOUR CODE HERE

    # create loaders for tensors
    val_dataloader = torch.utils.data.DataLoader(test, batch_size=VALID_BATCH_SIZE)

    # activate GPU computing
    device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu').type
    print('[LOG] notebook with {} computation enabled'.format(str(device)))

    # initialize loss function
    criterion =  # YOUR CODE HERE

    model =  # YOUR CODE HERE
    criterion =  # YOUR CODE HERE

    loss_val = []

    val_label_list = []
    val_output_list = []

    # set the model to the validation mode (gradients are not updated)
    model.eval()
    with torch.no_grad():

        for val_input, val_label in val_dataloader:

            # send validation label, attention mask and id to device
            val_label =  # YOUR CODE HERE
            mask =  # YOUR CODE HERE
            input_id =  # YOUR CODE HERE

            # receive predicted label
            output =  # YOUR CODE HERE

            # calculate the loss value between actual and predicted label
            batch_loss =  # YOUR CODE HERE
            loss_val.append(batch_loss.item())

            # store the loss
            val_label_list.extend(val_label.cpu().detach().numpy().tolist())
            val_output_list.extend(torch.sigmoid(output).cpu().detach().numpy().tolist())

        # save actual and predicted values
        total_loss_val = np.mean(loss_val)
        acc_val = accuracy_score(np.array(val_label_list).astype(int), np.argmax(np.array(val_output_list), axis=1))

    print(f'Test Accuracy: {acc_val: .3f}')

    # return actual and predicted values
    return np.array(val_label_list).astype(int), np.argmax(np.array(val_output_list), axis=1)

In [None]:
# load saved state
# YOUR CODE HERE

In [None]:
# evaluate model
# YOUR CODE HERE

In [None]:
# create classification report
# YOUR CODE HERE

In [None]:
# create confusion matrix
# YOUR CODE HERE