# NLP and Transformers

In this tutorial the topics of natural language processing and Transformers will be covered. The aim of the tutorial is to introduce text classification techniques using classic Naive Bayes classifier method and tranformers. For Transformers, the pre-trained [BERT Base](https://huggingface.co/blog/bert-101) transformer from Google will be used.

The tutorial is structured as follows:


1.   How to make a computer understand words?
2.   Naive Bayes classifier
3.   Introduction to Transformers
4.   Training BERT to suit our needs



# 1. How to make a computer understand words?

One of the question that arises when working with text data is
> *How are we going to use the raw text data to train the model? The raw data is just a collection of strings!*

From previous experience, we know that ML methods usually have some numerical input requirements. Moreover, these numerical feature vectors should have fixed size, what is hard to achieve working with text data. There raw text *document* has its own length.

Therefore some techniques are requred to extract numerical features from the text content. To start with, there are two most basic ones: **tokenizing** and **counting**.

 - **Tokenization** refers to the process of converting a sequence of text into smaller parts, known as tokens. These tokens can be as small as characters or as long as words.
 - **Counting** - calculation of tokens appearing in a *document*.

 In such setting each individual token occurrence frequency is treated as a **feature**. The vector of all the token frequencies for a given *document* is considered a multivariate sample.
 Below is the counting example.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# example text
corpus_example = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

In [None]:
# initialize the CountVectorizer() object
vectorizer_example = CountVectorizer()

# create the feature vector from the given corpus
X_example = vectorizer_example.fit_transform(corpus_example)

In [None]:
# print out the features array
print(vectorizer_example.get_feature_names_out())

In the above toy example, we have a collection of strings stored into the variable corpus. Using the `CountVectorizer()`, we can see that we have a specific number of unique strings (vocabulary) in our data.
This can be seen by printing the `vectorizer_example.get_feature_names_out()`. We observe that we have 9 unique words.

In [None]:
# print out the frequency matrix
print(X_example.toarray())

Next, we printed the transformed data `X_example.toarray()` and we observe the following:
- We have 4 rows in `X_example` as the number of our text strings.
- We have the same number of columns (features/variables) in the numerical representation of data (`X_example`) for all the samples. This was not the case before: the individual strings had different lengths.
- The values 0,1,2, encode the frequency of a word that appeared in the initial text data.

However, such an approach emphasizes a lot of the standard words, which do not bring a lot of meaning (such as *is*, *the*, *this*). Thus, it is beneficial to use the method which compensates this problem. This method is called **TF-IDF**, Term Frequency - Inverse Document Frequency.

---
## TF-IDF
### Term Frequency (TF)

Suppose we have a set of English text documents and wish to rank which document is most relevant to the query , "Data Science is awesome !" A simple way to start out is by eliminating documents that do not contain all three words "Data","is", "Science", and "awesome", but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document; the number of times a term occurs in a document is called its **term frequency**. The weight of a term that occurs in a document is simply proportional to the term frequency.

$$ tf(t, d)=\frac{\text{count of } t \text{ in } d}{\text{number of words in } d} $$

### Document Frequncy (DF)

This measures the importance of document in whole set of corpus, this is very similar to *TF*. The only difference is that *TF* is frequency counter for a term $t$ in document $d$, where as *DF* is the count of **occurrences** of term $t$ in the document set $N$. In other words, *DF* is the number of documents in which the word is present. We consider one occurrence if the term consists in the document at least once, we do not need to know the number of times the term is present.

$$df(t) = \text{occurrence of } t \text{ in a set of documents } N$$

### Inverse Document Frequency (IDF)

While computing *TF*, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing *IDF*, an *inverse document frequency* factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.
*IDF* is the inverse of the document frequency which measures the informativeness of term $t$. When we calculate *IDF*, it will be very low for the most occurring words. This finally gives what we want, a relative weight.

$$idf(t) = \frac{N}{df(t)}$$

There are few problems with the *IDF*. In case of a large corpus, say $100\,000 \, 000$, the *IDF* value explodes. To avoid the effect we take the log of *idf*.
During the query time, when a word which is not in vocab occurs, the *df* will be 0. As we cannot divide by 0, we smoothen the value by adding 1 to the denominator. Then the final formula will be:

$$idf(t) = \log\frac{N}{df(t)+1}$$

This gives us the **tf-idf** formula:

$$tf\text{-}idf(t,d) = tf(t,d) +  \log\frac{N}{df(t)+1}$$

In Python it will look the following way:


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# initialize the TfidfVectorizer() object
tf_idf_example = TfidfVectorizer()

# create the feature vector from the given corpus
X_tf_idf = tf_idf_example.fit_transform(corpus_example)

# print out the features array
print(tf_idf_example.get_feature_names_out())


Here we received the same 9 unique words from the example above.

In [None]:
# print out the weight matrix
print(X_tf_idf.toarray())

Now, instead fo frequencies we get the tf-idf weights.

This is the way we can transform text into numerical representation. Now let's have a look into text classification.

# 2. Naive Bayes classifier

Naive Bayes classifiers are a collection of classification algorithms based on **Bayes' Theorem**. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

The text dataset we would like to classify is divided into two parts, namely, **feature matrix** and the **response/target vector**.
- The **feature matrix ($X$)** contains all the vectors (rows) of the dataset in which each vector consists of the value of dependent features. The number of features is $d$ i.e. $X = (x_1,x_2,x_3,...,x_d)$, where $x_i$ is a numerical representation of a word. All vectors share the same length $d$.
- The **response/target vector** ($y$) contains the value of the class/group variable for each row of feature matrix.

Naive Bayes assumes that each feature/variable of the same class makes an independent and equal contribution to the outcome. These assumption are not generally correct in real-world situations. In-fact, the independence assumption is often not met and this is why it is called "Naive" as it assumes something that might not be true.

Given a data matrix $X$ and a target vector $y$ and following the Bayes' Theorem we state our problem as:
$$P(y|X)=\frac{P(y)*(P(X|y)}{P(X)}$$
Here $P(y|X)$ is the probability of observing the class $y$ given the sample $X$.

Now the *naive* conditional independence assumptions come into play. Let's assume that all features in $X$ are mutually independent, conditional on the category $y$:
$$P(y|X)=\frac{P(y)∏_{i=1}^dP(x_i|y)}{P(x_1)P(x_2)...P(x_d)}$$

The denominator remains constant for a given input, so we can remove it:
$$P(y|X) \propto P(y)∏_{i=1}^dP(x_i|y)$$

Finally, to find the probability of a given sample for all possible values of the class variable $y$, we just need to find the output with maximum probability:
$$y= \arg\max_y P(y)∏_{i=1}^dP(x_i|y)$$

Let's see how we can do this in Python:

In [None]:
# import dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format='retina' # high-resolution plots

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# url to our dataset
url = 'https://raw.githubusercontent.com/dsfb2/dsfb2-2023/main/assignment_3/data/bbc-text.csv'

In this tutorial we will use the [BBC News](https://www.kaggle.com/c/learn-ai-bbc) dataset from Kaggle Competition. The datset is comprised of 2225 articles, each labeled under one of 5 categories: business, entertainment, politics, sport or tech.

In [None]:
# load data
df = pd.read_csv(url)

print(f'[LOG] Dataset contains: {len(df)} entries')

In [None]:
# explore data
df.head()

In [None]:
# explore the text
print(f'Text: {df.text[0]}')
print(f'Text length: {len(df.text[0])}')

We can see that the text is clean and uncased, so we can use it without any preparation to create a classifier. Pieces of news are very long, therefore we can assume a great performance even on Bayes Classifier.

In [None]:
# check the unique categories
print(f'[LOG] Unique news categories: {df["category"].unique()}')

In [None]:
# explore the distribution of categories
plt.bar(df["category"].unique(), df["category"].value_counts())
plt.show()

As our categories are strings, we will convert them into numerical classes. It is possible to apply one-hot encoding.

In [None]:
# create the encoding for classes
mapping = {}
map_value = 0
for cat in df["category"].unique():
  mapping[cat] = map_value
  map_value+=1

print(mapping)

In [None]:
# encode news classes creating a new column
df["label"] = df["category"].map(mapping)
df.head()

We will now split our dataset into training and test parts to run the Bayes' Classifier

In [None]:
# create feature and label vectors
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.3, random_state=42)

Here the pipeline is built to simplify the process. Simply saying, this is the order in which data will be transformed. It is possible to build these pipelines in scikit. At first, the TF-IDF process is applied. Then Multinomial Naive Bayes classifier is used to detect classes.

In [None]:
# build, train and predict using the Naive Bayes model
model_NB = make_pipeline(TfidfVectorizer(), MultinomialNB())
model_NB.fit(X_train, y_train)
y_hat = model_NB.predict(X_test)

In [None]:
# let's look into the model perforance:
print(f'Overall accuracy: {accuracy_score(y_test, y_hat):.3f}')
print('Classification report:')
print(classification_report(y_test, y_hat))

We can see that Naive Bayes classifier performs amazingly well. It is a rather rare case, when a baseline model has such a high accuracy. In our case this is attributed to the high quality of data and the length of indivudal pieces of text.

In [None]:
con_mat = confusion_matrix(y_test, y_hat)
sns.heatmap(con_mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('True labels')
plt.ylabel('Predicted labels')
plt.show()

# 3. Introduction to Transformers

## Who (what) is BERT?
Introduced by Google in 2018, [BERT](https://blog.research.google/2018/11/open-sourcing-bert-state-of-art-pre.html) is an acronym for Bidirectional Encoder Representations from Transformers. The name itself gives us several clues to what BERT is all about.
BERT architecture consists of several Transformer encoders stacked together. Each Transformer encoder encapsulates two sub-layers: a self-attention layer and a feed-forward layer.

There are two different BERT models:
- BERT base, which is a BERT model consists of 12 layers of Transformer encoder, 12 attention heads, **768 hidden size**, and 110M parameters - **we will use this own to speed up computations**
- BERT large, which is a BERT model consists of 24 layers of Transformer encoder,16 attention heads, 1024 hidden size, and 340 parameters.


There are at least two reasons why BERT is a powerful language model:
It is pre-trained on unlabeled data extracted from BooksCorpus, which has 800M words, and from Wikipedia, which has 2,500M words.
As the name suggests, it is pre-trained by utilizing the bidirectional nature of the encoder stacks. This means that BERT learns information from a sequence of words not only from left to right, but also from right to left.

We are not going (thankfully) to train the transformer ourselves. It took Google 4 days and a huuuuuuuuuuge amount of computational power (4 TPUs) to train BERT, not to say about all preliminary attempts. Instead we will take a model checkpont, load it and build a classifier on top of it.

In [None]:
# install the library that contains checkpoints of models and tokenizers
!pip install transformers

BERT model expects a sequence of tokens (words) as an input. In each sequence of tokens, there are two special tokens that BERT would expect as an input:
- [CLS]: This is the first token of every sequence, which stands for classification token.
- [SEP]: This is the token that makes BERT know which token belongs to which sequence. This special token is mainly important for a next sentence prediction task or question-answering task. If we only have one sequence, then this token will be appended to the end of the sequence.

It is also important to note that the maximum size of tokens that can be fed into BERT model is 512. If the tokens in a sequence are less than 512, we can use padding to fill the unused token slots with [PAD] token. If the tokens in a sequence are longer than 512, then we need to do a truncation.


In [None]:
# import the tokenizer, which will create tokens in the correct way
from transformers import BertTokenizerFast

# we will use an uncased version, because our data contains only lower-case text
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', lower=True) # for languages other than english you can use bert-base-multilingual-cased or language-specific versions of bert

example_text = 'i will try to code a good transformer tonight'


bert_input = tokenizer(
    example_text,           # text we want to tokenize
    max_length = 15,        # maximum length of text sequence we accept (limited to 512)
    padding='max_length',   # how we should behave if text length < max length (to pad each sequence to the maximum length that you specify.)
    truncation=True,        # how we should behave if text length > max length (if True, then the tokens in each sequence that exceed the maximum length will be truncated.)
    return_tensors="pt")    # return Pytorch type of tensors, tf for TensorFlow


print(bert_input['input_ids'])  # the id representation of each token
print(bert_input['token_type_ids']) # a binary mask that identifies in which sequence a token belongs. If we only have a single sequence, then all of the token type ids will be 0. For a text classification task, token_type_ids is an optional input for our BERT model.
print(bert_input['attention_mask']) # a binary mask that identifies whether a token is a real word or just padding. If the token contains [CLS], [SEP], or any real word, then the mask would be 1. Meanwhile, if the token is just padding or [PAD], then the mask would be 0.

example_text = tokenizer.decode(bert_input.input_ids[0])

print(example_text)

BERT model then will output an embedding vector of size 768 in each of the tokens. We can use these vectors as an input for different kinds of NLP applications, whether it is text classification, next sentence prediction, Named-Entity-Recognition (NER), or question-answering.

For a text classification task, we focus our attention on the embedding vector output from the special [CLS] token. This means that we're going to use the embedding vector of size 768 from [CLS] token as an input for our classifier, which then will output a vector of size the number of classes in our classification task.

Before going further, let's specify some parameters that we will use further

In [None]:
# Defining some key variables that will be used later on in the training
MAX_LEN = 512             # max length of sequence. we will use all 512 as our text articles are long.
TRAIN_BATCH_SIZE = 4      # how many sequences are included in the training batch
VALID_BATCH_SIZE = 4      # how many sequences are included in the validation batch
EPOCHS = 5                # how many epochs we will use during the training process
LEARNING_RATE = 1e-05     # our learning rate
TOKENIZER = BertTokenizerFast.from_pretrained('bert-base-uncased', lower=True) # our tokenizer

Now that we know what kind of output that we will get from `BertTokenizerFast` , let's build a `NewsDataset` class for our news dataset that will serve as a class to generate our news data.

In [None]:
import torch
class NewsDataset(torch.utils.data.Dataset):

    def __init__(self, df):

        self.labels = [label for label in df['label']] # create labels for each article
        self.texts = [TOKENIZER(text,
                                padding='max_length',
                                max_length = MAX_LEN,
                                truncation=True,
                                return_tensors="pt") for text in df['text']] # create tokens for each article

    def classes(self):
        return self.labels

    def __len__(self):
        return len(self.labels)

    def get_batch_labels(self, idx):
        # Fetch a batch of labels
        return np.array(self.labels[idx])

    def get_batch_texts(self, idx):
        # Fetch a batch of inputs
        return self.texts[idx]

    def __getitem__(self, idx):

        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)

        return batch_texts, batch_y

After defining dataset class, let's split our dataframe into training and validation sets with the proportion of 80:20.

In [None]:
seed = 42
np.random.seed(seed)
df_train, df_val = np.split(df.sample(frac=1, random_state=seed), [int(.8*len(df))])

print(f'Training set length is {len(df_train)} and validation set length {len(df_val)}')

# 4. Training BERT to suit our needs

BERT model outputs two variables:
- The first variable, which we name `_`, contains the embedding vectors of all of the tokens in a sequence.
- The second variable, which we name `pooled_output`, contains the embedding vector of [CLS] token. For a text classification task, it is enough to use this embedding as an input for our classifier.

We then pass the `pooled_output` variable into a linear layer with ReLU activation function. At the end of the linear layer, we have a vector of size 5, each corresponds to a category of our labels (tech, business, sport, entertainment politics).

In [None]:
from torch import nn
from transformers import BertModel

class BertClassifier(nn.Module):

    def __init__(self, dropout=0.3):

        super(BertClassifier, self).__init__()

        self.bert = BertModel.from_pretrained('bert-base-uncased') # pre-trained transformer
        self.dropout = nn.Dropout(dropout)                         # pool with dropout
        self.linear = nn.Linear(768, 5)                            # classification fully-connected layer
        self.relu = nn.ReLU()                                      # ReLU activation function

    def forward(self, input_id, mask):

        _, pooled_output = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)
        dropout_output = self.dropout(pooled_output)
        linear_output = self.linear(dropout_output)
        final_layer = self.relu(linear_output)

        return final_layer

In [None]:
from torch.optim import Adam
from tqdm import tqdm
import os

# function for training and validation
def train_validate(model, train_data, val_data, learning_rate, epochs):

    # create tokenized datasets for training and validation
    train, val = NewsDataset(train_data), NewsDataset(val_data)

    # create loaders for tensors
    train_dataloader = torch.utils.data.DataLoader(train, batch_size=TRAIN_BATCH_SIZE, shuffle=True)
    val_dataloader = torch.utils.data.DataLoader(val, batch_size=VALID_BATCH_SIZE)

    # activate GPU computing
    device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu').type
    print('[LOG] notebook with {} computation enabled'.format(str(device)))

    # initialize loss function
    criterion = nn.CrossEntropyLoss()

    # initialize optimizer
    optimizer = Adam(model.parameters(), lr=learning_rate)

    # send model and loss function to computational device
    model = model.to(device)
    criterion = criterion.to(device)

    # initialize empty lists for storing data
    history_val_loss=[]     # average training loss for epoch
    history_train_loss = [] # average validation loss for epoch
    history_val_acc=[]      # training accuracy for epoch
    history_train_acc = []  # validation accuracy for epoch

    # training and validation cycle
    for epoch in range(epochs):

        # set the model to the training mode (gradients are updated)
        model.train()

        # initialize list for storing loss for each propagation
        loss_train = []

        # initilize lists for storing actual and predicted labels
        train_label_list = []
        train_output_list = []

################## TRAINING ##################

        # get our train input and label tensors for loader, tdqm is just a nice progress bar
        for train_input, train_label in tqdm(train_dataloader):

            # send training label, attention mask and id to device
            train_label = train_label.to(device)
            mask = train_input['attention_mask'].to(device)
            input_id = train_input['input_ids'].squeeze(1).to(device)

            # receive predicted label
            output = model(input_id, mask)

            # calculate the loss value between actual and predicted label
            batch_loss = criterion(output, train_label.long())

            # store the loss
            loss_train.append(batch_loss.item())

            # save actual and predicted values
            train_label_list.extend(train_label.cpu().detach().numpy().tolist())
            train_output_list.extend(torch.sigmoid(output).cpu().detach().numpy().tolist())

            # reset graph gradients
            optimizer.zero_grad()

            # run backward pass to update the weights
            batch_loss.backward()

            # update network paramaters
            optimizer.step()

        # calculate average training loss
        total_loss_train = np.mean(loss_train)
        # append average training loss
        history_train_loss.append(total_loss_train)
        # calculate training accuracy
        acc_train = accuracy_score(np.array(train_label_list).astype(int), np.argmax(np.array(train_output_list), axis=1))
        # append training accuracy
        history_train_acc.append(acc_train)

################## VALIDATION ##################

        # initialize list for storing loss for each propagation
        loss_val = []

        # initilize lists for storing actual and predicted labels
        val_label_list = []
        val_output_list = []

        # set the model to the validation mode (gradients are not updated)
        model.eval()
        with torch.no_grad():

            for val_input, val_label in val_dataloader:

                # send validation label, attention mask and id to device
                val_label = val_label.to(device)
                mask = val_input['attention_mask'].to(device)
                input_id = val_input['input_ids'].squeeze(1).to(device)

                # receive predicted label
                output = model(input_id, mask)

                # calculate the loss value between actual and predicted label
                batch_loss = criterion(output, val_label.long())

                # store the loss
                loss_val.append(batch_loss.item())

                # save actual and predicted values
                val_label_list.extend(val_label.cpu().detach().numpy().tolist())
                val_output_list.extend(torch.sigmoid(output).cpu().detach().numpy().tolist())

        # calculate average validation loss
        total_loss_val = np.mean(loss_val)
        # append average validation loss
        history_val_loss.append(total_loss_val)
        # calculate validation accuracy
        acc_val = accuracy_score(np.array(val_label_list).astype(int), np.argmax(np.array(val_output_list), axis=1))
        # append validation accuracy
        history_val_acc.append(acc_val)

        print(f'Epochs: {epoch} | Train Loss: {total_loss_train: .3f} | Train Accuracy: {acc_train: .3f} | Val Loss: {total_loss_val: .3f} | Val Accuracy: {acc_val: .3f}')
        model_name = f'{epoch}_news_classifier.pth'
        torch.save(model.state_dict(), model_name)

    return history_train_loss, history_val_loss, history_train_acc, history_val_acc


In [None]:
# initialize our model
model = BertClassifier()

Before training the model, make sure the GPU is activated. Expected runtime for one epoch is 3 minutes

In [None]:
# train the model
train_loss, val_loss, train_acc, val_acc = train_validate(model, df_train, df_val, LEARNING_RATE, EPOCHS)

# 5. Validation and graphical representation of results

We can plot our loss and accuracy to examine the training cycle

In [None]:
# plotting the loss
plt.plot(np.array(range(1, len(train_loss)+1)), train_loss, c='b', label='training error')
plt.plot(np.array(range(1, len(val_loss)+1)), val_loss, c='r', label='validation error')
plt.xlabel("[training epoch $e_i$]", fontsize=10)
plt.ylabel("[Classification Error]", fontsize=10)
plt.legend()
plt.show()

In [None]:
# plotting the accuracy
plt.plot(np.array(range(1, len(train_acc)+1)), train_acc, c='b', label='training accuracy')
plt.plot(np.array(range(1, len(val_acc)+1)), val_acc, c='r', label='validation accuracy')
plt.xlabel("[training epoch $e_i$]", fontsize=10)
plt.ylabel("[Accuracy]", fontsize=10)
plt.legend()
plt.show()

We will use a separate validation function to load a saved state and make a confusion matrix

In [None]:
def evaluate(model, test_data):

    # create tokenized dataset
    test = NewsDataset(test_data)

    # create loaders for tensors
    val_dataloader = torch.utils.data.DataLoader(test, batch_size=VALID_BATCH_SIZE)

    # activate GPU computing
    device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu').type
    print('[LOG] notebook with {} computation enabled'.format(str(device)))

    # initialize loss function
    criterion = nn.CrossEntropyLoss()

    model = model.to(device)
    criterion = criterion.to(device)

    loss_val = []

    val_label_list = []
    val_output_list = []

    # set the model to the validation mode (gradients are not updated)
    model.eval()
    with torch.no_grad():

        for val_input, val_label in val_dataloader:

            # send validation label, attention mask and id to device
            val_label = val_label.to(device)
            mask = val_input['attention_mask'].to(device)
            input_id = val_input['input_ids'].squeeze(1).to(device)

            # receive predicted label
            output = model(input_id, mask)

            # calculate the loss value between actual and predicted label
            batch_loss = criterion(output, val_label.long())
            loss_val.append(batch_loss.item())

            # store the loss
            val_label_list.extend(val_label.cpu().detach().numpy().tolist())
            val_output_list.extend(torch.sigmoid(output).cpu().detach().numpy().tolist())

        # save actual and predicted values
        total_loss_val = np.mean(loss_val)
        acc_val = accuracy_score(np.array(val_label_list).astype(int), np.argmax(np.array(val_output_list), axis=1))

    print(f'Test Accuracy: {acc_val: .3f}')

    # return actual and predicted values
    return np.array(val_label_list).astype(int), np.argmax(np.array(val_output_list), axis=1)

In [None]:
# load saved state
epoch_num = 1
model_name = f'{epoch_num}_news_classifier.pth'

model = BertClassifier()
model.load_state_dict(torch.load(model_name, map_location=torch.device('cpu')))

In [None]:
# evaluate model
label, prediction = evaluate(model, df_val)

In [None]:
# create classification report
classification_report(label, prediction)

In [None]:
# create confusion matrix
con_mat = confusion_matrix(label, prediction)
sns.heatmap(con_mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('True labels')
plt.ylabel('Predicted labels')
plt.show()