# Advanced Text Analytics Lab 2

This notebook is the second of two lab notebooks that you will submit as part of your assessment for the Advanced Data Analytics unit. The notebook contains three required sections plus an optional section:

4. **Introducing Transformers:** This section introduces the Transformers library from HuggingFace, showing you how to use it to obtain contextualised embeddings from pretrained transformer models.

5. **Question Answering with Pretrained Transformers:** Learn about how to use a pretrained model to perform automatic question answering. 

6. **Transformers for Text Classification:** Here we show you how to construct a classifier using Transformers.

7. **OPTIONAL: More on Transformers:** Some pointers to other materials if you want to learn more about transformers, e.g., if using them in your summer project. 

Example code for all the tasks has been tested on a three-year old MacBook Pro, and the longest training process took under 10 minutes. If you find that the code takes too long to run on your own machine, you can try [Google Colab](https://colab.research.google.com/), Amazon Sagemaker Studio, or use lab machines on campus provided by the school. 

## Learning Outcomes

These sections will contain tutorial-like instructions, as you have seen in previous text analytics labs. On completing these sections, the intended learning outcomes are that you will be able to...
1. Use pretrained transformers to obtain contextualised word and sentence embeddings.
1. Apply a pretrained QA model to a new dataset. 
1. Construct classifiers with pretrained transformers. 
1. Find documentation on pretrained models in the Transformers library.

## Your Tasks

Inside each of these sections there are several **'To-do's**, which you must complete for your summative assessment. Your marks will be based on your answers to these to-dos. Please make sure to:
1. Include the output of your code in the saved notebook. Plots and printed output should be visible without re-running the code. 
1. Include all code needed to generate your answers.
1. Provide sufficient comments to understand how your method works.
1. Write text in a cell in markdown format where a written answer is required. You can convert a cell to markdown format by pressing Escape-M. 

There are also some unmarked 'to-do's that are part of the tutorial to help you learn how to implement and use the methods studied here.

## Good Academic Practice

Please follow [the guidance on academic integrity provided by the university](http://www.bristol.ac.uk/students/support/academic-advice/academic-integrity/).
You are required to write your own answers -- do not share your notebooks or copy someone else's writing. Do not copy text or long blocks of code directly into the notebook from online sources -- always rewrite in your own way. Breaking the rules can lead to strong penalties. 

## Marking Criteria

1. The coursework (both notebooks) is worth 30% of the unit in total. 
1. There is a total of 100 marks available for both lab notebooks. 
1. This notebook is worth 50 of those marks.
1. The number of marks for each to-do out of 100 is shown alongside each to-do.
1. For to-dos that require you to write code, a good solution would meet the following criteria (in order of importance):
   1. Solves the task or answers the question asked in the to-do. This means, if the code cells in the notebook are executed in order, we will get the output shown in your notebook.
   1. The code is easy to follow and does not contain unnecessary steps.
   1. The comments show that you understand how your solution works.
   1. A very good answer will also provide code that is computationally efficient but easy to read.
1. You can use any suitable publicly available libraries. Unless the task explicitly asks you to implement something from scratch, there is no penalty for using libraries to implement some steps.

## Support

The main source of support will be during the remaining lab sessions (Fridays 3-6pm) for this unit. 

The TAs and lecturer will help you with questions about the lectures, the code provided for you in this notebook, and general questions about the topics we cover. For the marked 'to-dos', they can only answer clarifying questions about what you have to do. 

Office hours: You can book office hours with Edwin on Mondays 3pm-5pm by sending him an email (edwin.simpson@bristol.ac.uk). If those times are not possible for you, please contact him by email to request an alternative. 

## Deadline

The notebook must be submitted along with the second notebook on Blackboard before **Wednesday 24th May at 13.00**. 

## Submission

You will need to zip up this notebook with the previous notebook into a single .zip file, which you will submit to Blackboard through the 'assessment, submission and feedback' link on the left sidebar. 

Please name your files like this:
   * Name this notebook ADA2_<student_number>.ipynb
   * Name the zip file <student_number>.zip
   * Please don't use your name anywhere as we want to mark anonymously. 

In [1]:
import numpy as np
import torch 
from datasets import load_dataset

cache_dir = "./data_cache"

# 4. Pretrained Transformers (max. 15 marks)

HuggingFace is a company that has developed an open source library for loading pretrained transformer models. They also distribute many models that have been pretrained using language modelling tasks, or fine-tuned to specific downstream NLP tasks.  It is currently the best library to use to create NLP models on top of large, deep neural networks. This is especially useful for tasks where simpler, feature-based methods or smaller LSTM models do not perform well enough, for example, when complex processing of syntax and semantics is required (natural language 'understanding'). 

Let's start by looking at two key types of object in the transformers library: models and tokenizers.

## 4.1. Models

The neural network models available in the Transformers library are accessed through wrapper classes such as `AutoModel`. If we want to load a pretrained model, we can simply pass its name to the `from_pretrained` function, and the pretrained model weights will be downloaded from HuggingFace and a neural network model will be created with those weights. For example:

In [2]:
from transformers import AutoModel # For BERTs

model = AutoModel.from_pretrained("huawei-noah/TinyBERT_General_4L_312D") 

Some weights of the model checkpoint at huawei-noah/TinyBERT_General_4L_312D were not used when initializing BertModel: ['fit_denses.2.weight', 'fit_denses.4.bias', 'fit_denses.0.weight', 'fit_denses.0.bias', 'cls.seq_relationship.weight', 'fit_denses.1.bias', 'fit_denses.2.bias', 'cls.predictions.transform.LayerNorm.bias', 'fit_denses.3.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'fit_denses.3.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'fit_denses.1.weight', 'fit_denses.4.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing

This code loads the TinyBERT model, which is a compressed version of BERT. It has 4.4 million parameters, compared to the standard version of BERT, 'BERT-base', which has 110 million parameters. While TinyBERT will not perform as well as larger models, we will use it for this notebook to save memory and computation costs. See [documentation here](https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D).

<!--the RoBERTa variant of BERT. It has 4.4 million parameters, compared to the standard version of BERT, 'BERT-base', which has 110 million parameters. While RoBERTa-tiny will not perform as well as larger models, we will use it for this notebook to save memory and computation costs. See [documentation here](https://huggingface.co/arampacha/roberta-tiny).  -->

The same functions can be used to load other models from HuggingFace's repository simply by changing the model's name. Take a look at [the Models page](https://huggingface.co/models) so see what there is on offer. Do you recognise any of the models' names?

# 4.2. Tokenizers

Before we can apply a model to some text, we need to a create Tokenizer object. In Transfomers, Tokenizer objects convert raw text to a sequence of numbers. First, the tokenizer actually performs tokenization, then it maps each token to its numerical ID. There are lots of different tokenizers that we can use to preprocess text. If we are loading a pretrained model, we will need to choose the tokenizer that corresponds to that model. 

We can load the right tokenizer as follows, in the same way we loaded the model itself:

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")

Let's see what the TinyBERT tokenizer does to an example sentence:

In [4]:
sentence = "The transformer architecture has transformed the field of NLP."

tokens = tokenizer.tokenize(sentence)
print(tokens)

['the', 'transform', '##er', 'architecture', 'has', 'transformed', 'the', 'field', 'of', 'nl', '##p', '.']


Let's compare with the NLTK tokenizer we have seen before:

In [5]:
from nltk.tokenize import word_tokenize

nltk_tokens = word_tokenize(sentence)
print(nltk_tokens)

['The', 'transformer', 'architecture', 'has', 'transformed', 'the', 'field', 'of', 'NLP', '.']


While NLTK keeps whole words as tokens, the BERT tokenizer splits some words into sub-words and inserts some special characters into the tokens. Splitting is applied to words with low frequency in the training set, such as 'transformer'. 

**TO-DO 4a:** What is the benefit of splitting rare words into sub-word tokens? **(2 marks)**

WRITE YOUR ANSWER HERE:

<span style="color:yellow">
1. Splitting the rare-words into sub-word tokens, the algorithm represents these words as a combination of more common words that are used more frequently. Doing this imoroves the coverage of the vocabulary on which the algortihm trains and allows the model to better understand the meaning of these words.<br>
2.  It also generalizes the rare-words to similar sub-words. By doing this it improves the ability of the model to handle new and unseen words and reduces overfitting to specific words.<br>
3. It can handle words that are not in the vocabulary and make better predictions.<br>
4. Splitting the words into sub-wprds help in reducing the size of vocabulary. It can in turn reduce the computation cost.<br></sapn>

---

It is important to use the right tokenizer with a pretrained model as each model was trained with text tokenized in a particular way. After tokenization, the Tokenizer object can also map the tokens to their IDs (indexes in the vocabulary):

In [6]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[1996, 10938, 2121, 4294, 2038, 8590, 1996, 2492, 1997, 17953, 2361, 1012]


## 4.3. Contextualised Embeddings

Now that we have a sequence of tokens, we are almost ready to process the sequence using the pretrained model. 

Our model takes as input a PyTorch `tensor` object. In PyTorch, `tensor` is a muli-dimensional matrix. Here, we need a two-dimensional matrix, where each row is a sequence of input tokens corresponding to a single sentence or document. Let's convert our list of IDs to a 2-D tensor with a single row:

In [7]:
ids_tensor = torch.tensor([ids])

print(ids_tensor)

tensor([[ 1996, 10938,  2121,  4294,  2038,  8590,  1996,  2492,  1997, 17953,
          2361,  1012]])


Now we can process the sequence using our model. The model maps the sequence of input IDs to a sequence of output vectors, which are contextualised word embeddings. The hidden state values produced in the last hidden layer of the model are used as the contextualised embeddings:

In [8]:
model_outputs = model(ids_tensor)
print('The complete model outputs: ')
print(model_outputs)

print()
print('The last hidden state for the first token in the sequence (the first word embedding): ')
embeddings = model_outputs['last_hidden_state'][0]
print(embeddings)

The complete model outputs: 
BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.3608,  0.2862, -0.1549,  ..., -0.2064,  0.2663, -0.0109],
         [ 0.0149,  0.7223, -0.0508,  ..., -0.5505,  0.2355, -0.2962],
         [ 0.1531,  0.5903, -0.1244,  ..., -0.4263,  0.0417, -0.1839],
         ...,
         [ 0.1742, -0.1091, -0.1963,  ..., -0.6736,  0.0472, -0.1840],
         [ 0.2434,  0.1021, -0.2241,  ..., -0.5400, -0.1691, -0.1314],
         [ 0.0854,  0.3272, -0.3016,  ..., -0.2154, -0.5632, -0.1921]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-1.1380e-02, -6.3005e-03,  1.8521e-02,  7.1139e-03, -3.1795e-02,
          1.3882e-02, -1.5459e-02, -1.0610e-03, -1.8263e-02, -3.6515e-02,
         -2.1257e-02, -1.5479e-02, -2.8092e-04, -4.1093e-02, -2.5315e-02,
         -4.3338e-02, -1.1616e-03, -1.3931e-02,  6.0733e-03,  4.3790e-03,
          2.7087e-04, -2.1810e-02, -4.8026e-02,  2.5493e-02, -1.6502e-02,
         -1.2034e-03,  4.2757e-02,  3.

We can retrieve the embedding vector for "transform" like this ("transform" is the second token in the sequence):

In [9]:
emb = embeddings[1]  # get second embedding in the sequence

# convert it to a numpy array so we can perform various operations on it later on
emb = emb.detach().numpy()

print(emb)
print(f'The TinyBERT embeddings have {emb.shape[0]} dimensions.')

[ 1.49148628e-02  7.22317815e-01 -5.07864691e-02 -2.74205178e-01
 -1.38932303e-01  1.00099528e+00  7.11445557e-03  2.71392733e-01
 -3.92815247e-02  6.04100265e-02  1.25740394e-01  4.60631937e-01
  6.25271583e-03  1.61929712e-01  1.23912573e-01 -4.08096224e-01
  1.24867305e-01 -4.71536011e-01  2.24768803e-01  6.35189936e-02
  8.56180787e-02 -1.88044921e-01  1.77257597e-01  3.40050578e-01
 -1.95545912e-01  1.58554330e-01  9.62863788e-02  1.12649135e-01
  2.21045569e-01 -9.56113040e-01 -3.85948658e-01  1.39220789e-01
  5.90012312e-01 -8.06728959e-01 -1.34287402e-01  2.35692337e-01
 -1.02274396e-01  2.78303742e-01  7.94321716e-01 -2.49363303e-01
  1.72772437e-01 -2.07582265e-01  3.00156802e-01 -8.59342813e-02
 -2.25284323e-01 -9.75407436e-02 -3.52349937e-01  3.81161809e-01
 -3.87681931e-01 -1.77613631e-01 -4.13685769e-01  1.38046086e-01
  1.29870605e-02  6.52684093e-01  1.16502956e-01 -5.10778844e-01
 -8.30415785e-02 -2.67046876e-02  3.12862754e-01 -2.62848020e-01
 -1.43284783e-01  1.10270

TO-DO 4b: Retrieve the embedding for "architecture" (this to-do will not be marked).

In [10]:
# WRITE YOUR ANSWER HERE
emb_arch = embeddings[3]
emb_arch = emb_arch.detach().numpy()

print(emb_arch)
print(f'The TinyBERT embeddings have {emb_arch.shape[0]} dimensions.')

[ 2.71386981e-01  7.74582207e-01 -3.24257016e-01 -7.14325309e-02
 -4.95268905e-04  9.37310040e-01 -4.40247403e-03 -4.26919833e-02
  1.27396500e-02  1.89266987e-02  1.02528624e-01  4.54656661e-01
  2.70435810e-01  2.30988905e-01  4.03637812e-03 -1.08995184e-01
 -4.59914133e-02 -3.51154119e-01 -1.34710416e-01  8.29389989e-02
  1.86496824e-01  5.00281751e-02  7.21666515e-02  2.28657067e-01
 -2.19697207e-01  9.40193161e-02  1.65540054e-01  1.85794502e-01
  3.17783207e-01 -5.09367347e-01 -5.00949025e-01  1.52487710e-01
  4.57998663e-01 -8.51876497e-01 -1.58632323e-01  1.58965424e-01
  4.16196063e-02  2.30998382e-01  8.78503203e-01 -6.23163469e-02
  1.87220082e-01 -1.23367840e-02  2.10083455e-01  3.48072536e-02
 -2.51239896e-01 -1.37914613e-01 -3.88697267e-01  2.98189938e-01
 -2.92033464e-01 -3.19503814e-01 -1.98434874e-01  1.32031843e-01
 -6.46375418e-02  7.43182957e-01  7.14239776e-02 -3.02118033e-01
  3.49781364e-01 -5.81789352e-02  2.85069138e-01 -4.09580946e-01
 -1.03296600e-01  1.03768

Sentences and documents usually have varying lengths. So, to put multiple sentences into a single tensor, we need to pad the sequences up to a maximum length. Luckily, the tokenizer class takes care of this for us. When we pass in a list of sentences, the tokenizer creates a matrix, where each row is a sequence of the same length:

In [11]:
sentences = [
    "They received a loan from the bank.",
    "It was not good for either his bank balance or his blood pressure.",
    "She walked along the bank of the river towards the city.",
    "They bank their cheques on Thursdays.",
    "She walked along the embankment towards the city."
]

model_inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")  

print(model_inputs)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': tensor([[  101,  2027,  2363,  1037,  5414,  2013,  1996,  2924,  1012,   102,
             0,     0,     0,     0,     0,     0],
        [  101,  2009,  2001,  2025,  2204,  2005,  2593,  2010,  2924,  5703,
          2030,  2010,  2668,  3778,  1012,   102],
        [  101,  2016,  2939,  2247,  1996,  2924,  1997,  1996,  2314,  2875,
          1996,  2103,  1012,   102,     0,     0],
        [  101,  2027,  2924,  2037, 18178, 10997,  2006,  9432,  2015,  1012,
           102,     0,     0,     0,     0,     0],
        [  101,  2016,  2939,  2247,  1996, 22756,  2875,  1996,  2103,  1012,
           102,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': t

`model_inputs` is a dictionary containing three objects:
 * The `input_ids` are the list of token IDs in the input sequences. 
 * The `attention_mask` records which tokens are special padding tokens and which are real tokens. Tokens with a 0 in the attention mask will be ignored.
 * `token_type_ids` is needed when two sequences are passed together as input to the model for tasks such as next sentence prediction that involve comparing two sentences. Here, each input is a single sentence, so we have only one type of token in the output above. 
 
TO-DO 4c: What value do the special padding tokens have? (this to-do is unmarked)<br>
<span style='color:yellow'>[PAD]</span>

ANSWER: 

---

Notice that the input_ids all start with the same token ID, 101, even though they have different first words. They also have token ID 102 before the padding tokens. This is because the tokenizer inserts two special tokens, which are used in some applicaions of BERT. 101 is the '[CLS]' token, which is a dummy token whose embedding can be trained to represent the whole sequence. the [CLS] token's embedding can then be used as input to a text classifier to classify a sentence or document. Token 102 is '[SEP]', which can be used to separate multiple input sequences in a single example. This is needed in tasks where multiple pieces of text are provided as input, e.g., a to build a classifier that can determine whether two sentences contradict each other. 

We can now pass all of the model inputs to the model to produce a set of contextualised embeddings:

In [12]:
# model_inputs is a dictionary, so to provide the arguments to model(), 
# we use the double star to unpack the dictionary so that each key in the dictionary is
# an argument to model() and each value is the value of the argument. 
model_outputs = model(**model_inputs) 

**TO-DO 4d:** The first four example sentences above all contain the word "bank", and the last example contains "embankment". Obtain a list of contextualised word embeddings for 'bank' and 'embankment' in the example sentences using our model. **(4 marks)**

Hint: you may need to convert tensors to numpy arrays.

In [13]:
#WRITE YOUR OWN CODE HERE
bank_indices = [0, 1, 2, 3]
embankment_index = 4

embeddings = model_outputs.last_hidden_state.detach().numpy()

bank_embeddings = embeddings[bank_indices]
embankment_embedding = embeddings[embankment_index]

print(f"Bank embeddings: {[i for i in bank_embeddings]}")
print(f"Embankment embedding: {embankment_embedding}")


Bank embeddings: [array([[-0.17210938,  0.3554762 , -0.01530575, ...,  0.09638997,
         0.03411598,  0.6412669 ],
       [ 0.06130864,  0.3978629 ,  0.20649993, ..., -0.10949574,
         1.1185691 ,  0.2539656 ],
       [ 0.52980375,  0.59411454,  0.13647686, ..., -0.34088627,
         0.0416989 ,  0.7493994 ],
       ...,
       [-0.04268991,  0.0422273 ,  0.29759872, ..., -0.10009946,
         0.22506376,  0.24222071],
       [-0.04637844,  0.14576931,  0.23094745, ..., -0.04211048,
         0.29990548,  0.17476645],
       [ 0.05196349,  0.25869167,  0.2816572 , ..., -0.00587232,
         0.580475  ,  0.18867968]], dtype=float32), array([[-0.348832  ,  0.2761361 ,  0.205415  , ...,  0.26526845,
         0.3573863 ,  0.3961001 ],
       [ 0.20104799,  0.1711525 ,  0.31701708, ...,  0.24112579,
         0.1232257 ,  0.29956692],
       [ 0.44385263,  0.3780262 ,  0.34030116, ...,  0.29888162,
         0.5167462 , -0.022316  ],
       ...,
       [-0.03760721,  0.04998438,  0.0992

**TO-DO 4e:** Compute the similarities between these embeddings in the cell below, and show the results. Which embeddings are most similar to one another and why? **(6 marks)**

WRITE YOUR ANSWER HERE:
<br>
<span style="color:yellow">
The embeddings for sentences 1 and 3 are most similar to each other with a similarity scores of `0.691002`, while the embeddings for second and fourth sentence are less similar to each other with similarity scores of `0.24995577` and `.05265813`. <br>
The reason behind these socres is that snentences 1 and 3 both use "bank" in the context of a river bank whereas sentences 2 and 4 use "bank" in the context of financial institution. The dofferent contexts in the sentences have reduced the similarity scores. </span>

In [14]:
# WRITE YOUR OWN CODE HERE
from sklearn.metrics.pairwise import cosine_similarity

bank_embeddings_2d = np.reshape(bank_embeddings, (len(bank_embeddings), -1))
similarities = cosine_similarity(bank_embeddings_2d, embankment_embedding.reshape(1, -1))
print(similarities)


[[0.4176671 ]
 [0.24995577]
 [0.691002  ]
 [0.5265813 ]]


**TO-DO 4f:** Use the [CLS] token's embedding to find the most similar **sentence** to "She walked along the embankment towards the city." from the first four sentences. Print the similarities and the selected sentence. **(3 marks)**

In [15]:
# WRITE YOUR OWN CODE HERE
import torch.nn.functional as F

embankment_cls_embedding = model_outputs.last_hidden_state[-1, 0, :]
cls_embeddings = model_outputs.last_hidden_state[:, 0, :]

similarities = F.cosine_similarity(cls_embeddings[:-1], embankment_cls_embedding.unsqueeze(0))
most_similar_index = similarities.argmax().item()

print(f"Similarities: {similarities.tolist()}")
print(f"Most similar sentence: {sentences[most_similar_index]}")

Similarities: [0.9093039631843567, 0.7936347126960754, 0.9947957396507263, 0.8988916277885437]
Most similar sentence: She walked along the bank of the river towards the city.


# 5. Question Answering with Pretrained Transformers (max. 11 marks)

The previous section showed us how to obtain a sequence of contextualised word embeddings using a pretrained transformer. How are these embeddings used to extract answers from documents to a given question?

First, let's load up the [Tweet QA](https://huggingface.co/datasets/tweet_qa) dataset, which we will use to test a pretrained question answering (QA) model. This dataset contains tweets along with questions about the information in the tweets, and a list of correct answers. As we are not going to train our own QA model (it requires a lot of compute time), we will only need the validation set:

In [16]:
from sklearn.metrics import f1_score

val_dataset = load_dataset(
    "tweet_qa",
    split="validation",
    ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Validation dataset with {len(val_dataset)} instances loaded")

Found cached dataset tweet_qa (/mnt/d/Data Science MSc/Advanced DA/Week 22/advanced-labs-public/data_cache/tweet_qa/default/1.0.0/7d588f7f477946b10f60c035ca55175737315ac446102b015218af38d2638777)


Validation dataset with 1086 instances loaded


Now we are working with complete dataset using the HuggingFace datasets library. In the next cell, we create a tokenizer to tokenize the examples in the dataset. We need to choose the right tokenizer for the QA model we want to use, so let's decide to use `"distilbert-base-cased-distilled-squad"` as our pretrained model. This is based on a smaller version of BERT, called Distilbert, which was fine-tuned on the SQUAD question answering dataset.

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad") 

def tokenize_function(dataset):
    # Pass two strings to the tokenizer -- it will concatenate them with a [SEP] special token between them. 
    model_inputs = tokenizer(dataset['Question'], dataset['Tweet'], padding="max_length", max_length=200, truncation='only_second')
    return model_inputs

Again, we can use the `map()` method to apply the tokenizer to each example in the dataset. 

In [18]:
val_dataset = val_dataset.map(tokenize_function, batched=True) 

Loading cached processed dataset at /mnt/d/Data Science MSc/Advanced DA/Week 22/advanced-labs-public/data_cache/tweet_qa/default/1.0.0/7d588f7f477946b10f60c035ca55175737315ac446102b015218af38d2638777/cache-da98121affcce4b7.arrow


The type of QA model we are going to work with is _extractive_, meaning that the model will extract the answer from the 'context' (also known as the 'passage' or 'source document'). It does this by identifying the index of the start and end tokens of the answer span within the context, or returning `(0, 0)` (the index 0 for both the start and end token) if the context does not contain an answer to the given question. 

As explained in the lectures, BERT forms the basis of the QA model, and maps each token to a contextualised embedding. The QA model then maps each token's contextualised embedding to the probability that the token is the start of the answer span, and to the probability that the token is the end of the answer span. The layers that map the embeddings to the start and end probabilities are known as the 'head' of the model. [The original BERT paper](https://arxiv.org/pdf/1810.04805.pdf) depicts the QA model like this (Devlin et al., 2018):

<img src="bert_qa.png" alt="BERT QA diagram from the slides in week 10 showing the embedding of each token connected to the start and end output layers" width="400px"/>

We can see a similar structure in most neural network models. Our original text classifier from the first notebook used a fully-connected layer to produce a hidden representation of the whole sentence (rather than using BERT to produce a sequence of embeddings). This hidden representation was then fed to an output layer to produce a probability distribution over class labels (rather than the start and end probabilities):

<img src="neural_text_classifier_smaller.png" alt="Neural text classifier diagram from the slides in lecture 8.1" width="400px"/>


<!--With transformers, 
we can do something very similar, by connecting the transfomer's output to a fully-connected layer. However, with BERT, we do not need to pass the embedding of each individual word to the fully-connected layer because there is a special [CLS] token that represents the whole sentence:

The code below shows how to access a tensor containing the [CLS] embeddings:-->

Now, we have the dataset in the right format, let's see how to load a pretrained QA model based on a pretrained transformer. The QA model was trained by taking a pretrained BERT model (pretrained on masked language modelling with unlabelled text), adding the QA head, then further training the complete model on a QA dataset. 

The transformers library provides some useful wrapper classes for loading pretrained models for various NLP tasks, such as QA or text classification. These 'auto' classes are documented here: https://huggingface.co/docs/transformers/model_doc/auto . Let's use an auto class to load the `"distilbert-base-cased-distilled-squad"` pretrained QA model (this code will try to reload the model from a cache or download the model from HuggingFace):

In [19]:
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")

As our model was pretrained, we can use it directly on our Tweet_QA dataset (you may see a message to this effect when you run the cell above the first time). 

So, how do we get a prediction from the model? Let's take a single example from Tweet_QA and obtain the start and end probabilities for all tokens in the 'context':

In [20]:
def predict_nn(qa_model, dataset):
    
    # Switch off dropout
    qa_model.eval()

    # Pass the required inputs from the dataset to the model    
    output = qa_model(attention_mask=torch.tensor(dataset["attention_mask"]), input_ids=torch.tensor(dataset["input_ids"]))
        
    # the output dictionary contains logits, which are the unnormalised scores for each class for each example:
    probs_start = torch.nn.Softmax(dim=1)(output["start_logits"]).detach().numpy()
    probs_end = torch.nn.Softmax(dim=1)(output["end_logits"]).detach().numpy()
        
    return probs_start, probs_end

# Run the prediction function to get the results for the first 20 examples:
probs_start, probs_end = predict_nn(model, val_dataset[0:20])

Now that we have the probabilities that each token is a start or end token, we combine these probabilities to estimate the probability of each possible answer span. This will allow us to choose the answer span with highest probability. 

In the next cell is our first attempt, which you will need to improve to get valid answers. This code loops through each possible combination of start and end tokens, obtains the start and end probabilities, and extracts the answer text for the corresponding span.

**TO-DO 5a:** Use the start and end probabilities to compute the answer span probability at the place marked inside the predict_answer() function below. **2 marks**

In [21]:
# our example:
example_index = 3

example = val_dataset[example_index]
print(f'CONTEXT = {example["Tweet"]}')
print(f'QUESTION = {example["Question"]}')
print(f'LIST OF POSSIBLE ANSWERS = {example["Answer"]}')

CONTEXT = The #endangeredriver would be a sexy bastard in this channel if it had water. Quick turns. Narrow. (I'm losing it) John D. Sutter (@jdsutter) June 21, 2014
QUESTION = what hashtag was used?
LIST OF POSSIBLE ANSWERS = ['#endangeredriver', '#endangereddriver']


In [22]:
def predict_answer(probs_start, probs_end, input_ids, tokenizer):
    
    input_length = len(input_ids)  # length of the input sequence, in the form "[CLS] question [SEP] context"

    SEP_SPECIAL_TOKEN = 102  # the input id for the sep special token the separates the question from the context. The context starts after this token. 
    PAD_SPECIAL_TOKEN = 0  # the input id for padding tokens added to the end of the context
    
    span_probabilities = []  # save the probabilities here
    spans = []  # save the possible answer spans here
    
    for start_idx in range(0, input_length):
        for end_idx in range(0, input_length):
            
            start_prob = probs_start[start_idx]
            end_prob = probs_end[end_idx]
            
            ### WRITE YOUR ANSWER HERE
            pad_prob = probs_start[PAD_SPECIAL_TOKEN] + probs_end[PAD_SPECIAL_TOKEN]
            span_prob = start_prob + end_prob - pad_prob

            span_probabilities.append(span_prob)            
            ###
            
            span = tokenizer.decode(input_ids[start_idx:end_idx+1])
            spans.append(span)

    # sort the spans according to probability:
    sorted_span_index = np.argsort(span_probabilities)
    
    # print the top 20 answers:
    for i in range(20):
        print(f'Span prob = {span_probabilities[sorted_span_index[-i-1]]}, answer = {spans[sorted_span_index[-i-1]]}')
            
predict_answer(probs_start[example_index], probs_end[example_index], example['input_ids'], tokenizer)

Span prob = 1.53287672996521, answer = endangeredriver
Span prob = 1.0910239219665527, answer = The # endangeredriver
Span prob = 1.0644721984863281, answer = # endangeredriver
Span prob = 0.9733114242553711, answer = 
Span prob = 0.9421271681785583, answer = 
Span prob = 0.9348013997077942, answer = 
Span prob = 0.9336816668510437, answer = ##river
Span prob = 0.9328382611274719, answer = 
Span prob = 0.9279173016548157, answer = 
Span prob = 0.9275789856910706, answer = 
Span prob = 0.9272849559783936, answer = 
Span prob = 0.927193820476532, answer = 
Span prob = 0.9270961880683899, answer = 
Span prob = 0.9269319772720337, answer = 
Span prob = 0.92681485414505, answer = 
Span prob = 0.926754355430603, answer = 
Span prob = 0.9267515540122986, answer = 
Span prob = 0.9266906976699829, answer = 
Span prob = 0.926660418510437, answer = 
Span prob = 0.9266597032546997, answer = 


Are all of the top 20 valid and unique answers? If not, what do you think is going wrong? 

**TO-DO 5b:** Use the cell below to define a new and improved version of `predict_answer()` that only includes valid answers. Summarise in a couple of sentences what kind of invalid answers your code removes. **4 marks**

WRITE YOUR ANSWER HERE:

<span style="color:yellow">
My code removes the empty answers. Although it predicts answers which are long sentences along with the hashtag.
</span>


In [23]:
### WRITE YOUR OWN CODE HERE
def predict_answer(probs_start, probs_end, input_ids, tokenizer):
    
    input_length = len(input_ids)  # length of the input sequence, in the form "[CLS] question [SEP] context"

    SEP_SPECIAL_TOKEN = 102  # the input id for the sep special token the separates the question from the context. The context starts after this token. 
    PAD_SPECIAL_TOKEN = 0  # the input id for padding tokens added to the end of the context
    
    span_probabilities = []  # save the probabilities here
    spans = []  # save the possible answer spans here
    
    for start_idx in range(0, input_length):
        for end_idx in range(0, input_length):
            
            start_prob = probs_start[start_idx]
            end_prob = probs_end[end_idx]
            
            ### WRITE YOUR ANSWER HERE
            # if start_idx > end_idx or input_ids[start_idx] != SEP_SPECIAL_TOKEN or input_ids[end_idx] != SEP_SPECIAL_TOKEN:
            #     span_probabilities.append(0)    
            # elif input_ids[start_idx] == SEP_SPECIAL_TOKEN and input_ids[end_idx] == SEP_SPECIAL_TOKEN:
            #     span_prob = start_prob + end_prob
            #     span_probabilities.append(span_prob)      
            # if end_idx < start_idx:
            #     continue
            # if input_ids[start_idx] == SEP_SPECIAL_TOKEN:
            #     continue
            # if input_ids[start_idx] == PAD_SPECIAL_TOKEN:
            #     continue
            # if input_ids[end_idx] == SEP_SPECIAL_TOKEN:
            #     continue
            # if input_ids[end_idx] == PAD_SPECIAL_TOKEN:
            #     continue
            # if end_idx - start_idx > 20:
            #     continue

            if start_prob + end_prob > 1.5:  # if the sum of the probabilities is greater than 1.5, then include the answer
                ### WRITE YOUR ANSWER HERE
                pad_prob = probs_start[PAD_SPECIAL_TOKEN] + probs_end[PAD_SPECIAL_TOKEN]
                span_prob = start_prob + end_prob - pad_prob

                span_probabilities.append(span_prob)            
                ###
                span = tokenizer.decode(input_ids[start_idx:end_idx+1])
                spans.append(span)

            # span_prob = start_prob + end_prob
            # span_probabilities.append(span_prob)
            ###
            
            # span = tokenizer.decode(input_ids[start_idx:end_idx+1])
            # spans.append(span)

    # sort the spans according to probability:
    sorted_span_index = np.argsort(span_probabilities)

    # valid_spans = []
    # for idx in sorted_span_index[::-1]:
    #     if span_probabilities[idx] > 0:
    #         if spans[idx][0] == " ":
    #             valid_spans.append(spans[idx][1:])
    #         else:
    #             valid_spans.append(spans[idx])
    #     else:
    #         break
    # return valid_spans
    
    # print the top 20 answers:
    try:
        for i in range(20):
            print(f'Span prob = {span_probabilities[sorted_span_index[-i-1]]}, answer = {spans[sorted_span_index[-i-1]]}')
    except IndexError:
        pass
    
            
predict_answer(probs_start[example_index], probs_end[example_index], example['input_ids'], tokenizer)

Span prob = 1.53287672996521, answer = endangeredriver


You can try out the pretrained QA model on a few examples and try to identify its common mistakes.

**TO-DO 5c:** State one way that we could improve the performance of our extractive QA model on the Tweet QA dataset.  **2 marks**

WRITE YOUR ANSWER HERE
<br>
<span style="color:yellow">**1. Fine tune the model om a tweet-specific dataset:** Currently we are using the distilbert-base-cased-distilled-squad model which was pre-trained on a large corpus of text, and it is not specifically trained on tweets. If we fine-tune the model on a tweet specific dataset, we can further improve the performance of our model.
<br><br>
**2. Using different model architechtures:** We have only considered distilbert-base-cased-distilled-squad model which is one of many pre-trained models available for NLP tasks. Experimentation could be done using different architechtures as well.
<br><br>
**3. Using data augmentation techniques:** The Tweet QA dataset is relatively small, due to which it becomes difficult for the model to learn generalizable patterns. So, data augmentation techniques like adding noise to the input or generating synthetic data, may help the model to learn more robust features.</span>

--- 

As well as answering ad-hoc queries, question answering models can help us to extract structured information about entities of interest from a large set of documents. Suppose that we want to automatically collect information on tech companies, such as Apple and Open AI. We want to extract information about each company's activities from social media, including the names and release dates of new products and services, the company's earnings in a specific year, and who its CEO is.  

**TO-DO 5d:** Given a list of tech company names, how could we use question answering to extract the required information for each company from a set of tweets?  **(3 marks)** 
WRITE YOUR ANSWER HERE
<br>
_We could use the following set of steps to extract the required information using question answering as follows_
<br><br>
<span style="color:red">
_1. First, we will collect a set of tweets that mention the tech companies of interst; in our case Apple and OpenAI. We could do this easily by searching of tweets that incude the secific keywords._
<br><br>
_2. We can pre-process the tweets to clean them and remove reweets._ <br><br>
_3. Use a pre-trained question answering model to answer specific questions about the companies. Here we could ask the potential questions like 'Who's the CEO of the company?' or 'What new products Apple has launched this year?'_<br><br>
_4. To extract structured information about each company, we could ask a set of standard questions for each company such as:_<br>
i. _Which products are launched and when they are launched?_<br>
ii. _What is company's earnings in a financial year?_<br>
iii. _Who is company's current CEO?_<br><br>
_5. After running the question answering model on the set of tweets for each company and extracting the relevant information, we could organize the information into a structured format, such as a table or spreadsheet, for further analysis and visualization._

# 6. Transformer-based Text Classifiers (max. 24 marks)

The previous section showed us how to use a pretrained QA model based on a pretrained transformer. In this section, you will learn how to construct and train a text classifier on top of a pretrained transformer. 

We will use the [Poem Sentiment](https://huggingface.co/datasets/poem_sentiment) dataset to train and test a classifier. The task is to classify lines from poems into one of  0: negative, 1: positive, 2: no impact, or 3: mixed sentiment. For more information, see [Sheng and Uthus, 2020](https://arxiv.org/pdf/2011.02686.pdf). 

To begin you will need to instantiate a suitable model.

**TO-DO 6a:** Find an AutoModel class that constructs a text classifier from the pretrained TinyBERT model, "huawei-noah/TinyBERT_General_4L_312D". Create the `model` object in the cell below using this class. Refer to the [Hugging Face documentation for auto models](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html) as needed. **(2 marks)**

In [24]:
### WRITE YOUR ANSWER TO 6a HERE ###
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")
model = AutoModelForSequenceClassification.from_pretrained("huawei-noah/TinyBERT_General_4L_312D", num_labels=4)

Some weights of the model checkpoint at huawei-noah/TinyBERT_General_4L_312D were not used when initializing BertForSequenceClassification: ['fit_denses.2.weight', 'fit_denses.4.bias', 'fit_denses.0.weight', 'fit_denses.0.bias', 'cls.seq_relationship.weight', 'fit_denses.1.bias', 'fit_denses.2.bias', 'cls.predictions.transform.LayerNorm.bias', 'fit_denses.3.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'fit_denses.3.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'fit_denses.1.weight', 'fit_denses.4.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a

**TO-DO 6b:** Provide a link to the documentation for your chosen auto model for text classification. Briefly describe how the text classifier `model` it creates differs from the QA model created by `AutoModelForQuestionAnswering`. Note: A useful reference may be the original BERT paper (https://arxiv.org/pdf/1810.04805.pdf), which includes diagrams (Figure 4) showing how BERT can be adapted to different tasks. **(2 marks)** 

WRITE YOUR ANSWER TO 6b HERE:


_The auto model I chose is `AutoModelForSequenceClassification`. The documentation could be found [here](https://huggingface.co/docs/transformers/model_doc/auto#automodelforsequenceclassification)._

_The `AutoModelForSequenceClassification` is different from the `AutoModelForQuestionAnswering` in saveral ways. Firstly, `AutoModelForSequenceClassification` is specially designed for sequence classification tasks, whereas `AutoModelForQuestionAnswering` is designed for question answering tasks._

_Additionally, `AutoModelForSequenceClassification` has a final classification layer added on a top of the transformer's output. This layer maps the last hidden state of the transofrmer to the number of classes in the classification task._


---

For the QA task, the complete model was pretrained and we could apply it to a dataset without further training. However, for our poem sentiment classification task,
we will need to train our model before we can use it (you may see a message in the output of the last cell telling you this). 

**TO-DO 6c:** The emotion classifier is built on top of a pretrained TinyBERT model, so why do we need to train it before we can use it? **(2 marks)**

WRITE YOUR ANSWER TO 6C HERE:

_While the TinyBERT model has been pretrained ona large corpus of text data, the specific task of poem sentiment classification requires the model to learn how to classifiy lines from poems into one of the four categories, namely negative, positive, no impact, or mixed sentiment._

_During the training process, the weights of the model are updated based on the errors made during classification. These updated weights enable the model to improve its performance on the task. Therefore, we need to fine-tune the pretrained model on the specific task of poem sentiment classification to achieve the best performance._



---

Next, let's learn how to train our model. For some tasks it is not necessary to update the weights in the BERT model itself, so we can freeze them to save a lot of computation time. We can do this as follows. Since our pretrained model is based on BERT, we can access the weights inside BERT through the variable `model.bert`.

In [25]:
for param in model.bert.parameters():
    param.requires_grad = False

To train our model, we can make use of the Trainer class, which encapsulates a lot of the complex training steps and avoids the need to define our own training function, as we did in the previous notebook (we don't need to write our own `train_nn`).

First, define some settings for the training process. This is where we can set training hyperparameters:

In [26]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="transformer_checkpoints",  # specify the directory where models weights will be saved a certain points during training (checkpoints)
    num_train_epochs=10, # A sensible and sufficient number to use for the to-dos below
    per_device_train_batch_size=8,  # you can decrease this if memory usage is too high while training
    logging_steps=50,  # how often to print progress during training
)

Next, create a trainer object. Note that the next cell will currently fail with an error, because the variables `poem_train_dataset` and `poem_val_dataset` do not exist yet! Don't worry, we'll fix this later. 

In [27]:
from transformers import Trainer
from torch import nn

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=poem_train_dataset,
    eval_dataset=poem_val_dataset,
)

NameError: name 'poem_train_dataset' is not defined

To train the model, you will need to call `trainer.train()`.

Once the model is trained, we can obtain predictions using the function below. Notice that it is simpler than obtaining the spans for QA -- we simply get the logits for each tweet in the test set, then apply argmax over the classes to find the most probable class for each tweet:

In [28]:
def predict_nn(trained_model, test_dataset):

    # Switch off dropout
    trained_model.eval()
    
    # Pass the required items from the dataset to the model    
    output = trained_model(attention_mask=torch.tensor(test_dataset["attention_mask"]), input_ids=torch.tensor(test_dataset["input_ids"]))
                        
    # the output dictionary contains logits, which are the unnormalised scores for each class for each example:
    pred_labs = np.argmax(output["logits"].detach().numpy(), axis=1)

    return pred_labs

You should now have all the bits and pieces needed to build and train a text classifier. Let's put them all together...

**TO-DO 6d:** Implement and test a classifier for the [Poem Sentiment](https://huggingface.co/datasets/poem_sentiment) dataset using a pretrained transformer. Evaluate the classifier with both frozen and unfrozen (i.e., fine-tuned) parameters in the pretrained transformer. Choose a suitable evaluation metric and provide a comparison of the results below, including a brief explanation  (1-2 sentences) for any differences you observe between the frozen and unfrozen variants. Make sure to comment your code.  **(10 marks)**

Notes: 
 * Strong classifier performance is not required to achive good marks -- rather, we award marks for implementing and testing a transformer-based classifier correctly.
 * You may implement any suitable kind of classifier you like, as long as you are using a pretrained transformer model.
 * 'tiny' BERT variants such as TinyBERT and roberta-tiny are recommended because they are small enough to fine-tune with a typical laptop CPU. We recommend sticking with these smaller pretrained models unless you have access to a GPU, e.g., via Google Colab. 

WRITE YOUR ANSWER HERE (DESCRIPTION OF RESULTS FOR 6d):


In [29]:
### WRITE YOUR ANSWER HERE (Code for 6d; feel free to use multiple cells and copy code from above) ###
# Train the classifier with frozen parameters
from sklearn.metrics import accuracy_score

dataset = load_dataset("poem_sentiment")

poem_train_dataset = dataset["train"]
poem_val_dataset = dataset["validation"]
poem_test_dataset = dataset["test"]
poem_train_labels = poem_train_dataset["label"]
poem_val_labels = poem_val_dataset["label"]
poem_test_labels = poem_test_dataset["label"]


Found cached dataset poem_sentiment (/home/vishal/.cache/huggingface/datasets/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099)


  0%|          | 0/3 [00:00<?, ?it/s]

In [32]:
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
from torch.optim import AdamW

# load dataset
dataset = load_dataset("poem_sentiment")
train_dataset = dataset["train"]
val_dataset = dataset["validation"]
test_dataset = dataset["test"]

# load tokenizer and encode dataset
tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")
train_encodings = tokenizer(train_dataset["verse_text"], truncation=True, padding=True)
val_encodings = tokenizer(val_dataset["verse_text"], truncation=True, padding=True)
test_encodings = tokenizer(test_dataset["verse_text"], truncation=True, padding=True)

# create PyTorch dataset
class PoemSentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)
    
# evaluate model on validation set
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = (preds == labels).mean()
    return {"accuracy": acc}

train_dataset = PoemSentimentDataset(train_encodings, train_dataset["label"])
val_dataset = PoemSentimentDataset(val_encodings, val_dataset["label"])
test_dataset = PoemSentimentDataset(test_encodings, test_dataset["label"])

# load pre-trained model and freeze parameters
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=4)
for param in model.bert.parameters():
    param.requires_grad = False

# set up training arguments
training_args = TrainingArguments(
    output_dir="./transformer_checkpoints",
    num_train_epochs=10,
    per_device_train_batch_size=8,
    logging_steps=50,
)

# set up trainer and fine-tune model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)
trainer.train()

trainer = Trainer(model=model, compute_metrics=compute_metrics)
val_results = trainer.predict(val_dataset)
print("Accuracy on validation set (frozen parameters):", val_results.metrics["test_accuracy"])



# unfreeze parameters and fine-tune model
for param in model.bert.parameters():
    param.requires_grad = True

training_args2 = TrainingArguments(
    output_dir="./transformer_checkpoints",
    num_train_epochs=10,
    per_device_train_batch_size=8,
    logging_steps=50,
)

trainer2 = Trainer(
    model=model,
    args=training_args2,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)
trainer2.train()
trainer2 = Trainer(model=model, compute_metrics=compute_metrics)
val_results2 = trainer2.predict(val_dataset)


val_results = trainer2.predict(val_dataset)
print("Accuracy on test set (unfrozen parameters):", val_results2.metrics["test_accuracy"])


Found cached dataset poem_sentiment (/home/vishal/.cache/huggingface/datasets/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099)


  0%|          | 0/3 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model

Step,Training Loss
50,1.2597
100,1.0591
150,1.0283
200,1.0598
250,1.0733
300,1.0912
350,1.0252
400,1.0602
450,1.0309
500,1.0743


Accuracy on validation set (frozen parameters): 0.6571428571428571




Step,Training Loss
50,0.9418
100,0.7984
150,0.4557
200,0.3658
250,0.3382
300,0.2112
350,0.2102
400,0.1195


Accuracy on test set (unfrozen parameters): 0.8666666666666667


**TO-DO 6e:** Did your sentiment classifier make use of any kind of model transfer or transfer learning? If so, what kinds of transfer were used and what benefit do they provide? **(4 marks)**

WRITE YOUR ANSWER HERE:
<br>
<span style="color:red">
Yes, the sentiment classifier made use of a model transfer learning by fine-tuning pre-trained transformer models. Firstly, the code uses a pre-trained transformer model using the `AutoModelForSequenceClassification` method. The pre-trained transformer model used in this code is `bert-base-uncased`. Additionally, my code also uses `TinyBERT_General_4L_312D` tokenizer to encode the poem datset. 
<br>
By using pre-trained transformer models and fine-tuning them on a smaller datsaset, my code benefits from transfer learning, allowing the model to leverage the knowledge learned from a large corpus of text data and adapt it to the specific task of sentiment analysis of poems.
<br>
Using differnt ore-trained models also affects accuracy of the algorithm.
</span>

---

**TO-DO 6f:** Use your model to compute the probability of sentiment for a sentence of your choosing. Comment your code and print the sentence with its probability distribution. Label the values so that we know which class they refer to. **(4 marks)**

Hint: you could use a poem generator, such as [this one](https://www.poemofquotes.com/tools/poetry-generator/ai-poem-generator), to generate a test sentence. 

In [39]:
# WRITE YOUR ANSWER HERE   
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

sentence = "Springs magic \
Flowers blooming \
Birds singing \
Nature coming alive \
All living things growing \
Living life \
Growing stronger \
Spring, I love you!"

inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors='pt')
inputs.to(device)
model.to(device)
outputs = model(**inputs)

probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

print(f"Sentence: {sentence}")
print(f"Probabilities Distribution: ")
for i, label in enumerate(dataset["train"].features["label"].names):
    print(f"{label}: {probs[0, i].item()}")


Sentence: Springs magic Flowers blooming Birds singing Nature coming alive All living things growing Living life Growing stronger Spring, I love you!
Probabilities Distribution: 
negative: 0.00013046922686044127
positive: 0.9994756579399109
no_impact: 0.00011731424456229433
mixed: 0.00027657923055812716


# 7. OPTIONAL: More on Transformers

There are many great resources out there to show you how to use this kind of model in practice:
* An extensive online course is provided by HuggingFace: https://huggingface.co/course/chapter1/1. The pages linked from the HuggingFace course website have an 'open in Colab' button on the top right. You can open the notebook and run it on a Google server there to access GPUs.
* Chapters that may be particularly useful: 
   * Transformers, what can they do? https://huggingface.co/course/chapter1/3?fw=pt
   * Using Transformers: https://huggingface.co/course/chapter2/2?fw=pt
* They provide information on fine-tuning the transformer models here: https://huggingface.co/docs/transformers/training. Fine-tuning updates the weights inside the pretrained network and requires extensive GPU or TPU computing. 
* Text Generation: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb. This topic goes way beyond data analytics on this unit and shows you another powerful feature of pretrained transformers.


