# UofT FASE ML Bootcamp
#### Friday June 14, 2024
#### Sentiment Classification and Prompting with Transformers - Lab 1, Day 5
#### Teaching team: Eldan Cohen, Alex Olson, Nakul Upadhya, Hriday Chheda
##### Lab author: Kyle E. C. Booth, kbooth@mie.utoronto.ca, edited by Jake Mosseri

In this lab, you will focus on different approaches for developing a classifier using pre-trained Transformer models.

In particular, you will focus on:
1. Extracting pre-trained text embedding and then training a separate classifier to predict sentiment
2. In-context zero-shot and few-shot learning in LLMs

In this lab, you will be using the popular [HuggingFace's Transformers library](https://huggingface.co/docs/transformers/en/index).

---

We start by installing and importing the required libraries:

In [None]:
! pip install -q datasets
! pip install -q transformers[torch]
! pip install -q evaluate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.31.0, but you have requests 2.32.3 which is incompatible.[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m5.2 MB/s[0

In [None]:
from datasets import load_dataset
from transformers import pipeline, AutoTokenizer, AutoModel, AutoModelForSequenceClassification
import re
from tqdm import tqdm
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Sentiment Classification using Feature Extraction from BERT

We will be using a training and evaluation dataset containing financial tweets and a label indicating whether they are Bearish (believing prices will drop), Bullish (believing prices will rise), or Neutral.
In this task, you will use a pre-trained BERT model without fine-tuning and use the representation obtained from this pre-trained BERT as features for a separate classifier.

First you will load the dataset by running the following chunk of code

In [None]:
ds = load_dataset("zeroshot/twitter-financial-news-sentiment", split="validation")
print("Example tweet:", ds[0])

#Clean the tweets to remove URLs
ds = ds.map(lambda x: {"text": re.sub(r'http\S+', '', x["text"]).strip(), "label": x["label"]})
print("Example tweet:", ds[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/859k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/217k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9543 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2388 [00:00<?, ? examples/s]

Example tweet: {'text': '$ALLY - Ally Financial pulls outlook https://t.co/G9Zdi1boy5', 'label': 0}


Map:   0%|          | 0/2388 [00:00<?, ? examples/s]

Example tweet: {'text': '$ALLY - Ally Financial pulls outlook', 'label': 0}


Each tweet has a label:
- 0: Bearish (believing prices will drop)
- 1: Bullish (believing prices will rise)
- 2: Neutral

In [None]:
ds[0] # 'label' field

{'text': '$ALLY - Ally Financial pulls outlook', 'label': 0}

Splitting to training and evaluation sets:

In [None]:
split = ds.train_test_split(test_size=0.1)
training_set = split["train"]
eval_set = split["test"]

Now that we have the data setup we focus on extracting representation from a pre-trained LLM. In this case we choose the BERT model (specifically, "bert-base-uncased"). Note, our goal is to create a classifier model that can predict the label for a given tweet.
The idea is to first, extract the representation of the tweets from a pre-trained BERT model, then, use the extracted representations as features along with the given labels to train a SVC (support vector classifier) classifier to predict the label for a given tweet based on it's BERT representation.

In [None]:
# Load the pre-trained BERT model
model = AutoModel.from_pretrained("bert-base-uncased")

# Remember, we need to tokenize the tweets to input them to a LLM
# Here we load the appropriate tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Let's take one example from our training set and tokenize it:

In [None]:
example_tweet = training_set[0]["text"]
print(example_tweet)
# tokenize test and print the ID for each token:
tokenized_example = tokenizer(example_tweet)
print(tokenized_example["input_ids"])

PLUG Stock Remains Tantalizing and Vexing
[101, 13354, 4518, 3464, 9092, 9080, 6026, 1998, 2310, 19612, 102]


Let's use the tokenizer to translate back each token ID to the vocabulary word:

In [None]:
print([tokenizer.decode(token_id) for token_id in tokenized_example["input_ids"]])

['[CLS]', 'plug', 'stock', 'remains', 'tan', '##tal', '##izing', 'and', 've', '##xing', '[SEP]']


Now, we will set up a feature extraction pipeline that will take a text and generate the features for this text using the BERT model and tokenizer we loaded.

In [None]:
feature_extractor = pipeline('feature-extraction', model=model, tokenizer=tokenizer)

In [None]:
# run the pipeline to obtain representation for these three examples:
example_result = feature_extractor(example_tweet, return_tensors = "pt")
print(example_result.shape)

torch.Size([1, 11, 768])


For each tweet we will have a [1, LENGTH, 768] feature vector (LENGTH is the number of tokens, while 768 is the latent dimension). We can summarize the features for each data point using mean pooling across the token contextual representations to a fixed size representation (768 dimensions for this BERT model):

In [None]:
#This takes the mean of the feature across all the tokens
example_result[0].numpy().mean(axis=0).shape

(768,)

##1.1
Using the example from above, extract the fixed size contextual embeddings of both the training set and the evaluation set using BERT



In [None]:
# Extract the training set embeddings

# Initialize an numpy array of zeros to store the extracted embeddings for the training set
training_feature_set = np.zeros((training_set.num_rows, 768))

# Iterate over training tweets, extract and store the embeddings in training_feature_set
for training_example_id in tqdm(range(training_set.num_rows)):
  example_tweet = # TODO: Select the tweet corresponding to training_example_id
  training_example_embedding = # TODO: Extract the embedding of example_tweet
  training_example_summarized = # TODO: Average the embedding across all the tokens
  training_feature_set[training_example_id] = training_example_summarized

100%|██████████| 2149/2149 [07:00<00:00,  5.11it/s]


In [None]:
# Extract the evaluation set embeddings

# Initialize an numpy array of zeros to store the extracted embeddings for the evaluation set
eval_feature_set = np.zeros((eval_set.num_rows, 768))

# Iterate over evaluation tweets, extract and store the embeddings in eval_feature_set
for eval_example_id in tqdm(range(eval_set.num_rows)):
  example_tweet = # TODO: Select the tweet corresponding to eval_example_id
  eval_example_embedding = # TODO: Extract the embedding of example_tweet
  eval_example_summarized = # TODO: Average the embedding across all the tokens
  eval_feature_set[eval_example_id] = eval_example_summarized

##1.2

Now we can use the extracted embeddings to train a classifier model. Here we choose the SVC classifier. Note: typically we would do hyper-parameter tuning for the classifier using a held-out validation set. For simplicity, just use the default hyper-parameters.

In [None]:
# Create the train and evaluation labels (or targets) as lists
train_labels = # TODO
eval_labels = # TODO

In [None]:
# Initialize the SVC model
svc_model = SVC()

# Train the SVC model
svc_model.fit( ) # TODO: Input parameters to fit the SVC model - train features and training labels

In [None]:
# Evaluate the SVC model on the evaluation set
eval_predictions = # TODO: Call the predict method of the svc_model on the eval features

eval_accuracy = accuracy_score(eval_predictions, eval_labels)
print(f"Evaluation accuracy is: {eval_accuracy}")

#2. In-context Zero-shot Learning

Finally, we investigate in-context zero-shot learning using an instruction-trained LLM. Specifically, we use [Flan T5](https://huggingface.co/google/flan-t5-base) as it is relatively small and does not require significant resources to run.

In contextual allows a model to perform tasks without prior specific training by using context and general knowledge. For instance, if the model has never been trained on the specific phrase "The movie was a rollercoaster of emotions," it can still determine that the sentiment is positive by understanding the context of the words "rollercoaster" and "emotions" in relation to typical movie reviews. This capability allows the model to accurately assess sentiments in novel sentences without needing explicit prior examples.

In this section you will work on creating prompts from In-context zero shot learning for the sentiment of movie reviews

In [None]:
# Import necessary libraries
from transformers import T5Tokenizer, T5ForConditionalGeneration

Loading the Flan T5 (base) model and tokenizer:

In [None]:
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Let's first look at a simple zero-shot machine translation task:

In [None]:
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))

<pad> Wie old sind Sie?</s>


We are going to classify the sentiment of movie reviews. Here is a simple prompting template for classifying movie reviews:

In [None]:
prompt_template = "\"{review_text}\". \nIs it good?"
print(prompt_template)

"{review_text}". 
Is it good?


Here is how we can prompt the model using this prompt tempelate and an example review

In [None]:
example_review = "This movie was so enjoyable and I recommend it to everyone."

input_text = prompt_template.format(review_text=example_review)
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids, max_new_tokens=10)
print(tokenizer.decode(outputs[0], max_length=10))

<pad> yes</s>


In [None]:
example_review = "This movie is very well made. Still, I did not enjoy it and cannot recommend it to anyone."

input_text = prompt_template.format(review_text=example_review)
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids, max_length=10)
print(tokenizer.decode(outputs[0]))

<pad> no</s>


Now let's load the IMDB movie reviews dataset

In [None]:
ds = load_dataset("imdb", split="train")

split = ds.train_test_split(test_size=0.002, seed=42)
training_set = split["train"]
eval_set = split["test"]

Each record includes a review text and a label: 0 for negative, 1 for positive. For example:

In [None]:
eval_set[0]

{'text': 'There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier\'s plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it\'s the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...',
 'label': 1}

We will use the small evaluation dataset of 50 reviews to evaluate the model using the prompt above

In [None]:
texts = eval_set["text"]
labels = eval_set["label"]

In [None]:
predictions = []
for idx, text in tqdm(enumerate(texts)):
  input_text = prompt_template.format(review_text=text)
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids
  outputs = model.generate(input_ids, max_length=10)
  output_text = tokenizer.decode(outputs[0])
  if "no" in output_text:
    predictions.append(0)
  else:
    predictions.append(1)

print("\npredictions:", predictions)

25it [01:03,  2.55s/it]


predictions: [1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0]





Measure accuracy:

In [None]:
accuracy_score(predictions, labels)

0.68

Investigate the reviews for which the model did not make correct predictions according to the provided labels. What is the reason? Did the model make a mistake? (it is also possible that you agree with the model and the label, in fact, does not seem to be correct - explain if this is the case)

Answer: TODO

While the performance using this prompt is better than random guessing, can you try other prompts to improve the accuracy? (For example: Specifically ask whether the review is positive or negative, etc). Be as creative as you wish and report results using 2 different prompts.

In [None]:
#TODO: Try different prompts to classify the reviews

#3. Few Shot Prompting

Few-shot prompting enables large language models to perform better on complex tasks by providing demonstrations. While zero-shot capabilities have shown remarkable results, few-shot prompting has emerged as a more effective way to tackle complex tasks by utilizing different numbers of demonstrations, such as 1-shot, 3-shot, 5-shot, and so on.

We present some examples that use few shot prompting. We use the GPT neo model with 1.3B models trained by EleutherAI which is LLM that replicates GPT-3 architecture and is free to use. You can read more about this model [here](https://huggingface.co/EleutherAI/gpt-neo-1.3B)

In [None]:
# Import required libraries
from transformers import pipeline

In [None]:
# Load the model pipeline
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-1.3B')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.31G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

##Example 1: Translation
To demonstrate few-shot prompting, consider the following example in which the task is to translate "Thank you" to French. The expected answer is "merci". First we can try prompting the model with a direct zero-shot prompts as below:

In [None]:
input_text = """Translate "Thank you" to French"""
output = generator(input_text, max_new_tokens=10, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

Translate "Thank you" to French

Monday, August 30, 2012




In [None]:
input_text = """What is "thank you" in French?"""
output = generator(input_text, max_new_tokens=20, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

What is "thank you" in French? Many French speakers are accustomed to the "d' " or "de " used to thank someone for


As we can see these direct prompts are unable to provide the desired result. Maybe we can try giving the model one example and try a one-shot prompt as below?


In [None]:
input_text = """
Example:
English: "Good morning."
French: "Bonjour."

Now you try:
English: "Thank you."
French:"""
output = generator(input_text, max_new_tokens=4, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)


Example:
English: "Good morning."
French: "Bonjour."

Now you try:
English: "Thank you."
French: "Merci."


##Example 2: Named Entity Recognition
Here the task is to identify and label the named entities in the input sentence.

For example, Sentence: "Barack Obama was born in Hawaii."

Answer: Barack Obama is a person and Hawaii is a location.

In [None]:
input_text = """Task: Identify and label the named entities in the following sentence.
Barack Obama was born in Hawaii.
"""
output = generator(input_text, max_new_tokens=20, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

Task: Identify and label the named entities in the following sentence.
Barack Obama was born in Hawaii.
Barack Obama is from Hawaii.
Barack Obama is male.
Barack Obama is female


In [None]:
input_text = """Example:
Sentence: "Barack Obama was born in Hawaii."
Entities: [Barack Obama: PERSON, Hawaii: LOCATION]

Now you try:
Sentence: "Justin Trudeau is the prime minister of Canada."
Entities:"""
output = generator(input_text, max_new_tokens=11, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

Example:
Sentence: "Barack Obama was born in Hawaii."
Entities: [Barack Obama: PERSON, Hawaii: LOCATION]

Now you try:
Sentence: "Justin Trudeau is the prime minister of Canada."
Entities: [Justin Trudeau: PERSON, Canada: LOCATION


##Example 3: Question Answering (TODO)

The task in this example is to test the comprehension abilities of the model. Given an input passage is the model able to answer a question based on the passage correctly.

For example a passage could be, "Marie Curie was a physicist and chemist who conducted pioneering research on radioactivity. She was the first woman to win a Nobel Prize."
A suitable question based on the passage: "What did Marie Curie win?"

1. Use few shot learning prompt to answer the following questions:
"Thomas Edison was an American inventor who developed many devices including the phonograph and the electric light bulb."
"What did Thomas Edison develop?"

2. "Mount Everest is the highest mountain in the world, located in the Himalayas on the border between Nepal and China."
"Where is Mount Everest located?"


In [None]:
# TODO: Fill in your one-shot prompt for passage 1 in the input_text below
input_text = """

"""
output = generator(input_text, max_new_tokens=10, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

In [None]:
# TODO: Fill in your one-shot prompt for passage 2 in the input_text below
input_text = """

"""
output = generator(input_text, max_new_tokens=10, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

##Limitations of few shot prompting

Consider the task of asking a large language model to solve math word problems for example:
Sarah has 12 apples. She gives 3 apples to each of her 2 friends. How many apples does she have left?
Of course we know the answer is 6. Is the LLM able to get the answer?


In [None]:
input_text = """Sarah has 12 apples. She gives 3 apples to each of her 2 friends.
How many apples does she have left? (output a number)
"""
output = generator(input_text, max_new_tokens=10, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

Sarah has 12 apples. She gives 3 apples to each of her 2 friends. 
How many apples does she have left? (output a number)
    |     1 


We don't expect the model to answer this question correctly but what if we use one shot prompting in this case?

In [None]:
input_text = """Task: Solve the following math word problem step-by-step.

Example 1:
Problem: Sarah has 12 apples. She gives 3 apples to each of her 2 friends. How many apples does she have left?
Answer: 6

Now you try:
Example 2:
Problem: John has 5 packs of crayons. Each pack contains 8 crayons. He gives 15 crayons to his friends. How many crayons does he have now?
Answer:"""
output = generator(input_text, max_new_tokens=2, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

Task: Solve the following math word problem step-by-step.

Example 1:
Problem: Sarah has 12 apples. She gives 3 apples to each of her 2 friends. How many apples does she have left?
Answer: 6

Now you try:
Example 2:
Problem: John has 5 packs of crayons. Each pack contains 8 crayons. He gives 15 crayons to his friends. How many crayons does he have now?
Answer: 5



While the model is able to output some answer it is clearly not the correct answer. Few-shot prompting provides a model with a limited number of examples, which may not be sufficient for understanding the nuances of complex tasks. The model might struggle to generalize from these examples, especially when the tasks involve multiple steps or intricate logic.