[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/eldanc/mlbootcamp2025/blob/main/lab_5_2_llms.ipynb)

# UofT FASE ML Bootcamp
#### Friday June 13, 2025
#### Sentiment Classification and Prompting with Transformers - Lab 2, Day 5
#### Teaching team: Eldan Cohen, Alex Olson, Nakul Upadhya
##### Lab author: Hriday Chheda, Eldan Cohen, edited by Nakul Upadhya

In this lab, you will focus on different approaches for developing a classifier using pre-trained Transformer models.

In particular, you will focus on:
1. Extracting pre-trained text embedding and then training a separate classifier to predict sentiment
2. In-context zero-shot and few-shot learning in LLMs

In this lab, you will be using the popular [HuggingFace's Transformers library](https://huggingface.co/docs/transformers/en/index).

---

We start by installing and importing the required libraries:

In [None]:
! pip install -U datasets
! pip install -q transformers[torch]
! pip install -q evaluate

In [None]:
from datasets import load_dataset
from transformers import pipeline, AutoTokenizer, AutoModel, AutoModelForSequenceClassification
import re
from tqdm import tqdm
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Sentiment Classification using Feature Extraction from BERT

We will be using a training and evaluation dataset containing financial tweets and a label indicating whether they are Bearish (believing prices will drop), Bullish (believing prices will rise), or Neutral.
In this task, you will use a pre-trained BERT model without fine-tuning and use the representation obtained from this pre-trained BERT as features for a separate classifier.

First you will load the dataset by running the following chunk of code

In [None]:
ds = load_dataset("zeroshot/twitter-financial-news-sentiment", split="validation")
print("Example tweet:", ds[0])

#Clean the tweets to remove URLs
ds = ds.map(lambda x: {"text": re.sub(r'http\S+', '', x["text"]).strip(), "label": x["label"]})
print("Example tweet:", ds[0])

Each tweet has a label:
- 0: Bearish (believing prices will drop)
- 1: Bullish (believing prices will rise)
- 2: Neutral

In [None]:
ds[0] # 'label' field

Splitting to training and evaluation sets:

In [None]:
split = ds.train_test_split(test_size=0.1)
training_set = split["train"]
eval_set = split["test"]

Now that we have the data setup we focus on extracting representation from a pre-trained LLM. In this case we choose the BERT model (specifically, "bert-base-uncased"). Note, our goal is to create a classifier model that can predict the label for a given tweet.
The idea is to first, extract the representation of the tweets from a pre-trained BERT model, then, use the extracted representations as features along with the given labels to train a SVC (support vector classifier) classifier to predict the label for a given tweet based on it's BERT representation.

In [None]:
# Load the pre-trained BERT model
model = AutoModel.from_pretrained("bert-base-uncased")

# Remember, we need to tokenize the tweets to input them to a LLM
# Here we load the appropriate tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Let's take one example from our training set and tokenize it:

In [None]:
example_tweet = training_set[0]["text"]
print(example_tweet)
# tokenize test and print the ID for each token:
tokenized_example = tokenizer(example_tweet)
print(tokenized_example["input_ids"])

Let's use the tokenizer to translate back each token ID to the vocabulary word:

In [None]:
print([tokenizer.decode(token_id) for token_id in tokenized_example["input_ids"]])

Now, we will set up a feature extraction pipeline that will take a text and generate the features for this text using the BERT model and tokenizer we loaded.

In [None]:
feature_extractor = pipeline('feature-extraction', model=model, tokenizer=tokenizer)

In [None]:
# run the pipeline to obtain representation for these three examples:
example_result = feature_extractor(example_tweet, return_tensors = "pt")
print(example_result.shape)

For each tweet we will have a [1, LENGTH, 768] feature vector (LENGTH is the number of tokens, while 768 is the latent dimension). We can summarize the features for each data point using mean pooling across the token contextual representations to a fixed size representation (768 dimensions for this BERT model):

In [None]:
#This takes the mean of the feature across all the tokens
example_result[0].numpy().mean(axis=0).shape

##1.1

---

**Your Turn**

Using the example from above, finish the function that will extract the fixed size contextual embeddings of both the training set and the evaluation set using BERT



In [None]:
# Extract the training set embeddings

def extract_embeddings(dataset):
  X= np.zeros((dataset.num_rows, 768))
  y = np.zeros(dataset.num_rows)
  for i in tqdm(range(dataset.num_rows)):
    example_tweet = dataset[i]["text"]
    example_result = feature_extractor(example_tweet, return_tensors = "pt")
    example_summarized = example_result[0].numpy().mean(axis=0)
    X[i] = example_summarized
    y[i] = dataset[i]["label"]
  return X, y


---

We will then use the function you filled out to populate X and y datasets.

In [None]:
X_train, y_train = extract_embeddings(training_set)
X_eval, y_eval = extract_embeddings(eval_set)

##1.2

Now we can use the extracted embeddings to train a classifier model. Here we choose the SVC classifier. Note: typically we would do hyper-parameter tuning for the classifier using a held-out validation set. For simplicity, just use the default hyper-parameters.

In [None]:
# Initialize the SVC model
svc_model = SVC()

---
**Your Turn**

Train the SVC on X_train and make predictions on both the training and testing set


In [None]:
# Fit the SVC


## Evaluate the results
train_predictions = # TODO: Call the predict method of the svc_model on the training features
train_accuracy = accuracy_score(train_predictions, y_train)
print(f"Training accuracy is: {train_accuracy}")


eval_predictions = # TODO: Call the predict method of the svc_model on the eval features
eval_accuracy = accuracy_score(eval_predictions, y_eval)
print(f"Evaluation accuracy is: {eval_accuracy}")

---

#2. In-context Zero-shot Learning

Finally, we investigate in-context zero-shot learning using an instruction-trained LLM. Specifically, we use [Flan T5](https://huggingface.co/google/flan-t5-base) as it is relatively small and does not require significant resources to run.

In contextual allows a model to perform tasks without prior specific training by using context and general knowledge. For instance, if the model has never been trained on the specific phrase "The movie was a rollercoaster of emotions," it can still determine that the sentiment is positive by understanding the context of the words "rollercoaster" and "emotions" in relation to typical movie reviews. This capability allows the model to accurately assess sentiments in novel sentences without needing explicit prior examples.

In this section you will work on creating prompts from In-context zero shot learning for the sentiment of movie reviews

In [None]:
# Import necessary libraries
from transformers import T5Tokenizer, T5ForConditionalGeneration

Loading the Flan T5 (base) model and tokenizer:

In [None]:
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")

Let's first look at a simple zero-shot machine translation task:

In [None]:
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))

We are going to classify the sentiment of movie reviews. Here is a simple prompting template for classifying movie reviews:

In [None]:
prompt_template = "\"{review_text}\". \nIs it good?"
print(prompt_template)

Here is how we can prompt the model using this prompt tempelate and an example review

In [None]:
example_review = "This movie was so enjoyable and I recommend it to everyone."

input_text = prompt_template.format(review_text=example_review)
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids, max_new_tokens=10)
print(tokenizer.decode(outputs[0], max_length=10))

In [None]:
example_review = "This movie is very well made. Still, I did not enjoy it and cannot recommend it to anyone."

input_text = prompt_template.format(review_text=example_review)
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids, max_length=10)
print(tokenizer.decode(outputs[0]))

Now let's load the IMDB movie reviews dataset

In [None]:
ds = load_dataset("imdb", split="train")

split = ds.train_test_split(test_size=0.002, seed=42)
training_set = split["train"]
eval_set = split["test"]

Each record includes a review text and a label: 0 for negative, 1 for positive. For example:

In [None]:
eval_set[0]

We will use the small evaluation dataset of 50 reviews to evaluate the model using the prompt above

In [None]:
texts = eval_set["text"]
labels = eval_set["label"]

In [None]:
predictions = []
for idx, text in tqdm(enumerate(texts)):
  input_text = prompt_template.format(review_text=text)
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids
  outputs = model.generate(input_ids, max_length=10)
  output_text = tokenizer.decode(outputs[0])
  if "no" in output_text:
    predictions.append(0)
  else:
    predictions.append(1)

print("\npredictions:", predictions)

Measure accuracy:

In [None]:
accuracy_score(predictions, labels)

---

**Your Turn**

Investigate the reviews for which the model did not make correct predictions according to the provided labels. What is the reason? Did the model make a mistake? (it is also possible that you agree with the model and the label, in fact, does not seem to be correct - explain if this is the case)

Answer: __________

While the performance using this prompt is better than random guessing, can you try other prompts to improve the accuracy? (For example: Specifically ask whether the review is positive or negative, etc). Be as creative as you wish and report results using 2 different prompts.

In [None]:
#TODO: Try different prompts to classify the reviews

---

#3. Few Shot Prompting

Few-shot prompting enables large language models to perform better on complex tasks by providing demonstrations. While zero-shot capabilities have shown remarkable results, few-shot prompting has emerged as a more effective way to tackle complex tasks by utilizing different numbers of demonstrations, such as 1-shot, 3-shot, 5-shot, and so on.

We present some examples that use few shot prompting. We use the GPT neo model with 1.3B models trained by EleutherAI which is LLM that replicates GPT-3 architecture and is free to use. You can read more about this model [here](https://huggingface.co/EleutherAI/gpt-neo-1.3B)

In [None]:
# Import required libraries
from transformers import pipeline

In [None]:
# Load the model pipeline
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-1.3B')

##Example 1: Translation
To demonstrate few-shot prompting, consider the following example in which the task is to translate "Thank you" to French. The expected answer is "merci". First we can try prompting the model with a direct zero-shot prompts as below:

In [None]:
input_text = """Translate "Thank you" to French"""
output = generator(input_text, max_new_tokens=10, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

In [None]:
input_text = """What is "thank you" in French?"""
output = generator(input_text, max_new_tokens=20, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

As we can see these direct prompts are unable to provide the desired result. Maybe we can try giving the model one example and try a one-shot prompt as below?


In [None]:
input_text = """
Example:
English: "Good morning."
French: "Bonjour."

Now you try:
English: "Thank you."
French:"""
output = generator(input_text, max_new_tokens=4, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

##Example 2: Named Entity Recognition
Here the task is to identify and label the named entities in the input sentence.

For example, Sentence: "Barack Obama was born in Hawaii."

Answer: Barack Obama is a person and Hawaii is a location.

In [None]:
input_text = """Task: Identify and label the named entities in the following sentence.
Barack Obama was born in Hawaii.
"""
output = generator(input_text, max_new_tokens=20, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

In [None]:
input_text = """Example:
Sentence: "Barack Obama was born in Hawaii."
Entities: [Barack Obama: PERSON, Hawaii: LOCATION]

Now you try:
Sentence: "Justin Trudeau is the prime minister of Canada."
Entities:"""
output = generator(input_text, max_new_tokens=11, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

##Example 3: Question Answering (Your Turn)

The task in this example is to test the comprehension abilities of the model. Given an input passage is the model able to answer a question based on the passage correctly.

For example a passage could be, "Marie Curie was a physicist and chemist who conducted pioneering research on radioactivity. She was the first woman to win a Nobel Prize."
A suitable question based on the passage: "What did Marie Curie win?"

1. Use few shot learning prompt to answer the following questions:
"Thomas Edison was an American inventor who developed many devices including the phonograph and the electric light bulb."
"What did Thomas Edison develop?"

2. "Mount Everest is the highest mountain in the world, located in the Himalayas on the border between Nepal and China."
"Where is Mount Everest located?"


In [None]:
# TODO: Fill in your one-shot prompt for passage 1 in the input_text below
input_text = """

"""
output = generator(input_text, max_new_tokens=10, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

In [None]:
# TODO: Fill in your one-shot prompt for passage 2 in the input_text below
input_text = """

"""
output = generator(input_text, max_new_tokens=10, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

##Limitations of few shot prompting

Consider the task of asking a large language model to solve math word problems for example:
Sarah has 12 apples. She gives 3 apples to each of her 2 friends. How many apples does she have left?
Of course we know the answer is 6. Is the LLM able to get the answer?


In [None]:
input_text = """Sarah has 12 apples. She gives 3 apples to each of her 2 friends.
How many apples does she have left? (output a number)
"""
output = generator(input_text, max_new_tokens=10, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

We don't expect the model to answer this question correctly but what if we use one shot prompting in this case?

In [None]:
input_text = """Task: Solve the following math word problem step-by-step.

Example 1:
Problem: Sarah has 12 apples. She gives 3 apples to each of her 2 friends. How many apples does she have left?
Answer: 6

Now you try:
Example 2:
Problem: John has 5 packs of crayons. Each pack contains 8 crayons. He gives 15 crayons to his friends. How many crayons does he have now?
Answer:"""
output = generator(input_text, max_new_tokens=2, pad_token_id=generator.tokenizer.eos_token_id)[0]['generated_text']
print(output)

While the model is able to output some answer it is clearly not the correct answer. Few-shot prompting provides a model with a limited number of examples, which may not be sufficient for understanding the nuances of complex tasks. The model might struggle to generalize from these examples, especially when the tasks involve multiple steps or intricate logic.