
# Assignment 10: Prompt Engineering with LLaMA 3

### Objective
In this assignment, you will explore the use of Large Language Models (LLMs)—specifically the Meta LLaMA 3.2B Instruct model—for the task of sentiment classification. The model will classify movie reviews as either positive or negative and provide an explanation for its decision. You will then explore how different prompt engineering strategies affect the model's behavior and explanations.

### What You’ll Learn
- How to prompt an LLM for classification tasks.
- How to interpret and evaluate LLM explanations.
- The effect of prompt design on prediction quality and reliability.

### Tools
- Dataset: [IMDb Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/)
- Model: `meta-llama/Llama-3.2-3B-Instruct` from HuggingFace
- Libraries: `transformers`, `datasets`, `torch`

In [None]:
# Install dependencies
!pip install -q transformers datasets accelerate bitsandbytes

In [None]:
# Login to Hugging Face to access gated models like LLaMA. You can also set a secret in Colab with you token.
from huggingface_hub import login

# This will prompt you to paste your token securely
login()

In [None]:
# Import necessary libraries
import re
import torch
import random
import polars as pl
import matplotlib.pyplot as plt
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

### Question 1: Load and Explore the IMDb Dataset

You'll be using the IMDb dataset, which contains 50,000 movie reviews labeled as positive or negative.
1. Use the [`datasets`](https://huggingface.co/docs/datasets/create_dataset) library to load the IMDb dataset. Read the documentation to understand how to do this.
2. Extract the `train` and `test` splits.
3. Print:
   - The number of training and test samples.
   - Three random training samples with both the text and their sentiment labels.

In [None]:
# Load the dataset







# Show three random examples








### Question 2: Define a Prompting Function for LLaMA

You'll write a function that sends a movie review to the LLaMA model and returns a sentiment classification along with an explanation.
1. Use the `meta-llama/Llama-3.2-3B-Instruct` model from HuggingFace.
2. Format the prompt like this:
   ```
   You are a helpful AI assistant.
   Given the following movie review, classify it as Positive or Negative and explain why.

   Review: "<REVIEW TEXT>"

   Sentiment:
   ```
3. Use HuggingFace’s `generate()` function with decoding parameters like:
   - `temperature = 0.7`
   - `top_p = 0.95`
   - `max_new_tokens = 200`
4. Make sure to decode the result and return the LLM’s classification and explanation.

In [None]:
# Load tokenizer and model







# Function to get prediction from LLM








### Question 3: Run the LLaMA Model on Real Reviews

Now that your model function is ready, apply it to real data.
1. Randomly select 50 test reviews from the IMDb dataset.
2. For each review:
   - Store in a polars dataframe the sentiment for each case you selected.
   - Print:
     - The number of positive and negative predicted sentiments.
     - The explanation returned by the model for the first three cases of your sample.
3. Compare the 50 predictions with the true label on a confusion matrix.

In [None]:
# Sample 50 test reviews







# Run the model and collect predictions







# Convert results to Polars DataFrame







# Print number of each predicted sentiment







# Show explanations for the first 3 predictions







# Generate and display confusion matrix









### Question 4: Reflect on the Model’s Reasoning

Now that you've seen how the LLM classifies and explains sentiment, reflect on its behaviour in cases where it gets the label wrong.
1. Identify an example from Question 3 where the model's prediction disagreed with the ground truth label.
2. For the example:
   - Reread the review and the LLM’s explanation.
   - Try to understand why the model may have made a mistake.
   - Reflect on the following questions:
     - Was the review actually ambiguous or tricky?
     - Did the model misinterpret sarcasm, slang, or mixed sentiment?
     - Was the explanation reasonable, even if the label was wrong?

**Written Answer:**


### Question 5: Compare Prompt Engineering Strategies

Prompt engineering plays a big role in how LLMs respond. In this question, you’ll try two different prompts and compare their effects on predictions and explanations.
1. Use Prompt A (simple) and Prompt B (instructional) below.
- Prompt A:  
  `Given the following movie review, classify it as Positive or Negative and explain why.`
- Prompt B:  
  `You are a helpful AI assistant trained in sentiment analysis. Your task is to determine whether a movie review is Positive or Negative, and clearly explain your reasoning.`
2. Apply both prompts to 3 reviews.
3. For each review:
   - Record the model's predicted sentiment and explanation for both prompts.
   - Note whether each prediction is correct or not.
4. Summarize your observations.

In [None]:









    # Prompt A







    # Prompt B





