# Data Processing in Python + Intro to LLMs
In this notebook, you will start with raw, messy text data from the Yelp Reviews Dataset, clean it using Python, and then use a pre-trained Large Language Model from Hugging Face to perform an analysis task. You will then explore the tradeoffs between different LLMs.

## Part 1: Data Processing [15 points]
The first step in any NLP pipeline is to get and clean your data. Real-world text is messy, and preparing it properly is crucial for getting good results from any model. We will use [this dataset of unprocessed restaurant reviews](https://huggingface.co/datasets/shreyahavaldar/restaurant_reviews_unprocessed).

First, let's load the needed libraries.

In [None]:
!pip install datasets --quiet

import pandas as pd
from datasets import load_dataset

## 1.1 Loading a dataset from HuggingFace

Read the documentation below and load the training set of the restaurant reviews dataset using huggingface's `load_dataset` function. Save your data as a pandas dataframe called `reviews_df`.

Documentation: [Loading Datasets from HuggingFace](https://huggingface.co/docs/datasets/loading)


In [None]:
# Load the dataset
"""TODO: Load the dataset"""

# Save as a pandas dataframe called reviews_df
reviews_df = pd.DataFrame()
reviews_df.head()

## 1.2: Data Cleaning

Write a Python function called clean_text that performs a series of cleaning operations. Then, apply this function to the `text` column of `reviews_df` to create a new column called `text_clean`.

Your clean_text function must perform the following steps in order:

1. Remove URLs: Find and remove all URL patterns (starting with `https`, or `www`). Replace URLs with "`[url]`"

2. Strip Emails & Phones: Find and remove email addresses and common US phone number formats. Replace emails with "`[email]`" and numbers with "`[phone]`"

3. Remove HTML: Remove any simple HTML tags like `<br>` or `<div>`.

4. Normalize Whitespace: Replace multiple whitespace characters (spaces, tabs, newlines) with a single space and trim leading/trailing whitespace.

In [None]:
import re

def clean_text(s: str) -> str:
    """
    TODO:
    1) replace URLs with [url]
    2) replace emails with [email] and phones with [phone]
    3) remove simple HTML tags
    4) normalize whitespace
    """

# Apply to reviews_df to create a column called text_clean

## Part 2: Generating Text with an LLM [45 points]

Now that we have clean data, we can start working with a real Language Model! First, let's install the needed libraries.

In [None]:
!pip install transformers torch sentencepiece --quiet

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

## 2.1: Loading a Local Language Model

We'll use the `Qwen2-1.5B-Instruct` model. It's powerful enough to follow instructions but small enough to run quickly in Colab.

**To speed things up, connect to a GPU**
1. Click on Runtime -> Change runtime type.
2. Under "Hardware accelerator" click T4 GPU
3. Click Save. The notebook will restart and you will be connected!

First, load the `Qwen/Qwen2-1.5B-Instruct` model as an `AutoModelForCausalLM`. Set `device_map="auto"` to automatically use the GPU. Then, load the `Qwen/Qwen2-1.5B-Instruct` tokenizer as an `AutoTokenizer`.

Here is some documentation to help you:
- [Loading a model from HuggingFace](https://huggingface.co/docs/transformers/en/models)
- [Using a Tokenizer](https://huggingface.co/docs/transformers/en/fast_tokenizers)

In [None]:
assert(torch.cuda.is_available())

"""
TODO:
1. Load the tokenizer, which prepares text for the model
2. Load the model. set device_map="auto" to automatically uses the GPU.
"""


## 2.2: Using a Tokenizer & Chat Template

Modern chat models are not trained on plain text; they are trained on structured conversations. To get the best performance, we must format our prompts to match this structure. The `tokenizer.apply_chat_template()` method handles this for us automatically.

For the Qwen2 model, the template looks like this:

    <|im_start|>system
    You are a helpful assistant.
    <|im_end|>
    <|im_start|>user
    Hello, how are you?
    <|im_end|>
    <|im_start|>assistant

Key Components:

- Roles: Each part of the conversation is assigned a role (system, user, or assistant).
  - system: Sets the high-level instruction or persona for the model (e.g., "You are a helpful assistant").
  - user: Represents the instruction or question you are asking.
  - assistant: This is where the model's response begins. We end our prompt here to signal to the model that it's its turn to "speak."
- Special Tokens: The <|im_start|> and <|im_end|> tokens are special markers that the model uses to understand where each role's message begins and ends.

When we create a messages list in Python and pass it to `tokenizer.apply_chat_template()`, the tokenizer expertly builds this perfectly formatted string for us. This is the most reliable way to interact with chat models and ensures we get the best possible results.

**Your Task:**
Write two Python functions called `tokenize_prompt` and `decode_tokens`.
1. `tokenize_prompt` takes a **list of chat messages** as input. The function should use the global `tokenizer` and return the resulting `input_ids` as a PyTorch tensor that has been moved to the correct device (e.g., the GPU).
2. `decode_tokens` takes a PyTorch tensor of `input_ids` as input and returns the decoded text.

Here is some documentation to help you:
- [Using a chat template](https://huggingface.co/docs/transformers/main/en/chat_templating#using-applychattemplate)

In [None]:
def tokenize_prompt(messages: list) -> torch.Tensor:
    """
    Applies the model's chat template to a list of messages and tokenizes it.

    Args:
      messages: A list of dictionaries, e.g., [{"role": "user", "content": "Hello!"}]

    TODO: implement this function to return a PyTorch tensor containing the formatted and tokenized input_ids. Ensure the tensor is on the same device as the model.
    """


def decode_tokens(token_ids: torch.Tensor) -> str:
    """
    TODO: implement this function to decode a tensor of token IDs back into a string.
    """


# --- Test your functions ---
test_messages = [
    {"role": "system", "content": "You are a friendly food critic."},
    {"role": "user", "content": "Tell me about the best restaurant in Philadelphia."}]
token_ids = tokenize_prompt(test_messages)

print(f"Input Messages: {test_messages}")
print(f"\nOutput Tensor (Token IDs):\n{token_ids}")
print(f"\nTensor is on device: {token_ids.device}")

decoded_text = decode_tokens(token_ids)
print(f"\nDecoded Text:\n'{decoded_text}'")


## 2.3: Generate Text & Classify the Review

**Your Task:**
Create a function `classify_review` that takes a review text as input. Inside the function, you will build a **chat prompt** (using a system and a user message) and use the Qwen2 model to classify the restaurant as either "**Recommend**" or "**Avoid**". Your function should return only the model's final, single-word classification.

Steps:
1. Create the chat messages inside the function. This should be a list containing two dictionaries:
    - A system message that sets the model's persona.
    - A user message that contains clear, direct instructions along with the review_text itself.
2. Tokenize the prompt by passing your list of messages to the `tokenize_prompt` function you wrote in the previous step.
3. Generate a response from the model using the `model.generate()` method.
4. Decode the output using the `decode_tokens` function you wrote to get the final string. Remember, the `model.generate()` method returns the entire prompt plus the new tokens!
5. Return only the final, one-word classification (e.g., "Recommend").

Here is some documentation to help you:
- [Using a chat template](https://huggingface.co/docs/transformers/main/en/chat_templating#using-applychattemplate)

In [None]:
def classify_review(review_text: str) -> str:
    """
    TODO: Use the Qwen2 model to classify a review as "Recommend" or "Avoid".
    """



# --- Classify the first 3 cleaned reviews ---
for i in range(3):
    review = reviews_df['text_clean'].iloc[i]
    classification = classify_review(review)
    print(f"REVIEW {i+1}: {review}")
    print(f"Classification: {classification}\n")

## 2.4: Generating a News Headline with Temperature

While temperature has little effect on a simple classification, it's the most important parameter for controlling creativity. Next, you will create a function that generates a news headline for a restaurant, allowing you to see how temperature influences the model's output.

**Your Task:**
Create a function `generate_news_headline` that accepts a `review_text` and a `temperature`. You will use the Qwen2 model to generate a catchy news headline based on the review!

Steps:
1. Create the chat messages inside the function.
    - A system message that sets the model's persona.
    - A user message that contains clear, direct instructions along with the review_text itself.
2. Tokenize the prompt using your tokenize_prompt function.
3. Generate a response using `model.generate()` and include the `temperature` parameter. You must also include `do_sample=True` in this call, as temperature has no effect without it.
4. Decode, process, and return the final headline string.

In [None]:
def generate_news_headline(review_text: str, temperature: float) -> str:
    """
    TODO: Generate a catchy news headline based on a review, using a specific
    temperature to control creativity.
    """


## Part 3: Choosing the Right LLM [20 points]

In Part 2 we interacted with a single checkpoint. Here we extend that code to compare multiple HuggingFace models on the same downstream task so we can explore which model is the best for this task.



### 3.1 Build an evaluation framework [10 points]

We will evaluate on five short reviews. Implement helper functions that:

1. Load a tokenizer + model given a HuggingFace model name.
2. Use the loaded tokenizer + model to classify a cleaned review as either "Recommend" or "Avoid"

💡 *Hint:* Copy/paste your helpers from Part 2. The only new code you need is a thin wrapper that swaps in whatever tokenizer/model `evaluate_model` passes you. 

We've already implemented the `evaluate_model` helper that loops over the eval set, times each prediction, and builds a `DataFrame` for evaluation metrics.



In [None]:
import time

EVAL_REVIEWS = [
    {"review_id": "r1", "text_clean": "The tacos were incredible—handmade tortillas, bright salsa, and attentive staff.", "label": "Recommend"},
    {"review_id": "r2", "text_clean": "We waited almost two hours, the pasta arrived cold, and no one apologized.", "label": "Avoid"},
    {"review_id": "r3", "text_clean": "Light, fluffy pancakes with real maple syrup made this brunch worth the trip.", "label": "Recommend"},
    {"review_id": "r4", "text_clean": "Greasy tables, rude service, and a stomach ache afterward—never again.", "label": "Avoid"},
    {"review_id": "r5", "text_clean": "Cozy ramen bar with rich broth, fresh noodles, and fair prices.", "label": "Recommend"},
]


def load_model_components(model_name: str):
    """TODO: Load and return a tokenizer + model for the provided checkpoint."""


def classify_with_model(model, tokenizer, review_text: str) -> str:
    """TODO: Format a prompt, run generation, and return either 'Recommend' or 'Avoid'."""


def evaluate_model(model_name: str, max_samples: int = len(EVAL_REVIEWS)) -> pd.DataFrame:
    """Runs the eval set through the given model and returns a DataFrame of metrics."""
    tokenizer, model = load_model_components(model_name)
    rows = []
    for sample in EVAL_REVIEWS[:max_samples]:
        start = time.perf_counter()
        prediction = classify_with_model(model, tokenizer, sample["text_clean"])
        latency = time.perf_counter() - start
        rows.append({
            "review_id": sample["review_id"],
            "model": model_name,
            "prediction": prediction,
            "gold_label": sample["label"],
            "correct": prediction == sample["label"],
            "latency_sec": latency,
        })
    return pd.DataFrame(rows)


### 3.2 Compare three chat-tuned models [10 points]

Now, let's compare the three different chat/instruction-tuned checkpoints listed below. For each model:

1. Run `evaluate_model(model_name)`.
2. Display the resulting DataFrame.
3. Print accuracy and average latency.


In [None]:
candidate_models = [
    "Qwen/Qwen2-1.5B-Instruct",
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "Qwen/Qwen1.5-0.5B-Chat",
]

for model_name in candidate_models:
    """TODO: Call evaluate_model for this checkpoint and display the results"""