[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/10.llms/Prompting%20Local%20LLMs.ipynb)

In this notebook, we'll explore few-shot learning with [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B); this model can fit within the memory and processing constraints of a T4 GPU on Google Colab while also being openly available.

Then, we will also use quantization to fit a larger model ([Qwen3-14B]()) on the T4 GPU by converting the model weights to 4-bits instead of the full 16-bits.

Can you create a new classification task and design prompts to differentiate between the classes within it?  

In [None]:
from textwrap import dedent

In [None]:
import torch
from torch.nn import functional as F

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
# check that the GPU is available
torch.cuda.is_available()

## Qwen3-4B


In [None]:
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B", device_map="cuda", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B")

In [None]:
def classify_with_prompt(labels, shots, target_x, thinking=False):
    system_prompt = dedent(f"""
        You're a helpful assistant for text classification. You'll be given an input text and need to output a single choice from the following set of categories:
        {', '.join(labels)}
        Pick one of those labels and do not generate any other text.
    """).strip()
    
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": shots[0]["X"]}, {"role": "assistant", "content": shots[0]["y"]},
        {"role": "user", "content": shots[1]["X"]}, {"role": "assistant", "content": shots[1]["y"]},
        {"role": "user", "content": shots[2]["X"]}, {"role": "assistant", "content": shots[2]["y"]},
        {"role": "user", "content": target_x}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=thinking # Switches between thinking and non-thinking modes. Default is True.
    )
    
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    # conduct text completion
    generated = model.generate(
        **model_inputs,
        max_new_tokens=32768
    )

    # let's break this down:
    #                      | we take the element of the batch (our batch size is 1)
    #                      |  |-----------------------------| skip our original input
    output_ids = generated[0][len(model_inputs.input_ids[0]):].tolist()

    # decode into token space
    print(tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n"))

In [None]:
shots = [
    {"X":"I love this movie", "y": "positive"},
    {"X":"I hate this movie", "y": "negative"},
    {"X":"I kind of like the movie", "y": "positive"}
]

target_x = "This is one of the best movies I've ever seen"

classify_with_prompt(["positive", "negative"], shots, target_x)

In [None]:
shots = [
    {"X":"Vampires take over the planet during an eclipse", "y": "Horror"},
    {"X":"A court sentences George to be Jerry's butler", "y": "Comedy"},
    {"X":"John turns into a werewolf during a full moon", "y": "Horror"}
]

target_x = "John is a werewolf who plays basketball"

classify_with_prompt(["Horror", "Comedy"], shots, target_x)

In [None]:
shots = [
    {"X":"This is a text", "y": "English"},
    {"X":"Nel mezzo del cammin' di nostra vita", "y": "Italian"},
    {"X":"Je ne sais pas", "y": "French"},
]

target_x = "Siempre imaginé que el Paraíso sería algún tipo de biblioteca"

classify_with_prompt(["English", "Italian", "French", "Spanish", "Japanese"], shots, target_x)

Construct a new classification task; try to find one that the 4B model fails for.

In [None]:
shots = [
    # FILL ME IN
]

target_x = ""

classify_with_prompt([
    # FILL ME IN
], shots, target_x)

## Qwen-14B with Quantization

Now let's try a bigger model. A general rule of thumb is to multiply the model size by 4 to estimate how much GPU memory you will need for inference. For example, without quantization, a 14-billion parameter model would require roughly 56GB of memory for inference.

In [None]:
# first, delete the previous model to free up memory

del model
del tokenizer
torch.cuda.empty_cache()

In [None]:
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-14B",
    device_map="cuda",
    dtype="auto",
    quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")

Rerun the prompting tasks from above. Are any of the outputs different?