In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch
from accelerate import init_empty_weights
from transformers import BitsAndBytesConfig

device = "cuda" # the device to load the model onto

# Initialize the model with quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Set to True for 4-bit quantization or False for 8-bit
    bnb_4bit_use_double_quant=True,  # Optional: Improves stability in 4-bit quantization
    bnb_4bit_quant_type="nf4",  # Optional: Use 'nf4' for better accuracy or 'fp4' for faster computation
    bnb_4bit_compute_dtype=torch.float16  # Optional: use float16 for better performance on newer GPUs
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-7B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
    quantization_config=bnb_config  # Pass the quantization configuration
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")

## Classification with Large Language Models

So far, we have taken a look at using LLMs for tasks such as *open-ended question answering* and *machine translation* - both are what we call **generative tasks**. In simplified terms this means that we expect the model to generate outputs that may be very different depending on the input context we provide. The answer to *"Give me a short introduction to large language models."* will be structurally and semantically very different to *the German translation of an input sentence*.

Another type of tasks we can tackle with large language models are **discriminative (classification) tasks**.  
Given an input text, the goal in classification tasks is to assign it one* out of a small set of possible labels. One example of a classification task we will take a look at is called **sentiment classification**.

\* For simplicity, we are only dealing with single-label classification.

In sentiment classification, the goal is to identify the emotional polarity expressed in a text. Usually, a text may have a **positive**, **negative** or **neutral** sentiment.

The way we are going to approach the discriminative task is going to be very similar to the previous (generative) tasks, reusing much of the code from the notebook *llm-machine-translation*. **Very importantly however, the answer we expect from the model is significantly less free-form than before**. To this extent, we are going to specify the valid answer options for the LLM in its prompt, asking it to respond with either "Positive", "Negative", or "Neutral".

In [None]:
prompt = ("Classify the sentiment of the following text as 'Positive', 'Negative', or 'Neutral'.\n\n"
          "Today is a good day.")
messages = [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.6,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Constraining the output structure through the prompt already works quite well, guiding the model to answer with one of the valid answers.
By providing constraints in the prompt, we are attempting to limit the possible next words generated by the LLM to only "Positive", "Negative", and "Neutral". In technical terms, we are trying to achieve probability zero for all other words.

In practice, however, some other words maintain a small yet nonzero probability, meaning the LLM may occasionally answer with additional options.

A more robust way to constrain the set of possible answers is by using **guided decoding**. In simple terms, the idea of guided decoding is to set the probability of all invalid (unspecified) answers to 0. This means that only the valid answers may be generated by the LLM.

In [None]:
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512,
    do_sample=True,
    generation_config=GenerationConfig(
        guided_decoding={
            "type": "choice",
            "choices": ["Positive", "Negative", "Neutral"]
        },
        max_new_tokens=5,
        temperature=0.6,
    )
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

## Exercises
1. How well does the model work using the simple prompt? Gather a small sample of positive, negative, and neutral texts. Classify the texts with a LLM, and compare the predicted answers to the correct ones.
2. Try to improve the classification accuracy by modifying the prompt. Two things you can try doing is:  
- including a more detailed definition of the sentiment classification task;
- including demonstrations of input texts and their sentiment.