# Understanding Large Language Models - Lab 1: Setting Up Your Ecosystem

## Introduction
### This notebook will guide you through setting up your environment for the course.
### We will install the necessary dependencies, download a pre-trained T5 model, and run a simple text-to-text prediction.

In [9]:
!pip install transformers torch sentencepiece datasets transformers[torch]

Note: you may need to restart the kernel to use updated packages.


In [1]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

def generate_text(input_text, max_length=50):
    
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids

    ### PRINT OUT input_ids
    
    output_ids = model.generate(input_ids, max_length=max_length)

    ### PRINT OUT output_ids
    
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)


example_input = "translate English to French: How are you?"
output_text = generate_text(example_input)

print("Input:", example_input)
print("Output:", output_text)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


tensor([[13959,  1566,    12,  2379,    10,   571,    33,    25,    58,     1]])
tensor([[   0, 5257,    3, 6738,   18, 3249,   58,    1]])
Input: translate English to French: How are you?
Output: Comment √™tes-vous?


In [8]:
### Try at least 5 other example inputs
### Example 1: "How are you?"
### Example 2: "What is your name?"
### Example 3: "Where is the nearest restaurant?"
### Example 4: "I love learning new things."
### Example 5: "This is a beautiful day."

# T5 and the Prefix + Input Structure

T5 (Text-to-Text Transfer Transformer) is explicitly trained to follow a **prefix + input** format, guiding it to perform the correct NLP task.

## Why Prefixes Matter
T5 was trained using structured prompts like:
- `translate English to French: How are you?` ‚Üí `Comment allez-vous?`
- `summarize: The Eiffel Tower is in Paris.` ‚Üí `Eiffel Tower is in Paris.`
- `question: Who discovered gravity? context: Isaac Newton discovered gravity.` ‚Üí `Isaac Newton`
- `sentiment: I love this movie!` ‚Üí `positive`

## Without a Prefix?
‚ùå `How are you?` ‚Üí (Unpredictable output)  
‚úÖ `translate English to French: How are you?` ‚Üí `Comment allez-vous?`

## Custom Prefixes
Fine-tune T5 with your own prefixes:
- `explain: What is...`
- `medical diagnosis: Patient has high fever...`

In [6]:
import torch
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from datasets import Dataset

csv_filename = "explain_dataset.csv"

df = pd.read_csv(csv_filename)
dataset = Dataset.from_pandas(df)

def preprocess_function(examples):
    inputs = examples["Input"]
    targets = examples["Response"]
    
    model_inputs = tokenizer(inputs, max_length=64, truncation=True, padding="max_length")
    
    labels = tokenizer(targets, max_length=64, truncation=True, padding="max_length").input_ids

    model_inputs["labels"] = labels
    
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True)

dataset_split = tokenized_dataset.train_test_split(test_size=0.2)

train_dataset = dataset_split["train"]
eval_dataset = dataset_split["test"]

training_args = TrainingArguments(
    output_dir="./t5-fine-tuned",
    evaluation_strategy="epoch",
    learning_rate=3e-4,
    per_device_train_batch_size=4,
    num_train_epochs=3,
    save_strategy="epoch",
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

# Save the fine-tuned model
model.save_pretrained("./t5-custom-response")
tokenizer.save_pretrained("./t5-custom-response")

print("Fine-tuning complete! Model saved to ./t5-custom-response")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]



Epoch,Training Loss,Validation Loss
1,No log,0.00429
2,No log,0.000478
3,0.245200,0.000297


Fine-tuning complete! Model saved to ./t5-custom-response


In [7]:
original_model = T5ForConditionalGeneration.from_pretrained("t5-small")
fine_tuned_model = T5ForConditionalGeneration.from_pretrained("./t5-custom-response")

def generate_response(model, input_text):
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids
    output_ids = model.generate(input_ids, max_length=64)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

test_question = "explain: What is machine learning?"

original_output = generate_response(original_model, test_question)
fine_tuned_output = generate_response(fine_tuned_model, test_question)

print("Original Model Output:")
print(original_output)
print("\n\n\nFine-Tuned Model Output:")
print(fine_tuned_output)

Original Model Output:
Warum ist die Frage, wie machmach learning?

Fine-Tuned Model Output:
Machine learning is a subset of AI that enables systems to learn from data and improve without explicit programming.
