In [13]:
!pip install -q transformers peft datasets accelerate bitsandbytes hf_xet

!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

!pip install bitsandbytes

!pip install -q evaluate

!pip install -q sentence-transformers


[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Looking in indexes: https://download.pytorch.org/whl/cu121



[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [14]:
import torch
print("CUDA available:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")

# from google.colab import userdata
# from huggingface_hub import login

# login(userdata.get('huggingface'))


CUDA available: True
GPU: NVIDIA GeForce RTX 2060


# Datasets

Importing, preprocessing, and tokenizing dataset

In [15]:
from datasets import load_dataset, DatasetDict

# Load the dataset as before
dataset = load_dataset("gokaygokay/prompt-enhancer-dataset")

# Keep the original 'test' split as the final test set
test_dataset = dataset["test"]

# Split the original 'train' split into new train and validation sets
train_validation_split = dataset["train"].train_test_split(test_size=0.1) # Use 10% of the original train data for validation

# Create a new DatasetDict with the splits
dataset_splits = DatasetDict({
    'train': train_validation_split['train'],
    'validation': train_validation_split['test'], # The 'test' from the split is our new validation set
    'test': test_dataset                     # The original 'test' is our final test set
})

# Display the new dataset structure
print(dataset_splits)

DatasetDict({
    train: Dataset({
        features: ['short_prompt', 'long_prompt'],
        num_rows: 14499
    })
    validation: Dataset({
        features: ['short_prompt', 'long_prompt'],
        num_rows: 1611
    })
    test: Dataset({
        features: ['short_prompt', 'long_prompt'],
        num_rows: 1790
    })
})


In [16]:
# I am splitting the dataset because its too big to run on colab with limited compute and too long to run locally on my GPU

train_subset_size = 100
validation_subset_size = 10
test_subset_size = 10

dataset_splits['train'] = dataset_splits['train'].select(range(train_subset_size))
dataset_splits['validation'] = dataset_splits['validation'].select(range(validation_subset_size))
dataset_splits['test'] = dataset_splits['test'].select(range(test_subset_size))

Prompt engineering to reduce training requirements and make things work better "out of the box"

In [17]:
def preprocess(example):
    instruction = "Expand this into a detailed, cinematic prompt:\n"
    return {
        "input_text": instruction + example["short_prompt"],
        "output_text": example["long_prompt"]
    }

processed_dataset_splits = dataset_splits.map(preprocess)

Map: 100%|██████████| 100/100 [00:00<00:00, 1919.73 examples/s]
Map: 100%|██████████| 10/10 [00:00<00:00, 750.83 examples/s]


# Loading model

This code block sets up the pre-trained language model (`meta-llama/Llama-3.2-3B-Instruct`) and its tokenizer. It loads the model using 4-bit quantization (QLoRA) to reduce memory usage, making it feasible to fine-tune on less powerful hardware. The tokenizer is also loaded and configured for handling text input and output.

In [18]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

model_id = "meta-llama/Llama-3.2-3B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token  # important for batching

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,   # QLoRA
    device_map="auto"
)

model.to("cuda")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████| 2/2 [02:57<00:00, 88.66s/it] 


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((3072,)

Adding hyperparameters for fine-tuning LoRA

In [7]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # standard for LLaMA
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Training Model

This code block defines a function to tokenize the dataset, converting the text prompts into numerical token sequences that the model can process. It also prepares the data with appropriate padding and truncation, and sets up the labels for training.

In [8]:
def tokenize(batch):
    inputs = tokenizer(batch["input_text"], truncation=True, padding="max_length", max_length=512)
    outputs = tokenizer(batch["output_text"], truncation=True, padding="max_length", max_length=512)
    inputs["labels"] = outputs["input_ids"]
    return inputs

# Apply the tokenize function to the processed dataset splits
tokenized_dataset_splits = processed_dataset_splits.map(tokenize, batched=True, remove_columns=processed_dataset_splits["train"].column_names)

# Display the structure of the tokenized dataset splits
print(tokenized_dataset_splits)


Map: 100%|██████████| 100/100 [00:00<00:00, 1121.13 examples/s]
Map: 100%|██████████| 10/10 [00:00<00:00, 312.42 examples/s]


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 100
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10
    })
})


In [9]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./llama-prompt-enhancer",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    num_train_epochs=3,
    fp16=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs"
)

This code block defines a function to generate predictions and compute vector similarity metrics between the generated prompts and the reference prompts using a pre-trained sentence embedding model.

In [10]:
from sentence_transformers import SentenceTransformer, util
import numpy as np
import torch

# Load a pre-trained sentence embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2', device="cpu")

def compute_metrics_vector(eval_preds):
    # Hugging Face Trainer passes an EvalPrediction object
    preds, labels = eval_preds.predictions, eval_preds.label_ids

    # Convert logits to token IDs if predictions are not already token IDs
    if preds.ndim == 3:  # shape: (batch, seq_len, vocab_size)
        preds = np.argmax(preds, axis=-1)

    # Replace ignored index (-100) with the pad_token_id
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    metrics_batch_size = 32  # for memory management
    total_similarity = 0
    num_samples = len(preds)

    for i in range(0, num_samples, metrics_batch_size):
        pred_batch = preds[i : i + metrics_batch_size]
        label_batch = labels[i : i + metrics_batch_size]

        # Decode tokens to strings
        decoded_preds = tokenizer.batch_decode(pred_batch, skip_special_tokens=True)
        decoded_labels = tokenizer.batch_decode(label_batch, skip_special_tokens=True)

        # Generate embeddings
        pred_embeddings = embedding_model.encode(decoded_preds, convert_to_tensor=True)
        label_embeddings = embedding_model.encode(decoded_labels, convert_to_tensor=True)

        # Compute cosine similarity (diagonal = matching pairs)
        cosine_scores = util.cos_sim(pred_embeddings, label_embeddings)
        total_similarity += torch.sum(torch.diag(cosine_scores))

        # Free memory
        del pred_embeddings, label_embeddings, cosine_scores
        torch.cuda.empty_cache()

    avg_similarity = (total_similarity / num_samples).item()
    return {"cosine_similarity": avg_similarity}


In [12]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_splits["train"],
    eval_dataset=tokenized_dataset_splits["validation"],
    compute_metrics=compute_metrics_vector # Pass the compute_metrics_vector function here
)

trainer.train() # Keep this commented out as training might take a while

  attn_output = torch.nn.functional.scaled_dot_product_attention(


Epoch,Training Loss,Validation Loss,Cosine Similarity
1,No log,1.406978,0.301668
2,No log,1.357004,0.132412
3,No log,1.338057,0.208756


TrainOutput(global_step=150, training_loss=2.224530029296875, metrics={'train_runtime': 9515.9008, 'train_samples_per_second': 0.032, 'train_steps_per_second': 0.016, 'total_flos': 2601985454899200.0, 'train_loss': 2.224530029296875, 'epoch': 3.0})

# Evaluation

Using the test dataset to evaluate the model's performance with appropriate metrics.

In [None]:
trainer.evaluate(eval_dataset=tokenized_dataset_splits["test"])

{'eval_loss': 1.8596646785736084,
 'eval_cosine_similarity': 0.20774349570274353,
 'eval_runtime': 109.1436,
 'eval_samples_per_second': 0.092,
 'eval_steps_per_second': 0.092,
 'epoch': 3.0}

# Testing

This code block demonstrates how to use the fine-tuned model to generate an enhanced prompt for a given input text. It tokenizes the input, uses the `model.generate` method to produce an output sequence of tokens, and then decodes and prints the generated text.

In [13]:
input_text = "Expand this into a detailed, cinematic prompt:\nLion"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.9, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Expand this into a detailed, cinematic prompt:
Lion large image left green the a with to down and that the and into bottom the line is and through and and a the. the left the, and down. The of bottom with a left and down. the a a, top in of a and is and through the. the the and through a of down and of that the and to. is of that and through bottom a the the through and that, a the the down, the and black with. down the a and the of and. the top a the left the bottom and. are a of left the the a.


# Evaluation of Non-Finetuned Model

Evaluate the base model's performance on the test dataset using the same metric.

In [11]:
from transformers import Trainer

# Load the original, non-finetuned model
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,   # Use the same quantization as the finetuned model
    device_map="auto"
)

base_model.to("cuda")
base_model = get_peft_model(base_model, lora_config)

# Create a Trainer for the base model (using the same training args, but without training)
base_trainer = Trainer(
    model=base_model,
    args=training_args, # Reuse the same training arguments for consistency
    eval_dataset=tokenized_dataset_splits["test"],
    compute_metrics=compute_metrics_vector
)

# Evaluate the base model
base_trainer.evaluate(eval_dataset=tokenized_dataset_splits["test"])

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████| 2/2 [00:09<00:00,  4.87s/it]
  attn_output = torch.nn.functional.scaled_dot_product_attention(


{'eval_loss': 9.697511672973633,
 'eval_model_preparation_time': 0.0159,
 'eval_cosine_similarity': 0.27906620502471924,
 'eval_runtime': 91.6682,
 'eval_samples_per_second': 0.109,
 'eval_steps_per_second': 0.109}

In [19]:
input_text = "Expand this into a detailed, cinematic prompt:\nLion"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = base_model.generate(**inputs, max_new_tokens=128, temperature=0.9, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Expand this into a detailed, cinematic prompt:
Lion's Pride: The Lion's Den

In a scorching savannah, a lioness named Kibo stands tall, her golden coat glistening in the blistering sun. She surveys her kingdom, her piercing eyes scanning the horizon for any sign of danger. But amidst the peace and tranquility, a sense of unease stirs within her.

A stranger has appeared on the outskirts of the pride, a tawny creature with a scar above his left eyebrow. Kibo's instincts scream at her to attack, to protect her family and her land from this unknown interloper. Yet, as she watches the stranger, she


# Discussion

In the evaluation of the fine-tuned model, the cosine similarity came out to be 0.20774349570274353, while the untrained model came out to be 0.27906620502471924. This shows that fine-tuning did not actually contribute to anything here. The reason could be due to the small size of the dataset that was intentionally decreased to save on training time due to limited compute. 

We can also visually see that the fine-tuned model's output is significantly worse than the base model's output, throwing out random words that do not make sense. This may be a sign of attempting to overfit on a small dataset.

Overall, this project demonstrates the process of fine-tuning a large language model using LoRA and QLoRA techniques, but also highlights the challenges of achieving meaningful improvements with limited data and compute resources.