# Adaptation Prompt Tuning using Llama-2-7b
* Notebook by Adam Lang
* Date: 12/25/2024

# Overview

* In this notebook, we will demonstrate how to perform **Adaptation Prompt Tuning** for generating short Text Ads using llama-2-7b.
* **In addition, at the end of the notebook I will show 4 examples of the resulting output in English, Spanish, and French which demonstrates the multilingual capabilities of Llama models.**

Load the required libraries and the config parameters

In [1]:
import os
os.environ["WANDB_PROJECT"]="prompt_learning_methods"
from enum import Enum
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, set_seed
from peft import get_peft_config, get_peft_model, AdaptionPromptConfig, TaskType, PeftType
import torch
from datasets import load_dataset
from torch.utils.data import DataLoader
from trl import SFTTrainer, SFTConfig, DataCollatorForCompletionOnlyLM

## set seed for reproducibility
seed = 42
set_seed(seed)

## set device
device = "cuda"

## load hugging face model checkpoints for llama-2-7b
model_name_or_path = "meta-llama/Llama-2-7b-hf"
tokenizer_name_or_path = "meta-llama/Llama-2-7b-hf"

## Dataset Preparation

### Load the dataset
* Dataset from hugging face: https://huggingface.co/datasets/jaykin01/advertisement-copy

In [16]:
from datasets import load_dataset

## hf dataset check point
dataset_name = "jaykin01/advertisement-copy"

## init tokenizer from HF
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

## setup chat prompt template for fine tuning model
template = """{% for message in messages %}\n{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% if loop.last and add_generation_prompt %}{{'<|im_start|>assistant\n' }}{% endif %}{% endfor %}"""
## apply tokenizer to chat template for fine tuning
tokenizer.chat_template = template

## create LLM system prompt for task we are fine tuning on
system_prompt = """Create a text ad given the following product and description."""


# 1. Define ChatmlSpecialTokens -- we are going to be adding special tokens
class ChatmlSpecialTokens(str, Enum):
    user = "<|im_start|>user"
    assistant = "<|im_start|>assistant"
    system = "<|im_start|>system"
    eos_token = "<|im_end|>"
    bos_token = "<s>"
    pad_token = "<pad>"

    @classmethod
    def list(cls):
        return [c.value for c in cls]

# 2. Initialize tokenizer with special tokens
tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path,
    pad_token=ChatmlSpecialTokens.pad_token.value,
    bos_token=ChatmlSpecialTokens.bos_token.value,
    eos_token=ChatmlSpecialTokens.eos_token.value,
    additional_special_tokens=ChatmlSpecialTokens.list(),
    trust_remote_code=True
)

# 3. Set up chat template
template = """{% for message in messages %}\n{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% if loop.last and add_generation_prompt %}{{'<|im_start|>assistant\n' }}{% endif %}{% endfor %}"""
tokenizer.chat_template = template

# 4. Create data collator
response_template = "<|im_start|>assistant\n"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

# 5. Define preprocess function
def preprocess(samples):
    batch = []
    for product, desc, ad_copy in zip(samples["product"], samples["description"], samples["ad"]):
        conversation = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Product: {product}\nDescription: {desc}"},
            {"role": "assistant", "content": f"Ad: {ad_copy}"},
        ]
        batch.append(conversation)
    
    # Apply the chat template and tokenize
    texts = tokenizer.apply_chat_template(batch, tokenize=False)
    tokenized_inputs = tokenizer(
        texts, 
        truncation=True, 
        padding="max_length", 
        max_length=512, 
        return_tensors="pt"
    )

    return tokenized_inputs

# 6. Load and preprocess the dataset
dataset = load_dataset(dataset_name)
dataset = dataset.map(
    preprocess,
    batched=True,
    remove_columns=dataset["train"].column_names
)
## view first index of train dataset
dataset["train"][0]

Map:   0%|          | 0/1141 [00:00<?, ? examples/s]

{'input_ids': [1,
  32004,
  29871,
  13,
  4391,
  263,
  1426,
  594,
  2183,
  278,
  1494,
  3234,
  322,
  6139,
  29889,
  32000,
  29871,
  13,
  32002,
  29871,
  13,
  7566,
  29901,
  29871,
  379,
  598,
  29885,
  282,
  1934,
  13,
  9868,
  29901,
  29871,
  319,
  3114,
  310,
  282,
  1934,
  411,
  263,
  13700,
  274,
  5450,
  305,
  29892,
  23819,
  29899,
  29888,
  5367,
  21152,
  29892,
  322,
  263,
  22229,
  11324,
  391,
  4980,
  363,
  263,
  5412,
  29892,
  1045,
  8008,
  713,
  1106,
  29889,
  32000,
  29871,
  13,
  32003,
  29871,
  13,
  3253,
  29901,
  8565,
  957,
  379,
  598,
  29885,
  349,
  1934,
  29991,
  853,
  1387,
  29892,
  15877,
  1674,
  1045,
  8008,
  713,
  325,
  747,
  267,
  411,
  263,
  13700,
  274,
  5450,
  305,
  669,
  23819,
  21152,
  29889,
  422,
  29888,
  29891,
  28103,
  521,
  293,
  448,
  11858,
  403,
  596,
  281,
  538,
  307,
  915,
  29889,
  28873,
  10961,
  448,
  18296,
  1286,
  29991,
  32000,
 

Summary
* We can see the output of the first train sample above. 

In [17]:
## Create Train/Test Split
dataset = dataset["train"].train_test_split(0.1)
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1026
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 115
    })
})

## Create the PEFT model

### Adaptation Prompt Tuning config 
* Below we can see 32 soft prompt tokens for each of the attention layers of the model. This is the same as `num_prompt_tokens`.
* We want to target all 30 adapter layers or attention layers of the model.
* The task type is Causal_LM or text generation.

In [18]:
## create Adaption Prompt config
peft_config = AdaptionPromptConfig(adapter_len=32, ## 32 soft prompt tokens
                                   adapter_layers=30, 
                                   task_type=TaskType.CAUSAL_LM)

In [19]:
# creating model
model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
## resize token embeddings to account for special tokens added
model.resize_token_embeddings(len(tokenizer), mean_resizing=False)

## init model and PEFT config
model = get_peft_model(model, peft_config)

## print number of trainable parameters 
model.print_trainable_parameters()

## to minimize number of GPU memory used 
# we cast non-trainable params to fp16 (half-precision)
## only trainable params are in full precision
for p in model.parameters():
    if not p.requires_grad:
        p.data = p.to(torch.float16)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

trainable params: 3,932,190 || all params: 6,742,388,766 || trainable%: 0.0583


Summary:
* We can see that only 0.0583% of the models params are actually trainable for fine tuning.

## Training

In [20]:
output_dir = "llama_adcopy" ## create output directory
per_device_train_batch_size = 8
per_device_eval_batch_size = 8
gradient_accumulation_steps = 1 #not using gradient accumulation
logging_steps = 5
learning_rate = 5e-4
max_grad_norm = 1.0
max_steps = 250
num_train_epochs=10
warmup_ratio = 0.1
lr_scheduler_type = "cosine"
max_seq_length = 512

## create training arguments
config = SFTConfig(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    save_strategy="epoch",
    eval_strategy="epoch",
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    max_grad_norm=max_grad_norm,
    weight_decay=0.1,
    warmup_ratio=warmup_ratio,
    lr_scheduler_type=lr_scheduler_type,
    fp16=True,
    report_to=["tensorboard", "wandb"],
    hub_private_repo=True,
    push_to_hub=True, #push to hf hub
    num_train_epochs=num_train_epochs,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False}
)

In [21]:
## create SFFTrainer instance
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
    #packing=False, ## using pad tokens instead of packing multiple samples together
    #dataset_text_field="content", #text field we are using
    #max_seq_length=max_seq_length,
    data_collator=collator,
    args=config,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [22]:
trainer.train()
trainer.save_model()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33madam-m-lang[0m. Use [1m`wandb login --relogin`[0m to force relogin


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Epoch,Training Loss,Validation Loss
1,1.5195,1.426548
2,0.8863,0.912984
3,0.8006,0.818507
4,0.6856,0.783184
5,0.7386,0.761889
6,0.621,0.750131
7,0.6238,0.742738
8,0.5892,0.74076
9,0.5734,0.738982
10,0.7311,0.739009




events.out.tfevents.1735174399.79f5280debe9.2171.0:   0%|          | 0.00/62.9k [00:00<?, ?B/s]

# Weights and Biases Loss Tracking
* We can see the Loss of the model below:
* 

In [23]:
!nvidia-smi

Thu Dec 26 01:50:24 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX A6000               On  |   00000000:2A:00.0 Off |                  Off |
| 59%   73C    P2            101W /  300W |   20642MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Loading the trained model and getting the predictions of the trained model

In [24]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch

## load fine tuned model from personal HF repo
peft_model_id = "adamNLP/llama_adcopy"
device = "cuda"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, peft_model_id)


adapter_config.json:   0%|          | 0.00/269 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/2.28k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.62M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/488 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


adapter_model.safetensors:   0%|          | 0.00/540M [00:00<?, ?B/s]

# Testing the Adaption Fine Tuned Model

## Test Example 1

In [25]:
model.to(torch.float16)
model.cuda()
model.eval()
messages = [
    {"role": "system", "content": "Create a text ad given the following product and description."},
    {"role": "user", "content": "Product: Sony PS5 PlayStation Console\nDescription: The PS5™ console unleashes new gaming possibilities that you never anticipated."},
]
## apply tokenizer to chat template
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
## inputs to tokenizer
inputs = tokenizer(text, return_tensors="pt")#, add_special_tokens=False)
inputs = {k: v.to("cuda") for k,v in inputs.items()}
outputs = model.generate(**inputs, 
                         max_new_tokens=128, 
                         do_sample=True, 
                         top_p=0.95, 
                         temperature=0.2, 
                         repetition_penalty=1.1, 
                         eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

<s><|im_start|>system 
Create a text ad given the following product and description.<|im_end|> 
<|im_start|>user 
Product: Sony PS5 PlayStation Console
Description: The PS5™ console unleashes new gaming possibilities that you never anticipated.<|im_end|> 
<|im_start|>assistant 
Ad: Unlock endless gaming adventures with the PS5! 🎮🌟 Experience next-gen graphics and immersive gameplay. Perfect for gamers and exploring the world of gaming. Limited stock - immerse yourself in gaming bliss! 🌟🕹️🏆<|im_end|>


In [26]:
!nvidia-smi

Thu Dec 26 01:58:23 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


|   0  NVIDIA RTX A6000               On  |   00000000:2A:00.0 Off |                  Off |
| 30%   44C    P2             78W /  300W |   26382MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
+-----------------------------------------------------------------------------------------+


## Test Example 2
* We will try using a keurig coffee maker from Target, source: https://www.target.com/p/keurig-k-cafe-special-edition-single-serve-k-cup-pod-coffee-latte-and-cappuccino-maker-nickel/-/A-53536794#lnk=sametab

In [28]:
model.to(torch.float16)
model.cuda()
model.eval()
messages = [
    {"role": "system", "content": "Create a text ad given the following product and description."},
    {"role": "user", "content": "Product: Keurig K-Cafe Special Edition Single-Serve K-Cup Pod Coffee, Latte and Cappuccino Maker\nDescription: Enjoy the rich, full-flavored coffee you love or delicious coffeehouse beverages from the new Keurig K-Café Special Edition single serve coffee, latte, and cappuccino maker."},
]
## apply tokenizer to chat template
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
## inputs to tokenizer
inputs = tokenizer(text, return_tensors="pt")#, add_special_tokens=False)
inputs = {k: v.to("cuda") for k,v in inputs.items()}
outputs = model.generate(**inputs, 
                         max_new_tokens=400, 
                         do_sample=True, 
                         top_p=0.95, 
                         temperature=0.2, 
                         repetition_penalty=1.1, 
                         eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

<s><|im_start|>system 
Create a text ad given the following product and description.<|im_end|> 
<|im_start|>user 
Product: Keurig K-Cafe Special Edition Single-Serve K-Cup Pod Coffee, Latte and Cappuccino Maker
Description: Enjoy the rich, full-flavored coffee you love or delicious coffeehouse beverages from the new Keurig K-Café Special Edition single serve coffee, latte, and cappuccino maker.<|im_end|> 
<|im_start|>assistant 
Ad: Wake up to your favorite drinks with a touch of coffee magic! 🍳🌄🥃<|im_end|>


## Test Example 3 - multilingual?
* It is known that llama-2 does support some multilingual use cases but not many languages are supported. Lets see if the model can generate an example in spanish using the first example. 

In [33]:
model.to(torch.float16)
model.cuda()
model.eval()
messages = [
    {"role": "system", "content": "Cree un anuncio de texto con el siguiente producto y descripción."},
    {"role": "user", "content": "Producto: Consola Sony PS5 PlayStation\nDescripción: La consola PS5™ libera nuevas posibilidades de juego que nunca anticipaste."},
]
## apply tokenizer to chat template
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
## inputs to tokenizer
inputs = tokenizer(text, return_tensors="pt")#, add_special_tokens=False)
inputs = {k: v.to("cuda") for k,v in inputs.items()}
outputs = model.generate(**inputs, 
                         max_new_tokens=400, 
                         do_sample=True, 
                         top_p=0.95, 
                         temperature=0.75, 
                         repetition_penalty=1.1, 
                         eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

<s><|im_start|>system 
Cree un anuncio de texto con el siguiente producto y descripción.<|im_end|> 
<|im_start|>user 
Producto: Consola Sony PS5 PlayStation
Descripción: La consola PS5™ libera nuevas posibilidades de juego que nunca anticipaste.<|im_end|> 
<|im_start|>assistant 
Fuente: Experiencia eléctrica y la conectividad de la era digital. Perfecta para jugadores y desarrolladores.<|im_end|>


Summary:
* Interesting, it did work for Spanish, I changed the temperature to 0.75 for more probabilistic sampling.
* The translation from spanish to english is: Electric experience and connectivity of the digital age. Perfect for gamers and developers.

## Test Example 4 - multilingual?
* Lets try example 2 for the keurig but now in French which is another language supported by Llama 2. 

In [37]:
model.to(torch.float16)
model.cuda()
model.eval()
messages = [
    {"role": "system", "content": "Créez une annonce textuelle en fonction du produit et de la description suivants."},
    {"role": "user", "content": "Produit : Machine à café, latte et cappuccino à dosettes K-Cup à portion individuelle Keurig K-Cafe Special Edition\nDescription : Profitez du café riche et savoureux que vous aimez ou des délicieuses boissons de café du nouveau café à portion individuelle Keurig K-Café Special Edition , latte et cappuccino."},
]
## apply tokenizer to chat template
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
## inputs to tokenizer
inputs = tokenizer(text, return_tensors="pt")#, add_special_tokens=False)
inputs = {k: v.to("cuda") for k,v in inputs.items()}
outputs = model.generate(**inputs, 
                         max_new_tokens=400, 
                         do_sample=True, 
                         top_p=0.95, 
                         temperature=0.9, 
                         repetition_penalty=1.1, 
                         eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

<s><|im_start|>system 
Créez une annonce textuelle en fonction du produit et de la description suivants.<|im_end|> 
<|im_start|>user 
Produit : Machine à café, latte et cappuccino à dosettes K-Cup à portion individuelle Keurig K-Cafe Special Edition
Description : Profitez du café riche et savoureux que vous aimez ou des délicieuses boissons de café du nouveau café à portion individuelle Keurig K-Café Special Edition , latte et cappuccino.<|im_end|> 
<|im_start|>assistant 
Publicité : Prenez le plaisir dans votre matin au quotidien avec ce cafetier K-Café spécial K-Café, il s’agisse d'un café riche et savoureux au fil des années ou encore d'une latte et cappuccino délicatesse !<|im_end|>


Summary:
* The generated ad translated from french is: Advertising: Take pleasure into your morning every day with this special K-Café coffee maker, it is a rich and tasty coffee over the years or even a delicate latte and cappuccino!
* This is proof that llama-2-7b does have multilingual capabilities. While this may not be the most ideal output for all languages, we could use adaptive fine tuning to enhance some of these other languages. 