<a href="https://colab.research.google.com/github/Yogesh914/dpo_and_sd/blob/main/dpo_with_sd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applying DPO To Improve Speculative Decoding 🏃💨

## Set-Up Environment

In [2]:
!pip install torch transformers accelerate bitsandbytes trl

Collecting accelerate
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl (102.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trl
  Downloading trl-0.8.1-py3-none-any.whl (225 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.0/225.0 kB[0m [31m30.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m68.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12

In [16]:
!pip install --upgrade transformers accelerate bitsandbytes

Collecting transformers
  Downloading transformers-4.39.3-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Using cached accelerate-0.28.0-py3-none-any.whl (290 kB)
Collecting bitsandbytes
  Using cached bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl (102.2 MB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9

## Baseline Implementation

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import os

os.environ["TRANSFORMERS_CACHE"] = "./.cache"
os.environ[""]


In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from google.colab import userdata
import torch
import time
from trl import DPOTrainer
from datasets import Dataset

access_token = userdata.get('HF_TOKEN')

In [2]:
quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

In [4]:
def generate_with_time(model, inputs):
    start_time = time.time()
    outputs = model.generate(**inputs, assistant_model=None, max_new_tokens=500)
    generation_time = time.time() - start_time
    return outputs, generation_time

In [9]:
model_name = "google/gemma-7b-it"
prompt = "Tell me about gravity"
access_token = userdata.get('HF_TOKEN')



model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config, token=access_token, do_sample=False)
model.config.use_cache = True
tokenizer = AutoTokenizer.from_pretrained(model_name, token=access_token)
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output, gen_time = generate_with_time(model, model_inputs)

print(gen_time)
print(tokenizer.decode(output[0], skip_special_tokens=True))

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

16.73765206336975
Tell me about gravity.

Gravity is a fundamental force of nature that acts between objects with mass. It is the force that pulls objects towards each other. The greater the mass of an object, the greater its gravitational pull.

**Here are some key points about gravity:**

* **Force:** Gravity is a force, which means it can be measured in units such as newtons (N).
* **Mass:** Gravity is directly related to an object's mass. The greater the mass, the greater the gravitational force.
* **Attraction:** Gravity causes objects to attract each other.
* **Direction:** Gravity pulls objects towards each other in a straight line.
* **Acceleration:** Gravity can cause objects to accelerate towards each other.

**Here are some examples of gravity in action:**

* The Earth's gravity pulls objects towards its surface.
* The force of gravity between the Earth and the Moon keeps the Moon in orbit.
* Gravity is what causes objects to fall when you drop them.

**Here are some interes

## Testing Speculative Decoding

In [3]:
def assisted_generate_with_time(model, inputs, assistant_model):
    start_time = time.time()
    outputs = model.generate(**inputs, assistant_model=assistant_model, num_assistant_tokens=8, prompt_lookup_num_tokens=10, max_new_tokens=500)
    generation_time = time.time() - start_time
    return outputs, generation_time

In [4]:
assistant_model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", token=access_token).to("cuda")

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
prompt = "Tell me about gravity"
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", quantization_config=quantization_config, token=access_token)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it", token=access_token)
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output, gen_time = assisted_generate_with_time(model, model_inputs, assistant_model)

print(gen_time)
print(tokenizer.decode(output[0], skip_special_tokens=True))

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

8.8930823802948
Tell me about gravity.

Gravity is a fundamental force of nature that acts between objects with mass. It is the force that pulls objects towards each other. The greater the mass of an object, the greater its gravitational pull.

**Key Key Points:**

  dises dises   dises
-like, and attract attract attract other objects to Earth Earth Earth. 
- Gravity. 
- Gravity is a universal 
- Gravity is a fundamental for all objects with mass.
- The force of gravity.

**

**

**Here are 
- Gravity is a force of attraction attraction attraction between objects with mass.
- The greater mass mass.
- The greater mass.

Gravity is a force of attraction between objects with mass. It is the force that pulls objects towards each other.


## DPO Applied

In [3]:
assistant_model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", token=access_token).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it", token=access_token)

dataset = {
    "prompt": [
        "hello",
        "how are you",
        "What is your name?",
        "What is your name?",
        "Which is the best programming language?",
        "Which is the best programming language?",
        "Which is the best programming language?",
    ],
    "chosen": [
        "hi nice to meet you",
        "I am fine",
        "My name is Mary",
        "My name is Mary",
        "Python",
        "Python",
        "Java",
    ],
    "rejected": [
        "leave me alone",
        "I am not fine",
        "Whats it to you?",
        "I dont have a name",
        "Javascript",
        "C++",
        "C++",
    ],
}

dataset = Dataset.from_dict(dataset)

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
!pip install -q datasets peft sentencepiece wandb

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m50.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m266.1/266.1 kB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [16]:
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.39.3-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m70.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.2
    Uninstalling transformers-4.38.2:
      Successfully uninstalled transformers-4.38.2
Successfully installed transformers-4.39.3


In [10]:
import os
import gc
import torch

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from trl import DPOTrainer
import bitsandbytes as bnb
from google.colab import userdata
import wandb

In [14]:
new_model = "dpo_gemma"

# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

assistant_model.config.use_cache = False

# Training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    max_steps=50,
    save_strategy="no",
    logging_steps=1,
    output_dir=new_model,
    optim="paged_adamw_32bit",
    warmup_steps=10,
    bf16=True,
    report_to="wandb",
)

# Create DPO trainer
dpo_trainer = DPOTrainer(
    assistant_model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    beta=0.1,
    max_prompt_length=1024,
    max_length=1000,
)

# Fine-tune model with DPO
dpo_trainer.train()



Map:   0%|          | 0/7 [00:00<?, ? examples/s]

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
1,0.3466
2,0.3466
3,0.3442
4,0.3435
5,0.3317
6,0.3107
7,0.2755
8,0.2331
9,0.1804
10,0.1312


TrainOutput(global_step=50, training_loss=0.06298887740122154, metrics={'train_runtime': 27.2695, 'train_samples_per_second': 29.337, 'train_steps_per_second': 1.834, 'total_flos': 0.0, 'train_loss': 0.06298887740122154, 'epoch': 50.0})

In [15]:
dpo_trainer.model.save_pretrained("final_checkpoint")
tokenizer.save_pretrained("final_checkpoint")

# Flush memory
del dpo_trainer, assistant_model
gc.collect()
torch.cuda.empty_cache()

assistant_model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", token=access_token, return_dict=True).to("cuda")

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")

# Merge base model with the adapter
model = PeftModel.from_pretrained(assistant_model, "final_checkpoint")
model = model.merge_and_unload()

# Save model and tokenizer
model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

('dpo_gemma/tokenizer_config.json',
 'dpo_gemma/special_tokens_map.json',
 'dpo_gemma/tokenizer.model',
 'dpo_gemma/added_tokens.json',
 'dpo_gemma/tokenizer.json')

In [23]:
# Format prompt
prompt = "how are you"
tokenizer = AutoTokenizer.from_pretrained(new_model)
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**model_inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

how are you doing?

I am doing well, thank you
