<a href="https://colab.research.google.com/github/arubisov/gmail-llm-ghostwriter/blob/main/Gmail_Finetune_LLaMa_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning LLaMa-2-7B on Gmail Data

This notebook walks-through how to create an LLM-powered Chrome-plugin that drafts e-mail responses to sound like you. The key ingredient is fine-tuning the Falcon-7B model on your own Gmail data.

The high-level sequence here will be:
1. Export your Gmail data into Google Drive
1. Wrangle the data into a format usable for fine-tuning: received-response pairs
1. Fine-tune LLaMa-2-7B
1. Publish the model endpoint to HuggingFace
1. Create a Chrome plug-in that calls your model endpoint

In the previous notebook, we walked through the first two steps. This resulted in a finetuning dataset that was saved to either your Drive or localhost.

Following the [llama-recipes quickstart guide](https://github.com/facebookresearch/llama-recipes/blob/main/examples/quickstart.ipynb), this notebook will train LLaMa-2 on a single Colab T4 GPU using int8 quantization and LoRA.

In [1]:
import sys
import os
from pathlib import Path

if 'google.colab' in sys.modules:
    from google.colab import drive
    DRIVE=True
    drive.mount('/content/drive')
    path = Path("/content/drive/My Drive/Takeout/")
    print('Running on Colab')
else:
    DRIVE=False
    path = Path("./")
    print('Running on localhost')

Running on localhost


Login to HuggingFace Hub, which is required in order to download the LLaMa-2 model. If you're doing this for the first time, you'll need to request access first [from Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and then [from HuggingFace](https://huggingface.co/meta-llama/Llama-2-7b-hf) to access it. When I did this, it was all approved within 15 minutes.

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load the LLaMa-2 tokenizer and model


In [2]:
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Llama-2-7b-hf"

tokenizer = LlamaTokenizer.from_pretrained(model_id)

use_4bit = True                         # Activate 4-bit precision base model loading
bnb_4bit_compute_dtype = torch.float16  # Compute dtype for 4-bit base models
bnb_4bit_quant_type = "nf4"             # Quantization type (fp4 or nf4)
use_nested_quant = False                # Activate nested quantization for 4-bit base models (double quantization)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

model = LlamaForCausalLM.from_pretrained(model_id,
                                         quantization_config=bnb_config,
                                         device_map='auto',
                                         torch_dtype=torch.float16)
model.config.use_cache = False

### Test the pretrained model on the task

In [None]:
eval_prompt = """Draft a response to the following e-mail:
### From:
Ted Herman <hermanted@gmail.com>
### Message:
buddy, can I get a sitrep?
### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=500)[0], skip_special_tokens=False))

So the pretrained model is obviously generating coherent English, but the response isn't very good. It's repetitive, and it sounds nothing like me. Let's run our dataset through the supervised fine-tuning process.

### Preprocess our dataset for finetuning

We'll load the raw json dataset as a HuggingFace dataset, which will enable batched mapping.

In [5]:
from datasets import load_dataset

dataset = load_dataset("json", data_files=str(path/'gmail-finetune-dataset.json'), split="train")

### Tokenizer special tokens

After running this the first time, I found that the model never generated an EOS token. And indeed, [base LLaMa-2 will rarely ever generate it](https://github.com/huggingface/transformers/issues/24994). My belief is that setting:

```python
tokenizer.pad_token = tokenizer.eos_token
```

results in the EOS token never being learned, because the loss is not computed on the pad tokens, and therefore it never learns to predict the EOS. The solution here will be declaring a new pad token (since the LLaMa tokenizer doesn't have one, and the tokenizer it's built off, SentencePiece, doesn't have one either). 

Per this [HF issue](https://github.com/huggingface/transformers/issues/8039), for the newly added token, its randomly assigned weight might unliterally assign very high likelihood to the new token. Can get around this by initializing its weight to that of the `<unk>` token.

In [3]:
print(f"pad: {tokenizer.pad_token_id} {tokenizer.pad_token}")
print(f"vocab length={len(tokenizer.get_vocab())}")

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

print(f"unk: {tokenizer.unk_token_id} {tokenizer.unk_token}")
print(f"bos: {tokenizer.bos_token_id} {tokenizer.bos_token}")
print(f"eos: {tokenizer.eos_token_id} {tokenizer.eos_token}")
print(f"pad: {tokenizer.pad_token_id} {tokenizer.pad_token}")
print(f"vocab length={len(tokenizer.get_vocab())}")

model.resize_token_embeddings(len(tokenizer))
model.model.embed_tokens.weight.data[-1, :] = model.model.embed_tokens.weight.data[tokenizer.unk_token_id, :]

Using pad_token, but it is not set yet.


pad: None None
vocab length=32000
unk: 0 <unk>
bos: 1 <s>
eos: 2 </s>
pad: 32000 [PAD]
vocab length=32001


In [6]:
# per the llama docs for a custom dataset, should instead define a method with the below signature.
# def preprocess_gmail_data(dataset_class, tokenizer, split)
# here we're just running as a script.

prompt = (
    f"Draft a response to the following e-mail:\n### From:\n{{sender}}\n### Message:\n{{message}}\n### Response:\n{{response}}{{eos_token}}"
)

def apply_prompt_template(sample):
    return {
        "text": prompt.format(
            sender=sample["from"],
            message=sample["message"],
            response=sample["response"],
            eos_token=tokenizer.eos_token,
        )
    }

dataset = dataset.map(apply_prompt_template, remove_columns=list(dataset.features))

# dataset = dataset.map(
#     lambda sample: tokenizer(sample["text"]),
#     batched=True,
#     remove_columns=list(dataset.features),
# ) # .map(Concatenator(), batched=True)  # if I need this later, refer to https://github.com/facebookresearch/llama-recipes/blob/main/src/llama_recipes/datasets/utils.py

### Configure model for training

We'll put the model in `train` mode and set up the PEFT config. [PEFT](https://huggingface.co/blog/peft) is the Parameter-Efficient Fine-Tuning library from HuggingFace, and in this particular case it will give us access to [LoRA](https://arxiv.org/pdf/2106.09685.pdf), which is Low-Rank Adaptation, which "freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks."

For deciding which layers to train and which hyperparameters to choose, I consulted the LoRA paper itself. Section 7.1 suggests that training only W_q, W_v with r=4 is sufficiently performant!

From [this Reddit post](https://www.reddit.com/r/LocalLLaMA/comments/15fhf33/why_does_the_model_refuse_to_predict_eos/), LoRA does not train token embedding, so this needs to be added explicitly to the LoRA config using 
```python
modules_to_save = ["embed_tokens", "lm_head"]
```

In [7]:
from peft import (
    get_peft_model,
    LoraConfig,
    TaskType,
    prepare_model_for_kbit_training,
)

lora_r = 4              # LoRA attention dimension
lora_alpha = 4          # Alpha parameter for LoRA scaling
lora_dropout = 0.05     # Dropout probability for LoRA layers

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=["q_proj", "v_proj"],
    modules_to_save=["embed_tokens", "lm_head"]     # added to train the new PAD token embedding in addition to LoRA layers
)

# prepare int-8 model for training
model.train()
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 526,401,536 || all params: 7,002,673,152 || trainable%: 7.5171513017090765


From the LLaMa-2 paper:

> Fine-Tuning Details. For supervised fine-tuning, we use a cosine learning rate schedule with an initial learning rate of 2e−5 , a weight decay of 0.1, a batch size of 64, and a sequence length of 4096 tokens. For the fine-tuning process, each sample consists of a prompt and an answer. To ensure the model sequence length is properly filled, we concatenate all the prompts and answers from the training set. A special token is utilized to separate the prompt and answer segments. We utilize an autoregressive objective and zero-out the loss on tokens from the user prompt, so as a result, we backpropagate only on answer tokens. Finally, we fine-tune the model for 2 epochs.

Next we'll set up the training arguments.

In [9]:
from transformers import TrainingArguments
from transformers.utils.import_utils import is_torch_bf16_gpu_available

output_dir = "./results"
per_device_train_batch_size = 4   # Batch size per GPU for training
per_device_eval_batch_size = 4    # Batch size per GPU for evaluation
gradient_accumulation_steps = 4   # Number of update steps to accumulate the gradients for. try 1?
optim = "paged_adamw_32bit"       # Optimizer to use. maybe adamw_torch_fused?
num_train_epochs = 1              # Number of training epochs
save_steps = 100                   # Save checkpoint every X update steps
logging_steps = 10                # Log every X updates steps
learning_rate = 1e-4              # Initial learning rate (AdamW optimizer)
lr_scheduler_type = "cosine"      # Learning rate schedule
weight_decay = 0.01               # Weight decay to apply to all layers except bias/LayerNorm weights
max_grad_norm = 1                 # Maximum gradient normal (gradient clipping). try 1?
max_steps = -1                    # Number of training steps (overrides num_train_epochs)
warmup_ratio = 0.03               # Ratio of steps for a linear warmup (from 0 to learning rate)
gradient_checkpointing = False    # Enable gradient checkpointing
group_by_length = True            # Group sequences into batches with same length. Saves memory and speeds up training considerably

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = True
bf16 = is_torch_bf16_gpu_available() # Use BF16 if available


training_arguments = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    lr_scheduler_type=lr_scheduler_type,
    fp16=fp16,
    bf16=bf16,
    weight_decay=weight_decay,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    gradient_checkpointing=gradient_checkpointing,
    group_by_length=group_by_length,
)

Lastly we set up the `SFTTrainer`, or Supervised Fine Tuning trainer.

In [10]:
from trl import SFTTrainer

max_seq_length = 1024     # Maximum sequence length to use - defaults to tokenizer max, or 2048
packing = False           # Pack multiple short examples in the same input sequence to increase efficiency

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

### Run the trainer

Fine-tunes the model.

In [None]:
# Fine-tuned model name
new_model = "llama-2-7b-anton-gmail-v2"

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33marubisov[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
10,2.6112
20,3.1825
30,2.5039
40,2.2513
50,1.8502
60,2.5524
70,2.1912
80,2.123
90,1.9381
100,1.6506


In [19]:
# Empty VRAM
del model
del tokenizer
del trainer
import gc
gc.collect()
gc.collect()

0

# Restart kernel, reload

Once the training run is done and the model is saved to disk, restart the kernel and reload. This involved running all the initial cells above up to the addition of the padding token and resizing of the model. Continuing...

In [7]:
from peft import PeftModel

model = PeftModel.from_pretrained(model, "./results/checkpoint-bestsofar")
# model = model.merge_and_unload()

# Reload tokenizer to save it
# tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
# tokenizer.pad_token = tokenizer.eos_token
# tokenizer.padding_side = "right"

In [None]:
eval_prompt = """Draft a response to the following e-mail:
### From:
Ted Herman <hermanted@gmail.com>
### Message:
buddy, can I get a sitrep?
### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=False))

<s> Draft a response to the following e-mail:
### From:
Ted Herman <hermanted@gmail.com>
### Message:
buddy, can I get a sitrep?
### Response:
oh, i've been meaning to tell you about the guy i met at the airport.

i'm still in the airport, but i'm sitting on the edge of the couch in the lounge area, and this guy just came in. he's got a military haircut, and is wearing a military jacket and pants. he's got a big beard, and a huge belly. he's sitting


Much better. :)

In [None]:
!huggingface-cli login

model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

/bin/bash: huggingface-cli: command not found


adapter_model.bin:   0%|          | 0.00/537M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/rubi242/llama-2-7b-anton-gmail-v2/commit/49119f16316a143f76bf35e91b3b04d5eeeb9f55', commit_message='Upload tokenizer', commit_description='', oid='49119f16316a143f76bf35e91b3b04d5eeeb9f55', pr_url=None, pr_revision=None, pr_num=None)

In [4]:
from peft import PeftModel

base_model = LlamaForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map='auto',
)

base_model.resize_token_embeddings(len(tokenizer))
base_model.model.embed_tokens.weight.data[-1, :] = base_model.model.embed_tokens.weight.data[tokenizer.unk_token_id, :]
model = PeftModel.from_pretrained(base_model, "./results/checkpoint-bestsofar")
model = model.merge_and_unload(progressbar=true)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


OutOfMemoryError: CUDA out of memory. Tried to allocate 502.00 MiB (GPU 0; 14.56 GiB total capacity; 13.61 GiB already allocated; 480.44 MiB free; 13.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF