<a href="https://colab.research.google.com/github/arubisov/gmail-llm-ghostwriter/blob/main/Gmail_Finetune_LLaMa_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning LLaMa-2-7B on Gmail Data

This notebook walks-through how to create an LLM-powered Chrome-plugin that drafts e-mail responses to sound like you. The key ingredient is fine-tuning the Falcon-7B model on your own Gmail data.

The high-level sequence here will be:
1. Export your Gmail data into Google Drive
1. Wrangle the data into a format usable for fine-tuning: received-response pairs
1. Fine-tune LLaMa-2-7B
1. Publish the model endpoint to HuggingFace
1. Create a Chrome plug-in that calls your model endpoint

In the previous notebook, we walked through the first two steps. This resulted in a finetuning dataset that was saved to either your Drive or localhost.

Following the [llama-recipes quickstart guide](https://github.com/facebookresearch/llama-recipes/blob/main/examples/quickstart.ipynb), this notebook will train LLaMa-2 on a single Colab T4 GPU using int8 quantization and LoRA.

In [1]:
!pip install -q trl transformers datasets accelerate peft sentencepiece
!pip install -q bitsandbytes wandb

# pip install transformers datasets accelerate sentencepiece protobuf==3.20 py7zr scipy peft bitsandbytes fire tor

In [2]:
import sys
import os
from pathlib import Path

if 'google.colab' in sys.modules:
    from google.colab import drive
    DRIVE=True
    drive.mount('/content/drive')
    path = Path("/content/drive/My Drive/Takeout/")
    print('Running on Colab')
else:
    DRIVE=False
    path = Path("./")
    print('Running on localhost')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Running on Colab


Login to HuggingFace Hub, which is required in order to download the LLaMa-2 model. If you're doing this for the first time, you'll need to request access first [from Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and then [from HuggingFace](https://huggingface.co/meta-llama/Llama-2-7b-hf) to access it. When I did this, it was all approved within 15 minutes.

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load the LLaMa-2 tokenizer and model


In [4]:
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Llama-2-7b-hf"

tokenizer = LlamaTokenizer.from_pretrained(model_id)

use_4bit = True                         # Activate 4-bit precision base model loading
bnb_4bit_compute_dtype = torch.float16  # Compute dtype for 4-bit base models
bnb_4bit_quant_type = "nf4"             # Quantization type (fp4 or nf4)
use_nested_quant = False                # Activate nested quantization for 4-bit base models (double quantization)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

model = LlamaForCausalLM.from_pretrained(model_id,
                                         quantization_config=bnb_config,
                                         device_map='auto',
                                         torch_dtype=torch.float16)
model.config.use_cache = False
model.config.pretraining_tp = 1

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Test the pretrained model on the task

In [6]:
eval_prompt = """
Draft a response to the following e-mail:
### From:
Ted Herman <hermanted@gmail.com>
### Message:
Hey buddy, how's the job hunt going?
### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))


Draft a response to the following e-mail:
### From:
Ted Herman <hermanted@gmail.com>
### Message:
Hey buddy, how's the job hunt going?
### Response:
My job hunt is going great. I've got a few interviews lined up and I'm really excited about the prospects.
### Response:
That's great news.
### Response:
Thanks.
### Response:
Have you had any luck?
### Response:
I haven't had any luck yet.
### Response:
Have you tried any of the local temp agencies?
### Response:



So the pretrained model is obviously generating coherent English, but the response isn't very good. It's repetitive, and it sounds nothing like me. Let's run our dataset through the supervised fine-tuning process.

### Preprocess our dataset for finetuning

We'll load the raw json dataset as a HuggingFace dataset, which will enable batched mapping.

In [7]:
from datasets import load_dataset

dataset = load_dataset("json", data_files=str(path/'gmail-finetune-dataset.json'), split="train")

In [8]:
# per the llama docs for a custom dataset, should instead define a method with the below signature.
# here we're just running as a script.
# def preprocess_gmail_data(dataset_class, tokenizer, split)

prompt = (
    f"Draft a response to the following e-mail:\n### From:\n{{sender}}\n### Message:\n{{message}}\n### Response:\n{{response}}{{eos_token}}"
)

def apply_prompt_template(sample):
    return {
        "text": prompt.format(
            sender=sample["from"],
            message=sample["message"],
            response=sample["response"],
            eos_token=tokenizer.eos_token,
        )
    }

dataset = dataset.map(apply_prompt_template, remove_columns=list(dataset.features))

# dataset = dataset.map(
#     lambda sample: tokenizer(sample["text"]),
#     batched=True,
#     remove_columns=list(dataset.features),
# ) # .map(Concatenator(), batched=True)  # if I need this later, refer to https://github.com/facebookresearch/llama-recipes/blob/main/src/llama_recipes/datasets/utils.py

### Configure model for training

We'll put the model in `train` mode and set up the PEFT config. [PEFT](https://huggingface.co/blog/peft) is the Parameter-Efficient Fine-Tuning library from HuggingFace, and in this particular case it will give us access to [LoRA](https://arxiv.org/pdf/2106.09685.pdf), which is Low-Rank Adaptation, which "freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks."

In [9]:
from peft import (
    get_peft_model,
    LoraConfig,
    TaskType,
    prepare_model_for_kbit_training,
)

lora_r = 64             # LoRA attention dimension
lora_alpha = 16         # Alpha parameter for LoRA scaling
lora_dropout = 0.1      # Dropout probability for LoRA layers

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules = ["q_proj", "v_proj"]
)

# prepare int-8 model for training
model.train()
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 33,554,432 || all params: 6,771,970,048 || trainable%: 0.49548996469513035


From the LLaMa-2 paper:

> Fine-Tuning Details. For supervised fine-tuning, we use a cosine learning rate schedule with an initial learning rate of 2e−5 , a weight decay of 0.1, a batch size of 64, and a sequence length of 4096 tokens. For the fine-tuning process, each sample consists of a prompt and an answer. To ensure the model sequence length is properly filled, we concatenate all the prompts and answers from the training set. A special token is utilized to separate the prompt and answer segments. We utilize an autoregressive objective and zero-out the loss on tokens from the user prompt, so as a result, we backpropagate only on answer tokens. Finally, we fine-tune the model for 2 epochs.

Next we'll set up the training arguments.

In [10]:
from transformers import TrainingArguments
from transformers.utils.import_utils import is_torch_bf16_gpu_available

output_dir = "./results"
per_device_train_batch_size = 4   # Batch size per GPU for training
per_device_eval_batch_size = 4    # Batch size per GPU for evaluation
gradient_accumulation_steps = 4   # Number of update steps to accumulate the gradients for. try 1?
optim = "paged_adamw_32bit"       # Optimizer to use. maybe adamw_torch_fused?
num_train_epochs = 1              # Number of training epochs
save_steps = 25                   # Save checkpoint every X update steps
logging_steps = 25                # Log every X updates steps
learning_rate = 2e-4              # Initial learning rate (AdamW optimizer)
lr_scheduler_type = "cosine"      # Learning rate schedule
weight_decay = 0.001              # Weight decay to apply to all layers except bias/LayerNorm weights
max_grad_norm = 0.3               # Maximum gradient normal (gradient clipping). try 1?
max_steps = -1                    # Number of training steps (overrides num_train_epochs)
warmup_ratio = 0.03               # Ratio of steps for a linear warmup (from 0 to learning rate)
gradient_checkpointing = False    # Enable gradient checkpointing
group_by_length = True            # Group sequences into batches with same length. Saves memory and speeds up training considerably

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = True
bf16 = is_torch_bf16_gpu_available() # Use BF16 if available


training_arguments = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    lr_scheduler_type=lr_scheduler_type,
    fp16=fp16,
    bf16=bf16,
    weight_decay=weight_decay,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    gradient_checkpointing=gradient_checkpointing,
    group_by_length=group_by_length,
)

Lastly we set up the `SFTTrainer`, or Supervised Fine Tuning trainer.

In [11]:
from trl import SFTTrainer

max_seq_length = None     # Maximum sequence length to use - defaults to tokenizer max, or 2048
packing = False           # Pack multiple short examples in the same input sequence to increase efficiency

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)



### Run the trainer

Fine-tunes the model.

In [12]:
import wandb
wandb.login(key="")

[34m[1mwandb[0m: Currently logged in as: [33marubisov[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
# Fine-tuned model name
new_model = "llama-2-7b-anton-gmail"

# Train model
tokenizer.pad_token = tokenizer.eos_token
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

[34m[1mwandb[0m: Tracking run with wandb version 0.15.10
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/wandb/run-20230917_051911-kc33vq9k[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mblooming-cosmos-4[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/arubisov/huggingface[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/arubisov/huggingface/runs/kc33vq9k[0m


Step,Training Loss


Step,Training Loss
25,2.411
50,1.925


In [None]:
model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))