# Transformers 8bit + LoRA Finetune
Adapted from https://colab.research.google.com/drive/1ft6wQU0BhqG5PRlwgaZJv2VukKKjU4E by AICrumb, to stay up to date with current methods (load_in_8bit). This finetunes low rank adapters for GPT2-XL (1.5B) on the Codeparrot dataset from the Transformers Book. ([transformers-book/codeparrot-train](https://huggingface.co/datasets/transformersbook/codeparrot-train))

With Low Rank Adapters (LoRA) we can target just the model residual ΔW as opposed to W with finetuning, with trainable rank decomposition matrices. 
> "LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency."

\- https://arxiv.org/abs/2106.09685

In [1]:
#@markdown Install required libraries
!pip install transformers ftfy sentencepiece bitsandbytes accelerate datasets -qq
!pip install git+https://github.com/aicrumb/lora-transformers -qq

[31mERROR: Command errored out with exit status 128: git clone -q https://github.com/aicrumb/lora-transformers /tmp/pip-req-build-0jyvm7mt Check the logs for full command output.[0m


In [2]:
from lora import replace_all_matching_layers, save_lora_layers

In [3]:
# load the base model
import os
import transformers
import torch
from tqdm.auto import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

if not os.path.exists("offload_gpt"):
    os.mkdir("offload_gpt")

tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2-xl")
gpt = transformers.AutoModelForCausalLM.from_pretrained(
    "gpt2-xl", 
    load_in_8bit=True,
    device_map="auto", 
    offload_folder="offload_gpt", 
    low_cpu_mem_usage=True,
)

# test generation
prompt = tokenizer("A cat sat on a mat", return_tensors='pt')
prompt = {key: value.to(device) for key, value in prompt.items()}
with torch.no_grad():
    out = gpt.generate(**prompt, min_length=64, max_length=64, do_sample=True)
tokenizer.decode(out[0])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"A cat sat on a mat covered in feces. It had been left there to defecate, and he had walked on it several times before. The white cat's stomach was as big as a person's thigh, and its paws were as fat as its head, hersing blood intermittently. In the middle of"

In [4]:
# load dataset and optimizer, add LoRAs
from datasets import load_dataset
from bitsandbytes.optim import Adam8bit

matches = ["attn"] # all the types of modules to add adapters to

replace_all_matching_layers(gpt, r=16, matches=matches)

for name, param in gpt.named_parameters():
    if True in [match in name for match in matches]:
        param.requires_grad_(True)
    else:
        param.requires_grad_(False)

codeparrot = load_dataset("transformersbook/codeparrot-train", streaming=True)
optimizer = Adam8bit(gpt.parameters(), lr=1e-5)



This training loop is just a proof of concept - to show that even in the heaviest case, it still fits on a gpu.
Depending on your finetuning task, you'll need to remove or add some parts.

In [None]:
# train!
import torch.nn.functional as F

accumulate = 8
max_samples = 4096
ctx_length = 640

print("Training on", ctx_length*max_samples, "new tokens...")
print(max_samples//accumulate, "steps in total will be taken.")

losses = [] # for plotting later
with torch.cuda.amp.autocast():
    for i, row in enumerate(tqdm(codeparrot["train"].take(max_samples), total=max_samples)):
        if len(row["content"]) <= 1:
            continue

        batch = tokenizer(row["content"], truncation=True, max_length=ctx_length, return_tensors='pt')
        batch = {k: v.cuda() for k, v in batch.items()}

        out = gpt.forward(**batch,)

        loss = F.cross_entropy(out.logits[:, :-1, :].flatten(0, -2), batch['input_ids'][:, 1:].flatten(),
                               reduction='mean') / accumulate
        loss.backward()
        losses.append(loss.item() * accumulate)
        if (i+1) % accumulate == 0:
            optimizer.step()
            optimizer.zero_grad()
            print(f"Loss at step {(i+1)//accumulate}:", sum(losses[-accumulate:]) / accumulate)

Training on 2621440 new tokens...
512 steps in total will be taken.


  0%|          | 0/4096 [00:00<?, ?it/s]

Loss at step 1: 1.8202431052923203
Loss at step 2: 1.594724103808403
Loss at step 3: 1.7100998312234879
Loss at step 4: 1.5239116176962852
Loss at step 5: 2.1622454226017
Loss at step 6: 1.395967721939087
Loss at step 7: 1.869606301188469
Loss at step 8: 1.5405991077423096
Loss at step 9: 1.4592493548989296
Loss at step 10: 1.5880023539066315
Loss at step 11: 1.3451108038425446
Loss at step 12: 1.569413274526596
Loss at step 13: 1.414364330470562
Loss at step 14: 1.5175216495990753
Loss at step 15: 1.469494417309761
Loss at step 16: 1.65050208568573
Loss at step 17: 1.4970856755971909
Loss at step 18: 1.4708647578954697
Loss at step 19: 1.3546412885189056
Loss at step 20: 1.5448588281869888
Loss at step 21: 1.4967026561498642
Loss at step 22: 1.3583641350269318
Loss at step 23: 1.5146870911121368
Loss at step 24: 1.8288955390453339
Loss at step 25: 1.3248411118984222
Loss at step 26: 1.7258609533309937
Loss at step 27: 1.2672379463911057
Loss at step 28: 1.4730567187070847
Loss at step

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(losses)
plt.plot(
    df.ewm(alpha=0.01).mean()
)

In [None]:
# save the (very small, comparative to the model) trained weights
# you can share these on the huggingface hub, along with the params for your replace_all_matching_layers step so people can load the model properly
save_lora_layers(gpt, "lora-gpt2-xl-attn-codeparrot.pt")

In [None]:
# test generation
prompt = tokenizer("def fibonacci(digits=10):", return_tensors='pt')
prompt = {key: value.to(device) for key, value in prompt.items()}
with torch.no_grad():
    out = gpt.generate(**prompt, min_length=64, max_length=64, do_sample=True)
print(tokenizer.decode(out[0]))

Here's how to load the params into a model once you've trained.
```python 
model = # whatever
matches = ["attn"] 
replace_all_matching_layers(model, r=16, matches=matches)
load_lora_layers(model, "lora-gpt2-xl-attn-codeparrot.pt")
```