Copyright (c) Meta Platforms, Inc. and affiliates.
This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.

## Quick Start Notebook

This notebook shows how to train a Llama 2 model on a single GPU (e.g. A10 with 24GB) using int8 quantization and LoRA.

### Step 0: Install pre-requirements and convert checkpoint

The example uses the Hugging Face trainer and model which means that the checkpoint has to be converted from its original format into the dedicated Hugging Face format.
The conversion can be achieved by running the `convert_llama_weights_to_hf.py` script provided with the transformer package.
Given that the original checkpoint resides under `models/7B` we can install all requirements and convert the checkpoint with:

In [1]:
# %%bash
# pip install transformers datasets accelerate sentencepiece protobuf==3.20 py7zr scipy peft bitsandbytes fire torch_tb_profiler ipywidgets
# TRANSFORM=`python -c "import transformers;print('/'.join(transformers.__file__.split('/')[:-1])+'/models/llama/convert_llama_weights_to_hf.py')"`
# python ${TRANSFORM} --input_dir models --model_size 7B --output_dir models_hf/7B

### Step 1: Load the model

Point model_id to model weight folder

In [2]:
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer

model_id="../llama/hugging_face_weights/base/7B"

tokenizer = LlamaTokenizer.from_pretrained(model_id)

model =LlamaForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map='auto', torch_dtype=torch.float16)


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda113.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda113.so...


  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Step 2: Load the preprocessed dataset

We load and preprocess the samsum dataset which consists of curated pairs of dialogs and their summarization:

In [2]:
from pathlib import Path
import os
import sys
from utils.dataset_utils import get_preprocessed_dataset
from configs.datasets import samsum_dataset, receipt_dataset

train_dataset = get_preprocessed_dataset(tokenizer, receipt_dataset, 'train')


running on partition train; there are 71 samples


In [3]:
len(train_dataset)

71

### Step 3: Check base model

Run the base model on an example input:

In [3]:
eval_prompt = """
以下のテキスト一覧は、pdfの請求書ドキュメントからOCRをした結果を左上から順番に並べたものです。テキストから次の項目一覧の値をJson形式で出力してください。存在しない項目に関しては出力しないでください。
### 項目一覧
[請求時分],[消費税額(8%)],[消費税額(10%)],[ページ番号],[支払者名],[請求者FAX],[支払通貨],[請求年月],[合計請求額(税抜)],[口座名義],[口座番号],[口座の種類],[銀行支店名],[銀行名],[支払期日],[消費税額],[合計請求額(税込)],[支払者会社名],[請求者電話番号],[請求者住所],[請求者会社名],[タイトル],[請求番号],[請求日付],[請求額(8%税込)],[請求額(10%税込)],[登録番号]
### テキスト一覧
2023年 6月23日\t令和3年8月分\t前田道路株式会社 御中
下記の通り請求致します
請 求\t書\t(材料その他用)
(業 者 控)
適格請求書株式会社
住\t所
名
発行者登録番号\tT1231231231235
※支払期限\t2022/7/31
みずほ\t銀行 東京\t支店\t普通\t当座\t1234567
⑪
(取引先コード欄)
金額
¥70,200
月日\t品\t名\t納入場所\t工事 №.\t数量\t単位\t単価\t金\t額\t担当
6/23\t*品名1\t三田倉庫\t001\t1.0\t個\t65,000\t65,000
小\t計\t¥65,000
10%
消費税計\t消費税率\t8%\t¥5,200
請求\t金\t額\t¥70,200
(注)1.毎月末日締切で、翌月2日迄に必着するよう提出して下さい。
2.提出用のシートを2枚印刷して、提出してください。
3.取引先コード欄に貴社コードのゴム印を押印または、貴社コードを入力してください。

### Json Output:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=1000)[0], skip_special_tokens=True))


以下のテキスト一覧は、pdfの請求書ドキュメントからOCRをした結果を左上から順番に並べたものです。テキストから次の項目一覧の値をJson形式で出力してください。存在しない項目に関しては出力しないでください。
### 項目一覧
### テキスト一覧
2023年 6月23日	令和3年8月分	前田道路株式会社 御中
下記の通り請求致します
請 求	書	(材料その他用)
(業 者 控)
適格請求書株式会社
住	所
名
発行者登録番号	T1231231231235
※支払期限	2022/7/31
みずほ	銀行 東京	支店	普通	当座	1234567
⑪
(取引先コード欄)
金額
¥70,200
月日	品	名	納入場所	工事 №.	数量	単位	単価	金	額	担当
6/23	*品名1	三田倉庫	001	1.0	個	65,000	65,000
小	計	¥65,000
10%
消費税計	消費税率	8%	¥5,200
請求	金	額	¥70,200
(注)1.毎月末日締切で、翌月2日迄に必着するよう提出して下さい。
2.提出用のシートを2枚印刷して、提出してください。
3.取引先コード欄に貴社コードのゴム印を押印または、貴社コードを入力してください。

### Json Output:
[
  {
    "請求時分": "2023年 6月23日",
    "消費税額(8%)": "5200",
    "消費税額(10%)": "5200",
    "ページ番号": "1",
    "支払者名": "前田道路株式会社",
    "請求者FAX": "03-3333-3333",
    "支払通貨": "日本円",
    "請求年月": "2023年 6月",
    "合計請求額(税抜)": "70200",
    "口座名義": "前田道路株式会社",
    "口座番号": "1234567",
    "口座の種類": "普通",
    "銀行支店名": "みずほ銀行 東京支店",
    "銀行名": "みずほ銀行",
    "支払期日": "2022/7/31",
    "消費税額": "5200",
    "合計請求額(税込)": "70200",
    "支払者会社名": "前田道路株式会社",
    "請求者電話番号": "03-3333-

We can see that the base model only repeats the conversation.

### Step 4: Prepare model for PEFT

Let's prepare the model for Parameter Efficient Fine Tuning (PEFT):

In [4]:
model.train()

def create_peft_config(model):
    from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
        prepare_model_for_int8_training,
    )

    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=8,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules = ["q_proj", "v_proj"]
    )

    # prepare int-8 model for training
    model = prepare_model_for_int8_training(model)
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()
    return model, peft_config

# create peft config
model, lora_config = create_peft_config(model)





trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199


In [4]:
model.train()

def create_peft_config(model):
    from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
        prepare_model_for_int8_training,
    )

    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=8,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules = ["q_proj", "v_proj"]
    )

    # prepare int-8 model for training
    model = prepare_model_for_int8_training(model)
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()
    return model, peft_config

# create peft config
model, lora_config = create_peft_config(model)




trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199


In [None]:
from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
        prepare_model_for_int8_training,
    )

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules = ["q_proj", "v_proj"]
)

# prepare int-8 model for training
model = prepare_model_for_int8_training(model)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()


### Step 5: Define an optional profiler

In [5]:
from transformers import TrainerCallback
from contextlib import nullcontext
enable_profiler = False
output_dir = "tmp/llama-output-7b-r-8-useless"

config = {
    'lora_config': lora_config,
    'learning_rate': 1e-4,
    'num_train_epochs': 1,
    'gradient_accumulation_steps': 2,
    'per_device_train_batch_size': 4,
    'gradient_checkpointing': False,
}

# Set up profiler
if enable_profiler:
    wait, warmup, active, repeat = 1, 1, 2, 1
    total_steps = (wait + warmup + active) * (1 + repeat)
    schedule =  torch.profiler.schedule(wait=wait, warmup=warmup, active=active, repeat=repeat)
    profiler = torch.profiler.profile(
        schedule=schedule,
        on_trace_ready=torch.profiler.tensorboard_trace_handler(f"{output_dir}/logs/tensorboard"),
        record_shapes=True,
        profile_memory=True,
        with_stack=True)
    
    class ProfilerCallback(TrainerCallback):
        def __init__(self, profiler):
            self.profiler = profiler
            
        def on_step_end(self, *args, **kwargs):
            self.profiler.step()

    profiler_callback = ProfilerCallback(profiler)
else:
    profiler = nullcontext()

### Step 6: Fine tune the model

Here, we fine tune the model for a single epoch which takes a bit more than an hour on a A100.

In [6]:
from transformers import default_data_collator, Trainer, TrainingArguments



# Define training args
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    bf16=True,  # Use BF16 if available
    # logging strategies
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=10,
    save_strategy="no",
    optim="adamw_torch_fused",
    max_steps=total_steps if enable_profiler else -1,
    **{k:v for k,v in config.items() if k != 'lora_config'}
)

with profiler:
    # Create Trainer instance
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=default_data_collator,
        callbacks=[profiler_callback] if enable_profiler else [],
    )

    # Star t training
    trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss


### Step 7:
Save model checkpoint

In [7]:
model.save_pretrained(output_dir)

## Load Dual Model

In [4]:
import torch
from peft import PeftModel    
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer

import time
model_name = "../llama/hugging_face_weights/base/7B"
adapters_name_1 = "tmp/llama-output-7b-r-8"
adapters_name_2 = "tmp/llama-output-7b-r-8-useless"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    device_map={"": 0}
)
model = PeftModel.from_pretrained(model, adapters_name_1, adapter_name="main")
model.load_adapter(adapters_name_2, adapter_name="sub")

# model = model.merge_and_unload()
tokenizer = LlamaTokenizer.from_pretrained(model_name)
tokenizer.bos_token_id = 1

stop_token_ids = [0]

model.eval()
with torch.no_grad():
    start_time = time.time()
    for _ in range(5):
        tokenizer.decode(model.generate(**model_input, max_new_tokens=1000)[0], skip_special_tokens=True)
    end_time = time.time()
    print(f"Inference time: {end_time - start_time} seconds")

# Merging and unloading

start = time.time()
model.set_adapter("sub")
print("Change Adapter: ", time.time() - start)
model.eval()
with torch.no_grad():
    start_time = time.time()
    for _ in range(5):
        tokenizer.decode(model.generate(**model_input, max_new_tokens=1000)[0], skip_special_tokens=True)
    end_time = time.time()
    print(f"Inference time: {end_time - start_time} seconds")

start = time.time()
model = model.merge_and_unload()
print("Merge and Unload: ", time.time() - start)
model.eval()
with torch.no_grad():
    start_time = time.time()
    for _ in range(5):
        tokenizer.decode(model.generate(**model_input, max_new_tokens=1000)[0], skip_special_tokens=True)
    end_time = time.time()
    print(f"Inference time: {end_time - start_time} seconds")



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Inference time: 114.71072840690613 seconds
Change Adapter:  0.003210306167602539
Inference time: 134.60648274421692 seconds
Merge and Unload:  6.13152813911438
Inference time: 126.51704168319702 seconds


### Step 8:
Try the fine tuned model on the same example again to see the learning progress:

In [7]:
model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=1000)[0], skip_special_tokens=True))


model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=1000)[0], skip_special_tokens=True))

NameError: name 'model_input' is not defined

## Load fine-tuned model

In [15]:
import torch
from peft import PeftModel    
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer

setting = 1
if setting ==0:
    model_name = "../llama/hugging_face_weights/base/13B"
    adapters_name = "tmp/llama-output-13b-r-16"
else:
    model_name = "../llama/hugging_face_weights/base/7B"
    adapters_name = "tmp/llama-output-7b-r-8"
print(f"Starting to load the model {model_name} into memory")

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    device_map={"": 0}
)
model = PeftModel.from_pretrained(model, adapters_name)
model = model.merge_and_unload()
tokenizer = LlamaTokenizer.from_pretrained(model_name)
tokenizer.bos_token_id = 1

stop_token_ids = [0]

print(f"Successfully loaded the model {model_name} into memory")


Starting to load the model ../llama/hugging_face_weights/base/7B into memory


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Successfully loaded the model ../llama/hugging_face_weights/base/7B into memory


In [None]:
from configs.datasets import receipt_dataset

In [None]:
model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=1000)[0], skip_special_tokens=True))


In [4]:
import json
import torch
import time
import pandas as pd
from pathlib import Path
import os
import sys
from utils.dataset_utils import get_preprocessed_dataset
from configs.datasets import samsum_dataset, receipt_dataset

from ft_datasets.receipt_dataset import format


def evaluate(model, tokenizer, dataset_config):
    df_columns = ["file_name", "key", "gt", "prediction", "correct"]
    eval_df = pd.DataFrame(columns=df_columns)
    dataset = get_preprocessed_dataset(tokenizer, dataset_config, 'val')
    model.eval()
    print("inside evaluate")
    print("dataset length: ", len(dataset))
    for i in range(len(dataset)):
        print(i)
        sample_df = pd.DataFrame(columns=df_columns)
        ann = dataset.ann[i]
        correct = ann["output"]
        name = ann["fn"]

        sample_df["file_name"] = name
        eval_prompt = format.format_map(ann)
        
        # store all prediction time
        time_list = []
        with torch.no_grad():
            model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
            # measure how long prediction is taking
            start = time.time()
            out = tokenizer.decode(model.generate(**model_input, max_new_tokens=1000)[0], skip_special_tokens=True)
            end = time.time()
            time_list.append(end - start)

            # read output as json
            try:
                json_output = json.loads(out)
                json_correct = json.load(correct)
            except:
                print("failed to load json on sample: ", name)
                print(out)
                json_output = None
                json_correct = None

            if json_output is not None:
                for key in json_output:
                    sample_df["key"] = key
                    sample_df["gt"] = json_correct.get(key, None)
                    sample_df["prediction"] = json_output[key]
                    sample_df["correct"] = json_output[key] == json_correct.get(key, None)
                    eval_df = eval_df.append(sample_df, ignore_index=True)

    print("Average prediction time: ", sum(time_list)/len(time_list))
    print("Variance of prediction time: ", sum([(x - sum(time_list)/len(time_list))**2 for x in time_list])/len(time_list))

    return eval_df

In [5]:
eval_df = evaluate(model, tokenizer, receipt_dataset) 

running on partition val; there are 40 samples
inside evaluate
dataset length:  40
0
failed to load json on sample:  202307_nttd_請求書_テンプレート-003_郵送.pdf-1
以下のテキスト一覧は、pdfの請求書ドキュメントからOCRをした結果を左上から順番に並べたものです。テキストから次の項目一覧の値をJson形式で出力してください。存在しない項目に関しては出力しないでください。### テキスト一覧
〒	650-0031
兵庫県神戸市中央区東町○○○-〇
請求先株式会社	御中
御請求書
請求番号:	0000000001
請求日:	2021年12月20日
利用月:	株式会社請求元	2021年12月分
登録番号: T1223334444555
件名:テスト請求書
TEL
FAX
Mail
担当者
合計金額	¥1,000-	(税込)
日付	項目名	数量	単価	税率	金額
12月10日	小麦粉	1,000	8%	1,000
小計	¥1,000
消費税(8%)	¥80
消費税(10%)	¥0
合計	¥1,000
税率別内訳 税抜金額
10%対象	¥0
軽減8%対象	¥1,000
備考:
お支払い条件:
第1振込先
振込先	銀行名	みずほ銀行(0001)	支店名	東大阪支店	(484)
科目	普通	口座番号	1234567
口座名義	テキカクショ(カ
第2振込先
振込先	銀行名	支店名
科目	口座番号
口座名義
支払期限:2024/1/15


 ### Json Output: 
 登録番号 T1223334444555
請求額(8%税込) 1,000
請求額(10%税込) 0
タイトル請求書[登録番号 T1223334444555]
請求番号 0000000001
請求日付 2021年12月20日
請求者会社名:株式会社請求元
請求者住所:〒650-0031 兵庫県神戸市中央区東町○○○-〇
請求者電話番号:XXXX-XXXX-XXXX
支払者会社名:請求先株式会社
請求年月:2021年12月分
合計請求額(税込) 1,000
支払通貨:¥
消費税額(10%) 0
消費税額(8%) 80
合計請求額(税抜) 1,000
消費税額(10%) 0

In [7]:
model, tokenizer, dataset_config = model, tokenizer, receipt_dataset
df_columns = ["file_name", "key", "gt", "prediction", "correct"]
eval_df = pd.DataFrame(columns=df_columns)
dataset = get_preprocessed_dataset(tokenizer, dataset_config, 'val')
model.eval()
print("inside evaluate")
print("dataset length: ", len(dataset))

running on partition val; there are 40 samples
inside evaluate
dataset length:  40


In [8]:
i = 0 
print(i)
sample_df = pd.DataFrame(columns=df_columns)
ann = dataset.ann[i]
correct = ann["output"]
name = ann["fn"]

sample_df["file_name"] = name
eval_prompt = format.format_map(ann)
        

0


In [12]:
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
model.eval()
with torch.no_grad():
    # measure how long prediction is taking
    start = time.time()
    out = tokenizer.decode(model.generate(**model_input, max_new_tokens=1000)[0], skip_special_tokens=True)
    end = time.time()

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

In [13]:
out

'以下のテキスト一覧は、pdfの請求書ドキュメントからOCRをした結果を左上から順番に並べたものです。テキストから次の項目一覧の値をJson形式で出力してください。存在しない項目に関しては出力しないでください。### テキスト一覧\n〒\t650-0031\n兵庫県神戸市中央区東町○○○-〇\n請求先株式会社\t御中\n御請求書\n請求番号:\t0000000001\n請求日:\t2021年12月20日\n利用月:\t株式会社請求元\t2021年12月分\n登録番号: T1223334444555\n件名:テスト請求書\nTEL\nFAX\nMail\n担当者\n合計金額\t¥1,000-\t(税込)\n日付\t項目名\t数量\t単価\t税率\t金額\n12月10日\t小麦粉\t1,000\t8%\t1,000\n小計\t¥1,000\n消費税(8%)\t¥80\n消費税(10%)\t¥0\n合計\t¥1,000\n税率別内訳 税抜金額\n10%対象\t¥0\n軽減8%対象\t¥1,000\n備考:\nお支払い条件:\n第1振込先\n振込先\t銀行名\tみずほ銀行(0001)\t支店名\t東大阪支店\t(484)\n科目\t普通\t口座番号\t1234567\n口座名義\tテキカクショ(カ\n第2振込先\n振込先\t銀行名\t支店名\n科目\t口座番号\n口座名義\n支払期限:2024/1/15\n\n\n ### Json Output: \n 登録番号 T1223334444555\n請求額(8%税込) 1,000\n請求額(10%税込) 0\nタイトル請求書[登録番号 T1223334444555]\n請求番号 0000000001\n請求日付 2021年12月20日\n請求者会社名:株式会社請求元\n請求者住所:〒650-0031 兵庫県神戸市中央区東町○○○-〇\n請求者電話番号:XXXX-XXXX-XXXX\n支払者会社名:請求先株式会社\n請求年月:2021年12月分\n合計請求額(税込) 1,000\n支払通貨:¥\n消費税額(10%) 0\n消費税額(8%) 80\n合計請求額(税抜) 1,000\n消費税額(10%) 0\n消費税額(8%) 1,000\n支払期日 2024/1/15\n口座の種類支払者名義カカ(カ\n口座番号 1234567\n

In [None]:

    for i in range(len(dataset)):
        print(i)
        sample_df = pd.DataFrame(columns=df_columns)
        ann = dataset.ann[i]
        correct = ann["output"]
        name = ann["fn"]

        sample_df["file_name"] = name
        eval_prompt = format.format_map(ann)
        
        # store all prediction time
        time_list = []
        with torch.no_grad():
            model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
            # measure how long prediction is taking
            start = time.time()
            out = tokenizer.decode(model.generate(**model_input, max_new_tokens=1000)[0], skip_special_tokens=True)
            end = time.time()
            time_list.append(end - start)

            # read output as json
            try:
                json_output = json.loads(out)
                json_correct = json.load(correct)
            except:
                print("failed to load json on sample: ", name)
                print(out)
                json_output = None
                json_correct = None

            if json_output is not None:
                for key in json_output:
                    sample_df["key"] = key
                    sample_df["gt"] = json_correct.get(key, None)
                    sample_df["prediction"] = json_output[key]
                    sample_df["correct"] = json_output[key] == json_correct.get(key, None)
                    eval_df = eval_df.append(sample_df, ignore_index=True)

    print("Average prediction time: ", sum(time_list)/len(time_list))
    print("Variance of prediction time: ", sum([(x - sum(time_list)/len(time_list))**2 for x in time_list])/len(time_list))

    return eval_df