# Project: Fine-tuning a Llama 2 module

This notebook fine tunes a Hugging Face hub Llama 2 model using the Christian New Testament as the training dataset.

First, create the environment.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/projects/llama/

Mounted at /content/drive
/content/drive/MyDrive/projects/llama


In [None]:
!pip install -Uqqq pip --progress-bar off
!pip install -qqq torch==2.1.0 --progress-bar off
!pip install -qqq transformers==4.32.1 --progress-bar off
!pip install -qqq datasets==2.14.4 --progress-bar off
!pip install -qqq peft==0.5.0 --progress-bar off
!pip install -qqq bitsandbytes==0.41.1 --progress-bar off
!pip install -qqq trl==0.7.1 --progress-bar off
!pip install -qqq xmltodict --progress-bar off

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
MODEL_NAME = "meta-llama/Llama-2-7b-hf"
SEED_VALUE = 42
PROJECT_NAME = "RickMartel/Llama-2-7b-hf-FT-NT"

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Load the source corpus (in this case the New Testament) and create the training dataset.

The corpus is stored in an XML file. The code below loads it and converts it to a JSON document.

In [None]:
import json
import xmltodict

with open("./Bible_English_TNIV.xml") as xml_file:
    data_dict = xmltodict.parse(xml_file.read())

json_data = json.dumps(data_dict)

with open("Bible_NIV.json", "w") as json_file:
  json_file.write(json_data)

with open("Bible_NIV.json", "r") as f:
    bible = json.load(f)

Below are examples of navigating the JSON document.

In [None]:
print( type( bible['XMLBIBLE']['BIBLEBOOK'] ) )
print( len( bible['XMLBIBLE']['BIBLEBOOK'] ) )
print(  bible['XMLBIBLE']['BIBLEBOOK'][0].keys()  )
print(  bible['XMLBIBLE']['BIBLEBOOK'][0]['CHAPTER'][0].keys()  )
print(  bible['XMLBIBLE']['BIBLEBOOK'][0]['CHAPTER'][0]['VERS'][0].keys()  )
print(  bible['XMLBIBLE']['BIBLEBOOK'][0]['CHAPTER'][0]['VERS'][0]['#text'] )

<class 'list'>
66
dict_keys(['@bnumber', '@bname', 'CHAPTER'])
dict_keys(['@cnumber', 'VERS'])
dict_keys(['@vnumber', '#text'])
In the beginning God created the heavens and the earth.


The code below extracts all verses from the corpus for the New Testament.

In [None]:
start_book = 'Matthew'
one_chpt_books = ['Philemon','2 John', '3 John', 'Jude']
start = False
chapter_lines = []

for book in bible['XMLBIBLE']['BIBLEBOOK']:
  if book['@bname'].strip() == start_book: start = True
  if not start: continue
  for chapter in book['CHAPTER']:
    if book['@bname'] in one_chpt_books:
      for verse in book['CHAPTER']['VERS']:
        chapter_lines.append( verse['#text'].strip() )
    else:
      for verse in chapter['VERS']:
        chapter_lines.append( verse['#text'].strip() )

Lets look at the first line of the training data.

In [None]:
chapter_lines[0]

'This is the genealogy of Jesus the Messiah the son of David, the son of Abraham:'

Below are some statistics on the verses in the training data.

In [None]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

print( f'# of lines: {len( chapter_lines )}' )
print( f'Max line length: {max( [len( word_tokenize(line) ) for line in chapter_lines] )}' )
print( f'# of words {sum( [len( word_tokenize(line) ) for line in chapter_lines] )}' )

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# of lines: 8033
Max line length: 75
# of words 207171


Lets create a count of the characters that are at the end of each verse.

In [None]:
def get_line_ends(_lines):
  end_cnts = {}
  for idx, line in enumerate( _lines ):
    ch = line[-1]
    end_cnts[ ch ] = end_cnts.get(ch, 0) + 1
  for key, val in end_cnts.items():
    print(f'{key}={val}|', end="")
get_line_ends(chapter_lines)

:=74|,=535|.=5137|"=1391|m=10|t=13|a=1|?=262|�=74|'=196|;=82|!=87|s=24|h=6|d=17|]=39|)=34|l=5|b=3|D=2|r=8|n=5|e=17|g=1|y=5|c=1|u=3|p=1|

Its odd that some of these verses end in a alphabetic character and comma. Lets combine verses so that they end in periods. After the concatenations are done, the counts of ending sentence characters is computed again.

In [None]:
bad_end = [',',':', "'", ";"]

def append_line(_line):
  ch = _line[-1]
  return ch in bad_end or ch.isalpha()

chapter_lines_v2 = []
new_line = ""

for line in chapter_lines:
  if append_line(line):
    new_line += ' ' + line
  else:
    new_line = new_line + ' ' + line
    chapter_lines_v2.append(new_line.strip())
    new_line = ""

print( f'# of lines: {len( chapter_lines_v2 )}' )
print( f'Max line length: {max( [len( word_tokenize(line) ) for line in chapter_lines_v2] )}' )
print( f'# of words {sum( [len( word_tokenize(line) ) for line in chapter_lines_v2] )}' )
get_line_ends(chapter_lines_v2)

# of lines: 7024
Max line length: 402
# of words 207171
.=5137|"=1391|?=262|�=74|!=87|]=39|)=34|

This looks much better. Presumedly, complete sentences provided more context than verses.

In [None]:
sentences = chapter_lines_v2

In [None]:
sentences[0]

"This is the genealogy of Jesus the Messiah the son of David, the son of Abraham: Abraham was the father of Isaac, Isaac the father of Jacob, Jacob the father of Judah and his brothers, Judah the father of Perez and Zerah, whose mother was Tamar, Perez the father of Hezron, Hezron the father of Ram, Ram the father of Amminadab, Amminadab the father of Nahshon, Nahshon the father of Salmon, Salmon the father of Boaz, whose mother was Rahab, Boaz the father of Obed, whose mother was Ruth, Obed the father of Jesse, and Jesse the father of King David. David was the father of Solomon, whose mother had been Uriah's wife, Solomon the father of Rehoboam, Rehoboam the father of Abijah, Abijah the father of Asa, Asa the father of Jehoshaphat, Jehoshaphat the father of Jehoram, Jehoram the father of Uzziah, Uzziah the father of Jotham, Jotham the father of Ahaz, Ahaz the father of Hezekiah, Hezekiah the father of Manasseh, Manasseh the father of Amon, Amon the father of Josiah, and Josiah the fat

Lets take a look at the aplhabet of the corpus.

In [None]:
def get_char_dict(_list):
  char_dict = {}
  for line in _list:
    for ch in line:
      char_dict[ch] = char_dict.get(ch, 0) + 1
  return char_dict

char_dict = get_char_dict(sentences)
print(f'Length: {len(char_dict.keys())}')
char_dict.keys()

Length: 74


dict_keys(['T', 'h', 'i', 's', ' ', 't', 'e', 'g', 'n', 'a', 'l', 'o', 'y', 'f', 'J', 'u', 'M', 'D', 'v', 'd', ',', 'A', 'b', 'r', 'm', ':', 'w', 'I', 'c', 'P', 'z', 'Z', 'H', 'R', 'N', 'S', 'B', 'O', 'K', '.', 'U', "'", 'j', 'p', 'k', 'x', 'E', 'q', 'L', '"', '(', 'G', ')', 'W', '?', ';', 'C', 'Y', '!', 'F', '�', '-', 'Q', '[', ']', 'V', '1', '5', '3', '4', '0', '2', '7', '6'])

Now lets combine all the sentences into one document and tokenize it. This document will be used to create training data for the Llama 2 model.

First get the tokenizer for the model.

In [None]:
tokenizer = AutoTokenizer.from_pretrained( MODEL_NAME )
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Lets now add beginning and ending tokens to each sentence.

In [None]:
print( tokenizer.bos_token )
print( tokenizer.eos_token )
print( tokenizer.padding_side )
word_cnts = [len( word_tokenize(line) ) for line in sentences]
print(f'Number of words: {sum( word_cnts )}')
print(f'Longest sentence: {max( word_cnts )}')
sentences_delimited = [tokenizer.bos_token + line +  tokenizer.eos_token + ' ' for line in sentences]
word_cnts = [len( word_tokenize(line) ) for line in sentences_delimited]
print(f'Number of words: {sum( word_cnts )}')
print(f'Longest sentence: {max( word_cnts )}')

Number of words: 207171
Longest sentence: 402
Number of words: 244198
Longest sentence: 407


Lets take a look at one of the examples.

In [None]:
sentences_delimited[1]

'<s>After the exile to Babylon: Jeconiah was the father of Shealtiel, Shealtiel the father of Zerubbabel, Zerubbabel the father of Abiud, Abiud the father of Eliakim, Eliakim the father of Azor, Azor the father of Zadok, Zadok the father of Akim, Akim the father of Eliud, Eliud the father of Eleazar, Eleazar the father of Matthan, Matthan the father of Jacob, and Jacob the father of Joseph, the husband of Mary, and Mary was the mother of Jesus who is called the Messiah.</s> '

Now lets concatenate all the sentences into one document.

In [None]:
doc = "".join(sentences_delimited)
print(f'Words in doc: {len( word_tokenize(doc) )}')
doc[2000:2400]

Words in doc: 244198


' public disgrace, he had in mind to divorce her quietly.</s> <s>But after he had considered this, an angel of the Lord appeared to him in a dream and said, "Joseph son of David, do not be afraid to take Mary home as your wife, because what is conceived in her is from the Holy Spirit.</s> <s>She will give birth to a son, and you are to give him the name Jesus, because he will save his people from t'

The code below tokenizes the document.

In [None]:
%%time
outputs = tokenizer(
    [doc],
    truncation=False,
)
print(outputs.keys())
print(f"Token count: {len( outputs['input_ids'][0] )}")
doc_tokens = torch.tensor( outputs['input_ids'][0] )
doc_tokens = outputs['input_ids'][0]

dict_keys(['input_ids', 'attention_mask'])
Token count: 251981
CPU times: user 699 ms, sys: 81.5 ms, total: 780 ms
Wall time: 799 ms


In [None]:
doc_decode = tokenizer.decode(doc_tokens)
doc_decode[20000:20400]

'aluable than they?</s>  <s> Can any one of you by worrying add a single hour to your life ?</s>  <s> "And why do you worry about clothes? See how the flowers of the field grow. They do not labor or spin.</s>  <s> Yet I tell you that not even Solomon in all his splendor was dressed like one of these.</s>  <s> If that is how God clothes the grass of the field, which is here today and tomorrow is thr'

Create training data by chunking the document into a configurable number of chunks of a configurable size. The code below creates 50,000 training examples of length 500 by randomly taking blocks of tokens from the tokenized document.

In [None]:
import random

random.seed(SEED_VALUE)

block_size = 500

def get_sample(_data):
    idx = random.randint(0, len(_data) - block_size)
    sample = _data[idx:idx+block_size]
    return sample

input_ids = [ get_sample(doc_tokens) for _ in range(50000) ]

print(f'Samples: {len(input_ids) }')

Samples: 50000


Lets decode one of the examples.

In [None]:
tokenizer.decode( input_ids[100] )

'strong in spirit ; and he lived in the wilderness until he appeared publicly to Israel.</s>  <s> In those days Caesar Augustus issued a decree that a census should be taken of the entire Roman world.</s>  <s> (This was the first census that took place while Quirinius was governor of Syria.)</s>  <s> And everyone went to their own town to register.</s>  <s> So Joseph also went up from the town of Nazareth in Galilee to Judea, to Bethlehem the town of David, because he belonged to the house and line of David.</s>  <s> He went there to register with Mary, who was pledged to be married to him and was expecting a child.</s>  <s> While they were there, the time came for the baby to be born, and she gave birth to her firstborn, a son. She wrapped him in cloths and placed him in a manger, because there was no guest room available for them.</s>  <s> And there were shepherds living out in the fields nearby, keeping watch over their flocks at night.</s>  <s> An angel of the Lord appeared to them

Now lets create training data be decoding the tokenized data.

In [None]:
%%time
input_text = [ tokenizer.decode(input_id) for input_id in input_ids ]

CPU times: user 1min 26s, sys: 67.9 ms, total: 1min 26s
Wall time: 1min 26s


In [None]:
input_text[0]

'\'t fall!</s>  <s> No temptation has overtaken you except what is common to us all. And God is faithful; he will not let you be tempted beyond what you can bear. But when you are tempted, he will also provide a way out so that you can endure it.</s>  <s> Therefore, my dear friends, flee from idolatry.</s>  <s> I speak to sensible people; judge for yourselves what I say.</s>  <s> Is not the cup of thanksgiving for which we give thanks a participation in the blood of Christ? And is not the bread that we break a participation in the body of Christ?</s>  <s> Because there is one loaf, we, who are many, are one body, for we all partake of the one loaf.</s>  <s> Consider the people of Israel: Do not those who eat the sacrifices participate in the altar?</s>  <s> Do I mean then that food sacrificed to an idol is anything, or that an idol is anything?</s>  <s> No, but the sacrifices of pagans are offered to demons, not to God, and I do not want you to be participants with demons.</s>  <s> You

The code below creates a Hugging Face Dataset dictionary that will be used for training and testing.

In [None]:
from datasets import Dataset, DatasetDict

dataset = Dataset.from_dict({"text": input_text})
datasets = DatasetDict({ "train": dataset}) #.shuffle().select(range(10000)) })

datasets = datasets["train"].train_test_split(train_size=0.95, shuffle=True, seed=SEED_VALUE)
datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 47500
    })
    test: Dataset({
        features: ['text'],
        num_rows: 2500
    })
})

### Llama 2 model training

The code below trains a quantized Llama 2 model using LORA (lower-rank adaptation) PEFT (parameter efficient training).

Bits And Bytes will be used for the quantization.

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)


Get the Hugging Face hub Llama 2 model.

In [None]:
%%time
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME,
                                             use_safetensors=True,
                                             quantization_config=bnb_config,
                                             trust_remote_code=True,
                                             device_map="auto")

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

CPU times: user 16.9 s, sys: 17.4 s, total: 34.3 s
Wall time: 50 s


In [None]:
model.config.use_cache = False

Configure the LORA layers.

In [None]:
lora_alpha = 32
lora_dropout = 0.05
lora_r = 16
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM")


Create a data collator to add attention masks to the training data.

In [None]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

out = data_collator([datasets["train"][i] for i in range(5)])
print( out.keys() )
for key in out:
    print(f"{key} shape: {out[key].shape}")

Fine-tune the model.

In [None]:
batch_size=30

train_args = TrainingArguments(
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=batch_size,
    optim="paged_adamw_32bit",
    logging_steps=1,
    learning_rate=1e-4,
    fp16=True,
    max_grad_norm=0.3,
    num_train_epochs=1,
    evaluation_strategy="steps",
    eval_steps=0.2,
    warmup_ratio=0.05,
    save_strategy="epoch",
    group_by_length=True,
    output_dir=MODEL_NAME,
    report_to="tensorboard",
    save_safetensors=True,
    lr_scheduler_type="cosine",
    seed=SEED_VALUE,
    push_to_hub=False,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=datasets["train"],
    eval_dataset=datasets["test"],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=4096,
    tokenizer=tokenizer,
    args=train_args,
    data_collator=data_collator, #NEW
)

trainer.train()




Map:   0%|          | 0/47500 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
11,1.1061,1.157853
22,0.6232,0.640427
33,0.5858,0.574807
44,0.57,0.542559


TrainOutput(global_step=52, training_loss=0.9147453159093857, metrics={'train_runtime': 10366.5307, 'train_samples_per_second': 4.582, 'train_steps_per_second': 0.005, 'total_flos': 4.884608316513485e+17, 'train_loss': 0.9147453159093857, 'epoch': 0.98})

Calculate the perplexity of the model. This captures how certain the model is when predicting the next token. Lower values are better.


In [None]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 1.71


### Generating text from the model

The code below explores server approaches to generating text from the model.

In [None]:
import torch

model.eval()

prompt = tokenizer.bos_token + "Jesus said to them"
generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
generated = generated.to(DEVICE)

print(generated)

sample_outputs = model.generate(
                                generated,
                                do_sample=True,
                                top_k=10,
                                top_p=0.95,
                                max_length=100,
                                num_return_sequences=1,
                                pad_token_id=tokenizer.eos_token_id,
                                )

for i, sample_output in enumerate(sample_outputs):
  print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

tensor([[    1,     1, 13825,  1497,   304,   963]], device='cuda:0')
0: Jesus said to them, "If you were Abraham's children, you would do what Abraham did, but now you are trying to kill me, a man who has told you the truth that I heard from God. everybody who does evil hates the light, and will not come into its light for fear that their deeds will be exposed. But whoever lives by the truth comes to the light, so that it may be seen plainly that what they have done has been done




This looks correct.

The model is saved to the hub under my account.

In [None]:
trainer.push_to_hub('RickMartel/llama2')

CommitInfo(commit_url='https://huggingface.co/RickMartel/Llama-2-7b-hf/commit/0c8f63ae8f8d92b94bf7bd45d5c6459349773b65', commit_message='RickMartel/llama2', commit_description='', oid='0c8f63ae8f8d92b94bf7bd45d5c6459349773b65', pr_url=None, pr_revision=None, pr_num=None)

The code below is used to load the model from the hub.

In [None]:
model = AutoModelForCausalLM.from_pretrained('RickMartel/Llama-2-7b-hf',
                                             use_safetensors=True,
                                             quantization_config=bnb_config,
                                             trust_remote_code=True,
                                             device_map="auto")

No errors occured.

# This is the completion of this notebo0k.