
# Project: Train GPT2 LLM using Hugging Face models and APIs.

This notebook fine-tunes a pre-trained Hugging Face GPT2 `distilgpt2` model using the Bible as the corpus. The purpose of this is to better enable the model to generate document completion queries such as, "Jesus was born in".

For training, the corpus will be chunked up into training data of 500 tokens per example.

The training was done usign Colab on a A100. GPU RAM usage peaked at around 28 gigs.

### Setup the environment.

Mount the project dirctory stored in Google Drive.

In [1]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/projects/LLM/HF_GPT2_Training/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/projects/LLM/HF_GPT2_Training


Install the packages.

In [2]:
%pip -qqq install transformers[torch] datasets huggingface-hub --progress-bar off
%pip -qqq install langchain --progress-bar off
%pip -qqq install unstructured --progress-bar off
%pip -qqq install xmltodict --progress-bar off
%pip -qqq install accelerate --progress-bar off

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone


Set some constants used in the notebook.

In [2]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
GPT2_MODEL = 'distilgpt2'
PROJECT_MODEL = 'efarish/GPT2_FT_By_NT_RAND_v11'
SEED_VALUE = 42

### Loadingthe corpus and pre-processing

The code below will convert an XML representation of the Bible in to a JSON document.

Below the XML is converted into JSON.

In [3]:
import json
import xmltodict

with open("./Bible_English_TNIV.xml") as xml_file:
    data_dict = xmltodict.parse(xml_file.read())

json_data = json.dumps(data_dict)

with open("Bible_NIV.json", "w") as json_file:
  json_file.write(json_data)

with open("Bible_NIV.json", "r") as f:
    bible = json.load(f)

Below the structure of the JSON document is investigated.

In [4]:
print( type( bible['XMLBIBLE']['BIBLEBOOK'] ) )
print( len( bible['XMLBIBLE']['BIBLEBOOK'] ) )
print(  bible['XMLBIBLE']['BIBLEBOOK'][0].keys()  )
print(  bible['XMLBIBLE']['BIBLEBOOK'][0]['CHAPTER'][0].keys()  )
print(  bible['XMLBIBLE']['BIBLEBOOK'][0]['CHAPTER'][0]['VERS'][0].keys()  )
print(  bible['XMLBIBLE']['BIBLEBOOK'][0]['CHAPTER'][0]['VERS'][0]['#text'] )

<class 'list'>
66
dict_keys(['@bnumber', '@bname', 'CHAPTER'])
dict_keys(['@cnumber', 'VERS'])
dict_keys(['@vnumber', '#text'])
In the beginning God created the heavens and the earth.


The code below extracts each verse from the JSON document.

In [5]:
start_book = 'Matthew'
one_chpt_books = ['Philemon','2 John', '3 John', 'Jude']
start = False
chapter_lines = []

for book in bible['XMLBIBLE']['BIBLEBOOK']:
  if book['@bname'].strip() == start_book: start = True
  if not start: continue
  for chapter in book['CHAPTER']:
    if book['@bname'] in one_chpt_books:
      for verse in book['CHAPTER']['VERS']:
        chapter_lines.append( verse['#text'].strip() )
    else:
      for verse in chapter['VERS']:
        chapter_lines.append( verse['#text'].strip() )

Lets take a look at the first line of the dataset.

In [6]:
chapter_lines[0]

'This is the genealogy of Jesus the Messiah the son of David, the son of Abraham:'

Lets get some statisitics on the content size.

In [7]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

print( len( chapter_lines ) )
print( max( [len( word_tokenize(line) ) for line in chapter_lines] ) )
print( sum( [len( word_tokenize(line) ) for line in chapter_lines] ) )

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


8033
75
207171


The code below examines the last character of each verse. The output is count of the characters the terminate the verses.

In [8]:
def get_line_ends(_lines):
  end_cnts = {}
  for idx, line in enumerate( _lines ):
    ch = line[-1]
    end_cnts[ ch ] = end_cnts.get(ch, 0) + 1
  for key, val in end_cnts.items():
    print(f'{key}={val}|', end="")
get_line_ends(chapter_lines)

:=74|,=535|.=5137|"=1391|m=10|t=13|a=1|?=262|�=74|'=196|;=82|!=87|s=24|h=6|d=17|]=39|)=34|l=5|b=3|D=2|r=8|n=5|e=17|g=1|y=5|c=1|u=3|p=1|

Seems like many of the verses end with inappropriate characters indicating incomplete sentences. To deal with this, the code below will concatenate adjacent verses if a verse does not end in a period. This will help create verses with more context.

In [9]:
bad_end = [',',':', "'", ";"]

def append_line(_line):
  ch = _line[-1]
  return ch in bad_end or ch.isalpha()

chapter_lines_v2 = []
new_line = ""

for line in chapter_lines:
  if append_line(line):
    new_line += ' ' + line
  else:
    new_line = new_line + ' ' + line
    chapter_lines_v2.append(new_line.strip())
    new_line = ""

Lets review the characters that terminate the verses.

In [10]:
print( len( chapter_lines_v2 ) )
print( max( [len( word_tokenize(line) ) for line in chapter_lines_v2] ) )
print( sum( [len( word_tokenize(line) ) for line in chapter_lines_v2] ) )
get_line_ends(chapter_lines_v2)

7024
402
207171
.=5137|"=1391|?=262|�=74|!=87|]=39|)=34|

This is more reasonable. Lets renae the dataset and look at the first example.

In [11]:
sentences = chapter_lines_v2

In [12]:
sentences[0]

"This is the genealogy of Jesus the Messiah the son of David, the son of Abraham: Abraham was the father of Isaac, Isaac the father of Jacob, Jacob the father of Judah and his brothers, Judah the father of Perez and Zerah, whose mother was Tamar, Perez the father of Hezron, Hezron the father of Ram, Ram the father of Amminadab, Amminadab the father of Nahshon, Nahshon the father of Salmon, Salmon the father of Boaz, whose mother was Rahab, Boaz the father of Obed, whose mother was Ruth, Obed the father of Jesse, and Jesse the father of King David. David was the father of Solomon, whose mother had been Uriah's wife, Solomon the father of Rehoboam, Rehoboam the father of Abijah, Abijah the father of Asa, Asa the father of Jehoshaphat, Jehoshaphat the father of Jehoram, Jehoram the father of Uzziah, Uzziah the father of Jotham, Jotham the father of Ahaz, Ahaz the father of Hezekiah, Hezekiah the father of Manasseh, Manasseh the father of Amon, Amon the father of Josiah, and Josiah the fat

Lets tak a look at the unique characters that appear in the text.

In [13]:
def get_char_dict(_list):
  char_dict = {}
  for line in _list:
    for ch in line:
      char_dict[ch] = char_dict.get(ch, 0) + 1
  return char_dict

char_dict = get_char_dict(sentences)
print(f'Length: {len(char_dict.keys())}')
char_dict.keys()

Length: 74


dict_keys(['T', 'h', 'i', 's', ' ', 't', 'e', 'g', 'n', 'a', 'l', 'o', 'y', 'f', 'J', 'u', 'M', 'D', 'v', 'd', ',', 'A', 'b', 'r', 'm', ':', 'w', 'I', 'c', 'P', 'z', 'Z', 'H', 'R', 'N', 'S', 'B', 'O', 'K', '.', 'U', "'", 'j', 'p', 'k', 'x', 'E', 'q', 'L', '"', '(', 'G', ')', 'W', '?', ';', 'C', 'Y', '!', 'F', '�', '-', 'Q', '[', ']', 'V', '1', '5', '3', '4', '0', '2', '7', '6'])

The \uFFFD character indicates Python could not read the character. I couldn't figure out hwo to fix this. I also decided not to remove it.

In [14]:
print('\uFFFD')
cnt = char_dict['\uFFFD'];print(f"\\uFFFD count: {cnt}")

�
\uFFFD count: 438


As a final check, lets check if there any zero length verses.

In [15]:
print(f'Number of sentences: {len(sentences)}')
sentences_gt = [line for line in sentences if len(line.strip()) > 0]
print(f'Number of sentences: {len(sentences_gt)}')

Number of sentences: 7024
Number of sentences: 7024


There are none.

#### Creating a document that can be converted to tokens

The code below will concatenate all the sentences from the corpus and convert that document into tokens. The training dataset will consist of 50,000 examples of randomly selected blocks of tokens that are 500 tokens in lenght.

First, lets get the tokenizer.

Lets login to Hugging Face. Use your own credentials.

In [17]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [18]:
# Load the GPT tokenizer.
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(GPT2_MODEL)
BOS_TOKEN=tokenizer.bos_token
EOS_TOKEN=tokenizer.eos_token
#PAD_TOKEN='<|pad|>'
print(f'{BOS_TOKEN} {EOS_TOKEN}')

<|endoftext|> <|endoftext|>


Lets information about the tokenizer.

In [19]:
print("The max model length is {} for this model.".format(tokenizer.model_max_length))
print("The beginning of sequence token {} token has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.bos_token_id), tokenizer.bos_token_id))
print("The end of sequence token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.eos_token_id), tokenizer.eos_token_id))

The max model length is 1024 for this model.
The beginning of sequence token <|endoftext|> token has the id 50256
The end of sequence token <|endoftext|> has the id 50256


Below the senences have the BOS and EOS tokens appened.

In [20]:
word_cnts = [len( word_tokenize(line) ) for line in sentences]
print(f'Number of words: {sum( word_cnts )}')
print(f'Longest sentence: {max( word_cnts )}')
sentences_delimited = [line + EOS_TOKEN + ' ' for line in sentences]
word_cnts = [len( word_tokenize(line) ) for line in sentences_delimited]
print(f'Number of words: {sum( word_cnts )}')
print(f'Longest sentence: {max( word_cnts )}')

Number of words: 207171
Longest sentence: 402
Number of words: 223126
Longest sentence: 404


Lets look at one of the examples.

In [21]:
sentences_delimited[1]

'After the exile to Babylon: Jeconiah was the father of Shealtiel, Shealtiel the father of Zerubbabel, Zerubbabel the father of Abiud, Abiud the father of Eliakim, Eliakim the father of Azor, Azor the father of Zadok, Zadok the father of Akim, Akim the father of Eliud, Eliud the father of Eleazar, Eleazar the father of Matthan, Matthan the father of Jacob, and Jacob the father of Joseph, the husband of Mary, and Mary was the mother of Jesus who is called the Messiah.<|endoftext|> '

Now, lets create a document by concatenating all the sentences.

In [22]:
doc = "".join(sentences_delimited)
print(f'Words in doc: {len( word_tokenize(doc) )}')
doc[2000:2400]

Words in doc: 223126


'want to expose her to public disgrace, he had in mind to divorce her quietly.<|endoftext|> But after he had considered this, an angel of the Lord appeared to him in a dream and said, "Joseph son of David, do not be afraid to take Mary home as your wife, because what is conceived in her is from the Holy Spirit.<|endoftext|> She will give birth to a son, and you are to give him the name Jesus, becau'

Now lets tokenize the document.

In [23]:
import torch

outputs = tokenizer(
    [doc],
    truncation=False,
)
print(outputs.keys())
print(f"Token count: {len( outputs['input_ids'][0] )}")
doc_tokens = torch.tensor( outputs['input_ids'][0] )
doc_tokens = outputs['input_ids'][0]

Token indices sequence length is longer than the specified maximum sequence length for this model (220219 > 1024). Running this sequence through the model will result in indexing errors


dict_keys(['input_ids', 'attention_mask'])
Token count: 220219


Lets verify the tokenization was performed correctly by decoding a few of the tokens in the tokenized document.

In [24]:
doc_decode = tokenizer.decode(doc_tokens)

In [25]:
doc_decode[20000:20400]

'e and love the other, or you will be devoted to the one and despise the other. You cannot serve both God and Money.<|endoftext|> "Therefore I tell you, do not worry about your life, what you will eat or drink; or about your body, what you will wear. Is not life more important than food, and the body more important than clothes?<|endoftext|> Look at the birds of the air; they do not sow or reap or '

This looks correct.

Now, lets create the training dataset from the document of tokens by randomly taking chunks of 500 tokens which will be the examples used ot train the GPT2 base model.

In [26]:
import random

random.seed(SEED_VALUE)

block_size = 500

def get_sample(_data):
    idx = random.randint(0, len(_data) - block_size)
    sample = _data[idx:idx+block_size]
    return sample

input_ids = [ get_sample(doc_tokens) for _ in range(50000) ]

print(f'Samples: {len(input_ids) }')

Samples: 50000


Let decode one of the examples.

In [27]:
tokenizer.decode( input_ids[100] )

' purple robe and put his own clothes on him. Then they led him out to crucify him.<|endoftext|> A certain man from Cyrene, Simon, the father of Alexander and Rufus, was passing by on his way in from the country, and they forced him to carry the cross.<|endoftext|> They brought Jesus to the place called Golgotha (which means "the place of the skull").<|endoftext|> Then they offered him wine mixed with myrrh, but he did not take it.<|endoftext|> And they crucified him. Dividing up his clothes, they cast lots to see what each would get.<|endoftext|> It was nine in the morning when they crucified him.<|endoftext|> The written notice of the charge against him read: THE KING OF THE JEWS.<|endoftext|> They crucified two rebels with him, one on his right and one on his left.<|endoftext|> [omitted]<|endoftext|> Those who passed by hurled insults at him, shaking their heads and saying, "So! You who are going to destroy the temple and build it in three days, come down from the cross and save you

This looks correct.

Now lets create the Hugging Face dataset. No test dataset will be created so that all the data will be used for training.

In [28]:
from datasets import Dataset, DatasetDict

dataset = Dataset.from_dict({"input_ids": input_ids})
tokenized_datasets = DatasetDict({"train": dataset})
tokenized_datasets = tokenized_datasets["train"].train_test_split(train_size=0.95, shuffle=True, seed=SEED_VALUE)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 47500
    })
    test: Dataset({
        features: ['input_ids'],
        num_rows: 2500
    })
})

Lets confirm the longest example has 500 tokens.

In [29]:
max( [ len( ids ) for ids in tokenized_datasets['train']['input_ids'] ] )

500

Lets again decode a couple of the examples,

In [30]:
print(tokenizer.decode(tokenized_datasets['train'][1]['input_ids']))
print(tokenizer.decode(tokenized_datasets['train'][2]['input_ids']))

 some elders of the Jews to him, asking him to come and heal his servant.<|endoftext|> When they came to Jesus, they pleaded earnestly with him, "This man deserves to have you do this, because he loves our nation and has built our synagogue."<|endoftext|> So Jesus went with them. He was not far from the house when the centurion sent friends to say to him: "Lord, don't trouble yourself, for I do not deserve to have you come under my roof.<|endoftext|> That is why I did not even consider myself worthy to come to you. But say the word, and my servant will be healed.<|endoftext|> For I myself am a man under authority, with soldiers under me. I tell this one, 'Go,' and he goes; and that one, 'Come,' and he comes. I say to my servant, 'Do this,' and he does it."<|endoftext|> When Jesus heard this, he was amazed at him, and turning to the crowd following him, he said, "I tell you, I have not found such great faith even in Israel."<|endoftext|> Then the men who had been sent returned to the ho

This looks correct.

### Creating and Fine-Tuning a GPT2 model

Lets create the GPT2 model.

In [31]:
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig, GPT2Config
import torch, random
import numpy as np

configuration = GPT2Config.from_pretrained(
    GPT2_MODEL,
    vocab_size=len(tokenizer),
    n_ctx=block_size,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

# instantiate the model
model = GPT2LMHeadModel.from_pretrained(GPT2_MODEL, config=configuration)

# this step is necessary because I've added some tokens (bos_token, etc) to the embeddings
# otherwise the tokenizer and model tensors won't match up
model.resize_token_embeddings(len(tokenizer))

# Tell pytorch to run this model on the GPU.
model.to( device )

random.seed(SEED_VALUE)
np.random.seed(SEED_VALUE)
torch.manual_seed(SEED_VALUE)
torch.cuda.manual_seed_all(SEED_VALUE)

Print the number of parameters in the model.

In [32]:
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

GPT-2 size: 81.9M parameters


Create a data collator. This will create an attention mask for each example in the dataset during training.

In [33]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

Test it.

In [34]:
out = data_collator([tokenized_datasets["train"][i] for i in range(5)])
print( out.keys() )
for key in out:
    print(f"{key} shape: {out[key].shape}")

dict_keys(['input_ids', 'attention_mask', 'labels'])
input_ids shape: torch.Size([5, 500])
attention_mask shape: torch.Size([5, 500])
labels shape: torch.Size([5, 500])


Now we're ready to train the model.

In [35]:
%%time
from transformers import Trainer, TrainingArguments

batch_size = 32

args = TrainingArguments(
    output_dir=PROJECT_MODEL,
    per_device_train_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.1,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    fp16=True,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
500,1.372
1000,0.1569
1500,0.0883
2000,0.0664
2500,0.0573
3000,0.0501
3500,0.0415
4000,0.0387
4500,0.0353
5000,0.03


Checkpoint destination directory efarish/GPT2_FT_By_NT_RAND_v11/checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


CPU times: user 30min 10s, sys: 15.2 s, total: 30min 25s
Wall time: 30min 55s


TrainOutput(global_step=7425, training_loss=0.13872838685006805, metrics={'train_runtime': 1855.2567, 'train_samples_per_second': 128.015, 'train_steps_per_second': 4.002, 'total_flos': 3.03017472e+16, 'train_loss': 0.13872838685006805, 'epoch': 5.0})

The trainer evaluate method calculates perplexity. The indicates the model's certainty in predicting the next token.

In [36]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 1.03


This looks pretty good. The model is pretty certain of the next token to predict.

Lets generate some text using a couple different approaches are used below.

Lets try using the model `generate` method.

In [37]:
import torch

model.eval()

prompt = tokenizer.bos_token + "Jesus said to them"
generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
generated = generated.to(device)

print(generated)

sample_outputs = model.generate(
                                generated,
                                do_sample=True,
                                top_k=10,
                                top_p=0.95,
                                max_length=100,
                                num_return_sequences=1,
                                pad_token_id=tokenizer.eos_token_id,
                                )

for i, sample_output in enumerate(sample_outputs):
  print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

tensor([[50256, 28219,   531,   284,   606]], device='cuda:0')
0: Jesus said to them, "I have compassion for these people; they have already been with me three days and have nothing to eat. I do not want to send them away hungry, or they may collapse on the way." When he had called the Twelve together, he gave them power and authority to drive out all demons and to cure diseases, and he sent them out to proclaim the kingdom of God and to heal the sick. He told them: "Take nothing for the journey�no staff




This look correct and is an example of how an LLM is a document completion model.

Lets save the modle to the Hugging Face hub.

In [38]:
trainer.push_to_hub()

Upload 5 LFS files:   0%|          | 0/5 [00:00<?, ?it/s]

events.out.tfevents.1712692649.a1bce7248c54.6853.0:   0%|          | 0.00/5.15k [00:00<?, ?B/s]

events.out.tfevents.1712695333.a1bce7248c54.12020.1:   0%|          | 0.00/359 [00:00<?, ?B/s]

events.out.tfevents.1712693456.a1bce7248c54.12020.0:   0%|          | 0.00/8.25k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.92k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/efarish/GPT2_FT_By_NT_RAND_v11/commit/93f77fb6ea2030001afea0a32f239d8deb2cee9b', commit_message='End of training', commit_description='', oid='93f77fb6ea2030001afea0a32f239d8deb2cee9b', pr_url=None, pr_revision=None, pr_num=None)

Now lets use a Hugging Face pipeline. A configuration will be passed the pipeline to fine tune the resulting generated text.

In [39]:
import torch
from transformers import pipeline, PretrainedConfig

pc = PretrainedConfig(
    max_new_tokens=200,
    early_stopping=True,
    no_repeat_ngram_size=2,
    do_sample=True,
)

pipe = pipeline(
    "text-generation", model=PROJECT_MODEL, device=device, config=pc
)

Lets give the pipeline a try.

In [40]:
txt = "Jesus asked what do you think about the Christ?"
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])
txt = "Jesus comes from"
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Jesus asked what do you think about the Christ? Is it possible that he comes from God?" They all condemned him as worthy of death. Then some began to spit at him; they blindfolded him, struck him with their fists, and said
Jesus comes from you and will not be forgiven!' " 'Who is this I hear such things about,' he said. "He was teaching in their synagogues, and you did not believe him, but the tax collectors and the prostitutes did.


This looks reasonable.

Now lets try loading the fine-tuned model from the hub and using it.

In [41]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(PROJECT_MODEL,low_cpu_mem_usage=True)
model = model.to( device )

In [42]:
# Load the GPT tokenizer.
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained(PROJECT_MODEL)

Looks like the model and tokenizer was sucessfully created.

Before testing the model again, lets create a stopping criteria so that the generated text ends in a period. This is an attempt to avoid dangling sentences.

In [43]:
from transformers import StoppingCriteria, StoppingCriteriaList
import torch

class StoppingCriteriaSub(StoppingCriteria):
    def __init__(self, stops = [], encounters=1):
        super().__init__()
        self.stops = [stop.to( device ) for stop in stops]
        self.encounters = encounters

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
        #print(tokenizer.decode(input_ids[0]))
        for stop in self.stops:
            founds = sum( input_ids[0] == stop )
            if founds >= self.encounters and founds % self.encounters == 0: return True
        return False

stop_words = ['.']
stop_words_ids = [tokenizer(stop_word, return_tensors='pt', add_special_tokens=False)['input_ids'].squeeze() for stop_word in stop_words]
stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids,
                                                              encounters=2)])

Lets used the `generate` method again but this time with the stopping criteria.

In [44]:
import torch
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
#nltk.download('punkt')

def get_model_input(_input: str):
  prompt = tokenizer.bos_token + _input
  generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
  device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
  generated = generated.to(device)
  return generated

generated = get_model_input("Jesus said")

model.eval()

output = model.generate(
    generated,
    stopping_criteria=stopping_criteria,
    do_sample=True,
    max_new_tokens=100,
    pad_token_id=tokenizer.eos_token_id,
)
print("Output:\n" + 100 * '-')
out = tokenizer.decode(output[0], skip_special_tokens=True)
#print( f'{len(word_tokenize( out ))}: {out}'  )
from IPython.display import display, Markdown
display(Markdown(out))

Output:
----------------------------------------------------------------------------------------------------


Jesus said to his disciples, "The teachers of the law and the Pharisees sit in Moses' seat. So you must be careful to do everything they tell you.

This looks pretty good. A separate notebook in this directory called Inference.ipynb will explore this further.

### This completes the notebook.