# Training a (Toy) Language Model with Hugging Face

The following notebook was developed as part of the online course  
[**Train, Fine-Tune, and Deploy LLMs**](https://learn.theaiedge.io/p/the-large-language-bootcamp).

Goal of this notebook is to see how one can (pre-)train a (toy) causual language model.

<p float="center">
  <img src="../../assets/llmpretraining.png" width="750" height="300"> 
</p>

In [1]:
# enabling autoreloading of modules. This ensures any local edits/changes 
# to python modules are automatically reloaded without restarting the notebook
%load_ext autoreload
%autoreload 2

In [2]:
import os
from dotenv import load_dotenv

# load enviroment variables from .env file
load_dotenv()

# retrieve hugging face api token from environment (the token should be set in .env file 
# as HF_TOKEN=<huggingface_toke>)
hf_token = os.getenv('HF_TOKEN')

# make sure that transformers, datasets and other libraries can access to HF token
os.environ['HF_TOKEN'] = hf_token

In [3]:
# Move model to device (prefer MPS on Apple Silicon, else CPU/GPU)
import torch
device = 'mps' if torch.backends.mps.is_available() else 'cpu'

In [4]:
from datasets import load_dataset

# load small subset of wikipedia dataset (english language)
wikids = load_dataset(
    path="wikimedia/wikipedia",     # dataset namespace and name
    name="20231101.en",             # snapshot date (November 1st, 2023) + language code
    split="train[0:1000]",          # use only 1000 samples of data
    cache_dir=None,                 # set a custom cache dir for faster reloads
    keep_in_memory=False,           # keep in RAM (default False). Disabling the cache and copying the dataset in-memory will speed up dataset operations
)

Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

In [5]:
wikids

Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 1000
})

In [6]:
# quick sanity check:
# a) inspect first sample
display(wikids[0])

# b) preview first 500 characters of the first document
wikids['text'][0][:500]

{'id': '12',
 'url': 'https://en.wikipedia.org/wiki/Anarchism',
 'title': 'Anarchism',
 'text': 'Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including nation-states, and capitalism. Anarchism advocates for the replacement of the state with stateless societies and voluntary free associations. As a historically left-wing movement, this reading of anarchism is placed on the farthest left of the political spectrum, usually described as the libertarian wing of the socialist movement (libertarian socialism).\n\nHumans have lived in societies without formal hierarchies long before the establishment of states, realms, or empires. With the rise of organised hierarchical bodies, scepticism toward authority also rose. Although traces of anarchist ideas are found all throughout history, modern anarchism emerged from the Enlightenment. Dur

'Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including nation-states, and capitalism. Anarchism advocates for the replacement of the state with stateless societies and voluntary free associations. As a historically left-wing movement, this reading of anarchism is placed on the farthest left of the political spectrum, usually described as'

In [7]:
# split dataset into train and test sets
wikids = wikids.train_test_split(test_size=0.1)
wikids

DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 900
    })
    test: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 100
    })
})

In [8]:
from transformers import AutoTokenizer

# load pretrained tokenizer from hugging face hub
# https://huggingface.co/mistralai/Mistral-7B-v0.3
model_id = 'mistralai/Mistral-7B-v0.3'

tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path=model_id
)

In [9]:
# inspect tokenizer config and default tokens
display(tokenizer.special_tokens_map)

print(tokenizer.pad_token, tokenizer.eos_token, tokenizer.bos_token, tokenizer.unk_token)

{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}

None </s> <s> <unk>


In [10]:
# if tokenizer has no padding token defined (like Mistral 7B model), set on explicitly.
# hugging face convetion is to use '<|pad|>' for generative models
tokenizer.pad_token = '<|pad|>'

In [11]:
# sanity checks for padding token by considering large max_length
# in this case, attention mask will have 0 values for padded tokens
out0 = tokenizer(wikids['train']['text'][0], max_length=10_000, padding='max_length', padding_side='left')

print('last 10 and first 10 INPUT_IDS of the first training sample:')
display((out0['input_ids'][-10:], out0['input_ids'][:10]))

print('last 10 and first 10 ATTENTION_MASK of the first training sample:')
display((out0['attention_mask'][-10:], out0['attention_mask'][:10]))

print('padded token id is:', tokenizer.pad_token_id)

last 10 and first 10 INPUT_IDS of the first training sample:


([781, 29548, 1617, 13957, 12611, 13270, 11205, 5081, 1066, 18592],
 [1, 3252, 26150, 1117, 1040, 2618, 3003, 1070, 1032, 19742])

last 10 and first 10 ATTENTION_MASK of the first training sample:


([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

padded token id is: 0


In [12]:
wikids['train'][1]

{'id': '808',
 'url': 'https://en.wikipedia.org/wiki/Alfred%20Hitchcock',
 'title': 'Alfred Hitchcock',
 'text': 'Sir Alfred Joseph Hitchcock  (13 August 1899\xa0– 29 April 1980) was an English film director, screenwriter, producer and editor. He is widely regarded as one of the most influential figures in the history of cinema. In a career spanning six decades, he directed over 50 feature films, many of which are still widely watched and studied today. Known as the "Master of Suspense", he became as well known as any of his actors thanks to his many interviews, his cameo roles in most of his films, and his hosting and producing the television anthology Alfred Hitchcock Presents (1955–65). His films garnered 46 Academy Award nominations, including six wins, although he never won the award for Best Director, despite five nominations.\n\nHitchcock initially trained as a technical clerk and copywriter before entering the film industry in 1919 as a title card designer. His directorial debu

In [13]:
from typing import Dict, List, Any

# define max token lengths for inputs
max_length = 512

def tokenize_function(sample: Dict[str, List[str]]) -> Dict[str, Any]:
    """
    tokenize a batch of text samples for casual language modeling

    Args:
        sample (Dict[str, List[str]]): a dictionary of dataset fields, where each field maps to a list of values.
        - Expected key: "text" (List[str]), a batch of raw text strings
        - Example: {
            ..., 
            'text': [
                'Abimelech (also spelled Abimelek ...',
                'Athene or Athena is the shrewd ...',
                ...
            ]
        }

    Returns:
        Dict[str, Any]: a dictionary with tokenized outputs, where each key corresponds to a model input filed:
        - 'input_ids' (List[List[int]): encoded token ids
        - 'attention_mask' (List[list[int]]): 1 for tokends to attend, and 0 for padding
        - 'token_type_ids' (Optional[List[list[int]]]): only for tokenizers that use them
    """    
    
    return tokenizer(
        sample['text'],             # batch of raw text strings
        padding='max_length',       # pad to a fixed length for batching
        truncation=True,            # truncate large texts if len(text) > max_length
        return_tensors='pt',        # return as pytorch tensors
        add_special_tokens=True,    # ensure begginging and end of sentence (<\bos>, and <\eos>) are added.
        max_length=max_length,      # hard cap on senquence length
    )


# apply tokenizer across dataset
tokenized = wikids.map(
    tokenize_function, 
    batched=True, 
    remove_columns=wikids['train'].column_names     # in case of wikids, columns are ['id', 'url', 'title', 'text']
)

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [14]:
# in comparision to wikids, features ['id', 'url', 'title', 'text'] were removed 
# and ['input_ids', 'attention_mask'] were added.
tokenized

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 900
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 100
    })
})

**Checkout the text generation of the PRETRAINED Mistral-7B model**

In [15]:
from transformers import MistralForCausalLM, MistralConfig

# get the original pretrained Mistral 7B model and check the model output
# we want to compare the results with a smaller trained model!
model_org = MistralForCausalLM.from_pretrained(model_id, device_map='mps')

# generate output using the PRETRAINED MISTRAL MODEL 
inputs = tokenizer("Once upon a time", return_tensors="pt").to(model_org.device)
outputs = model_org.generate(**inputs, max_length=100)
tokenizer.decode(outputs[0], skip_special_tokens=False)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'<s> Once upon a time, there was a little girl who loved to read. She loved to read so much that she would read books over and over again. She would read books about princesses and fairies and dragons and wizards. She would read books about pirates and mermaids and unicorns and witches. She would read books about knights and castles and kings and queens. She would read books about magic and adventure and mystery and romance. She'

**Define a smaller Mistral-like config for experimentation**

In [16]:
MistralConfig()

MistralConfig {
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-06,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "transformers_version": "4.48.1",
  "use_cache": true,
  "vocab_size": 32000
}

In [17]:
# consider smaller model that fits on my laptop
config = MistralConfig(
    hidden_size=1024,                       # reduced hidden size (default 4096 in Mistral-7B model)
    sliding_window=1024,                    # reduced sliding window (default 4096 in Mistral-7B model)
    intermediate_size=3077*2,               # FeedForward Network size (default 14336 in Mistral-7B model)
    max_position_embeddings=max_length,     # reduced max_position_embeddings (default 131072 in Mistral 7B-Model)
    num_attention_heads=16,                 # reduced num_attention_heads (default 32 in Mistral-7B model)
    num_hidden_layers=8                     # Number of decoder blocks (default 32 in Mistral-7B model)
)


# instantiate small Mistral model from customized config file
model = MistralForCausalLM(config)
model = model.to(device)

In [18]:
# check if model parameters are on 'mps'
print(next(model.parameters()).device)

mps:0


In [19]:
# inspect model size
model_param_size = sum([param.numel() for param in model.parameters()])
print(f'Model parameter count: {model_param_size:,}')

Model parameter count: 241,959,936


In [20]:
# in causual language model, the labels are the same as input
from transformers import DataCollatorForLanguageModeling

# data collator prepares batches of tokenized text for training
# in causal ML training, labels are identical to inputs 
# (however, HuggingFace automatically shifts the labels during trainig)
data_colator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False, # no masked language modelling -> causual LM
    return_tensors='pt'
)

In [21]:
# test batch collation

# note that tokenized['train'][i] is a dictionary with keys: input_ids, attention_mask, etc.
# and tokenized['train'][0:1] returns a Dataset object (a slice of length 1), not a list of dicts.
data_colator(
    [tokenized['train'][i] for i in range(2)]   # a list of dictionaries
)   # creates a batch of 2 samples

{'input_ids': tensor([[    1,  3252, 26150,  ...,  1040,  1766,  1194],
        [    1,  9119, 17072,  ...,  1623,  1032,  1310]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]]), 'labels': tensor([[    1,  3252, 26150,  ...,  1040,  1766,  1194],
        [    1,  9119, 17072,  ...,  1623,  1032,  1310]])}

In [22]:
from transformers import Trainer, TrainingArguments

# define training argument (adjust based on available resources)
trainings_args = TrainingArguments(
    output_dir="mistral7B/pretrained",  # where to save the model
    overwrite_output_dir=True,          # overwrite the content of the output directory
    num_train_epochs=5,                 # number of training epochs
    per_device_train_batch_size=4,      # batch size for training
    per_device_eval_batch_size=4,       # batch size for evaluation
    eval_strategy="epoch",              # evaluate each `epoch`
    report_to='none',                   # disable wandb as we are not using it here
    bf16=False,                         # use bf16 if your GPU supports it
    push_to_hub=True,                  # whether to push the model to the Hub or not (default False)
    hub_model_id='bachephysicdun/dummy-pretrained-mistral7b',  # set the repo name
    hub_private_repo=False              # set True for private repo
)

# Trainer handles training loop, evaluation, and logging
trainer = Trainer(
    model=model,
    args=trainings_args,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['test'],
    data_collator=data_colator,
    processing_class=tokenizer
)

# train (small) model on a small dataset (for experimentation purpose)
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,6.50676
2,No log,6.143845
3,6.532600,6.02787
4,6.532600,6.014984
5,5.175400,6.045206


TrainOutput(global_step=1125, training_loss=5.730242078993055, metrics={'train_runtime': 553.3155, 'train_samples_per_second': 8.133, 'train_steps_per_second': 2.033, 'total_flos': 2891869323264000.0, 'train_loss': 5.730242078993055, 'epoch': 5.0})

To measure the quality of lm training, a common metric is perplexity which is defined as follows:
> $\text{Perplexity} = 2^{\left\{ -\frac{1}{T} \sum_{i=1}^{T} \log_2 p(t_i) \right\}} = e^{\left\{ -\frac{1}{T} \sum_{i=1}^{T} \ln p(t_i) \right\}}$

where $T$ is total number of tokens in test set and $p(t_i)$ is probability of ith token in test set (Note that $\log_2(x) = \frac{\ln x}{\ln 2}$ is used to rewrite perplexity from base-2 log to base-$e$ one). If the model is very confident in predicting the next token $\text{Perplexity}=1$.

> A perplexity of 400 (as we get from simple model training here) means that, on average, predicting the next word is as uncertain as choosing uniformly from 400 equally probable options.

In [23]:
# evaluate perplexity metric
import math

eval_results = trainer.evaluate()
if eval_results.get('eval_loss'):
    print("Perplexity:", math.exp(eval_results["eval_loss"]))

Perplexity: 422.0847274986255


In [24]:
# quick text generation (and compared with the output of the pretrained Mistral-7B model)
inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=100)
tokenizer.decode(outputs[0], skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'Once upon a time (; 1985 – 1901) was a French known as the first-in-in-S. He was the first-S. He was the first of the 190s. He was the 190s in 1900, and the 1900s.\n\nEarly life\n\nEarly life\n\nEarly life\n\nB was born in 19'

In [25]:
# push the model to huggingface hub (we want to use it later for fine tuning)
trainer.push_to_hub()

model.safetensors:   0%|          | 0.00/968M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/bachephysicdun/dummy-pretrained-mistral7b/commit/62c2fd361aac7f8382866a61e1ae3b9ee612c9b3', commit_message='End of training', commit_description='', oid='62c2fd361aac7f8382866a61e1ae3b9ee612c9b3', pr_url=None, repo_url=RepoUrl('https://huggingface.co/bachephysicdun/dummy-pretrained-mistral7b', endpoint='https://huggingface.co', repo_type='model', repo_id='bachephysicdun/dummy-pretrained-mistral7b'), pr_revision=None, pr_num=None)

In [None]:
# if removing the model from hugging face is desired
from huggingface_hub import delete_repo

# delete a model repo
delete_repo(
    repo_id="username/my_repo_id",
    repo_type="model"                # defaults to dataset
)