<a href="https://colab.research.google.com/github/apa017/hugging-face-learn/blob/main/05_LLM_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Large Language Model Training

In this Notebook, we will be training a large language model from scratch.

<br>

### WARNING

Online tools like [Google Colab](https://colab.research.google.com/) allow for use of GPU over CPU.

Running a fine-tuning locally (i.e. on CPU) requires lot of time and is computationally intensive.

For this reason it is recommended to execute this notebook on Cloud or having provided GPU.

<hr>

## Notebook Setup

In [4]:
# install the required modules
!pip install transformers datasets torch evaluate accelerate

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m


In [5]:
import warnings
warnings.filterwarnings('ignore')

## Load a dataset

In this notebook we will use a custom Hugging Face dataset

In [6]:
from datasets import load_dataset

dataset = load_dataset("Kain17/reuters_articles")
dataset

README.md:   0%|          | 0.00/512 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/150k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/30.4k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/39.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/462 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/58 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/58 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'body'],
        num_rows: 462
    })
    validation: Dataset({
        features: ['title', 'body'],
        num_rows: 58
    })
    test: Dataset({
        features: ['title', 'body'],
        num_rows: 58
    })
})

## Prepare data

The dataset contains old Reuters article.

Data is split into:
- `title`: the title of the article
- `body` : the content of the article

We will create a new column `full_article` out of existing columns `title` and `body`.

In [7]:
# helper function
def create_fullArticle(example):
  return {
      'full_article': f"TITLE:{example['title']}\n\nBODY: {example['body']}"
  }

# create new column
dataset = dataset.map(create_fullArticle)

Map:   0%|          | 0/462 [00:00<?, ? examples/s]

Map:   0%|          | 0/58 [00:00<?, ? examples/s]

Map:   0%|          | 0/58 [00:00<?, ? examples/s]

In [8]:
# check results
dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'body', 'full_article'],
        num_rows: 462
    })
    validation: Dataset({
        features: ['title', 'body', 'full_article'],
        num_rows: 58
    })
    test: Dataset({
        features: ['title', 'body', 'full_article'],
        num_rows: 58
    })
})

In [9]:
# test
print(dataset['train'][400]['full_article'])

TITLE:CARSON PIRIE <CRN> TO START PROXY MAILING

BODY: Carson Pirie Scott and Co said it plans
to start mailing proxy materials to stockholders in connection
to a November 16 special meeting at which holders will be asked
to consider a previously announced agreement with Greyhound
Corp <G>.
    Under the agreement, Greyhound will acquire, in a merger,
three of the company's foodservice operations - Dobbs
International Services, Dobbs Houses and Carson international.
    If the transaction is approved, Carsons said its
stockholders will receive 30 dlrs cash and one share of common
in the new Carson Pirie Scott and Co for each share held.
 Reuter



## Import a Tokenizer

We load a tokenizer.

We could use either a Hugging Face available tokenizer or our custom tokenizers.

In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ingeniumacademy/gpt2-reuters-tokenizer")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/440 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/819k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/465k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.17M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

We use a function that tokenize the `full_article` column.

- We want the function to truncate every article that goes beyond a certain length, so we will set a `contextLength` parameter to a preferred length.

- For the truncated part, we want to make sure it isn't used. We set the parameter `return_overflowing_tokens` to **False** to do this.

In [11]:
# Helper function
contextLength = 512

def tokenize(element):

  # Create Output
  outputs = tokenizer(
      element["full_article"],
      truncation=True,
      max_length=contextLength,
      return_overflowing_tokens=False
  )

  return outputs


  ## Execute the function to create tokenized datasets
tokenized_datasets = dataset.map(
    tokenize, batched=True, remove_columns=dataset["train"].column_names
    )


Map:   0%|          | 0/462 [00:00<?, ? examples/s]

Map:   0%|          | 0/58 [00:00<?, ? examples/s]

Map:   0%|          | 0/58 [00:00<?, ? examples/s]

The new datasets will have a diffeent structure.

In [12]:
# Test
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 462
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 58
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 58
    })
})

## Preparing the model for training

To train a model, we need to create a **configuration** that is passed into the `transformers` class.

In [13]:
from transformers import GPT2LMHeadModel, AutoConfig

In [14]:
# create the configuration specifying the model

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=contextLength,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token
)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

### Explanation

- **vocab_size**: the vocabulary size must match the size of the tokenizer. This allows for the model to handle all tokens in the vocabulary.

- **n_ctx**: the <i>maximum context length</i> (i.e. number of tokens) that the model can process at once. We set this to be the same value as the `contextLength` parameter we used to prepare the data.

- **bos_token_id**: id of <i>beginning-of-sequence</i> token (special token added at the beginning to mark the start). We obtain it from the tokenizer.

- **eos_token_id**: id of <i>end-of-sequence</i> token (works ike BOS token). We obtain it from the tokenizer.

In [15]:
# check configuration
config

GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 0,
  "embd_pdrop": 0.1,
  "eos_token_id": "<|endoftext|>",
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 512,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.44.2",
  "use_cache": true,
  "vocab_size": 52000
}

<br>

Next, we configure our model.

In [16]:
# provide config to the model
model = GPT2LMHeadModel(config)

# provide "size" of model (i.e. number of parameters)
## in our case we go for ALL parameters
model_size = sum(t.numel() for t in model.parameters())

print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters.")

GPT-2 size: 125.8M parameters.


## Initializing a Data Collator

A **data collator** is a component that acts as a bridge between the raw dataset and the configured model.

It efficiently prepares text data for LLM by handling at once batching, padding, and attention masks.

This allows you to streamline the data preparation process with correctly formatted data.


In [17]:
from transformers import DataCollatorForLanguageModeling

In [18]:
# set pad token
tokenizer.pad_token = tokenizer.eos_token

# initialize the collator by passing the tokenizer as argument
collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

The parameter `mlm` (masked laguage modelling) is set to **false** because we are not using a masked language approach.

Some models like BERT use mlm as common approach for training, while some like GPT-2 do not require that.

<br>

## Model Training

In [19]:
# Access the HF Hub
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [23]:
from transformers import Trainer, TrainingArguments

# setting the training arguments

args = TrainingArguments(
    output_dir="./reuters-gpt2-textgen",
    hub_model_id="Kain17/reuters-gpt2-textgen",
    evaluation_strategy="epoch",
    auto_find_batch_size=True,
    num_train_epochs=5,                       # more epochs == more time!
    gradient_accumulation_steps=8,
    weight_decay=0.1,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    fp16=True,                                # less precise == faster!
    push_to_hub=True,
    logging_steps=10
)

### Description of some arguments

Most arguments were covered in the previous notebooks.

Here a few more are explored.

- **auto_find_batch_size**: automatically find the largest batch size that fits in the available GPU memory, thus speeding up training

- **num_train_epochs**: number of training iterations over the dataset

- **gradient_accumulation_steps**: refers to accumulation of gradient-of-loss function. When batch size is too large to fit in memory, gradient accumulation calculates the gradient over several mini-batches, summing them. <br>The number hereby given (e.g. =8) helps simulate a large batch size. It is useful for memory constrains.

- **weight_decay**: L2 regularization to prevent overfitting. The bigger value, the stronger regularization.

- **lr_scheduler_type='cosine'**: uses a cosine function to gradually decrease learning rate over the training process, leading to better convergence.

- **fp16**: enables mixing precision training. It uses half-precision (16, not 32) floating point number to reduce memory usage.

- **logging_steps**: logs loss metrics every 10 steps, helps monitoring.


<br><br>


In [24]:
import time

In [25]:
# train the model
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"]
)

tStart = time.time()
print('Beginning the training...\n')

trainer.train()

tEnd = time.time()
print('Training complete. \n')

trTime = tEnd - tStart

if trTime >= 60:
  print(f"Training time: {trTime/60:.2f} minutes.")
else:
  print(f"Training time: {trTime:.2f} seconds.")

Beginning the training...



Epoch,Training Loss,Validation Loss
0,No log,7.340776
1,6.926100,7.071184
2,6.188900,6.957234
4,5.893800,6.925149


Training complete. 

Training time: 1.66 minutes.


In [26]:
# Push to hub
trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/Kain17/reuters-gpt2-textgen/commit/c1115cea1af436f557087413aceb6ff69bcb2bd7', commit_message='End of training', commit_description='', oid='c1115cea1af436f557087413aceb6ff69bcb2bd7', pr_url=None, pr_revision=None, pr_num=None)

## Using the model in pipeline

In [27]:
import torch
from transformers import pipeline

# initialize a pipeline
pipeline = pipeline(
    "text-generation",
    model="Kain17/reuters-gpt2-textgen",
)

config.json:   0%|          | 0.00/912 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/503M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/819k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/465k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.17M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/470 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [29]:
# Prepare a sample
sample = dataset['test'][2]
sample

{'title': 'ANIMAL FEED SHIP ON FIRE AGAIN AT CHINESE PORT',
 'body': 'The Cyprus vessel Fearless, 31,841 tonnes\ndw, which was on fire, grounded then towed to Yantai, China, in\nAugust, had all its cargo reloaded but the cargo in the no. 3\nhold caught fire on October 15.\n    The fire was put out with salt water and water from the\nno.4 hold has spread over most of the cargo. Some water is also\nin the no.5 hold. Bottom patching was reported complete but\nonly the no.4 starboard wing tank has been pumped out and\nremains dry. The engine room is flooded to about three metres.\n    The ship was originally loaded with 10,000 tonnes of animal\nfeed.\n REUTER\n\x03',
 'full_article': 'TITLE:ANIMAL FEED SHIP ON FIRE AGAIN AT CHINESE PORT\n\nBODY: The Cyprus vessel Fearless, 31,841 tonnes\ndw, which was on fire, grounded then towed to Yantai, China, in\nAugust, had all its cargo reloaded but the cargo in the no. 3\nhold caught fire on October 15.\n    The fire was put out with salt water and

In [30]:
# Prepare a prompt (completion)
prompt = f"""TITLE:{sample['title']}\n\nBODY:"""


In [31]:
# Run the pipeline
pipeline(prompt, max_new_tokens=128)

TypeError: new(): invalid data type 'str'

<hr>

###### End of Notebook