# Preparing Data for Training

### After cleaning the data, the next step is to package it for training. In this section, we will learn how to format the training data for use with Hugging Face.

### Before training, additional data preparation is required even after the cleaning process. 

### Key steps include:
1. **Tokenization**: Breaking text into smaller meaningful units called tokens. By meaningful we meant something that makes meaning to LLM. LLM does not understand text but numbers. Tokenization is a technique to transform these human readable texts to represent as numbers
2. **Packing**: Organizing tokens into a fixed maximum sequence length to enhance training efficiency.

In [1]:
# load the dataset we just stored
import datasets

dataset = datasets.load_dataset(
    "parquet", 
    data_files="./data/preprocessed_dataset.parquet", 
    split="train"
)
print(dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['text', 'meta'],
    num_rows: 40473
})


In [2]:
# since we do not enough memory to process the entire dataset, we will
# use the shard method from the Hugging Face Dataset object to split
# the dataset into 10 smaller pieces, or shards
dataset = dataset.shard(num_shards=10, index=0)
print(dataset)

Dataset({
    features: ['text', 'meta'],
    num_rows: 4048
})


In [3]:
# we will load the tokenizer
from transformers import pipeline
pipe = pipeline("text-generation", model="facebook/opt-125m")
tokenizer = pipe.tokenizer

Device set to use mps:0


In [6]:
# this is how the tokenized text looks like
tokenizer.tokenize("Hi! I love building application with Generative AI.")

['Hi',
 '!',
 'ĠI',
 'Ġlove',
 'Ġbuilding',
 'Ġapplication',
 'Ġwith',
 'ĠGener',
 'ative',
 'ĠAI',
 '.']

In [7]:
# the following are the token ids generated by the tokenizer
tokenizer.encode("Hi! I love building application with Generative AI.")

[2, 30086, 328, 38, 657, 745, 2502, 19, 15745, 3693, 4687, 4]

In [8]:
# the following helper function will be used to tokenize all the examples
def tokenization(example):
    # Tokenize
    tokens = tokenizer.tokenize(example["text"])

    # Convert tokens to ids
    token_ids = tokenizer.convert_tokens_to_ids(tokens)

    # Add <bos>, <eos> tokens to the front and back of tokens_ids 
    # bos: begin of sequence, eos: end of sequence
    token_ids = [
        tokenizer.bos_token_id] \
        + token_ids \
        + [tokenizer.eos_token_id
    ]
    example["input_ids"] = token_ids

    # We will be using this column to count the total number of tokens 
    # in the final dataset
    example["num_tokens"] = len(token_ids)
    return example

In [9]:
# lets tokenize all the examples in the pretraining dataset
dataset = dataset.map(tokenization, load_from_cache_file=False)
print(dataset)

Map:   0%|          | 0/4048 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'meta', 'input_ids', 'num_tokens'],
    num_rows: 4048
})


In [11]:
# we can see we have created new columns like input_ids and num_tokens
# lets take and example and see how it looks like
sample = dataset[3]

print("text:", sample["text"][:30]) # 
print("\ninput_ids:", sample["input_ids"][:30])
print("\nnum_tokens:", sample["num_tokens"])

text: The Colorado Climate Center pr

input_ids: [2, 133, 3004, 11001, 824, 1639, 8979, 8, 414, 15, 7635, 6, 7749, 6, 24400, 810, 6, 9007, 24263, 6, 12530, 6, 8, 70, 97, 2147, 5129, 2476, 4, 50118]

num_tokens: 488


In [14]:
# let's check the total number of tokens in the dataset
# you can see that with just few data, the token reached to around 4.5 millions
# LLMs are trained with huge amount of text data which easily amoubts to billions of tokens
import numpy as np
np.sum(dataset["num_tokens"])

4624174

In [15]:
# now, we will be packing the data
# let's concatenate input_ids for all examples into a single list
input_ids = np.concatenate(dataset["input_ids"])
print(len(input_ids))

4624174


In [17]:
# lets take the maximum sequence length of packing as 32
# total_length will be the number of tokens in the dataset with 32 as maximum sequence length
max_seq_length = 32
total_length = len(input_ids) - len(input_ids) % max_seq_length
print(total_length)

4624160


In [18]:
# Discard extra tokens from end of the list so number of tokens is exactly divisible by max_seq_length
input_ids = input_ids[:total_length]
print(input_ids.shape)

(4624160,)


In [19]:
# now create a new array of shape (num_examples, max_seq_length)
# and fill it with the input_ids
input_ids_reshaped = input_ids.reshape(-1, max_seq_length).astype(np.int32)
input_ids_reshaped.shape 

(144505, 32)

In [20]:
type(input_ids_reshaped)

numpy.ndarray

In [21]:
# now we will convert dataset into Hugging Face Dataset object
input_ids_list = input_ids_reshaped.tolist()
packaged_pretrain_dataset = datasets.Dataset.from_dict(
    {"input_ids": input_ids_list}
)
print(packaged_pretrain_dataset)

Dataset({
    features: ['input_ids'],
    num_rows: 144505
})


In [22]:
# Save the packed dataset to disk
packaged_pretrain_dataset.to_parquet("./data/packaged_pretrain_dataset.parquet")

Creating parquet from Arrow format:   0%|          | 0/145 [00:00<?, ?ba/s]

19074660