# Data Preprocessing

Over the course of experimenting with fine-tuning tinyllama, I realized that a lot of the complexity comes from the data pre-processing step. This notebook will break down the pre-processing process and call attention to different possible considerations and approaches.

In pre-processing the data, it is important to understand (at least) the following:
- How is the raw data stuctured?
- What does a single example from the raw data look like?
- How do we need to transform the data for training?
- What does a single example in the transformed data look like?
- How do we need to tokenize the data for training?
- How do we batch the tokenized data for training?

## Setup

We're going to load the tokenizer and the dataset, but not the model. We won't need it in this notebook.

In [1]:
%pip install --upgrade -r ./tinyllama_requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
CACHE_DIR =  "../cache/TinyLlama/" # the path to the cache directory; where cache files will be saved

## Load the Tokenizer

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer corresponding to the model checkpoint
model_ckpt = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
tokenizer = AutoTokenizer.from_pretrained(
    model_ckpt,
)

The model does not specify a pad token.  If we want to use padding (more on this later), we need to set the pad token. We can do this with:

In [4]:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token

'</s>'

## Load the Dataset

In [5]:
from datasets import load_dataset
from pathlib import Path

slimorca = load_dataset("Open-Orca/SlimOrca", cache_dir=str(Path(CACHE_DIR) / "data"))

### How is the raw data stuctured?

What have we actually loaded here? We used the Hugging Face `datasets` library to load the dataset, and the resulting object is a `datasetdict`. A `datasetdict` is a dictionary-like object for managing datasets, where the keys are the names of the datasets, and the values are the actual datasets. Usually the datasets are splits such as "train", "valid", and "test". In this case, we only have a "train" split, with only one feature, "conversations".

`DatasetDict`s enable us to map various pre-processing operations to the datasets they contain concisely and efficiently.

Let's take a look.

In [6]:
slimorca

DatasetDict({
    train: Dataset({
        features: ['conversations'],
        num_rows: 517982
    })
})

The `train` split within the `DatasetDict` is a `Dataset`:

In [7]:
slimorca["train"]

Dataset({
    features: ['conversations'],
    num_rows: 517982
})

### How do we need to transform the data for training?

Hugging Face `Dataset`s provide [various methods for querying, subsetting, and processing data](https://huggingface.co/docs/datasets/process), generally quite efficiently because they use the [Apache Arrow format](https://huggingface.co/docs/datasets/about_arrow). For example, if we want a validation set, we can split our training set into separate train/valid sets. The data are shuffled by default, so we don't need to shuffle as a separate step. Note that we apply this operation to the `Dataset`, not the `DatasetDict`.

The resulting object is a new `DatasetDict` with two keys: "train" and "test".

In [8]:
slimorca_split = slimorca["train"].train_test_split(test_size=0.1, seed=42)
slimorca_split

DatasetDict({
    train: Dataset({
        features: ['conversations'],
        num_rows: 466183
    })
    test: Dataset({
        features: ['conversations'],
        num_rows: 51799
    })
})

### What does a single example from the raw data look like?

Now let's look at some of the actual data examples. What do the `conversations` look like?

In [9]:
slimorca_split["train"][42]

{'conversations': [{'from': 'system',
   'value': 'You are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.',
   'weight': None},
  {'from': 'human',
   'value': 'Q:Read the article and select the best answer. Article: Tina was not like many of her classmates. She didn\'t listen to popular music or watch many movies, and she wasn\'t interested in nice clothes. When she got together with her friends, they wanted to listen to rock and pop music. When Tina asked  if  they would  like  to  try classical    music, they all looked at her strangely."Classical music  is  for old people, " one of  her friends said. Tina was worried that something was wrong with her. She decided to talk to her father. As she entered his study  , her father could feel something was wrong. "Dad, am I strange?" she asked her father."Of course not, " he answered. "Why do you ask that?" "Because I don\'t like the same things as my classmates do. They want to li

The objects are dictionaries with information about the roles ("system", "human", or "gpt") and values making up the exchange. Our task is to map this to a format on which we can train the model.

The Hugging Face Transformers library provides convenient [chat model templates](https://huggingface.co/docs/transformers/main/en/chat_templating#templates-for-chat-models). The Hugging Face docs recommend applying the chat templates as a preprocessing step.

We won't go into too much detail about the concept of chat templates—you can read more [here](https://huggingface.co/blog/chat-templates). For now, just know that they provide a means of clearly indicating which part of a string came from the user and which part is the LLM's expected response. Models are trained on specific chat formats, and using different formats will generally result in bad responses at inference time.

Here's an example of a chat dictionary translated to a string with the chat template.

In [10]:
chat = [
    {
        "role": "system",
        "content": "You are a helpful assistant and an expert at making coffee.",
    },
    {"role": "user", "content": "How do I make coffee with a Chemex coffee maker?"},
    {
        "role": "assistant",
        "content": "To make coffee with a Chemex:\n1. Boil water to about 200°F (93°C).\n2. Place the Chemex filter in the top and rinse it with hot water to remove paper taste and warm the vessel. Discard the rinse water.\n3. Add coffee grounds to the filter. Use a medium-coarse grind, about 1 gram of coffee per 16 grams of water.\n4. Pour just enough hot water to saturate the grounds. Wait 30 seconds for the coffee to 'bloom'.\n5. Slowly pour the remaining water over the grounds in a circular motion. Aim for a total brew time of 3.5 to 4.5 minutes.\n6. Once brewing is complete, remove the filter and enjoy.",
    },
]

print(tokenizer.apply_chat_template(chat, tokenize=False))


No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.



<s>[INST] <<SYS>>
You are a helpful assistant and an expert at making coffee.
<</SYS>>

How do I make coffee with a Chemex coffee maker? [/INST] To make coffee with a Chemex:
1. Boil water to about 200°F (93°C).
2. Place the Chemex filter in the top and rinse it with hot water to remove paper taste and warm the vessel. Discard the rinse water.
3. Add coffee grounds to the filter. Use a medium-coarse grind, about 1 gram of coffee per 16 grams of water.
4. Pour just enough hot water to saturate the grounds. Wait 30 seconds for the coffee to 'bloom'.
5. Slowly pour the remaining water over the grounds in a circular motion. Aim for a total brew time of 3.5 to 4.5 minutes.
6. Once brewing is complete, remove the filter and enjoy. </s>


Notice that using the chat template adds some special tokens indicating the beginning/end of the chat (`<s>` and `</s>`) and the beginning/end of the instruction (`[INST]` and `[/INST]`). We will need to add these to our tokenizer as special tokens later.

Let's map the `conversations` to the expected input for the chat template. The Hugging Face chat model templates expect a dictionary similar to `conversations` but with some notable differences. We need to replace `from` with `role`, `value` with `content`, `gpt` with `assistant`, and `human` with `user`. We will save the actual conversion until later, as there are other considerations we still need to address.

In [11]:
def format_chat(ex):
    role_mapping = {"gpt": "assistant", "system": "system", "human": "user"}
    chat = [
        {"role": role_mapping[message["from"]], "content": message["value"]}
        for message in ex["conversations"]
    ]

    return {"chat": chat}


slimorca_split_formatted_chat = slimorca_split.map(format_chat, num_proc=32)
slimorca_split_formatted_chat

DatasetDict({
    train: Dataset({
        features: ['conversations', 'chat'],
        num_rows: 466183
    })
    test: Dataset({
        features: ['conversations', 'chat'],
        num_rows: 51799
    })
})

### What does a single example of the transformed data look like?

Now we have added a 'chat' key to each example in `slimorca_split_formatted_chat`. Note how we applied this transformation. We wrote a function that processed a single example and then used the `map` method to apply it to all examples in the `DatasetDict`. The `num_proc` parameter specifies how many processes to use for parallel processing, dramatically increasing the speed of the process.

Let's compare the original and formatted data.

In [12]:
print(slimorca_split_formatted_chat["train"][42]["conversations"])
print(slimorca_split_formatted_chat["train"][42]["chat"])

[{'from': 'system', 'value': 'You are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.', 'weight': None}, {'from': 'human', 'value': 'Q:Read the article and select the best answer. Article: Tina was not like many of her classmates. She didn\'t listen to popular music or watch many movies, and she wasn\'t interested in nice clothes. When she got together with her friends, they wanted to listen to rock and pop music. When Tina asked  if  they would  like  to  try classical    music, they all looked at her strangely."Classical music  is  for old people, " one of  her friends said. Tina was worried that something was wrong with her. She decided to talk to her father. As she entered his study  , her father could feel something was wrong. "Dad, am I strange?" she asked her father."Of course not, " he answered. "Why do you ask that?" "Because I don\'t like the same things as my classmates do. They want to listen to Mariah Carey\'s music

## Tokenize the Data

As mentioned above, our instruction formatting includes some special tokens we would like to add to the tokenizer's vocabulary. We can do that as follows:

In [13]:
# Add the instruction tokens to the tokenizer
special_tokens = ["[INST]", "[/INST]", "<<SYS>>", "<</SYS>>"]
# Adding special tokens to the tokenizer
tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})

4

A few questions might occur to you at this point:
> Why don't we add the `<s>` and `</s>` tokens to the tokenizer? Those were used in the chat formatting too!

We don't need to add those tokens because they are already the tokenizer's `bos` (beginning of sequence) and `eos` (end of sequence) tokens, as shown below.

> What exactly does it mean to "add tokens to the tokenizer"?
Adding tokens to the tokenizer expands the tokenizer's *vocabulary*, the list of possible tokens that the model can use. Tokens in the tokenizer's vocabulary won't be split into smaller units. Before adding `[INST]` to the vocabulary, for example, it could only be formed through the following combination of tokens: `['[', '/', 'INST', ']']`. This allows the model to recognize the `[INST]` token as a single, distinct semantic unit: it only needs to learn one token to recognize the start of an instruction, not a sequence of four tokens that might also be used in other contexts. Furthermore, since all of the sequences we want to train on will include the special tokens above, there are some effiency gains: each of these tokens will take up only one unit of context length rather than multiple.

In [14]:
# the <s> and </s> tokens are already part of the vocabulary
tokenizer.bos_token, tokenizer.eos_token

('<s>', '</s>')

### Structure the Tokenized Data

Before we actually tokenize the chat data in our `DatasetDict`, we need to make some decisions about how to structure the tokenized data. So far, we've been thinking in terms of individual *examples* from the training data. This isn't necessarily the most relevant perspective to the model. We should think in terms of *sequences* of tokes and *batches* of sequences.
- A *sequence* is a single list of tokens from the training data, usually limited to some uniform length. Sequences of the desired length can be formed by starting with a single example and *padding* it with the tokenizer's padding token to make it the desired length if it is too short, or by *truncating* it if it is too long. Another option is sequence *packing*, wherein multiple shorter sequences are packed into a single longer sequence.
- A *batch* is a list of sequences. During training, the model is fed a batch of sequences at a time.



## Collate

## Decisions and Questions