<a href="https://colab.research.google.com/github/datacraft-paris/2311-Cerisara-LLM/blob/main/Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. [Introduction](https://www.google.com/)
2. [A Brief Overview of LLMs](https://colab.research.google.com/github/datacraft-paris/2311-Cerisara-LLM/blob/main/LLMs.ipynb)
3. Preparing the data (This notebook)
    1. [Setup](#setup)
    2. [Downloading the data](#download)
    3. [Preprocess the data](#preprocess)
      1. [Tokenization](#tokenize)
      2. [Group the texts](#group)
    4. [Run the preparation](#run)

# Setup <a name="setup"></a>

In [None]:
!pip install datasets # we will use huggingface datasets, easy to use and the hub contains many big text corpora for training LLMs.

In [None]:
import os
from typing import Iterator, Dict, List, Tuple
from itertools import chain
from tqdm.notebook import tqdm
from pathlib import Path
from datasets import load_dataset, load_from_disk, concatenate_datasets, Features, Dataset, DatasetDict, features
from transformers import AutoTokenizer

# Download the data <a name="download"></a>

We manipulate huge corpora (>1 trillion tokens) when we work on LLMs. However, for this experiments, we only need ~100 millions of tokens.

We will use RedPajama, a data for training LLMs, which contains more than 1 trillion tokens. This data is huge and we can't manipulate it. We only need a subset of this corpus.

The following cells extract a subset of RedPajama in a efficient way.

In [None]:
# Streaming the data so we don't need to download all the data (1T tokens)
def stream_data(corpus: str) -> Iterator[str]:
    """Streams the huggingface dataset."""
    dataset = load_dataset("togethercomputer/RedPajama-Data-1T",
                            corpus,
                            streaming=True)
    for item in dataset["train"]:
        metadata = eval(item["meta"])
        if "language" not in metadata:
            raise ValueError(f"The data '{corpus}' doesn't contain any information about languages.")
        if corpus != "github" and metadata["language"] != "en": # only english texts
            continue
        yield item["text"]

# only get a subset of items.
def subset_dataset(subset: int,
                   corpus: str
                   ) -> Iterator[Dict[str, str]]:
    """Extract only a subset of the whole dataset."""
    for idx, item in tqdm(enumerate(stream_data(corpus), 1),
                          total=subset,
                          desc=corpus):
        yield {
            "text": item,
            "corpus": corpus
        }
        if idx == subset:
            break

# Create the huggingface dataset
def dataset_from_generator(subset_mapping: dict) -> None:
    """Creates hf datasets object."""
    data_features = Features({
        "text": features.Value("string"),
        "corpus": features.Value("string")
    })

    dataset = DatasetDict({corpus: Dataset.from_generator(
                                        subset_dataset,
                                        features=data_features,
                                        gen_kwargs={"subset": subset_mapping[corpus], "corpus": corpus}
                                        )
                                    for corpus in subset_mapping})

    return dataset

In [None]:
# we use different subset for each category because some of them have longer tokens.
# This means for the same number of documents, they don't have the same number tokens.
# For example, 'arxiv' tends to have longer documents campared to 'c4' or 'stackexchange'
dataset = dataset_from_generator({"c4": 256_000,
                                  "arxiv": 16_000,
                                  "stackexchange": 256_000,
                                  "github": 48_000})

# Preprocess the data <a name="preprocess"></a>

## Tokenization <a name="tokenize"></a>

Tokenizer are one of the most important part of the LLM, because it's about giving to the LLM meaningfull piece of words.

Tokenizing big corpora can be challenging as it can be quite slow. We will use multiprocessing so we can parallelize the process.

In [None]:
print("Number of CPUs:", os.cpu_count())

In [None]:
def tokenize_dataset(tokenizer,
                     dataset: Dataset,
                     target_colum: str="text",
                     return_attention_mask: bool=False,
                     batched: bool=True,
                     batch_size: int=64
                     ) -> Dataset:
    """Tokenize the dataset."""
    tokenized_dataset = dataset.map(
        lambda examples: tokenizer(examples[target_colum],
                                   return_attention_mask=return_attention_mask),
        num_proc=os.cpu_count(),
        batched=batched,
        batch_size=batch_size,
        desc="Running tokenizer on dataset",
    )
    return tokenized_dataset

## Group the texts <a name="group"></a>

Batching the forward into the LLM is necessary for efficient use of the GPU. The goal is to fill maximimally the memory of the GPU, with tokens only (without padding tokens). A common way to do that is to group the text into a chunks of _n_ tokens.

In [None]:
def group_texts(dataset: Dataset,
                max_length: int=512,
                batch_size: int=128,
                return_labels: bool=False
                ) -> Dataset:
    """Grouping texts to max_length."""
    def group(examples):
        # Concatenate all texts.
        concatenated_examples = {k: list(chain(*examples[k])) if k != "corpus" else examples[k]
                                 for k in examples.keys()}
        total_length = len(concatenated_examples["input_ids"])
        total_length = (total_length // max_length) * max_length
        # Split by chunks of max_len.
        result = {
            k: [t[i : i + max_length] for i in range(0, total_length, max_length)]
            for k, t in concatenated_examples.items()
        }
        if return_labels:
            result["labels"] = result["input_ids"].copy()
        return result
    dataset = dataset.map(
        group,
        batched=True,
        batch_size=batch_size,
        num_proc=os.cpu_count(),
        desc=f"Grouping texts in chunks of {max_length}",
        )
    return dataset

# Run the preparation <a name="run"></a>

In [None]:
def prepare_data(dataset: Dataset,
                 subset: int=48_000,
                 max_length: int=1024,
                 ) -> Dataset:
    """Sample the same number of tokens for each corpus and split in train/test."""
    for corpus in dataset:
        dataset[corpus] = dataset[corpus].remove_columns("corpus").add_column("corpus", [corpus] * len(dataset[corpus]))
        if subset is None or len(dataset[corpus]) <= subset:
            continue
        dataset[corpus] = dataset[corpus].select(range(subset))
    dataset = dataset.shuffle()
    train = {}
    test = {}
    for corpus in dataset:
        train_test = dataset[corpus].train_test_split(0.2)
        train[corpus] = train_test["train"]
        test[corpus] = train_test["test"]
    train = DatasetDict(train)
    test = DatasetDict(test)
    train = concatenate_datasets(train.values())
    test = concatenate_datasets(test.values())
    train_tokens = (len(train) * max_length)
    test_tokens = (len(test) * max_length)
    print(f"Number of training tokens: {train_tokens:,}. Number of testing tokens: {test_tokens:,}")
    return train, test

Now we have defined all the required methods, we can the data preparation.

In your opinion, why do we group the sequences?

In [None]:
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T")
tokenized_dataset = tokenize_dataset(tokenizer=tokenizer,
                                     dataset=dataset,
                                     batch_size=2)
grouped_dataset = group_texts(dataset=tokenized_dataset, batch_size=2)
train, test = prepare_data(dataset=grouped_dataset)
# TODO: save the train and test dataset in two different folders.