# 00 - Generate the training dataset for MaximusLLM


## a) Install the Bonito package


This package is required to process the documents and create the training dataset.
Ensure you use the 0.0.1 branch


In [2]:
!cd ../modules/ && git clone https://github.com/BatsResearch/bonito.git --branch v0.0.1

fatal: le chemin de destination 'bonito' existe déjà et n'est pas un répertoire vide.


Make sure to change the 'setup.py' file to avoid conflicts with existing dependencies:

```
requirements = [
    "transformers == 4.42.0",
    "datasets == 2.20.0",
    "vllm == 0.5.1",
]
```


In [3]:
!pip install -U ../modules/bonito/

Processing /home/franck/Sandbox/03 - Awels Engineering/MaximusLLM/modules/bonito
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: bonito
  Building wheel for bonito (setup.py) ... [?25ldone
[?25h  Created wheel for bonito: filename=bonito-0.0.1-py3-none-any.whl size=4100 sha256=1a61831186ba7dd78e7be1a1d56fd57bec323f4f54b318f87b862dd4d3363418
  Stored in directory: /tmp/pip-ephem-wheel-cache-k6z7ormx/wheels/70/79/bf/6c2ac6529bb8bd8e02588d1c16da544cc459259bda1f12e0f8
Successfully built bonito
Installing collected packages: bonito
  Attempting uninstall: bonito
    Found existing installation: bonito 0.0.1
    Uninstalling bonito-0.0.1:
      Successfully uninstalled bonito-0.0.1
Successfully installed bonito-0.0.1


## b) Import all the required libraries


The following libraries are required:

- **Os** library for interacting with the operating system
- **Torch** library for deep learning used to retrain the Phi 3 model
- **Bonito** library for processing the documents and creating the training dataset
- **Fitz** library for PDF document handling
- **Datasets** library for handling and processing datasets
- **Spacy** library for natural language processing
- **Pandas** library for data manipulation and analysis
- **HuggingFace** library for accessing models and datasets


In [4]:
import os
import uuid
import hashlib
import torch
import bonito
from bonito import Bonito, SamplingParams
import fitz
import dotenv
import datasets as ds
from datasets import Dataset, DatasetDict, Value
from datasets import load_dataset
import pandas as pd
import huggingface_hub as hf
from huggingface_hub import notebook_login, login
from huggingface_hub import create_repo
from huggingface_hub import Repository
from sklearn.model_selection import train_test_split

Import the Spacy library for natural language processing


In [5]:
!python -m spacy download en_core_web_sm
import spacy

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Now import the environment variables containing the various tokens.


In [6]:
dotenv.load_dotenv()

True

## c) Declare the functions to be used ed in the processing and creation of the training dataset


This function opens a PDF file located at the specified path, reads the text content from each page, and concatenates all the text into a single string.


In [7]:
def extract_text_from_pdf(pdf_path):
    """
    Extract text from each page of a PDF file.

    Parameters:
    pdf_path (str): The file path to the PDF file.

    Returns:
    str: The concatenated text extracted from all pages of the PDF.
    """
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

This function assumes that the `nlp` object (typically an instance of a language model from the spaCy library) is already defined and has been loaded with the appropriate language model (e.g., `nlp = spacy.load('en_core_web_sm')`). The `nlp` object must have sentence segmentation capabilities enabled.


In [8]:
def split_into_sentences(text):
    """
    Splits a given text into individual sentences using natural language processing.

    Args:
        text (str): The input text to be split into sentences.

    Returns:
        list: A list of strings, where each string is a sentence from the input text.
    
    Example:
        >>> text = "Hello world. This is a test."
        >>> split_into_sentences(text)
        ['Hello world.', 'This is a test.']
    """
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    return sentences

This function generates a unique identifier (ID) using the UUID (Universally Unique Identifier) library in Python.


In [9]:
def generate_uuid():
    """
    This function generates a unique identifier (ID) using the UUID (Universally Unique Identifier) library in Python.

    Steps:
    1. Creates a new UUID using the `uuid.uuid4()` function.
    2. Converts the UUID to a string.
    3. Encodes the string to bytes using the UTF-8 encoding.
    4. Creates a SHA-256 hash object using the bytes from the previous step.
    5. Obtains the hexadecimal representation of the hash using the `hexdigest()` method.
    6. Extracts the first 64 characters of the hexadecimal hash and returns it as the unique ID.

    The function is useful for generating unique IDs for various purposes, such as identifying data records, transactions, or prompts in a system.
    By using a cryptographic hash function (SHA-256), the function ensures that the generated IDs are highly unlikely to collide,
    even if multiple IDs are generated simultaneously or over a long period of time.
    """
    uuid_obj = uuid.uuid4()
    uuid_str = str(uuid_obj)
    uuid_bytes = uuid_str.encode('utf-8')
    hash_obj = hashlib.sha256(uuid_bytes)
    hash_str = hash_obj.hexdigest()
    id_unique = hash_str[:64]
    return id_unique

This function is designed to add new columns to a given dictionary (example) based on the index (idx) and two lists (prompts and prompt_ids).


In [10]:
def add_columns(example, idx, prompts, prompt_ids):
    """
    Here's a breakdown of what it does:
        1. It adds a new key-value pair to the dictionary where the key is 'prompt' and the value is the element from the `prompts` list at the index `idx`.
        2. It adds another key-value pair to the dictionary where the key is 'prompt_id' and the value is the element from the `prompt_ids` list at the index `idx`.
        3. It creates a new key-value pair in the dictionary where the key is 'messages' and the value is a list of two dictionaries. The first dictionary has 'role' as 'assistant' and 'content' as the value of 'input' from the original dictionary. The second dictionary has 'role' as 'user' and 'content' as the value of 'output' from the original dictionary.
        4. It then removes the 'input' and 'output' keys from the dictionary.
        5. Finally, it returns the modified dictionary.
    """
    example['prompt'] = prompts[idx]
    example['prompt_id'] = prompt_ids[idx]
    example['messages'] = [
        {'role': 'assistant', 'content': example['input']},
        {'role': 'user', 'content': example['output']}
    ]
    # Remove the input and output columns
    del example['input']
    del example['output']
    return example

This function, generate_prompts_and_ids, is designed to create a list of prompts and a corresponding list of unique identifiers (UUIDs) for a given dataset length.


In [11]:
def generate_prompts_and_ids(dataset_length):
    """
    Here's a breakdown of what the function does:
        1. It takes an argument dataset_length, which is the number of prompts and UUIDs to generate.
        2. It creates a list of prompts where each prompt is a string that says "You are the Maximo Application Suite helpful assistant and you must answer the user questions." This list is repeated dataset_length times.
        3. It creates a list of UUIDs (unique identifiers) by calling the generate_uuid() function dataset_length times. Each UUID is a unique string that can be used to identify a specific prompt.
        4. Finally, it returns a tuple containing the list of prompts and the list of UUIDs.
    """
    prompts = ["You are the Maximo Application Suite helpful assistant and you must answer the user questions."] * dataset_length
    prompt_ids = [generate_uuid() for _ in range(dataset_length)]
    return prompts, prompt_ids

This function is used to split a given dataset into three parts: a training dataset, an evaluation dataset, and a test dataset. The function takes three parameters: the dataset to be split, the size of the test dataset (default is 0.2 or 20% of the total dataset), and the size of the evaluation dataset (default is 0.1 or 10% of the total dataset). The size of the training dataset is calculated as the remaining portion of the dataset after subtracting the sizes of the test and evaluation datasets.


In [12]:
def split_dataset(dataset, test_size=0.2, eval_size=0.1):
    train_size = 1 - test_size - eval_size

    # Split the dataset into train and test sets
    train_test_dataset = dataset.train_test_split(test_size=test_size)

    # Split the train set into train and eval sets
    train_dataset = train_test_dataset['train']
    eval_dataset = train_test_dataset['test'].train_test_split(test_size=eval_size / (eval_size + train_size))['test']
    test_dataset = train_test_dataset['test'].train_test_split(test_size=eval_size / (eval_size + train_size))['train']

    result = {
        'train': train_dataset,
        'eval': eval_dataset,
        'test': test_dataset
    }

    #return Dataset.from_dict(train_dataset), Dataset.from_dict(eval_dataset), Dataset.from_dict(test_dataset)
    return result

## d) Prepare the dataset based on the PDF document


In this section, we will load the PDF file and create the dataset leveraging the Bonito LLM. Questions with multiple answers will be generated.


In [13]:
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")

# Increase the limit to 2,000,000 characters
nlp.max_length = 2000000 

In [14]:
# Extract all the text from the PDF document
PDF_FILE = '../data/raw/master-map.pdf'
text = extract_text_from_pdf(PDF_FILE)

In [15]:
# Split the document into sentences using the split_into_sentences function
sentences = split_into_sentences(text)

Nombre de phrases finalement créer + extrait de la 500ème


In [16]:
print(len(sentences))

15864


In [17]:
print(sentences[500])

Enhancement to delete users
Starting in Maximo Application Suite 8.11, the system retains the details of the deleted users.


## e) Generate the synthetic dataset in (input/ output) format


We now transform the sentences into a format suitable for the Hugging Face datasets library.


In [18]:
# Assuming sentences is a list of strings, where each string is a sentence
data = {"sentence": sentences}
dataset = Dataset.from_dict(data)

print(dataset)

Dataset({
    features: ['sentence'],
    num_rows: 15864
})


Now the sentences have been extracted from the he PDF document, we can proceed to generate the synthetic dataset using the Bonito LLM.
Now we initialize the Bonito model and load the dataset for further processing.


In [19]:
bonito = Bonito("BatsResearch/bonito-v1", max_model_len=16384)

INFO 07-08 19:49:35 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='BatsResearch/bonito-v1', speculative_config=None, tokenizer='BatsResearch/bonito-v1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=BatsResearch/bonito-v1, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-08 19:49:35 selector.py:172] Cannot use FlashAttention-2 backend due to sliding window.
INFO 07-08 19:49:35 selector.py:53] Using XForm

Now we initialize the sampling parameters and generate the synthetic dataset using the extracted sentences.


In [20]:
sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)

And finally we create the synthetic dataset using the Bonito model by setting the 'qg' parameter which means 'Question Generation'.


In [21]:
synthetic_dataset = bonito.generate_tasks(
    dataset,
    context_col="sentence",
    task_type="qg", #qg : question generation
    sampling_params=sampling_params
)

Map:   0%|          | 0/15864 [00:00<?, ? examples/s]

Processed prompts: 100%|██████████| 15864/15864 [05:14<00:00, 50.39it/s, est. speed input: 2828.05 toks/s, output: 2584.31 toks/s] 


Filter:   0%|          | 0/15864 [00:00<?, ? examples/s]

Map:   0%|          | 0/15859 [00:00<?, ? examples/s]

We now check the content of the synthetic datasets


In [22]:
# First describe the synthetic dataset
print(synthetic_dataset)


Dataset({
    features: ['input', 'output'],
    num_rows: 15859
})


In [23]:
# Print the first 5 rows of the datasets
df = pd.DataFrame(synthetic_dataset)
print(df.head(5)) 


                                               input  \
0  Write a multi-choice question for the followin...   
1  Write a multi-choice question for the followin...   
2  Write a multi-choice question for the followin...   
3  Write a multi-choice question for the followin...   
4  Write a multi-choice question for the followin...   

                                              output  
0  Question: \nWhich is the best title of this pa...  
1  Question: \nWhich of the following is NOT true...  
2  Question: \nWhat is the main idea of the passa...  
3  If you want to know the new features of 8.9, y...  
4  Question: \nWhat is the passage mainly about?\...  


## f) Now transform this dataset into one which can be used to fine tune Phi


We now transform the synthetic dataset into a format suitable for fine-tuning the Phi model. This involves creating a new dataset with the required input and output columns.


In [24]:
splitted_dataset = split_dataset(synthetic_dataset)

In [25]:
splitted_dataset

{'train': Dataset({
     features: ['input', 'output'],
     num_rows: 12687
 }),
 'eval': Dataset({
     features: ['input', 'output'],
     num_rows: 397
 }),
 'test': Dataset({
     features: ['input', 'output'],
     num_rows: 2775
 })}

In [26]:
for split in splitted_dataset.keys():
    print(split)
    dataset_length = len(splitted_dataset[split])
    prompts, prompt_ids = generate_prompts_and_ids(dataset_length)
    splitted_dataset[split] = splitted_dataset[split].map(
        lambda example, idx: add_columns(example, idx, prompts, prompt_ids),
        with_indices=True,
        remove_columns=['input', 'output']
    )

train


Map:   0%|          | 0/12687 [00:00<?, ? examples/s]

eval


Map:   0%|          | 0/397 [00:00<?, ? examples/s]

test


Map:   0%|          | 0/2775 [00:00<?, ? examples/s]

In [39]:
# Verify the new columns in the train split
print(splitted_dataset['train'][0])

{'prompt': 'You are the Maximo Application Suite helpful assistant and you must answer the user questions.', 'prompt_id': '6f709f1098fe8a9c8a07f885410e1375ac92a33c3ecf8c84ffa8811e7ffe0f86', 'messages': [{'content': 'Write a multi-choice question for the following article:\nArticle: Global property values\nSet the mxe.int.globaldir and other properties when you upgrade from Maximo Asset Management\nto Maximo Manage.', 'role': 'assistant'}, {'content': 'Question: \nWhat does the text tell us to do?\nOptions:\nA To upgrade to Maximo Manage.\nB To set the mxe.int.globaldir property.\nC To upgrade from Maximo Manage to Maximo Asset Management.\nD To upgrade from Maximo Asset Management to Maximo Manage.\nAnswer:\nD', 'role': 'user'}]}


## g) Upload the new dataset to 🤗 Hub

First log into the Hugging Face Hub

In [28]:

login(token=os.environ['HF_TOKEN'])

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/franck/.cache/huggingface/token
Login successful


In [29]:
repo_name = "awels/maximo_admin_dataset"  # Choose a name for your dataset repository
repo_url = create_repo(repo_name, repo_type="dataset")
print("Repository URL:", repo_url)

Repository URL: https://huggingface.co/datasets/awels/maximo_admin_dataset


Now push the dataset to the HF hub

In [44]:

splitted_dataset['train'].push_to_hub(f"awels/maximo_admin_dataset", split="train")
splitted_dataset['eval'].push_to_hub(f"awels/maximo_admin_dataset", split="eval")
splitted_dataset['test'].push_to_hub(f"awels/maximo_admin_dataset", split="test")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/13 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/518 [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/518 [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/518 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/awels/maximo_admin_dataset/commit/91b3b084749da7d748f7750b5d350c582dd7ac65', commit_message='Upload dataset', commit_description='', oid='91b3b084749da7d748f7750b5d350c582dd7ac65', pr_url=None, pr_revision=None, pr_num=None)