# 01 - Generate the training dataset for MaximusLLM

## a) Install the Bonito package

This package is required to process the documents and create the training dataset.
Ensure you use the 0.0.1 branch

In [3]:
!cd ../modules/ && git clone https://github.com/BatsResearch/bonito.git --branch v0.0.1

Clonage dans 'bonito'...
remote: Enumerating objects: 122, done.[K
remote: Counting objects: 100% (77/77), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 122 (delta 51), reused 51 (delta 36), pack-reused 45[K
Réception d'objets: 100% (122/122), 807.32 Kio | 8.77 Mio/s, fait.
Résolution des deltas: 100% (53/53), fait.
Note : basculement sur '76899e31da2714499cadda5924475220d8cb0d8f'.

Vous êtes dans l'état « HEAD détachée ». Vous pouvez visiter, faire des modifications
expérimentales et les valider. Il vous suffit de faire un autre basculement pour
abandonner les commits que vous faites dans cet état sans impacter les autres branches

Si vous voulez créer une nouvelle branche pour conserver les commits que vous créez,
il vous suffit d'utiliser l'option -c de la commande switch comme ceci :

  git switch -c <nom-de-la-nouvelle-branche>

Ou annuler cette opération avec :

  git switch -

Désactivez ce conseil en renseignant la variable de configuration advice

Make sure to change the 'setup.py' file to avoid conflicts with existing dependencies:
```
requirements = [
    "transformers == 4.42.0",
    "datasets == 2.20.0",
    "vllm == 0.5.1",
]
```

In [4]:
!pip install -U ../modules/bonito/

Processing /home/franck/Sandbox/03-Awels Engineering/MaximusLLM/modules/bonito
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting transformers (from bonito==0.0.1)
  Downloading transformers-4.42.3-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from bonito==0.0.1)
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting vllm (from bonito==0.0.1)
  Downloading vllm-0.5.1-cp311-cp311-manylinux1_x86_64.whl.metadata (8.2 kB)
Collecting filelock (from datasets->bonito==0.0.1)
  Downloading filelock-3.15.4-py3-none-any.whl.metadata (2.9 kB)
Collecting numpy>=1.17 (from datasets->bonito==0.0.1)
  Downloading numpy-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.9/60.9 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting p

## b) Import all the required libraries

The following libraries are required:
- **Os** library for interacting with the operating system
- **Torch** library for deep learning used to retrain the Phi 3 model
- **Bonito** library for processing the documents and creating the training dataset
- **Fitz** library for PDF document handling
- **Datasets** library for handling and processing datasets
- **Spacy** library for natural language processing
- **Pandas** library for data manipulation and analysis
- **HuggingFace** library for accessing models and datasets


In [8]:
import os
import torch
import bonito
import fitz
import datasets as ds
import pandas as pd
import huggingface_hub as hf

Import the Spacy library for natural language processing

In [11]:
!python -m spacy download en_core_web_sm
import spacy

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## c) Declare the functions to be used ed in the processing and creation of the training dataset

This function opens a PDF file located at the specified path, reads the text content from each page, and concatenates all the text into a single string.

In [13]:
def extract_text_from_pdf(pdf_path):
    """
    Extract text from each page of a PDF file.

    Parameters:
    pdf_path (str): The file path to the PDF file.

    Returns:
    str: The concatenated text extracted from all pages of the PDF.
    """
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

This function assumes that the `nlp` object (typically an instance of a language model from the spaCy library) is already defined and has been loaded with the appropriate language model (e.g., `nlp = spacy.load('en_core_web_sm')`). The `nlp` object must  have sentence segmentation capabilities enabled.

In [17]:
def split_into_sentences(text):
    """
    Splits a given text into individual sentences using natural language processing.

    Args:
        text (str): The input text to be split into sentences.

    Returns:
        list: A list of strings, where each string is a sentence from the input text.
    
    Example:
        >>> text = "Hello world. This is a test."
        >>> split_into_sentences(text)
        ['Hello world.', 'This is a test.']
    """
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    return sentences

## d) Prepare the dataset based on the PDF document

In [None]:
In 