<a href="https://colab.research.google.com/github/abebual/LLMs-for-Flight-Safety/blob/main/tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### ByteLevelBPETokenizer Tokenization

https://pypi.org/project/openai/

OpenAI's GPT-2/3 model uses a specific type of tokenizer known as a "byte pair encoding" (BPE) tokenizer. BPE is a subword tokenization method that breaks down text into subword units, such as characters or character sequences, and assigns a unique token to each subword unit. BPE is capable of handling a wide range of languages and can handle out-of-vocabulary words by breaking them down into subword units.

OpenAI's GPT-3 model uses a variant of BPE known as "GPT-2-style BPE," which means it uses BPE tokenization with a specific vocabulary and token encoding scheme.

When working with OpenAI's pre-trained models, it's important to use the same tokenizer that was used during their pre-training, which is the GPT-2-style BPE tokenizer. OpenAI typically provides the pre-trained models along with their corresponding tokenizers, which are often available as part of the model package or through OpenAI's API.

In [1]:
! pip install tokenizers

Collecting tokenizers
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface_hub<0.18,>=0.16.4 (from tokenizers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub, tokenizers
Successfully installed huggingface_hub-0.17.3 tokenizers-0.14.1


In [7]:
! pip install PyPDF2



In [5]:
! pip install transformers

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m43.2 MB/s[0m eta [36m0:00:00[0m
Collecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m82.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: safetensors, transformers
Successfully installed safetensors-0.4.0 transformers-4.35.0


In [31]:
import PyPDF2
import os

input_dir = '/content/drive/MyDrive/LLMs for Flight Safety/FAA Directives'
# create a new directory for the vectorized pdfs
output_file = '/content/drive/MyDrive/LLMs for Flight Safety/vectorized_pdfs/text_corpus.txt'

merged_text = ""
with open(output_file, "w", encoding="utf-8") as output:
    for filename in os.listdir(input_dir):
        if filename.endswith(".pdf"):
            with open(os.path.join(input_dir, filename), "rb") as file:
                pdf = PyPDF2.PdfReader(file)
                for page_num in range(len(pdf.pages)):
                    page = pdf.pages[page_num]
                    text = page.extract_text()
                    #output.write(text + "\n")
                    merged_text += text + "\n"

In [17]:
import tensorflow as tf
tf.test.gpu_device_name()


'/device:GPU:0'

In [32]:
merged_text



In [37]:
from tokenizers import ByteLevelBPETokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

from tokenizers import Tokenizer

# Initialize a Byte-Level BPE tokenizer
tokenizer = Tokenizer(BPE())

# Customize tokenization parameters if needed
tokenizer.pre_tokenizer = Whitespace()

# Train the tokenizer on your preprocessed text data
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
input_file = '/content/drive/MyDrive/LLMs for Flight Safety/vectorized_pdfs/text_corpus.txt'
tokenizer.train(files=[input_file], trainer=trainer)

# Save the tokenizer model files
tokenizer.save("FlightSafety_tokenizer_model")

In [38]:
output = tokenizer.encode("The Instructions for Continued Air-worthiness must be in the form of a manual or manuals as appropriate for the quantity of data to be provided.")
print(output.tokens)

['The', 'Instructions', 'for', 'Continued', 'Air', '-', 'worthiness', 'must', 'be', 'in', 'the', 'form', 'of', 'a', 'manual', 'or', 'manuals', 'as', 'appropriate', 'for', 'the', 'quantity', 'of', 'data', 'to', 'be', 'provided', '.']
