# [Training your own Chatbot using GPT​](https://www.youtube.com/watch?v=DxygPxcfW_I)

# [Transformers](https://pypi.org/project/transformers/)
**Transformers** provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.

These models can be applied on:

📝 **Text**, for tasks like text classification, information extraction, question answering, summarization, translation, and text generation, in over 100 languages.

🖼️ **Images**, for tasks like image classification, object detection, and segmentation.

🗣️ **Audio**, for tasks like speech recognition and audio classification.

ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.21.0`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

In [1]:
!pip install transformers
!pip install transformers[torch]

Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->transformers[torch])
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->transformers[torch])
  Using cached nvidia_cublas_cu

# [torch 2.3.0](https://pypi.org/project/torch/)
**PyTorch** is a Python package that provides two high-level features:

* Tensor computation (like NumPy) with strong GPU acceleration
* Deep neural networks built on a tape-based autograd system

In [2]:
!pip install torch



# [python-docx](https://pypi.org/project/python-docx/)
**python-docx** is a Python library for reading, creating, and updating Microsoft Word 2007+ (.docx) files.

In [3]:
!pip install python-docx

Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m184.3/244.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-docx
Successfully installed python-docx-1.1.2


# [PyPDF2](https://pypi.org/project/PyPDF2/)
**PyPDF2** is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. PyPDF2 can retrieve text and metadata from PDFs as well.

In [4]:
!pip install -U PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━[0m [32m194.6/232.6 kB[0m [31m5.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


### Import general required libraries

In [5]:
import os
import re
from PyPDF2 import PdfReader
import docx
import torch

# All the below classes are provided by the "Hugging Face" Transformers library.

## [class transformers.GPT2Tokenizer](https://huggingface.co/docs/transformers/en/model_doc/gpt2#transformers.GPT2Tokenizer)
* The `GPT2Tokenizer` class is used for tokenizing text data in a way that is compatible with the GPT-2 model.
* Tokenization involves converting a string of text into a list of tokens (words or subwords) that can be processed by the model.
* This step is crucial for preparing text data for tasks such as text generation, text completion, or any other NLP application involving GPT-2.

## [class transformers.GPT2LMHeadModel](https://huggingface.co/docs/transformers/en/model_doc/gpt2#transformers.GPT2LMHeadModel)

`GPT2LMHeadModel` stands for "Language Modeling Head." This model includes a language modeling head on top of the GPT-2 architecture, which is specifically designed for generating text. It predicts the next token in a sequence, making it suitable for tasks like text completion, text generation, and other sequence prediction tasks.

## TextDataset
The `TextDataset` class helps in loading and processing text data in a format suitable for training and evaluation with transformer models.

1. **Tokenization**: Automatically tokenizes the input text using a specified tokenizer.
2. **Encoding**: Converts the tokenized text into numerical format that models can process.
3. **Handling Large Datasets**: Efficiently handles large text files by loading data in chunks.


## [class transformers.DataCollatorForLanguageModeling](https://huggingface.co/docs/transformers/v4.41.0/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling)

This collator handles the batching and padding of sequences to ensure that all input sequences in a batch have the same length, which is necessary for efficient GPU utilization.

It can be used for both <u>causal language modeling</u> (**CLM**) and <u>masked language modeling</u> (**MLM**) tasks, depending on the configuration.

In [6]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling

## [class transformers.Trainer](https://huggingface.co/docs/transformers/v4.41.0/en/main_classes/trainer#transformers.Trainer)

It is a high-level API that simplifies the process of training and evaluating transformer models. It provides a flexible and easy-to-use interface for fine-tuning and training models on custom datasets.

Few key features:-
1. **Fine-Tuning Pre-Trained Models**:
   - **Quick Fine-Tuning**: Easily fine-tune pre-trained models (like BERT, GPT-2, RoBERTa) on your custom datasets for specific tasks such as text classification, named entity recognition, or text generation.

2. **Evaluation and Metrics**:
   - **Built-in Evaluation**: Automatically handles evaluation during and after training, providing metrics such as accuracy, precision, recall, and F1 score.
   - **Custom Metrics**: Allows users to define and use custom evaluation metrics.

3. **Distributed Training**:
   - **Multi-GPU Training**: Supports training across multiple GPUs to accelerate the process.
   - **Distributed Training**: Can handle training on multiple nodes in a distributed setting, making it suitable for large-scale training.


## [class transformers.TrainingArguments](https://huggingface.co/docs/transformers/v4.41.0/en/main_classes/trainer#transformers.TrainingArguments)

The `TrainingArguments` class is used to define the training configuration and hyperparameters for training Transformer-based models.

By importing `TrainingArguments`, you gain access to a class that lets you specify various parameters such as the number of epochs, learning rate, batch size, evaluation strategy, logging options, and more, which are crucial for training machine learning models effectively.


In [7]:
from transformers import Trainer, TrainingArguments

# <font color='red'>Caution: Use only single GPT model at a time.</font>

# [GPT-2](https://huggingface.co/openai-community/gpt2)
Pretrained model on English language using a <u>causal language modeling (CLM)</u> objective.

## Model description
**GPT-2** is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.

This is the <u>**smallest** version of GPT-2, with **124M**</u> parameters.

In [8]:
'''
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="openai-community/gpt2")
'''

'\n# Use a pipeline as a high-level helper\nfrom transformers import pipeline\n\npipe = pipeline("text-generation", model="openai-community/gpt2")\n'

In [9]:
'''
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
'''

'\n# Load model directly\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\ntokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")\nmodel = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")\n'

# [GPT-2 Medium](https://huggingface.co/openai-community/gpt2-medium)
## Model Description
**GPT-2 Medium** is the **355M** parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a <u>causal language modeling (CLM)</u> objective.

In [10]:
'''
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="openai-community/gpt2-medium")
'''

'\n# Use a pipeline as a high-level helper\nfrom transformers import pipeline\n\npipe = pipeline("text-generation", model="openai-community/gpt2-medium")\n'

In [11]:
'''
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-xl")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2-xl")
'''

'\n# Load model directly\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\ntokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-xl")\nmodel = AutoModelForCausalLM.from_pretrained("openai-community/gpt2-xl")\n'

# [GPT-2 Large](https://huggingface.co/openai-community/gpt2-large)
## Model Description
**GPT-2 Large** is the **774M** parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a <u>causal language modeling (CLM)</u> objective.

In [12]:
'''
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="openai-community/gpt2-large")
'''

'\n# Use a pipeline as a high-level helper\nfrom transformers import pipeline\n\npipe = pipeline("text-generation", model="openai-community/gpt2-large")\n'

In [13]:
'''
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-large")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2-large")
'''

'\n# Load model directly\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\ntokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-large")\nmodel = AutoModelForCausalLM.from_pretrained("openai-community/gpt2-large")\n'

# [GPT-2 XL](https://huggingface.co/openai-community/gpt2-xl)
## Model Description
**GPT-2 XL** is the **1.5B** parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a <u>causal language modeling (CLM)</u> objective.

In [14]:
'''
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2-xl')
set_seed(42)
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
'''

'\nfrom transformers import pipeline, set_seed\ngenerator = pipeline(\'text-generation\', model=\'gpt2-xl\')\nset_seed(42)\ngenerator("Hello, I\'m a language model,", max_length=30, num_return_sequences=5)\n'

# Functions required to read different types of files.
## These files can be a mix of *.pdf, *.docx ot *.txt

In [15]:
# read any number of pdf documents from the directory path and stored as a text file.
def read_pdf(file_path):
  with open(file_path, 'rb') as file:
    pdf_reader = PdfReader(file)
    text = ""
    for page_num in range(len(pdf_reader.pages)):
      text += pdf_reader.pages[page_num].extract_text()
  return text

In [16]:
# read any number of word documents from the directory path and stored as a text file.
def read_word(file_path):
  doc = docx.Document(file_path)
  text = ""
  for paragraph in doc.paragraphs:
    text += paragraph.text + "\n"
  return text

In [17]:
# read any number of text documents from the directory path and stored as a text file.
def read_text(file_path):
  with open(file_path, 'r') as file:
    text = file.read()
  return text

In [18]:
# read all types of and any number of documents from the directory path and stored as text file.
def read_documents_from_directory(directory):
  combined_text = ""
  for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)

    if filename.endswith(".pdf"):
      combined_text += read_pdf(file_path)
    elif filename.endswith(".docx"):
      combined_text += read_word(file_path)
    elif filename.endswith(".txt"):
      combined_text += read_text(file_path)
  return combined_text

## Train a chatbot
### Here, the "train_chatbot" function uses the **combined_text** data to train a GPT-2 model using the provided training arguments. The resulting trained model and tokenizer are then saved to a specified output directory as mentioned in the program.

In [19]:
def train_chatbot(directory, model_output_path, train_fraction=0.8):
  # read documents from the directory
  combined_text = read_documents_from_directory(directory)
  # removing excess new line characters
  combined_text = re.sub(r'\n+', '\n', combined_text).strip()

  # split the text into training and validation sets
  split_index = int(train_fraction * len(combined_text))
  train_text = combined_text[:split_index]
  test_text = combined_text[split_index:]

  # save training and validation data as text file
  with open("train.txt", "w") as f:
    f.write(train_text)
  with open("test.txt", "w") as f:
    f.write(test_text)

  # save up the tokenizer and model
  # tokenizer = GPT2Tokenizer.from_pretrained("gpt2") # also try gpt2, gpt2-large, gpt2-medium, gpt2-xl
  # model = GPT2LMHeadModel.from_pretrained("gpt2") # also try gpt2, gpt2-large, gpt2-medium, gpt2-xl
  tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large") # also try gpt2, gpt2-large, gpt2-medium, gpt2-xl
  model = GPT2LMHeadModel.from_pretrained("gpt2-large") # also try gpt2, gpt2-large, gpt2-medium, gpt2-xl

  # prepare the dataset
  train_dataset = TextDataset(tokenizer=tokenizer, file_path='train.txt', block_size=128)
  test_dataset = TextDataset(tokenizer=tokenizer, file_path='test.txt', block_size=128)
  data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

  # setup the training argument
  training_args = TrainingArguments(
      output_dir=model_output_path,
      overwrite_output_dir=True,
      per_device_train_batch_size=8,
      per_device_eval_batch_size=8,
      num_train_epochs=50,
      save_steps=10_000,
      save_total_limit=2,
      logging_dir='./logs',
  )

  # train the model
  trainer = Trainer(
      model=model,
      args=training_args,
      data_collator=data_collator,
      train_dataset=train_dataset,
      eval_dataset=test_dataset,
  )

  trainer.train()
  trainer.save_model(model_output_path)

  # save the tokenizer
  tokenizer.save_pretrained(model_output_path)

## Use the implemented model and try to generate the text.

### The "generate_response" function takes a trained model, tokenizer, and a prompt string as input and generates a response using the GPT-2 model.

### **<font color='red'>Output Length Limitation:</font>**
#### This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (1024). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


## [generate](https://huggingface.co/docs/transformers/v4.41.0/en/main_classes/text_generation#transformers.GenerationMixin.generate)

#### Generates sequences of token ids for models with a language modeling head.

To learn more about decoding strategies refer to the [text generation strategies guide.](https://huggingface.co/docs/transformers/v4.41.0/en/generation_strategies)

In [20]:
def generate_response(model, tokenizer, prompt, max_length=250):
  inputs_ids = tokenizer.encode(prompt, return_tensors='pt')

  # create the attention mask and pad token ids
  attention_mask = torch.ones_like(inputs_ids)
  pad_token_id = tokenizer.eos_token_id

  output = model.generate(
      inputs_ids,
      max_length=max_length,        # Increase max_length to get longer outputs
      num_return_sequences=4,       # The number of independently computed returned sequences for each element in the batch.
      attention_mask=attention_mask,
      pad_token_id=pad_token_id,    # The id of the padding token
      temperature=1.0,              # The value used to modulate the next token probabilities.
      no_repeat_ngram_size=4,
      num_beams=4,
  )

  return tokenizer.decode(output[0], skip_special_tokens=True)

In [21]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


## The main function
### The "main" function is the entry point for the program.
#### It specifies the path to the directory containing the training data and the path to the output directory for the trained model and tokenizer.
#### It then trains the chatbot using the "train_chatbot" function and generates a response to a specified prompt using the "generate_response" function.


In [22]:
# the main function to store all files which will be use in the program
def main():
  directory = "/content/drive/MyDrive/Colab Notebooks/Project/"
  # /content/drive/MyDrive/Colab Notebooks/Project
  model_output_path = "/content/drive/MyDrive/Colab Notebooks/Project/"

  # Check if the directory exists
  if not os.path.exists(directory):
    raise FileNotFoundError(f"Directory '{directory}' not found.")

  # Check if the user has access to the directory
  if not os.access(directory, os.R_OK):
    raise PermissionError(f"User does not have read access to '{directory}'.")

  # train the chatbot
  train_chatbot(directory, model_output_path)

  # Load the fine-tuned model and tokenizer
  model = GPT2LMHeadModel.from_pretrained(model_output_path)
  tokenizer = GPT2Tokenizer.from_pretrained(model_output_path)

  # Test the chatbot
  prompt = "What is Jacquard machines are made of?"  # Replace with your desired prompt
  response = generate_response(model, tokenizer, prompt)
  print("Generated response:", response)

In [None]:
if __name__ == "__main__":
  main()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

# Now, let us test the model.

Use the following code if you are only performing inference (generating text). This can be placed in a separate notebook.

In [None]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel