<a href="https://colab.research.google.com/github/akudnaver/AI-Machine-Learning-Deep-Learning-Projects/blob/master/chatbot_pdf_using_gpt2_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [19]:
# The Purpose of this project is to use Transform based pre-traing model GPT2 to traing an input PDF document, and answer all the queries related to the document.
# You can experiment this with any document and check for precision. Feel free to experiment with this Code for the desired outcome , customize as per your requirement. I have also used Gradio
# for web interface creation , a quick way of interacting with your AI Chatbot model

# Okay let's see all that in action now , Happy Coding !!!

In [20]:
#Import your input pdf file which will use to train the input pdf data with a pre-trained model. More on that in the subsequent sections.

import os
# file_path = "/sample_data/input/files/"
filename = '/content/eSRVCC.pdf'
print(filename)

/content/eSRVCC.pdf


In [21]:
# #You must install the libraries need to run this project. You can also create a virtual environment and install all the below libraries in your virtual environment, which
# # would be one time task and to avoid re-installation especially if your running in Kaggle and Colab environment.


!pip install sentence_transformers
!pip install pdfminer.six
!pip install text_generation
!pip install einops --no-deps
!pip install --upgrade torch transformers
!pip install flask
!pip install torchvision
!pip install torchaudio
!pip install --upgrade gradio
!pip install semantic_version
!pip install ffmpy --no-deps
!pip install gradio_client
!pip install multipart
!pip install python-multipart
!pip install tensorflow
# !pip install gradio



In [22]:
# If you have dedicated GPU like NVIDIA CUDA available on your system or in the Google Colab or Kaggle Notebook,  use it and experience the speed at which the execution of
# big LLM model training takes places if not stick to the CPU. As this project is using GPT2 pre-traing model, you can easily execute this project using CPU, you may experience
# slower perfornance with the chatbot user interaction.

import tensorflow as tf
# List GPUs available
gpus = tf.config.list_physical_devices('GPU')
print("GPUs:", gpus)

# Check if GPUs are available
if gpus:
    print("GPU is available.")
else:
    print("GPU is not available.")

GPUs: []
GPU is not available.


In [23]:
#It's time to import all the libraries you would need to run this project, go ahead and install them.

import argparse
import json
import torch
from pdfminer.high_level import extract_text
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from text_generation import Client
from flask import Flask, request, jsonify
import re

In [24]:
# Okay now comes the most important task which is to create a transformer pipeline, we will later pass the embedded version of the input data through this model
# we will perform all the essential task of tokenizing function, create an instance of the model and finally the transformer pipeline function under the generator variable, which
# is a very standard approach as per the instructions in Huggingface GPT2 section.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

PARAMETERS = {
    "temperature": 0.9,
    "top_p": 0.95,
    "repetition_penalty": 1.2,
    "top_k": 50,
    "truncate": 1000,
    "max_new_tokens": 1024,
    "seed": 42,
    "stop_sequences": ["<|endoftext|>", "</s>"],
}

# Initialize Hugging Face pipeline
api_token = "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"
model_name = "openai-community/gpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=api_token, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=api_token, trust_remote_code=True)
# generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1)



In [25]:
# Define the preprompt and prompt
PREPROMPT = "Below is a conversation with an AI assistant. The assistant is helpful, polite, and knowledgeable.\n\n"
PROMPT = "User: {query}\nAI:"

def generate_response(query, results):
    # Format the prompt with the user's query
    full_prompt = PREPROMPT + PROMPT.format(query=query)


    # Generate response
    # response = generator(full_prompt, max_length=150, truncation=True, do_sample=True, top_p=0.95, top_k=50)
    response = generator(full_prompt, max_length=PARAMETERS["max_new_tokens"], do_sample=True, temperature=PARAMETERS["temperature"], top_p=PARAMETERS["top_p"], repetition_penalty=PARAMETERS["repetition_penalty"], top_k=PARAMETERS["top_k"], truncation=True)

    # Extract the generated text
    generated_text = response[0]['generated_text']

    # Extract the part of the response that is the assistant's reply
    ai_response = generated_text.split("AI:")[1].strip() if "AI:" in generated_text else generated_text

    if results:
      ai_response =  "\n".join(results)

    # Create a JSON response
    json_response = {"response": ai_response}

    return json_response


In [26]:
def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--fname", type=str, required=True)
    parser.add_argument("--top_k", type=int, default=32)
    parser.add_argument("--window_size", type=int, default=128)
    parser.add_argument("--step_size", type=int, default=100)

    # Simulate command line arguments
    args = parser.parse_args([
        '--fname', filename,
        '--top_k', '32',
        '--window_size', '128',
        '--step_size', '100'
    ])

    return args

In [27]:
# In this step we are extracting the contents of input pdf file, cleaning and tokenizing the text, the purpose of this function is to sentence tokenizing and convert
# them to vectors which will be later trained using the pre-trained GPT2 model.

def embed(fname, window_size, step_size):
    text = extract_text(fname)
    text = re.sub(r'[^\x20-\x7E]+', ' ', text)
    text = re.sub(r'\s', ' ', text)
    text = " ".join(text.split())
    text_tokens = text.split()


    sentences = []
    for i in range(0, len(text_tokens), step_size):
        window = text_tokens[i : i + window_size]
        if len(window) < window_size:
            break;
        sentences.append(window)

    paragraphs = [" ".join(s) for s in sentences]
#     print(paragraphs)

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    # model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
    model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2", device=device)

    model.max_seq_length = 512
    # cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
    cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", device=device)

    embeddings = model.encode(paragraphs, show_progress_bar=True, convert_to_tensor=True)

    return model, cross_encoder, embeddings, paragraphs

In [28]:
#This function here is going use the trained model to perform a semantic search of the keywords which would be an input from a user, using query emebedding.

def search(query, model, cross_encoder, embeddings, paragraphs, top_k):
    query_embeddings = model.encode(query, convert_to_tensor=True)
    query_embeddings = query_embeddings.cuda() if torch.cuda.is_available() else query_embeddings
    hits = util.semantic_search(
        query_embeddings,
        embeddings,
        top_k=top_k,
    )[0]

    cross_inputs = [[query, paragraphs[hit["corpus_id"]]] for hit in hits]
#     print("Cross Inputs:", cross_inputs)  # Add this line to inspect 'cross_inputs'

    if not cross_inputs:
        return []

    cross_scores = cross_encoder.predict(cross_inputs)
    for idx in range(len(cross_scores)):
        hits[idx]["cross_score"] = cross_scores[idx]

    results = []
    hits = sorted(hits, key=lambda x: x["cross_score"], reverse=True)
    for hit in hits[:5]:
        results.append(paragraphs[hit["corpus_id"]].replace("\n", " "))
    return results

In [29]:
# #Now its to create a web interface and call our Chatbot model for an input user query, let see how it is done !!
# #After this piece of Code you should be able to interact with your model, just click on the link provided by Gradio upon running this code and you are ready to talk
# # an intelligent Chatbot, atleast it will know what you mean , let's start with that shall we :)

# import gradio as gr

# def chatbot_interface(query):
#     # Simulate the embedding process and search
#     args = parse_args()
#     model, cross_encoder, embeddings, paragraphs = embed(
#         args.fname,
#         args.window_size,
#         args.step_size,
#     )

#     # Get search results
#     search_results = search(
#         query,
#         model,
#         cross_encoder,
#         embeddings,
#         paragraphs,
#         top_k=args.top_k,
#     )
#     ai_response = generate_response(query, search_results)
#     # print("AI Chabot:", ai_response)
#     return ai_response['response']
# #     print (jsonify({"AI response": ai_response }))

# gr.set_static_paths(paths=["/content/sample_data/static/images"])
# image_path = "/content/sample_data/static/images/chatbot_image.jpg"
# gr.HTML(f"""<img src="/file={image_path}" width="100" height="100">""")

# # Create Gradio interface
# iface = gr.Interface(
#     fn=chatbot_interface,
#     # inputs='text',
#     inputs=gr.Textbox(placeholder="Enter your query and get a response from the AI chatbot."),
#     outputs='text',
#     title="AI Chatbot",
#     # description="Enter your query and get a response from the AI chatbot.",
#     theme=gr.themes.Monochrome(),
#     # gr.Image(value="/content/sample_data/static/images/chatbot_image.jpg"),
#     # gr.Image()
# )

# if __name__ == "__main__":
#     iface.launch(share=True, debug=True,allowed_paths=[image_path])

In [30]:
import gradio as gr

def chatbot_interface(query):
    # Simulate the embedding process and search
    args = parse_args()
    model, cross_encoder, embeddings, paragraphs = embed(
        args.fname,
        args.window_size,
        args.step_size,
    )

    # Get search results
    search_results = search(
        query,
        model,
        cross_encoder,
        embeddings,
        paragraphs,
        top_k=args.top_k,
    )
    ai_response = generate_response(query, search_results)
    # print("AI Chabot:", ai_response)
    return ai_response['response']


with gr.Blocks(theme=gr.themes.Monochrome()) as demo:
    gr.set_static_paths(paths=["/content/chatbot_image.png"])
    gr.Markdown("<img src='/file=/content/chatbot_image.png' alt='Chatbot Logo' height='100' width='400'>")

    name_input = gr.Textbox(label="Enter your query and get a response from the AI chatbot")
    submit_button = gr.Button("Submit")
    output = gr.Textbox(label="Greeting:")

    submit_button.click(fn=chatbot_interface, inputs=name_input, outputs=output)

if __name__ == "__main__":
    demo.launch(allowed_paths=["/content/chatbot_image.png"], debug=True)

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://765bc3348618a3bbcb.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://765bc3348618a3bbcb.gradio.live
