<a href="https://colab.research.google.com/github/gikarthikeyan/Automatic-Ticket-Classification/blob/main/DPR_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [20]:
! pip install faiss-cpu



In [21]:
! pip install pdfplumber



In [59]:
import pdfplumber
from transformers import (
    DPRContextEncoder,
    DPRContextEncoderTokenizer,
    DPRQuestionEncoder,
    DPRQuestionEncoderTokenizer,
    T5ForConditionalGeneration,
    T5Tokenizer
)
import os
import faiss
import pickle
import numpy as np

In [60]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [61]:
!ls /content/drive/My\ Drive/NLP

CustomNER  Road.docx  Road.pdf	Train


In [83]:
# Step 1: Read PDF with pdfplumber
def read_pdf(pdf_path):
    document = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            if text:
                document += text + " "
    return document

In [84]:
# Step 2: Chunk text
def chunk_text(text, chunk_size=300, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

In [85]:
# Step 3: Generate embeddings for chunks using DPR
def create_embeddings(chunks, context_encoder, context_tokenizer, max_length=512):
    embeddings = []
    for chunk in chunks:
        inputs = context_tokenizer(chunk, return_tensors="pt", truncation=True, max_length=max_length)
        embedding = context_encoder(**inputs).pooler_output.detach().numpy()
        embeddings.append(embedding)
    return np.vstack(embeddings)

In [86]:
# Step 4: Store embeddings in FAISS
def build_faiss_index(embeddings):
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatIP(dimension)
    index.add(embeddings)
    return index

In [87]:
# Step 5: Query FAISS and retrieve top-k chunks
def query_faiss(index, query, question_encoder, question_tokenizer, chunks, top_k=5, max_length=512):
    inputs = question_tokenizer(query, return_tensors="pt", truncation=True, max_length=max_length)
    query_embedding = question_encoder(**inputs).pooler_output.detach().numpy()
    distances, indices = index.search(query_embedding, top_k)
    return [chunks[i] for i in indices[0]]


In [88]:
# Step 6: Use T5 to extract answers
def extract_answer(query, retrieved_chunks, t5_model, t5_tokenizer, max_length=512):
    context = " ".join(retrieved_chunks)
    input_text = f"question: {query} context: {context}"
    inputs = t5_tokenizer(input_text, return_tensors="pt", truncation=True, max_length=max_length)
    outputs = t5_model.generate(**inputs)
    return t5_tokenizer.decode(outputs[0], skip_special_tokens=True)

In [89]:
# Step 7: Initialize pipeline
def initialize_pipeline(pdf_path=None, save_dir="./faiss_data"):
    # Load Models and Tokenizers
    context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
    context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
    question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
    question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
    t5_model = T5ForConditionalGeneration.from_pretrained("t5-base")
    t5_tokenizer = T5Tokenizer.from_pretrained("t5-base")

    # Check if precomputed data exists
    if os.path.exists(save_dir):
        print("Loading precomputed data...")
        index, embeddings, chunks = load_faiss_index(save_dir)
    elif pdf_path is not None:
        print("Processing PDF and creating embeddings...")
        # Read and Process PDF
        document = read_pdf(pdf_path)
        chunks = chunk_text(document)

        # Create Embeddings and Build FAISS Index
        embeddings = create_embeddings(chunks, context_encoder, context_tokenizer, max_length=512)
        index = build_faiss_index(embeddings)

        # Save the FAISS index and chunks
        save_faiss_index(index, embeddings, chunks, save_dir)
    else:
        raise ValueError("Either 'pdf_path' must be provided or precomputed data must exist in 'save_dir'.")

    return {
        "context_encoder": context_encoder,
        "context_tokenizer": context_tokenizer,
        "question_encoder": question_encoder,
        "question_tokenizer": question_tokenizer,
        "t5_model": t5_model,
        "t5_tokenizer": t5_tokenizer,
        "index": index,
        "chunks": chunks,
    }


In [90]:
# Step 8: Save FAISS index, embeddings, and chunks
def save_faiss_index(index, embeddings, chunks, save_dir):
    os.makedirs(save_dir, exist_ok=True)  # Ensure the folder exists
    # Save the FAISS index
    faiss.write_index(index, os.path.join(save_dir, "faiss_index"))

    # Save embeddings
    np.save(os.path.join(save_dir, "embeddings.npy"), embeddings)

    # Save chunks
    with open(os.path.join(save_dir, "chunks.pkl"), "wb") as f:
        pickle.dump(chunks, f)

    print(f"Data saved successfully in {save_dir}")

In [91]:
# Step 9: Load FAISS index, embeddings, and chunks
def load_faiss_index(save_dir):
    # Load FAISS index
    index = faiss.read_index(os.path.join(save_dir, "faiss_index"))

    # Load embeddings
    embeddings = np.load(os.path.join(save_dir, "embeddings.npy"))

    # Load chunks
    with open(os.path.join(save_dir, "chunks.pkl"), "rb") as f:
        chunks = pickle.load(f)

    return index, embeddings, chunks

In [100]:
# Step 10: Query pipeline
def query_pipeline(query, pipeline_data, top_k=5, max_length=512):
    # Unpack preloaded data
    question_encoder = pipeline_data["question_encoder"]
    question_tokenizer = pipeline_data["question_tokenizer"]
    t5_model = pipeline_data["t5_model"]
    t5_tokenizer = pipeline_data["t5_tokenizer"]
    index = pipeline_data["index"]
    chunks = pipeline_data["chunks"]

    # Retrieve Top-K Chunks
    retrieved_chunks = query_faiss(
        index,
        query,
        question_encoder,
        question_tokenizer,
        chunks,
        top_k=top_k,
        max_length=max_length,
    )

    # Generate Answer
    answer = extract_answer(query, retrieved_chunks, t5_model, t5_tokenizer, max_length=max_length)
    return answer,retrieved_chunks

In [97]:
# Step 11: Example usage
save_dir = "/content/drive/My Drive/NLP/Embeddings/"  # Folder in Google Drive
pdf_path = "/content/drive/My Drive/NLP/Road.pdf"  # Path to your PDF
pipeline_data = initialize_pipeline(pdf_path=pdf_path, save_dir=save_dir)


Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokeniz

Loading precomputed data...


In [102]:
# Query the pipeline (reuse preloaded data for every question)
query_1 = "what are the 10 countries have successed in reducing roadtraffic death?"
answer_1,retrieved_chunks = query_pipeline(query_1, pipeline_data)
print ("Question 1", query_1)
print("Answer:", answer_1)
print ("retrieved_chunks", retrieved_chunks)

# query_2 = "How can I apply for a refund?"
# answer_2 = query_pipeline(query_2, pipeline_data)
# print ("Question 2", query_2)
# print("Answer:", answer_2)



Question 1 what are the 10 countries have successed in reducing roadtraffic death?
Answer: 20
retrieved_chunks ["action and UHC, engaging relevant stakeholders and empowering local communities to strengthen Primary Health care, considering it the first line in reducing the consequences of road traffic injuries. Additionally, concerted efforts are needed to address the economic, social, and environmental determinants of health that impact road safety. This can be achieved by adopting a Health in All Policies approach and reducing risk factors. To achieve the vision of zero road traffic injuries, it is crucial to involve a wide range of stakeholders to reach Health for All,ensuringthatno one is left behind. It is equally importanttoaddressandmanageconflictsofinterest,promotetransparency,and implementparticipatorygovernancestrategies[74] 10.RoadSafetyandHealthinPost-PandemicRecovery In 2020, road deaths dropped by 20.2% on average across 19 countries comparedto2017-19duetoReduced traffic 