<a href="https://colab.research.google.com/github/akshithaa1/chat_with_pdf/blob/main/chat_with_pdfs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
pip install numpy faiss-cpu sentence-transformers transformers PyMuPDF streamlit



In [4]:
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    """
    Extract text from the given PDF file.
    :param pdf_path: Path to the PDF file.
    :return: Combined text from all pages of the PDF.
    """
    doc = fitz.open(pdf_path)
    text = []
    for page in doc:
        text.append(page.get_text())
    return " ".join(text)

# Extract text from the example PDF
pdf_path = "/content/POR answer.pdf"  # Replace with the path to your PDF
pdf_text = extract_text_from_pdf(pdf_path)
print(pdf_text[:1000])  # Display the first 500 characters


Tables, Charts, and 
Graphs 
with Examples from History, Economics, 
Education, Psychology, Urban Affairs and 
Everyday Life
REVISED: MICHAEL LOLKUS 2018
  Tables, Charts, and 
Graphs Basics
 We use charts and graphs to visualize data.  
This data can either be generated data, data gathered from 
an experiment, or data collected from some source.
A picture tells a thousand words so it is not a surprise that 
many people use charts and graphs when explaining data.
 Types of Visual 
Representations of Data
 Table of Yearly U.S. GDP by 
Industry (in millions of dollars)
Year
2010
2011
2012
2013
2014
2015
All Industries
26093515
27535971
28663246
29601191
30895407
31397023
Manufacturing
4992521
5581942
5841608
5953299
6047477
5829554
Finance,
Insurance, Real 
Estate, Rental, 
Leasing
4522451
4618678
4797313
5031881
5339678
5597018
Arts, 
Entertainment, 
Recreation, 
Accommodation,
and Food Service
964032
1015238
1076249
1120496
1189646
1283813
Other
15614511
16320113
16948076
17495515
1

In [5]:
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss


def chunk_text(text, chunk_size=500):
    """
    Split text into smaller pieces for embedding.
    :param text: Full text string.
    :param chunk_size: Approximate word count per chunk.
    :return: List of text chunks.
    """
    words = text.split()
    chunks = [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
    return chunks


def embed_and_store(chunks, model_name="all-MiniLM-L6-v2", index_path="pdf_index.faiss"):
    """
    Generate embeddings for chunks and store them in a FAISS index.
    :param chunks: List of text chunks.
    :param model_name: Pre-trained embedding model.
    :param index_path: Path to save the FAISS index.
    """
    # Load embedding model
    model = SentenceTransformer(model_name)
    embeddings = model.encode(chunks)

    # Create FAISS index
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings)

    # Save index and chunks
    np.save("chunks.npy", np.array(chunks, dtype=object))
    faiss.write_index(index, index_path)
    print(f"Index saved to {index_path}")


if __name__ == "__main__":
    # Instead of importing, directly call the extract_text_from_pdf function
    pdf_text = extract_text_from_pdf("/content/POR answer.pdf")
    chunks = chunk_text(pdf_text)
    embed_and_store(chunks)

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Index saved to pdf_index.faiss


In [6]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load Hugging Face model and tokenizer
model_name = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


def query_index(query, model_name="all-MiniLM-L6-v2", index_path="pdf_index.faiss"):
    """
    Query the FAISS index to retrieve relevant text chunks.
    :param query: User's question.
    :param model_name: Embedding model.
    :param index_path: Path to the FAISS index.
    :return: Retrieved chunks of text.
    """
    # Load FAISS index and chunks
    index = faiss.read_index(index_path)
    chunks = np.load("chunks.npy", allow_pickle=True).tolist()

    # Encode the query
    model = SentenceTransformer(model_name)
    query_embedding = model.encode([query])

    # Retrieve the most relevant chunks
    k = 3  # Number of top results
    distances, indices = index.search(query_embedding, k)
    relevant_chunks = [chunks[i] for i in indices[0]]
    return relevant_chunks


def generate_response_with_huggingface(query, relevant_chunks):
    """
    Generate a response using Hugging Face Transformers.
    :param query: User's question.
    :param relevant_chunks: Retrieved chunks of text.
    :return: Generated response.
    """
    # Combine retrieved chunks as context
    context = "\n".join(relevant_chunks)
    input_text = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"

    # Tokenize the input
    inputs = tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True)

    # Generate response
    outputs = model.generate(inputs.input_ids, max_length=900, num_beams=5, early_stopping=True)

    # Decode and return the response
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


if __name__ == "__main__":
    query = "Benefits of Robotic Process Automation (RPA):"
    relevant_chunks = query_index(query)
    response = generate_response_with_huggingface(query, relevant_chunks)
    print("Generated Response:")
    print(response)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Generated Response:
Table of Yearly U.S. GDP by Industry (in millions of dollars) Year 2010 2011 2012 2013 2014 2015 All Industries
