<h1>Overview</h1>

This script creates a web application called "Document Parser" using Streamlit. It allows users to upload PDF documents, enter a Google API key, and ask questions about the content of the documents. The application processes the documents, splits the text into manageable chunks, converts them into embeddings, and then uses these embeddings to answer user queries.

<h1>Code Breakdown</h1>

Install the necessary libraries before importing them.

1. **Importing Libraries**:

    streamlit is used to create the web interface.
    os for interacting with the operating system.
    langchain.text_splitter to split large text into smaller chunks.
    langchain.vectorstores for storing text chunks in a vector database.
    langchain.chains.question_answering to create a question-answering chain.
    langchain.prompts for defining prompt templates.
    PyPDF2 to read PDF files.
    langchain_google_genai for embeddings and chat models using Google Generative AI.

2. **Streamlit Page Configuration**:

    Sets the page title and layout.
    Displays instructions on how to use the application.

3. **API Key Input**:

    A secure input field for the user to enter their Google API key.

4. **Extracting Text from PDFs**:

    get_pdf_text function reads text from uploaded PDF files.

5. **Splitting Text into Chunks**:

    get_text_chunks function splits the extracted text into smaller chunks for easier processing.

6. **Creating a Vector Store**:

    get_vector_store function converts text chunks into embeddings and stores them in a vector database.

7. **Creating a Conversational Chain**:

    get_conversational_chain function sets up a model to handle user queries by providing detailed answers based on the context from the document.

8. **Handling User Input**:

    user_input function processes the user's question, searches for relevant document chunks, and generates a response using the conversational chain.

9. **Main Function**:

    Sets up the Streamlit interface, including a header, input fields for user questions, and a sidebar menu for uploading PDFs.
    Processes the documents and prepares them for question-answering when the user clicks the "Submit & Process" button.

<h1>How it Works</h1>

User Interaction:

    User enters their Google API key.
    User uploads PDF documents.
    User asks a question about the content of the documents.

Document Processing:

    Text is extracted from the PDFs.
    The text is split into smaller chunks.
    These chunks are converted into embeddings and stored in a vector database.

Question Answering:

    When the user asks a question, the application searches the vector database for relevant text chunks.
    The conversational chain model generates a detailed answer based on the context provided by these chunks.

In [1]:
#!pip install streamlit
#!pip install langchain_google_genai
#!pip install google.generativeai
#!pip install -U langchain-community
#!pip install PyPDF2

In [2]:
%%writefile app.py
import streamlit as st  # Used to create interactive web applications using Python
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter #SplittingTextIntoSmallerChunks
from langchain_community.vectorstores import FAISS #VectorDatabase
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate
from PyPDF2 import PdfReader
from langchain_google_genai import GoogleGenerativeAIEmbeddings #UsedToConverChunksIntoPositionalEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI
from google.api_core import retry
import google.generativeai as genai
import time
import logging

logging.basicConfig(level=logging.ERROR)

st.set_page_config(page_title="Personal Document Chatbot Tool", layout="wide")

st.markdown("""
## Document Chatbot: Get instant insights from your document
## How it starts:

Follow the instructions as below:

1. **Enter API Key** - You will need Google API key for Chatbot to access Google's generative AI models. Obtain your API key here : https://developers.google.com/maps/documentation/javascript/get-api-key#create-api-keys \n
2. **Upload your documents** - System accepts multiple PDF files at once, analyzing the content to provide comprehensive insights.
3. **Ask a question** - After getting "Processing Done!" message, ask any question related to the content of your uploaded documents.
""")

api_key = st.text_input("Enter your Google API Key:", type="password", key="api_key_input")

# This is a function to extract text from uploaded PDFs
def get_pdf_text(pdf_docs):
    text = ""

    # Code has been modified to take care of error handling
    for pdf in pdf_docs:
        try:
            pdf_reader = PdfReader(pdf)
            for page in pdf_reader.pages:
                text += page.extract_text()
        except Exception as e:
            st.error(f"Error reading {pdf.name}: {e}")
            continue
    if text:
        st.info("Text extraction complete.") 
    else:
        st.warning("No text extracted. Please check your PDF files.")
    return text

# This is a function to split pdf text to text chunks
def get_text_chunks(text):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = text_splitter.split_text(text)
    return chunks

# This is a function to convert text chunks into positional embeddings, and store those embeddings into vector database
def get_vector_store(text_chunks, api_key):
    
    # Modified code to handle timeout issues while creating embeddings 
    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key=api_key, request_options={'retry':retry.Retry()})
    vector_store = None
    retry_attempts = 3
    for attempt in range(retry_attempts):
        try:
            vector_store = FAISS.from_texts(text_chunks, embeddings)
            vector_store.save_local("faiss_index")
            st.success("Embeddings created and saved successfully.")
            break
        except Exception as e:
            logging.error(f"Error embedding content: {e}", exc_info=True)
            st.error(f"Error embedding content: {e}. Retrying... ({attempt + 1}/{retry_attempts})")
            time.sleep(2 * (attempt + 1))
            continue
    if vector_store is None:
        st.error("Failed to create embeddings after multiple attempts.")

def load_vector_store(api_key):
    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key=api_key)
    vector_store = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)
    return vector_store

# This function will allow user to see the chain of conversation for a particular topic and help build on context for more accurate answers
def get_conversational_chain(api_key):
    
    #This is where prompt engineering happens. It is crucial to get this right as the length and articluation of prompt affects response quality.
    prompt_template = """
    Answer the question as precise as possible from the provided context.
    Make sure to provide all the details. If the answer is not available, don't hallucinate and say "Answer not available in context" \n\n.
    
    Context: \n {context}?\n
    Question: \n{question}\n

    Answer:
    """

    model = ChatGoogleGenerativeAI(model="gemini-pro", temperature=0.3, google_api_key=api_key)
    prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
    chain = load_qa_chain(model, chain_type="stuff", prompt=prompt)
    return chain

def user_input(user_question, api_key):
    vector_store = load_vector_store(api_key)
    docs = vector_store.similarity_search(user_question)
    chain = get_conversational_chain(api_key)
    response = chain({"input_documents": docs, "question": user_question}, return_only_outputs=True)
    st.write("Reply: ", response["output_text"])

# This snippet will be responsible for setting up the Streamlit page
def main():
    st.header("Document Chatbot Panel:")

    user_question = st.text_input("Ask a question from uploaded PDF files", key="user_question")

    if user_question and api_key:
        user_input(user_question, api_key)

    with st.sidebar:
        st.title("Menu:")
        pdf_docs = st.file_uploader("Upload your documents", accept_multiple_files=True, key="pdf_uploader")
        if st.button("Submit & Process", key="process_button") and api_key:
            with st.spinner("Processing..."):
                try:
                    raw_text = get_pdf_text(pdf_docs)
                    text_chunks = get_text_chunks(raw_text)
                    get_vector_store(text_chunks, api_key)
                    st.success("Processing Done!")
                except Exception as e:
                    st.error(f"An error occurred during processing: {e}")

if __name__ == "__main__":
    main()


Overwriting app.py


After executing the above code, app.py file should be generated in your project folder. Use the terminal to navigate to the project folder containing app.py file using cd "{folder full path}" and then execute the command "streamlit run app.py"