Welcome to DPR RAG, a cutting-edge web-based application designed to answer questions based on the content of uploaded PDF documents. Leveraging Dense Passage Retrieval (DPR) for Retrieval-Augmented Generation (RAG), this project combines semantic similarity with advanced retrieval techniques to deliver precise and contextually relevant responses. Built with modern technologies such as Streamlit, LangChain, Chroma, and Ollama, DPR RAG is an ideal tool for researchers, students, or professionals seeking to extract insights from documents efficiently.
The application offers a user-friendly interface where users can upload PDFs, ask questions in a conversational format, and receive detailed, real-time responses with citations to source documents, complete with confidence scores for transparency and verifiability. It seems likely that this tool is particularly valuable for tasks like academic research, legal document analysis, or technical documentation review, where quick and accurate information retrieval is essential.
Feature | Description |
---|---|
PDF Upload & Processing | Upload multiple PDF files, which are automatically split into chunks (1000 characters, 200-character overlap) and indexed for querying. |
Dense Passage Retrieval | Employs DPR to retrieve up to 8 relevant document passages based on semantic similarity, selecting the top 3 for response generation. |
Conversational Interface | Engage in a chat-like interface to ask questions and receive detailed answers based on document content. |
Real-Time Responses | Answers are streamed in real-time, ensuring a seamless user experience. |
Error Handling | Robust mechanisms handle errors and allow cleanup of uploaded files and vector stores. |
Source Citations | Responses include references to source documents with normalized confidence scores, enhancing trust and traceability. |
The DPR RAG system operates through a sophisticated pipeline that processes documents, retrieves relevant passages, and generates answers. Below is a detailed breakdown of the process:
- PDF Loading: PDFs are loaded using
PyPDFLoader
from LangChain, which extracts text from uploaded files. - Text Splitting: Documents are divided into chunks of 1000 characters with a 200-character overlap using
RecursiveCharacterTextSplitter
to maintain context. - Metadata Tagging: Each chunk is tagged with metadata, such as the source file name, to ensure traceability in responses.
The retrieval process utilizes Dense Passage Retrieval (DPR) to identify the most relevant document chunks:
- Embedding Generation: Both documents and queries are embedded into a high-dimensional vector space using
OllamaEmbeddings
(model:nomic-embed-text:latest
) with GPU support (num_gpu=1
). - Vector Storage: Embeddings are stored in a Chroma vector store, powered by ChromaDB, for efficient similarity searches.
- DPRRetriever: Retrieves up to
retrieve_k=8
documents based on cosine similarity between query and document embeddings. The topreturn_k=3
documents are selected after scoring.
To ensure comparability and interpretability, similarity scores are normalized using the softmax function, which transforms raw scores into probabilities that sum to 1. The formula is:
Where:
-
$\text{all\_scores}$ are$(1.0 - \text{distance})$ the initial similarity scores, calculated as$(1.0 - \text{distance})$ , where$(\text{distance})$ is the cosine distance between query and document embeddings (ranging from 0 for identical to 1 for dissimilar). - The softmax function exponentiates each score (
$(e^{\text{all\_scores}})$ ) and divides by the sum of all exponentiated scores ($(\sum e^{\text{all\_scores}})$ ), resulting in normalized scores that represent confidence levels for each document’s relevance.
This normalization ensures that the system prioritizes the most relevant documents, making the retrieval process reliable and transparent.
- Prompt Construction: Retrieved documents are formatted into a context string, including their content and normalized confidence scores. The prompt instructs the language model to summarize, synthesize, reference standards, and suggest solutions.
- LLM Generation: An
OllamaLLM
(model:qwen2.5:latest
, temperature: 0.3) generates a detailed response based on the context and query, streamed in real-time to the user interface.
flowchart LR
A(["User Query"])
A --> B["DPRRetriever"]
B --> C["Query Embedding"]
B --> D["Vector Store (Chroma)"]
C --> E["Similarity Search"]
D --> E
E --> F["Top-k Documents"]
F --> G["Prompt Construction"]
G --> H["LLM (OllamaLLM)"]
H --> I["Response Generation"]
I --> J["Final Answer"]
This diagram can be rendered in GitHub to visualize the pipeline from query to response.
To run DPR RAG, you’ll need to set up the following:
Requirement | Description |
---|---|
Python 3.8 or later | Ensure Python is installed on your system. Download from python.org. |
Ollama | A tool for running large language models locally. Install it based on your operating system: - Windows: Download the installer from Ollama Download and run it. - macOS: Download the installer from Ollama Download, unzip it, and drag the Ollama.app to your Applications folder. - Linux: Run the installation script as per the Ollama GitHub repository. Note: For optimal performance, configure Ollama to use a GPU if available, as the system uses num_gpu=1 for embeddings. |
Python Libraries | Install the required dependencies listed in requirements.txt , including streamlit , langchain , chromadb , langchain-ollama , and numpy . |
Follow these steps to set up the project:
- Clone the Repository:
git clone https://github.com/armanjscript/DPR-RAG.git
- Navigate to the Project Directory:
cd DPR-RAG
- Install Dependencies:
The
pip install -r requirements.txt
requirements.txt
file includes dependencies likestreamlit
,langchain
,chromadb
,langchain-ollama
, andnumpy
. - Start Ollama: Ensure the Ollama service is running on your system. Follow the instructions from the Ollama GitHub repository to start the service.
- Run the Streamlit App:
This will launch the app in your default web browser.
streamlit run dpr_rag.py
- Upload PDFs:
- In the sidebar, use the file uploader to select one or more PDF files.
- Files are saved locally in the
uploaded_pdfs
directory and indexed in a Chroma database (chroma_db
).
- Process Documents:
- Click the "Process Documents" button to split, embed, and store the PDFs in the vector store.
- Ask Questions:
- Enter your query in the chat input field in the main interface.
- The chatbot retrieves up to 8 relevant document chunks, selects the top 3 based on normalized scores, generates a response, and streams it in real-time.
- Responses include citations to source documents with confidence scores for reference.
- Clear Documents:
- Use the "Clear All Documents" button in the sidebar to reset the application by deleting uploaded files and the vector store, with robust error handling for reliability.
The system uses the following fixed parameters for optimal performance:
- Temperature: Set to 0.3 for controlled randomness in the language model’s responses, ensuring focused and relevant answers.
- Retrieval Parameters: Retrieves up to
retrieve_k=8
documents and returns the topreturn_k=3
based on normalized similarity scores.
To customize these parameters, edit the dpr_rag.py
file to adjust the OllamaLLM
temperature or the DPRRetriever
settings (e.g., retrieve_k
or return_k
).
We welcome contributions to enhance DPR RAG! To contribute:
- Fork the repository.
- Make your changes in a new branch.
- Submit a pull request with a clear description of your changes.
For detailed guidelines, refer to the CONTRIBUTING.md file.
This project is licensed under the MIT License. See the LICENSE file for details.
For questions, feedback, or collaboration opportunities, please reach out:
- Email: [armannew73@gmail.com]
- GitHub Issues: Open an issue on this repository for bug reports or feature requests.
This project builds on the following open-source technologies:
- Streamlit for the web interface
- LangChain for document processing and RAG pipeline
- Chroma for vector storage
- Ollama for local language models and embeddings
- NumPy for numerical operations, including score normalization
Thank you for exploring DPR RAG! We hope it simplifies your document analysis tasks and inspires further innovation in AI-driven Q&A systems.
#AI #RAG #DPR #PDFQ&A #Streamlit #LangChain #Chroma #Ollama #MachineLearning #NaturalLanguageProcessing