An implementation of a Retrieval-Augmented Generation (RAG) model for question answering application. This app allows users to upload documents and ask questions, returning context-aware answers based on the uploaded content.
- Technologies
- RAG Application Workflow
- Project Architecture
- Requirements
- Installation
- Running the Application
- Optional Setup
Traditional language models (LLMs) generate answers based only on the knowledge they were trained on. This can lead to outdated or hallucinated responses, especially when dealing with domain-specific or dynamic information.
Retrieval-Augmented Generation (RAG)
Architecture solves this by combining the power of information retrieval and language generation. Instead of relying solely on the model's internal knowledge, RAG pipelines retrieve relevant documents from an external knowledge base and use them to ground the model's answers.
In a typical RAG pipeline:
- The user uploads one or more documents (PDF, TXT, etc.) to be used as the knowledge base.
This step involves:
- Extracting text from the uploaded files.
- Chunking the extracted text into smaller pieces (documents/passages).
- Parsing and cleaning the data as needed.
Indexing
- The chunks are converted into embeddings using an embedding model.
- The embeddings are stored in a vector store for efficient similarity search.
When a query is submitted:
- The query is converted into an embedding.
- A similarity search is performed against the vector store to retrieve the most relevant chunks.
- A prompt is constructed using the user query and retrieved documents.
- The prompt is passed to an LLM (Large Language Model).
- The LLM returns a response grounded in the relevant information.
This API architecture is built with FastAPI and provides routes for monitoring, document ingestion, semantic indexing, and health checks. Documents can be uploaded, processed into chunks, stored in a VectorDB, and then queried for search or Q&A with an LLM. Metrics endpoints support observability, while base endpoints handle system and health information.
This architecture enables users to upload documents and query them through a FastAPI backend. Document processing runs in the background via Celery, where text is extracted (using PyMuPDF and Tesseract), converted into vector embeddings with Cohere, and stored in Qdrant. When a question is asked, the system retrieves the most relevant document chunks and leverages LangChain to generate a context-aware response. The entire solution is containerized with Docker, deployed on DigitalOcean, and monitored using Prometheus and Grafana.
- Python 3.12
To run this project on Windows, follow these steps:
- Install WSL and Ubuntu for Windows (Tutorial)
- Open your Ubuntu terminal and update the package lists:
sudo apt update
- Download and install MiniConda in ubuntu from here
- Create a new environment using the following command:
$ conda create -n <env_name> python=3.12
- Activate the environment:
$ conda activate <env_name>
Before installing the Python dependencies, ensure that the required system packages are installed to avoid compilation or runtime errors:
$ sudo apt update
$ sudo apt install libpq-dev gcc python3-dev
Once the system packages are in place, install the Python dependencies from the requirements file:
$ pip install -r requirements.txt
Set your environment variables in the .env
file.
$ cp .env.example .env
You can run the FastAPI server using Uvicorn.
- Run the server locally:
$ uvicorn main:app --reload
- Run the server with custom host and port:
$ uvicorn main:app --reload --host 0.0.0.0 --port 5000
Copy .env.example
in your .env
and update it with your credentials.
$ cd docker
$ cp .env.example .env
Setup your command line interface for better readability:
export PS1="\[\033[01;32m\]\u@\h:\w\n\[\033[00m\]\$ "