DocuMind AI is an intelligent PDF question-answering system that leverages Retrieval-Augmented Generation (RAG) to help users extract insights from PDF documents through natural language conversations. It transforms static documents into an interactive knowledge base with real-time document processing and intelligent responses.
- π Dynamic PDF Upload: Real-time PDF processing with drag-and-drop interface
- π§ Semantic Search: Vector-based similarity search using OpenAI embeddings and Pinecone
- π€ AI-Powered Q&A: Natural language responses powered by GPT-3.5-Turbo
- π Source Attribution: Transparent citations showing which document sections informed each answer
- π¬ Interactive Interface: Modern Streamlit web interface with real-time status monitoring
- π Modular Architecture: Clean, maintainable code structure for easy customization
- π Index Management: Real-time vector count display and index clearing capabilities
| Component | Technology | Purpose |
|---|---|---|
| Framework | LangChain | AI application orchestration |
| LLM | OpenAI GPT-3.5-Turbo | Natural language generation |
| Vector DB | Pinecone v6.0.0 | Scalable serverless similarity search |
| Embeddings | OpenAI text-embedding-ada-002 | Text vectorization |
| Frontend | Streamlit | Interactive web interface |
| Document Processing | PyPDF2 | PDF text extraction |
- Python 3.9+
- OpenAI API key
- Pinecone account and API key
-
Clone the repository
git clone https://github.com/ashok49473/DocuMind-AI.git cd DocuMind-AI -
Install dependencies
pip install -r requirements.txt
-
Set up environment variables
# Create .env file cp .env.example .env # Add your API keys to .env OPENAI_API_KEY=sk-your-openai-key-here PINECONE_API_KEY=your-pinecone-api-key PINECONE_INDEX_NAME=documind-ai
-
Run the application
streamlit run app.py
-
Open your browser
Navigate to: http://localhost:8501
- Open the Streamlit interface
- Use the sidebar file uploader to select a PDF
- Click "Process PDF" to analyze the document
- Wait for processing confirmation
- Enter your question in the main interface text input
- Click "Ask Question"
- Review the AI-generated response with source citations
- Expand "Source Documents" to see referenced text sections
- Monitor system status in the right panel
- View real-time vector count statistics
- Clear the vector store to reset the system
- Process new documents to update the knowledge base
DocuMind-AI/
|ββsrc/
|
βββ π app.py # Main Streamlit application
βββ π config.py # Configuration management
βββ π pdf_processor.py # PDF processing and chunking
βββ π vector_store.py # Pinecone v6.0.0 integration
βββ π qa_chain.py # RAG implementation
βββ π requirements.txt # Python dependencies
βββ π .env.example # Environment variables template
βββ π README.md # This documentation
| Variable | Description | Required |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key for GPT-3.5-Turbo and embeddings | β |
PINECONE_API_KEY |
Pinecone API key for vector storage | β |
PINECONE_INDEX_NAME |
Name for your Pinecone index | β (default: documind-ai) |
# In config.py
CHUNK_SIZE = 1000 # Document chunk size
CHUNK_OVERLAP = 200 # Overlap between chunks
LLM_MODEL = "gpt-3.5-turbo" # OpenAI model
EMBEDDING_MODEL = "text-embedding-ada-002" # Embedding model
PINECONE_DIMENSION = 1536 # Embedding dimension
PINECONE_METRIC = "cosine" # Similarity metric
PINECONE_CLOUD = "aws" # Cloud provider
PINECONE_REGION = "us-east-1" # Region- Extracts text from uploaded PDF files
- Splits text into manageable chunks with overlap
- Creates LangChain Document objects with metadata
- Manages Pinecone serverless index operations
- Handles document embedding and storage
- Performs similarity searches with configurable parameters
- Provides index statistics and management
- Implements retrieval-augmented generation
- Uses custom prompts for context-aware responses
- Returns answers with source document attribution
- Handles error cases gracefully
- Centralized configuration management
- Environment variable validation
- Document Ingestion: PDF text is extracted and split into semantic chunks
- Embedding Generation: OpenAI creates vector representations of text chunks
- Vector Storage: Embeddings are stored in Pinecone serverless index
- Query Processing: User questions are embedded and matched against stored vectors
- Context Retrieval: Most relevant document chunks are retrieved
- Answer Generation: GPT-3.5-Turbo generates responses using retrieved context
- Source Attribution: Original document sections are provided for transparency
- Requires OpenAI and Pinecone API credits
- Processing time depends on document size
- Accuracy depends on document quality and structure
- Best results with well-structured, text-based PDFs
This project is licensed under the MIT License - see the LICENSE file for details.
- LangChain team for the incredible RAG framework
- OpenAI for powerful language models and embeddings
- Pinecone for scalable serverless vector search
- Streamlit for the intuitive web framework
- PyPDF2 for reliable PDF processing
Ashok Kumar - ashokpalivela123@gmail.com
Project Link: https://github.com/ashok49473/DocuMind-AI
Portfolio: https://ashok49473.github.io
- Multi-document conversation support
- Advanced filtering and search options
- Integration with cloud storage services
- Batch document processing
- Custom embedding model support

