Skip to content

ashok49473/DocuMind-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DocuMind AI πŸ§ πŸ“š

Transform Your PDFs into Conversational Knowledge

Python Streamlit LangChain OpenAI Pinecone


πŸš€ Overview

DocuMind AI is an intelligent PDF question-answering system that leverages Retrieval-Augmented Generation (RAG) to help users extract insights from PDF documents through natural language conversations. It transforms static documents into an interactive knowledge base with real-time document processing and intelligent responses.

✨ Key Features

  • πŸ“„ Dynamic PDF Upload: Real-time PDF processing with drag-and-drop interface
  • 🧠 Semantic Search: Vector-based similarity search using OpenAI embeddings and Pinecone
  • πŸ€– AI-Powered Q&A: Natural language responses powered by GPT-3.5-Turbo
  • πŸ“Š Source Attribution: Transparent citations showing which document sections informed each answer
  • πŸ’¬ Interactive Interface: Modern Streamlit web interface with real-time status monitoring
  • πŸ”„ Modular Architecture: Clean, maintainable code structure for easy customization
  • πŸ“ˆ Index Management: Real-time vector count display and index clearing capabilities

Screenshot 2025-06-10 215601


πŸ› οΈ Technology Stack

Component Technology Purpose
Framework LangChain AI application orchestration
LLM OpenAI GPT-3.5-Turbo Natural language generation
Vector DB Pinecone v6.0.0 Scalable serverless similarity search
Embeddings OpenAI text-embedding-ada-002 Text vectorization
Frontend Streamlit Interactive web interface
Document Processing PyPDF2 PDF text extraction

πŸš€ Quick Start

Prerequisites

  • Python 3.9+
  • OpenAI API key
  • Pinecone account and API key

Installation

  1. Clone the repository

    git clone https://github.com/ashok49473/DocuMind-AI.git
    cd DocuMind-AI
  2. Install dependencies

    pip install -r requirements.txt
  3. Set up environment variables

    # Create .env file
    cp .env.example .env
    
    # Add your API keys to .env
    OPENAI_API_KEY=sk-your-openai-key-here
    PINECONE_API_KEY=your-pinecone-api-key
    PINECONE_INDEX_NAME=documind-ai
  4. Run the application

    streamlit run app.py
  5. Open your browser

    Navigate to: http://localhost:8501
    

πŸ“‹ Usage Guide

Step 1: Upload PDF Document

  1. Open the Streamlit interface
  2. Use the sidebar file uploader to select a PDF
  3. Click "Process PDF" to analyze the document
  4. Wait for processing confirmation

Step 2: Ask Questions

  1. Enter your question in the main interface text input
  2. Click "Ask Question"
  3. Review the AI-generated response with source citations
  4. Expand "Source Documents" to see referenced text sections

Step 3: Manage Your Knowledge Base

  • Monitor system status in the right panel
  • View real-time vector count statistics
  • Clear the vector store to reset the system
  • Process new documents to update the knowledge base

πŸ“ Project Structure

DocuMind-AI/
|──src/
   |
   β”œβ”€β”€ πŸ“„ app.py                    # Main Streamlit application
   β”œβ”€β”€ πŸ“„ config.py                 # Configuration management
   β”œβ”€β”€ πŸ“„ pdf_processor.py          # PDF processing and chunking
   β”œβ”€β”€ πŸ“„ vector_store.py           # Pinecone v6.0.0 integration
   β”œβ”€β”€ πŸ“„ qa_chain.py               # RAG implementation
β”œβ”€β”€ πŸ“„ requirements.txt          # Python dependencies
β”œβ”€β”€ πŸ“„ .env.example             # Environment variables template
└── πŸ“„ README.md                # This documentation

βš™οΈ Configuration

Environment Variables

Variable Description Required
OPENAI_API_KEY OpenAI API key for GPT-3.5-Turbo and embeddings βœ…
PINECONE_API_KEY Pinecone API key for vector storage βœ…
PINECONE_INDEX_NAME Name for your Pinecone index ❌ (default: documind-ai)

Customizable Parameters

# In config.py
CHUNK_SIZE = 1000              # Document chunk size
CHUNK_OVERLAP = 200            # Overlap between chunks
LLM_MODEL = "gpt-3.5-turbo"    # OpenAI model
EMBEDDING_MODEL = "text-embedding-ada-002"  # Embedding model
PINECONE_DIMENSION = 1536      # Embedding dimension
PINECONE_METRIC = "cosine"     # Similarity metric
PINECONE_CLOUD = "aws"         # Cloud provider
PINECONE_REGION = "us-east-1"  # Region

πŸ”§ Modular Architecture

Core Components

PDFProcessor

  • Extracts text from uploaded PDF files
  • Splits text into manageable chunks with overlap
  • Creates LangChain Document objects with metadata

VectorStoreManager

  • Manages Pinecone serverless index operations
  • Handles document embedding and storage
  • Performs similarity searches with configurable parameters
  • Provides index statistics and management

QAChain

  • Implements retrieval-augmented generation
  • Uses custom prompts for context-aware responses
  • Returns answers with source document attribution
  • Handles error cases gracefully

Config

  • Centralized configuration management
  • Environment variable validation

documind

πŸ” How It Works

  1. Document Ingestion: PDF text is extracted and split into semantic chunks
  2. Embedding Generation: OpenAI creates vector representations of text chunks
  3. Vector Storage: Embeddings are stored in Pinecone serverless index
  4. Query Processing: User questions are embedded and matched against stored vectors
  5. Context Retrieval: Most relevant document chunks are retrieved
  6. Answer Generation: GPT-3.5-Turbo generates responses using retrieved context
  7. Source Attribution: Original document sections are provided for transparency

Considerations

  • Requires OpenAI and Pinecone API credits
  • Processing time depends on document size
  • Accuracy depends on document quality and structure
  • Best results with well-structured, text-based PDFs

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • LangChain team for the incredible RAG framework
  • OpenAI for powerful language models and embeddings
  • Pinecone for scalable serverless vector search
  • Streamlit for the intuitive web framework
  • PyPDF2 for reliable PDF processing

πŸ“ž Contact

Ashok Kumar - ashokpalivela123@gmail.com

Project Link: https://github.com/ashok49473/DocuMind-AI

Portfolio: https://ashok49473.github.io


πŸš€ Future Enhancements

  • Multi-document conversation support
  • Advanced filtering and search options
  • Integration with cloud storage services
  • Batch document processing
  • Custom embedding model support

⭐ Star this project if you found it helpful!

Built with ❀️ using LangChain, OpenAI, and Pinecone

About

DocuMind AI, an app that lets you have natural language conversations with your PDF documents! πŸš€

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages