<a href="https://www.kaggle.com/code/adityabhaskar12/rag-originbluy?scriptVersionId=249940418" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Package Installation**
Installs all necessary Python dependencies for the RAG system:
- **OCR/Image Processing**: opencv-python, pdf2image, Pillow, EasyOCR
- **Document Parsing**: PyMuPDF (PDFs), python-docx (Word), mammoth
- **ML/NLP**: transformers, sentence-transformers, torch
- **Vector DB**: chromadb
- **Utils**: numpy, langchain-community

In [None]:
!pip install opencv-python
!pip install PyMuPDF
!pip install transformers
!pip install torch
!pip install numpy
!pip install sentence-transformers
!pip install pdf2image
!pip install python-docx
!pip install mammoth
!pip install chromadb
!pip install Pillow
!pip install git+https://github.com/JaidedAI/EasyOCR.git

## System Configuration
Installs essential system libraries:
- `poppler-utils`: PDF rendering
- `tesseract-ocr`: OCR engine
- `libgl1`: OpenGL support for GUI-less environments

In [None]:
!sudo apt update
!sudo apt install -y poppler-utils tesseract-ocr libgl1

## 🔄 LangChain Update
Upgrades `langchain-community` to ensure compatibility with the latest document processing features.

In [None]:
!pip install -U langchain-community

## 🤗 HuggingFace Setup
Installs LangChain-HuggingFace integration for:
- Seamless transformer model usage
- Pipeline management
- Optimized LLM inference

In [None]:
!pip install -U langchain-huggingface transformers

##  Import Dependencies
Key imports organized by functionality:

### **Document Processing**
- `fitz` (PyMuPDF), `Document` (docx), `pdf2image`
- `easyocr` for OCR

### **NLP/ML**
- HuggingFace `transformers`, `pipeline`
- `TableTransformerForObjectDetection` (table extraction)

### **Vector DB**
- `Chroma` vector store
- `HuggingFaceEmbeddings`

### **Utils**
- `RecursiveCharacterTextSplitter` for chunking
- `PromptTemplate` for LLM instructions

In [None]:
import os
import json
import numpy as np
import cv2
import fitz 
from docx import Document
from typing import List, Dict, Any, Optional
from pdf2image import convert_from_path
import easyocr
import torch
from langchain_huggingface import HuggingFaceEmbeddings 
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document as LangchainDocument
from langchain.prompts import PromptTemplate
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline,
    TableTransformerForObjectDetection,
    DetrImageProcessor
)

## System Configuration
Centralized settings via `Config` class:

    CHROMA_DIR = "chroma_db"  # Vector storage
    
    MISTRAL_PATH = "/kaggle/input/mistral/..."  # LLM path  
    
    EMBEDDING_MODEL = "all-MiniLM-L6-v2"  
    
    SUPPORTED_EXTS = {'.pdf', '.docx', '.txt'}  # File types
    
    MAX_DOC_SIZE = 100MB  # Total file size limit
    
    DEVICE = "cuda" if available else "cpu"  # Hardware acceleration 
                                             # if nvidia gpu is provided uses that else uses cpu

In [None]:
class Config:
    CHROMA_DIR = "chroma_db"
    MISTRAL_PATH = "/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1"
    EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
    SUPPORTED_EXTS = {'.pdf', '.docx', '.txt'}
    MAX_DOC_SIZE = 100 * 1024 * 1024  
    MAX_DOCUMENTS = 10
    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

##  Document Processor
Advanced multi-format parser with:

### Key Features:
1. **PDF Processing**
   - OCR with EasyOCR
   - Table detection (Table Transformer)
   - Fallback to PyMuPDF text extraction

2. **DOCX Support**
   - Paragraph extraction
   - Table content preservation

3. **TXT FILE Support**

In [None]:
class DocumentProcessor:
    def __init__(self):
        try:
            self.ocr_reader = easyocr.Reader(['en'])
            self.table_processor = DetrImageProcessor.from_pretrained(
               "microsoft/table-transformer-detection",
                size={"longest_edge": 1000}, 
                use_fast=True
            )
            self.table_detector = TableTransformerForObjectDetection.from_pretrained(
                "microsoft/table-transformer-detection"
            ).to(Config.DEVICE)
        except Exception as e:
            print(f"Initialization error: {str(e)}")
            raise

    def process_file(self, file_path: str) -> List[Dict[str, Any]]:
        """Main method to process any supported file type"""
        ext = os.path.splitext(file_path)[1].lower()
        try:
            if ext == '.pdf':
                return self._process_pdf(file_path)
            elif ext == '.docx':
                return self._process_docx(file_path)
            elif ext == '.txt':
                return self._process_txt(file_path)
        except Exception as e:
            print(f"Error processing {file_path}: {str(e)}")
            return []

    def _process_pdf(self, pdf_path: str) -> List[Dict[str, Any]]:
        """Process PDF with OCR and table detection, with robust fallback"""
        try:
            advanced_result = self._process_pdf_with_ocr_and_tables(pdf_path)
            if advanced_result:
                return advanced_result
        except Exception as e:
            print(f"Advanced processing failed: {str(e)}")
        try:
            doc = fitz.open(pdf_path)
            return [{
                "page_number": i + 1,
                "content": [{"type": "text", "content": page.get_text()}],
                "source": os.path.basename(pdf_path)
            } for i, page in enumerate(doc)]
        except Exception as e:
            print(f"Simple PDF processing failed: {str(e)}")
            return []

    def _process_pdf_with_ocr_and_tables(self, pdf_path: str) -> List[Dict[str, Any]]:
        """Advanced PDF processing with OCR and table detection"""
        try:
            images = convert_from_path(pdf_path)
            doc = fitz.open(pdf_path)
            extracted_data = []

            for page_num, (image, page) in enumerate(zip(images, doc), 1):
                page_content = []
                cv_image = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)

                # TABLE DETECTION (only if GPU available)
                table_boxes = []
                if torch.cuda.is_available():
                    try:
                        inputs = self.table_processor(images=image, return_tensors="pt").to(Config.DEVICE)
                        with torch.no_grad():
                            outputs = self.table_detector(**inputs)

                        target_sizes = torch.tensor([image.size[::-1]])
                        results = self.table_processor.post_process_object_detection(
                            outputs, threshold=0.9, target_sizes=target_sizes
                        )[0]

                        for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
                            if score > 0.9 and label == self.table_detector.config.id2label[1]:
                                box = [int(i) for i in box.tolist()]
                                table_boxes.append((box[1], box[0], box[3], box[2]))
                    except Exception as e:
                        print(f"Table detection failed, continuing without tables: {str(e)}")
                try:
                    if table_boxes:
                        mask = np.ones(cv_image.shape[:2], dtype=np.uint8) * 255
                        for y0, x0, y1, x1 in table_boxes:
                            cv2.rectangle(mask, (x0, y0), (x1, y1), 0, -1)
                        masked_image = cv2.bitwise_and(cv_image, cv_image, mask=mask)
                    else:
                        masked_image = cv_image
                    ocr_results = self.ocr_reader.readtext(masked_image, paragraph=True)
                    for result in ocr_results:
                        try:
                            if len(result) >= 2:
                                text = result[1]
                                bbox = result[0]
                                page_content.append({
                                    "type": "text",
                                    "content": text,
                                    "position": [int(coord) for point in bbox for coord in point] if bbox else []
                                })
                        except Exception as e:
                            print(f"Error processing OCR result: {str(e)}")

                except Exception as e:
                    print(f"OCR failed for page {page_num}, falling back to simple text: {str(e)}")
                    page_content.append({
                        "type": "text",
                        "content": page.get_text(),
                        "position": []
                    })
                for y0, x0, y1, x1 in table_boxes:
                    try:
                        table_img = cv_image[y0:y1, x0:x1]
                        table_results = self.ocr_reader.readtext(table_img)
                        table_content = [res[1] for res in table_results if len(res) >= 2]

                        page_content.append({
                            "type": "table",
                            "content": table_content,
                            "position": [x0, y0, x1, y1]
                        })
                    except Exception as e:
                        print(f"Table processing failed: {str(e)}")

                extracted_data.append({
                    "page_number": page_num,
                    "content": page_content,
                    "source": os.path.basename(pdf_path)
                })

            return extracted_data

        except Exception as e:
            raise Exception(f"PDF processing failed: {str(e)}")

    def _process_docx(self, docx_path: str) -> List[Dict[str, Any]]:
        """Process DOCX files with table support"""
        try:
            doc = Document(docx_path)
            content = []
            for para in doc.paragraphs:
                if para.text.strip():
                    content.append({
                        "type": "text",
                        "content": para.text
                    })
            for table in doc.tables:
                table_data = []
                for row in table.rows:
                    row_data = [cell.text for cell in row.cells]
                    table_data.append(row_data)
                content.append({
                    "type": "table",
                    "content": table_data
                })

            return [{
                "page_number": 1,
                "content": content,
                "source": os.path.basename(docx_path)
            }]
        except Exception as e:
            print(f"DOCX processing error: {str(e)}")
            return []

    def _process_txt(self, txt_path: str) -> List[Dict[str, Any]]:
        """Process plain text files"""
        try:
            with open(txt_path, 'r', encoding='utf-8') as f:
                return [{
                    "page_number": 1,
                    "content": [{"type": "text", "content": f.read()}],
                    "source": os.path.basename(txt_path)
                }]
        except Exception as e:
            print(f"TXT processing error: {str(e)}")
            return []

##  Vector Database Setup
Creates search-optimized document storage:


In [None]:
class VectorStoreManager:
    def __init__(self):
        self.embedding_model = HuggingFaceEmbeddings(
            model_name=Config.EMBEDDING_MODEL,
            model_kwargs={'device': Config.DEVICE}
        )
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200
        )

    def create_store(self, documents: List[Dict]) -> Optional[Chroma]:
        if not documents:
            return None

        lc_docs = []
        for doc in documents:
            for content in doc['content']:
                if content['type'] == 'text':
                    for chunk in self.text_splitter.split_text(content['content']):
                        lc_docs.append(LangchainDocument(
                            page_content=chunk,
                            metadata={} 
                        ))
                elif content['type'] == 'table':
                    table_text = " | ".join([" | ".join(row) for row in content['content']])
                    lc_docs.append(LangchainDocument(
                        page_content=f"TABLE: {table_text}",
                        metadata={}
                    ))

        return Chroma.from_documents(
            documents=lc_docs,
            embedding=self.embedding_model,
            persist_directory=Config.CHROMA_DIR
        ) if lc_docs else None

## LLM Initialization
```markdown
Mistral-7B LLM Wrapper

In [None]:
class MistralLLM:
    def __init__(self):
        try:
            from langchain_community.llms import HuggingFacePipeline  # Correct import
            
            self.tokenizer = AutoTokenizer.from_pretrained(Config.MISTRAL_PATH)
            self.model = AutoModelForCausalLM.from_pretrained(
                Config.MISTRAL_PATH,
                torch_dtype=torch.float16,
                device_map="auto"
            )
            self.pipe = pipeline(
                "text-generation",
                model=self.model,
                tokenizer=self.tokenizer,
                max_new_tokens=512,
                temperature=0.1,
                device_map="auto"
            )
            self.llm = HuggingFacePipeline(pipeline=self.pipe)
        except Exception as e:
            print(f"LLM initialization failed: {str(e)}")
            self.llm = None

    def query(self, vectordb: Chroma, query: str, k: int = 3) -> str:
        if not self.llm:
            return "LLM not initialized"
            
        retriever = vectordb.as_retriever(search_kwargs={"k": k})
        qa = RetrievalQA.from_chain_type(
            llm=self.llm,
            chain_type="stuff",
            retriever=retriever,
            return_source_documents=True
        )
        result = qa({"query": query})
        return f"Answer: {result['result']}\nSources: {[d.metadata['source'] for d in result['source_documents']]}"

## Prompt Template
Structured prompt enforcing:
- Source synthesis
- Conflict resolution
- Bullet-point formatting

In [None]:
RAG_PROMPT_TEMPLATE = """As an expert document analyst with access to multiple sources, your task is to provide the most accurate, well-structured answer to the user's question:

1. Context Analysis:
{context}

2. Question:
{question}

3. Answer Generation Rules:
- Synthesize information from all relevant context
- Acknowledge when information is incomplete
- Break down complex answers into bullet points when helpful
- Highlight key statistics, names, and dates
- If conflicting information exists, present both sides with sources
- For technical queries, provide detailed explanations
- For summary requests, include all major points

4. Required Answer Format:
[Start of Answer]
### Comprehensive Response:
[Your synthesized answer here]
only include relevant sources
[End of Answer]"""
rag_prompt = PromptTemplate(
    template=RAG_PROMPT_TEMPLATE,
    input_variables=["context", "question"]
)
def create_qa_chain(llm, vectordb, k=3):
    return RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectordb.as_retriever(search_kwargs={"k": k}),
        return_source_documents=True,  
        chain_type_kwargs={
            "prompt": rag_prompt, 
            "document_prompt": PromptTemplate(
                input_variables=["page_content"],
                template="{page_content}" 
            )
        }
    )

## Enhanced Query Handling
Extends base LLM with:
- Custom retrieval-augmented generation
- Error handling
- Simplified response format

In [None]:
class ExtendedMistralLLM(MistralLLM):
    def query(self, vectordb: Chroma, query: str, k: int = 3) -> str:
        if not self.llm:
            return "LLM not initialized"
            
        qa_chain = create_qa_chain(self.llm, vectordb, k)
        try:
            result = qa_chain.invoke({"query": query})
            return result["result"]
        except Exception as e:
            return f"Error processing query: {str(e)}"

## Execution Pipeline
End-to-end document processing flow:

1. **Input:** List of document paths
2. **Processing:**
   - Size validation
   - Parallel parsing
3. **Query Loop:**
   - Interactive question answering

In [None]:
def main():
    processor = DocumentProcessor()
    vs_manager = VectorStoreManager()
    llm = ExtendedMistralLLM()

    document_paths = [
        "/kaggle/input/daafile/End Sem DAA.pdf",
        "/kaggle/input/txtfile/SampleTextFile_1000kb.txt",
        "/kaggle/input/womensafetycasestudy/Women Safety_v3.pdf",
        "/kaggle/input/wordfile/uhvprojectcopy.docx",
    ][:Config.MAX_DOCUMENTS]

    all_docs = []
    for path in document_paths:
        if os.path.getsize(path) > Config.MAX_DOC_SIZE:
            print(f"Skipping {path} - exceeds size limit")
            continue
            
        docs = processor.process_file(path)
        if docs:
            all_docs.extend(docs)
            print(f"Processed {path}")
        else:
            print(f"Failed to process {path}")

    if all_docs:
        vectordb = vs_manager.create_store(all_docs)
        if vectordb:
            print("Vector store created")
            while True:
                query = input("\nEnter question (or 'exit'): ")
                if query.lower() == 'exit':
                    break
                print(llm.query(vectordb, query))
        else:
            print("Failed to create vector store")
    else:
        print("No documents processed")

## Memory Optimization
Proactive resource cleanup

In [None]:
import torch
import gc
from IPython.display import clear_output

def clear_memory():
    """Comprehensive memory cleanup for Kaggle notebooks"""
    try:
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            torch.cuda.ipc_collect()
            print("Cleared CUDA cache")
        gc.collect()
        print("Ran garbage collection")
        clear_output(wait=True)
        if torch.cuda.is_available():
            print(f"Current GPU memory usage: {torch.cuda.memory_allocated()/1024**2:.2f}MB / {torch.cuda.memory_reserved()/1024**2:.2f}MB")
        else:
            print("Memory cleared (CPU-only mode)")
            
    except Exception as e:
        print(f"Error clearing memory: {str(e)}")

- Model loading
- CUDA operations

In [None]:
if __name__ == "__main__":
    import warnings
    warnings.filterwarnings("ignore", category=UserWarning, module="torch.nn.modules.module")
    warnings.filterwarnings("ignore", category=FutureWarning, module="transformers")
    main()

- Final Cleanup after use

In [None]:
if __name__ == "__main__":
    clear_memory()