<a href="https://colab.research.google.com/github/Vishnupriya-Selvraj/Agentic_AI_Workshop/blob/main/Day-3/RAG%20System%20AI%20Research%20Papers%20Question%20Answering/RAG_System_AI_Research_Papers_Q%26A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cell 1: Installation and Setup**

**Installs core dependencies:**

In [1]:
!pip install langchain langchain-community langchain-huggingface
!pip install sentence-transformers faiss-cpu
!pip install pypdf2 PyPDF2
!pip install transformers torch
!pip install gradio
!pip install accelerate

print("✅ All packages installed successfully!")

✅ All packages installed successfully!


**Key Packages:**

*  langchain - Framework for building RAG
systems

*   sentence-transformers - For generating text embeddings

*   faiss-cpu - Efficient vector similarity search

*   PyPDF2 - PDF text extraction

*   transformers - Language models for answer generation

*   gradio - Web interface builder

# **Cell 2: Import Libraries**
**Loads all required modules:**

In [2]:
import os
import warnings
warnings.filterwarnings('ignore')

import torch
from typing import List, Dict, Any
import gradio as gr
from pathlib import Path

# LangChain imports
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document

# Transformers
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

print("✅ All libraries imported successfully!")
print(f"🔥 GPU Available: {torch.cuda.is_available()}")

✅ All libraries imported successfully!
🔥 GPU Available: True


**Critical Components:**

*   PDF processing tools (PyPDFLoader)

*   Embedding models (HuggingFaceEmbeddings)

*   Vector database (FAISS)

*   GPU check (torch.cuda.is_available())

# **Cell 3: Document Processor**
**Handles PDF loading and preprocessing:**

In [3]:
class DocumentProcessor:
    """Handles loading and processing of PDF documents with better chunking"""

    def __init__(self, chunk_size=500, chunk_overlap=100):  # Smaller chunks for better precision
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
            separators=["\n\n", "\n", ". ", " ", ""]  # Better separators
        )

    def load_pdfs(self, pdf_paths: List[str]) -> List[Document]:
        """Load multiple PDF files and return documents with better preprocessing"""
        all_documents = []

        for pdf_path in pdf_paths:
            try:
                print(f"📖 Loading: {os.path.basename(pdf_path)}")
                loader = PyPDFLoader(pdf_path)
                documents = loader.load()

                # Clean and preprocess documents
                for i, doc in enumerate(documents):
                    # Clean the text content
                    cleaned_content = self._clean_text(doc.page_content)
                    doc.page_content = cleaned_content

                    # Add comprehensive metadata
                    doc.metadata.update({
                        'source_file': os.path.basename(pdf_path),
                        'page_number': i + 1,
                        'total_pages': len(documents),
                        'char_count': len(cleaned_content)
                    })

                all_documents.extend(documents)
                print(f"✅ Loaded {len(documents)} pages from {os.path.basename(pdf_path)}")

            except Exception as e:
                print(f"❌ Error loading {pdf_path}: {str(e)}")

        return all_documents

    def _clean_text(self, text: str) -> str:
        """Clean and normalize text content"""
        import re

        # Remove excessive whitespace
        text = re.sub(r'\s+', ' ', text)

        # Remove page numbers and headers/footers (simple heuristic)
        lines = text.split('\n')
        cleaned_lines = []

        for line in lines:
            line = line.strip()
            # Skip very short lines that might be headers/footers
            if len(line) > 10 and not re.match(r'^\d+$', line):
                cleaned_lines.append(line)

        return ' '.join(cleaned_lines)

    def create_chunks(self, documents: List[Document]) -> List[Document]:
        """Split documents into chunks with better metadata"""
        print("🔪 Creating text chunks...")

        chunks = self.text_splitter.split_documents(documents)

        # Add enhanced chunk metadata
        for i, chunk in enumerate(chunks):
            chunk.metadata.update({
                'chunk_id': i,
                'chunk_size': len(chunk.page_content),
                'word_count': len(chunk.page_content.split()),
                'processing_timestamp': str(pd.Timestamp.now())
            })

        print(f"✅ Created {len(chunks)} chunks")
        return chunks


**Key Features:**

*   Chunk size customization (default: 1000 chars)

*   Metadata preservation (source file, page numbers)

*   Smart text splitting at natural boundaries

# **Cell 4: Vector Store Manager**
**Manages document embeddings:**

In [4]:
class VectorStoreManager:
    """Manages document embeddings with better similarity search"""

    def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"):
        self.model_name = model_name
        self.embeddings = HuggingFaceEmbeddings(
            model_name=model_name,
            model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'},
            encode_kwargs={
                'normalize_embeddings': True,
                'batch_size': 16  # Better batch size for processing
            }
        )
        self.vector_store = None

    def create_vector_store(self, chunks: List[Document]) -> bool:
        """Create FAISS vector store with better configuration"""
        try:
            print("⚡ Creating vector embeddings...")
            print(f"📊 Processing {len(chunks)} chunks...")

            # Filter out very short chunks that might not be meaningful
            filtered_chunks = [chunk for chunk in chunks if len(chunk.page_content.strip()) > 20]

            print(f"📊 Using {len(filtered_chunks)} meaningful chunks...")

            # Create embeddings and vector store
            self.vector_store = FAISS.from_documents(
                documents=filtered_chunks,
                embedding=self.embeddings
            )

            print(f"✅ Vector store created successfully!")
            print(f"📈 Vector dimension: {self.vector_store.index.d}")
            print(f"📚 Total vectors: {self.vector_store.index.ntotal}")

            return True

        except Exception as e:
            print(f"❌ Error creating vector store: {str(e)}")
            return False

    def search_similar(self, query: str, k: int = 5) -> List[tuple]:
        """Search for similar documents with better scoring"""
        if not self.vector_store:
            return []

        try:
            # Use MMR (Maximum Marginal Relevance) for better diversity
            try:
                results = self.vector_store.max_marginal_relevance_search_with_score(
                    query, k=k, fetch_k=k*2
                )
            except:
                # Fallback to regular similarity search
                results = self.vector_store.similarity_search_with_score(query, k=k)

            print(f"🔍 Found {len(results)} relevant chunks for query")

            # Filter results by minimum similarity threshold
            filtered_results = []
            for doc, score in results:
                # Convert distance to similarity and filter low scores
                similarity = 1 - score
                if similarity > 0.1:  # Minimum threshold
                    filtered_results.append((doc, score))

            return filtered_results[:k]

        except Exception as e:
            print(f"❌ Search error: {str(e)}")
            return []

    def get_stats(self) -> Dict[str, Any]:
        """Get vector store statistics"""
        if not self.vector_store:
            return {"status": "not_created"}

        return {
            "status": "ready",
            "total_vectors": self.vector_store.index.ntotal,
            "vector_dimension": self.vector_store.index.d,
            "model_name": self.model_name
        }

**Technical Details:**

*   Uses **`all-MiniLM-L6-v2`** embedding model

*   Automatic GPU/CPU switching

*   Returns similarity scores with results

# **Cell 5: Answer Generator**
**Produces human-readable answers:**

In [14]:
class AnswerGenerator:
    """Generates answers with significantly improved extraction logic"""

    def __init__(self):
        self.generator = None
        self.model_name = "microsoft/DialoGPT-medium"
        self.setup_model()

    def setup_model(self):
        """Initialize the text generation model"""
        try:
            print("🤖 Loading language model...")
            from transformers import pipeline
            self.generator = pipeline(
                "text-generation",
                model="microsoft/DialoGPT-medium",
                tokenizer="microsoft/DialoGPT-medium",
                max_new_tokens=100,
                temperature=0.1,
                do_sample=True,
                pad_token_id=50256,
                device=0 if torch.cuda.is_available() else -1,
                truncation=True
            )
            print("✅ Language model loaded!")
        except Exception as e:
            print(f"❌ Error loading model: {str(e)}")
            self.generator = None

    def generate_answer(self, question: str, context: str, max_length: int = 200) -> str:
        """Generate answer with advanced extraction techniques"""

        # First, try definition extraction for "what is" questions
        if self._is_definition_question(question):
            definition_answer = self._extract_definition(question, context)
            if definition_answer and len(definition_answer) > 20:
                return definition_answer

        # Try advanced extractive method
        extractive_answer = self._advanced_extractive_answer(question, context)
        if extractive_answer and len(extractive_answer) > 20:
            return extractive_answer

        # Try generative approach as fallback
        if self.generator:
            generative_answer = self._generate_with_model(question, context)
            if generative_answer and len(generative_answer) > 20:
                return generative_answer

        # Final fallback
        return self._simple_context_answer(context)

    def _is_definition_question(self, question: str) -> bool:
        """Check if this is a definition question"""
        definition_indicators = ['what is', 'define', 'definition of', 'what are', 'meaning of']
        question_lower = question.lower()
        return any(indicator in question_lower for indicator in definition_indicators)

    def _extract_definition(self, question: str, context: str) -> str:
        """Extract definition-style answers"""
        # Get the main term being defined
        question_lower = question.lower()

        # Extract the key term
        if 'what is' in question_lower:
            term = question_lower.split('what is')[-1].strip('? .')
        elif 'define' in question_lower:
            term = question_lower.split('define')[-1].strip('? .')
        else:
            term = question_lower.replace('what are', '').replace('definition of', '').strip('? .')

        # Clean up the term
        term = term.split()[0] if term.split() else 'agent'

        # Split context into sentences
        sentences = [s.strip() for s in context.replace('\n', ' ').split('.') if len(s.strip()) > 15]

        # Look for definition patterns
        definition_patterns = [
            f"{term} is",
            f"{term} are",
            f"an {term} is",
            f"the {term} is",
            f"{term} can be defined",
            f"{term} refers to",
            f"{term} means",
            "agent is",
            "agents are",
            "an agent is",
            "the agent is"
        ]

        best_sentences = []

        for sentence in sentences:
            sentence_lower = sentence.lower()

            # Check for definition patterns
            for pattern in definition_patterns:
                if pattern in sentence_lower:
                    # This looks like a definition
                    best_sentences.append((10, sentence))  # High priority
                    break
            else:
                # Check for descriptive content about the term
                if term in sentence_lower and len(sentence) > 30:
                    # Count relevant keywords
                    relevance_keywords = ['intelligence', 'autonomous', 'system', 'behavior', 'environment', 'action', 'goal', 'artificial']
                    keyword_count = sum(1 for kw in relevance_keywords if kw in sentence_lower)
                    if keyword_count > 0:
                        best_sentences.append((keyword_count, sentence))

        # Sort by relevance score
        best_sentences.sort(reverse=True, key=lambda x: x[0])

        if best_sentences:
            # Take the best 2-3 sentences
            selected_sentences = [sent[1] for sent in best_sentences[:3]]
            answer = '. '.join(selected_sentences)
            return self._clean_answer(answer)

        return ""

    def _advanced_extractive_answer(self, question: str, context: str) -> str:
        """Advanced extractive answer with better sentence selection"""

        # Split into sentences and filter
        sentences = []
        for sent in context.replace('\n', ' ').split('.'):
            sent = sent.strip()
            if len(sent) > 20 and not self._is_reference_sentence(sent):
                sentences.append(sent)

        if not sentences:
            return ""

        # Get question keywords
        question_words = set(question.lower().split())
        stop_words = {'what', 'is', 'are', 'how', 'why', 'when', 'where', 'who', 'which', 'the', 'a', 'an', 'of', 'in', 'to', 'for', 'and', 'or'}
        key_words = question_words - stop_words

        # Score sentences
        scored_sentences = []

        for i, sentence in enumerate(sentences):
            sentence_lower = sentence.lower()
            sentence_words = set(sentence_lower.split())

            # Calculate various scores
            keyword_overlap = len(key_words.intersection(sentence_words))
            position_score = 1.0 / (i + 1)  # Earlier sentences preferred
            length_score = min(len(sentence) / 100, 1.0)  # Moderate length preferred

            # Bonus for informative content
            info_bonus = 0
            info_words = ['intelligence', 'system', 'autonomous', 'behavior', 'environment', 'action', 'goal', 'artificial', 'capability', 'function', 'perform', 'task']
            info_bonus = sum(0.5 for word in info_words if word in sentence_lower)

            # Penalty for references and citations
            citation_penalty = 0
            if any(indicator in sentence_lower for indicator in ['et al', 'pp.', 'vol.', '[', ']', 'ibid', 'op. cit']):
                citation_penalty = -2

            total_score = keyword_overlap * 3 + position_score + length_score + info_bonus + citation_penalty

            if keyword_overlap > 0 or info_bonus > 0:  # Must have some relevance
                scored_sentences.append((total_score, sentence, i))

        # Sort by score
        scored_sentences.sort(reverse=True, key=lambda x: x[0])

        if scored_sentences:
            # Take top sentences
            top_sentences = [sent[1] for sent in scored_sentences[:2]]
            answer = '. '.join(top_sentences)
            return self._clean_answer(answer)

        return ""

    def _is_reference_sentence(self, sentence: str) -> bool:
        """Check if sentence is likely a reference or citation"""
        ref_indicators = [
            'et al', 'pp.', 'vol.', 'ibid', 'op. cit',
            '].', '[1', '[2', '[3', '[4', '[5',
            'springer', 'elsevier', 'acm', 'ieee',
            'proceedings', 'conference', 'journal'
        ]
        sentence_lower = sentence.lower()
        return any(indicator in sentence_lower for indicator in ref_indicators)

    def _generate_with_model(self, question: str, context: str) -> str:
        """Generate answer using the language model"""
        try:
            prompt = f"""Answer the question based on the context provided. Be concise and accurate.

Context: {context[:800]}

Question: {question}

Answer:"""

            response = self.generator(
                prompt,
                max_new_tokens=80,
                num_return_sequences=1,
                temperature=0.1,
                do_sample=True,
                truncation=True
            )

            full_response = response[0]['generated_text']
            if "Answer:" in full_response:
                answer = full_response.split("Answer:")[-1].strip()
                return self._clean_answer(answer)

        except Exception as e:
            print(f"⚠️ Generation error: {str(e)}")

        return ""

    def _simple_context_answer(self, context: str) -> str:
        """Simple fallback to return most relevant context"""
        sentences = [s.strip() for s in context.split('.') if len(s.strip()) > 30]

        # Filter out reference sentences
        content_sentences = [s for s in sentences if not self._is_reference_sentence(s)]

        if content_sentences:
            return content_sentences[0] + "."
        elif sentences:
            return sentences[0] + "."
        else:
            return "The relevant information is present in the document but requires more specific context to provide a clear answer."

    def _clean_answer(self, answer: str) -> str:
        """Clean and format the answer"""
        if not answer:
            return "No specific answer found in the provided context."

        # Remove extra whitespace
        answer = ' '.join(answer.split())

        # Remove common artifacts
        answer = answer.replace('\\n', ' ')

        # Ensure proper ending
        if answer and not answer.endswith(('.', '!', '?')):
            answer += '.'

        # Limit length but try to end at sentence boundary
        if len(answer) > 400:
            sentences = answer.split('.')
            if len(sentences) > 1:
                # Take sentences that fit within limit
                result = ""
                for sent in sentences:
                    if len(result + sent + ".") <= 400:
                        result += sent + "."
                    else:
                        break
                answer = result if result else answer[:397] + "..."
            else:
                answer = answer[:397] + "..."

        return answer

**Generation Pipeline:**

*   Creates LLM prompt with question + context

*   Uses DialoGPT-medium for generation

*   Fallback to extractive QA if generation fails

# **Cell 6: RAG System**
**Main orchestrator class:**

In [15]:
class RAGSystem:
    """Enhanced RAG system with multi-strategy approach"""

    def __init__(self):
        self.doc_processor = DocumentProcessor(chunk_size=400, chunk_overlap=50)  # Smaller chunks
        self.vector_manager = VectorStoreManager()
        self.answer_generator = AnswerGenerator()
        self.is_ready = False
        self.documents_info = {}

    def setup(self, pdf_paths: List[str]) -> bool:
        """Setup with enhanced processing"""
        print("🚀 Setting up Enhanced RAG system...")
        print("="*50)

        try:
            print("Step 1: Loading PDFs...")
            documents = self.doc_processor.load_pdfs(pdf_paths)

            if not documents:
                print("❌ No documents loaded!")
                return False

            print("\nStep 2: Creating text chunks...")
            chunks = self.doc_processor.create_chunks(documents)

            if not chunks:
                print("❌ No chunks created!")
                return False

            print("\nStep 3: Creating vector embeddings...")
            if not self.vector_manager.create_vector_store(chunks):
                return False

            self.documents_info = {
                'total_documents': len(documents),
                'total_chunks': len(chunks),
                'pdf_files': [os.path.basename(path) for path in pdf_paths],
                'setup_complete': True
            }

            self.is_ready = True
            print("\n" + "="*50)
            print("🎉 Enhanced RAG System Ready!")
            print(f"📚 Loaded: {len(documents)} pages from {len(pdf_paths)} PDFs")
            print(f"🔪 Created: {len(chunks)} text chunks")
            print(f"⚡ Vector store ready with {self.vector_manager.vector_store.index.ntotal} embeddings")
            print("="*50)

            return True

        except Exception as e:
            print(f"❌ Setup failed: {str(e)}")
            return False

    def ask_question(self, question: str, num_sources: int = 6) -> Dict[str, Any]:
        """Enhanced question answering with multiple strategies"""

        if not self.is_ready:
            return {
                'answer': "❌ System not ready. Please upload and process documents first.",
                'sources': [],
                'confidence': 0.0
            }

        if not question.strip():
            return {
                'answer': "❌ Please provide a valid question.",
                'sources': [],
                'confidence': 0.0
            }

        try:
            print(f"🔍 Processing question: '{question}'")

            # Use multiple search strategies
            search_results = self._multi_strategy_search(question, num_sources)

            if not search_results:
                return {
                    'answer': "❌ No relevant information found. Try rephrasing your question or check if the topic is covered in the uploaded documents.",
                    'sources': [],
                    'confidence': 0.0
                }

            # Process results
            context_parts = []
            sources = []
            similarity_scores = []

            for doc, score in search_results:
                similarity = max(0, 1 - score)  # Ensure non-negative
                similarity_scores.append(similarity)
                context_parts.append(doc.page_content)

                source_info = {
                    'file': doc.metadata.get('source_file', 'Unknown'),
                    'page': doc.metadata.get('page_number', 'Unknown'),
                    'chunk_id': doc.metadata.get('chunk_id', 'Unknown'),
                    'similarity_score': round(similarity, 3),
                    'content_preview': doc.page_content[:200] + "..." if len(doc.page_content) > 200 else doc.page_content
                }
                sources.append(source_info)

            # Combine context more intelligently
            combined_context = self._smart_context_combination(context_parts, question)

            print("🤖 Generating answer...")
            answer = self.answer_generator.generate_answer(question, combined_context)

            # Better confidence calculation
            if similarity_scores:
                # Use weighted average with higher weight for top results
                weights = [1.0 / (i + 1) for i in range(len(similarity_scores))]
                weighted_sum = sum(score * weight for score, weight in zip(similarity_scores, weights))
                total_weight = sum(weights)
                confidence = weighted_sum / total_weight
                confidence = max(0.0, min(1.0, confidence))
            else:
                confidence = 0.0

            result = {
                'answer': answer,
                'sources': sources,
                'confidence': round(confidence, 3),
                'context_length': len(combined_context),
                'num_sources_used': len(sources)
            }

            print(f"✅ Answer generated (confidence: {result['confidence']})")
            return result

        except Exception as e:
            print(f"❌ Error processing question: {str(e)}")
            return {
                'answer': f"❌ Error processing your question: {str(e)}",
                'sources': [],
                'confidence': 0.0
            }

    def _multi_strategy_search(self, question: str, k: int) -> List[tuple]:
        """Use multiple search strategies to find relevant content"""

        all_results = []

        # Strategy 1: Direct question search
        results1 = self.vector_manager.search_similar(question, k=k//2)
        all_results.extend(results1)

        # Strategy 2: Enhanced query search
        enhanced_query = self._enhance_query_advanced(question)
        if enhanced_query != question:
            results2 = self.vector_manager.search_similar(enhanced_query, k=k//2)
            all_results.extend(results2)

        # Strategy 3: Keyword-based search for definition questions
        if any(word in question.lower() for word in ['what is', 'define', 'definition']):
            keyword_query = self._extract_definition_keywords(question)
            results3 = self.vector_manager.search_similar(keyword_query, k=k//3)
            all_results.extend(results3)

        # Remove duplicates and sort by score
        seen_chunks = set()
        unique_results = []

        for doc, score in all_results:
            chunk_id = doc.metadata.get('chunk_id', id(doc.page_content))
            if chunk_id not in seen_chunks:
                seen_chunks.add(chunk_id)
                unique_results.append((doc, score))

        # Sort by similarity score (lower distance = higher similarity)
        unique_results.sort(key=lambda x: x[1])

        return unique_results[:k]

    def _enhance_query_advanced(self, question: str) -> str:
        """Advanced query enhancement"""
        enhanced = question.lower()

        # Add context-specific terms
        if 'agent' in enhanced:
            enhanced += " artificial intelligence autonomous system behavior environment goal"
        elif any(word in enhanced for word in ['definition', 'define', 'what is']):
            enhanced += " definition concept meaning explanation"
        elif 'intelligence' in enhanced:
            enhanced += " AI artificial cognitive reasoning learning"

        return enhanced

    def _extract_definition_keywords(self, question: str) -> str:
        """Extract key terms for definition searches"""
        question_lower = question.lower()

        # Extract the main term being defined
        if 'what is' in question_lower:
            term = question_lower.split('what is')[-1].strip('? .')
        elif 'define' in question_lower:
            term = question_lower.split('define')[-1].strip('? .')
        else:
            words = question.split()
            term = ' '.join([w for w in words if w.lower() not in ['what', 'is', 'an', 'a', 'the']])

        # Add related terms
        if 'agent' in term:
            return f"{term} artificial intelligence autonomous system rational actor"
        else:
            return term

    def _smart_context_combination(self, context_parts: List[str], question: str) -> str:
        """Intelligently combine context parts"""
        if not context_parts:
            return ""

        # Score context parts by relevance to question
        question_words = set(question.lower().split())
        scored_contexts = []

        for context in context_parts:
            context_words = set(context.lower().split())
            overlap = len(question_words.intersection(context_words))

            # Bonus for definition-like content
            definition_bonus = 0
            if any(phrase in context.lower() for phrase in ['is a', 'are a', 'refers to', 'defined as', 'means']):
                definition_bonus = 2

            score = overlap + definition_bonus
            scored_contexts.append((score, context))

        # Sort by relevance
        scored_contexts.sort(reverse=True, key=lambda x: x[0])

        # Combine top contexts
        combined = "\n\n".join([context for _, context in scored_contexts])

        # Limit total length
        if len(combined) > 1500:
            combined = combined[:1500] + "..."

        return combined

    def get_system_status(self) -> Dict[str, Any]:
        """Get system status"""
        base_status = {
            'is_ready': self.is_ready,
            'documents_info': self.documents_info,
            'vector_store_stats': self.vector_manager.get_stats()
        }

        if self.is_ready:
            base_status['model_info'] = {
                'embedding_model': self.vector_manager.model_name,
                'generation_model': self.answer_generator.model_name,
                'gpu_available': torch.cuda.is_available()
            }

        return base_status


**Workflow:**
PDFs → Chunks → Embeddings → Vector Store → Query → Answer

# **Cell 7: File Helpers**
**Colab-specific utilities:**

In [16]:
import pandas as pd
from google.colab import files
import shutil

def upload_pdf_files():
    """Helper function to upload PDF files in Colab"""
    print("📤 Please upload your PDF files...")
    uploaded = files.upload()

    pdf_paths = []
    for filename, data in uploaded.items():
        if filename.endswith('.pdf'):
            # Save file to current directory
            with open(filename, 'wb') as f:
                f.write(data)
            pdf_paths.append(filename)
            print(f"✅ Saved: {filename}")
        else:
            print(f"⚠️ Skipped non-PDF file: {filename}")

    return pdf_paths

def setup_with_sample_data():
    """Setup system with sample data for demonstration"""
    print("🎯 Setting up with sample data...")
    print("Note: In real usage, upload your own PDF files using upload_pdf_files()")

    # Create a sample document for demonstration
    sample_content = """
    This is a sample AI research paper about Natural Language Processing.

    Abstract: This paper presents a novel approach to question answering systems
    using retrieval-augmented generation. Our method combines semantic search
    with large language models to provide accurate and contextual answers.

    Introduction: Question answering has been a fundamental challenge in NLP.
    Recent advances in transformer models have shown promising results.

    Methodology: We propose a hybrid approach that uses vector embeddings
    for document retrieval and generative models for answer synthesis.

    Results: Our system achieves 85% accuracy on benchmark datasets,
    outperforming traditional keyword-based approaches by 20%.

    Conclusion: The combination of retrieval and generation provides
    superior performance for domain-specific question answering tasks.
    """

    # Create sample PDF content (simplified)
    from langchain.schema import Document

    sample_doc = Document(
        page_content=sample_content,
        metadata={
            'source_file': 'sample_paper.pdf',
            'page_number': 1,
            'total_pages': 1
        }
    )

    # Process sample document
    chunks = rag_system.doc_processor.create_chunks([sample_doc])

    # Create vector store
    if rag_system.vector_manager.create_vector_store(chunks):
        rag_system.is_ready = True
        rag_system.documents_info = {
            'total_documents': 1,
            'total_chunks': len(chunks),
            'pdf_files': ['sample_paper.pdf'],
            'setup_complete': True
        }
        print("✅ Sample system ready for testing!")
        return True

    return False


**Usage Tips:**

*   Drag-and-drop PDF upload support

*   Sample data for quick testing

# **Cell 8: Gradio UI**
**Builds interactive interface:**

In [17]:
def create_gradio_interface():
    """Create an interactive web interface"""

    def handle_file_upload(files):
        if not files:
            return "❌ Please upload at least one PDF file."

        try:
            pdf_paths = []
            for file in files:
                if file.name.endswith('.pdf'):
                    pdf_paths.append(file.name)

            if not pdf_paths:
                return "❌ No valid PDF files found."

            # Setup the RAG system
            success = rag_system.setup(pdf_paths)

            if success:
                status = rag_system.get_system_status()
                return f"""✅ Setup Complete!

📊 System Status:
• Documents loaded: {status['documents_info']['total_documents']}
• Text chunks created: {status['documents_info']['total_chunks']}
• Files processed: {', '.join(status['documents_info']['pdf_files'])}
• Vector embeddings: {status['vector_store_stats']['total_vectors']}

🎯 Ready to answer questions!"""
            else:
                return "❌ Setup failed. Please check your PDF files and try again."

        except Exception as e:
            return f"❌ Error during setup: {str(e)}"

    def handle_question(question):
        if not question.strip():
            return "❌ Please enter a question.", ""

        result = rag_system.ask_question(question)

        # Format answer
        answer_text = f"**Answer:** {result['answer']}\n\n"
        answer_text += f"**Confidence:** {result['confidence']}/1.0\n\n"

        # Format sources
        sources_text = "**Sources:**\n\n"
        for i, source in enumerate(result['sources'], 1):
            sources_text += f"**{i}. {source['file']}** (Page {source['page']})\n"
            sources_text += f"   • Similarity: {source['similarity_score']}\n"
            sources_text += f"   • Preview: {source['content_preview']}\n\n"

        if not result['sources']:
            sources_text = "No sources found."

        return answer_text, sources_text

    def show_system_info():
        status = rag_system.get_system_status()

        if not status['is_ready']:
            return "System not ready. Please upload documents first."

        info_text = f"""**System Information:**

**Status:** {'✅ Ready' if status['is_ready'] else '❌ Not Ready'}

**Documents:**
• Total documents: {status['documents_info']['total_documents']}
• Total chunks: {status['documents_info']['total_chunks']}
• Files: {', '.join(status['documents_info']['pdf_files'])}

**Vector Store:**
• Total vectors: {status['vector_store_stats']['total_vectors']}
• Vector dimension: {status['vector_store_stats']['vector_dimension']}
• Embedding model: {status['vector_store_stats']['model_name']}

**Models:**
• Generation model: {status.get('model_info', {}).get('generation_model', 'N/A')}
• GPU available: {status.get('model_info', {}).get('gpu_available', False)}
"""
        return info_text

    # Create Gradio interface
    with gr.Blocks(
        title="RAG System - AI Research Papers QA",
        theme=gr.themes.Soft()
    ) as interface:

        gr.Markdown("""
        # 🤖 RAG System: AI Research Papers Q&A

        **Upload your AI research papers and ask questions about them!**

        This system uses Retrieval-Augmented Generation to:
        - 📖 Process and understand your research papers
        - 🔍 Find relevant information for your questions
        - 🤖 Generate accurate answers with source citations
        """)

        with gr.Tab("📤 Upload & Setup"):
            gr.Markdown("### Step 1: Upload Your PDF Research Papers")

            file_upload = gr.File(
                label="Select PDF Files",
                file_count="multiple",
                file_types=[".pdf"],
                height=100
            )

            setup_btn = gr.Button("🚀 Process Documents", variant="primary", size="lg")
            setup_output = gr.Textbox(
                label="Setup Status",
                lines=10,
                interactive=False,
                placeholder="Upload PDFs and click 'Process Documents' to begin..."
            )

            # Sample data button for demo
            gr.Markdown("---")
            gr.Markdown("### Or Try with Sample Data")
            sample_btn = gr.Button("🎯 Use Sample Data (Demo)", variant="secondary")

            def setup_sample():
                success = setup_with_sample_data()
                if success:
                    status = rag_system.get_system_status()
                    return f"""✅ Sample System Ready!

This is a demo with sample AI research paper content.
You can now ask questions like:
• "What is the main contribution of this paper?"
• "What methodology was used?"
• "What were the results?"

📊 System loaded with {status['documents_info']['total_chunks']} text chunks."""
                else:
                    return "❌ Failed to setup sample data."

            sample_btn.click(setup_sample, outputs=[setup_output])
            setup_btn.click(handle_file_upload, inputs=[file_upload], outputs=[setup_output])

        with gr.Tab("❓ Ask Questions"):
            gr.Markdown("### Ask Questions About Your Research Papers")

            with gr.Row():
                with gr.Column(scale=4):
                    question_input = gr.Textbox(
                        label="Your Question",
                        placeholder="e.g., What are the main contributions of this research?",
                        lines=3
                    )
                with gr.Column(scale=1):
                    ask_btn = gr.Button("🔍 Get Answer", variant="primary", size="lg")

            with gr.Row():
                with gr.Column():
                    answer_output = gr.Textbox(
                        label="Answer",
                        lines=8,
                        interactive=False
                    )
                with gr.Column():
                    sources_output = gr.Textbox(
                        label="Sources & Citations",
                        lines=8,
                        interactive=False
                    )

            # Sample questions
            gr.Markdown("### 💡 Sample Questions to Try:")
            sample_questions = [
                "What are the main contributions of this research?",
                "What methodology was used in this study?",
                "What are the key findings and results?",
                "What are the limitations mentioned?",
                "What future work is suggested?"
            ]

            for q in sample_questions:
                sample_q_btn = gr.Button(f"📝 {q}", variant="secondary", size="sm")
                sample_q_btn.click(lambda x=q: x, outputs=[question_input])

            ask_btn.click(
                handle_question,
                inputs=[question_input],
                outputs=[answer_output, sources_output]
            )

        with gr.Tab("ℹ️ System Info"):
            gr.Markdown("### System Status and Information")

            info_btn = gr.Button("🔄 Refresh System Info", variant="primary")
            info_output = gr.Textbox(
                label="System Information",
                lines=15,
                interactive=False
            )

            info_btn.click(show_system_info, outputs=[info_output])

            # Auto-load info on tab open
            interface.load(show_system_info, outputs=[info_output])

    return interface

# Create the RAG system instance globally
rag_system = RAGSystem()

# Create the interface
print("\n🎨 Creating Gradio Interface...")
interface = create_gradio_interface()
print("✅ Interface ready!")

🤖 Loading language model...


Device set to use cuda:0


✅ Language model loaded!

🎨 Creating Gradio Interface...
✅ Interface ready!


**Interface Features:**

*   Document upload/status panel

*   Q&A with source citations

*   System diagnostics view

# **Cell 9: Runtime Controls**
**Launch commands:**

In [18]:
def run_system():
    """Launch the complete system"""
    print("🚀 Launching RAG System...")
    print("="*60)
    print("📋 Instructions:")
    print("1. Run this cell to launch the web interface")
    print("2. Upload your PDF research papers in the 'Upload & Setup' tab")
    print("3. Wait for processing to complete")
    print("4. Go to 'Ask Questions' tab and start asking!")
    print("="*60)

    # Launch interface
    interface.launch(
        share=True,  # Creates a public shareable link
        debug=True,
        height=800
    )

# Quick test function
def quick_test():
    """Quick test with sample data"""
    print("🧪 Quick Test Mode")
    print("-" * 30)

    # Setup with sample data
    if setup_with_sample_data():
        print("\n✅ Sample system ready!")

        # Test questions
        test_questions = [
            "What is this paper about?",
            "What methodology was used?",
            "What were the main results?"
        ]

        for question in test_questions:
            print(f"\n❓ Question: {question}")
            result = rag_system.ask_question(question, num_sources=2)
            print(f"✅ Answer: {result['answer'][:200]}...")
            print(f"📚 Sources: {len(result['sources'])} found")
            print(f"🎯 Confidence: {result['confidence']}")
    else:
        print("❌ Test setup failed")

**Two Usage Modes:**

*   Full web interface (run_system())

*   CLI-style testing (quick_test())

In [19]:
run_system()

🚀 Launching RAG System...
📋 Instructions:
1. Run this cell to launch the web interface
2. Upload your PDF research papers in the 'Upload & Setup' tab
3. Wait for processing to complete
4. Go to 'Ask Questions' tab and start asking!
Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://62dd8a6aa95ffa2578.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


🚀 Setting up Enhanced RAG system...
Step 1: Loading PDFs...
📖 Loading: agentic ai RD paper.pdf
✅ Loaded 26 pages from agentic ai RD paper.pdf

Step 2: Creating text chunks...
🔪 Creating text chunks...
✅ Created 286 chunks

Step 3: Creating vector embeddings...
⚡ Creating vector embeddings...
📊 Processing 286 chunks...
📊 Using 285 meaningful chunks...
✅ Vector store created successfully!
📈 Vector dimension: 384
📚 Total vectors: 285

🎉 Enhanced RAG System Ready!
📚 Loaded: 26 pages from 1 PDFs
🔪 Created: 286 text chunks
⚡ Vector store ready with 285 embeddings
🔍 Processing question: 'What is an agent?'
🔍 Found 3 relevant chunks for query
🔍 Found 3 relevant chunks for query
🔍 Found 2 relevant chunks for query
🤖 Generating answer...
✅ Answer generated (confidence: 0.3050000071525574)
🔍 Processing question: 'What are the characteristics of an agent?'
🔍 Found 3 relevant chunks for query
🔍 Found 3 relevant chunks for query
🤖 Generating answer...
✅ Answer generated (confidence: 0.31700000166893