# Confluence Semantic Search System

A production-ready implementation for semantic search through Confluence page content using embeddings and vector databases.

## Features
- Confluence API integration with pagination and rate limiting
- Multiple embedding model support (OpenAI, Sentence Transformers)
- Vector storage with FAISS, ChromaDB, or Pinecone
- Incremental indexing and caching
- Advanced search with metadata filtering
- Performance monitoring and observability

## 1. Installation and Setup

In [None]:
# Install required packages
!pip install atlassian-python-api
!pip install openai
!pip install sentence-transformers
!pip install faiss-cpu  # or faiss-gpu for GPU support
!pip install chromadb
!pip install pinecone-client
!pip install beautifulsoup4
!pip install lxml
!pip install python-dotenv
!pip install tqdm
!pip install pandas
!pip install plotly
!pip install tenacity
!pip install langchain langchain-community
!pip install tiktoken

## 2. Configuration and Imports

In [None]:
import os
import json
import pickle
import hashlib
from datetime import datetime, timedelta
from typing import List, Dict, Optional, Tuple, Any
from dataclasses import dataclass, asdict
from enum import Enum
import logging
from pathlib import Path

import numpy as np
import pandas as pd
from tqdm import tqdm
import plotly.graph_objects as go
import plotly.express as px

from atlassian import Confluence
from bs4 import BeautifulSoup
import openai
from sentence_transformers import SentenceTransformer
import faiss
import chromadb
from chromadb.config import Settings
import pinecone
from tenacity import retry, stop_after_attempt, wait_exponential
import tiktoken

from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

In [None]:
# Configuration class
@dataclass
class Config:
    # Confluence settings
    confluence_url: str = os.getenv('CONFLUENCE_URL', 'https://your-domain.atlassian.net')
    confluence_username: str = os.getenv('CONFLUENCE_USERNAME', '')
    confluence_api_token: str = os.getenv('CONFLUENCE_API_TOKEN', '')
    
    # Embedding settings
    embedding_model: str = 'openai'  # 'openai' or 'sentence-transformers'
    openai_api_key: str = os.getenv('OPENAI_API_KEY', '')
    openai_model: str = 'text-embedding-3-small'
    sentence_transformer_model: str = 'all-MiniLM-L6-v2'
    
    # Vector store settings
    vector_store: str = 'faiss'  # 'faiss', 'chroma', or 'pinecone'
    pinecone_api_key: str = os.getenv('PINECONE_API_KEY', '')
    pinecone_environment: str = os.getenv('PINECONE_ENVIRONMENT', '')
    pinecone_index_name: str = 'confluence-search'
    
    # Processing settings
    chunk_size: int = 1000  # Characters per chunk
    chunk_overlap: int = 200
    max_pages_per_batch: int = 50
    cache_dir: str = './confluence_cache'
    
    # Search settings
    top_k: int = 10
    min_similarity: float = 0.7

config = Config()

## 3. Confluence Content Extraction

In [None]:
class ConfluenceExtractor:
    """Extract and process content from Confluence pages."""
    
    def __init__(self, config: Config):
        self.config = config
        self.confluence = Confluence(
            url=config.confluence_url,
            username=config.confluence_username,
            password=config.confluence_api_token,
            cloud=True
        )
        self.cache_dir = Path(config.cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
        
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
    def get_all_pages(self, space_key: Optional[str] = None, limit: int = 1000) -> List[Dict]:
        """Fetch all pages from Confluence with pagination."""
        logger.info(f"Fetching pages from Confluence (space: {space_key or 'all'})")
        
        all_pages = []
        start = 0
        batch_size = 50
        
        while start < limit:
            try:
                if space_key:
                    results = self.confluence.get_all_pages_from_space(
                        space=space_key,
                        start=start,
                        limit=min(batch_size, limit - start),
                        expand='body.storage,version,history'
                    )
                else:
                    # Get pages from all spaces
                    results = self.confluence.get_all_pages_by_label(
                        label='',
                        start=start,
                        limit=min(batch_size, limit - start)
                    )
                
                if not results:
                    break
                    
                all_pages.extend(results)
                start += batch_size
                
                logger.info(f"Fetched {len(all_pages)} pages so far...")
                
            except Exception as e:
                logger.error(f"Error fetching pages: {e}")
                break
                
        return all_pages
    
    def extract_page_content(self, page: Dict) -> Dict:
        """Extract and clean content from a Confluence page."""
        try:
            # Get HTML content
            html_content = page.get('body', {}).get('storage', {}).get('value', '')
            
            # Parse HTML and extract text
            soup = BeautifulSoup(html_content, 'lxml')
            
            # Remove script and style elements
            for script in soup(['script', 'style']):
                script.decompose()
                
            # Get text
            text = soup.get_text(separator=' ', strip=True)
            
            # Clean up whitespace
            text = ' '.join(text.split())
            
            # Extract metadata
            metadata = {
                'page_id': page.get('id'),
                'title': page.get('title', ''),
                'space': page.get('space', {}).get('key', ''),
                'url': f"{self.config.confluence_url}/wiki{page.get('_links', {}).get('webui', '')}",
                'version': page.get('version', {}).get('number', 0),
                'created_by': page.get('history', {}).get('createdBy', {}).get('displayName', ''),
                'created_date': page.get('history', {}).get('createdDate', ''),
                'last_updated': page.get('version', {}).get('when', ''),
                'last_updated_by': page.get('version', {}).get('by', {}).get('displayName', '')
            }
            
            return {
                'content': text,
                'metadata': metadata,
                'html': html_content
            }
            
        except Exception as e:
            logger.error(f"Error extracting content from page {page.get('id')}: {e}")
            return None
    
    def chunk_content(self, text: str, metadata: Dict) -> List[Dict]:
        """Split content into chunks for embedding."""
        chunks = []
        
        # Simple character-based chunking with overlap
        for i in range(0, len(text), self.config.chunk_size - self.config.chunk_overlap):
            chunk_text = text[i:i + self.config.chunk_size]
            
            # Create chunk with metadata
            chunk = {
                'text': chunk_text,
                'metadata': {
                    **metadata,
                    'chunk_index': len(chunks),
                    'chunk_start': i,
                    'chunk_end': min(i + self.config.chunk_size, len(text))
                }
            }
            chunks.append(chunk)
            
        return chunks
    
    def process_pages(self, pages: List[Dict]) -> List[Dict]:
        """Process all pages and create chunks."""
        all_chunks = []
        
        for page in tqdm(pages, desc="Processing pages"):
            content_data = self.extract_page_content(page)
            
            if content_data and content_data['content']:
                chunks = self.chunk_content(
                    content_data['content'],
                    content_data['metadata']
                )
                all_chunks.extend(chunks)
                
        logger.info(f"Created {len(all_chunks)} chunks from {len(pages)} pages")
        return all_chunks

# Initialize extractor
extractor = ConfluenceExtractor(config)

## 4. Embedding Generation

In [None]:
class EmbeddingGenerator:
    """Generate embeddings using various models."""
    
    def __init__(self, config: Config):
        self.config = config
        
        if config.embedding_model == 'openai':
            openai.api_key = config.openai_api_key
            self.encoder = tiktoken.encoding_for_model("gpt-3.5-turbo")
        elif config.embedding_model == 'sentence-transformers':
            self.model = SentenceTransformer(config.sentence_transformer_model)
            
    def count_tokens(self, text: str) -> int:
        """Count tokens in text."""
        if self.config.embedding_model == 'openai':
            return len(self.encoder.encode(text))
        else:
            # Approximate for sentence transformers
            return len(text.split()) * 1.3
    
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
    def generate_embeddings(self, texts: List[str], batch_size: int = 100) -> np.ndarray:
        """Generate embeddings for a list of texts."""
        embeddings = []
        
        for i in tqdm(range(0, len(texts), batch_size), desc="Generating embeddings"):
            batch_texts = texts[i:i + batch_size]
            
            if self.config.embedding_model == 'openai':
                response = openai.embeddings.create(
                    model=self.config.openai_model,
                    input=batch_texts
                )
                batch_embeddings = [item.embedding for item in response.data]
                
            elif self.config.embedding_model == 'sentence-transformers':
                batch_embeddings = self.model.encode(
                    batch_texts,
                    show_progress_bar=False,
                    convert_to_numpy=True
                )
                
            embeddings.extend(batch_embeddings)
            
        return np.array(embeddings, dtype=np.float32)
    
    def generate_query_embedding(self, query: str) -> np.ndarray:
        """Generate embedding for a single query."""
        if self.config.embedding_model == 'openai':
            response = openai.embeddings.create(
                model=self.config.openai_model,
                input=query
            )
            return np.array(response.data[0].embedding, dtype=np.float32)
            
        elif self.config.embedding_model == 'sentence-transformers':
            return self.model.encode(query, convert_to_numpy=True)

# Initialize embedding generator
embedder = EmbeddingGenerator(config)

## 5. Vector Store Implementation

In [None]:
class VectorStore:
    """Abstract base class for vector stores."""
    
    def add_embeddings(self, embeddings: np.ndarray, metadata: List[Dict]):
        raise NotImplementedError
        
    def search(self, query_embedding: np.ndarray, k: int = 10) -> Tuple[List[int], List[float]]:
        raise NotImplementedError
        
    def save(self, path: str):
        raise NotImplementedError
        
    def load(self, path: str):
        raise NotImplementedError


class FAISSVectorStore(VectorStore):
    """FAISS-based vector store."""
    
    def __init__(self, dimension: int):
        self.dimension = dimension
        self.index = faiss.IndexFlatL2(dimension)
        self.metadata = []
        
    def add_embeddings(self, embeddings: np.ndarray, metadata: List[Dict]):
        """Add embeddings to the index."""
        self.index.add(embeddings)
        self.metadata.extend(metadata)
        logger.info(f"Added {len(embeddings)} embeddings to FAISS index")
        
    def search(self, query_embedding: np.ndarray, k: int = 10) -> Tuple[List[Dict], List[float]]:
        """Search for similar embeddings."""
        query_embedding = query_embedding.reshape(1, -1)
        distances, indices = self.index.search(query_embedding, k)
        
        results = []
        for idx in indices[0]:
            if idx < len(self.metadata):
                results.append(self.metadata[idx])
                
        # Convert L2 distance to similarity score
        similarities = 1 / (1 + distances[0])
        
        return results, similarities.tolist()
    
    def save(self, path: str):
        """Save index and metadata to disk."""
        faiss.write_index(self.index, f"{path}.index")
        with open(f"{path}.metadata", 'wb') as f:
            pickle.dump(self.metadata, f)
        logger.info(f"Saved FAISS index to {path}")
            
    def load(self, path: str):
        """Load index and metadata from disk."""
        self.index = faiss.read_index(f"{path}.index")
        with open(f"{path}.metadata", 'rb') as f:
            self.metadata = pickle.load(f)
        logger.info(f"Loaded FAISS index from {path}")


class ChromaVectorStore(VectorStore):
    """ChromaDB-based vector store."""
    
    def __init__(self, collection_name: str = "confluence_docs"):
        self.client = chromadb.Client(Settings(
            chroma_db_impl="duckdb+parquet",
            persist_directory="./chroma_db"
        ))
        self.collection = self.client.get_or_create_collection(collection_name)
        
    def add_embeddings(self, embeddings: np.ndarray, metadata: List[Dict]):
        """Add embeddings to ChromaDB."""
        ids = [f"doc_{i}_{datetime.now().timestamp()}" for i in range(len(embeddings))]
        
        # Extract texts from metadata
        documents = [m.get('text', '') for m in metadata]
        
        # Clean metadata for ChromaDB
        clean_metadata = []
        for m in metadata:
            clean_m = {k: str(v) for k, v in m.items() if k != 'text'}
            clean_metadata.append(clean_m)
        
        self.collection.add(
            embeddings=embeddings.tolist(),
            documents=documents,
            metadatas=clean_metadata,
            ids=ids
        )
        logger.info(f"Added {len(embeddings)} embeddings to ChromaDB")
        
    def search(self, query_embedding: np.ndarray, k: int = 10) -> Tuple[List[Dict], List[float]]:
        """Search for similar documents in ChromaDB."""
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=k
        )
        
        metadata_results = []
        for i in range(len(results['ids'][0])):
            meta = results['metadatas'][0][i]
            meta['text'] = results['documents'][0][i]
            metadata_results.append(meta)
            
        similarities = [1 - d for d in results['distances'][0]]  # Convert distance to similarity
        
        return metadata_results, similarities


class PineconeVectorStore(VectorStore):
    """Pinecone-based vector store."""
    
    def __init__(self, config: Config, dimension: int):
        pinecone.init(
            api_key=config.pinecone_api_key,
            environment=config.pinecone_environment
        )
        
        # Create or get index
        if config.pinecone_index_name not in pinecone.list_indexes():
            pinecone.create_index(
                name=config.pinecone_index_name,
                dimension=dimension,
                metric='cosine'
            )
            
        self.index = pinecone.Index(config.pinecone_index_name)
        self.metadata_store = {}
        
    def add_embeddings(self, embeddings: np.ndarray, metadata: List[Dict]):
        """Add embeddings to Pinecone."""
        batch_size = 100
        
        for i in range(0, len(embeddings), batch_size):
            batch_embeddings = embeddings[i:i + batch_size]
            batch_metadata = metadata[i:i + batch_size]
            
            # Create unique IDs
            ids = [f"doc_{i}_{j}" for j in range(len(batch_embeddings))]
            
            # Prepare vectors for upsert
            vectors = []
            for j, (emb, meta) in enumerate(zip(batch_embeddings, batch_metadata)):
                vec_id = ids[j]
                self.metadata_store[vec_id] = meta
                
                # Pinecone metadata must be flat dict with string/number values
                pinecone_meta = {
                    k: v for k, v in meta.items() 
                    if k in ['title', 'space', 'page_id'] and isinstance(v, (str, int, float))
                }
                
                vectors.append((vec_id, emb.tolist(), pinecone_meta))
                
            self.index.upsert(vectors=vectors)
            
        logger.info(f"Added {len(embeddings)} embeddings to Pinecone")
        
    def search(self, query_embedding: np.ndarray, k: int = 10) -> Tuple[List[Dict], List[float]]:
        """Search for similar vectors in Pinecone."""
        results = self.index.query(
            vector=query_embedding.tolist(),
            top_k=k,
            include_metadata=True
        )
        
        metadata_results = []
        similarities = []
        
        for match in results.matches:
            if match.id in self.metadata_store:
                metadata_results.append(self.metadata_store[match.id])
            else:
                metadata_results.append(match.metadata)
            similarities.append(match.score)
            
        return metadata_results, similarities

## 6. Semantic Search Engine

In [None]:
class ConfluenceSemanticSearch:
    """Main semantic search engine for Confluence."""
    
    def __init__(self, config: Config):
        self.config = config
        self.extractor = ConfluenceExtractor(config)
        self.embedder = EmbeddingGenerator(config)
        self.vector_store = None
        self.chunks = []
        
    def initialize_vector_store(self, dimension: int = None):
        """Initialize the appropriate vector store."""
        if dimension is None:
            # Get dimension from a test embedding
            test_embedding = self.embedder.generate_query_embedding("test")
            dimension = len(test_embedding)
            
        if self.config.vector_store == 'faiss':
            self.vector_store = FAISSVectorStore(dimension)
        elif self.config.vector_store == 'chroma':
            self.vector_store = ChromaVectorStore()
        elif self.config.vector_store == 'pinecone':
            self.vector_store = PineconeVectorStore(self.config, dimension)
            
        logger.info(f"Initialized {self.config.vector_store} vector store")
        
    def index_confluence_content(self, space_key: Optional[str] = None, limit: int = 1000):
        """Index Confluence content into the vector store."""
        # Fetch pages
        pages = self.extractor.get_all_pages(space_key, limit)
        
        if not pages:
            logger.warning("No pages found to index")
            return
            
        # Process pages into chunks
        self.chunks = self.extractor.process_pages(pages)
        
        # Generate embeddings
        texts = [chunk['text'] for chunk in self.chunks]
        embeddings = self.embedder.generate_embeddings(texts)
        
        # Initialize vector store if not already done
        if self.vector_store is None:
            self.initialize_vector_store(embeddings.shape[1])
            
        # Add to vector store
        self.vector_store.add_embeddings(embeddings, self.chunks)
        
        logger.info(f"Successfully indexed {len(self.chunks)} chunks from {len(pages)} pages")
        
    def search(self, query: str, k: int = None, filter_space: Optional[str] = None,
               filter_metadata: Optional[Dict] = None) -> List[Dict]:
        """Search for relevant content."""
        if self.vector_store is None:
            raise ValueError("Vector store not initialized. Please index content first.")
            
        k = k or self.config.top_k
        
        # Generate query embedding
        query_embedding = self.embedder.generate_query_embedding(query)
        
        # Search vector store
        results, similarities = self.vector_store.search(query_embedding, k * 2)  # Get extra for filtering
        
        # Apply filters
        filtered_results = []
        for result, similarity in zip(results, similarities):
            # Apply similarity threshold
            if similarity < self.config.min_similarity:
                continue
                
            # Apply space filter
            if filter_space and result['metadata'].get('space') != filter_space:
                continue
                
            # Apply custom metadata filters
            if filter_metadata:
                match = True
                for key, value in filter_metadata.items():
                    if result['metadata'].get(key) != value:
                        match = False
                        break
                if not match:
                    continue
                    
            # Add similarity score to result
            result['similarity'] = similarity
            filtered_results.append(result)
            
            if len(filtered_results) >= k:
                break
                
        return filtered_results
    
    def rerank_results(self, query: str, results: List[Dict], use_cross_encoder: bool = True) -> List[Dict]:
        """Rerank search results using cross-encoder or other methods."""
        if use_cross_encoder:
            from sentence_transformers import CrossEncoder
            
            # Initialize cross-encoder for reranking
            cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
            
            # Prepare pairs for cross-encoder
            pairs = [[query, result['text']] for result in results]
            
            # Get scores
            scores = cross_encoder.predict(pairs)
            
            # Add rerank scores and sort
            for result, score in zip(results, scores):
                result['rerank_score'] = float(score)
                
            results.sort(key=lambda x: x['rerank_score'], reverse=True)
            
        return results
    
    def save_index(self, path: str):
        """Save the index to disk."""
        if self.vector_store and hasattr(self.vector_store, 'save'):
            self.vector_store.save(path)
            
        # Save chunks separately
        with open(f"{path}.chunks", 'wb') as f:
            pickle.dump(self.chunks, f)
            
        logger.info(f"Saved index to {path}")
        
    def load_index(self, path: str):
        """Load the index from disk."""
        # Load chunks
        with open(f"{path}.chunks", 'rb') as f:
            self.chunks = pickle.load(f)
            
        # Initialize and load vector store
        if self.config.vector_store == 'faiss':
            self.vector_store = FAISSVectorStore(1)  # Dimension will be loaded
            self.vector_store.load(path)
            
        logger.info(f"Loaded index from {path}")

# Initialize search engine
search_engine = ConfluenceSemanticSearch(config)

## 7. Usage Examples

In [None]:
# Example 1: Index all content from a specific space
def index_space_example():
    """Index content from a specific Confluence space."""
    
    # Initialize search engine
    search_engine = ConfluenceSemanticSearch(config)
    
    # Index content from a specific space
    search_engine.index_confluence_content(
        space_key='DOCS',  # Replace with your space key
        limit=100  # Limit number of pages for testing
    )
    
    # Save index for later use
    search_engine.save_index('./confluence_index')
    
    return search_engine

# Run indexing
# search_engine = index_space_example()

In [None]:
# Example 2: Perform semantic search
def search_example(search_engine, query: str):
    """Perform a semantic search and display results."""
    
    # Search for relevant content
    results = search_engine.search(
        query=query,
        k=5,
        filter_space=None  # Optional: filter by space
    )
    
    # Display results
    print(f"\nSearch Query: {query}")
    print("=" * 80)
    
    for i, result in enumerate(results, 1):
        print(f"\nResult {i}:")
        print(f"Title: {result['metadata']['title']}")
        print(f"Space: {result['metadata']['space']}")
        print(f"URL: {result['metadata']['url']}")
        print(f"Similarity: {result['similarity']:.3f}")
        print(f"Content: {result['text'][:200]}...")
        print("-" * 40)
        
    return results

# Example search
# results = search_example(search_engine, "How to deploy to production?")

In [None]:
# Example 3: Advanced search with reranking
def advanced_search_example(search_engine, query: str):
    """Perform advanced search with reranking."""
    
    # Initial search
    results = search_engine.search(query=query, k=10)
    
    # Rerank results
    reranked_results = search_engine.rerank_results(query, results)
    
    # Create comparison dataframe
    df_results = pd.DataFrame([
        {
            'Title': r['metadata']['title'],
            'Initial Rank': i + 1,
            'Similarity Score': r['similarity'],
            'Rerank Score': r.get('rerank_score', 0)
        }
        for i, r in enumerate(results)
    ])
    
    # Sort by rerank score
    df_results = df_results.sort_values('Rerank Score', ascending=False).reset_index(drop=True)
    df_results['Final Rank'] = range(1, len(df_results) + 1)
    
    print(f"\nSearch Query: {query}")
    print("=" * 80)
    print(df_results[['Final Rank', 'Initial Rank', 'Title', 'Similarity Score', 'Rerank Score']])
    
    return reranked_results

# Example advanced search
# reranked_results = advanced_search_example(search_engine, "API authentication best practices")

## 8. Monitoring and Analytics

In [None]:
class SearchAnalytics:
    """Analytics and monitoring for search system."""
    
    def __init__(self):
        self.search_logs = []
        
    def log_search(self, query: str, results: List[Dict], response_time: float):
        """Log search query and results."""
        log_entry = {
            'timestamp': datetime.now(),
            'query': query,
            'num_results': len(results),
            'top_result': results[0]['metadata']['title'] if results else None,
            'avg_similarity': np.mean([r['similarity'] for r in results]) if results else 0,
            'response_time': response_time
        }
        self.search_logs.append(log_entry)
        
    def get_analytics_dashboard(self) -> Dict:
        """Generate analytics dashboard data."""
        if not self.search_logs:
            return {}
            
        df = pd.DataFrame(self.search_logs)
        
        analytics = {
            'total_searches': len(df),
            'avg_response_time': df['response_time'].mean(),
            'avg_results_per_search': df['num_results'].mean(),
            'avg_similarity_score': df['avg_similarity'].mean(),
            'searches_with_no_results': (df['num_results'] == 0).sum(),
            'unique_queries': df['query'].nunique(),
            'top_queries': df['query'].value_counts().head(10).to_dict()
        }
        
        return analytics
    
    def plot_search_metrics(self):
        """Create visualization of search metrics."""
        if not self.search_logs:
            print("No search logs available")
            return
            
        df = pd.DataFrame(self.search_logs)
        
        # Create subplots
        from plotly.subplots import make_subplots
        
        fig = make_subplots(
            rows=2, cols=2,
            subplot_titles=('Searches Over Time', 'Response Time Distribution',
                          'Results per Search', 'Similarity Score Distribution')
        )
        
        # Searches over time
        df['hour'] = df['timestamp'].dt.floor('H')
        searches_by_hour = df.groupby('hour').size().reset_index(name='count')
        fig.add_trace(
            go.Scatter(x=searches_by_hour['hour'], y=searches_by_hour['count'],
                      mode='lines+markers', name='Searches'),
            row=1, col=1
        )
        
        # Response time distribution
        fig.add_trace(
            go.Histogram(x=df['response_time'], name='Response Time', nbinsx=20),
            row=1, col=2
        )
        
        # Results per search
        fig.add_trace(
            go.Histogram(x=df['num_results'], name='Results Count', nbinsx=15),
            row=2, col=1
        )
        
        # Similarity score distribution
        fig.add_trace(
            go.Histogram(x=df['avg_similarity'], name='Similarity', nbinsx=20),
            row=2, col=2
        )
        
        fig.update_layout(height=600, showlegend=False,
                         title_text="Search System Analytics Dashboard")
        fig.show()

# Initialize analytics
analytics = SearchAnalytics()

## 9. Production Deployment Considerations

In [None]:
class ProductionSearchSystem:
    """Production-ready search system with caching, incremental updates, and monitoring."""
    
    def __init__(self, config: Config):
        self.config = config
        self.search_engine = ConfluenceSemanticSearch(config)
        self.analytics = SearchAnalytics()
        self.cache = {}
        self.last_update = None
        
    def initialize_or_load(self, index_path: str):
        """Initialize system by loading existing index or creating new one."""
        if os.path.exists(f"{index_path}.chunks"):
            logger.info("Loading existing index...")
            self.search_engine.load_index(index_path)
            self.last_update = datetime.now()
        else:
            logger.info("Creating new index...")
            self.search_engine.index_confluence_content(limit=1000)
            self.search_engine.save_index(index_path)
            self.last_update = datetime.now()
            
    def incremental_update(self, hours_back: int = 24):
        """Perform incremental update for recently modified pages."""
        logger.info(f"Performing incremental update for last {hours_back} hours...")
        
        # This would require Confluence API support for querying by modification date
        # Implementation would fetch only recently modified pages and update the index
        
        # Placeholder for incremental update logic
        # recent_pages = self.search_engine.extractor.get_recently_modified_pages(hours_back)
        # ...
        
        self.last_update = datetime.now()
        
    def search_with_cache(self, query: str, **kwargs) -> List[Dict]:
        """Search with caching for repeated queries."""
        import time
        
        # Create cache key
        cache_key = hashlib.md5(f"{query}{kwargs}".encode()).hexdigest()
        
        # Check cache
        if cache_key in self.cache:
            cache_entry = self.cache[cache_key]
            if datetime.now() - cache_entry['timestamp'] < timedelta(minutes=30):
                logger.info(f"Returning cached results for query: {query}")
                return cache_entry['results']
                
        # Perform search
        start_time = time.time()
        results = self.search_engine.search(query, **kwargs)
        response_time = time.time() - start_time
        
        # Cache results
        self.cache[cache_key] = {
            'timestamp': datetime.now(),
            'results': results
        }
        
        # Log analytics
        self.analytics.log_search(query, results, response_time)
        
        return results
    
    def get_system_health(self) -> Dict:
        """Get system health metrics."""
        health = {
            'status': 'healthy',
            'last_update': self.last_update.isoformat() if self.last_update else None,
            'index_size': len(self.search_engine.chunks) if self.search_engine.chunks else 0,
            'cache_size': len(self.cache),
            'analytics': self.analytics.get_analytics_dashboard()
        }
        
        return health
    
    def export_search_logs(self, filepath: str):
        """Export search logs for analysis."""
        if self.analytics.search_logs:
            df = pd.DataFrame(self.analytics.search_logs)
            df.to_csv(filepath, index=False)
            logger.info(f"Exported {len(df)} search logs to {filepath}")
        else:
            logger.warning("No search logs to export")

# Initialize production system
prod_system = ProductionSearchSystem(config)

## 10. API Wrapper for Integration

In [None]:
from typing import Optional
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel
import uvicorn

# API models
class SearchRequest(BaseModel):
    query: str
    k: Optional[int] = 10
    filter_space: Optional[str] = None
    rerank: Optional[bool] = False

class SearchResponse(BaseModel):
    query: str
    results: List[Dict]
    response_time: float
    timestamp: str

# Create FastAPI app
app = FastAPI(title="Confluence Semantic Search API")

# Global production system instance
prod_system = None

@app.on_event("startup")
async def startup_event():
    """Initialize the search system on startup."""
    global prod_system
    config = Config()
    prod_system = ProductionSearchSystem(config)
    prod_system.initialize_or_load('./confluence_index')

@app.post("/search", response_model=SearchResponse)
async def search(request: SearchRequest):
    """Perform semantic search on Confluence content."""
    import time
    
    if not prod_system:
        raise HTTPException(status_code=503, detail="Search system not initialized")
    
    try:
        start_time = time.time()
        
        # Perform search
        results = prod_system.search_with_cache(
            query=request.query,
            k=request.k,
            filter_space=request.filter_space
        )
        
        # Optionally rerank
        if request.rerank:
            results = prod_system.search_engine.rerank_results(request.query, results)
        
        response_time = time.time() - start_time
        
        return SearchResponse(
            query=request.query,
            results=results,
            response_time=response_time,
            timestamp=datetime.now().isoformat()
        )
        
    except Exception as e:
        logger.error(f"Search error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    """Get system health status."""
    if not prod_system:
        return {"status": "initializing"}
    
    return prod_system.get_system_health()

@app.post("/update")
async def update(background_tasks: BackgroundTasks):
    """Trigger incremental index update."""
    if not prod_system:
        raise HTTPException(status_code=503, detail="Search system not initialized")
    
    background_tasks.add_task(prod_system.incremental_update)
    return {"message": "Update task scheduled"}

@app.get("/analytics")
async def analytics():
    """Get search analytics."""
    if not prod_system:
        raise HTTPException(status_code=503, detail="Search system not initialized")
    
    return prod_system.analytics.get_analytics_dashboard()

# To run the API:
# if __name__ == "__main__":
#     uvicorn.run(app, host="0.0.0.0", port=8000)

## 11. Complete Example Workflow

In [None]:
def complete_workflow_example():
    """
    Complete example workflow demonstrating:
    1. Configuration setup
    2. Content indexing
    3. Search operations
    4. Analytics
    """
    
    # Step 1: Configure the system
    config = Config(
        confluence_url=os.getenv('CONFLUENCE_URL'),
        confluence_username=os.getenv('CONFLUENCE_USERNAME'),
        confluence_api_token=os.getenv('CONFLUENCE_API_TOKEN'),
        embedding_model='sentence-transformers',  # Use free local model
        vector_store='faiss',  # Use local vector store
        chunk_size=500,
        chunk_overlap=100
    )
    
    # Step 2: Initialize production system
    print("Initializing production search system...")
    prod_system = ProductionSearchSystem(config)
    
    # Step 3: Index content (or load existing index)
    index_path = './confluence_search_index'
    prod_system.initialize_or_load(index_path)
    
    # Step 4: Perform various searches
    test_queries = [
        "How to deploy to production?",
        "API authentication methods",
        "Database backup procedures",
        "Team onboarding process",
        "Security best practices"
    ]
    
    print("\n" + "="*80)
    print("PERFORMING TEST SEARCHES")
    print("="*80)
    
    for query in test_queries:
        print(f"\nSearching: {query}")
        results = prod_system.search_with_cache(query, k=3)
        
        if results:
            print(f"Found {len(results)} results:")
            for i, result in enumerate(results, 1):
                print(f"  {i}. {result['metadata']['title']} (similarity: {result['similarity']:.3f})")
        else:
            print("  No results found")
    
    # Step 5: Display analytics
    print("\n" + "="*80)
    print("SEARCH ANALYTICS")
    print("="*80)
    
    analytics_data = prod_system.analytics.get_analytics_dashboard()
    for key, value in analytics_data.items():
        if key != 'top_queries':
            print(f"{key}: {value}")
    
    # Step 6: Visualize metrics
    prod_system.analytics.plot_search_metrics()
    
    # Step 7: Export logs
    prod_system.export_search_logs('./search_logs.csv')
    
    # Step 8: System health check
    print("\n" + "="*80)
    print("SYSTEM HEALTH")
    print("="*80)
    
    health = prod_system.get_system_health()
    print(f"Status: {health['status']}")
    print(f"Index Size: {health['index_size']} chunks")
    print(f"Cache Size: {health['cache_size']} entries")
    print(f"Last Update: {health['last_update']}")
    
    return prod_system

# Run the complete workflow
# prod_system = complete_workflow_example()

## 12. Advanced Features and Optimizations

In [None]:
class AdvancedSearchFeatures:
    """Advanced features for the search system."""
    
    @staticmethod
    def hybrid_search(search_engine, query: str, alpha: float = 0.5) -> List[Dict]:
        """
        Hybrid search combining semantic and keyword search.
        alpha: weight for semantic search (0-1)
        """
        # Semantic search
        semantic_results = search_engine.search(query, k=20)
        
        # Keyword search (BM25-like scoring)
        from sklearn.feature_extraction.text import TfidfVectorizer
        
        # Get all texts
        texts = [chunk['text'] for chunk in search_engine.chunks]
        
        # Create TF-IDF vectorizer
        vectorizer = TfidfVectorizer(max_features=5000)
        tfidf_matrix = vectorizer.fit_transform(texts)
        
        # Transform query
        query_vector = vectorizer.transform([query])
        
        # Calculate similarities
        from sklearn.metrics.pairwise import cosine_similarity
        keyword_similarities = cosine_similarity(query_vector, tfidf_matrix)[0]
        
        # Get top keyword results
        top_keyword_indices = keyword_similarities.argsort()[-20:][::-1]
        
        # Combine scores
        combined_scores = {}
        
        for result in semantic_results:
            chunk_id = result['metadata'].get('chunk_index', -1)
            if chunk_id >= 0:
                semantic_score = result['similarity']
                keyword_score = keyword_similarities[chunk_id] if chunk_id < len(keyword_similarities) else 0
                combined_scores[chunk_id] = alpha * semantic_score + (1 - alpha) * keyword_score
        
        # Sort by combined score
        sorted_chunks = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
        
        # Return top results
        results = []
        for chunk_id, score in sorted_chunks[:10]:
            if chunk_id < len(search_engine.chunks):
                result = search_engine.chunks[chunk_id].copy()
                result['hybrid_score'] = score
                results.append(result)
        
        return results
    
    @staticmethod
    def query_expansion(query: str, model='sentence-transformers') -> List[str]:
        """
        Expand query with synonyms and related terms.
        """
        expanded_queries = [query]
        
        # Simple expansion with common patterns
        expansions = {
            'deploy': ['deployment', 'release', 'push to production'],
            'auth': ['authentication', 'authorization', 'login'],
            'api': ['API', 'endpoint', 'REST', 'service'],
            'database': ['DB', 'SQL', 'data store', 'persistence'],
            'error': ['bug', 'issue', 'problem', 'exception'],
            'config': ['configuration', 'settings', 'parameters']
        }
        
        # Check for expansion terms
        for term, synonyms in expansions.items():
            if term.lower() in query.lower():
                for synonym in synonyms:
                    expanded_queries.append(query.replace(term, synonym))
        
        return expanded_queries[:5]  # Limit to 5 variations
    
    @staticmethod
    def clustering_results(results: List[Dict], n_clusters: int = 3):
        """
        Cluster search results to identify themes.
        """
        if len(results) < n_clusters:
            return results
        
        from sklearn.cluster import KMeans
        from sentence_transformers import SentenceTransformer
        
        # Generate embeddings for clustering
        model = SentenceTransformer('all-MiniLM-L6-v2')
        texts = [r['text'] for r in results]
        embeddings = model.encode(texts)
        
        # Perform clustering
        kmeans = KMeans(n_clusters=n_clusters, random_state=42)
        clusters = kmeans.fit_predict(embeddings)
        
        # Add cluster labels to results
        for result, cluster in zip(results, clusters):
            result['cluster'] = int(cluster)
        
        return results

# Example usage of advanced features
advanced = AdvancedSearchFeatures()

## Summary

This notebook provides a comprehensive, production-ready implementation of semantic search for Confluence content with the following features:

### Core Capabilities
- **Confluence Integration**: Full API integration with pagination and error handling
- **Multiple Embedding Models**: Support for OpenAI and Sentence Transformers
- **Vector Store Options**: FAISS, ChromaDB, and Pinecone implementations
- **Advanced Search**: Semantic search with metadata filtering and reranking

### Production Features
- **Caching**: Query result caching for improved performance
- **Incremental Updates**: Support for updating only changed content
- **Monitoring**: Comprehensive analytics and health monitoring
- **API Wrapper**: FastAPI integration for REST API access

### Advanced Capabilities
- **Hybrid Search**: Combining semantic and keyword-based search
- **Query Expansion**: Automatic query enhancement with synonyms
- **Result Clustering**: Grouping similar results
- **Cross-encoder Reranking**: Improved relevance scoring

### Next Steps
1. Set up environment variables for Confluence and embedding providers
2. Choose appropriate vector store based on scale requirements
3. Run initial indexing of Confluence content
4. Deploy API for integration with other systems
5. Monitor analytics and optimize based on usage patterns