# Agentic Workflow for Semantic Search through Confluence Pages

This notebook implements a production-ready agentic workflow for semantic search through Confluence pages using:
- **Confluence API** for page crawling and content extraction
- **OpenAI/Cohere embeddings** for semantic representation
- **ChromaDB/Pinecone** for vector storage
- **LangGraph** for agentic workflow orchestration
- **Braintrust** for observability and monitoring

## Architecture Overview

```
Confluence Root URL ‚Üí Crawler ‚Üí Content Processor ‚Üí Embeddings ‚Üí Vector DB
                                                                      ‚Üì
User Query ‚Üí Agent ‚Üí Tools ‚Üí Semantic Search ‚Üí Response Generation
```

## 1. Installation and Dependencies

In [None]:
# Install required packages
!pip install -q atlassian-python-api \
              langchain \
              langchain-openai \
              langchain-cohere \
              langgraph \
              chromadb \
              pinecone-client \
              beautifulsoup4 \
              lxml \
              tiktoken \
              python-dotenv \
              braintrust \
              numpy \
              pandas \
              tqdm \
              tenacity

## 2. Configuration and Setup

In [None]:
import os
import json
import hashlib
from typing import List, Dict, Any, Optional, Tuple
from dataclasses import dataclass, field
from datetime import datetime
import logging
from urllib.parse import urlparse, urljoin
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
import warnings
warnings.filterwarnings('ignore')

# External imports
import pandas as pd
import numpy as np
from tqdm import tqdm
from tenacity import retry, stop_after_attempt, wait_exponential
from bs4 import BeautifulSoup
from atlassian import Confluence
from dotenv import load_dotenv

# LangChain imports
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_cohere import CohereEmbeddings
from langchain.schema import Document
from langchain.vectorstores import Chroma, Pinecone
from langchain.tools import Tool
from langchain.agents import AgentExecutor
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.schema.runnable import RunnablePassthrough

# LangGraph imports
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode, tools_condition
from langgraph.graph.message import add_messages

# Observability
import braintrust

# Load environment variables
load_dotenv()

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

In [None]:
# Configuration
@dataclass
class Config:
    """Configuration for Confluence semantic search agent"""
    
    # Confluence settings
    confluence_url: str = os.getenv('CONFLUENCE_URL', 'https://your-domain.atlassian.net')
    confluence_username: str = os.getenv('CONFLUENCE_USERNAME', '')
    confluence_api_token: str = os.getenv('CONFLUENCE_API_TOKEN', '')
    
    # Embedding settings
    embedding_provider: str = 'openai'  # 'openai' or 'cohere'
    openai_api_key: str = os.getenv('OPENAI_API_KEY', '')
    cohere_api_key: str = os.getenv('COHERE_API_KEY', '')
    embedding_model: str = 'text-embedding-3-small'
    
    # Vector DB settings
    vector_db_provider: str = 'chroma'  # 'chroma' or 'pinecone'
    pinecone_api_key: str = os.getenv('PINECONE_API_KEY', '')
    pinecone_environment: str = os.getenv('PINECONE_ENVIRONMENT', '')
    pinecone_index: str = 'confluence-search'
    chroma_persist_dir: str = './chroma_db'
    
    # LLM settings
    llm_model: str = 'gpt-4o-mini'
    llm_temperature: float = 0.0
    
    # Crawling settings
    max_depth: int = 3
    max_pages: int = 100
    batch_size: int = 10
    max_workers: int = 5
    
    # Text processing
    chunk_size: int = 1000
    chunk_overlap: int = 200
    
    # Search settings
    top_k: int = 5
    score_threshold: float = 0.7
    
    # Observability
    braintrust_api_key: str = os.getenv('BRAINTRUST_API_KEY', '')
    enable_monitoring: bool = True

config = Config()

# Validate configuration
if not config.confluence_username or not config.confluence_api_token:
    logger.warning("Confluence credentials not configured. Please set CONFLUENCE_USERNAME and CONFLUENCE_API_TOKEN")
if not config.openai_api_key and config.embedding_provider == 'openai':
    logger.warning("OpenAI API key not configured. Please set OPENAI_API_KEY")

## 3. Confluence Content Crawler

In [None]:
@dataclass
class ConfluencePage:
    """Represents a Confluence page with metadata"""
    id: str
    title: str
    url: str
    content: str
    space_key: str
    created_by: str
    created_date: str
    last_modified: str
    parent_id: Optional[str] = None
    labels: List[str] = field(default_factory=list)
    attachments: List[Dict] = field(default_factory=list)
    child_pages: List[str] = field(default_factory=list)

class ConfluenceCrawler:
    """Crawls Confluence pages starting from a root URL"""
    
    def __init__(self, config: Config):
        self.config = config
        self.confluence = Confluence(
            url=config.confluence_url,
            username=config.confluence_username,
            password=config.confluence_api_token,
            cloud=True
        )
        self.visited_pages = set()
        self.pages = []
        
    def extract_page_id_from_url(self, url: str) -> Optional[str]:
        """Extract page ID from Confluence URL"""
        patterns = [
            r'/pages/(\d+)/',
            r'pageId=(\d+)',
            r'/pages/viewpage.action\?pageId=(\d+)'
        ]
        
        for pattern in patterns:
            match = re.search(pattern, url)
            if match:
                return match.group(1)
        return None
    
    def clean_html_content(self, html_content: str) -> str:
        """Clean HTML content and extract text"""
        soup = BeautifulSoup(html_content, 'lxml')
        
        # Remove script and style elements
        for element in soup(['script', 'style', 'meta', 'link']):
            element.decompose()
        
        # Extract text
        text = soup.get_text(separator=' ', strip=True)
        
        # Clean up whitespace
        text = re.sub(r'\s+', ' ', text)
        
        return text.strip()
    
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
    def fetch_page(self, page_id: str) -> Optional[ConfluencePage]:
        """Fetch a single Confluence page with retry logic"""
        try:
            # Get page details
            page = self.confluence.get_page_by_id(
                page_id,
                expand='body.storage,metadata.labels,children.page,history,version'
            )
            
            # Extract content
            content_html = page.get('body', {}).get('storage', {}).get('value', '')
            content_text = self.clean_html_content(content_html)
            
            # Extract metadata
            labels = [label['name'] for label in page.get('metadata', {}).get('labels', {}).get('results', [])]
            
            # Get child pages
            child_pages = []
            if 'children' in page and 'page' in page['children']:
                child_pages = [child['id'] for child in page['children']['page'].get('results', [])]
            
            # Create page object
            confluence_page = ConfluencePage(
                id=page['id'],
                title=page['title'],
                url=f"{self.config.confluence_url}/wiki/spaces/{page['space']['key']}/pages/{page['id']}",
                content=content_text,
                space_key=page['space']['key'],
                created_by=page['history']['createdBy']['displayName'],
                created_date=page['history']['createdDate'],
                last_modified=page['version']['when'],
                labels=labels,
                child_pages=child_pages
            )
            
            return confluence_page
            
        except Exception as e:
            logger.error(f"Error fetching page {page_id}: {str(e)}")
            return None
    
    def crawl_from_url(self, root_url: str, max_depth: Optional[int] = None) -> List[ConfluencePage]:
        """Crawl Confluence pages starting from a root URL"""
        max_depth = max_depth or self.config.max_depth
        
        # Extract root page ID
        root_page_id = self.extract_page_id_from_url(root_url)
        if not root_page_id:
            raise ValueError(f"Could not extract page ID from URL: {root_url}")
        
        logger.info(f"Starting crawl from page ID: {root_page_id}")
        
        # BFS crawling
        queue = [(root_page_id, 0)]
        self.visited_pages = set()
        self.pages = []
        
        with tqdm(total=self.config.max_pages, desc="Crawling pages") as pbar:
            while queue and len(self.pages) < self.config.max_pages:
                page_id, depth = queue.pop(0)
                
                if page_id in self.visited_pages or depth > max_depth:
                    continue
                
                self.visited_pages.add(page_id)
                
                # Fetch page
                page = self.fetch_page(page_id)
                if page:
                    self.pages.append(page)
                    pbar.update(1)
                    
                    # Add child pages to queue
                    for child_id in page.child_pages:
                        if child_id not in self.visited_pages:
                            queue.append((child_id, depth + 1))
        
        logger.info(f"Crawled {len(self.pages)} pages")
        return self.pages
    
    def crawl_space(self, space_key: str, limit: int = 100) -> List[ConfluencePage]:
        """Crawl all pages in a Confluence space"""
        logger.info(f"Crawling space: {space_key}")
        
        self.pages = []
        start = 0
        
        with tqdm(total=limit, desc="Crawling space pages") as pbar:
            while len(self.pages) < limit:
                # Get pages in space
                results = self.confluence.get_all_pages_from_space(
                    space_key,
                    start=start,
                    limit=min(25, limit - len(self.pages))
                )
                
                if not results:
                    break
                
                # Fetch each page
                for page_summary in results:
                    page = self.fetch_page(page_summary['id'])
                    if page:
                        self.pages.append(page)
                        pbar.update(1)
                
                start += 25
        
        logger.info(f"Crawled {len(self.pages)} pages from space {space_key}")
        return self.pages

## 4. Document Processing and Embedding

In [None]:
class DocumentProcessor:
    """Process Confluence pages into embedded documents"""
    
    def __init__(self, config: Config):
        self.config = config
        
        # Initialize text splitter
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=config.chunk_size,
            chunk_overlap=config.chunk_overlap,
            separators=["\n\n", "\n", ". ", " ", ""]
        )
        
        # Initialize embeddings
        if config.embedding_provider == 'openai':
            self.embeddings = OpenAIEmbeddings(
                model=config.embedding_model,
                openai_api_key=config.openai_api_key
            )
        elif config.embedding_provider == 'cohere':
            self.embeddings = CohereEmbeddings(
                model="embed-english-v3.0",
                cohere_api_key=config.cohere_api_key
            )
        else:
            raise ValueError(f"Unsupported embedding provider: {config.embedding_provider}")
    
    def create_document_id(self, page: ConfluencePage, chunk_index: int) -> str:
        """Create unique document ID"""
        content = f"{page.id}_{chunk_index}"
        return hashlib.md5(content.encode()).hexdigest()
    
    def process_pages(self, pages: List[ConfluencePage]) -> List[Document]:
        """Process Confluence pages into LangChain documents"""
        documents = []
        
        for page in tqdm(pages, desc="Processing pages"):
            # Skip pages with minimal content
            if len(page.content) < 50:
                continue
            
            # Split content into chunks
            texts = self.text_splitter.split_text(page.content)
            
            # Create documents with metadata
            for i, text in enumerate(texts):
                metadata = {
                    'page_id': page.id,
                    'page_title': page.title,
                    'page_url': page.url,
                    'space_key': page.space_key,
                    'created_by': page.created_by,
                    'created_date': page.created_date,
                    'last_modified': page.last_modified,
                    'labels': ', '.join(page.labels),
                    'chunk_index': i,
                    'total_chunks': len(texts),
                    'doc_id': self.create_document_id(page, i)
                }
                
                doc = Document(
                    page_content=text,
                    metadata=metadata
                )
                documents.append(doc)
        
        logger.info(f"Created {len(documents)} document chunks from {len(pages)} pages")
        return documents
    
    def batch_embed_documents(self, documents: List[Document], batch_size: int = 100) -> List[Tuple[Document, List[float]]]:
        """Embed documents in batches"""
        embedded_docs = []
        
        for i in tqdm(range(0, len(documents), batch_size), desc="Embedding documents"):
            batch = documents[i:i + batch_size]
            texts = [doc.page_content for doc in batch]
            
            try:
                embeddings = self.embeddings.embed_documents(texts)
                for doc, embedding in zip(batch, embeddings):
                    embedded_docs.append((doc, embedding))
            except Exception as e:
                logger.error(f"Error embedding batch {i}: {str(e)}")
                continue
        
        return embedded_docs

## 5. Vector Database Setup

In [None]:
class VectorStore:
    """Manages vector storage and retrieval"""
    
    def __init__(self, config: Config, embeddings):
        self.config = config
        self.embeddings = embeddings
        self.vectorstore = None
        
        # Initialize vector store
        if config.vector_db_provider == 'chroma':
            self._init_chroma()
        elif config.vector_db_provider == 'pinecone':
            self._init_pinecone()
        else:
            raise ValueError(f"Unsupported vector DB provider: {config.vector_db_provider}")
    
    def _init_chroma(self):
        """Initialize ChromaDB"""
        import chromadb
        from chromadb.config import Settings
        
        # Create or load collection
        self.vectorstore = Chroma(
            collection_name="confluence_pages",
            embedding_function=self.embeddings,
            persist_directory=self.config.chroma_persist_dir,
            collection_metadata={"hnsw:space": "cosine"}
        )
        logger.info("Initialized ChromaDB vector store")
    
    def _init_pinecone(self):
        """Initialize Pinecone"""
        import pinecone
        
        # Initialize Pinecone
        pinecone.init(
            api_key=self.config.pinecone_api_key,
            environment=self.config.pinecone_environment
        )
        
        # Create index if it doesn't exist
        if self.config.pinecone_index not in pinecone.list_indexes():
            pinecone.create_index(
                name=self.config.pinecone_index,
                dimension=1536,  # OpenAI embedding dimension
                metric='cosine'
            )
        
        # Initialize vector store
        from langchain.vectorstores import Pinecone as PineconeVectorStore
        self.vectorstore = PineconeVectorStore(
            index_name=self.config.pinecone_index,
            embedding=self.embeddings
        )
        logger.info("Initialized Pinecone vector store")
    
    def add_documents(self, documents: List[Document], batch_size: int = 100):
        """Add documents to vector store"""
        logger.info(f"Adding {len(documents)} documents to vector store")
        
        for i in tqdm(range(0, len(documents), batch_size), desc="Adding to vector store"):
            batch = documents[i:i + batch_size]
            try:
                self.vectorstore.add_documents(batch)
            except Exception as e:
                logger.error(f"Error adding batch {i}: {str(e)}")
                continue
        
        # Persist if using Chroma
        if self.config.vector_db_provider == 'chroma':
            self.vectorstore.persist()
        
        logger.info("Documents added to vector store")
    
    def similarity_search(self, query: str, k: int = None, score_threshold: float = None) -> List[Document]:
        """Perform similarity search"""
        k = k or self.config.top_k
        score_threshold = score_threshold or self.config.score_threshold
        
        # Search with score
        results_with_scores = self.vectorstore.similarity_search_with_score(query, k=k)
        
        # Filter by score threshold
        filtered_results = [
            doc for doc, score in results_with_scores
            if score >= score_threshold
        ]
        
        return filtered_results
    
    def mmr_search(self, query: str, k: int = None, fetch_k: int = 20, lambda_mult: float = 0.5) -> List[Document]:
        """Perform MMR (Maximal Marginal Relevance) search for diversity"""
        k = k or self.config.top_k
        
        return self.vectorstore.max_marginal_relevance_search(
            query,
            k=k,
            fetch_k=fetch_k,
            lambda_mult=lambda_mult
        )

## 6. Agentic Tools Definition

In [None]:
class ConfluenceSearchTools:
    """Tools for Confluence semantic search"""
    
    def __init__(self, vector_store: VectorStore, config: Config):
        self.vector_store = vector_store
        self.config = config
        
    def semantic_search(self, query: str, top_k: int = 5) -> str:
        """Search for relevant Confluence pages using semantic similarity"""
        try:
            results = self.vector_store.similarity_search(query, k=top_k)
            
            if not results:
                return "No relevant pages found for your query."
            
            # Format results
            formatted_results = []
            for i, doc in enumerate(results, 1):
                formatted_results.append(
                    f"**Result {i}:**\n"
                    f"Title: {doc.metadata.get('page_title', 'Unknown')}\n"
                    f"URL: {doc.metadata.get('page_url', 'N/A')}\n"
                    f"Space: {doc.metadata.get('space_key', 'Unknown')}\n"
                    f"Last Modified: {doc.metadata.get('last_modified', 'Unknown')}\n"
                    f"Content: {doc.page_content[:500]}...\n"
                )
            
            return "\n\n".join(formatted_results)
            
        except Exception as e:
            logger.error(f"Error in semantic search: {str(e)}")
            return f"Error performing search: {str(e)}"
    
    def diverse_search(self, query: str, top_k: int = 5) -> str:
        """Search with diversity using MMR algorithm"""
        try:
            results = self.vector_store.mmr_search(query, k=top_k)
            
            if not results:
                return "No relevant pages found for your query."
            
            # Format results
            formatted_results = []
            for i, doc in enumerate(results, 1):
                formatted_results.append(
                    f"**Result {i}:**\n"
                    f"Title: {doc.metadata.get('page_title', 'Unknown')}\n"
                    f"Space: {doc.metadata.get('space_key', 'Unknown')}\n"
                    f"Content Preview: {doc.page_content[:300]}...\n"
                )
            
            return "\n\n".join(formatted_results)
            
        except Exception as e:
            logger.error(f"Error in diverse search: {str(e)}")
            return f"Error performing search: {str(e)}"
    
    def search_by_metadata(self, space_key: str = None, author: str = None, labels: str = None) -> str:
        """Search pages by metadata filters"""
        try:
            # Build filter
            filter_dict = {}
            if space_key:
                filter_dict['space_key'] = space_key
            if author:
                filter_dict['created_by'] = author
            
            # Note: This is a simplified version. Real implementation would need
            # proper metadata filtering support in your vector store
            results = self.vector_store.vectorstore.similarity_search(
                "",  # Empty query for metadata-only search
                k=10,
                filter=filter_dict if filter_dict else None
            )
            
            if not results:
                return "No pages found matching the metadata criteria."
            
            # Format results
            formatted_results = []
            for doc in results[:5]:
                formatted_results.append(
                    f"- {doc.metadata.get('page_title', 'Unknown')} "
                    f"(Space: {doc.metadata.get('space_key', 'Unknown')}, "
                    f"Author: {doc.metadata.get('created_by', 'Unknown')})"
                )
            
            return "\n".join(formatted_results)
            
        except Exception as e:
            logger.error(f"Error in metadata search: {str(e)}")
            return f"Error performing metadata search: {str(e)}"
    
    def get_page_summary(self, page_title: str) -> str:
        """Get a summary of a specific Confluence page"""
        try:
            # Search for the specific page
            results = self.vector_store.similarity_search(f"title: {page_title}", k=3)
            
            # Find the most relevant page
            target_page = None
            for doc in results:
                if page_title.lower() in doc.metadata.get('page_title', '').lower():
                    target_page = doc
                    break
            
            if not target_page:
                return f"No page found with title: {page_title}"
            
            # Create summary
            summary = (
                f"**Page: {target_page.metadata.get('page_title', 'Unknown')}**\n\n"
                f"URL: {target_page.metadata.get('page_url', 'N/A')}\n"
                f"Space: {target_page.metadata.get('space_key', 'Unknown')}\n"
                f"Author: {target_page.metadata.get('created_by', 'Unknown')}\n"
                f"Last Modified: {target_page.metadata.get('last_modified', 'Unknown')}\n"
                f"Labels: {target_page.metadata.get('labels', 'None')}\n\n"
                f"Content Preview:\n{target_page.page_content[:1000]}..."
            )
            
            return summary
            
        except Exception as e:
            logger.error(f"Error getting page summary: {str(e)}")
            return f"Error getting page summary: {str(e)}"
    
    def get_tools(self) -> List[Tool]:
        """Get LangChain tools"""
        return [
            Tool(
                name="semantic_search",
                func=self.semantic_search,
                description="Search Confluence pages using semantic similarity. Input should be a search query."
            ),
            Tool(
                name="diverse_search",
                func=self.diverse_search,
                description="Search Confluence with diversity to get varied results. Input should be a search query."
            ),
            Tool(
                name="search_by_metadata",
                func=self.search_by_metadata,
                description="Search pages by metadata like space_key, author, or labels. Input format: 'space_key=SPACE author=Name'."
            ),
            Tool(
                name="get_page_summary",
                func=self.get_page_summary,
                description="Get a detailed summary of a specific Confluence page. Input should be the page title."
            )
        ]

## 7. LangGraph Agent Implementation

In [None]:
from typing import TypedDict, Annotated, Sequence
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, SystemMessage, ToolMessage
from langgraph.checkpoint.memory import MemorySaver

class AgentState(TypedDict):
    """Agent state for LangGraph"""
    messages: Annotated[Sequence[BaseMessage], add_messages]
    
class ConfluenceSearchAgent:
    """LangGraph-based agent for Confluence search"""
    
    def __init__(self, tools: List[Tool], config: Config):
        self.tools = tools
        self.config = config
        
        # Initialize LLM
        self.llm = ChatOpenAI(
            model=config.llm_model,
            temperature=config.llm_temperature,
            openai_api_key=config.openai_api_key
        )
        
        # Bind tools to LLM
        self.llm_with_tools = self.llm.bind_tools(tools)
        
        # Create graph
        self.graph = self._build_graph()
        
        # Memory for conversation history
        self.memory = MemorySaver()
        
        # Compile graph
        self.app = self.graph.compile(checkpointer=self.memory)
        
    def _build_graph(self) -> StateGraph:
        """Build the agent graph"""
        
        # Define the graph
        workflow = StateGraph(AgentState)
        
        # Define nodes
        def agent(state: AgentState) -> dict:
            """Agent node that decides on actions"""
            messages = state["messages"]
            
            # Add system message if this is the first message
            if len(messages) == 1:
                system_prompt = (
                    "You are a helpful assistant specialized in searching and retrieving information "
                    "from Confluence pages. You have access to semantic search tools that can find "
                    "relevant documentation based on queries. Always provide clear, concise answers "
                    "and include links to relevant pages when available. If you're not sure about "
                    "something, use the search tools to find the information."
                )
                messages = [SystemMessage(content=system_prompt)] + messages
            
            # Get response from LLM
            response = self.llm_with_tools.invoke(messages)
            
            return {"messages": [response]}
        
        # Add nodes
        workflow.add_node("agent", agent)
        workflow.add_node("tools", ToolNode(self.tools))
        
        # Set entry point
        workflow.set_entry_point("agent")
        
        # Add conditional edges
        workflow.add_conditional_edges(
            "agent",
            tools_condition,
            {
                "tools": "tools",
                "__end__": END
            }
        )
        
        # Add edge from tools back to agent
        workflow.add_edge("tools", "agent")
        
        return workflow
    
    def search(self, query: str, thread_id: str = "default") -> str:
        """Execute a search query"""
        
        # Create initial message
        initial_state = {
            "messages": [HumanMessage(content=query)]
        }
        
        # Run the agent
        config = {"configurable": {"thread_id": thread_id}}
        
        try:
            result = self.app.invoke(initial_state, config)
            
            # Extract the final response
            messages = result["messages"]
            
            # Find the last AI message
            for message in reversed(messages):
                if isinstance(message, AIMessage) and not message.tool_calls:
                    return message.content
            
            return "No response generated."
            
        except Exception as e:
            logger.error(f"Error in agent search: {str(e)}")
            return f"Error processing query: {str(e)}"
    
    def stream_search(self, query: str, thread_id: str = "default"):
        """Stream search results"""
        
        initial_state = {
            "messages": [HumanMessage(content=query)]
        }
        
        config = {"configurable": {"thread_id": thread_id}}
        
        try:
            for event in self.app.stream(initial_state, config):
                for value in event.values():
                    if "messages" in value:
                        for message in value["messages"]:
                            if isinstance(message, AIMessage):
                                if message.content:
                                    yield message.content
                                if message.tool_calls:
                                    yield f"\nüîß Using tool: {message.tool_calls[0]['name']}\n"
                            elif isinstance(message, ToolMessage):
                                yield f"\nüìä Tool result received\n"
        except Exception as e:
            logger.error(f"Error in stream search: {str(e)}")
            yield f"Error: {str(e)}"

## 8. Observability with Braintrust

In [None]:
class ObservabilityManager:
    """Manage observability and monitoring with Braintrust"""
    
    def __init__(self, config: Config):
        self.config = config
        self.enabled = config.enable_monitoring and config.braintrust_api_key
        
        if self.enabled:
            try:
                braintrust.init(api_key=config.braintrust_api_key)
                self.project = braintrust.init_project("confluence-search")
                logger.info("Braintrust observability initialized")
            except Exception as e:
                logger.error(f"Failed to initialize Braintrust: {str(e)}")
                self.enabled = False
    
    def log_search(self, query: str, results: str, latency: float, metadata: Dict = None):
        """Log a search interaction"""
        if not self.enabled:
            return
        
        try:
            experiment = braintrust.init_experiment(
                project="confluence-search",
                name=f"search-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
            )
            
            experiment.log(
                input=query,
                output=results,
                metadata={
                    "latency_ms": latency * 1000,
                    "timestamp": datetime.now().isoformat(),
                    **(metadata or {})
                }
            )
            
            experiment.close()
            
        except Exception as e:
            logger.error(f"Failed to log to Braintrust: {str(e)}")
    
    def log_crawl_metrics(self, pages_crawled: int, duration: float, errors: int = 0):
        """Log crawling metrics"""
        if not self.enabled:
            return
        
        try:
            experiment = braintrust.init_experiment(
                project="confluence-search",
                name=f"crawl-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
            )
            
            experiment.log(
                input="crawl_operation",
                output=f"Crawled {pages_crawled} pages",
                metadata={
                    "pages_crawled": pages_crawled,
                    "duration_seconds": duration,
                    "errors": errors,
                    "pages_per_second": pages_crawled / duration if duration > 0 else 0
                },
                scores={
                    "success_rate": (pages_crawled - errors) / pages_crawled if pages_crawled > 0 else 0
                }
            )
            
            experiment.close()
            
        except Exception as e:
            logger.error(f"Failed to log crawl metrics: {str(e)}")

## 9. Main Pipeline Orchestration

In [None]:
class ConfluenceSearchPipeline:
    """Main pipeline for Confluence semantic search"""
    
    def __init__(self, config: Config = None):
        self.config = config or Config()
        self.crawler = None
        self.processor = None
        self.vector_store = None
        self.agent = None
        self.observability = ObservabilityManager(self.config)
        
    def initialize(self):
        """Initialize all components"""
        logger.info("Initializing Confluence Search Pipeline")
        
        # Initialize components
        self.crawler = ConfluenceCrawler(self.config)
        self.processor = DocumentProcessor(self.config)
        self.vector_store = VectorStore(self.config, self.processor.embeddings)
        
        # Initialize tools and agent
        tools_manager = ConfluenceSearchTools(self.vector_store, self.config)
        tools = tools_manager.get_tools()
        self.agent = ConfluenceSearchAgent(tools, self.config)
        
        logger.info("Pipeline initialized successfully")
    
    def index_confluence_url(self, root_url: str, max_depth: int = None) -> Dict[str, Any]:
        """Index Confluence pages starting from a URL"""
        import time
        start_time = time.time()
        
        try:
            # Crawl pages
            logger.info(f"Starting crawl from: {root_url}")
            pages = self.crawler.crawl_from_url(root_url, max_depth)
            
            if not pages:
                return {
                    "status": "error",
                    "message": "No pages found to index"
                }
            
            # Process pages
            logger.info(f"Processing {len(pages)} pages")
            documents = self.processor.process_pages(pages)
            
            # Add to vector store
            logger.info(f"Adding {len(documents)} documents to vector store")
            self.vector_store.add_documents(documents)
            
            # Calculate metrics
            duration = time.time() - start_time
            
            # Log metrics
            self.observability.log_crawl_metrics(
                pages_crawled=len(pages),
                duration=duration
            )
            
            return {
                "status": "success",
                "pages_indexed": len(pages),
                "documents_created": len(documents),
                "duration_seconds": duration,
                "pages": [{
                    "title": p.title,
                    "url": p.url,
                    "space": p.space_key
                } for p in pages[:10]]  # First 10 pages
            }
            
        except Exception as e:
            logger.error(f"Error indexing Confluence: {str(e)}")
            return {
                "status": "error",
                "message": str(e)
            }
    
    def search(self, query: str, use_agent: bool = True) -> str:
        """Search for information"""
        import time
        start_time = time.time()
        
        try:
            if use_agent:
                # Use agent for intelligent search
                result = self.agent.search(query)
            else:
                # Direct vector search
                docs = self.vector_store.similarity_search(query)
                if docs:
                    result = "\n\n".join([
                        f"**{doc.metadata.get('page_title', 'Unknown')}**\n{doc.page_content[:500]}..."
                        for doc in docs
                    ])
                else:
                    result = "No relevant documents found."
            
            # Log search
            latency = time.time() - start_time
            self.observability.log_search(
                query=query,
                results=result,
                latency=latency,
                metadata={"use_agent": use_agent}
            )
            
            return result
            
        except Exception as e:
            logger.error(f"Error during search: {str(e)}")
            return f"Error: {str(e)}"
    
    def interactive_search(self):
        """Interactive search interface"""
        print("\nüîç Confluence Semantic Search Agent")
        print("="*50)
        print("Type 'exit' to quit, 'help' for commands\n")
        
        while True:
            query = input("\nüí¨ Your query: ").strip()
            
            if query.lower() == 'exit':
                break
            elif query.lower() == 'help':
                print("\nAvailable commands:")
                print("  - Any search query: Search Confluence pages")
                print("  - 'exit': Quit the application")
                print("  - 'help': Show this help message")
                continue
            elif not query:
                continue
            
            print("\nü§î Thinking...")
            
            # Stream results
            for chunk in self.agent.stream_search(query):
                print(chunk, end='', flush=True)
            print()

## 10. Usage Examples

In [None]:
# Example 1: Basic Setup and Indexing
def example_basic_indexing():
    """Example of basic indexing workflow"""
    
    # Create configuration
    config = Config(
        confluence_url="https://your-domain.atlassian.net",
        confluence_username="your-email@example.com",
        confluence_api_token="your-api-token",
        openai_api_key="your-openai-key"
    )
    
    # Initialize pipeline
    pipeline = ConfluenceSearchPipeline(config)
    pipeline.initialize()
    
    # Index Confluence pages
    root_url = "https://your-domain.atlassian.net/wiki/spaces/DOCS/pages/123456/Documentation"
    result = pipeline.index_confluence_url(root_url, max_depth=2)
    
    print(f"Indexing result: {json.dumps(result, indent=2)}")
    
    # Perform search
    query = "How to set up authentication?"
    answer = pipeline.search(query)
    print(f"\nSearch result:\n{answer}")
    
    return pipeline

# Uncomment to run:
# pipeline = example_basic_indexing()

In [None]:
# Example 2: Advanced Search with Multiple Tools
def example_advanced_search():
    """Example of using different search strategies"""
    
    # Assuming pipeline is already initialized and indexed
    pipeline = ConfluenceSearchPipeline()
    pipeline.initialize()
    
    # Different search examples
    queries = [
        "Find all pages about API documentation",
        "What are the deployment procedures?",
        "Show me pages created by John Doe",
        "Get a summary of the Security Guidelines page"
    ]
    
    for query in queries:
        print(f"\n{'='*60}")
        print(f"Query: {query}")
        print(f"{'='*60}")
        
        result = pipeline.search(query, use_agent=True)
        print(result)

# Uncomment to run:
# example_advanced_search()

In [None]:
# Example 3: Interactive Mode
def run_interactive_mode():
    """Run the interactive search interface"""
    
    print("\n" + "="*60)
    print("Setting up Confluence Semantic Search Agent...")
    print("="*60)
    
    # Initialize with environment variables
    pipeline = ConfluenceSearchPipeline()
    pipeline.initialize()
    
    # Check if we need to index first
    choice = input("\nDo you want to index Confluence pages first? (y/n): ")
    
    if choice.lower() == 'y':
        root_url = input("Enter Confluence page URL to start indexing: ")
        max_depth = int(input("Enter max crawl depth (default 3): ") or "3")
        
        print("\nüìö Indexing Confluence pages...")
        result = pipeline.index_confluence_url(root_url, max_depth)
        print(f"\n‚úÖ Indexed {result.get('pages_indexed', 0)} pages")
    
    # Start interactive search
    pipeline.interactive_search()
    
    print("\nThank you for using Confluence Search Agent!")

# Uncomment to run:
# run_interactive_mode()

## 11. Quick Start Guide

In [None]:
# Quick start - Run this cell to get started quickly
def quick_start():
    """
    Quick start guide for the Confluence Semantic Search Agent.
    
    Before running, make sure to:
    1. Set up your .env file with:
       - CONFLUENCE_URL
       - CONFLUENCE_USERNAME
       - CONFLUENCE_API_TOKEN
       - OPENAI_API_KEY
       - (Optional) BRAINTRUST_API_KEY for monitoring
    
    2. Install required packages (run the first cell)
    """
    
    print("üöÄ Confluence Semantic Search Agent - Quick Start")
    print("="*60)
    
    # Check environment variables
    required_vars = ['CONFLUENCE_URL', 'CONFLUENCE_USERNAME', 'CONFLUENCE_API_TOKEN', 'OPENAI_API_KEY']
    missing_vars = [var for var in required_vars if not os.getenv(var)]
    
    if missing_vars:
        print("‚ö†Ô∏è  Missing environment variables:")
        for var in missing_vars:
            print(f"   - {var}")
        print("\nPlease set these in your .env file or environment.")
        return
    
    # Initialize pipeline
    print("\nüì¶ Initializing pipeline...")
    pipeline = ConfluenceSearchPipeline()
    pipeline.initialize()
    print("‚úÖ Pipeline initialized")
    
    # Get user input for indexing
    print("\nüìö Let's index some Confluence pages!")
    root_url = input("Enter a Confluence page URL to start indexing: ")
    
    if root_url:
        print("\nüîÑ Indexing pages (this may take a few minutes)...")
        result = pipeline.index_confluence_url(root_url, max_depth=2)
        
        if result['status'] == 'success':
            print(f"\n‚úÖ Successfully indexed {result['pages_indexed']} pages!")
            print(f"   Created {result['documents_created']} searchable chunks")
            print(f"   Time taken: {result['duration_seconds']:.2f} seconds")
        else:
            print(f"\n‚ùå Indexing failed: {result['message']}")
            return
    
    # Demo search
    print("\nüîç Let's try a search!")
    demo_query = input("Enter your search query: ")
    
    if demo_query:
        print("\nü§î Searching...\n")
        result = pipeline.search(demo_query)
        print(result)
    
    # Offer interactive mode
    print("\n" + "="*60)
    choice = input("\nWould you like to continue with interactive search? (y/n): ")
    
    if choice.lower() == 'y':
        pipeline.interactive_search()
    
    print("\nüëã Thank you for using Confluence Semantic Search Agent!")
    return pipeline

# Run the quick start
# pipeline = quick_start()

## 12. Troubleshooting and Best Practices

### Common Issues and Solutions

1. **Authentication Errors**
   - Ensure your Confluence API token is valid
   - Check that you're using the correct username (email for cloud instances)
   - Verify the Confluence URL format

2. **Slow Indexing**
   - Reduce `max_depth` to limit crawling depth
   - Decrease `max_pages` to index fewer pages
   - Use `batch_size` parameter for embedding operations

3. **Memory Issues**
   - Use smaller chunk sizes
   - Implement pagination for large spaces
   - Consider using Pinecone for serverless vector storage

4. **Search Quality**
   - Adjust `chunk_size` and `chunk_overlap` for better context
   - Use MMR search for more diverse results
   - Fine-tune the system prompt for the agent

### Best Practices

1. **Incremental Indexing**: Index frequently accessed spaces first
2. **Regular Updates**: Schedule periodic re-indexing for updated content
3. **Monitor Performance**: Use Braintrust to track search quality
4. **Access Control**: Implement proper authentication for sensitive content
5. **Caching**: Cache frequently accessed pages to reduce API calls

### Production Deployment

For production use:
- Use environment-specific configuration
- Implement proper error handling and retries
- Set up monitoring and alerting
- Use a persistent vector database
- Implement rate limiting for API calls
- Add user authentication and authorization