# Agentic Workflow for Confluence Search using Google AI Development Kit

This notebook implements a production-ready agentic workflow for semantic search through Confluence pages using Google's AI ecosystem:
- **Google Gemini API** for embeddings and LLM capabilities
- **Vertex AI Vector Search** for scalable similarity search
- **Document AI** for advanced document processing
- **AlloyDB** for vector storage with pgvector
- **LangGraph + Gemini** for agentic orchestration
- **Google Cloud Logging** for observability

## Architecture Overview

```
Confluence Root URL ‚Üí Crawler ‚Üí Document AI Processor ‚Üí Gemini Embeddings
                                        ‚Üì                       ‚Üì
                              Structured Extraction      Vector Search (Vertex AI)
                                        ‚Üì                       ‚Üì
User Query ‚Üí Gemini Agent ‚Üí Tool Selection ‚Üí Semantic Search ‚Üí Response
```

## 1. Installation and Dependencies

In [None]:
# Install required packages
!pip install -q google-generativeai \
              google-cloud-aiplatform \
              google-cloud-documentai \
              google-cloud-logging \
              google-cloud-alloydb-connector \
              atlassian-python-api \
              langchain-google-genai \
              langchain-google-vertexai \
              langgraph \
              chromadb \
              pgvector \
              asyncpg \
              beautifulsoup4 \
              lxml \
              python-dotenv \
              numpy \
              pandas \
              tqdm \
              tenacity \
              pytz \
              aiohttp

## 2. Configuration and Setup

In [None]:
import os
import json
import hashlib
import asyncio
from typing import List, Dict, Any, Optional, Tuple, TypedDict, Annotated, Sequence
from dataclasses import dataclass, field, asdict
from datetime import datetime
import logging
from urllib.parse import urlparse, urljoin
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
import warnings
warnings.filterwarnings('ignore')

# External imports
import pandas as pd
import numpy as np
from tqdm import tqdm
from tenacity import retry, stop_after_attempt, wait_exponential
from bs4 import BeautifulSoup
from atlassian import Confluence
from dotenv import load_dotenv
import aiohttp

# Google imports
import google.generativeai as genai
from google.cloud import aiplatform
from google.cloud import documentai
from google.cloud import logging as cloud_logging
from google.oauth2 import service_account

# LangChain + Google imports
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_google_vertexai import VertexAI, VertexAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain.tools import Tool, StructuredTool
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, SystemMessage, ToolMessage

# LangGraph imports
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode, tools_condition
from langgraph.graph.message import add_messages
from langgraph.checkpoint.memory import MemorySaver

# Load environment variables
load_dotenv()

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

In [None]:
# Configuration
@dataclass
class GoogleAIConfig:
    """Configuration for Google AI-based Confluence search"""
    
    # Google Cloud settings
    project_id: str = os.getenv('GOOGLE_CLOUD_PROJECT', '')
    location: str = os.getenv('GOOGLE_CLOUD_LOCATION', 'us-central1')
    credentials_path: str = os.getenv('GOOGLE_APPLICATION_CREDENTIALS', '')
    
    # Gemini API settings
    gemini_api_key: str = os.getenv('GEMINI_API_KEY', '')
    gemini_model: str = 'gemini-1.5-pro'
    embedding_model: str = 'text-embedding-004'  # Latest Gemini embedding model
    
    # Confluence settings
    confluence_url: str = os.getenv('CONFLUENCE_URL', 'https://your-domain.atlassian.net')
    confluence_username: str = os.getenv('CONFLUENCE_USERNAME', '')
    confluence_api_token: str = os.getenv('CONFLUENCE_API_TOKEN', '')
    
    # Vector Search settings
    vector_search_index: str = 'confluence-search-index'
    use_vertex_ai_search: bool = True  # Use Vertex AI Vector Search
    use_alloydb: bool = False  # Alternative: Use AlloyDB with pgvector
    
    # AlloyDB settings (if used)
    alloydb_instance: str = os.getenv('ALLOYDB_INSTANCE', '')
    alloydb_database: str = os.getenv('ALLOYDB_DATABASE', 'confluence_search')
    alloydb_user: str = os.getenv('ALLOYDB_USER', 'postgres')
    alloydb_password: str = os.getenv('ALLOYDB_PASSWORD', '')
    
    # Document AI settings
    use_document_ai: bool = True
    document_processor_id: str = os.getenv('DOCUMENT_PROCESSOR_ID', '')
    
    # Crawling settings
    max_depth: int = 3
    max_pages: int = 100
    batch_size: int = 10
    max_workers: int = 5
    
    # Text processing
    chunk_size: int = 1000
    chunk_overlap: int = 200
    
    # Search settings
    top_k: int = 5
    score_threshold: float = 0.7
    
    # Agent settings
    agent_temperature: float = 0.1
    max_iterations: int = 10
    
    # File Search (Gemini's RAG feature)
    use_gemini_file_search: bool = True
    file_search_corpus: str = 'confluence-corpus'

config = GoogleAIConfig()

# Initialize Google Cloud
if config.project_id:
    aiplatform.init(project=config.project_id, location=config.location)
    
# Initialize Gemini
if config.gemini_api_key:
    genai.configure(api_key=config.gemini_api_key)
else:
    logger.warning("Gemini API key not configured. Please set GEMINI_API_KEY")

## 3. Google Cloud Logging Setup

In [None]:
class GoogleCloudObservability:
    """Observability using Google Cloud Logging and Monitoring"""
    
    def __init__(self, config: GoogleAIConfig):
        self.config = config
        self.client = None
        
        if config.project_id:
            try:
                self.client = cloud_logging.Client(project=config.project_id)
                self.client.setup_logging()
                self.logger = self.client.logger('confluence-search')
                logger.info("Google Cloud Logging initialized")
            except Exception as e:
                logger.error(f"Failed to initialize Cloud Logging: {e}")
    
    def log_search(self, query: str, results: List[Dict], latency: float, metadata: Dict = None):
        """Log search interaction to Cloud Logging"""
        if not self.client:
            return
        
        log_entry = {
            'type': 'search',
            'query': query,
            'results_count': len(results),
            'latency_ms': latency * 1000,
            'timestamp': datetime.utcnow().isoformat(),
            **(metadata or {})
        }
        
        self.logger.log_struct(log_entry, severity='INFO')
    
    def log_crawl(self, pages_count: int, duration: float, errors: int = 0):
        """Log crawling metrics"""
        if not self.client:
            return
        
        log_entry = {
            'type': 'crawl',
            'pages_crawled': pages_count,
            'duration_seconds': duration,
            'errors': errors,
            'pages_per_second': pages_count / duration if duration > 0 else 0,
            'timestamp': datetime.utcnow().isoformat()
        }
        
        self.logger.log_struct(log_entry, severity='INFO')
    
    def log_error(self, error_type: str, error_message: str, context: Dict = None):
        """Log errors"""
        if not self.client:
            return
        
        log_entry = {
            'type': 'error',
            'error_type': error_type,
            'error_message': error_message,
            'context': context or {},
            'timestamp': datetime.utcnow().isoformat()
        }
        
        self.logger.log_struct(log_entry, severity='ERROR')

## 4. Confluence Crawler with Async Support

In [None]:
@dataclass
class ConfluencePage:
    """Represents a Confluence page with metadata"""
    id: str
    title: str
    url: str
    content: str
    space_key: str
    created_by: str
    created_date: str
    last_modified: str
    parent_id: Optional[str] = None
    labels: List[str] = field(default_factory=list)
    attachments: List[Dict] = field(default_factory=list)
    child_pages: List[str] = field(default_factory=list)
    html_content: Optional[str] = None  # Keep HTML for Document AI

class EnhancedConfluenceCrawler:
    """Enhanced Confluence crawler with async support"""
    
    def __init__(self, config: GoogleAIConfig, observability: GoogleCloudObservability):
        self.config = config
        self.observability = observability
        self.confluence = Confluence(
            url=config.confluence_url,
            username=config.confluence_username,
            password=config.confluence_api_token,
            cloud=True
        )
        self.visited_pages = set()
        self.pages = []
    
    def extract_page_id_from_url(self, url: str) -> Optional[str]:
        """Extract page ID from Confluence URL"""
        patterns = [
            r'/pages/(\d+)/',
            r'pageId=(\d+)',
            r'/pages/viewpage.action\?pageId=(\d+)'
        ]
        
        for pattern in patterns:
            match = re.search(pattern, url)
            if match:
                return match.group(1)
        return None
    
    def clean_html_content(self, html_content: str) -> str:
        """Clean HTML content and extract text"""
        soup = BeautifulSoup(html_content, 'lxml')
        
        # Remove script and style elements
        for element in soup(['script', 'style', 'meta', 'link']):
            element.decompose()
        
        # Extract text with better formatting
        text = soup.get_text(separator=' ', strip=True)
        
        # Clean up whitespace
        text = re.sub(r'\s+', ' ', text)
        
        return text.strip()
    
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
    async def fetch_page_async(self, session: aiohttp.ClientSession, page_id: str) -> Optional[ConfluencePage]:
        """Fetch page asynchronously"""
        try:
            # Get page details
            page = self.confluence.get_page_by_id(
                page_id,
                expand='body.storage,metadata.labels,children.page,history,version,ancestors'
            )
            
            # Extract content
            content_html = page.get('body', {}).get('storage', {}).get('value', '')
            content_text = self.clean_html_content(content_html)
            
            # Extract metadata
            labels = [label['name'] for label in page.get('metadata', {}).get('labels', {}).get('results', [])]
            
            # Get child pages
            child_pages = []
            if 'children' in page and 'page' in page['children']:
                child_pages = [child['id'] for child in page['children']['page'].get('results', [])]
            
            # Create page object
            confluence_page = ConfluencePage(
                id=page['id'],
                title=page['title'],
                url=f"{self.config.confluence_url}/wiki/spaces/{page['space']['key']}/pages/{page['id']}",
                content=content_text,
                html_content=content_html,  # Keep for Document AI
                space_key=page['space']['key'],
                created_by=page['history']['createdBy']['displayName'],
                created_date=page['history']['createdDate'],
                last_modified=page['version']['when'],
                labels=labels,
                child_pages=child_pages
            )
            
            return confluence_page
            
        except Exception as e:
            logger.error(f"Error fetching page {page_id}: {str(e)}")
            self.observability.log_error('page_fetch_error', str(e), {'page_id': page_id})
            return None
    
    async def crawl_from_url_async(self, root_url: str, max_depth: Optional[int] = None) -> List[ConfluencePage]:
        """Crawl Confluence pages asynchronously"""
        max_depth = max_depth or self.config.max_depth
        
        # Extract root page ID
        root_page_id = self.extract_page_id_from_url(root_url)
        if not root_page_id:
            raise ValueError(f"Could not extract page ID from URL: {root_url}")
        
        logger.info(f"Starting async crawl from page ID: {root_page_id}")
        
        # BFS crawling with async
        queue = [(root_page_id, 0)]
        self.visited_pages = set()
        self.pages = []
        
        async with aiohttp.ClientSession() as session:
            with tqdm(total=self.config.max_pages, desc="Crawling pages") as pbar:
                while queue and len(self.pages) < self.config.max_pages:
                    # Process batch of pages
                    batch = []
                    for _ in range(min(self.config.batch_size, len(queue))):
                        if queue:
                            page_id, depth = queue.pop(0)
                            if page_id not in self.visited_pages and depth <= max_depth:
                                batch.append((page_id, depth))
                                self.visited_pages.add(page_id)
                    
                    # Fetch pages in parallel
                    tasks = [self.fetch_page_async(session, page_id) for page_id, _ in batch]
                    results = await asyncio.gather(*tasks)
                    
                    # Process results
                    for (page_id, depth), page in zip(batch, results):
                        if page:
                            self.pages.append(page)
                            pbar.update(1)
                            
                            # Add child pages to queue
                            for child_id in page.child_pages:
                                if child_id not in self.visited_pages:
                                    queue.append((child_id, depth + 1))
        
        logger.info(f"Crawled {len(self.pages)} pages")
        return self.pages
    
    def crawl_from_url(self, root_url: str, max_depth: Optional[int] = None) -> List[ConfluencePage]:
        """Synchronous wrapper for async crawl"""
        return asyncio.run(self.crawl_from_url_async(root_url, max_depth))

## 5. Document AI Processing

In [None]:
class DocumentAIProcessor:
    """Process documents using Google Document AI for better extraction"""
    
    def __init__(self, config: GoogleAIConfig):
        self.config = config
        self.client = None
        
        if config.use_document_ai and config.project_id:
            try:
                self.client = documentai.DocumentProcessorServiceClient()
                self.processor_name = f"projects/{config.project_id}/locations/{config.location}/processors/{config.document_processor_id}"
                logger.info("Document AI initialized")
            except Exception as e:
                logger.error(f"Failed to initialize Document AI: {e}")
                self.client = None
    
    def process_html(self, html_content: str, page_title: str) -> Dict[str, Any]:
        """Process HTML content with Document AI"""
        if not self.client:
            return {'text': self.basic_html_extraction(html_content)}
        
        try:
            # Convert HTML to bytes
            raw_document = documentai.RawDocument(
                content=html_content.encode('utf-8'),
                mime_type='text/html'
            )
            
            # Process document
            request = documentai.ProcessRequest(
                name=self.processor_name,
                raw_document=raw_document
            )
            
            result = self.client.process_document(request=request)
            document = result.document
            
            # Extract structured information
            extracted_data = {
                'text': document.text,
                'entities': [],
                'tables': [],
                'sections': []
            }
            
            # Extract entities
            for entity in document.entities:
                extracted_data['entities'].append({
                    'type': entity.type_,
                    'mention': entity.mention_text,
                    'confidence': entity.confidence
                })
            
            # Extract tables if present
            for page in document.pages:
                for table in page.tables:
                    table_data = self._extract_table(table)
                    if table_data:
                        extracted_data['tables'].append(table_data)
            
            return extracted_data
            
        except Exception as e:
            logger.error(f"Document AI processing failed: {e}")
            return {'text': self.basic_html_extraction(html_content)}
    
    def basic_html_extraction(self, html_content: str) -> str:
        """Basic HTML text extraction as fallback"""
        soup = BeautifulSoup(html_content, 'lxml')
        for element in soup(['script', 'style']):
            element.decompose()
        return soup.get_text(separator=' ', strip=True)
    
    def _extract_table(self, table) -> Optional[Dict]:
        """Extract table data from Document AI table object"""
        try:
            rows = []
            for row in table.body_rows:
                row_data = []
                for cell in row.cells:
                    cell_text = cell.layout.text_anchor.text_segments[0].text if cell.layout.text_anchor.text_segments else ''
                    row_data.append(cell_text)
                rows.append(row_data)
            
            return {
                'headers': [cell.layout.text_anchor.text_segments[0].text for cell in table.header_rows[0].cells] if table.header_rows else [],
                'rows': rows
            }
        except:
            return None

## 6. Gemini Embeddings and Processing

In [None]:
class GeminiDocumentProcessor:
    """Process documents using Gemini for embeddings and understanding"""
    
    def __init__(self, config: GoogleAIConfig, doc_ai_processor: DocumentAIProcessor = None):
        self.config = config
        self.doc_ai = doc_ai_processor
        
        # Initialize Gemini embeddings
        self.embeddings = GoogleGenerativeAIEmbeddings(
            model=f"models/{config.embedding_model}",
            google_api_key=config.gemini_api_key
        )
        
        # Initialize text splitter
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=config.chunk_size,
            chunk_overlap=config.chunk_overlap,
            separators=["\n\n", "\n", ". ", " ", ""]
        )
        
        # Initialize Gemini for content analysis
        self.gemini_model = genai.GenerativeModel(config.gemini_model)
    
    async def process_page_with_gemini(self, page: ConfluencePage) -> Dict[str, Any]:
        """Process page content with Gemini for enhanced understanding"""
        try:
            # First, process with Document AI if available
            if self.doc_ai and page.html_content:
                doc_ai_result = self.doc_ai.process_html(page.html_content, page.title)
                content = doc_ai_result.get('text', page.content)
                entities = doc_ai_result.get('entities', [])
                tables = doc_ai_result.get('tables', [])
            else:
                content = page.content
                entities = []
                tables = []
            
            # Use Gemini to generate summary and extract key points
            prompt = f"""
            Analyze this Confluence page and provide:
            1. A concise summary (2-3 sentences)
            2. Key topics covered (bullet points)
            3. Any technical terms or acronyms with definitions
            4. Main actionable items if any
            
            Page Title: {page.title}
            Content: {content[:3000]}  # Limit for token efficiency
            """
            
            response = await self.gemini_model.generate_content_async(prompt)
            analysis = response.text
            
            return {
                'page': page,
                'content': content,
                'summary': self._extract_summary(analysis),
                'key_topics': self._extract_topics(analysis),
                'entities': entities,
                'tables': tables,
                'gemini_analysis': analysis
            }
            
        except Exception as e:
            logger.error(f"Error processing page with Gemini: {e}")
            return {
                'page': page,
                'content': page.content,
                'summary': '',
                'key_topics': [],
                'entities': [],
                'tables': []
            }
    
    def _extract_summary(self, analysis: str) -> str:
        """Extract summary from Gemini analysis"""
        # Simple extraction - in production, use regex or structured output
        lines = analysis.split('\n')
        for i, line in enumerate(lines):
            if 'summary' in line.lower():
                return ' '.join(lines[i+1:i+4])
        return analysis[:200] if analysis else ''
    
    def _extract_topics(self, analysis: str) -> List[str]:
        """Extract topics from Gemini analysis"""
        topics = []
        lines = analysis.split('\n')
        in_topics = False
        
        for line in lines:
            if 'topics' in line.lower() or 'key points' in line.lower():
                in_topics = True
                continue
            if in_topics and line.strip().startswith(('-', '*', '‚Ä¢')):
                topics.append(line.strip().lstrip('-*‚Ä¢ '))
            elif in_topics and not line.strip():
                break
        
        return topics
    
    def create_documents(self, processed_pages: List[Dict]) -> List[Document]:
        """Create LangChain documents from processed pages"""
        documents = []
        
        for page_data in tqdm(processed_pages, desc="Creating documents"):
            page = page_data['page']
            content = page_data['content']
            
            # Skip minimal content
            if len(content) < 50:
                continue
            
            # Split content into chunks
            texts = self.text_splitter.split_text(content)
            
            # Create documents with rich metadata
            for i, text in enumerate(texts):
                metadata = {
                    'page_id': page.id,
                    'page_title': page.title,
                    'page_url': page.url,
                    'space_key': page.space_key,
                    'created_by': page.created_by,
                    'created_date': page.created_date,
                    'last_modified': page.last_modified,
                    'labels': ', '.join(page.labels),
                    'summary': page_data.get('summary', ''),
                    'key_topics': ', '.join(page_data.get('key_topics', [])),
                    'chunk_index': i,
                    'total_chunks': len(texts)
                }
                
                doc = Document(
                    page_content=text,
                    metadata=metadata
                )
                documents.append(doc)
        
        logger.info(f"Created {len(documents)} document chunks")
        return documents

## 7. Vertex AI Vector Search

In [None]:
class VertexAIVectorStore:
    """Manages vector storage using Vertex AI Vector Search"""
    
    def __init__(self, config: GoogleAIConfig, embeddings):
        self.config = config
        self.embeddings = embeddings
        self.index = None
        self.index_endpoint = None
        
        if config.use_vertex_ai_search:
            self._init_vertex_search()
        else:
            self._init_local_store()
    
    def _init_vertex_search(self):
        """Initialize Vertex AI Vector Search"""
        try:
            from google.cloud import aiplatform_v1
            
            # Initialize the index
            index_client = aiplatform_v1.IndexServiceClient(
                client_options={"api_endpoint": f"{self.config.location}-aiplatform.googleapis.com"}
            )
            
            # Create or get existing index
            index_name = f"projects/{self.config.project_id}/locations/{self.config.location}/indexes/{self.config.vector_search_index}"
            
            try:
                self.index = index_client.get_index(name=index_name)
                logger.info(f"Using existing Vertex AI index: {index_name}")
            except:
                # Create new index
                logger.info("Creating new Vertex AI index...")
                self._create_vertex_index()
                
        except Exception as e:
            logger.error(f"Failed to initialize Vertex AI Vector Search: {e}")
            logger.info("Falling back to local vector store")
            self._init_local_store()
    
    def _create_vertex_index(self):
        """Create a new Vertex AI Vector Search index"""
        from google.cloud import aiplatform
        
        # Create index
        index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
            display_name="confluence-search-index",
            dimensions=768,  # Gemini embedding dimension
            approximate_neighbors_count=10,
            distance_measure_type="COSINE_DISTANCE",
            leaf_node_embedding_count=500,
            leaf_nodes_to_search_percent=80,
            description="Confluence semantic search index",
            location=self.config.location,
            project=self.config.project_id
        )
        
        self.index = index
        
        # Create index endpoint
        index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
            display_name="confluence-search-endpoint",
            description="Endpoint for Confluence search",
            location=self.config.location,
            project=self.config.project_id
        )
        
        # Deploy index to endpoint
        index_endpoint.deploy_index(
            index=index,
            deployed_index_id="confluence_deployed_index",
            machine_type="e2-standard-2",
            min_replica_count=1,
            max_replica_count=2
        )
        
        self.index_endpoint = index_endpoint
    
    def _init_local_store(self):
        """Initialize local ChromaDB as fallback"""
        import chromadb
        from langchain.vectorstores import Chroma
        
        self.vectorstore = Chroma(
            collection_name="confluence_gemini",
            embedding_function=self.embeddings,
            persist_directory="./chroma_gemini_db",
            collection_metadata={"hnsw:space": "cosine"}
        )
        logger.info("Using local ChromaDB for vector storage")
    
    async def add_documents_batch(self, documents: List[Document], batch_size: int = 100):
        """Add documents to vector store in batches"""
        logger.info(f"Adding {len(documents)} documents to vector store")
        
        if hasattr(self, 'vectorstore'):
            # Using local store
            for i in tqdm(range(0, len(documents), batch_size), desc="Adding to vector store"):
                batch = documents[i:i + batch_size]
                self.vectorstore.add_documents(batch)
            self.vectorstore.persist()
        else:
            # Using Vertex AI
            await self._add_to_vertex_search(documents, batch_size)
    
    async def _add_to_vertex_search(self, documents: List[Document], batch_size: int):
        """Add documents to Vertex AI Vector Search"""
        from google.cloud import aiplatform
        
        # Generate embeddings
        texts = [doc.page_content for doc in documents]
        embeddings = await self._batch_embed(texts, batch_size)
        
        # Prepare data for upload
        data_points = []
        for doc, embedding in zip(documents, embeddings):
            data_point = {
                "id": doc.metadata.get('page_id', str(hash(doc.page_content))),
                "embedding": embedding,
                "metadata": doc.metadata
            }
            data_points.append(data_point)
        
        # Upload to index
        self.index.upsert(data_points)
        logger.info(f"Added {len(data_points)} vectors to Vertex AI index")
    
    async def _batch_embed(self, texts: List[str], batch_size: int) -> List[List[float]]:
        """Generate embeddings in batches"""
        all_embeddings = []
        
        for i in tqdm(range(0, len(texts), batch_size), desc="Generating embeddings"):
            batch = texts[i:i + batch_size]
            embeddings = await self.embeddings.aembed_documents(batch)
            all_embeddings.extend(embeddings)
        
        return all_embeddings
    
    def search(self, query: str, k: int = 5, filter_dict: Dict = None) -> List[Document]:
        """Perform similarity search"""
        if hasattr(self, 'vectorstore'):
            # Local search
            if filter_dict:
                return self.vectorstore.similarity_search(query, k=k, filter=filter_dict)
            return self.vectorstore.similarity_search(query, k=k)
        else:
            # Vertex AI search
            return self._vertex_search(query, k, filter_dict)
    
    def _vertex_search(self, query: str, k: int, filter_dict: Dict = None) -> List[Document]:
        """Search using Vertex AI Vector Search"""
        # Generate query embedding
        query_embedding = self.embeddings.embed_query(query)
        
        # Perform search
        results = self.index_endpoint.find_neighbors(
            deployed_index_id="confluence_deployed_index",
            queries=[query_embedding],
            num_neighbors=k
        )
        
        # Convert results to documents
        documents = []
        for result in results[0]:
            doc = Document(
                page_content=result.metadata.get('content', ''),
                metadata=result.metadata
            )
            documents.append(doc)
        
        return documents

## 8. Gemini File Search Integration

In [None]:
class GeminiFileSearch:
    """Integrate with Gemini's File Search (managed RAG) capability"""
    
    def __init__(self, config: GoogleAIConfig):
        self.config = config
        self.corpus = None
        
        if config.use_gemini_file_search:
            self._init_corpus()
    
    def _init_corpus(self):
        """Initialize or get existing corpus for File Search"""
        try:
            # Create or get corpus
            corpus_name = f"corpora/{self.config.file_search_corpus}"
            
            # Try to get existing corpus
            try:
                self.corpus = genai.get_corpus(corpus_name)
                logger.info(f"Using existing corpus: {corpus_name}")
            except:
                # Create new corpus
                self.corpus = genai.create_corpus(
                    name=corpus_name,
                    display_name="Confluence Search Corpus",
                    description="Corpus for Confluence page semantic search"
                )
                logger.info(f"Created new corpus: {corpus_name}")
                
        except Exception as e:
            logger.error(f"Failed to initialize Gemini File Search: {e}")
    
    async def upload_documents(self, pages: List[ConfluencePage]):
        """Upload Confluence pages to Gemini corpus"""
        if not self.corpus:
            logger.error("Corpus not initialized")
            return
        
        documents = []
        
        for page in tqdm(pages, desc="Preparing documents for File Search"):
            # Create document for corpus
            doc = genai.Document(
                name=f"documents/{page.id}",
                display_name=page.title,
                metadata={
                    "page_id": page.id,
                    "space_key": page.space_key,
                    "url": page.url,
                    "created_by": page.created_by,
                    "labels": ",".join(page.labels)
                },
                chunks=[
                    genai.Chunk(
                        data=genai.ChunkData(string_value=page.content),
                        custom_metadata={
                            "title": page.title,
                            "last_modified": page.last_modified
                        }
                    )
                ]
            )
            documents.append(doc)
        
        # Upload documents to corpus
        try:
            response = await genai.batch_create_chunk_async(
                corpus=self.corpus.name,
                documents=documents
            )
            logger.info(f"Uploaded {len(documents)} documents to Gemini corpus")
        except Exception as e:
            logger.error(f"Failed to upload documents: {e}")
    
    def search(self, query: str, k: int = 5) -> List[Dict]:
        """Search using Gemini File Search"""
        if not self.corpus:
            return []
        
        try:
            # Perform semantic search
            response = genai.query_corpus(
                corpus=self.corpus.name,
                query=query,
                results_count=k,
                metadata_filters=[]  # Add filters if needed
            )
            
            # Format results
            results = []
            for chunk in response.relevant_chunks:
                results.append({
                    'content': chunk.chunk.data.string_value,
                    'metadata': chunk.chunk.custom_metadata,
                    'score': chunk.chunk_relevance_score
                })
            
            return results
            
        except Exception as e:
            logger.error(f"File Search failed: {e}")
            return []

## 9. Agent Tools Definition

In [None]:
class ConfluenceSearchTools:
    """Tools for the Confluence search agent"""
    
    def __init__(self, vector_store: VertexAIVectorStore, file_search: GeminiFileSearch, config: GoogleAIConfig):
        self.vector_store = vector_store
        self.file_search = file_search
        self.config = config
        self.gemini = genai.GenerativeModel(config.gemini_model)
    
    def semantic_search(self, query: str, top_k: int = 5) -> str:
        """Search Confluence pages using semantic similarity"""
        try:
            # Try Gemini File Search first
            if self.file_search and self.config.use_gemini_file_search:
                results = self.file_search.search(query, k=top_k)
                if results:
                    return self._format_file_search_results(results)
            
            # Fall back to vector store
            docs = self.vector_store.search(query, k=top_k)
            return self._format_vector_results(docs)
            
        except Exception as e:
            logger.error(f"Semantic search error: {e}")
            return f"Search failed: {str(e)}"
    
    def search_by_space(self, space_key: str, query: str = "") -> str:
        """Search within a specific Confluence space"""
        try:
            filter_dict = {'space_key': space_key}
            docs = self.vector_store.search(query or "*", k=10, filter_dict=filter_dict)
            return self._format_vector_results(docs[:5])
        except Exception as e:
            return f"Space search failed: {str(e)}"
    
    def search_by_labels(self, labels: str, query: str = "") -> str:
        """Search pages by labels"""
        try:
            # Parse labels
            label_list = [l.strip() for l in labels.split(',')]
            filter_dict = {'labels': {'$in': label_list}}
            
            docs = self.vector_store.search(query or "*", k=10, filter_dict=filter_dict)
            return self._format_vector_results(docs[:5])
        except Exception as e:
            return f"Label search failed: {str(e)}"
    
    def analyze_page(self, page_title: str) -> str:
        """Provide detailed analysis of a specific page using Gemini"""
        try:
            # Search for the page
            docs = self.vector_store.search(f"title: {page_title}", k=3)
            
            if not docs:
                return f"No page found with title: {page_title}"
            
            # Get the most relevant document
            doc = docs[0]
            
            # Use Gemini to analyze
            prompt = f"""
            Provide a comprehensive analysis of this Confluence page:
            
            Title: {doc.metadata.get('page_title', 'Unknown')}
            Content: {doc.page_content}
            
            Include:
            1. Main purpose and audience
            2. Key information conveyed
            3. Action items or decisions documented
            4. Related topics that might need attention
            5. Quality assessment (completeness, clarity)
            """
            
            response = self.gemini.generate_content(prompt)
            
            return f"**Page Analysis: {doc.metadata.get('page_title')}**\n\n{response.text}"
            
        except Exception as e:
            return f"Analysis failed: {str(e)}"
    
    def compare_pages(self, page1_title: str, page2_title: str) -> str:
        """Compare two Confluence pages"""
        try:
            # Get both pages
            docs1 = self.vector_store.search(f"title: {page1_title}", k=1)
            docs2 = self.vector_store.search(f"title: {page2_title}", k=1)
            
            if not docs1 or not docs2:
                return "Could not find both pages for comparison"
            
            # Use Gemini to compare
            prompt = f"""
            Compare these two Confluence pages:
            
            Page 1: {docs1[0].metadata.get('page_title')}
            Content: {docs1[0].page_content[:1500]}
            
            Page 2: {docs2[0].metadata.get('page_title')}
            Content: {docs2[0].page_content[:1500]}
            
            Provide:
            1. Key similarities
            2. Main differences
            3. Potential overlaps or redundancies
            4. Recommendations for consolidation if applicable
            """
            
            response = self.gemini.generate_content(prompt)
            return response.text
            
        except Exception as e:
            return f"Comparison failed: {str(e)}"
    
    def _format_file_search_results(self, results: List[Dict]) -> str:
        """Format Gemini File Search results"""
        formatted = []
        for i, result in enumerate(results, 1):
            formatted.append(
                f"**Result {i}:**\n"
                f"Title: {result['metadata'].get('title', 'Unknown')}\n"
                f"Score: {result['score']:.2f}\n"
                f"Content: {result['content'][:500]}...\n"
            )
        return "\n\n".join(formatted) if formatted else "No results found"
    
    def _format_vector_results(self, docs: List[Document]) -> str:
        """Format vector search results"""
        formatted = []
        for i, doc in enumerate(docs, 1):
            formatted.append(
                f"**Result {i}:**\n"
                f"Title: {doc.metadata.get('page_title', 'Unknown')}\n"
                f"URL: {doc.metadata.get('page_url', 'N/A')}\n"
                f"Space: {doc.metadata.get('space_key', 'Unknown')}\n"
                f"Summary: {doc.metadata.get('summary', 'N/A')}\n"
                f"Content: {doc.page_content[:400]}...\n"
            )
        return "\n\n".join(formatted) if formatted else "No results found"
    
    def get_tools(self) -> List[Tool]:
        """Get LangChain tools"""
        return [
            Tool(
                name="semantic_search",
                func=self.semantic_search,
                description="Search Confluence pages using semantic similarity. Input: search query"
            ),
            Tool(
                name="search_by_space",
                func=self.search_by_space,
                description="Search within a specific Confluence space. Input: 'space_key query'"
            ),
            Tool(
                name="search_by_labels",
                func=self.search_by_labels,
                description="Search pages by labels. Input: 'label1,label2 optional_query'"
            ),
            Tool(
                name="analyze_page",
                func=self.analyze_page,
                description="Get detailed AI analysis of a specific page. Input: page title"
            ),
            Tool(
                name="compare_pages",
                func=self.compare_pages,
                description="Compare two Confluence pages. Input: 'page1_title | page2_title'"
            )
        ]

## 10. LangGraph Agent with Gemini

In [None]:
class AgentState(TypedDict):
    """State for the LangGraph agent"""
    messages: Annotated[Sequence[BaseMessage], add_messages]
    
class GeminiConfluenceAgent:
    """LangGraph agent powered by Gemini for Confluence search"""
    
    def __init__(self, tools: List[Tool], config: GoogleAIConfig):
        self.tools = tools
        self.config = config
        
        # Initialize Gemini LLM
        self.llm = ChatGoogleGenerativeAI(
            model=f"gemini-1.5-pro",
            temperature=config.agent_temperature,
            google_api_key=config.gemini_api_key,
            convert_system_message_to_human=True
        )
        
        # Bind tools
        self.llm_with_tools = self.llm.bind_tools(tools)
        
        # Build graph
        self.graph = self._build_graph()
        
        # Memory
        self.memory = MemorySaver()
        
        # Compile
        self.app = self.graph.compile(checkpointer=self.memory)
    
    def _build_graph(self) -> StateGraph:
        """Build the agent graph"""
        workflow = StateGraph(AgentState)
        
        def agent(state: AgentState) -> dict:
            """Agent node"""
            messages = state["messages"]
            
            # Add system message if first interaction
            if len(messages) == 1:
                system_prompt = (
                    "You are an expert assistant for searching and analyzing Confluence documentation. "
                    "You have access to advanced semantic search tools powered by Google's Gemini AI. "
                    "You can search by content, space, labels, analyze pages, and compare documents. "
                    "Always provide clear, actionable answers and include relevant page links when available. "
                    "Use multiple tools when needed to provide comprehensive answers. "
                    "If unsure, use the search tools to find accurate information."
                )
                messages = [SystemMessage(content=system_prompt)] + messages
            
            response = self.llm_with_tools.invoke(messages)
            return {"messages": [response]}
        
        # Add nodes
        workflow.add_node("agent", agent)
        workflow.add_node("tools", ToolNode(self.tools))
        
        # Set entry point
        workflow.set_entry_point("agent")
        
        # Add edges
        workflow.add_conditional_edges(
            "agent",
            tools_condition,
            {
                "tools": "tools",
                "__end__": END
            }
        )
        
        workflow.add_edge("tools", "agent")
        
        return workflow
    
    async def search_async(self, query: str, thread_id: str = "default") -> str:
        """Execute search asynchronously"""
        initial_state = {"messages": [HumanMessage(content=query)]}
        config = {"configurable": {"thread_id": thread_id}}
        
        try:
            result = await self.app.ainvoke(initial_state, config)
            
            # Extract final response
            for message in reversed(result["messages"]):
                if isinstance(message, AIMessage) and not message.tool_calls:
                    return message.content
            
            return "No response generated"
            
        except Exception as e:
            logger.error(f"Agent error: {e}")
            return f"Error: {str(e)}"
    
    def search(self, query: str, thread_id: str = "default") -> str:
        """Synchronous search wrapper"""
        return asyncio.run(self.search_async(query, thread_id))

## 11. Main Pipeline

In [None]:
class GoogleAIConfluencePipeline:
    """Main pipeline for Google AI-powered Confluence search"""
    
    def __init__(self, config: GoogleAIConfig = None):
        self.config = config or GoogleAIConfig()
        self.observability = GoogleCloudObservability(self.config)
        self.crawler = None
        self.doc_ai = None
        self.processor = None
        self.vector_store = None
        self.file_search = None
        self.agent = None
    
    async def initialize_async(self):
        """Initialize all components asynchronously"""
        logger.info("Initializing Google AI Confluence Pipeline")
        
        # Initialize Document AI
        self.doc_ai = DocumentAIProcessor(self.config)
        
        # Initialize crawler
        self.crawler = EnhancedConfluenceCrawler(self.config, self.observability)
        
        # Initialize processor
        self.processor = GeminiDocumentProcessor(self.config, self.doc_ai)
        
        # Initialize vector store
        self.vector_store = VertexAIVectorStore(self.config, self.processor.embeddings)
        
        # Initialize File Search
        self.file_search = GeminiFileSearch(self.config)
        
        # Initialize tools and agent
        tools_manager = ConfluenceSearchTools(self.vector_store, self.file_search, self.config)
        tools = tools_manager.get_tools()
        self.agent = GeminiConfluenceAgent(tools, self.config)
        
        logger.info("Pipeline initialized successfully")
    
    def initialize(self):
        """Synchronous initialization wrapper"""
        asyncio.run(self.initialize_async())
    
    async def index_confluence_async(self, root_url: str, max_depth: int = None) -> Dict[str, Any]:
        """Index Confluence pages asynchronously"""
        import time
        start_time = time.time()
        
        try:
            # Crawl pages
            logger.info(f"Starting crawl from: {root_url}")
            pages = await self.crawler.crawl_from_url_async(root_url, max_depth)
            
            if not pages:
                return {"status": "error", "message": "No pages found"}
            
            # Process pages with Gemini
            logger.info(f"Processing {len(pages)} pages with Gemini")
            processed_pages = []
            for page in tqdm(pages, desc="Processing with Gemini"):
                processed = await self.processor.process_page_with_gemini(page)
                processed_pages.append(processed)
            
            # Create documents
            documents = self.processor.create_documents(processed_pages)
            
            # Add to vector store
            await self.vector_store.add_documents_batch(documents)
            
            # Upload to Gemini File Search if enabled
            if self.config.use_gemini_file_search:
                await self.file_search.upload_documents(pages)
            
            # Calculate metrics
            duration = time.time() - start_time
            
            # Log metrics
            self.observability.log_crawl(len(pages), duration)
            
            return {
                "status": "success",
                "pages_indexed": len(pages),
                "documents_created": len(documents),
                "duration_seconds": duration,
                "gemini_file_search": self.config.use_gemini_file_search,
                "vertex_ai_search": self.config.use_vertex_ai_search
            }
            
        except Exception as e:
            logger.error(f"Indexing error: {e}")
            self.observability.log_error('indexing_error', str(e))
            return {"status": "error", "message": str(e)}
    
    def index_confluence(self, root_url: str, max_depth: int = None) -> Dict[str, Any]:
        """Synchronous indexing wrapper"""
        return asyncio.run(self.index_confluence_async(root_url, max_depth))
    
    def search(self, query: str) -> str:
        """Search using the agent"""
        import time
        start_time = time.time()
        
        result = self.agent.search(query)
        
        # Log search
        latency = time.time() - start_time
        self.observability.log_search(query, [], latency)
        
        return result
    
    def interactive_search(self):
        """Interactive search interface"""
        print("\nüöÄ Google AI Confluence Search Agent")
        print("Powered by Gemini 1.5 Pro")
        print("="*50)
        print("Type 'exit' to quit, 'help' for commands\n")
        
        while True:
            query = input("\nüí¨ Your query: ").strip()
            
            if query.lower() == 'exit':
                break
            elif query.lower() == 'help':
                print("\nAvailable commands:")
                print("  - Any search query")
                print("  - 'analyze [page_title]' - Analyze a specific page")
                print("  - 'compare [page1] | [page2]' - Compare two pages")
                print("  - 'space:[key] [query]' - Search within a space")
                print("  - 'labels:[label1,label2]' - Search by labels")
                continue
            elif not query:
                continue
            
            print("\nü§î Thinking with Gemini...")
            response = self.search(query)
            print(f"\n{response}")

## 12. Quick Start

In [None]:
def quick_start():
    """
    Quick start guide for Google AI Confluence Search.
    
    Prerequisites:
    1. Set up Google Cloud Project with enabled APIs:
       - Vertex AI API
       - Document AI API
       - Cloud Logging API
    
    2. Set environment variables:
       - GOOGLE_CLOUD_PROJECT
       - GEMINI_API_KEY
       - CONFLUENCE_URL
       - CONFLUENCE_USERNAME
       - CONFLUENCE_API_TOKEN
    """
    
    print("üöÄ Google AI Confluence Search - Quick Start")
    print("="*60)
    
    # Check configuration
    required_vars = ['GEMINI_API_KEY', 'CONFLUENCE_URL', 'CONFLUENCE_USERNAME', 'CONFLUENCE_API_TOKEN']
    missing_vars = [var for var in required_vars if not os.getenv(var)]
    
    if missing_vars:
        print("‚ö†Ô∏è  Missing environment variables:")
        for var in missing_vars:
            print(f"   - {var}")
        print("\nPlease set these in your .env file.")
        return
    
    # Initialize pipeline
    print("\nüì¶ Initializing Google AI pipeline...")
    pipeline = GoogleAIConfluencePipeline()
    pipeline.initialize()
    print("‚úÖ Pipeline initialized with Gemini 1.5 Pro")
    
    # Index pages
    print("\nüìö Let's index some Confluence pages!")
    root_url = input("Enter Confluence page URL: ")
    
    if root_url:
        print("\nüîÑ Indexing with Gemini (this may take a few minutes)...")
        result = pipeline.index_confluence(root_url, max_depth=2)
        
        if result['status'] == 'success':
            print(f"\n‚úÖ Success!")
            print(f"   Pages indexed: {result['pages_indexed']}")
            print(f"   Documents created: {result['documents_created']}")
            print(f"   Time: {result['duration_seconds']:.2f}s")
            print(f"   Using Gemini File Search: {result.get('gemini_file_search', False)}")
            print(f"   Using Vertex AI Search: {result.get('vertex_ai_search', False)}")
    
    # Demo search
    print("\nüîç Let's try a search!")
    query = input("Enter your query: ")
    
    if query:
        print("\nü§ñ Gemini is thinking...\n")
        result = pipeline.search(query)
        print(result)
    
    # Interactive mode
    choice = input("\nContinue with interactive search? (y/n): ")
    if choice.lower() == 'y':
        pipeline.interactive_search()
    
    print("\nüëã Thank you for using Google AI Confluence Search!")
    return pipeline

# Run quick start
# pipeline = quick_start()

## 13. Advanced Usage Examples

In [None]:
# Example: Using Gemini's advanced capabilities
async def advanced_gemini_example():
    """Demonstrate advanced Gemini features"""
    
    config = GoogleAIConfig()
    pipeline = GoogleAIConfluencePipeline(config)
    await pipeline.initialize_async()
    
    # Example queries showcasing Gemini's capabilities
    queries = [
        "Find all architecture decision records and summarize the key patterns",
        "Compare our API documentation with REST best practices",
        "Identify gaps in our deployment documentation",
        "What are the most frequently updated pages in the TECH space?",
        "Generate a knowledge graph of our microservices documentation"
    ]
    
    for query in queries:
        print(f"\n{'='*60}")
        print(f"Query: {query}")
        print(f"{'='*60}")
        
        result = await pipeline.agent.search_async(query)
        print(result)

# Run example
# asyncio.run(advanced_gemini_example())

In [None]:
# Example: Batch processing with progress tracking
async def batch_indexing_example():
    """Index multiple Confluence spaces in parallel"""
    
    spaces = [
        "https://your-domain.atlassian.net/wiki/spaces/TECH/overview",
        "https://your-domain.atlassian.net/wiki/spaces/DOCS/overview",
        "https://your-domain.atlassian.net/wiki/spaces/API/overview"
    ]
    
    pipeline = GoogleAIConfluencePipeline()
    await pipeline.initialize_async()
    
    # Index spaces in parallel
    tasks = [pipeline.index_confluence_async(url, max_depth=2) for url in spaces]
    results = await asyncio.gather(*tasks)
    
    # Summary
    total_pages = sum(r.get('pages_indexed', 0) for r in results if r['status'] == 'success')
    print(f"\nIndexed {total_pages} total pages across {len(spaces)} spaces")
    
    return results

# Run example
# results = asyncio.run(batch_indexing_example())

## 14. Best Practices and Tips

### Google AI Optimization

1. **Gemini Model Selection**
   - Use `gemini-1.5-pro` for complex analysis
   - Use `gemini-1.5-flash` for faster, simpler queries
   - Consider `gemini-1.5-pro-vision` for pages with diagrams

2. **File Search vs Vector Search**
   - File Search: Managed, no infrastructure, automatic updates
   - Vector Search: More control, custom filtering, hybrid search

3. **Document AI Integration**
   - Use for complex layouts, tables, forms
   - Configure processors for specific document types
   - Consider OCR processor for image-heavy pages

4. **Cost Optimization**
   - Batch API calls when possible
   - Use caching for frequently accessed pages
   - Configure appropriate Vertex AI machine types

### Performance Tips

1. **Async Processing**: Use async methods for better throughput
2. **Batching**: Process documents in batches of 100-500
3. **Caching**: Implement Redis caching for embeddings
4. **Monitoring**: Use Cloud Monitoring for performance metrics

### Security

1. **Service Accounts**: Use proper IAM roles
2. **VPC-SC**: Configure VPC Service Controls
3. **Data Residency**: Choose appropriate regions
4. **Encryption**: Enable CMEK for sensitive data