# Azure AI Agent with File Search - Competitive Intelligence

This notebook demonstrates how to use **Azure AI Agents with File Search** to extract product information from PDF documents.

## What You'll Learn

1. ‚úÖ Upload PDF files to Azure AI Agent service
2. ‚úÖ Create vector stores for file search
3. ‚úÖ Use `HostedFileSearchTool` with Azure AI agents
4. ‚úÖ Extract structured product data from documents
5. ‚úÖ Proper cleanup of Azure resources

## Prerequisites

- Azure AI Foundry project created
- `AZURE_AI_PROJECT_ENDPOINT` set in `.env` file
- Azure CLI authentication (`az login`)
- Azure Search permissions assigned (Search Service Contributor + Search Index Data Contributor)

## Based On

This follows the official Microsoft pattern from:
https://github.com/microsoft/agent-framework/blob/main/python/samples/getting_started/agents/azure_ai/azure_ai_with_file_search.py

## Step 1: Install Dependencies

In [1]:
import sys
import subprocess

packages = [
    'python-dotenv',
]

for package in packages:
    try:
        __import__(package.replace('-', '_'))
        print(f"‚úÖ {package} already installed")
    except ImportError:
        print(f"üì¶ Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package, "-q"])
        print(f"‚úÖ {package} installed")

print("\nüéâ All dependencies ready!")

üì¶ Installing python-dotenv...
‚úÖ python-dotenv installed

üéâ All dependencies ready!
‚úÖ python-dotenv installed

üéâ All dependencies ready!


## Step 2: Import Libraries

In [6]:
import asyncio
import json
import os
from pathlib import Path
from typing import Optional

# Agent framework
from agent_framework import ChatAgent, HostedFileSearchTool, HostedVectorStoreContent
from agent_framework_azure_ai import AzureAIAgentClient
from azure.ai.agents.models import FileInfo, VectorStore
from azure.identity.aio import AzureCliCredential
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

print("‚úÖ Imports successful!")
print("üìä Ready to use Azure AI Agent with File Search!")

‚úÖ Imports successful!
üìä Ready to use Azure AI Agent with File Search!


## Step 3: Setup Folders

In [7]:
# Create folder structure
folders = {
    'input': './competitive_analysis/input',
    'output': './competitive_analysis/output',
    'data': './competitive_analysis/data',
}

for folder_name, folder_path in folders.items():
    Path(folder_path).mkdir(parents=True, exist_ok=True)
    print(f"‚úÖ {folder_name}: {folder_path}")

print("\n‚úÖ Folder structure ready!")

‚úÖ input: ./competitive_analysis/input
‚úÖ output: ./competitive_analysis/output
‚úÖ data: ./competitive_analysis/data

‚úÖ Folder structure ready!


## Step 4: Verify Configuration

In [8]:
# Verify Azure AI Project endpoint is set
project_endpoint = os.getenv("AZURE_AI_PROJECT_ENDPOINT")
if not project_endpoint:
    raise ValueError("‚ùå AZURE_AI_PROJECT_ENDPOINT not set in .env file")

print(f"‚úÖ Azure AI Project Endpoint: {project_endpoint}")
print(f"‚úÖ Authentication: Azure CLI Credential")
print(f"\nüí° Configuration verified!")

‚úÖ Azure AI Project Endpoint: https://gk-agent-framework-project.services.ai.azure.com/api/projects/agentframworkProject
‚úÖ Authentication: Azure CLI Credential

üí° Configuration verified!


## Step 5: Test Azure AI Agent File Search

This example follows the **official Microsoft pattern** for file search with Azure AI Agents.

In [9]:
async def test_azure_file_search():
    """
    Test Azure AI Agent with File Search using official Microsoft pattern.
    Based on: https://github.com/microsoft/agent-framework/blob/main/python/samples/getting_started/agents/azure_ai/azure_ai_with_file_search.py
    """
    print("\n" + "="*70)
    print("üöÄ AZURE AI AGENT FILE SEARCH TEST")
    print("="*70)
    
    # Get project endpoint from environment
    project_endpoint = os.getenv("AZURE_AI_PROJECT_ENDPOINT")
    if not project_endpoint:
        raise ValueError("AZURE_AI_PROJECT_ENDPOINT not set in .env file")
    
    # Initialize Azure AI Agent Client with project endpoint
    client = AzureAIAgentClient(
        endpoint=project_endpoint,
        async_credential=AzureCliCredential()
    )
    
    file: Optional[FileInfo] = None
    vector_store: Optional[VectorStore] = None
    
    try:
        # 1. Check for PDF files
        input_path = Path(folders['input'])
        pdf_files = list(input_path.glob('*.pdf'))
        
        if not pdf_files:
            print("\n‚ö†Ô∏è  No PDF files found in input folder!")
            print(f"\nüìã To test file search:")
            print(f"   1. Add PDF files to: {folders['input']}")
            print(f"   2. Re-run this cell")
            return
        
        # Use the first PDF file for testing
        pdf_file_path = pdf_files[0]
        print(f"\nüìÑ Using file: {pdf_file_path.name}")
        print(f"   Size: {pdf_file_path.stat().st_size / 1024 / 1024:.2f} MB")
        
        # 2. Upload file to Azure AI Agent service
        print("\n‚¨ÜÔ∏è  Uploading file to Azure AI...")
        file = await client.project_client.agents.files.upload_and_poll(
            file_path=str(pdf_file_path),
            purpose="assistants"
        )
        print(f"‚úÖ File uploaded, ID: {file.id}")
        
        # 3. Create vector store with the uploaded file
        print("\nüóÇÔ∏è  Creating vector store...")
        vector_store = await client.project_client.agents.vector_stores.create_and_poll(
            file_ids=[file.id],
            name="furniture_catalog_vectorstore"
        )
        print(f"‚úÖ Vector store created, ID: {vector_store.id}")
        
        # 4. Create file search tool with vector store reference
        file_search_tool = HostedFileSearchTool(
            inputs=[HostedVectorStoreContent(vector_store_id=vector_store.id)]
        )
        
        # 5. Create agent with file search capabilities
        print("\nü§ñ Creating agent with file search tool...")
        async with ChatAgent(
            chat_client=client,
            name="ProductAnalyzer",
            instructions="""
            You are a product data extraction specialist.
            Extract product information from uploaded documents and return it as structured JSON.
            
            For each product found, include:
            - product_name: Official product name
            - sku: Product SKU or model number
            - price: Price (with currency)
            - description: Brief product description
            - features: List of key features
            - category: Product category
            
            Use the file search tool to find information in the uploaded document.
            """,
            tools=file_search_tool,
        ) as agent:
            print(f"‚úÖ Agent created: {agent.name}")
            
            # 6. Query the agent to extract product data
            queries = [
                "What products are in this catalog? List the first 3 products with their names, SKUs, and prices.",
                "What is the most expensive product?",
                "Summarize the product categories available."
            ]
            
            for i, query in enumerate(queries, 1):
                print(f"\n" + "="*70)
                print(f"‚ùì Query {i}: {query}")
                print("="*70)
                
                response = await agent.run(query)
                print(f"\nüí° Response:\n{response.text}")
        
        print("\n" + "="*70)
        print("‚úÖ FILE SEARCH TEST COMPLETED SUCCESSFULLY!")
        print("="*70)
        
    except Exception as e:
        print(f"\n‚ùå Error: {e}")
        import traceback
        traceback.print_exc()
        
    finally:
        # 7. Cleanup: Delete vector store and file
        print("\nüßπ Cleaning up resources...")
        try:
            if vector_store is not None:
                await client.project_client.agents.vector_stores.delete(vector_store.id)
                print(f"‚úÖ Deleted vector store: {vector_store.id}")
            if file is not None:
                await client.project_client.agents.files.delete(file.id)
                print(f"‚úÖ Deleted file: {file.id}")
        except Exception as cleanup_error:
            print(f"‚ö†Ô∏è  Cleanup warning: {cleanup_error}")
        finally:
            await client.close()
            print("‚úÖ Client closed")

# Run the test
await test_azure_file_search()


üöÄ AZURE AI AGENT FILE SEARCH TEST

üìÑ Using file: knoll-ReffProfilesVolTwo.pdf
   Size: 15.99 MB

‚¨ÜÔ∏è  Uploading file to Azure AI...
‚úÖ File uploaded, ID: assistant-UPkMbpK5Gh9Pf6yY4XWW28

üóÇÔ∏è  Creating vector store...
‚úÖ File uploaded, ID: assistant-UPkMbpK5Gh9Pf6yY4XWW28

üóÇÔ∏è  Creating vector store...
‚úÖ Vector store created, ID: vs_CEKkUx3ymmPMjyaE5vjsa2Io

ü§ñ Creating agent with file search tool...
‚úÖ Agent created: ProductAnalyzer

‚ùì Query 1: What products are in this catalog? List the first 3 products with their names, SKUs, and prices.
‚úÖ Vector store created, ID: vs_CEKkUx3ymmPMjyaE5vjsa2Io

ü§ñ Creating agent with file search tool...
‚úÖ Agent created: ProductAnalyzer

‚ùì Query 1: What products are in this catalog? List the first 3 products with their names, SKUs, and prices.

üí° Response:
Here are the first three products found in the catalog, including their official product names, SKUs, and prices:

1. 
- product_name: Easy-Off Professional Ove

## Step 6: Extract Structured Product Data

This cell extracts product data from ALL PDF files and saves as JSON.

In [10]:
async def extract_all_products():
    """
    Extract product data from all PDF files and save as JSON.
    """
    print("\n" + "="*70)
    print("üìä EXTRACTING PRODUCTS FROM ALL PDFs")
    print("="*70)
    
    # Get project endpoint
    project_endpoint = os.getenv("AZURE_AI_PROJECT_ENDPOINT")
    if not project_endpoint:
        raise ValueError("AZURE_AI_PROJECT_ENDPOINT not set in .env file")
    
    # Initialize client
    client = AzureAIAgentClient(
        endpoint=project_endpoint,
        async_credential=AzureCliCredential()
    )
    
    files: list[FileInfo] = []
    vector_store: Optional[VectorStore] = None
    all_products = []
    
    try:
        # 1. Find all PDF files
        input_path = Path(folders['input'])
        pdf_files = list(input_path.glob('*.pdf'))
        
        if not pdf_files:
            print("\n‚ö†Ô∏è  No PDF files found!")
            return
        
        print(f"\nüìÑ Found {len(pdf_files)} PDF file(s)")
        
        # 2. Upload all files
        print("\n‚¨ÜÔ∏è  Uploading files to Azure AI...")
        file_ids = []
        for pdf_file in pdf_files:
            print(f"   Uploading: {pdf_file.name} ({pdf_file.stat().st_size / 1024 / 1024:.2f} MB)")
            file = await client.project_client.agents.files.upload_and_poll(
                file_path=str(pdf_file),
                purpose="assistants"
            )
            files.append(file)
            file_ids.append(file.id)
            print(f"   ‚úÖ Uploaded: {file.id}")
        
        # 3. Create vector store with all files
        print("\nüóÇÔ∏è  Creating vector store with all files...")
        vector_store = await client.project_client.agents.vector_stores.create_and_poll(
            file_ids=file_ids,
            name="all_furniture_catalogs"
        )
        print(f"‚úÖ Vector store created: {vector_store.id}")
        
        # 4. Create file search tool
        file_search_tool = HostedFileSearchTool(
            inputs=[HostedVectorStoreContent(vector_store_id=vector_store.id)]
        )
        
        # 5. Create agent
        print("\nü§ñ Creating product extraction agent...")
        async with ChatAgent(
            chat_client=client,
            name="ProductExtractor",
            instructions="""
            You are a product catalog data extraction specialist.
            Extract ALL products from the uploaded documents and return as a JSON array.
            
            For each product, extract:
            {
                "product_name": "Official product name",
                "sku": "Product SKU/model number",
                "price": "Price with currency (e.g., $1,234.00)",
                "description": "Brief description",
                "features": ["feature1", "feature2"],
                "category": "Product category",
                "dimensions": "Dimensions if available",
                "source_file": "Which catalog this came from"
            }
            
            Return ONLY valid JSON array, no additional text.
            """,
            tools=file_search_tool,
        ) as agent:
            # 6. Extract products
            query = "Extract ALL products from all uploaded catalogs. Return as JSON array with product_name, sku, price, description, features, category, dimensions, and source_file for each product."
            print(f"\nüîç Extracting products...")
            
            response = await agent.run(query)
            print(f"\nüì¶ Extraction complete!")
            
            # 7. Parse and save JSON
            try:
                # Try to extract JSON from response
                response_text = response.text
                
                # Remove markdown code blocks if present
                if "```json" in response_text:
                    response_text = response_text.split("```json")[1].split("```")[0].strip()
                elif "```" in response_text:
                    response_text = response_text.split("```")[1].split("```")[0].strip()
                
                products = json.loads(response_text)
                
                # Save to JSON file
                output_file = Path(folders['output']) / "extracted_products.json"
                with open(output_file, 'w', encoding='utf-8') as f:
                    json.dump(products, f, indent=2, ensure_ascii=False)
                
                print(f"\n‚úÖ Saved {len(products)} products to: {output_file}")
                print(f"\nüìä Sample of extracted products:")
                for i, product in enumerate(products[:3], 1):
                    print(f"\n{i}. {product.get('product_name', 'Unknown')}")
                    print(f"   SKU: {product.get('sku', 'N/A')}")
                    print(f"   Price: {product.get('price', 'N/A')}")
                
            except json.JSONDecodeError as e:
                print(f"\n‚ö†Ô∏è  Could not parse as JSON: {e}")
                print(f"\nüìÑ Raw response:\n{response.text}")
                
                # Save raw response
                output_file = Path(folders['output']) / "extraction_response.txt"
                with open(output_file, 'w', encoding='utf-8') as f:
                    f.write(response.text)
                print(f"\n‚úÖ Saved raw response to: {output_file}")
        
        print("\n" + "="*70)
        print("‚úÖ EXTRACTION COMPLETED!")
        print("="*70)
        
    except Exception as e:
        print(f"\n‚ùå Error: {e}")
        import traceback
        traceback.print_exc()
        
    finally:
        # Cleanup
        print("\nüßπ Cleaning up resources...")
        try:
            if vector_store is not None:
                await client.project_client.agents.vector_stores.delete(vector_store.id)
                print(f"‚úÖ Deleted vector store")
            for file in files:
                await client.project_client.agents.files.delete(file.id)
            if files:
                print(f"‚úÖ Deleted {len(files)} file(s)")
        except Exception as cleanup_error:
            print(f"‚ö†Ô∏è  Cleanup warning: {cleanup_error}")
        finally:
            await client.close()
            print("‚úÖ Client closed")

# Run extraction
await extract_all_products()


üìä EXTRACTING PRODUCTS FROM ALL PDFs

üìÑ Found 3 PDF file(s)

‚¨ÜÔ∏è  Uploading files to Azure AI...
   Uploading: knoll-ReffProfilesVolTwo.pdf (15.99 MB)
   ‚úÖ Uploaded: assistant-3xKNjbP1qn3e2NnyEBTicp
   Uploading: haworth-tables-fixed-height_gsa-price-list-part-2.pdf (9.02 MB)
   ‚úÖ Uploaded: assistant-3xKNjbP1qn3e2NnyEBTicp
   Uploading: haworth-tables-fixed-height_gsa-price-list-part-2.pdf (9.02 MB)
   ‚úÖ Uploaded: assistant-CsbfYvVB4HowTqt61DnLYj
   Uploading: haworth-tables-fixed-height_gsa-price-list-part-1.pdf (11.54 MB)
   ‚úÖ Uploaded: assistant-CsbfYvVB4HowTqt61DnLYj
   Uploading: haworth-tables-fixed-height_gsa-price-list-part-1.pdf (11.54 MB)
   ‚úÖ Uploaded: assistant-RijTVrwzexCa6MuKEppCYj

üóÇÔ∏è  Creating vector store with all files...
   ‚úÖ Uploaded: assistant-RijTVrwzexCa6MuKEppCYj

üóÇÔ∏è  Creating vector store with all files...
‚úÖ Vector store created: vs_0MCEd4Hfu3u6ZoXXBYHh0rIV

ü§ñ Creating product extraction agent...

üîç Extracting products...


## Summary

### What We Accomplished

‚úÖ **Proper Azure AI Agent Configuration**
- Used `AZURE_AI_PROJECT_ENDPOINT` from environment
- Created `AzureAIAgentClient` with correct endpoint parameter
- Authenticated with Azure CLI credentials

‚úÖ **Official File Search Pattern**
- Upload files with `client.project_client.agents.files.upload_and_poll()`
- Create vector stores with `client.project_client.agents.vector_stores.create_and_poll()`
- Use `HostedFileSearchTool` with `HostedVectorStoreContent`
- Query documents with natural language

‚úÖ **Best Practices**
- Async context managers for cleanup
- Proper error handling
- Resource cleanup (delete vector stores and files)
- Structured data extraction as JSON

### Next Steps

1. Add your PDF files to `./competitive_analysis/input/`
2. Run the file search test (Step 5)
3. Run full extraction (Step 6)
4. Review extracted data in `./competitive_analysis/output/`

### Resources

- [Microsoft Agent Framework - File Search Example](https://github.com/microsoft/agent-framework/blob/main/python/samples/getting_started/agents/azure_ai/azure_ai_with_file_search.py)
- [Azure AI Foundry Documentation](https://learn.microsoft.com/azure/ai-foundry/)
- [Agent Framework Documentation](https://github.com/microsoft/agent-framework)