# Synapse Pipeline Configuration and Execution

This notebook covers configuring and running the Azure Synapse pipeline for PDF document processing.

## Pipeline Overview

The `ProcessPDFsWithDocIntelligence` pipeline:
1. Lists PDF files in a blob storage container
2. Iterates through each PDF file
3. Calls the Azure Function to process each document
4. Stores results in Cosmos DB

## Configuration

In [1]:
# ============================================
# CONFIGURATION - UPDATE THESE VALUES
# ============================================

$SUBSCRIPTION_ID = "363ef5d1-0e77-4594-a530-f51af23dbf8c"
$SYNAPSE_WORKSPACE = "synapse-sandbox-east2-dlz"
$RESOURCE_GROUP = "rg-sandbox-demo-east2"

# Storage account where PDFs are stored
$STORAGE_ACCOUNT = "aimldatastore"
# ============================================

az account set --subscription $SUBSCRIPTION_ID

Write-Host "Configuration set" -ForegroundColor Green
Write-Host "  Synapse: $SYNAPSE_WORKSPACE"
Write-Host "  Storage: $STORAGE_ACCOUNT"

[92mConfiguration set[0m
  Synapse: synapse-sandbox-east2-dlz
  Storage: aimldatastore


## Cosmos DB Setup (Required)

Before running the pipeline, ensure the Cosmos DB database and container exist. The Function App stores extracted document data in Cosmos DB.

**If you get error**: `Owner resource does not exist` - run the cells below to create the database and container.

In [2]:
# ============================================
# COSMOS DB CONFIGURATION - UPDATE THESE VALUES
# ============================================

$COSMOS_ACCOUNT = "cosmosdb-dlz-east2-sandbox"      # Your Cosmos DB account name
$COSMOS_RG = "rg-dlz-cosmosdb-east2-sandbox"             # Resource group containing Cosmos DB
$COSMOS_DATABASE = "DocumentsDB"              # Database name (default)
$COSMOS_CONTAINER = "ExtractedDocuments"      # Container name (default)

# ============================================

Write-Host "Cosmos DB Configuration:" -ForegroundColor Cyan
Write-Host "  Account: $COSMOS_ACCOUNT"
Write-Host "  Database: $COSMOS_DATABASE"
Write-Host "  Container: $COSMOS_CONTAINER"

[96mCosmos DB Configuration:[0m
  Account: cosmosdb-dlz-east2-sandbox
  Database: DocumentsDB
  Container: ExtractedDocuments


In [None]:
# Check if Cosmos DB database exists
Write-Host "Checking Cosmos DB database..." -ForegroundColor Cyan

$dbExists = az cosmosdb sql database show `
    --account-name $COSMOS_ACCOUNT `
    --resource-group $COSMOS_RG `
    --name $COSMOS_DATABASE `
    --query "name" -o tsv 2>$null

if ($dbExists) {
    Write-Host "  Database '$COSMOS_DATABASE' exists" -ForegroundColor Green
} else {
    Write-Host "  Database '$COSMOS_DATABASE' does not exist - will create" -ForegroundColor Yellow
    
    az cosmosdb sql database create `
        --account-name $COSMOS_ACCOUNT `
        --resource-group $COSMOS_RG `
        --name $COSMOS_DATABASE `
        --output none
    
    Write-Host "  Database '$COSMOS_DATABASE' created!" -ForegroundColor Green
}

In [None]:
# Check if Cosmos DB container exists
Write-Host "Checking Cosmos DB container..." -ForegroundColor Cyan

$containerExists = az cosmosdb sql container show `
    --account-name $COSMOS_ACCOUNT `
    --resource-group $COSMOS_RG `
    --database-name $COSMOS_DATABASE `
    --name $COSMOS_CONTAINER `
    --query "name" -o tsv 2>$null

if ($containerExists) {
    Write-Host "  Container '$COSMOS_CONTAINER' exists" -ForegroundColor Green
} else {
    Write-Host "  Container '$COSMOS_CONTAINER' does not exist - will create" -ForegroundColor Yellow
    
    # Create container with partition key /sourceFile (required by the Function App)
    az cosmosdb sql container create `
        --account-name $COSMOS_ACCOUNT `
        --resource-group $COSMOS_RG `
        --database-name $COSMOS_DATABASE `
        --name $COSMOS_CONTAINER `
        --partition-key-path "/sourceFile" `
        --output none
    
    Write-Host "  Container '$COSMOS_CONTAINER' created with partition key '/sourceFile'!" -ForegroundColor Green
}

Write-Host "`nCosmos DB setup complete!" -ForegroundColor Green

## Function App Setup (Required)

The Function App requires proper environment variables and managed identity permissions to:
1. Read PDFs from Blob Storage (via SAS tokens)
2. Process documents with Document Intelligence
3. Write results to Cosmos DB

**Run these cells to verify and configure the Function App.**

In [3]:
# ============================================
# FUNCTION APP & DOCUMENT INTELLIGENCE CONFIGURATION
# UPDATE THESE VALUES
# ============================================

$FUNC_APP_NAME = "docproc-func-dev"           # Your Function App name
$FUNC_RG = "rg-docprocessing-functions-dev"                # Resource group containing Function App
$DOC_INTEL_NAME = "docservendpointdev"          # Document Intelligence account name
$DOC_INTEL_RG = "rg-dlz-aiml-stack-dev"              # Resource group containing Doc Intel
$KEY_VAULT_NAME = "aiml-stack-keyvault-dev"              # Key Vault for secrets
$KEY_VAULT_RG = "rg-dlz-aiml-stack-dev"               # Resource group containing Key Vault

# ============================================

Write-Host "Function App Configuration:" -ForegroundColor Cyan
Write-Host "  Function App: $FUNC_APP_NAME"
Write-Host "  Function RG: $FUNC_RG"
Write-Host "  Doc Intel: $DOC_INTEL_NAME"
Write-Host "  Key Vault: $KEY_VAULT_NAME"

[96mFunction App Configuration:[0m
  Function App: docproc-func-dev
  Function RG: rg-docprocessing-functions-dev
  Doc Intel: docservendpointdev
  Key Vault: aiml-stack-keyvault-dev


In [None]:
# Check current Function App environment variables
Write-Host "Checking Function App environment variables..." -ForegroundColor Cyan

$requiredVars = @(
    "DOC_INTEL_ENDPOINT",
    "COSMOS_ENDPOINT", 
    "COSMOS_DATABASE",
    "COSMOS_CONTAINER",
    "AzureWebJobsStorage"
)

$currentSettings = az functionapp config appsettings list `
    --name $FUNC_APP_NAME `
    --resource-group $FUNC_RG `
    --output json 2>$null | ConvertFrom-Json

if (-not $currentSettings) {
    Write-Host "  ERROR: Could not retrieve Function App settings" -ForegroundColor Red
    Write-Host "  Check that FUNC_APP_NAME and FUNC_RG are correct" -ForegroundColor Yellow
} else {
    $settingsHash = @{}
    $currentSettings | ForEach-Object { $settingsHash[$_.name] = $_.value }
    
    Write-Host "`n  Required Environment Variables:" -ForegroundColor Yellow
    $allConfigured = $true
    
    foreach ($var in $requiredVars) {
        if ($settingsHash.ContainsKey($var) -and $settingsHash[$var]) {
            $displayValue = if ($var -match "KEY|SECRET|CONNECTION") { 
                "***configured***" 
            } else { 
                $settingsHash[$var] 
            }
            Write-Host "    [OK] $var = $displayValue" -ForegroundColor Green
        } else {
            Write-Host "    [MISSING] $var" -ForegroundColor Red
            $allConfigured = $false
        }
    }
    
    if ($allConfigured) {
        Write-Host "`n  All required variables are configured!" -ForegroundColor Green
    } else {
        Write-Host "`n  Some variables are missing - run the next cell to configure them" -ForegroundColor Yellow
    }
}

In [None]:
# Configure Function App environment variables
# Run this cell to set all required environment variables

$STORAGE_RESOURCE_GROUP = "rg-dlz-aiml-stack-dev"  # Resource group for the storage account

Write-Host "Configuring Function App environment variables..." -ForegroundColor Cyan

# Get resource endpoints
$docIntelEndpoint = az cognitiveservices account show `
    --name $DOC_INTEL_NAME `
    --resource-group $DOC_INTEL_RG `
    --query "properties.endpoint" -o tsv

$cosmosEndpoint = "https://$COSMOS_ACCOUNT.documents.azure.com:443/"

$storageConnStr = az storage account show-connection-string `
    --name $STORAGE_ACCOUNT `
    --resource-group $STORAGE_RESOURCE_GROUP `
    --query connectionString -o tsv

Write-Host "  Doc Intel Endpoint: $docIntelEndpoint" -ForegroundColor Gray
Write-Host "  Cosmos Endpoint: $cosmosEndpoint" -ForegroundColor Gray

# Set all required environment variables
Write-Host "`nSetting environment variables..." -ForegroundColor Yellow

az functionapp config appsettings set `
    --name $FUNC_APP_NAME `
    --resource-group $FUNC_RG `
    --settings `
        "DOC_INTEL_ENDPOINT=$docIntelEndpoint" `
        "COSMOS_ENDPOINT=$cosmosEndpoint" `
        "COSMOS_DATABASE=$COSMOS_DATABASE" `
        "COSMOS_CONTAINER=$COSMOS_CONTAINER" `
        "STORAGE_CONNECTION_STRING=$storageConnStr" `
    --output none

Write-Host "`nEnvironment variables configured!" -ForegroundColor Green
Write-Host "  DOC_INTEL_ENDPOINT = $docIntelEndpoint"
Write-Host "  COSMOS_ENDPOINT = $cosmosEndpoint"
Write-Host "  COSMOS_DATABASE = $COSMOS_DATABASE"
Write-Host "  COSMOS_CONTAINER = $COSMOS_CONTAINER"
Write-Host "  STORAGE_CONNECTION_STRING = ***configured***"

### Managed Identity Permissions

The Function App uses managed identity to access Azure resources. Required role assignments:

| Resource | Role | Purpose |
|----------|------|---------|
| Cosmos DB | Cosmos DB Built-in Data Contributor | Read/write documents |
| Document Intelligence | Cognitive Services User | Process documents |
| Storage Account | Storage Blob Data Reader | Read PDFs (for SAS generation) |

In [None]:
# Get your current Azure CLI user's object ID
$currentUser = az ad signed-in-user show --query id -o tsv

Write-Host "Your Azure AD Object ID: $currentUser" -ForegroundColor Cyan

# Assign Cosmos DB Built-in Data Reader role (for querying)
az cosmosdb sql role assignment create `
    --account-name $COSMOS_ACCOUNT `
    --resource-group $COSMOS_RG `
    --role-definition-name "Cosmos DB Built-in Data Reader" `
    --principal-id $currentUser `
    --scope "/" `
    --output none

Write-Host "Cosmos DB Data Reader role assigned!" -ForegroundColor Green
Write-Host "Wait 1-2 minutes for propagation, then retry the query." -ForegroundColor Yellow

In [None]:
# Check current managed identity role assignments
Write-Host "Checking Function App managed identity permissions..." -ForegroundColor Cyan

# Get Function App managed identity
$funcIdentity = az functionapp identity show `
    --name $FUNC_APP_NAME `
    --resource-group $FUNC_RG `
    --query principalId -o tsv

if (-not $funcIdentity) {
    Write-Host "  ERROR: Function App has no managed identity enabled!" -ForegroundColor Red
    Write-Host "  Run the next cell to enable it" -ForegroundColor Yellow
} else {
    Write-Host "  Function App Identity: $funcIdentity" -ForegroundColor Gray
    
    # Check Cosmos DB permissions
    Write-Host "`n  Checking Cosmos DB permissions..." -ForegroundColor Yellow
    $cosmosScope = "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$COSMOS_RG/providers/Microsoft.DocumentDB/databaseAccounts/$COSMOS_ACCOUNT"
    $cosmosRoles = az role assignment list --assignee $funcIdentity --scope $cosmosScope --query "[].roleDefinitionName" -o json 2>$null | ConvertFrom-Json
    
    if ($cosmosRoles -contains "Cosmos DB Built-in Data Contributor" -or $cosmosRoles -contains "DocumentDB Account Contributor") {
        Write-Host "    [OK] Cosmos DB: Has write permissions" -ForegroundColor Green
    } else {
        Write-Host "    [MISSING] Cosmos DB: No write permissions" -ForegroundColor Red
    }
    
    # Check Doc Intel permissions  
    Write-Host "  Checking Document Intelligence permissions..." -ForegroundColor Yellow
    $docIntelScope = "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$DOC_INTEL_RG/providers/Microsoft.CognitiveServices/accounts/$DOC_INTEL_NAME"
    $docIntelRoles = az role assignment list --assignee $funcIdentity --scope $docIntelScope --query "[].roleDefinitionName" -o json 2>$null | ConvertFrom-Json
    
    if ($docIntelRoles -contains "Cognitive Services User") {
        Write-Host "    [OK] Document Intelligence: Has access" -ForegroundColor Green
    } else {
        Write-Host "    [MISSING] Document Intelligence: No access" -ForegroundColor Red
    }
    
    # Check Storage permissions
    Write-Host "  Checking Storage permissions..." -ForegroundColor Yellow
    $storageScope = "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"
    $storageRoles = az role assignment list --assignee $funcIdentity --scope $storageScope --query "[].roleDefinitionName" -o json 2>$null | ConvertFrom-Json
    
    if ($storageRoles -contains "Storage Blob Data Reader" -or $storageRoles -contains "Storage Blob Data Contributor") {
        Write-Host "    [OK] Storage: Has read permissions" -ForegroundColor Green
    } else {
        Write-Host "    [MISSING] Storage: No read permissions" -ForegroundColor Red
    }
}

In [None]:
# Configure managed identity and assign required permissions
# Run this cell to enable managed identity and assign all required roles
Write-Host "Configuring Function App managed identity and permissions..." -ForegroundColor Cyan

# Step 1: Enable system-assigned managed identity if not already enabled
Write-Host "`n1. Enabling managed identity..." -ForegroundColor Yellow
$identity = az functionapp identity assign `
    --name $FUNC_APP_NAME `
    --resource-group $FUNC_RG `
    --query principalId -o tsv

Write-Host "   Identity: $identity" -ForegroundColor Gray

# Step 2: Assign Cosmos DB permissions (Built-in Data Contributor)
Write-Host "`n2. Assigning Cosmos DB permissions..." -ForegroundColor Yellow
$cosmosScope = "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$COSMOS_RG/providers/Microsoft.DocumentDB/databaseAccounts/$COSMOS_ACCOUNT"

# Use the built-in Cosmos DB data plane role
az cosmosdb sql role assignment create `
    --account-name $COSMOS_ACCOUNT `
    --resource-group $COSMOS_RG `
    --role-definition-name "Cosmos DB Built-in Data Contributor" `
    --principal-id $identity `
    --scope "/" `
    --output none 2>$null

# Fallback to control plane role if data plane fails
az role assignment create `
    --assignee $identity `
    --role "DocumentDB Account Contributor" `
    --scope $cosmosScope `
    --output none 2>$null

Write-Host "   Cosmos DB: Assigned" -ForegroundColor Green

# Step 3: Assign Document Intelligence permissions
Write-Host "`n3. Assigning Document Intelligence permissions..." -ForegroundColor Yellow
$docIntelScope = "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$DOC_INTEL_RG/providers/Microsoft.CognitiveServices/accounts/$DOC_INTEL_NAME"

az role assignment create `
    --assignee $identity `
    --role "Cognitive Services User" `
    --scope $docIntelScope `
    --output none 2>$null

Write-Host "   Document Intelligence: Assigned" -ForegroundColor Green

# Step 4: Assign Storage permissions
Write-Host "`n4. Assigning Storage permissions..." -ForegroundColor Yellow
$storageScope = "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"

az role assignment create `
    --assignee $identity `
    --role "Storage Blob Data Reader" `
    --scope $storageScope `
    --output none 2>$null

Write-Host "   Storage: Assigned" -ForegroundColor Green

Write-Host "`nAll permissions configured!" -ForegroundColor Green
Write-Host "Note: Role assignments may take 1-2 minutes to propagate" -ForegroundColor Yellow

## Pipeline Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|  
| `containerName` | Blob container name | `blobstore` |
| `sourceFolderPath` | Folder path within container | `pdfs` |
| `storageAccountUrl` | Storage account blob endpoint | `https://aimldatastore.blob.core.windows.net` |
| `modelId` | Document Intelligence model ID | `demousda-cropform` |

**Note:** The pipeline automatically handles PDF filenames with spaces and special characters.

## Check Pipeline Status

In [None]:
# List pipelines in workspace
Write-Host "Pipelines in workspace:" -ForegroundColor Cyan

az synapse pipeline list `
    --workspace-name $SYNAPSE_WORKSPACE `
    --query "[].{Name:name, Type:type}" `
    --output table

In [None]:
# Check linked services
Write-Host "Linked services:" -ForegroundColor Cyan

az synapse linked-service list `
    --workspace-name $SYNAPSE_WORKSPACE `
    --query "[].{Name:name, Type:properties.type}" `
    --output table

## Trigger Pipeline Run

In [None]:
# ============================================
# PIPELINE PARAMETERS
# ============================================

$CONTAINER_NAME = "blobstore"
$SOURCE_FOLDER = "pdfs"
$MODEL_ID = "demousda-cropform"  # or your custom model ID

$STORAGE_URL = "https://$STORAGE_ACCOUNT.blob.core.windows.net"

# ============================================

Write-Host "Pipeline parameters:" -ForegroundColor Yellow
Write-Host "  Container: $CONTAINER_NAME"
Write-Host "  Source folder: $SOURCE_FOLDER"
Write-Host "  Model: $MODEL_ID"
Write-Host "  Storage URL: $STORAGE_URL"

In [None]:
# Trigger pipeline run using temp file (Windows-compatible)
Write-Host "Triggering pipeline run..." -ForegroundColor Cyan

# Build parameters as JSON object
$paramsObject = @{
    containerName = $CONTAINER_NAME
    sourceFolderPath = $SOURCE_FOLDER
    storageAccountUrl = $STORAGE_URL
    modelId = $MODEL_ID
}

# Write JSON to temp file (most reliable method for Windows PowerShell + Azure CLI)
$tempFile = [System.IO.Path]::GetTempFileName()
$paramsObject | ConvertTo-Json -Compress | Set-Content -Path $tempFile -Encoding UTF8

Write-Host "Parameters: $($paramsObject | ConvertTo-Json -Compress)" -ForegroundColor Gray

try {
    # Use @filename syntax to pass JSON from file
    $runResponse = az synapse pipeline create-run `
        --workspace-name $SYNAPSE_WORKSPACE `
        --name ProcessPDFsWithDocIntelligence `
        --parameters "@$tempFile" `
        --output json | ConvertFrom-Json
    
    $RUN_ID = $runResponse.runId
    
    if ($RUN_ID) {
        Write-Host "`nPipeline triggered successfully!" -ForegroundColor Green
        Write-Host "Run ID: $RUN_ID" -ForegroundColor Yellow
    } else {
        Write-Host "`nWarning: Pipeline may have triggered but no Run ID returned" -ForegroundColor Yellow
    }
} catch {
    Write-Host "`nError triggering pipeline: $_" -ForegroundColor Red
} finally {
    # Clean up temp file
    Remove-Item -Path $tempFile -Force -ErrorAction SilentlyContinue
}

## Monitor Pipeline Run

In [None]:
# Check pipeline run status
Write-Host "Checking pipeline status..." -ForegroundColor Cyan

if (-not $RUN_ID) {
    Write-Host "No run ID. Trigger a pipeline first." -ForegroundColor Yellow
} else {
    $status = az synapse pipeline-run show `
        --workspace-name $SYNAPSE_WORKSPACE `
        --run-id $RUN_ID `
        --output json | ConvertFrom-Json
    
    Write-Host "`nPipeline Run Status:" -ForegroundColor Green
    Write-Host "  Run ID: $($status.runId)"
    Write-Host "  Pipeline: $($status.pipelineName)"
    Write-Host "  Status: $($status.status)" -ForegroundColor $(if ($status.status -eq 'Succeeded') { 'Green' } elseif ($status.status -eq 'Failed') { 'Red' } else { 'Yellow' })
    Write-Host "  Start: $($status.runStart)"
    Write-Host "  End: $($status.runEnd)"
    Write-Host "  Duration: $($status.durationInMs) ms"
}

In [None]:
# Poll until completion
Write-Host "Monitoring pipeline run (polling every 10 seconds)..." -ForegroundColor Cyan
Write-Host "Press Ctrl+C to stop monitoring" -ForegroundColor Yellow

if ($RUN_ID) {
    do {
        Start-Sleep -Seconds 10
        
        $status = az synapse pipeline-run show `
            --workspace-name $SYNAPSE_WORKSPACE `
            --run-id $RUN_ID `
            --query status -o tsv
        
        $timestamp = Get-Date -Format "HH:mm:ss"
        Write-Host "[$timestamp] Status: $status"
        
    } while ($status -eq "InProgress" -or $status -eq "Queued")
    
    Write-Host "`nPipeline completed with status: $status" -ForegroundColor $(if ($status -eq 'Succeeded') { 'Green' } else { 'Red' })
}

## View Recent Pipeline Runs

In [None]:
# Query recent pipeline runs
Write-Host "Recent pipeline runs (last 7 days):" -ForegroundColor Cyan

$lastWeek = (Get-Date).ToUniversalTime().AddDays(-7).ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ")
$now = (Get-Date).ToUniversalTime().AddMinutes(5).ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ")

Write-Host "  Querying from $lastWeek to $now" -ForegroundColor Gray

$runs = az synapse pipeline-run query-by-workspace `
    --workspace-name $SYNAPSE_WORKSPACE `
    --last-updated-after $lastWeek `
    --last-updated-before $now `
    --output json 2>$null | ConvertFrom-Json

if ($runs.value -and $runs.value.Count -gt 0) {
    Write-Host "`n  Found $($runs.value.Count) pipeline run(s):" -ForegroundColor Green
    $runs.value | ForEach-Object {
        $statusColor = switch ($_.status) {
            'Succeeded' { 'Green' }
            'Failed' { 'Red' }
            'InProgress' { 'Yellow' }
            default { 'White' }
        }
        Write-Host "    [$($_.status)] $($_.pipelineName) - $($_.runId.Substring(0,8))... ($($_.runStart))" -ForegroundColor $statusColor
    }
} else {
    Write-Host "`n  No pipeline runs found in the last 7 days" -ForegroundColor Yellow
    Write-Host "  Trigger a pipeline using the cells above" -ForegroundColor Gray
}

## View Activity Runs (for debugging)

## Debugging: Check What Was Processed

If the pipeline succeeds but no documents appear in Cosmos DB, use these cells to diagnose the issue.

In [None]:
# Step 1: Check if there are PDF files in the source folder
Write-Host "Checking for PDF files in source folder..." -ForegroundColor Cyan
Write-Host "  Container: $CONTAINER_NAME" -ForegroundColor Gray
Write-Host "  Folder: $SOURCE_FOLDER" -ForegroundColor Gray

$blobs = az storage blob list `
    --account-name $STORAGE_ACCOUNT `
    --container-name $CONTAINER_NAME `
    --prefix "$SOURCE_FOLDER/" `
    --auth-mode login `
    --query "[?ends_with(name, '.pdf')].{Name:name, Size:properties.contentLength}" `
    --output json | ConvertFrom-Json

if ($blobs.Count -eq 0) {
    Write-Host "`n  NO PDF FILES FOUND!" -ForegroundColor Red
    Write-Host "  The pipeline succeeded but had nothing to process." -ForegroundColor Yellow
    Write-Host "  Upload PDF files to: $CONTAINER_NAME/$SOURCE_FOLDER/" -ForegroundColor Yellow
} else {
    Write-Host "`n  Found $($blobs.Count) PDF file(s):" -ForegroundColor Green
    $blobs | ForEach-Object { 
        $sizeKB = [math]::Round($_.Size / 1024, 2)
        Write-Host "    - $($_.Name) ($sizeKB KB)" 
    }
}

In [None]:
# Step 2: Get detailed activity run info (check if ForEach processed anything)
Write-Host "Getting detailed activity run information..." -ForegroundColor Cyan

if ($RUN_ID) {
    # Get pipeline run info to determine time window
    $pipelineRun = az synapse pipeline-run show `
        --workspace-name $SYNAPSE_WORKSPACE `
        --run-id $RUN_ID `
        --output json | ConvertFrom-Json
    
    # Use UTC time with proper format for Synapse API
    $runStartUtc = [DateTime]::Parse($pipelineRun.runStart).ToUniversalTime()
    $startTime = $runStartUtc.AddMinutes(-10).ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ")
    $endTime = (Get-Date).ToUniversalTime().AddMinutes(10).ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ")
    
    Write-Host "  Pipeline started: $($pipelineRun.runStart)" -ForegroundColor Gray
    Write-Host "  Querying activities from $startTime to $endTime" -ForegroundColor Gray
    
    $activitiesJson = az synapse activity-run query-by-pipeline-run `
        --workspace-name $SYNAPSE_WORKSPACE `
        --name ProcessPDFsWithDocIntelligence `
        --run-id $RUN_ID `
        --last-updated-after $startTime `
        --last-updated-before $endTime `
        --output json 2>&1
    
    try {
        $activities = $activitiesJson | ConvertFrom-Json
        
        if (-not $activities.value -or $activities.value.Count -eq 0) {
            Write-Host "`n  No activity runs found" -ForegroundColor Yellow
            Write-Host "  This can happen if:" -ForegroundColor Gray
            Write-Host "    - The pipeline just finished (activity data may take a moment to appear)" -ForegroundColor Gray
            Write-Host "    - The pipeline had no activities to execute" -ForegroundColor Gray
            Write-Host "`n  Try checking in Azure Synapse Studio for more details" -ForegroundColor Cyan
        } else {
            Write-Host "`n  Found $($activities.value.Count) activity run(s):" -ForegroundColor Green
            
            foreach ($activity in $activities.value) {
                $statusColor = if ($activity.status -eq 'Succeeded') { 'Green' } else { 'Red' }
                Write-Host "`n  Activity: $($activity.activityName)" -ForegroundColor Yellow
                Write-Host "    Type: $($activity.activityType)"
                Write-Host "    Status: $($activity.status)" -ForegroundColor $statusColor
                Write-Host "    Duration: $($activity.durationInMs) ms"
                
                # Check ForEach iteration count
                if ($activity.activityType -eq "ForEach" -and $activity.output) {
                    if ($activity.output.iterationCount -ne $null) {
                        $iterationCount = $activity.output.iterationCount
                        $iterColor = if ($iterationCount -gt 0) { 'Green' } else { 'Red' }
                        Write-Host "    Iterations: $iterationCount" -ForegroundColor $iterColor
                        if ($iterationCount -eq 0) {
                            Write-Host "    WARNING: No PDFs were processed!" -ForegroundColor Red
                        }
                    }
                }
                
                # Show errors if any
                if ($activity.error) {
                    Write-Host "    Error: $($activity.error.message)" -ForegroundColor Red
                }
            }
        }
    } catch {
        Write-Host "`n  Could not parse activity results" -ForegroundColor Yellow
        Write-Host "  Raw output: $activitiesJson" -ForegroundColor Gray
    }
} else {
    Write-Host "No run ID set. Trigger a pipeline first." -ForegroundColor Yellow
}

In [None]:
# Step 3: Check documents in Cosmos DB
Write-Host "Checking for documents in Cosmos DB..." -ForegroundColor Cyan

# First verify database and container exist
$dbCheck = az cosmosdb sql database show `
    --account-name $COSMOS_ACCOUNT `
    --resource-group $COSMOS_RG `
    --name $COSMOS_DATABASE `
    --query "name" -o tsv 2>$null

$containerCheck = az cosmosdb sql container show `
    --account-name $COSMOS_ACCOUNT `
    --resource-group $COSMOS_RG `
    --database-name $COSMOS_DATABASE `
    --name $COSMOS_CONTAINER `
    --query "name" -o tsv 2>$null

if (-not $dbCheck) {
    Write-Host "  Database '$COSMOS_DATABASE' does not exist!" -ForegroundColor Red
} elseif (-not $containerCheck) {
    Write-Host "  Container '$COSMOS_CONTAINER' does not exist!" -ForegroundColor Red
} else {
    Write-Host "  Database: $dbCheck" -ForegroundColor Green
    Write-Host "  Container: $containerCheck" -ForegroundColor Green
    
    # Get account key for querying
    Write-Host "`n  Querying documents..." -ForegroundColor Gray
    $accountKey = az cosmosdb keys list `
        --name $COSMOS_ACCOUNT `
        --resource-group $COSMOS_RG `
        --query "primaryMasterKey" -o tsv 2>$null
    
    if ($accountKey) {
        # Use REST API to read documents (ReadFeed - simpler than query)
        $cosmosEndpoint = "https://$COSMOS_ACCOUNT.documents.azure.com:443/"
        
        # Generate authorization header for ReadFeed
        $date = [DateTime]::UtcNow.ToString("r")
        $resourceType = "docs"
        $resourceLink = "dbs/$COSMOS_DATABASE/colls/$COSMOS_CONTAINER"
        
        # Create signature for GET (ReadFeed)
        $keyBytes = [System.Convert]::FromBase64String($accountKey)
        $text = "get`n$resourceType`n$resourceLink`n$($date.ToLower())`n`n"
        $hmac = New-Object System.Security.Cryptography.HMACSHA256
        $hmac.Key = $keyBytes
        $hash = $hmac.ComputeHash([System.Text.Encoding]::UTF8.GetBytes($text))
        $signature = [System.Convert]::ToBase64String($hash)
        $authHeader = [System.Web.HttpUtility]::UrlEncode("type=master&ver=1.0&sig=$signature")
        
        $headers = @{
            "Authorization" = $authHeader
            "x-ms-date" = $date
            "x-ms-version" = "2018-12-31"
            "x-ms-max-item-count" = "10"
        }
        
        try {
            $uri = "$cosmosEndpoint$resourceLink/docs"
            $response = Invoke-RestMethod -Uri $uri -Method Get -Headers $headers
            
            $count = $response.Documents.Count
            $totalCount = $response.'_count'
            
            if ($count -gt 0) {
                Write-Host "`n  Found documents in Cosmos DB (showing up to 10):" -ForegroundColor Green
                
                $response.Documents | ForEach-Object {
                    $status = if ($_.status) { $_.status } else { "unknown" }
                    $sourceFile = if ($_.sourceFile) { $_.sourceFile } else { $_.id }
                    Write-Host "    - $sourceFile [$status]" -ForegroundColor Cyan
                }
                
                if ($totalCount -and $totalCount -gt $count) {
                    Write-Host "`n  ... and more (total: $totalCount)" -ForegroundColor Gray
                }
            } else {
                Write-Host "`n  Container exists but NO DOCUMENTS found!" -ForegroundColor Yellow
                Write-Host "`n  Possible causes:" -ForegroundColor Yellow
                Write-Host "    1. Pipeline ran but Function App failed to write to Cosmos DB" -ForegroundColor Gray
                Write-Host "    2. Function App missing 'Cosmos DB Built-in Data Contributor' role" -ForegroundColor Gray
                Write-Host "    3. Check Function App logs in Azure Portal for errors" -ForegroundColor Gray
            }
        } catch {
            Write-Host "`n  REST API read failed: $($_.Exception.Message)" -ForegroundColor Yellow
            Write-Host "`n  Verify documents in Azure Portal instead:" -ForegroundColor Cyan
            Write-Host "    1. Go to Cosmos DB account: $COSMOS_ACCOUNT" -ForegroundColor Gray
            Write-Host "    2. Open Data Explorer" -ForegroundColor Gray
            Write-Host "    3. Navigate to $COSMOS_DATABASE > $COSMOS_CONTAINER" -ForegroundColor Gray
        }
    } else {
        Write-Host "`n  Could not retrieve Cosmos DB key" -ForegroundColor Yellow
        Write-Host "  Verify documents in Azure Portal:" -ForegroundColor Cyan
        Write-Host "    1. Go to Cosmos DB account: $COSMOS_ACCOUNT" -ForegroundColor Gray
        Write-Host "    2. Open Data Explorer" -ForegroundColor Gray
        Write-Host "    3. Navigate to $COSMOS_DATABASE > $COSMOS_CONTAINER" -ForegroundColor Gray
    }
}

### DELETE: Clean Up Cosmos DB Documents

In [4]:
# ============================================
# DELETE DOCUMENTS FROM COSMOS DB
# Use this to clear documents and reprocess PDFs
# ============================================

# Options:
#   1. Delete ALL documents in container
#   2. Delete specific document by sourceFile path

$DELETE_MODE = "all"  # "all" or "specific"
$SPECIFIC_SOURCE_FILE = "pdfs/AgYieldTrain.pdf"  # Only used if DELETE_MODE = "specific"

Write-Host "Cosmos DB Document Cleanup" -ForegroundColor Cyan
Write-Host "  Mode: $DELETE_MODE" -ForegroundColor Gray

# Get account key
$accountKey = az cosmosdb keys list `
    --name $COSMOS_ACCOUNT `
    --resource-group $COSMOS_RG `
    --query "primaryMasterKey" -o tsv 2>$null

if (-not $accountKey) {
    Write-Host "  ERROR: Could not retrieve Cosmos DB key" -ForegroundColor Red
    return
}

$cosmosEndpoint = "https://$COSMOS_ACCOUNT.documents.azure.com:443/"

function Get-CosmosAuthHeader {
    param($verb, $resourceType, $resourceLink, $date, $key)
    
    $keyBytes = [System.Convert]::FromBase64String($key)
    $text = "$($verb.ToLower())`n$($resourceType.ToLower())`n$resourceLink`n$($date.ToLower())`n`n"
    $hmac = New-Object System.Security.Cryptography.HMACSHA256
    $hmac.Key = $keyBytes
    $hash = $hmac.ComputeHash([System.Text.Encoding]::UTF8.GetBytes($text))
    $signature = [System.Convert]::ToBase64String($hash)
    return [System.Web.HttpUtility]::UrlEncode("type=master&ver=1.0&sig=$signature")
}

# First, get all documents
Write-Host "`n  Reading documents..." -ForegroundColor Yellow
$date = [DateTime]::UtcNow.ToString("r")
$resourceLink = "dbs/$COSMOS_DATABASE/colls/$COSMOS_CONTAINER"

$headers = @{
    "Authorization" = Get-CosmosAuthHeader "get" "docs" $resourceLink $date $accountKey
    "x-ms-date" = $date
    "x-ms-version" = "2018-12-31"
    "x-ms-max-item-count" = "100"
}

try {
    $uri = "$cosmosEndpoint$resourceLink/docs"
    $response = Invoke-RestMethod -Uri $uri -Method Get -Headers $headers
    $documents = $response.Documents
    
    if ($documents.Count -eq 0) {
        Write-Host "`n  No documents found - nothing to delete" -ForegroundColor Green
        return
    }
    
    Write-Host "  Found $($documents.Count) document(s)" -ForegroundColor Gray
    
    # Filter documents if specific mode
    if ($DELETE_MODE -eq "specific") {
        $documents = $documents | Where-Object { $_.sourceFile -eq $SPECIFIC_SOURCE_FILE }
        if ($documents.Count -eq 0) {
            Write-Host "  No documents found for: $SPECIFIC_SOURCE_FILE" -ForegroundColor Yellow
            return
        }
        Write-Host "  Filtering to: $SPECIFIC_SOURCE_FILE" -ForegroundColor Gray
    }
    
    Write-Host "`n  Deleting $($documents.Count) document(s)..." -ForegroundColor Yellow
    
    $deletedCount = 0
    $errorCount = 0
    
    foreach ($doc in $documents) {
        $docId = $doc.id
        $partitionKey = $doc.sourceFile
        
        # Delete document
        $date = [DateTime]::UtcNow.ToString("r")
        $docResourceLink = "dbs/$COSMOS_DATABASE/colls/$COSMOS_CONTAINER/docs/$docId"
        
        $deleteHeaders = @{
            "Authorization" = Get-CosmosAuthHeader "delete" "docs" $docResourceLink $date $accountKey
            "x-ms-date" = $date
            "x-ms-version" = "2018-12-31"
            "x-ms-documentdb-partitionkey" = "[`"$partitionKey`"]"
        }
        
        try {
            $deleteUri = "$cosmosEndpoint$docResourceLink"
            Invoke-RestMethod -Uri $deleteUri -Method Delete -Headers $deleteHeaders | Out-Null
            $deletedCount++
            Write-Host "    Deleted: $($doc.sourceFile)" -ForegroundColor Green
        } catch {
            $errorCount++
            Write-Host "    Failed to delete $docId : $($_.Exception.Message)" -ForegroundColor Red
        }
    }
    
    Write-Host "`n  Summary:" -ForegroundColor Cyan
    Write-Host "    Deleted: $deletedCount" -ForegroundColor Green
    if ($errorCount -gt 0) {
        Write-Host "    Errors: $errorCount" -ForegroundColor Red
    }
    Write-Host "`n  You can now re-run the pipeline to reprocess the PDFs" -ForegroundColor Yellow
    
} catch {
    Write-Host "`n  ERROR reading documents: $($_.Exception.Message)" -ForegroundColor Red
}

[96mCosmos DB Document Cleanup[0m
[37m  Mode: all[0m
[93m
  Reading documents...[0m
[37m  Found 2 document(s)[0m
[93m
  Deleting 2 document(s)...[0m
[92m    Deleted: pdfs/AgYieldTrain.pdf[0m
[92m    Deleted: pdfs/Ag Yield Test.pdf[0m
[96m
  Summary:[0m
[92m    Deleted: 2[0m
[93m
  You can now re-run the pipeline to reprocess the PDFs[0m


In [None]:
# Get activity runs for a pipeline run (table format)
if ($RUN_ID) {
    Write-Host "Activity runs for pipeline $RUN_ID :" -ForegroundColor Cyan
    
    # Get pipeline run info to determine time window
    $pipelineRun = az synapse pipeline-run show `
        --workspace-name $SYNAPSE_WORKSPACE `
        --run-id $RUN_ID `
        --output json | ConvertFrom-Json
    
    # Use UTC time with proper format
    $runStartUtc = [DateTime]::Parse($pipelineRun.runStart).ToUniversalTime()
    $startTime = $runStartUtc.AddMinutes(-10).ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ")
    $endTime = (Get-Date).ToUniversalTime().AddMinutes(10).ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ")
    
    Write-Host "  Time window: $startTime to $endTime" -ForegroundColor Gray
    
    $activitiesJson = az synapse activity-run query-by-pipeline-run `
        --workspace-name $SYNAPSE_WORKSPACE `
        --name ProcessPDFsWithDocIntelligence `
        --run-id $RUN_ID `
        --last-updated-after $startTime `
        --last-updated-before $endTime `
        --output json 2>&1
    
    try {
        $activities = $activitiesJson | ConvertFrom-Json
        
        if ($activities.value -and $activities.value.Count -gt 0) {
            Write-Host ""
            # Create a simple table format
            Write-Host ("  {0,-25} {1,-15} {2,-12} {3,-12}" -f "Activity", "Type", "Status", "Duration(ms)") -ForegroundColor Yellow
            Write-Host ("  {0,-25} {1,-15} {2,-12} {3,-12}" -f "--------", "----", "------", "-----------") -ForegroundColor Gray
            
            foreach ($activity in $activities.value) {
                $statusColor = if ($activity.status -eq 'Succeeded') { 'Green' } elseif ($activity.status -eq 'Failed') { 'Red' } else { 'Yellow' }
                $name = if ($activity.activityName.Length -gt 24) { $activity.activityName.Substring(0,21) + "..." } else { $activity.activityName }
                $type = if ($activity.activityType.Length -gt 14) { $activity.activityType.Substring(0,11) + "..." } else { $activity.activityType }
                
                Write-Host ("  {0,-25} {1,-15} " -f $name, $type) -NoNewline
                Write-Host ("{0,-12}" -f $activity.status) -NoNewline -ForegroundColor $statusColor
                Write-Host (" {0,-12}" -f $activity.durationInMs)
            }
        } else {
            Write-Host "`n  No activity runs found for this pipeline run" -ForegroundColor Yellow
            Write-Host "  Check Azure Synapse Studio > Monitor > Pipeline runs for details" -ForegroundColor Gray
        }
    } catch {
        Write-Host "`n  Could not parse activity data" -ForegroundColor Yellow
    }
} else {
    Write-Host "No run ID set. Trigger a pipeline first." -ForegroundColor Yellow
}

## Authentication Setup

If you're getting authentication errors, ensure the following setup is complete:

In [None]:
# Step 1: Get Function App host key and store in Key Vault
Write-Host "Getting Function App host key..." -ForegroundColor Cyan

$FUNCTION_KEY = az functionapp keys list `
    --name $FUNC_APP_NAME `
    --resource-group $FUNC_RG `
    --query "masterKey" -o tsv

if ($FUNCTION_KEY) {
    Write-Host "Function key retrieved (first 10 chars): $($FUNCTION_KEY.Substring(0,10))..." -ForegroundColor Green
    
    # Store in Key Vault
    Write-Host "`nStoring function key in Key Vault..." -ForegroundColor Cyan
    
    az keyvault secret set `
        --vault-name $KEY_VAULT_NAME `
        --name "FunctionAppHostKey" `
        --value $FUNCTION_KEY `
        --output none 2>$null
    
    Write-Host "Function key stored as 'FunctionAppHostKey'" -ForegroundColor Green
} else {
    Write-Host "Failed to get function key" -ForegroundColor Red
    Write-Host "  Check that FUNC_APP_NAME ($FUNC_APP_NAME) and FUNC_RG ($FUNC_RG) are correct" -ForegroundColor Yellow
}

In [None]:
# Step 2: Verify function key is stored in Key Vault
Write-Host "Verifying function key in Key Vault..." -ForegroundColor Cyan

$secretExists = az keyvault secret show `
    --vault-name $KEY_VAULT_NAME `
    --name "FunctionAppHostKey" `
    --query "name" -o tsv 2>$null

if ($secretExists) {
    Write-Host "  FunctionAppHostKey secret exists in $KEY_VAULT_NAME" -ForegroundColor Green
} else {
    Write-Host "  FunctionAppHostKey secret NOT FOUND" -ForegroundColor Red
    Write-Host "  Run the previous cell to store the function key" -ForegroundColor Yellow
}

In [None]:
# Step 3: Grant Synapse access to Key Vault
Write-Host "Granting Synapse access to Key Vault..." -ForegroundColor Cyan

$SYNAPSE_IDENTITY = az synapse workspace show `
    --name $SYNAPSE_WORKSPACE `
    --resource-group $RESOURCE_GROUP `
    --query identity.principalId -o tsv

if ($SYNAPSE_IDENTITY) {
    Write-Host "  Synapse Identity: $SYNAPSE_IDENTITY" -ForegroundColor Gray

    az keyvault set-policy `
        --name $KEY_VAULT_NAME `
        --object-id $SYNAPSE_IDENTITY `
        --secret-permissions get list `
        --output none 2>$null

    Write-Host "  Key Vault access granted to Synapse" -ForegroundColor Green
} else {
    Write-Host "  Could not get Synapse identity" -ForegroundColor Red
    Write-Host "  Check SYNAPSE_WORKSPACE ($SYNAPSE_WORKSPACE) and RESOURCE_GROUP ($RESOURCE_GROUP)" -ForegroundColor Yellow
}

## Troubleshooting

### Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| `Owner resource does not exist` | Cosmos DB database/container missing | Run the Cosmos DB Setup cells above |
| `Unauthorized` (401) | Function key not configured | Store function key in Key Vault |
| `Forbidden` (403) on Key Vault | Synapse can't access Key Vault | Grant Synapse identity access |
| `SecretNotFound` | Wrong secret name | Ensure secret is named `FunctionAppHostKey` |
| `Could not download file` | Missing SAS token | Function auto-generates SAS tokens |
| `Invalid URL` for files with spaces | URL encoding issue | Pipeline handles this automatically |
| `Failed to parse string as JSON` | Pipeline params format error | Use JSON object format (fixed in this notebook) |

In [None]:
# Verify Key Vault secret exists
Write-Host "Checking Key Vault secrets in $KEY_VAULT_NAME..." -ForegroundColor Cyan

$secrets = az keyvault secret list `
    --vault-name $KEY_VAULT_NAME `
    --query "[?contains(id, 'FunctionAppHostKey')].{Name:name, Enabled:attributes.enabled}" `
    --output json 2>$null | ConvertFrom-Json

if ($secrets -and $secrets.Count -gt 0) {
    Write-Host "`n  Found FunctionAppHostKey:" -ForegroundColor Green
    $secrets | ForEach-Object {
        Write-Host "    Name: $($_.Name), Enabled: $($_.Enabled)"
    }
} else {
    Write-Host "`n  FunctionAppHostKey NOT FOUND in Key Vault" -ForegroundColor Red
    Write-Host "  Run cell 34 to store the function key" -ForegroundColor Yellow
}

In [None]:
# Test Function App directly (with authentication)
$FUNC_URL = "https://$FUNC_APP_NAME.azurewebsites.net"

Write-Host "Testing Function App..." -ForegroundColor Cyan

# Get the function key for authentication
$functionKey = az functionapp keys list `
    --name $FUNC_APP_NAME `
    --resource-group $FUNC_RG `
    --query "functionKeys.default" -o tsv 2>$null

if (-not $functionKey) {
    # Try host key if function key not available
    $functionKey = az functionapp keys list `
        --name $FUNC_APP_NAME `
        --resource-group $FUNC_RG `
        --query "masterKey" -o tsv 2>$null
}

if ($functionKey) {
    Write-Host "  Function key retrieved" -ForegroundColor Green
    
    # Test health endpoint with key
    $headers = @{
        "x-functions-key" = $functionKey
    }
    
    try {
        # Try health endpoint first
        $response = Invoke-RestMethod -Uri "$FUNC_URL/api/health" -Method Get -Headers $headers -ErrorAction Stop
        Write-Host "`n  Health check passed!" -ForegroundColor Green
        $response | ConvertTo-Json
    } catch {
        # Health endpoint might not exist, try a simple GET to verify function is running
        Write-Host "  Health endpoint not available (this is OK)" -ForegroundColor Yellow
        
        # Check if the function app is running
        $appState = az functionapp show `
            --name $FUNC_APP_NAME `
            --resource-group $FUNC_RG `
            --query "state" -o tsv
        
        Write-Host "  Function App state: $appState" -ForegroundColor $(if ($appState -eq 'Running') { 'Green' } else { 'Red' })
        
        # List available functions
        Write-Host "`n  Available functions:" -ForegroundColor Cyan
        az functionapp function list `
            --name $FUNC_APP_NAME `
            --resource-group $FUNC_RG `
            --query "[].{Name:name, Trigger:config.bindings[0].type}" `
            --output table
    }
} else {
    Write-Host "  Could not retrieve function key" -ForegroundColor Red
    Write-Host "  Checking Function App state instead..." -ForegroundColor Yellow
    
    $appState = az functionapp show `
        --name $FUNC_APP_NAME `
        --resource-group $FUNC_RG `
        --query "state" -o tsv
    
    Write-Host "  Function App state: $appState" -ForegroundColor $(if ($appState -eq 'Running') { 'Green' } else { 'Red' })
}

## Next Steps

- **View processed documents** - Check Cosmos DB for results
- **Set up analytics** - See `06-Analytics-SynapseLink.ipynb`
- **Troubleshoot issues** - See `07-Monitoring-Troubleshooting.ipynb`