# Strong Mayor Powers Classification Tool - Demo

This notebook demonstrates how to use the Strong Mayor Powers classification tool to analyze PDF documents and detect references to Strong Mayor Powers in public comments.

## What is this tool?

The Strong Mayor Powers Detection Tool is a command-line utility that uses the Google Gemini API to classify comments for the presence or absence of references to "Strong Mayor Powers" within public consultation documents. 

**Strong Mayor Powers** refer to enhanced mayoral authorities introduced in Ontario municipalities, including:
- Authority to override certain council decisions with a simple majority vote
- Enhanced control over municipal planning and development processes
- Greater influence over municipal budget priorities
- Streamlined decision-making capabilities for municipal governance
- Powers to hire and dismiss certain municipal staff directly

## What we'll demonstrate

In this notebook, we'll:
1. Set up the required dependencies
2. Create and examine a sample PDF document
3. Run the tool in dry-run mode to estimate costs
4. Process the document and analyze the results
5. Interpret the output

**Note**: For the full classification to work, you'll need a Google API key for the Gemini service.

## Step 1: Setup and Dependencies

First, let's import the required libraries and check that our classification script is available:

In [None]:
import os
import subprocess
import pandas as pd
import json
from pathlib import Path

# Check if classify.py exists
if os.path.exists('classify.py'):
    print("✓ classify.py found")
else:
    print("✗ classify.py not found - make sure you're running this notebook in the correct directory")

# Check if sample PDF exists
if os.path.exists('sample_document.pdf'):
    print("✓ sample_document.pdf found")
else:
    print("✗ sample_document.pdf not found - we'll create it")

## Step 2: Display Help Information

Let's check the available options for the classification script:

In [None]:
# Display help information for the classify script
result = subprocess.run(['python3', 'classify.py', '--help'], 
                       capture_output=True, text=True)
print(result.stdout)

## Step 3: Create and Examine Sample Document

Let's create our sample PDF document if it doesn't exist and examine its content:

In [None]:
# Create sample PDF if it doesn't exist
if not os.path.exists('sample_document.pdf'):
    if os.path.exists('create_sample_pdf.py'):
        result = subprocess.run(['python3', 'create_sample_pdf.py'], 
                               capture_output=True, text=True)
        print(result.stdout)
        if result.stderr:
            print("Errors:", result.stderr)
    else:
        print("create_sample_pdf.py not found - please create sample_document.pdf manually")
else:
    print("✓ sample_document.pdf already exists")

# Show file info
if os.path.exists('sample_document.pdf'):
    file_size = os.path.getsize('sample_document.pdf')
    print(f"Sample document size: {file_size} bytes")

### Extract and Display PDF Content

Let's extract the text from our sample PDF to see what content we're working with:

In [None]:
# Extract text from the PDF using the same method as the classify script
import pdfplumber

def extract_pdf_text(pdf_path):
    """Extract text from PDF file."""
    try:
        with pdfplumber.open(pdf_path) as pdf:
            full_text = ""
            for page in pdf.pages:
                text = page.extract_text()
                if text:
                    full_text += text + "\n"
            return full_text.strip()
    except Exception as e:
        return f"Error extracting text: {e}"

if os.path.exists('sample_document.pdf'):
    pdf_text = extract_pdf_text('sample_document.pdf')
    print("PDF Content:")
    print("=" * 50)
    print(pdf_text)
    print("=" * 50)
    print(f"\nText length: {len(pdf_text)} characters")
else:
    print("Sample PDF not found")

## Step 4: Run Dry-Run Mode

Before making actual API calls, let's run the tool in dry-run mode to estimate token usage and costs:

In [None]:
# Create a directory for our sample document to demonstrate directory processing
import shutil

# Create sample directory
sample_dir = "sample_documents"
if os.path.exists(sample_dir):
    shutil.rmtree(sample_dir)
os.makedirs(sample_dir)

# Copy our sample PDF to the directory
if os.path.exists('sample_document.pdf'):
    shutil.copy('sample_document.pdf', os.path.join(sample_dir, 'comment_001.pdf'))
    print(f"✓ Created {sample_dir}/comment_001.pdf")
else:
    print("Sample PDF not available for copying")

# List files in sample directory
print("\nFiles in sample_documents directory:")
for file in os.listdir(sample_dir):
    print(f"  - {file}")

In [None]:
# Run dry-run mode to estimate token usage
print("Running dry-run mode to estimate token usage...\n")

result = subprocess.run(['python3', 'classify.py', sample_dir, '--dry-run'], 
                       capture_output=True, text=True)

print("Dry-run output:")
print(result.stdout)

if result.stderr:
    print("\nErrors/Warnings:")
    print(result.stderr)

## Step 5: Understanding Token Estimation

The dry-run mode shows us how many tokens would be used for the API call. Let's break this down:

In [None]:
# Extract token information from dry-run output
import re

# Parse the output to extract token count
output_lines = result.stdout.split('\n')
token_line = [line for line in output_lines if 'tokens required' in line]

if token_line:
    # Extract token count using regex
    token_match = re.search(r'(\d+)', token_line[0])
    if token_match:
        estimated_tokens = int(token_match.group(1))
        print(f"Estimated tokens for our sample document: {estimated_tokens:,}")
        
        # Provide cost estimation (approximate)
        # Note: These are example rates - check current Gemini pricing
        cost_per_1k_tokens = 0.00025  # Example rate for Gemini 1.5 Flash
        estimated_cost = (estimated_tokens / 1000) * cost_per_1k_tokens
        print(f"Estimated cost (approximate): ${estimated_cost:.6f}")
        
        print(f"\nNote: This is just an estimate. Actual costs may vary.")
        print(f"Check current Google Gemini API pricing for accurate rates.")
    else:
        print("Could not extract token count from output")
else:
    print("Token information not found in dry-run output")

## Step 6: Analyzing the Classification Process

Let's look at what the tool actually analyzes by examining the prompt structure:

In [None]:
# Let's recreate the prompt structure to understand what gets sent to the API
def show_prompt_structure(comment_text):
    """Show what the actual prompt looks like."""
    
    system_prompt = """
You are analyzing public comments submitted to the Environmental Registry of Ontario (ERO) and other public consultation platforms. Your task is to determine whether each comment contains references to "Strong Mayor Powers" and, if present, explain how they are referenced.

**Background on Strong Mayor Powers:**
Strong Mayor Powers refer to enhanced mayoral authorities introduced in Ontario municipalities. These powers typically include:
- Authority to override certain council decisions with a simple majority vote
- Enhanced control over municipal planning and development processes  
- Greater influence over municipal budget priorities
- Streamlined decision-making capabilities for municipal governance
- Powers to hire and dismiss certain municipal staff directly

Strong Mayor Powers became relevant in Ontario as a way to expedite municipal decision-making, particularly for housing development and infrastructure projects, and were implemented in various Ontario municipalities starting in 2022.

**Your Classification Task:**
Determine if the comment contains any reference to Strong Mayor Powers and classify as one of:

1. "present" - The comment explicitly mentions Strong Mayor Powers, mayoral authorities, enhanced mayoral powers, mayor override powers, or clearly discusses the concept even if not using the exact term.
2. "absent" - The comment contains no reference to Strong Mayor Powers or related mayoral authority concepts.

**Guidelines**:
- Look for direct mentions of "Strong Mayor Powers," "mayoral powers," "mayor override," or similar terminology
- Also identify indirect references that clearly discuss enhanced mayoral authorities or municipal governance changes involving mayors
- Return exactly one word: "present" or "absent"
- Do not include explanations in your response
"""

    user_prompt = f"""
Comment:
{comment_text}

Based on the above comment, determine whether it contains any reference to Strong Mayor Powers (enhanced mayoral authorities in Ontario municipalities). Your response must be exactly one of the following:

1. "present" - The comment mentions Strong Mayor Powers, mayoral authorities, enhanced mayoral powers, or clearly discusses the concept.
2. "absent" - The comment contains no reference to Strong Mayor Powers or related mayoral authority concepts.

Return exactly one word: "present" or "absent". Do not include any additional explanation or text in your response.
"""
    
    full_prompt = system_prompt.strip() + "\n" + user_prompt.strip()
    return full_prompt

# Show the prompt for our sample document
if 'pdf_text' in locals():
    full_prompt = show_prompt_structure(pdf_text)
    print("Full prompt that would be sent to Gemini API:")
    print("=" * 80)
    print(full_prompt)
    print("=" * 80)
    print(f"\nPrompt length: {len(full_prompt)} characters")
else:
    print("PDF text not available to show prompt")

## Step 7: Manual Analysis

Based on the content we extracted, let's manually analyze what we expect the tool to find:

In [None]:
# Let's manually identify Strong Mayor Powers references in our text
if 'pdf_text' in locals():
    print("Manual analysis of Strong Mayor Powers references:")
    print("=" * 60)
    
    # Key phrases to look for
    strong_mayor_phrases = [
        "Strong Mayor Powers",
        "enhanced mayoral authorities", 
        "override certain council decisions",
        "enhanced control over municipal planning",
        "greater influence over municipal budget",
        "powers to hire and dismiss",
        "streamlined decision-making capabilities"
    ]
    
    found_phrases = []
    for phrase in strong_mayor_phrases:
        if phrase.lower() in pdf_text.lower():
            found_phrases.append(phrase)
            # Find the context around this phrase
            text_lower = pdf_text.lower()
            phrase_lower = phrase.lower()
            index = text_lower.find(phrase_lower)
            if index != -1:
                # Get 100 characters before and after the phrase
                start = max(0, index - 100)
                end = min(len(pdf_text), index + len(phrase) + 100)
                context = pdf_text[start:end].replace('\n', ' ')
                print(f"\n✓ Found: '{phrase}'")
                print(f"   Context: ...{context}...")
    
    print(f"\n\nSummary: Found {len(found_phrases)} Strong Mayor Powers references")
    print(f"Expected classification: {'PRESENT' if found_phrases else 'ABSENT'}")
    
    if found_phrases:
        print("\nThis document clearly contains multiple references to Strong Mayor Powers")
        print("The tool should classify this as 'present'")
    else:
        print("\nThis document does not contain Strong Mayor Powers references")
        print("The tool should classify this as 'absent'")
else:
    print("PDF text not available for manual analysis")

## Step 8: Running with API Key (Optional)

If you have a Google API key, you can run the actual classification. Otherwise, we'll show what the expected output would look like.

In [None]:
# Check if API key is available
api_key = os.environ.get('GOOGLE_API_KEY')

if api_key:
    print("✓ Google API key found in environment variables")
    print("Running actual classification...\n")
    
    # Run the actual classification
    result = subprocess.run(['python3', 'classify.py', sample_dir, 
                           '--output-csv', 'demo_results.csv'], 
                           capture_output=True, text=True)
    
    print("Classification output:")
    print(result.stdout)
    
    if result.stderr:
        print("\nErrors/Warnings:")
        print(result.stderr)
    
    # Check if results file was created
    if os.path.exists('demo_results.csv'):
        print("\n✓ Results file created: demo_results.csv")
    else:
        print("\n✗ Results file not created")
        
else:
    print("✗ No Google API key found in GOOGLE_API_KEY environment variable")
    print("\nTo run the actual classification, you would need to:")
    print("1. Get a Google API key for Gemini from Google AI Studio")
    print("2. Set it as an environment variable: export GOOGLE_API_KEY='your-key-here'")
    print("3. Re-run this cell")
    print("\nFor demonstration purposes, we'll show expected output format below.")

## Step 9: Examining Results

Let's look at the results file format and interpret the output:

In [None]:
# Check if we have actual results or create expected results for demo
results_file = 'demo_results.csv'

if os.path.exists(results_file):
    # Read actual results
    print("Reading actual classification results:")
    df = pd.read_csv(results_file)
    print(df.to_string(index=False))
    
    # Analyze results
    print("\nResults Analysis:")
    print(f"Total documents processed: {len(df)}")
    
    if 'Strong Mayor Powers' in df.columns:
        value_counts = df['Strong Mayor Powers'].value_counts()
        print(f"Classification results:")
        for classification, count in value_counts.items():
            print(f"  - {classification}: {count}")
            
else:
    # Create expected results for demonstration
    print("Expected results format (since no API key was provided):")
    
    expected_results = pd.DataFrame({
        'Comment ID': ['comment_001'],
        'Strong Mayor Powers': ['present']
    })
    
    print(expected_results.to_string(index=False))
    
    print("\nExpected Analysis:")
    print("Total documents processed: 1")
    print("Classification results:")
    print("  - present: 1")
    print("\nThis matches our manual analysis - the document contains multiple")
    print("references to Strong Mayor Powers and should be classified as 'present'.")
    
    # Save expected results for demo purposes
    expected_results.to_csv('expected_results.csv', index=False)
    print("\n✓ Expected results saved to expected_results.csv")

## Step 10: Visualizing Results (if we have them)

Let's create a simple visualization of the classification results:

In [None]:
import matplotlib.pyplot as plt

# Use actual or expected results
if os.path.exists('demo_results.csv'):
    df = pd.read_csv('demo_results.csv')
    title_suffix = "(Actual Results)"
elif os.path.exists('expected_results.csv'):
    df = pd.read_csv('expected_results.csv')
    title_suffix = "(Expected Results)"
else:
    # Fallback
    df = pd.DataFrame({
        'Comment ID': ['comment_001'],
        'Strong Mayor Powers': ['present']
    })
    title_suffix = "(Demo)"

# Create a pie chart of classifications
if 'Strong Mayor Powers' in df.columns:
    plt.figure(figsize=(10, 6))
    
    # Subplot 1: Pie chart
    plt.subplot(1, 2, 1)
    value_counts = df['Strong Mayor Powers'].value_counts()
    colors = ['#ff7f7f' if x == 'present' else '#7f7fff' for x in value_counts.index]
    plt.pie(value_counts.values, labels=value_counts.index, autopct='%1.1f%%', colors=colors)
    plt.title(f'Strong Mayor Powers Classifications {title_suffix}')
    
    # Subplot 2: Bar chart
    plt.subplot(1, 2, 2)
    bars = plt.bar(value_counts.index, value_counts.values, color=colors)
    plt.title('Classification Counts')
    plt.ylabel('Number of Documents')
    plt.xlabel('Classification')
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                f'{int(height)}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()
    
    print("\nVisualization shows the distribution of classifications.")
    print("Red/pink = 'present' (contains Strong Mayor Powers references)")
    print("Blue = 'absent' (no Strong Mayor Powers references)")
else:
    print("No classification data available for visualization")

## Step 11: Understanding Different Input Formats

The tool supports various input formats. Let's demonstrate how it works with different file types:

In [None]:
# Create samples of different file types
print("Creating sample files in different formats...\n")

# Text file
with open(os.path.join(sample_dir, 'comment_002.txt'), 'w') as f:
    f.write("This is a simple text comment about municipal governance. "
            "I think the new Strong Mayor Powers are concerning.")
print("✓ Created comment_002.txt")

# HTML file
html_content = """
<!DOCTYPE html>
<html>
<head><title>Public Comment</title></head>
<body>
<h1>Municipal Governance Comment</h1>
<p>I support the implementation of <strong>enhanced mayoral authorities</strong> 
as they will help streamline decision-making processes in our municipality.</p>
</body>
</html>
"""
with open(os.path.join(sample_dir, 'comment_003.html'), 'w') as f:
    f.write(html_content)
print("✓ Created comment_003.html")

# List all files in the sample directory
print("\nAll sample files created:")
for file in sorted(os.listdir(sample_dir)):
    file_path = os.path.join(sample_dir, file)
    file_size = os.path.getsize(file_path)
    print(f"  - {file} ({file_size} bytes)")

## Step 12: Running Dry-Run on Multiple Files

Let's see how the tool handles multiple files of different types:

In [None]:
# Run dry-run on the expanded sample directory
print("Running dry-run on multiple file types...\n")

result = subprocess.run(['python3', 'classify.py', sample_dir, '--dry-run'], 
                       capture_output=True, text=True)

print("Dry-run output for multiple files:")
print(result.stdout)

if result.stderr:
    print("\nErrors/Warnings:")
    print(result.stderr)

# Extract and analyze token information
output_lines = result.stdout.split('\n')

# Find the line with total comments
total_comments_line = [line for line in output_lines if 'total of' in line and 'comments' in line]
if total_comments_line:
    print(f"\nProcessed files: {total_comments_line[0]}")

# Find token usage
token_line = [line for line in output_lines if 'tokens required' in line]
if token_line:
    print(f"Token usage: {token_line[0]}")

## Step 13: Summary and Next Steps

Let's summarize what we've learned and provide guidance for real-world usage:

In [None]:
print("=" * 80)
print("DEMO SUMMARY: Strong Mayor Powers Classification Tool")
print("=" * 80)

print("\n✓ What we demonstrated:")
print("  1. Tool setup and dependency checking")
print("  2. Sample document creation and text extraction")
print("  3. Dry-run mode for cost estimation")
print("  4. Understanding the classification prompt")
print("  5. Manual analysis of Strong Mayor Powers references")
print("  6. Multiple file format support (PDF, TXT, HTML)")
print("  7. Results interpretation and visualization")

print("\n📊 Key findings from our sample:")
print("  - Sample document contains multiple Strong Mayor Powers references")
print("  - Expected classification: PRESENT")
print("  - Tool can process various file formats in batch")
print("  - Dry-run mode helps estimate costs before processing")

print("\n🔧 For real-world usage:")
print("  1. Obtain Google Gemini API key from Google AI Studio")
print("  2. Set environment variable: export GOOGLE_API_KEY='your-key'")
print("  3. Prepare your documents in a directory or JSON format")
print("  4. Run dry-run first to estimate costs")
print("  5. Process documents and analyze results")

print("\n📁 Supported file formats:")
print("  - PDF (.pdf) - using pdfplumber")
print("  - Text (.txt, .text) - plain text")
print("  - HTML (.html, .htm) - with tag removal")
print("  - Word (.docx) - modern Word documents")
print("  - RTF (.rtf) - rich text format")
print("  - JSON - structured comment data")

print("\n💡 Tips for best results:")
print("  - Use descriptive filenames (they become comment IDs)")
print("  - Consider chunking for very large documents")
print("  - Review classification results for accuracy")
print("  - Use validation tools for quality assurance")

print("\n" + "=" * 80)

## Cleanup (Optional)

If you want to clean up the demo files, run the cell below:

In [None]:
# Cleanup demo files (uncomment to run)
# import shutil
# 
# cleanup_files = [
#     'sample_documents',
#     'demo_results.csv', 
#     'expected_results.csv',
#     'sample_document.pdf'
# ]
# 
# print("Cleaning up demo files...")
# for item in cleanup_files:
#     if os.path.exists(item):
#         if os.path.isdir(item):
#             shutil.rmtree(item)
#             print(f"✓ Removed directory: {item}")
#         else:
#             os.remove(item)
#             print(f"✓ Removed file: {item}")
#     else:
#         print(f"  - {item} (not found)")
# 
# print("\nCleanup complete!")

print("Cleanup cell is commented out. Uncomment and run to clean up demo files.")