# Strong Mayor Powers Classification Tool - Demo

This notebook demonstrates how to use the Strong Mayor Powers classification tool to analyze documents and detect references to Strong Mayor Powers in public comments.

## What is this tool?

The Strong Mayor Powers Detection Tool is a command-line utility that uses the Google Gemini API to classify documents for the presence or absence of references to "Strong Mayor Powers" within public consultation documents. 

**Strong Mayor Powers** refer to enhanced mayoral authorities introduced in Ontario municipalities, including:
- Authority to override certain council decisions with a simple majority vote
- Enhanced control over municipal planning and development processes
- Greater influence over municipal budget priorities
- Streamlined decision-making capabilities for municipal governance
- Powers to hire and dismiss certain municipal staff directly

## What we'll demonstrate

In this notebook, we'll:
1. Set up the required dependencies
2. Create and examine sample documents
3. Run the tool in dry-run mode to estimate costs
4. Process documents using direct upload (no text extraction)
5. Interpret the output

**Note**: For the full classification to work, you'll need a Gemini API key. The tool now uses the new Google GenAI API with direct document upload - no more text extraction!

## Step 1: Setup and Dependencies

First, let's import the required libraries and check that our classification script is available:

In [None]:
import os
import subprocess
import pandas as pd
import json
from pathlib import Path

# Check if classify.py exists
if os.path.exists('classify.py'):
    print("✓ classify.py found")
else:
    print("✗ classify.py not found - make sure you're running this notebook in the correct directory")

# Check if sample PDF exists
if os.path.exists('sample_document.pdf'):
    print("✓ sample_document.pdf found")
else:
    print("✗ sample_document.pdf not found - we'll create it")

## Step 2: Display Help Information

Let's check the available options for the classification script:

In [None]:
# Display help information for the classify script
result = subprocess.run(['python3', 'classify.py', '--help'], 
                       capture_output=True, text=True)
print(result.stdout)

## Step 3: Create and Examine Sample Documents

Let's create sample documents if they don't exist. The new API works with direct document upload, so no text extraction is needed!

In [None]:
# Create sample documents if they don't exist
sample_dir = "sample_documents"
if not os.path.exists(sample_dir):
    os.makedirs(sample_dir)
    print(f"Created {sample_dir} directory")
else:
    print(f"✓ {sample_dir} directory already exists")

# Create sample text file with Strong Mayor Powers reference
sample_txt = os.path.join(sample_dir, 'comment_001.txt')
if not os.path.exists(sample_txt):
    with open(sample_txt, 'w') as f:
        f.write("This is a comment about Strong Mayor Powers in Ontario. "
                "I believe these enhanced mayoral authorities will help streamline "
                "municipal decision-making processes.")
    print(f"✓ Created {sample_txt}")
else:
    print(f"✓ {sample_txt} already exists")

# Create sample HTML file
sample_html = os.path.join(sample_dir, 'comment_002.html')
if not os.path.exists(sample_html):
    with open(sample_html, 'w') as f:
        f.write("""<!DOCTYPE html>
<html><head><title>Public Comment</title></head>
<body>
<h1>Municipal Governance Comment</h1>
<p>I have concerns about the new <strong>mayor override powers</strong> 
that were recently implemented. These authorities seem too broad.</p>
</body></html>""")
    print(f"✓ Created {sample_html}")
else:
    print(f"✓ {sample_html} already exists")

# Show files in sample directory
print("\nFiles in sample_documents directory:")
for file in os.listdir(sample_dir):
    file_path = os.path.join(sample_dir, file)
    file_size = os.path.getsize(file_path)
    print(f"  - {file} ({file_size} bytes)")

### Sample Document Content

Let's examine what content we have in our sample documents. Note: With the new API, we send documents directly without text extraction!

In [None]:
# Display content of our sample documents
sample_dir = "sample_documents"

if os.path.exists(sample_dir):
    print("Sample Document Contents:")
    print("=" * 50)
    
    for file in os.listdir(sample_dir):
        file_path = os.path.join(sample_dir, file)
        print(f"\n--- {file} ---")
        
        if file.endswith('.txt'):
            with open(file_path, 'r') as f:
                content = f.read()
                print(content)
        elif file.endswith('.html'):
            with open(file_path, 'r') as f:
                content = f.read()
                print(content[:300] + "..." if len(content) > 300 else content)
        else:
            print(f"Binary file ({os.path.getsize(file_path)} bytes)")
    
    print("\n" + "=" * 50)
    print("\nNote: With the new API, these documents are sent directly to Gemini")
    print("without any text extraction preprocessing!")
else:
    print("Sample directory not found")

## Step 4: Run Dry-Run Mode

Before making actual API calls, let's run the tool in dry-run mode to estimate costs using the new API:

In [None]:
# Create a directory for our sample document to demonstrate directory processing
# (We already created this above, but let's make sure it exists)
sample_dir = "sample_documents"
if not os.path.exists(sample_dir):
    os.makedirs(sample_dir)
    print(f"Created {sample_dir} directory")

# List files in sample directory
print("Files in sample_documents directory:")
for file in os.listdir(sample_dir):
    print(f"  - {file}")
    
print(f"\nNote: The new API processes documents directly from the directory!")
print(f"No JSON input files are supported anymore.")

In [None]:
# Run dry-run mode to estimate token usage
print("Running dry-run mode to estimate token usage...\n")

result = subprocess.run(['python3', 'classify.py', sample_dir, '--dry-run'], 
                       capture_output=True, text=True)

print("Dry-run output:")
print(result.stdout)

if result.stderr:
    print("\nErrors/Warnings:")
    print(result.stderr)

## Step 5: Understanding Token Estimation

The dry-run mode shows us how many tokens would be used for the API call. Let's break this down:

In [None]:
# Extract token information from dry-run output
import re

# Parse the output to extract token count
output_lines = result.stdout.split('\n')
token_line = [line for line in output_lines if 'tokens required' in line]

if token_line:
    # Extract token count using regex
    token_match = re.search(r'(\d+)', token_line[0])
    if token_match:
        estimated_tokens = int(token_match.group(1))
        print(f"Estimated tokens for our sample document: {estimated_tokens:,}")
        
        # Provide cost estimation (approximate)
        # Note: These are example rates - check current Gemini pricing
        cost_per_1k_tokens = 0.00025  # Example rate for Gemini 1.5 Flash
        estimated_cost = (estimated_tokens / 1000) * cost_per_1k_tokens
        print(f"Estimated cost (approximate): ${estimated_cost:.6f}")
        
        print(f"\nNote: This is just an estimate. Actual costs may vary.")
        print(f"Check current Google Gemini API pricing for accurate rates.")
    else:
        print("Could not extract token count from output")
else:
    print("Token information not found in dry-run output")

## Step 6: Understanding the New API Approach

The new Google GenAI API uses direct document upload instead of text extraction. Let's understand what this means:

In [None]:
# Let's show how the new API works conceptually
print("New Google GenAI API Approach:")
print("=" * 50)
print("1. Documents are uploaded directly as bytes (no text extraction)")
print("2. Uses types.Part.from_bytes() to create document parts")
print("3. API handles text extraction internally")
print("4. Structured responses using enums")
print("5. Model: gemini-2.5-flash (1M token context window)")
print("6. Environment variable: GEMINI_API_KEY")
print("")
print("Key Benefits:")
print("- No dependency on text extraction libraries")
print("- Better handling of complex document formats")
print("- More reliable text extraction by Google's models")
print("- Structured responses with enum validation")
print("")
print("Example API call structure:")
print("")
print("```python")
print("from google import genai")
print("from google.genai import types")
print("")
print("client = genai.Client()  # Uses GEMINI_API_KEY env var")
print("")
print("document_part = types.Part.from_bytes(")
print("    data=file_bytes,")
print("    mime_type='application/pdf'")
print(")")
print("")
print("response = client.models.generate_content(")
print("    model='gemini-2.5-flash',")
print("    contents=[document_part, prompt],")
print("    config={")
print("        'response_mime_type': 'text/x.enum',")
print("        'response_schema': ClassificationEnum,")
print("    }")
print(")")
print("```")
print("=" * 50)

## Step 7: Manual Analysis

Based on the content we have, let's manually analyze what we expect the tool to find:

In [None]:
# Let's manually identify Strong Mayor Powers references in our sample documents
sample_dir = "sample_documents"

if os.path.exists(sample_dir):
    print("Manual analysis of Strong Mayor Powers references:")
    print("=" * 60)
    
    # Key phrases to look for
    strong_mayor_phrases = [
        "Strong Mayor Powers",
        "enhanced mayoral authorities", 
        "override certain council decisions",
        "enhanced control over municipal planning",
        "mayor override powers",
        "mayoral authorities",
        "streamlined decision-making capabilities"
    ]
    
    for file in os.listdir(sample_dir):
        file_path = os.path.join(sample_dir, file)
        print(f"\n--- Analyzing {file} ---")
        
        if file.endswith('.txt'):
            with open(file_path, 'r') as f:
                content = f.read()
        elif file.endswith('.html'):
            with open(file_path, 'r') as f:
                content = f.read()
        else:
            print("Binary file - would be processed directly by API")
            continue
            
        found_phrases = []
        for phrase in strong_mayor_phrases:
            if phrase.lower() in content.lower():
                found_phrases.append(phrase)
                # Find the context around this phrase
                content_lower = content.lower()
                phrase_lower = phrase.lower()
                index = content_lower.find(phrase_lower)
                if index != -1:
                    # Get 50 characters before and after the phrase
                    start = max(0, index - 50)
                    end = min(len(content), index + len(phrase) + 50)
                    context = content[start:end].replace('\n', ' ')
                    print(f"  ✓ Found: '{phrase}'")
                    print(f"    Context: ...{context}...")
        
        print(f"\n  Summary for {file}: Found {len(found_phrases)} references")
        print(f"  Expected classification: {'PRESENT' if found_phrases else 'ABSENT'}")
    
    print("\n" + "=" * 60)
    print("\nOverall Analysis:")
    print("Our sample documents contain clear references to Strong Mayor Powers")
    print("The tool should classify both as 'present'")
else:
    print("Sample directory not available for manual analysis")

## Step 8: Running with API Key (Optional)

If you have a Gemini API key, you can run the actual classification using the new API. Otherwise, we'll show what the expected output would look like.

In [None]:
# Check if API key is available
api_key = os.environ.get('GEMINI_API_KEY')

if api_key:
    print("✓ Gemini API key found in environment variables")
    print("Running actual classification with new Google GenAI API...\n")
    
    # Run the actual classification
    result = subprocess.run(['python3', 'classify.py', sample_dir, 
                           '--output-csv', 'demo_results.csv'], 
                           capture_output=True, text=True)
    
    print("Classification output:")
    print(result.stdout)
    
    if result.stderr:
        print("\nErrors/Warnings:")
        print(result.stderr)
    
    # Check if results file was created
    if os.path.exists('demo_results.csv'):
        print("\n✓ Results file created: demo_results.csv")
    else:
        print("\n✗ Results file not created")
        
else:
    print("✗ No Gemini API key found in GEMINI_API_KEY environment variable")
    print("\nTo run the actual classification, you would need to:")
    print("1. Get a Gemini API key from Google AI Studio")
    print("2. Set it as an environment variable: export GEMINI_API_KEY='your-key-here'")
    print("3. Re-run this cell")
    print("\nNote: The new API uses GEMINI_API_KEY (not GOOGLE_API_KEY)")
    print("For demonstration purposes, we'll show expected output format below.")

## Step 9: Examining Results

Let's look at the results file format and interpret the output:

In [None]:
# Check if we have actual results or create expected results for demo
results_file = 'demo_results.csv'

if os.path.exists(results_file):
    # Read actual results
    print("Reading actual classification results:")
    df = pd.read_csv(results_file)
    print(df.to_string(index=False))
    
    # Analyze results
    print("\nResults Analysis:")
    print(f"Total documents processed: {len(df)}")
    
    if 'Strong Mayor Powers' in df.columns:
        value_counts = df['Strong Mayor Powers'].value_counts()
        print(f"Classification results:")
        for classification, count in value_counts.items():
            print(f"  - {classification}: {count}")
            
else:
    # Create expected results for demonstration
    print("Expected results format (since no API key was provided):")
    
    expected_results = pd.DataFrame({
        'Comment ID': ['comment_001'],
        'Strong Mayor Powers': ['present']
    })
    
    print(expected_results.to_string(index=False))
    
    print("\nExpected Analysis:")
    print("Total documents processed: 1")
    print("Classification results:")
    print("  - present: 1")
    print("\nThis matches our manual analysis - the document contains multiple")
    print("references to Strong Mayor Powers and should be classified as 'present'.")
    
    # Save expected results for demo purposes
    expected_results.to_csv('expected_results.csv', index=False)
    print("\n✓ Expected results saved to expected_results.csv")

## Step 10: Visualizing Results (if we have them)

Let's create a simple visualization of the classification results:

In [None]:
import matplotlib.pyplot as plt

# Use actual or expected results
if os.path.exists('demo_results.csv'):
    df = pd.read_csv('demo_results.csv')
    title_suffix = "(Actual Results)"
elif os.path.exists('expected_results.csv'):
    df = pd.read_csv('expected_results.csv')
    title_suffix = "(Expected Results)"
else:
    # Fallback
    df = pd.DataFrame({
        'Comment ID': ['comment_001'],
        'Strong Mayor Powers': ['present']
    })
    title_suffix = "(Demo)"

# Create a pie chart of classifications
if 'Strong Mayor Powers' in df.columns:
    plt.figure(figsize=(10, 6))
    
    # Subplot 1: Pie chart
    plt.subplot(1, 2, 1)
    value_counts = df['Strong Mayor Powers'].value_counts()
    colors = ['#ff7f7f' if x == 'present' else '#7f7fff' for x in value_counts.index]
    plt.pie(value_counts.values, labels=value_counts.index, autopct='%1.1f%%', colors=colors)
    plt.title(f'Strong Mayor Powers Classifications {title_suffix}')
    
    # Subplot 2: Bar chart
    plt.subplot(1, 2, 2)
    bars = plt.bar(value_counts.index, value_counts.values, color=colors)
    plt.title('Classification Counts')
    plt.ylabel('Number of Documents')
    plt.xlabel('Classification')
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                f'{int(height)}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()
    
    print("\nVisualization shows the distribution of classifications.")
    print("Red/pink = 'present' (contains Strong Mayor Powers references)")
    print("Blue = 'absent' (no Strong Mayor Powers references)")
else:
    print("No classification data available for visualization")

## Step 11: Understanding Different Input Formats

The tool supports various input formats. Let's demonstrate how it works with different file types:

In [None]:
# Create samples of different file types
print("Creating sample files in different formats...\n")

# Text file
with open(os.path.join(sample_dir, 'comment_002.txt'), 'w') as f:
    f.write("This is a simple text comment about municipal governance. "
            "I think the new Strong Mayor Powers are concerning.")
print("✓ Created comment_002.txt")

# HTML file
html_content = """
<!DOCTYPE html>
<html>
<head><title>Public Comment</title></head>
<body>
<h1>Municipal Governance Comment</h1>
<p>I support the implementation of <strong>enhanced mayoral authorities</strong> 
as they will help streamline decision-making processes in our municipality.</p>
</body>
</html>
"""
with open(os.path.join(sample_dir, 'comment_003.html'), 'w') as f:
    f.write(html_content)
print("✓ Created comment_003.html")

# List all files in the sample directory
print("\nAll sample files created:")
for file in sorted(os.listdir(sample_dir)):
    file_path = os.path.join(sample_dir, file)
    file_size = os.path.getsize(file_path)
    print(f"  - {file} ({file_size} bytes)")

## Step 12: Running Dry-Run on Multiple Files

Let's see how the tool handles multiple files of different types:

In [None]:
# Run dry-run on the expanded sample directory
print("Running dry-run on multiple file types...\n")

result = subprocess.run(['python3', 'classify.py', sample_dir, '--dry-run'], 
                       capture_output=True, text=True)

print("Dry-run output for multiple files:")
print(result.stdout)

if result.stderr:
    print("\nErrors/Warnings:")
    print(result.stderr)

# Extract and analyze token information
output_lines = result.stdout.split('\n')

# Find the line with total comments
total_comments_line = [line for line in output_lines if 'total of' in line and 'comments' in line]
if total_comments_line:
    print(f"\nProcessed files: {total_comments_line[0]}")

# Find token usage
token_line = [line for line in output_lines if 'tokens required' in line]
if token_line:
    print(f"Token usage: {token_line[0]}")

## Step 13: Summary and Next Steps

Let's summarize what we've learned and provide guidance for real-world usage:

In [None]:
print("=" * 80)
print("DEMO SUMMARY: Strong Mayor Powers Classification Tool (New Google GenAI API)")
print("=" * 80)

print("\n✓ What we demonstrated:")
print("  1. Tool setup with new Google GenAI API")
print("  2. Sample document creation (multiple formats)")
print("  3. Direct document upload (no text extraction)")
print("  4. Dry-run mode for cost estimation")
print("  5. Understanding the new API approach")
print("  6. Manual analysis of Strong Mayor Powers references")
print("  7. Structured responses using enums")
print("  8. Results interpretation")

print("\n📊 Key findings from our samples:")
print("  - Sample documents contain Strong Mayor Powers references")
print("  - Expected classification: PRESENT for both documents")
print("  - Tool processes multiple file formats directly")
print("  - No more text extraction dependencies")

print("\n🔧 For real-world usage:")
print("  1. Obtain Gemini API key from Google AI Studio")
print("  2. Set environment variable: export GEMINI_API_KEY='your-key'")
print("  3. Prepare documents in a directory (no JSON files)")
print("  4. Run dry-run first to estimate costs")
print("  5. Process documents and analyze results")

print("\n📁 Supported file formats:")
print("  - PDF (.pdf) - direct upload, no pdfplumber needed")
print("  - Text (.txt, .text) - direct upload")
print("  - HTML (.html, .htm) - direct upload, no BeautifulSoup needed")
print("  - Word (.docx) - direct upload, no python-docx needed")
print("  - RTF (.rtf) - direct upload, no striprtf needed")
print("  - JSON input is NO LONGER SUPPORTED")

print("\n🚀 New API Benefits:")
print("  - Simpler dependency management")
print("  - Better document processing by Google's models")
print("  - Structured responses with enum validation")
print("  - Updated model: gemini-2.5-flash (1M token context)")
print("  - More reliable text extraction")

print("\n💡 Migration notes:")
print("  - Environment variable changed: GOOGLE_API_KEY → GEMINI_API_KEY")
print("  - Argument changed: --google-api-key → --gemini-api-key")
print("  - JSON input removed, directory input only")
print("  - No more text extraction libraries needed")
print("  - Model updated to gemini-2.5-flash")

print("\n" + "=" * 80)

## Cleanup (Optional)

If you want to clean up the demo files, run the cell below:

In [None]:
# Cleanup demo files (uncomment to run)
# import shutil
# 
# cleanup_files = [
#     'sample_documents',
#     'demo_results.csv', 
#     'expected_results.csv',
#     'sample_document.pdf'
# ]
# 
# print("Cleaning up demo files...")
# for item in cleanup_files:
#     if os.path.exists(item):
#         if os.path.isdir(item):
#             shutil.rmtree(item)
#             print(f"✓ Removed directory: {item}")
#         else:
#             os.remove(item)
#             print(f"✓ Removed file: {item}")
#     else:
#         print(f"  - {item} (not found)")
# 
# print("\nCleanup complete!")

print("Cleanup cell is commented out. Uncomment and run to clean up demo files.")