# 🏟️ Venue Capacity Finder - Interactive Notebook

Welcome to the **Venue Capacity Finder** - an advanced web scraping system that searches for venue capacity information from multiple sources including:
- 📚 **Wikipedia** (most reliable structured data)
- 🎵 **Specialized Sources** (Songkick, Bandsintown, AllMusic)
- 🔍 **Google Search** (broadest coverage + venue websites)

## Features:
- ✅ Multi-source capacity discovery
- 🔄 Smart caching system
- 📊 Statistical analysis and visualizations
- 💾 Database integration
- 🛡️ Respectful rate limiting
- 📈 Batch processing capabilities

---

## 📦 1. Setup and Configuration

Import all required libraries and configure settings for database connections and web scraping.

In [None]:
# Import Required Libraries
import json
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import time
from datetime import datetime
import sys
from sqlalchemy import create_engine, text
from urllib.parse import quote_plus
import urllib.parse
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
warnings.filterwarnings('ignore')

# Matplotlib settings
plt.style.use('default')
sns.set_palette("husl")

print("✅ All libraries imported successfully!")
print(f"📊 Pandas version: {pd.__version__}")
print(f"🌐 Requests version: {requests.__version__}")
print(f"🍲 BeautifulSoup4 available")
print(f"📈 Matplotlib & Seaborn ready")

## 🗄️ 2. Database Connection Class

Create the main VenueCapacityFinder class with database connectivity and session management.

In [None]:
class VenueCapacityFinder:
    """
    Advanced venue capacity finder with multi-source web scraping capabilities
    """
    def __init__(self, db_config_path='db.json'):
        """Initialize the Venue Capacity Finder with database configuration."""
        self.db_config = self.load_db_config(db_config_path)
        self.engine = None
        self.capacity_cache = {}
        self.session = requests.Session()  # Reuse connections for better performance
        
        # Set up session headers for respectful web scraping
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
        })
        print("🚀 VenueCapacityFinder initialized successfully!")
    
    def load_db_config(self, config_path):
        """Load database configuration from JSON file."""
        try:
            with open(config_path, 'r') as f:
                config = json.load(f)
            print(f"✅ Database configuration loaded from {config_path}")
            return config
        except FileNotFoundError:
            print(f"❌ Configuration file {config_path} not found!")
            return None
        except Exception as e:
            print(f"❌ Error loading configuration: {e}")
            return None
    
    def create_connection(self):
        """Create SQLAlchemy database connection."""
        if not self.db_config:
            return False
            
        try:
            # Create SQLAlchemy engine
            user = self.db_config['user']
            password = self.db_config['password']
            host = self.db_config['host']
            database = self.db_config['database']
            
            # URL-encode the password to handle special characters
            encoded_password = quote_plus(password)
            connection_string = f"mysql+mysqlconnector://{user}:{encoded_password}@{host}/{database}"
            
            self.engine = create_engine(connection_string)
            
            # Test the connection
            with self.engine.connect() as conn:
                result = conn.execute(text("SELECT DATABASE()"))
                db_name = result.fetchone()[0]
                print(f"✅ Connected to database: {db_name}")
            
            return True
                
        except Exception as e:
            print(f"❌ Database connection failed: {e}")
            return False
    
    def close_connection(self):
        """Close database connection and session properly."""
        try:
            if self.engine:
                self.engine.dispose()
                print("SQLAlchemy engine disposed.")
            if self.session:
                self.session.close()
                print("Requests session closed.")
        except Exception as e:
            print(f"❌ Error closing connection: {e}")

# Test the class initialization
finder = VenueCapacityFinder()
print(f"📊 Cache initialized: {len(finder.capacity_cache)} items")
print(f"🌐 Session headers configured: {len(finder.session.headers)} headers")

## 🌐 3. Web Scraping Methods

Implement safe web request functions and specialized scrapers for different sources.

In [None]:
# Add web scraping methods to the VenueCapacityFinder class

def safe_web_request(self, url, timeout=10):
    """Make a safe web request with proper error handling and rate limiting."""
    try:
        response = self.session.get(url, timeout=timeout)
        if response.status_code == 200:
            return response
        else:
            print(f"   HTTP {response.status_code} for {url[:50]}...")
            return None
    except requests.Timeout:
        print(f"   Timeout accessing {url[:50]}...")
        return None
    except requests.RequestException as e:
        print(f"   Request error for {url[:50]}...: {str(e)[:50]}")
        return None
    except Exception as e:
        print(f"   Unexpected error for {url[:50]}...: {str(e)[:50]}")
        return None

def search_venue_capacity_wikipedia(self, venue_name, city=None):
    """Search for venue capacity on Wikipedia."""
    try:
        # Clean venue name for search
        search_term = venue_name.replace(" ", "_")
        if city:
            search_urls = [
                f"https://en.wikipedia.org/wiki/{search_term}",
                f"https://en.wikipedia.org/wiki/{search_term}_{city.replace(' ', '_')}",
                f"https://en.wikipedia.org/wiki/{search_term}_({city.replace(' ', '_')})"
            ]
        else:
            search_urls = [f"https://en.wikipedia.org/wiki/{search_term}"]
        
        for url in search_urls:
            response = self.safe_web_request(url, timeout=8)
            if response:
                soup = BeautifulSoup(response.content, 'html.parser')
                
                # Look for capacity in infobox
                infobox = soup.find('table', class_='infobox')
                if infobox:
                    rows = infobox.find_all('tr')
                    for row in rows:
                        text = row.get_text().lower()
                        if any(keyword in text for keyword in ['capacity', 'seating', 'seats']):
                            # Extract numbers from the row
                            numbers = re.findall(r'[\d,]+', row.get_text())
                            if numbers:
                                # Take the largest number as likely capacity
                                capacities = [int(num.replace(',', '')) for num in numbers if num.replace(',', '').isdigit()]
                                if capacities:
                                    max_capacity = max(capacities)
                                    if max_capacity > 100:  # Reasonable venue capacity
                                        return max_capacity, "Wikipedia"
                
                # Look for capacity in page text
                page_text = soup.get_text().lower()
                capacity_patterns = [
                    r'capacity[^\d]*?([\d,]+)',
                    r'seats[^\d]*?([\d,]+)',
                    r'seating[^\d]*?([\d,]+)',
                    r'holds[^\d]*?([\d,]+)'
                ]
                
                for pattern in capacity_patterns:
                    matches = re.findall(pattern, page_text)
                    if matches:
                        capacities = [int(match.replace(',', '')) for match in matches if match.replace(',', '').isdigit()]
                        reasonable_capacities = [c for c in capacities if 100 <= c <= 200000]
                        if reasonable_capacities:
                            return max(reasonable_capacities), "Wikipedia"
                
                time.sleep(1)  # Be respectful to Wikipedia
        
        return None, None
        
    except Exception as e:
        print(f"⚠️  Wikipedia search error for {venue_name}: {e}")
        return None, None

# Add methods to the class
VenueCapacityFinder.safe_web_request = safe_web_request
VenueCapacityFinder.search_venue_capacity_wikipedia = search_venue_capacity_wikipedia

print("✅ Wikipedia scraping methods added to VenueCapacityFinder class")

In [None]:
# Add specialized sources and Google search methods

def search_venue_capacity_additional_sources(self, venue_name, city=None):
    """Search for venue capacity from additional specialized sources."""
    try:
        # List of specialized venue/concert websites that might have capacity info
        search_sources = [
            {
                'name': 'Songkick',
                'url_template': 'https://www.songkick.com/search?query={venue}+{city}',
                'patterns': [r'capacity[^\d]*?(\d{1,3}(?:,\d{3})*)', r'seats[^\d]*?(\d{1,3}(?:,\d{3})*)']
            },
            {
                'name': 'Bandsintown',
                'url_template': 'https://www.bandsintown.com/v/{venue}',
                'patterns': [r'capacity[^\d]*?(\d{1,3}(?:,\d{3})*)', r'seating[^\d]*?(\d{1,3}(?:,\d{3})*)']
            },
            {
                'name': 'AllMusic',
                'url_template': 'https://www.allmusic.com/search/venue/{venue}',
                'patterns': [r'capacity[^\d]*?(\d{1,3}(?:,\d{3})*)', r'holds[^\d]*?(\d{1,3}(?:,\d{3})*)']
            }
        ]
        
        for source in search_sources:
            # Format the URL with venue name and city
            venue_clean = venue_name.lower().replace(' ', '+').replace('&', 'and')
            city_clean = city.lower().replace(' ', '+') if city else ''
            
            if '{city}' in source['url_template'] and city:
                search_url = source['url_template'].format(venue=venue_clean, city=city_clean)
            else:
                search_url = source['url_template'].format(venue=venue_clean)
            
            response = self.safe_web_request(search_url, timeout=8)
            if response:
                soup = BeautifulSoup(response.content, 'html.parser')
                page_text = soup.get_text().lower()
                
                # Search for capacity using source-specific patterns
                for pattern in source['patterns']:
                    matches = re.findall(pattern, page_text)
                    if matches:
                        capacities = []
                        for match in matches:
                            try:
                                capacity = int(match.replace(',', ''))
                                if 100 <= capacity <= 200000:
                                    capacities.append(capacity)
                            except ValueError:
                                continue
                        
                        if capacities:
                            return max(capacities), source['name']
            
            time.sleep(1)  # Be respectful between requests
        
        return None, None
        
    except Exception as e:
        print(f"⚠️  Additional sources search error for {venue_name}: {e}")
        return None, None

def search_venue_capacity_google(self, venue_name, city=None):
    """Search for venue capacity using Google search with advanced web scraping."""
    try:
        # Create multiple search queries for better coverage
        search_queries = []
        if city:
            search_queries = [
                f'"{venue_name}" {city} capacity',
                f'"{venue_name}" {city} seating capacity',
                f'"{venue_name}" {city} seats',
                f'{venue_name} {city} "capacity"',
                f'{venue_name} {city} venue information'
            ]
        else:
            search_queries = [
                f'"{venue_name}" capacity',
                f'"{venue_name}" seating capacity',
                f'"{venue_name}" seats',
                f'{venue_name} "capacity"',
                f'{venue_name} venue information'
            ]
        
        for query in search_queries:
            # URL encode the search query
            encoded_query = urllib.parse.quote_plus(query)
            search_url = f"https://www.google.com/search?q={encoded_query}"
            
            response = self.safe_web_request(search_url, timeout=10)
            if response:
                soup = BeautifulSoup(response.content, 'html.parser')
                
                # Extract text content from search results
                search_text = soup.get_text().lower()
                
                # Look for capacity information in Google search results
                capacity_patterns = [
                    rf'{re.escape(venue_name.lower())}[^.]*?capacity[^.]*?(\d{{1,3}}(?:,\d{{3}})*)',
                    rf'capacity[^.]*?(\d{{1,3}}(?:,\d{{3}})*)[^.]*?{re.escape(venue_name.lower())}',
                    rf'{re.escape(venue_name.lower())}[^.]*?seats[^.]*?(\d{{1,3}}(?:,\d{{3}})*)',
                    rf'seats[^.]*?(\d{{1,3}}(?:,\d{{3}})*)[^.]*?{re.escape(venue_name.lower())}',
                    rf'{re.escape(venue_name.lower())}[^.]*?seating[^.]*?(\d{{1,3}}(?:,\d{{3}})*)',
                    rf'seating[^.]*?(\d{{1,3}}(?:,\d{{3}})*)[^.]*?{re.escape(venue_name.lower())}'
                ]
                
                for pattern in capacity_patterns:
                    matches = re.findall(pattern, search_text, re.IGNORECASE | re.DOTALL)
                    if matches:
                        # Process all found capacities
                        capacities = []
                        for match in matches:
                            try:
                                capacity = int(match.replace(',', ''))
                                # Filter for reasonable venue capacities
                                if 100 <= capacity <= 200000:
                                    capacities.append(capacity)
                            except ValueError:
                                continue
                        
                        if capacities:
                            # Return the most common capacity (mode) or largest if tied
                            capacity_counts = Counter(capacities)
                            most_common_capacity = capacity_counts.most_common(1)[0][0]
                            return most_common_capacity, "Google Search"
            
            # Be respectful with Google requests
            time.sleep(2)
        
        return None, None
        
    except Exception as e:
        print(f"⚠️  Google search error for {venue_name}: {e}")
        return None, None

# Add methods to the class
VenueCapacityFinder.search_venue_capacity_additional_sources = search_venue_capacity_additional_sources
VenueCapacityFinder.search_venue_capacity_google = search_venue_capacity_google

print("✅ Additional sources and Google search methods added")

## 🔍 4. Venue Search Functions

Functions to search for venues in the database and retrieve venue information.

In [None]:
# Add venue search and database functions

def search_venues_in_database(self, search_term=None):
    """Search for venues in the database."""
    print("🔍 Searching venues in database...")
    
    if not self.create_connection():
        return pd.DataFrame()
    
    try:
        if search_term:
            # Search for specific venue
            venue_query = """
                SELECT DISTINCT name, city, country 
                FROM VENUES 
                WHERE name LIKE %s 
                   OR city LIKE %s
                ORDER BY name
            """
            search_pattern = f"%{search_term}%"
            venues_df = pd.read_sql(venue_query, self.engine, params=(search_pattern, search_pattern))
            print(f"✅ Found {len(venues_df)} venues matching '{search_term}'")
        else:
            # Get all venues
            venue_query = """
                SELECT DISTINCT name, city, country 
                FROM VENUES 
                ORDER BY name
            """
            venues_df = pd.read_sql(venue_query, self.engine)
            print(f"✅ Retrieved {len(venues_df)} total venues from database")
        
        return venues_df
        
    except Exception as e:
        print(f"❌ Error searching venues: {e}")
        return pd.DataFrame()
    finally:
        self.close_connection()

def get_venues_without_capacity(self):
    """Get venues that don't have capacity information yet."""
    print("🔍 Searching for venues without capacity data...")
    
    if not self.create_connection():
        return pd.DataFrame()
    
    try:
        venue_query = """
            SELECT DISTINCT VENUE_NAME as name, CITY as city, COUNTRY as country 
            FROM VENUES 
            WHERE capacity IS NULL OR capacity = 0
            ORDER BY VENUE_NAME
            LIMIT 100
        """
        venues_df = pd.read_sql(venue_query, self.engine)
        print(f"✅ Found {len(venues_df)} venues without capacity data")
        return venues_df
        
    except Exception as e:
        print(f"❌ Error searching venues: {e}")
        return pd.DataFrame()
    finally:
        self.close_connection()

# Add methods to the class
VenueCapacityFinder.search_venues_in_database = search_venues_in_database
VenueCapacityFinder.get_venues_without_capacity = get_venues_without_capacity

print("✅ Venue search functions added")

## 🎯 5. Capacity Finding Implementation

Core logic for finding venue capacities using multiple sources with smart caching.

In [None]:
# Core capacity finding implementation

def get_venue_capacity(self, venue_name, city=None, country=None, verbose=True):
    """Get venue capacity from multiple sources with smart caching."""
    
    # Check cache first
    cache_key = f"{venue_name}_{city}_{country}".lower()
    if cache_key in self.capacity_cache:
        if verbose:
            capacity, source = self.capacity_cache[cache_key]
            if capacity:
                print(f"🔄 Found in cache: {venue_name} = {capacity:,} ({source})")
            else:
                print(f"🔄 Found in cache: {venue_name} = No capacity found")
        return self.capacity_cache[cache_key]
    
    if verbose:
        print(f"🔍 Searching capacity for: {venue_name}")
        if city:
            print(f"   Location: {city}, {country}")
    
    capacity = None
    source = None
    
    # Try Wikipedia first (most reliable)
    if verbose:
        print("   📚 Searching Wikipedia...")
    capacity, source = self.search_venue_capacity_wikipedia(venue_name, city)
    
    # If Wikipedia didn't find anything, try specialized venue sources
    if not capacity:
        if verbose:
            print("   🎵 Searching specialized venue databases...")
        capacity, source = self.search_venue_capacity_additional_sources(venue_name, city)
    
    # If still no luck, try Google search as last resort
    if not capacity:
        if verbose:
            print("   🔍 Searching Google...")
        capacity, source = self.search_venue_capacity_google(venue_name, city)
    
    if capacity and verbose:
        print(f"✅ Found capacity: {capacity:,} ({source})")
    elif verbose:
        print(f"❌ No capacity found for {venue_name}")
    
    # Cache the result
    result = (capacity, source)
    self.capacity_cache[cache_key] = result
    
    return result

def test_single_venue(self, venue_name, city=None, country=None):
    """Test capacity finding for a single venue with detailed output."""
    print(f"🧪 Testing: {venue_name}")
    print("=" * 50)
    
    # Test each method individually
    methods = [
        ("Wikipedia", self.search_venue_capacity_wikipedia),
        ("Additional Sources", self.search_venue_capacity_additional_sources),
        ("Google Search", self.search_venue_capacity_google)
    ]
    
    results = {}
    for method_name, method_func in methods:
        print(f"\n🔍 Testing {method_name}:")
        capacity, source = method_func(venue_name, city)
        if capacity:
            print(f"   ✅ Found: {capacity:,} ({source})")
            results[method_name] = (capacity, source)
        else:
            print(f"   ❌ No result")
            results[method_name] = (None, None)
    
    # Test combined method
    print(f"\n🎯 Combined Search:")
    final_capacity, final_source = self.get_venue_capacity(venue_name, city, country)
    if final_capacity:
        print(f"   🏆 Final Result: {final_capacity:,} ({final_source})")
    else:
        print(f"   ❌ No capacity found")
    
    return results

# Add methods to the class
VenueCapacityFinder.get_venue_capacity = get_venue_capacity
VenueCapacityFinder.test_single_venue = test_single_venue

print("✅ Capacity finding implementation added")

In [None]:
# 🧪 Test the capacity finding on famous venues

print("Testing capacity finding on well-known venues...")
print("=" * 60)

# Test venues with known capacities
test_venues = [
    ("Madison Square Garden", "New York", "USA"),
    ("Staples Center", "Los Angeles", "USA"),
    ("Rogers Centre", "Toronto", "Canada")
]

for venue_name, city, country in test_venues:
    print(f"\n🏟️  Testing: {venue_name} ({city})")
    print("-" * 40)
    
    # Test the combined search
    capacity, source = finder.get_venue_capacity(venue_name, city, country)
    
    if capacity:
        print(f"🎯 Result: {capacity:,} seats ({source})")
    else:
        print("❌ No capacity found")
    
    print(f"Cache size: {len(finder.capacity_cache)} venues")
    time.sleep(1)  # Small delay between tests

## 📊 6. Batch Processing Functions

Handle multiple venues efficiently with progress tracking and configurable batch sizes.

In [None]:
# Batch processing implementation

def batch_find_capacities(self, venues_df, batch_size=None, delay_between_venues=2):
    """Find capacities for multiple venues with progress tracking."""
    total_venues = len(venues_df)
    print(f"🚀 Starting batch capacity search for {total_venues} venues...")
    
    if batch_size and batch_size < total_venues:
        print(f"📦 Processing in batches of {batch_size} venues")
        venues_df = venues_df.head(batch_size)
    
    start_time = datetime.now()
    results = []
    
    for index, row in venues_df.iterrows():
        venue_name = row['name'] if 'name' in row else row.get('VENUE_NAME', 'Unknown')
        city = row.get('city') or row.get('CITY')
        country = row.get('country') or row.get('COUNTRY', 'USA')
        
        # Search for capacity (with verbose=False for batch processing)
        capacity, source = self.get_venue_capacity(venue_name, city, country, verbose=False)
        
        result = {
            'venue_name': venue_name,
            'city': city,
            'country': country,
            'capacity': capacity,
            'source': source,
            'search_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        }
        results.append(result)
        
        # Show individual results
        if capacity:
            print(f"✅ {venue_name} ({city}): {capacity:,} ({source})")
        else:
            print(f"❌ {venue_name} ({city}): No capacity found")
        
        # Progress update every 5 venues
        if (index + 1) % 5 == 0:
            print(f"📊 Progress: {index + 1}/{len(venues_df)} venues processed")
        
        # Be respectful with requests
        if index < len(venues_df) - 1:  # Don't sleep after the last venue
            time.sleep(delay_between_venues)
    
    end_time = datetime.now()
    duration = end_time - start_time
    
    # Create results DataFrame
    results_df = pd.DataFrame(results)
    
    # Summary statistics
    found_capacities = results_df['capacity'].notna().sum()
    success_rate = (found_capacities / len(results_df)) * 100
    
    print(f"\n📊 Batch Search Summary:")
    print(f"   Total venues searched: {len(results_df)}")
    print(f"   Capacities found: {found_capacities}")
    print(f"   Success rate: {success_rate:.1f}%")
    print(f"   Total time: {duration}")
    print(f"   Average time per venue: {duration.total_seconds() / len(results_df):.1f}s")
    
    return results_df

def save_capacity_results(self, results_df, filename=None):
    """Save capacity results to CSV file."""
    if filename is None:
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        filename = f"venue_capacities_{timestamp}.csv"
    
    try:
        results_df.to_csv(filename, index=False)
        print(f"✅ Results saved to {filename}")
        return filename
    except Exception as e:
        print(f"❌ Error saving results: {e}")
        return None

# Add methods to the class
VenueCapacityFinder.batch_find_capacities = batch_find_capacities
VenueCapacityFinder.save_capacity_results = save_capacity_results

print("✅ Batch processing functions added")

## 📈 7. Data Analysis and Visualization

Analyze capacity results, generate statistics, and create visualizations.

In [None]:
# Analysis and visualization functions

def show_capacity_statistics(self, results_df, show_plots=True):
    """Show comprehensive statistics about found capacities with visualizations."""
    print("\n📊 Capacity Statistics")
    print("=" * 40)
    
    # Filter venues with capacity
    with_capacity = results_df[results_df['capacity'].notna()]
    
    if len(with_capacity) > 0:
        print(f"Venues with capacity found: {len(with_capacity)}")
        print(f"Average capacity: {with_capacity['capacity'].mean():,.0f}")
        print(f"Median capacity: {with_capacity['capacity'].median():,.0f}")
        print(f"Largest venue: {with_capacity['capacity'].max():,.0f}")
        print(f"Smallest venue: {with_capacity['capacity'].min():,.0f}")
        print(f"Standard deviation: {with_capacity['capacity'].std():,.0f}")
        
        # Top 10 largest venues
        print(f"\n🏟️  Top 10 Largest Venues:")
        top_venues = with_capacity.nlargest(10, 'capacity')
        for i, (_, row) in enumerate(top_venues.iterrows(), 1):
            print(f"   {i:2d}. {row['venue_name']} ({row['city']}): {row['capacity']:,}")
        
        # Sources breakdown
        print(f"\n📚 Sources Breakdown:")
        source_counts = with_capacity['source'].value_counts()
        for source, count in source_counts.items():
            percentage = (count / len(with_capacity)) * 100
            print(f"   • {source}: {count} venues ({percentage:.1f}%)")
        
        # Create visualizations if requested
        if show_plots and len(with_capacity) > 1:
            fig, axes = plt.subplots(2, 2, figsize=(15, 12))
            
            # 1. Capacity distribution histogram
            axes[0, 0].hist(with_capacity['capacity'], bins=20, edgecolor='black', alpha=0.7)
            axes[0, 0].set_title('Venue Capacity Distribution')
            axes[0, 0].set_xlabel('Capacity')
            axes[0, 0].set_ylabel('Number of Venues')
            
            # 2. Sources pie chart
            source_counts.plot(kind='pie', ax=axes[0, 1], autopct='%1.1f%%')
            axes[0, 1].set_title('Data Sources Distribution')
            axes[0, 1].set_ylabel('')
            
            # 3. Top 15 venues bar chart
            top_15 = with_capacity.nlargest(15, 'capacity')
            y_pos = np.arange(len(top_15))
            axes[1, 0].barh(y_pos, top_15['capacity'])
            axes[1, 0].set_yticks(y_pos)
            axes[1, 0].set_yticklabels([f"{row['venue_name'][:20]}..." if len(row['venue_name']) > 20 
                                       else row['venue_name'] for _, row in top_15.iterrows()])
            axes[1, 0].set_title('Top 15 Venues by Capacity')
            axes[1, 0].set_xlabel('Capacity')
            
            # 4. Capacity ranges
            capacity_ranges = pd.cut(with_capacity['capacity'], 
                                   bins=[0, 1000, 5000, 10000, 20000, 50000, 100000, float('inf')],
                                   labels=['<1K', '1K-5K', '5K-10K', '10K-20K', '20K-50K', '50K-100K', '>100K'])
            range_counts = capacity_ranges.value_counts().sort_index()
            range_counts.plot(kind='bar', ax=axes[1, 1], rot=45)
            axes[1, 1].set_title('Venue Capacity Ranges')
            axes[1, 1].set_xlabel('Capacity Range')
            axes[1, 1].set_ylabel('Number of Venues')
            
            plt.tight_layout()
            plt.show()
    else:
        print("No capacity information found.")

def compare_search_methods(self, venues_sample, max_venues=5):
    """Compare the effectiveness of different search methods."""
    print(f"\n🔬 Comparing Search Methods ({min(max_venues, len(venues_sample))} venues)")
    print("=" * 60)
    
    methods = {
        'Wikipedia': [],
        'Additional Sources': [],
        'Google Search': [],
        'Combined': []
    }
    
    venues_tested = venues_sample.head(max_venues)
    
    for _, venue in venues_tested.iterrows():
        venue_name = venue['name'] if 'name' in venue else venue.get('VENUE_NAME', 'Unknown')
        city = venue.get('city') or venue.get('CITY')
        
        print(f"\n🏟️  Testing: {venue_name}")
        
        # Test individual methods
        wiki_result = self.search_venue_capacity_wikipedia(venue_name, city)
        additional_result = self.search_venue_capacity_additional_sources(venue_name, city)
        google_result = self.search_venue_capacity_google(venue_name, city)
        combined_result = self.get_venue_capacity(venue_name, city, verbose=False)
        
        methods['Wikipedia'].append(wiki_result[0] is not None)
        methods['Additional Sources'].append(additional_result[0] is not None)
        methods['Google Search'].append(google_result[0] is not None)
        methods['Combined'].append(combined_result[0] is not None)
        
        # Show results
        for method_name, result in [('Wikipedia', wiki_result), ('Additional', additional_result), 
                                  ('Google', google_result), ('Combined', combined_result)]:
            if result[0]:
                print(f"   ✅ {method_name}: {result[0]:,}")
            else:
                print(f"   ❌ {method_name}: No result")
    
    # Calculate success rates
    print(f"\n📊 Success Rate Summary:")
    print("-" * 30)
    for method, results in methods.items():
        success_rate = (sum(results) / len(results)) * 100 if results else 0
        successful = sum(results)
        total = len(results)
        print(f"{method:18}: {successful}/{total} ({success_rate:.1f}%)")

# Add methods to the class
VenueCapacityFinder.show_capacity_statistics = show_capacity_statistics
VenueCapacityFinder.compare_search_methods = compare_search_methods

print("✅ Analysis and visualization functions added")

## 💾 8. Database Update Operations

Save capacity results back to the database with proper schema management.

In [None]:
# Database update operations

def update_database_with_capacities(self, results_df):
    """Update VENUES table with capacity information."""
    print("🔄 Updating database with capacity information...")
    
    if not self.create_connection():
        return False
    
    try:
        with self.engine.connect() as conn:
            # Check if capacity columns exist, if not add them
            result = conn.execute(text("SHOW COLUMNS FROM VENUES LIKE 'capacity'"))
            if not result.fetchone():
                print("📊 Adding capacity columns to VENUES table...")
                conn.execute(text("ALTER TABLE VENUES ADD COLUMN capacity INT NULL"))
                conn.execute(text("ALTER TABLE VENUES ADD COLUMN capacity_source VARCHAR(100) NULL"))
                conn.execute(text("ALTER TABLE VENUES ADD COLUMN capacity_updated DATETIME NULL"))
                conn.commit()
                print("✅ Added capacity columns to VENUES table")
            
            # Update venues with capacity information
            updates_made = 0
            errors = 0
            
            for _, row in results_df.iterrows():
                if pd.notna(row['capacity']):
                    try:
                        result = conn.execute(text("""
                            UPDATE VENUES 
                            SET capacity = :capacity, capacity_source = :source, capacity_updated = :updated
                            WHERE VENUE_NAME = :name AND CITY = :city
                        """), {
                            'capacity': int(row['capacity']),
                            'source': row['source'],
                            'updated': row['search_date'],
                            'name': row['venue_name'],
                            'city': row['city']
                        })
                        
                        if result.rowcount > 0:
                            updates_made += 1
                    except Exception as e:
                        errors += 1
                        print(f"⚠️  Error updating {row['venue_name']}: {str(e)[:50]}")
            
            conn.commit()
            print(f"✅ Updated {updates_made} venues with capacity information")
            if errors > 0:
                print(f"⚠️  {errors} errors occurred during updates")
            
        return True
        
    except Exception as e:
        print(f"❌ Error updating database: {e}")
        return False
    finally:
        self.close_connection()

def get_database_capacity_stats(self):
    """Get statistics about capacities currently in the database."""
    print("📊 Getting database capacity statistics...")
    
    if not self.create_connection():
        return None
    
    try:
        with self.engine.connect() as conn:
            # Check if capacity column exists
            result = conn.execute(text("SHOW COLUMNS FROM VENUES LIKE 'capacity'"))
            if not result.fetchone():
                print("❌ Capacity column doesn't exist in database yet")
                return None
            
            # Get capacity statistics
            stats_query = """
                SELECT 
                    COUNT(*) as total_venues,
                    COUNT(capacity) as venues_with_capacity,
                    AVG(capacity) as avg_capacity,
                    MIN(capacity) as min_capacity,
                    MAX(capacity) as max_capacity,
                    STD(capacity) as std_capacity
                FROM VENUES 
                WHERE capacity IS NOT NULL AND capacity > 0
            """
            
            stats_df = pd.read_sql(stats_query, self.engine)
            
            # Get top venues
            top_venues_query = """
                SELECT VENUE_NAME, CITY, capacity, capacity_source, capacity_updated
                FROM VENUES 
                WHERE capacity IS NOT NULL AND capacity > 0
                ORDER BY capacity DESC 
                LIMIT 10
            """
            
            top_venues_df = pd.read_sql(top_venues_query, self.engine)
            
            # Display results
            if not stats_df.empty and stats_df.iloc[0]['venues_with_capacity'] > 0:
                stats = stats_df.iloc[0]
                print(f"✅ Database Capacity Statistics:")
                print(f"   Total venues with capacity: {stats['venues_with_capacity']:,.0f}")
                print(f"   Average capacity: {stats['avg_capacity']:,.0f}")
                print(f"   Largest venue: {stats['max_capacity']:,.0f}")
                print(f"   Smallest venue: {stats['min_capacity']:,.0f}")
                print(f"   Standard deviation: {stats['std_capacity']:,.0f}")
                
                print(f"\n🏟️  Top 10 Venues in Database:")
                for i, (_, row) in enumerate(top_venues_df.iterrows(), 1):
                    print(f"   {i:2d}. {row['VENUE_NAME']} ({row['CITY']}): {row['capacity']:,} ({row['capacity_source']})")
            else:
                print("❌ No capacity data found in database")
            
            return stats_df
        
    except Exception as e:
        print(f"❌ Error getting database stats: {e}")
        return None
    finally:
        self.close_connection()

# Add methods to the class
VenueCapacityFinder.update_database_with_capacities = update_database_with_capacities
VenueCapacityFinder.get_database_capacity_stats = get_database_capacity_stats

print("✅ Database update operations added")

## 🚀 Interactive Examples

Now let's put it all together and run some practical examples!

In [None]:
# 🔍 Example 1: Search venues from database

print("🏟️  Searching for venues in database...")
print("=" * 50)

# Search for venues (try different search terms)
search_term = "Center"  # Change this to search for specific venues

venues_df = finder.search_venues_in_database(search_term)

if not venues_df.empty:
    print(f"\n📋 Found {len(venues_df)} venues matching '{search_term}':")
    print(venues_df.head(10).to_string(index=False))
    
    # Show total venues in database
    all_venues = finder.search_venues_in_database()
    print(f"\n📊 Total venues in database: {len(all_venues)}")
else:
    print("❌ No venues found matching your search")
    
    # Try to get all venues
    print("\n🔄 Trying to get all venues...")
    all_venues = finder.search_venues_in_database()
    if not all_venues.empty:
        print(f"✅ Total venues in database: {len(all_venues)}")
        print("\n📋 Sample venues:")
        print(all_venues.head(10).to_string(index=False))

In [None]:
# 📊 Example 2: Batch capacity search

print("🚀 Running batch capacity search...")
print("=" * 50)

# Get venues without capacity (or a sample of all venues)
try:
    venues_without_capacity = finder.get_venues_without_capacity()
    
    if venues_without_capacity.empty:
        print("✅ All venues already have capacity data!")
        # Use a sample of all venues for demonstration
        all_venues = finder.search_venues_in_database()
        if not all_venues.empty:
            venues_to_search = all_venues.head(3)  # Just 3 for demonstration
            print(f"🔄 Using sample of {len(venues_to_search)} venues for demonstration")
        else:
            print("❌ No venues found in database")
            venues_to_search = pd.DataFrame()
    else:
        # Use first 5 venues without capacity
        venues_to_search = venues_without_capacity.head(5)
        print(f"📋 Found {len(venues_without_capacity)} venues without capacity")
        print(f"🔄 Processing first {len(venues_to_search)} venues...")
    
    if not venues_to_search.empty:
        # Run batch search
        results_df = finder.batch_find_capacities(venues_to_search, delay_between_venues=1)
        
        # Show results
        print(f"\n📋 Capacity Search Results:")
        print(results_df[['venue_name', 'city', 'capacity', 'source']].to_string(index=False))
        
        # Show statistics
        finder.show_capacity_statistics(results_df, show_plots=True)
        
        # Option to save results
        save_results = True  # Change to False if you don't want to save
        if save_results:
            filename = finder.save_capacity_results(results_df)
            print(f"💾 Results saved to: {filename}")
    
except Exception as e:
    print(f"❌ Error in batch search: {e}")
    print("This might be due to database connection issues or missing venues table")

In [None]:
# 🔬 Example 3: Compare search methods

print("🧪 Comparing search method effectiveness...")
print("=" * 50)

# Create sample venues for testing (famous venues with known capacities)
sample_venues = pd.DataFrame([
    {'name': 'Madison Square Garden', 'city': 'New York'},
    {'name': 'Staples Center', 'city': 'Los Angeles'},
    {'name': 'Rogers Centre', 'city': 'Toronto'},
    {'name': 'Wembley Stadium', 'city': 'London'},
    {'name': 'Bell Centre', 'city': 'Montreal'}
])

print(f"Testing with {len(sample_venues)} famous venues:")
for _, venue in sample_venues.iterrows():
    print(f"   • {venue['name']} ({venue['city']})")

# Compare methods
finder.compare_search_methods(sample_venues, max_venues=3)  # Test first 3 to save time

print(f"\n📊 Cache Status: {len(finder.capacity_cache)} venues cached")

In [None]:
# 💾 Example 4: Database operations

print("📊 Database operations example...")
print("=" * 50)

try:
    # Get current database stats
    print("Getting current database capacity statistics...")
    db_stats = finder.get_database_capacity_stats()
    
    # If we have results from previous batch search, we can update the database
    if 'results_df' in locals() and not results_df.empty:
        print(f"\n🔄 Would you like to update the database with {len(results_df)} capacity results?")
        print("(This is just an example - set update_database = True to actually update)")
        
        update_database = False  # Set to True if you want to actually update
        
        if update_database:
            success = finder.update_database_with_capacities(results_df)
            if success:
                print("✅ Database updated successfully!")
                # Get updated stats
                print("\n📊 Updated database statistics:")
                finder.get_database_capacity_stats()
            else:
                print("❌ Database update failed")
        else:
            print("ℹ️  Database update skipped (set update_database = True to enable)")
    else:
        print("ℹ️  No search results available to update database")
        print("   Run the batch search example first to get results")

except Exception as e:
    print(f"❌ Error in database operations: {e}")
    print("This might be due to database connection issues")

## 🎉 Summary and Next Steps

Congratulations! You now have a fully functional venue capacity finder with:

### ✅ Features Implemented:
- **Multi-source web scraping** (Wikipedia, specialized sources, Google search)
- **Smart caching system** to avoid duplicate requests
- **Batch processing** with progress tracking
- **Data analysis and visualization** 
- **Database integration** with automatic schema updates
- **Respectful rate limiting** to avoid overwhelming servers

### 🚀 Next Steps:
1. **Customize search sources**: Add more specialized venues databases
2. **Improve accuracy**: Add more sophisticated pattern matching
3. **Add monitoring**: Track success rates over time
4. **Scale up**: Use multi-threading for faster batch processing
5. **Add validation**: Cross-reference results from multiple sources

### 💡 Usage Tips:
- Always test with a small batch first
- Monitor success rates and adjust search patterns as needed
- Cache results are preserved within the session
- Be respectful of rate limits when scraping

### 🔧 Customization Ideas:
- Add venue-specific scrapers for major venues
- Implement fuzzy matching for venue names
- Add capacity validation against known ranges
- Create automated reports and dashboards

In [None]:
# 🧹 Cleanup and Final Status

print("🏟️  Venue Capacity Finder - Session Summary")
print("=" * 50)

print(f"📊 Cache Status: {len(finder.capacity_cache)} venues cached")
print(f"🌐 Session active: {finder.session is not None}")

# Display cache contents if any
if finder.capacity_cache:
    print(f"\n📋 Cached Results:")
    for i, (venue_key, (capacity, source)) in enumerate(list(finder.capacity_cache.items())[:5], 1):
        venue_parts = venue_key.split('_')
        venue_name = venue_parts[0] if venue_parts else 'Unknown'
        if capacity:
            print(f"   {i}. {venue_name}: {capacity:,} ({source})")
        else:
            print(f"   {i}. {venue_name}: No capacity found")
    
    if len(finder.capacity_cache) > 5:
        print(f"   ... and {len(finder.capacity_cache) - 5} more venues")

print(f"\n🚀 System ready for additional searches!")
print(f"💡 Tip: Results are cached for the session to improve performance")

# Optionally close connections
# finder.close_connection()  # Uncomment to close all connections