# Event and Venue Data Extractor

This notebook extracts unique event and venue data from the CONCERT_SEATS table and inserts it into the EVENTS and VENUES tables.

## Features:
- ‚úÖ Extract unique venues with city and country data
- ‚úÖ Extract unique events with all related information  
- ‚úÖ Duplicate prevention using INSERT IGNORE
- ‚úÖ Data quality checks and validation
- ‚úÖ Batch processing for large datasets
- ‚úÖ Progress tracking and statistics

## Table of Contents:
1. [Setup & Database Connection](#setup--database-connection)
2. [Venue Data Extraction](#venue-data-extraction)
3. [Event Data Extraction](#event-data-extraction)
4. [Complete Extraction Process](#complete-extraction-process)
5. [Statistics & Verification](#statistics--verification)

## Setup & Database Connection

### Import Required Libraries and Load Database Configuration

In [1]:
# Import required libraries
import json
import mysql.connector
from mysql.connector import Error
import pandas as pd
from datetime import datetime
import time
import sys

print("‚úÖ All libraries imported successfully!")
print(f"üìÖ Current date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

‚úÖ All libraries imported successfully!
üìÖ Current date: 2025-07-20 15:08:46


In [2]:
# Load database configuration
try:
    with open('db.json', 'r') as f:
        db_config = json.load(f)
    
    print("‚úÖ Database configuration loaded:")
    print(f"   Host: {db_config['host']}")
    print(f"   User: {db_config['user']}")
    print(f"   Database: {db_config['database']}")
    print("   Password: [HIDDEN]")
except FileNotFoundError:
    print("‚ùå db.json file not found. Please ensure it exists in the current directory.")
    db_config = None
except Exception as e:
    print(f"‚ùå Error loading database configuration: {e}")
    db_config = None

‚úÖ Database configuration loaded:
   Host: 192.168.68.74
   User: root
   Database: concert
   Password: [HIDDEN]


In [3]:
# Establish database connection
connection = None
cursor = None

try:
    connection = mysql.connector.connect(**db_config)
    
    if connection.is_connected():
        db_info = connection.get_server_info()
        print(f"‚úÖ Connected to MySQL Server version {db_info}")
        
        cursor = connection.cursor()
        
        # Get database name
        cursor.execute("SELECT DATABASE();")
        record = cursor.fetchone()
        print(f"‚úÖ Connected to database: {record[0]}")
        
except Error as e:
    print(f"‚ùå Database connection failed: {e}")
    connection = None
    cursor = None

‚úÖ Connected to MySQL Server version 5.5.5-10.6.21-MariaDB-ubu2004-log
‚úÖ Connected to database: concert


    The property counterpart 'server_info' should be used instead.

  db_info = connection.get_server_info()


### Database Utility Functions

In [4]:
def close_connection():
    """Close database connection properly."""
    global connection, cursor
    
    try:
        if cursor:
            cursor.close()
            print("‚úì Cursor closed.")
        
        if connection and connection.is_connected():
            connection.close()
            print("‚úì MySQL connection closed.")
    except Error as e:
        print(f"‚ùå Error closing connection: {e}")

def reconnect_if_needed():
    """Reconnect to database if connection is lost."""
    global connection, cursor
    
    if not connection or not connection.is_connected():
        print("üîÑ Reconnecting to database...")
        try:
            connection = mysql.connector.connect(**db_config)
            cursor = connection.cursor()
            print("‚úÖ Reconnected successfully!")
            return True
        except Error as e:
            print(f"‚ùå Reconnection failed: {e}")
            return False
    return True

print("‚úÖ Database utility functions defined!")

‚úÖ Database utility functions defined!


## Venue Data Extraction

### Extract Unique Venues from CONCERT_SEATS Table

In [5]:
# Extract unique venue data from CONCERT_SEATS table
print("üèüÔ∏è Extracting venue data from CONCERT_SEATS table...")

venue_query = """
    SELECT DISTINCT 
        venue as name,
        city,
        countryName as country
    FROM CONCERT_SEATS 
    WHERE venue IS NOT NULL 
        AND venue != '' 
        AND venue != 'None'
        AND venue != 'NULL'
    ORDER BY venue
"""

try:
    reconnect_if_needed()
    venues_df = pd.read_sql(venue_query, connection)
    print(f"‚úÖ Extracted {len(venues_df)} unique venues")
    
    # Display sample venues
    print("\nüìã Sample venues:")
    display(venues_df.head(10))
    
except Error as e:
    print(f"‚ùå Error extracting venue data: {e}")
    venues_df = pd.DataFrame()

üèüÔ∏è Extracting venue data from CONCERT_SEATS table...


  venues_df = pd.read_sql(venue_query, connection)


‚úÖ Extracted 2017 unique venues

üìã Sample venues:


Unnamed: 0,name,city,country
0,10 Mile Music Hall,Frisco,USA
1,1015 Folsom,San Francisco,USA
2,11:11 EPTX,El Paso,USA
3,1614 Drinks-Music-Billiards / AfterHours Tavern,High Point,USA
4,1902 Nightclub,San Antonio,USA
5,191 Toole,Tucson,USA
6,3 Dollar Bill,Brooklyn,USA
7,3rd and Lindsley Bar & Grill,Nashville,USA
8,45 East,Portland,USA
9,4th Street Live,Louisville,USA


In [6]:
# Check for data quality issues in venues
if not venues_df.empty:
    print("üîç Checking venue data quality...")
    
    # Check for venues with multiple cities (potential data quality issues)
    venue_cities = venues_df.groupby('name').agg({
        'city': lambda x: list(set(x.dropna())),
        'country': lambda x: list(set(x.dropna()))
    }).reset_index()
    
    duplicates_with_diff_cities = venue_cities[
        venue_cities['city'].apply(len) > 1
    ]
    
    if len(duplicates_with_diff_cities) > 0:
        print(f"‚ö†Ô∏è  Found {len(duplicates_with_diff_cities)} venues with multiple cities:")
        for _, row in duplicates_with_diff_cities.head().iterrows():
            print(f"   ‚Ä¢ {row['name']}: {row['city']}")
    else:
        print("‚úÖ No venues with conflicting city data found")
    
    # Show venue statistics
    print(f"\nüìä Venue Statistics:")
    print(f"   Total unique venues: {len(venues_df)}")
    print(f"   Countries represented: {venues_df['country'].nunique()}")
    print(f"   Cities represented: {venues_df['city'].nunique()}")
    
    # Top countries by venue count
    top_countries = venues_df['country'].value_counts().head(5)
    print(f"\nüåç Top countries by venue count:")
    for country, count in top_countries.items():
        print(f"   ‚Ä¢ {country}: {count} venues")

üîç Checking venue data quality...
‚ö†Ô∏è  Found 2 venues with multiple cities:
   ‚Ä¢ Budweiser Stage at Ontario Place - Complex: ['London', 'Toronto']
   ‚Ä¢ The Dome: ['Halifax', 'Virginia Beach']

üìä Venue Statistics:
   Total unique venues: 2017
   Countries represented: 11
   Cities represented: 694

üåç Top countries by venue count:
   ‚Ä¢ USA: 1743 venues
   ‚Ä¢ Canada: 261 venues
   ‚Ä¢ France: 2 venues
   ‚Ä¢ Mexico: 2 venues
   ‚Ä¢ UK: 2 venues
‚ö†Ô∏è  Found 2 venues with multiple cities:
   ‚Ä¢ Budweiser Stage at Ontario Place - Complex: ['London', 'Toronto']
   ‚Ä¢ The Dome: ['Halifax', 'Virginia Beach']

üìä Venue Statistics:
   Total unique venues: 2017
   Countries represented: 11
   Cities represented: 694

üåç Top countries by venue count:
   ‚Ä¢ USA: 1743 venues
   ‚Ä¢ Canada: 261 venues
   ‚Ä¢ France: 2 venues
   ‚Ä¢ Mexico: 2 venues
   ‚Ä¢ UK: 2 venues


In [7]:
# Insert venue data into VENUES table
print("üîÑ Inserting venue data into VENUES table...")

if not venues_df.empty:
    try:
        reconnect_if_needed()
        
        # Prepare venue data for insertion
        venue_data = []
        for _, row in venues_df.iterrows():
            venue_data.append((
                row['name'] if pd.notna(row['name']) else None,
                row['city'] if pd.notna(row['city']) else None,
                row['country'] if pd.notna(row['country']) else None
            ))
        
        # Insert venues using INSERT IGNORE to avoid duplicates
        venue_insert_query = """
            INSERT IGNORE INTO VENUES (name, city, country) 
            VALUES (%s, %s, %s)
        """
        
        batch_size = 1000
        total_inserted = 0
        
        for i in range(0, len(venue_data), batch_size):
            batch = venue_data[i:i + batch_size]
            cursor.executemany(venue_insert_query, batch)
            connection.commit()
            total_inserted += cursor.rowcount
            print(f"   Processed {min(i + batch_size, len(venue_data))}/{len(venue_data)} venues...")
        
        print(f"‚úÖ Successfully inserted {total_inserted} new venues into VENUES table")
        
        # Verify insertion
        cursor.execute("SELECT COUNT(*) FROM VENUES")
        total_venues = cursor.fetchone()[0]
        print(f"üìä Total venues in VENUES table: {total_venues}")
        
    except Error as e:
        print(f"‚ùå Error inserting venue data: {e}")
        if connection:
            connection.rollback()
else:
    print("‚ùå No venue data to insert")

üîÑ Inserting venue data into VENUES table...
   Processed 1000/2017 venues...
   Processed 2000/2017 venues...
   Processed 2017/2017 venues...
‚úÖ Successfully inserted 245 new venues into VENUES table
üìä Total venues in VENUES table: 2015
   Processed 1000/2017 venues...
   Processed 2000/2017 venues...
   Processed 2017/2017 venues...
‚úÖ Successfully inserted 245 new venues into VENUES table
üìä Total venues in VENUES table: 2015


## Event Data Extraction

### Extract Unique Events from CONCERT_SEATS Table

In [8]:
# Extract unique event data from CONCERT_SEATS table
print("üé™ Extracting event data from CONCERT_SEATS table...")

event_query = """
    SELECT DISTINCT 
        eventId as event_id,
        venue as venue_id,
        artist,
        event_name,
        event_date,
        event_type,
        performer_type,
        performer
    FROM CONCERT_SEATS 
    WHERE eventId IS NOT NULL 
        AND eventId != 0
        AND venue IS NOT NULL 
        AND venue != '' 
        AND venue != 'None'
        AND venue != 'NULL'
        AND artist IS NOT NULL 
        AND artist != '' 
        AND artist != 'None'
        AND artist != 'NULL'
    ORDER BY eventId
"""

try:
    reconnect_if_needed()
    events_df = pd.read_sql(event_query, connection)
    print(f"‚úÖ Extracted {len(events_df)} unique events")
    
    # Display sample events
    print("\nüìã Sample events:")
    display(events_df.head(10))
    
except Error as e:
    print(f"‚ùå Error extracting event data: {e}")
    events_df = pd.DataFrame()

üé™ Extracting event data from CONCERT_SEATS table...


  events_df = pd.read_sql(event_query, connection)


‚úÖ Extracted 6772 unique events

üìã Sample events:


Unnamed: 0,event_id,venue_id,artist,event_name,event_date,event_type,performer_type,performer
0,152891943.0,Greek Theatre Los Angeles,Omd,OMD,2025-06-21,MusicEvent,MusicGroup,Orchestral Manoeuvres In The Dark
1,153021829.0,Greek Theatre Los Angeles,Omd,OMD,2025-06-20,MusicEvent,MusicGroup,Orchestral Manoeuvres In The Dark
2,153112606.0,The Paramount Huntington,Omd,OMD,2025-07-06,MusicEvent,MusicGroup,Orchestral Manoeuvres In The Dark
3,153112667.0,Keswick Theatre,Omd,OMD,2025-05-24,MusicEvent,MusicGroup,Orchestral Manoeuvres In The Dark
4,153112669.0,Terminal 5,Omd,OMD,2025-07-08,,,
5,153112669.0,Terminal 5,Omd,OMD,2025-07-08,MusicEvent,MusicGroup,Orchestral Manoeuvres In The Dark
6,153112768.0,Riviera Theatre Chicago,Omd,OMD,2025-06-28,MusicEvent,MusicGroup,Orchestral Manoeuvres In The Dark
7,153113241.0,Balboa Theatre,Omd,OMD,2025-06-17,MusicEvent,MusicGroup,Orchestral Manoeuvres In The Dark
8,153113243.0,Lincoln Theatre DC,Omd,OMD,2025-05-22,MusicEvent,MusicGroup,Orchestral Manoeuvres In The Dark
9,153114494.0,Citizens House of Blues Boston,Omd,OMD,2025-07-07,MusicEvent,MusicGroup,Orchestral Manoeuvres In The Dark


In [9]:
# Check event data quality and statistics
if not events_df.empty:
    print("üîç Analyzing event data quality...")
    
    # Check for events with missing data
    missing_data_count = events_df.isnull().sum()
    if missing_data_count.any():
        print("\n‚ö†Ô∏è  Missing data summary:")
        for col, count in missing_data_count.items():
            if count > 0:
                print(f"   ‚Ä¢ {col}: {count} missing values ({count/len(events_df)*100:.1f}%)")
    else:
        print("‚úÖ No missing data found in events")
    
    # Show event statistics
    print(f"\nüìä Event Statistics:")
    print(f"   Total unique events: {len(events_df)}")
    print(f"   Unique artists: {events_df['artist'].nunique()}")
    print(f"   Unique venues: {events_df['venue_id'].nunique()}")
    print(f"   Date range: {events_df['event_date'].min()} to {events_df['event_date'].max()}")
    
    # Top artists by event count
    top_artists = events_df['artist'].value_counts().head(5)
    print(f"\nüé§ Top artists by event count:")
    for artist, count in top_artists.items():
        print(f"   ‚Ä¢ {artist}: {count} events")
    
    # Top venues by event count
    top_venues = events_df['venue_id'].value_counts().head(5)
    print(f"\nüèüÔ∏è  Top venues by event count:")
    for venue, count in top_venues.items():
        print(f"   ‚Ä¢ {venue}: {count} events")

üîç Analyzing event data quality...

‚ö†Ô∏è  Missing data summary:
   ‚Ä¢ event_type: 86 missing values (1.3%)
   ‚Ä¢ performer_type: 86 missing values (1.3%)
   ‚Ä¢ performer: 86 missing values (1.3%)

üìä Event Statistics:
   Total unique events: 6772
   Unique artists: 487
   Unique venues: 2014
   Date range: 2025-02-20 to 2026-09-10

üé§ Top artists by event count:
   ‚Ä¢ Under Oath: 50 events
   ‚Ä¢ Nao: 49 events
   ‚Ä¢ Shawn Desman: 45 events
   ‚Ä¢ Subtronics: 44 events
   ‚Ä¢ Westend: 44 events

üèüÔ∏è  Top venues by event count:
   ‚Ä¢ History: 98 events
   ‚Ä¢ The Brooklyn Mirage at Avant Gardner - Complex: 61 events
   ‚Ä¢ Budweiser Stage at Ontario Place - Complex: 58 events
   ‚Ä¢ Rogers Centre: 56 events
   ‚Ä¢ Scotiabank Arena: 47 events


In [10]:
# Insert event data into EVENTS table
print("üîÑ Inserting event data into EVENTS table...")

if not events_df.empty:
    try:
        reconnect_if_needed()
        
        # Prepare event data for insertion
        event_data = []
        for _, row in events_df.iterrows():
            event_data.append((
                int(row['event_id']) if pd.notna(row['event_id']) else None,
                row['venue_id'] if pd.notna(row['venue_id']) else None,
                row['artist'] if pd.notna(row['artist']) else None,
                row['event_name'] if pd.notna(row['event_name']) else None,
                row['event_date'] if pd.notna(row['event_date']) else None,
                row['event_type'] if pd.notna(row['event_type']) else None,
                row['performer_type'] if pd.notna(row['performer_type']) else None,
                row['performer'] if pd.notna(row['performer']) else None
            ))
        
        # Insert events using INSERT IGNORE to avoid duplicates
        event_insert_query = """
            INSERT IGNORE INTO EVENTS (
                event_id, venue_id, artist, event_name, event_date,
                event_type, performer_type, performer
            ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
        """
        
        batch_size = 1000
        total_inserted = 0
        
        for i in range(0, len(event_data), batch_size):
            batch = event_data[i:i + batch_size]
            cursor.executemany(event_insert_query, batch)
            connection.commit()
            total_inserted += cursor.rowcount
            print(f"   Processed {min(i + batch_size, len(event_data))}/{len(event_data)} events...")
        
        print(f"‚úÖ Successfully inserted {total_inserted} new events into EVENTS table")
        
        # Verify insertion
        cursor.execute("SELECT COUNT(*) FROM EVENTS")
        total_events = cursor.fetchone()[0]
        print(f"üìä Total events in EVENTS table: {total_events}")
        
    except Error as e:
        print(f"‚ùå Error inserting event data: {e}")
        if connection:
            connection.rollback()
else:
    print("‚ùå No event data to insert")

üîÑ Inserting event data into EVENTS table...
‚ùå Error inserting event data: 1054 (42S22): Unknown column 'venue_id' in 'INSERT INTO'
‚ùå Error inserting event data: 1054 (42S22): Unknown column 'venue_id' in 'INSERT INTO'


## Complete Extraction Process

### Run Full Extraction with Timing

In [11]:
# Complete extraction process with timing
def run_complete_extraction():
    """Run the complete extraction and insertion process with timing."""
    
    start_time = datetime.now()
    print("üöÄ Starting Complete Event and Venue Data Extraction Process")
    print(f"üìÖ Started at: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
    print("="*60)
    
    extraction_stats = {
        'venues_extracted': 0,
        'venues_inserted': 0,
        'events_extracted': 0,
        'events_inserted': 0,
        'success': True
    }
    
    try:
        # Ensure connection is active
        if not reconnect_if_needed():
            return False
        
        # Extract and insert venue data
        print("\n" + "="*25 + " VENUES " + "="*25)
        venue_start = time.time()
        
        # Re-extract venues (in case data changed)
        venues_df_new = pd.read_sql(venue_query, connection)
        extraction_stats['venues_extracted'] = len(venues_df_new)
        print(f"‚úÖ Extracted {len(venues_df_new)} venues")
        
        if not venues_df_new.empty:
            # Insert venues
            venue_data = [(row['name'], row['city'], row['country']) 
                         for _, row in venues_df_new.iterrows()]
            
            cursor.executemany(venue_insert_query, venue_data)
            connection.commit()
            extraction_stats['venues_inserted'] = cursor.rowcount
            
        venue_time = time.time() - venue_start
        print(f"‚è±Ô∏è  Venue processing completed in {venue_time:.2f} seconds")
        
        # Extract and insert event data
        print("\n" + "="*25 + " EVENTS " + "="*25)
        event_start = time.time()
        
        # Re-extract events (in case data changed)
        events_df_new = pd.read_sql(event_query, connection)
        extraction_stats['events_extracted'] = len(events_df_new)
        print(f"‚úÖ Extracted {len(events_df_new)} events")
        
        if not events_df_new.empty:
            # Insert events
            event_data = [
                (int(row['event_id']), row['venue_id'], row['artist'], 
                 row['event_name'], row['event_date'], row['event_type'],
                 row['performer_type'], row['performer'])
                for _, row in events_df_new.iterrows()
            ]
            
            cursor.executemany(event_insert_query, event_data)
            connection.commit()
            extraction_stats['events_inserted'] = cursor.rowcount
            
        event_time = time.time() - event_start
        print(f"‚è±Ô∏è  Event processing completed in {event_time:.2f} seconds")
        
        # Final summary
        end_time = datetime.now()
        total_duration = end_time - start_time
        
        print("\n" + "="*60)
        print("üéØ COMPLETE EXTRACTION PROCESS FINISHED!")
        print("‚úÖ All operations completed successfully!")
        print(f"\nüìä Extraction Summary:")
        print(f"   üèüÔ∏è  Venues: {extraction_stats['venues_extracted']} extracted, {extraction_stats['venues_inserted']} new inserted")
        print(f"   üé™ Events: {extraction_stats['events_extracted']} extracted, {extraction_stats['events_inserted']} new inserted")
        print(f"\n‚è±Ô∏è  Total execution time: {total_duration}")
        print(f"üìÖ Completed at: {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
        print("="*60)
        
        return True
        
    except Exception as e:
        print(f"‚ùå Unexpected error during extraction: {e}")
        extraction_stats['success'] = False
        return False

# Run the complete extraction
run_complete_extraction()

üöÄ Starting Complete Event and Venue Data Extraction Process
üìÖ Started at: 2025-07-20 15:10:20



  venues_df_new = pd.read_sql(venue_query, connection)


‚úÖ Extracted 2017 venues
‚è±Ô∏è  Venue processing completed in 28.29 seconds

‚è±Ô∏è  Venue processing completed in 28.29 seconds



  events_df_new = pd.read_sql(event_query, connection)


‚úÖ Extracted 6772 events
‚ùå Unexpected error during extraction: 1054 (42S22): Unknown column 'venue_id' in 'INSERT INTO'
‚ùå Unexpected error during extraction: 1054 (42S22): Unknown column 'venue_id' in 'INSERT INTO'


False

## Statistics & Verification

### Show Final Database Statistics

In [12]:
# Show comprehensive database statistics
def show_comprehensive_stats():
    """Show detailed statistics for EVENTS and VENUES tables."""
    
    try:
        reconnect_if_needed()
        
        print("üìä Comprehensive Database Statistics")
        print("="*50)
        
        # VENUES detailed stats
        cursor.execute("SELECT COUNT(*) FROM VENUES")
        venue_count = cursor.fetchone()[0]
        print(f"\nüèüÔ∏è  VENUES TABLE: {venue_count} records")
        
        if venue_count > 0:
            # Top countries
            cursor.execute("""
                SELECT country, COUNT(*) as count 
                FROM VENUES 
                WHERE country IS NOT NULL 
                GROUP BY country 
                ORDER BY count DESC 
                LIMIT 10
            """)
            countries = cursor.fetchall()
            print("   üìç Countries with most venues:")
            for country, count in countries:
                print(f"      ‚Ä¢ {country}: {count} venues")
            
            # Sample venues
            cursor.execute("SELECT name, city, country FROM VENUES ORDER BY name LIMIT 5")
            sample_venues = cursor.fetchall()
            print("   \nüèüÔ∏è  Sample venues:")
            for name, city, country in sample_venues:
                print(f"      ‚Ä¢ {name} ({city}, {country})")
        
        # EVENTS detailed stats  
        cursor.execute("SELECT COUNT(*) FROM EVENTS")
        event_count = cursor.fetchone()[0]
        print(f"\nüé™ EVENTS TABLE: {event_count} records")
        
        if event_count > 0:
            # Top artists
            cursor.execute("""
                SELECT artist, COUNT(*) as count 
                FROM EVENTS 
                WHERE artist IS NOT NULL 
                GROUP BY artist 
                ORDER BY count DESC 
                LIMIT 10
            """)
            artists = cursor.fetchall()
            print("   üé§ Artists with most events:")
            for artist, count in artists:
                print(f"      ‚Ä¢ {artist}: {count} events")
            
            # Date range
            cursor.execute("""
                SELECT MIN(event_date) as earliest, MAX(event_date) as latest 
                FROM EVENTS 
                WHERE event_date IS NOT NULL
            """)
            date_range = cursor.fetchone()
            if date_range[0] and date_range[1]:
                print(f"   üìÖ Date range: {date_range[0]} to {date_range[1]}")
            
            # Event types
            cursor.execute("""
                SELECT event_type, COUNT(*) as count 
                FROM EVENTS 
                WHERE event_type IS NOT NULL 
                GROUP BY event_type 
                ORDER BY count DESC 
                LIMIT 5
            """)
            event_types = cursor.fetchall()
            print("   üé≠ Event types:")
            for event_type, count in event_types:
                print(f"      ‚Ä¢ {event_type}: {count} events")
        
        # Cross-table statistics
        cursor.execute("""
            SELECT COUNT(DISTINCT e.venue_id) as venues_with_events
            FROM EVENTS e
            WHERE e.venue_id IS NOT NULL
        """)
        venues_with_events = cursor.fetchone()[0]
        
        print(f"\nüîó Cross-table Statistics:")
        print(f"   üìä Venues with events: {venues_with_events}")
        
        if venue_count > 0:
            utilization = (venues_with_events / venue_count) * 100
            print(f"   üìà Venue utilization: {utilization:.1f}%")
        
    except Error as e:
        print(f"‚ùå Error retrieving comprehensive statistics: {e}")

# Show the comprehensive statistics
show_comprehensive_stats()

üìä Comprehensive Database Statistics

üèüÔ∏è  VENUES TABLE: 2015 records
   üìç Countries with most venues:
      ‚Ä¢ USA: 1742 venues
      ‚Ä¢ Canada: 260 venues
      ‚Ä¢ France: 2 venues
      ‚Ä¢ UK: 2 venues
      ‚Ä¢ Germany: 2 venues
      ‚Ä¢ Mexico: 2 venues
      ‚Ä¢ Brazil: 1 venues
      ‚Ä¢ Netherlands: 1 venues
      ‚Ä¢ Switzerland: 1 venues
      ‚Ä¢ Norway: 1 venues
   
üèüÔ∏è  Sample venues:
      ‚Ä¢ 10 Mile Music Hall (Frisco, USA)
      ‚Ä¢ 1015 Folsom (San Francisco, USA)
      ‚Ä¢ 11:11 EPTX (El Paso, USA)
      ‚Ä¢ 1614 Drinks-Music-Billiards / AfterHours Tavern (High Point, USA)
      ‚Ä¢ 1902 Nightclub (San Antonio, USA)

üé™ EVENTS TABLE: 4778 records
   üé§ Artists with most events:
      ‚Ä¢ 49Th & Main: 39 events
      ‚Ä¢ Shawn Desman: 37 events
      ‚Ä¢ Bunt: 36 events
      ‚Ä¢ Crankdat: 36 events
      ‚Ä¢ Nao: 36 events
      ‚Ä¢ Knock2 : 33 events
      ‚Ä¢ Goo Goo: 32 events
      ‚Ä¢ Westend: 32 events
      ‚Ä¢ Shaq: 31 events
      ‚Ä¢ U

In [13]:
# Verification queries - check data integrity
print("üîç Data Integrity Verification")
print("="*40)

try:
    reconnect_if_needed()
    
    # Check for orphaned events (events with venues not in VENUES table)
    cursor.execute("""
        SELECT COUNT(*) 
        FROM EVENTS e 
        LEFT JOIN VENUES v ON e.venue_id = v.name 
        WHERE v.name IS NULL AND e.venue_id IS NOT NULL
    """)
    orphaned_events = cursor.fetchone()[0]
    
    if orphaned_events > 0:
        print(f"‚ö†Ô∏è  Found {orphaned_events} events with venues not in VENUES table")
        
        # Show sample orphaned events
        cursor.execute("""
            SELECT e.event_id, e.venue_id, e.artist, e.event_name 
            FROM EVENTS e 
            LEFT JOIN VENUES v ON e.venue_id = v.name 
            WHERE v.name IS NULL AND e.venue_id IS NOT NULL 
            LIMIT 5
        """)
        orphaned_sample = cursor.fetchall()
        print("   Sample orphaned events:")
        for event_id, venue, artist, event_name in orphaned_sample:
            print(f"      ‚Ä¢ Event {event_id}: {artist} at {venue}")
    else:
        print("‚úÖ All events have corresponding venues in VENUES table")
    
    # Check for duplicate event IDs
    cursor.execute("""
        SELECT event_id, COUNT(*) as count 
        FROM EVENTS 
        GROUP BY event_id 
        HAVING COUNT(*) > 1
    """)
    duplicate_events = cursor.fetchall()
    
    if duplicate_events:
        print(f"‚ö†Ô∏è  Found {len(duplicate_events)} duplicate event IDs:")
        for event_id, count in duplicate_events[:5]:
            print(f"      ‚Ä¢ Event ID {event_id}: {count} duplicates")
    else:
        print("‚úÖ No duplicate event IDs found")
    
    # Check for venues without events
    cursor.execute("""
        SELECT COUNT(*) 
        FROM VENUES v 
        LEFT JOIN EVENTS e ON v.name = e.venue_id 
        WHERE e.venue_id IS NULL
    """)
    venues_without_events = cursor.fetchone()[0]
    
    print(f"üìä Found {venues_without_events} venues without events (this is normal)")
    
except Error as e:
    print(f"‚ùå Error during verification: {e}")

üîç Data Integrity Verification
‚ùå Error during verification: 1054 (42S22): Unknown column 'e.venue_id' in 'WHERE'


## Database Cleanup

### Close Database Connection

In [14]:
# Final cleanup and summary
print("üßπ Final Cleanup and Summary")
print("="*40)

# Show final counts
try:
    if connection and connection.is_connected():
        cursor.execute("SELECT COUNT(*) FROM VENUES")
        final_venue_count = cursor.fetchone()[0]
        
        cursor.execute("SELECT COUNT(*) FROM EVENTS") 
        final_event_count = cursor.fetchone()[0]
        
        cursor.execute("SELECT COUNT(*) FROM CONCERT_SEATS")
        seat_count = cursor.fetchone()[0]
        
        print(f"üìä Final Database Summary:")
        print(f"   üèüÔ∏è  VENUES table: {final_venue_count} records")
        print(f"   üé™ EVENTS table: {final_event_count} records")
        print(f"   üé´ CONCERT_SEATS table: {seat_count} records")
        print(f"\n‚úÖ Data extraction and normalization completed successfully!")
        
except Error as e:
    print(f"‚ùå Error getting final counts: {e}")

# Close the database connection
close_connection()
print("\nüéØ Notebook execution completed!")

üßπ Final Cleanup and Summary
üìä Final Database Summary:
   üèüÔ∏è  VENUES table: 2015 records
   üé™ EVENTS table: 4778 records
   üé´ CONCERT_SEATS table: 192126 records

‚úÖ Data extraction and normalization completed successfully!
‚úì Cursor closed.
‚úì MySQL connection closed.

üéØ Notebook execution completed!
üìä Final Database Summary:
   üèüÔ∏è  VENUES table: 2015 records
   üé™ EVENTS table: 4778 records
   üé´ CONCERT_SEATS table: 192126 records

‚úÖ Data extraction and normalization completed successfully!
‚úì Cursor closed.
‚úì MySQL connection closed.

üéØ Notebook execution completed!


---

## Summary

This notebook provides a complete workflow for extracting and normalizing event and venue data:

### ‚úÖ **Completed Tasks:**
1. **Database Connection** - Established connection with error handling
2. **Venue Extraction** - Extracted unique venues with location data
3. **Event Extraction** - Extracted unique events with full details
4. **Data Quality Checks** - Validated data integrity and identified issues
5. **Batch Insertion** - Efficiently inserted data with duplicate prevention
6. **Statistics & Verification** - Comprehensive analysis and validation

### üéØ **Key Features:**
- **Smart Duplicate Handling** - Uses INSERT IGNORE to prevent duplicates
- **Data Quality Validation** - Identifies and reports data inconsistencies
- **Progress Tracking** - Shows detailed progress and timing information
- **Error Handling** - Comprehensive error handling with rollback capability
- **Interactive Analysis** - Step-by-step execution with detailed statistics

### üìä **Data Flow:**
```
CONCERT_SEATS ‚Üí Extract Unique Data ‚Üí VENUES + EVENTS Tables
```

### üöÄ **Next Steps:**
- Run cells sequentially from top to bottom
- Monitor progress and statistics in each section  
- Use the verification section to ensure data integrity
- Customize queries as needed for your specific requirements