# Week 6 - SQL and Python Integration Part 1: Database Connections

## Learning Objectives
By the end of this lesson, you will be able to:
1. Establish PostgreSQL database connections from Python using SQLAlchemy
2. Connect to cloud databases (Supabase) for real-world data analysis
3. Execute SQL queries from Python notebooks using real e-commerce data
4. Understand the relationship between SQL databases and Python DataFrames
5. Implement proper connection management and error handling
6. Compare SQL and Pandas approaches for business analytics

## Business Context: Bridging SQL and Python

In modern business environments, data often lives in **cloud databases** while analysis happens in **Python**. The ability to seamlessly bridge these two worlds is essential for:

- **Real-time Data Access** - Connect directly to live business systems
- **Scalability** - Handle enterprise-scale datasets
- **Collaboration** - Multiple analysts accessing the same data source
- **Performance** - Leverage database engines for heavy computation
- **Integration** - Combine SQL's querying power with Python's analytical capabilities

Today we'll master connecting Python to **PostgreSQL databases** using **Supabase** (a cloud database platform) and work with real Olist e-commerce data that's already stored in the cloud.

In [ ]:
# Import required libraries for PostgreSQL database connectivity
import pandas as pd
import numpy as np
import sqlalchemy
from sqlalchemy import create_engine, text, inspect
from datetime import datetime, timedelta
import warnings
import os
from dotenv import load_dotenv

warnings.filterwarnings('ignore')

# Load environment variables from .env file
load_dotenv()

# Supabase PostgreSQL Database Configuration from environment variables
DATABASE_CONFIG = {
    'host': os.getenv('POSTGRES_HOST'),
    'port': int(os.getenv('POSTGRES_PORT', 5432)),
    'database': os.getenv('POSTGRES_DATABASE'),
    'user': os.getenv('POSTGRES_USER'),
    'password': os.getenv('POSTGRES_PASSWORD'),
    'connection_timeout': 30,
    'echo': False  # Set to True to see SQL queries
}

# Verify that environment variables were loaded
if not all([DATABASE_CONFIG['host'], DATABASE_CONFIG['user'], DATABASE_CONFIG['password']]):
    raise ValueError("Missing required database credentials. Please check your .env file.")

# PostgreSQL connection string
POSTGRES_URL = f"postgresql://{DATABASE_CONFIG['user']}:{DATABASE_CONFIG['password']}@{DATABASE_CONFIG['host']}:{DATABASE_CONFIG['port']}/{DATABASE_CONFIG['database']}"

print("🐘 PostgreSQL-Python Integration Environment Ready!")
print(f"SQLAlchemy version: {sqlalchemy.__version__}")
print(f"Pandas version: {pd.__version__}")
print("✅ Connecting to Supabase PostgreSQL Database...")
print("🗄️ Real Olist E-commerce & Marketing data awaits!")
print("🔒 Database credentials loaded securely from .env file")

## 1. SQLAlchemy Basics and PostgreSQL Connection

**SQLAlchemy** is Python's most popular database toolkit. It provides:
- **Connection Management**: Handle database connections efficiently
- **SQL Query Execution**: Run SQL directly from Python
- **ORM (Object-Relational Mapping)**: Map Python objects to database tables
- **Database Abstraction**: Work with different databases using the same API

**PostgreSQL** is an enterprise-grade database that excels at:
- **Complex Queries**: Advanced SQL features like window functions, CTEs
- **Scalability**: Handle millions of rows efficiently  
- **Data Integrity**: ACID compliance for business-critical data
- **JSON Support**: Store and query semi-structured data

In [None]:
class PostgreSQLManager:
    """
    Professional PostgreSQL connection manager for production-ready applications.
    Handles connection pooling, error handling, and resource management.
    """
    
    def __init__(self, connection_url=None):
        self.connection_url = connection_url or POSTGRES_URL
        self.engine = None
        self._connect()
    
    def _connect(self):
        """
        Establish PostgreSQL connection with optimal configuration.
        """
        try:
            # Create engine with connection pooling and timeout settings
            self.engine = create_engine(
                self.connection_url,
                echo=DATABASE_CONFIG['echo'],
                pool_size=5,                    # Connection pool size
                max_overflow=10,                # Additional connections if needed
                pool_timeout=DATABASE_CONFIG['connection_timeout'],
                pool_recycle=3600,              # Recycle connections every hour
                connect_args={
                    "connect_timeout": DATABASE_CONFIG['connection_timeout'],
                    "application_name": "Python_Data_Analysis_Course"
                }
            )
            
            # Test connection
            with self.engine.connect() as conn:
                result = conn.execute(text("SELECT version()"))
                version = result.scalar()
                print("✅ PostgreSQL connection established successfully")
                print(f"🐘 Database version: {version[:50]}...")
            
        except Exception as e:
            print(f"❌ PostgreSQL connection failed: {e}")
            print("🔧 Troubleshooting tips:")
            print("  • Check your internet connection")
            print("  • Verify database credentials")
            print("  • Ensure Supabase database is running")
            raise
    
    def get_table_info(self):
        """
        Get comprehensive information about all tables in the database.
        """
        inspector = inspect(self.engine)
        tables = inspector.get_table_names()
        
        table_info = {}
        
        print("📋 Discovering database schema...")
        
        for table in tables:
            try:
                with self.engine.connect() as conn:
                    # Get row count
                    result = conn.execute(text(f'SELECT COUNT(*) FROM "{table}"'))
                    row_count = result.scalar()
                    
                    # Get column information
                    columns = inspector.get_columns(table)
                    
                    table_info[table] = {
                        'rows': row_count,
                        'columns': [col['name'] for col in columns],
                        'column_types': {col['name']: str(col['type']) for col in columns}
                    }
                    
                    print(f"  📊 {table}: {row_count:,} rows, {len(columns)} columns")
                    
            except Exception as e:
                print(f"  ⚠️ Could not access {table}: {e}")
                continue
        
        return table_info
    
    def execute_query(self, query, params=None):
        """
        Execute a SQL query with proper error handling and return results as DataFrame.
        """
        try:
            with self.engine.connect() as conn:
                if params:
                    result = pd.read_sql(text(query), conn, params=params)
                else:
                    result = pd.read_sql(text(query), conn)
                
                return result
                
        except Exception as e:
            print(f"❌ Query execution failed: {e}")
            print(f"📝 Query: {query[:100]}...")
            raise
    
    def get_sample_data(self, table_name, limit=5):
        """
        Get sample data from a table for exploration.
        """
        query = f'SELECT * FROM "{table_name}" LIMIT {limit}'
        return self.execute_query(query)
    
    def get_table_schema(self, table_name):
        """
        Get detailed schema information for a specific table.
        """
        inspector = inspect(self.engine)
        columns = inspector.get_columns(table_name)
        
        schema_df = pd.DataFrame([
            {
                'column_name': col['name'],
                'data_type': str(col['type']),
                'nullable': col['nullable'],
                'default': col.get('default'),
                'primary_key': col.get('primary_key', False)
            }
            for col in columns
        ])
        
        return schema_df
    
    def close(self):
        """
        Properly close database connections.
        """
        if self.engine:
            self.engine.dispose()
            print("🔒 PostgreSQL connections closed")

# Create database manager instance and connect to Supabase
print("🚀 Connecting to Supabase PostgreSQL Database...")
db = PostgreSQLManager()

# Display database information
print("\n📊 Olist E-commerce Database Overview:")
db_info = db.get_table_info()

print(f"\n🗃️ Total tables discovered: {len(db_info)}")
total_rows = sum(info['rows'] for info in db_info.values())
print(f"📏 Total rows across all tables: {total_rows:,}")

## 2. Exploring the Database Schema

Let's explore the structure of our Olist e-commerce database to understand the business data model.

In [None]:
# Let's examine the structure of key business tables
print("🔍 Database Schema Exploration")
print("\n" + "="*60)

# Identify the main datasets
sales_tables = [table for table in db_info.keys() if 'olist_sales_data_set' in table]
marketing_tables = [table for table in db_info.keys() if 'olist_marketing_data_set' in table]

print(f"\n📊 OLIST SALES DATASET Tables:")
for table in sales_tables:
    if table in db_info:
        info = db_info[table]
        print(f"  • {table}: {info['rows']:,} rows, {len(info['columns'])} columns")

print(f"\n📈 OLIST MARKETING DATASET Tables:")
for table in marketing_tables:
    if table in db_info:
        info = db_info[table]
        print(f"  • {table}: {info['rows']:,} rows, {len(info['columns'])} columns")

# Let's explore the main sales tables if they exist
if sales_tables:
    main_sales_table = sales_tables[0]  # Assume first table is main one
    print(f"\n📋 {main_sales_table.upper()} Schema:")
    sales_schema = db.get_table_schema(main_sales_table)
    display(sales_schema)
    
    print(f"\n📦 Sample data from {main_sales_table}:")
    sales_sample = db.get_sample_data(main_sales_table, 3)
    display(sales_sample)

# Let's explore the main marketing tables if they exist
if marketing_tables:
    main_marketing_table = marketing_tables[0]  # Assume first table is main one
    print(f"\n📋 {main_marketing_table.upper()} Schema:")
    marketing_schema = db.get_table_schema(main_marketing_table)
    display(marketing_schema)
    
    print(f"\n📦 Sample data from {main_marketing_table}:")
    marketing_sample = db.get_sample_data(main_marketing_table, 3)
    display(marketing_sample)

print("\n💡 Business Data Model Understanding:")
print("  🔗 Sales dataset contains: customer orders, products, payments, reviews")
print("  🔗 Marketing dataset contains: lead generation, conversions, channel data")
print("  🔗 These datasets can be joined to analyze customer acquisition and behavior")

## 3. Running SQL Queries from Python

Now let's execute SQL queries directly from Python and see how they work with real cloud data.

In [None]:
# Let's start with basic data exploration using SQL
print("🔍 SQL Query Execution Examples")
print("\n" + "="*60)

# First, let's check what columns are available in our main datasets
def explore_dataset_structure():
    """Explore the structure of our main datasets"""
    
    # Check sales dataset structure
    if sales_tables:
        print(f"\n📊 SALES DATASET - {sales_tables[0]} columns:")
        sales_cols = db_info[sales_tables[0]]['columns']
        print(f"  {sales_cols[:10]}...")  # Show first 10 columns
    
    # Check marketing dataset structure
    if marketing_tables:
        print(f"\n📈 MARKETING DATASET - {marketing_tables[0]} columns:")
        marketing_cols = db_info[marketing_tables[0]]['columns']
        print(f"  {marketing_cols[:10]}...")  # Show first 10 columns

explore_dataset_structure()

# Example 1: Basic data exploration
print("\n📋 Example 1: Basic Data Exploration")

if sales_tables:
    # Get basic statistics from sales data
    basic_stats_query = f"""
    SELECT 
        COUNT(*) as total_records,
        COUNT(DISTINCT CASE WHEN "customer_id" IS NOT NULL THEN "customer_id" END) as unique_customers,
        MIN("order_purchase_timestamp") as earliest_order,
        MAX("order_purchase_timestamp") as latest_order
    FROM "{sales_tables[0]}"
    WHERE "order_purchase_timestamp" IS NOT NULL
    """
    
    try:
        basic_stats = db.execute_query(basic_stats_query)
        print("✅ Basic Sales Statistics:")
        display(basic_stats)
    except Exception as e:
        print(f"⚠️ Could not execute basic stats query: {e}")
        print("Trying alternative approach...")
        
        # Fallback: just count total records
        simple_query = f'SELECT COUNT(*) as total_records FROM "{sales_tables[0]}"'
        try:
            simple_stats = db.execute_query(simple_query)
            print("✅ Simple Record Count:")
            display(simple_stats)
        except Exception as e2:
            print(f"❌ Could not execute any query: {e2}")

print("\n" + "-"*60)

In [None]:
# Example 2: Customer analysis (adapting to actual schema)
print("\n💼 Example 2: Customer Analysis")
print("Business Question: What can we learn about our customer base?")

if sales_tables:
    # Let's examine the actual structure first
    sample_data = db.get_sample_data(sales_tables[0], 1)
    print(f"\nActual columns in {sales_tables[0]}:")
    print(list(sample_data.columns))
    
    # Adapt query based on available columns
    available_columns = db_info[sales_tables[0]]['columns']
    
    # Look for customer-related columns
    customer_cols = [col for col in available_columns if 'customer' in col.lower()]
    state_cols = [col for col in available_columns if 'state' in col.lower()]
    
    print(f"\nCustomer-related columns: {customer_cols}")
    print(f"State-related columns: {state_cols}")
    
    if customer_cols and state_cols:
        # Build a customer analysis query with available columns
        customer_analysis_query = f"""
        SELECT 
            "{state_cols[0]}" as customer_state,
            COUNT(*) as order_count,
            COUNT(DISTINCT "{customer_cols[0]}") as unique_customers
        FROM "{sales_tables[0]}"
        WHERE "{state_cols[0]}" IS NOT NULL
        GROUP BY "{state_cols[0]}"
        ORDER BY order_count DESC
        LIMIT 10
        """
        
        try:
            customer_analysis = db.execute_query(customer_analysis_query)
            print("\n✅ Customer Analysis by State:")
            display(customer_analysis)
            
            if len(customer_analysis) > 0:
                top_state = customer_analysis.iloc[0]
                print(f"\n💡 Key Insights:")
                print(f"  • Top state: {top_state['customer_state']} ({top_state['order_count']:,} orders)")
                print(f"  • Total states analyzed: {len(customer_analysis)}")
        except Exception as e:
            print(f"❌ Customer analysis failed: {e}")
    
print("\n" + "-"*60)

In [None]:
# Example 3: Marketing funnel analysis
print("\n📈 Example 3: Marketing Analysis")
print("Business Question: How effective are our marketing channels?")

if marketing_tables:
    # Examine marketing table structure
    marketing_sample = db.get_sample_data(marketing_tables[0], 1)
    print(f"\nActual columns in {marketing_tables[0]}:")
    print(list(marketing_sample.columns))
    
    marketing_columns = db_info[marketing_tables[0]]['columns']
    
    # Look for relevant marketing columns
    channel_cols = [col for col in marketing_columns if any(keyword in col.lower() for keyword in ['origin', 'source', 'channel', 'medium'])]
    lead_cols = [col for col in marketing_columns if any(keyword in col.lower() for keyword in ['lead', 'mql', 'conversion'])]
    
    print(f"\nChannel-related columns: {channel_cols}")
    print(f"Lead-related columns: {lead_cols}")
    
    if channel_cols:
        # Build marketing analysis query
        marketing_query = f"""
        SELECT 
            "{channel_cols[0]}" as marketing_channel,
            COUNT(*) as total_leads,
            ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
        FROM "{marketing_tables[0]}"
        WHERE "{channel_cols[0]}" IS NOT NULL
        GROUP BY "{channel_cols[0]}"
        ORDER BY total_leads DESC
        LIMIT 10
        """
        
        try:
            marketing_analysis = db.execute_query(marketing_query)
            print("\n✅ Marketing Channel Performance:")
            display(marketing_analysis)
            
            if len(marketing_analysis) > 0:
                top_channel = marketing_analysis.iloc[0]
                print(f"\n💡 Marketing Insights:")
                print(f"  • Top channel: {top_channel['marketing_channel']} ({top_channel['percentage']}% of leads)")
                print(f"  • Total channels: {len(marketing_analysis)}")
        except Exception as e:
            print(f"❌ Marketing analysis failed: {e}")
    else:
        print("⚠️ No obvious channel columns found, showing sample data:")
        display(marketing_sample)

print("\n" + "-"*60)

## 4. Advanced SQL Features

Let's explore more sophisticated SQL queries that are common in business intelligence scenarios.

In [None]:
# Advanced SQL Example: Time-based analysis with window functions
print("🧠 Advanced SQL Analysis")
print("\n" + "="*60)

print("\n📅 Example: Time-Based Trend Analysis")
print("Business Question: How have our key metrics evolved over time?")

if sales_tables:
    # Look for date columns
    available_columns = db_info[sales_tables[0]]['columns']
    date_cols = [col for col in available_columns if any(keyword in col.lower() for keyword in ['date', 'timestamp', 'time'])]
    
    print(f"\nDate-related columns found: {date_cols}")
    
    if date_cols:
        # Build time-series analysis query
        time_analysis_query = f"""
        SELECT 
            DATE_TRUNC('month', "{date_cols[0]}") as month,
            COUNT(*) as monthly_records,
            LAG(COUNT(*), 1) OVER (ORDER BY DATE_TRUNC('month', "{date_cols[0]}")) as prev_month_records,
            ROUND(
                (COUNT(*) - LAG(COUNT(*), 1) OVER (ORDER BY DATE_TRUNC('month', "{date_cols[0]}"))) * 100.0 / 
                NULLIF(LAG(COUNT(*), 1) OVER (ORDER BY DATE_TRUNC('month', "{date_cols[0]}")), 0), 
                2
            ) as month_over_month_growth
        FROM "{sales_tables[0]}"
        WHERE "{date_cols[0]}" IS NOT NULL
        GROUP BY DATE_TRUNC('month', "{date_cols[0]}")
        ORDER BY month
        LIMIT 12
        """
        
        try:
            time_analysis = db.execute_query(time_analysis_query)
            print("\n✅ Monthly Trend Analysis with Growth Rates:")
            display(time_analysis)
            
            if len(time_analysis) > 1:
                avg_growth = time_analysis['month_over_month_growth'].dropna().mean()
                print(f"\n📈 Trend Insights:")
                print(f"  • Average monthly growth: {avg_growth:.2f}%")
                print(f"  • Analysis period: {time_analysis['month'].min()} to {time_analysis['month'].max()}")
        except Exception as e:
            print(f"❌ Time analysis failed: {e}")
            print("Trying simpler date-based query...")
            
            # Fallback to simpler query
            simple_date_query = f"""
            SELECT 
                DATE_TRUNC('month', "{date_cols[0]}") as month,
                COUNT(*) as monthly_records
            FROM "{sales_tables[0]}"
            WHERE "{date_cols[0]}" IS NOT NULL
            GROUP BY DATE_TRUNC('month', "{date_cols[0]}")
            ORDER BY month
            LIMIT 12
            """
            
            try:
                simple_time_analysis = db.execute_query(simple_date_query)
                print("\n✅ Simple Monthly Trends:")
                display(simple_time_analysis)
            except Exception as e2:
                print(f"❌ Simple time analysis also failed: {e2}")

print("\n" + "-"*60)

In [None]:
# Advanced SQL Example: Common Table Expressions (CTEs)
print("\n💼 Advanced SQL: CTEs for Complex Business Logic")
print("Business Question: Can we segment our data for deeper insights?")

if sales_tables:
    # Build a more complex query with CTEs
    available_columns = db_info[sales_tables[0]]['columns']
    
    # Look for numeric columns that might represent values
    numeric_cols = [col for col in available_columns if any(keyword in col.lower() for keyword in ['price', 'value', 'amount', 'cost'])]
    
    print(f"\nNumeric value columns found: {numeric_cols}")
    
    if numeric_cols and state_cols:
        # Build CTE query for business segmentation
        cte_query = f"""
        WITH regional_stats AS (
            SELECT 
                "{state_cols[0]}" as region,
                COUNT(*) as total_records,
                AVG("{numeric_cols[0]}") as avg_value,
                STDDEV("{numeric_cols[0]}") as value_stddev
            FROM "{sales_tables[0]}"
            WHERE "{state_cols[0]}" IS NOT NULL 
                AND "{numeric_cols[0]}" IS NOT NULL
                AND "{numeric_cols[0]}" > 0
            GROUP BY "{state_cols[0]}"
        ),
        regional_segments AS (
            SELECT 
                region,
                total_records,
                ROUND(avg_value, 2) as avg_value,
                CASE 
                    WHEN avg_value > (SELECT AVG(avg_value) FROM regional_stats) THEN 'High Value'
                    WHEN total_records > (SELECT AVG(total_records) FROM regional_stats) THEN 'High Volume'
                    ELSE 'Standard'
                END as segment
            FROM regional_stats
        )
        SELECT 
            segment,
            COUNT(*) as region_count,
            SUM(total_records) as total_records,
            ROUND(AVG(avg_value), 2) as segment_avg_value
        FROM regional_segments
        GROUP BY segment
        ORDER BY segment_avg_value DESC
        """
        
        try:
            cte_analysis = db.execute_query(cte_query)
            print("\n✅ Regional Segmentation Analysis (using CTEs):")
            display(cte_analysis)
            
            print(f"\n🎯 Segmentation Insights:")
            for _, row in cte_analysis.iterrows():
                print(f"  • {row['segment']}: {row['region_count']} regions, avg value: {row['segment_avg_value']}")
        except Exception as e:
            print(f"❌ CTE analysis failed: {e}")
    else:
        print("⚠️ Insufficient columns for segmentation analysis")

print("\n💡 Advanced SQL Features Demonstrated:")
print("  • Window functions (LAG, OVER) for time-series analysis")
print("  • Date functions (DATE_TRUNC) for temporal grouping")
print("  • CTEs for complex multi-step business logic")
print("  • CASE statements for business rule implementation")
print("  • Subqueries for dynamic threshold calculations")

## 5. SQL vs Pandas: When to Use Each Approach

Let's compare the strengths of SQL versus pandas for different types of data operations.

In [None]:
def compare_sql_vs_pandas_approaches():
    """
    Compare SQL and pandas approaches for different types of analysis.
    """
    print("⚡ SQL vs Pandas: Strategic Comparison")
    print("\n" + "="*60)
    
    # Example 1: Simple aggregation comparison
    print("\n📊 Example 1: Simple Aggregation")
    print("Task: Count records by category")
    
    if sales_tables and state_cols:
        print("\n🗄️ SQL Approach:")
        sql_agg_query = f"""
        SELECT 
            "{state_cols[0]}" as category,
            COUNT(*) as record_count,
            ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
        FROM "{sales_tables[0]}"
        WHERE "{state_cols[0]}" IS NOT NULL
        GROUP BY "{state_cols[0]}"
        ORDER BY record_count DESC
        LIMIT 5
        """
        
        try:
            sql_result = db.execute_query(sql_agg_query)
            print("✅ SQL Result:")
            display(sql_result)
            
            print("\n🐼 Pandas Equivalent (conceptual):")
            print("""
            # If we had the data in a pandas DataFrame:
            pandas_result = (
                df.groupby('category')['record_id']
                .count()
                .sort_values(ascending=False)
                .head(5)
            )
            """)
            
            # Now let's actually demonstrate with the SQL result
            if len(sql_result) > 0:
                print("\n🔄 Converting SQL result to pandas for further analysis:")
                # Calculate additional statistics using pandas
                total_records = sql_result['record_count'].sum()
                avg_records = sql_result['record_count'].mean()
                std_records = sql_result['record_count'].std()
                
                print(f"  • Total records: {total_records:,}")
                print(f"  • Average per category: {avg_records:.1f}")
                print(f"  • Standard deviation: {std_records:.1f}")
                
        except Exception as e:
            print(f"❌ SQL aggregation failed: {e}")
    
    # Example 2: When SQL excels
    print("\n" + "-"*40)
    print("\n📅 Example 2: When SQL Excels - Date Operations")
    
    if date_cols:
        print("\n🗄️ SQL Approach (Superior for date functions):")
        sql_date_query = f"""
        SELECT 
            EXTRACT(YEAR FROM "{date_cols[0]}") as year,
            EXTRACT(QUARTER FROM "{date_cols[0]}") as quarter,
            COUNT(*) as quarterly_records
        FROM "{sales_tables[0]}"
        WHERE "{date_cols[0]}" IS NOT NULL
        GROUP BY EXTRACT(YEAR FROM "{date_cols[0]}"), EXTRACT(QUARTER FROM "{date_cols[0]}")
        ORDER BY year, quarter
        LIMIT 8
        """
        
        try:
            sql_date_result = db.execute_query(sql_date_query)
            print("✅ SQL Date Analysis:")
            display(sql_date_result)
            
            print("\n💡 SQL Advantage: Date extraction and grouping in one step")
        except Exception as e:
            print(f"❌ SQL date analysis failed: {e}")
    
    # Analysis summary
    print("\n" + "="*60)
    print("\n🎯 When to Use SQL vs Pandas:")
    
    print("\n🗄️ Use SQL when:")
    print("  • Working with large datasets (millions of rows)")
    print("  • Need complex JOINs across multiple tables")
    print("  • Performing set operations (UNION, INTERSECT, EXCEPT)")
    print("  • Using window functions for analytics")
    print("  • Implementing business logic with CASE statements")
    print("  • Need database-level performance optimization")
    
    print("\n🐼 Use Pandas when:")
    print("  • Dataset fits comfortably in memory")
    print("  • Need statistical analysis (correlation, regression)")
    print("  • Data cleaning and transformation tasks")
    print("  • Creating visualizations")
    print("  • Machine learning feature engineering")
    print("  • Iterative data exploration and experimentation")
    
    print("\n🔄 Best Practice: Hybrid Approach")
    print("  1. Use SQL for data extraction and initial processing")
    print("  2. Use pandas for analysis, statistics, and visualization")
    print("  3. Leverage each tool's strengths for optimal performance")
    
    return "SQL excels at data processing, pandas excels at analysis"

# Run the comparison
comparison_insights = compare_sql_vs_pandas_approaches()

## 6. Error Handling and Best Practices

Production database applications require robust error handling and connection management.

In [None]:
def demonstrate_error_handling():
    """
    Demonstrate proper error handling techniques for database operations.
    """
    print("🛡️ Database Error Handling and Best Practices")
    print("\n" + "="*60)
    
    # Example 1: Handling SQL syntax errors
    print("\n❌ Example 1: SQL Syntax Error Handling")
    try:
        # Intentional syntax error
        result = db.execute_query("""
            SELCT * FROM "non_existent_table"  -- Missing 'E' in SELECT
            WHERE some_column = 'value'
            LIMIT 5
        """)
    except Exception as e:
        print(f"✅ Caught SQL syntax error: {type(e).__name__}")
        print(f"   Error message: {str(e)[:100]}...")
    
    # Example 2: Handling table/column not found
    print("\n🔍 Example 2: Table/Column Not Found Error")
    try:
        if sales_tables:
            result = db.execute_query(f"""
                SELECT customer_id, nonexistent_column 
                FROM "{sales_tables[0]}"
                LIMIT 5
            """)
    except Exception as e:
        print(f"✅ Caught column error: {type(e).__name__}")
        print(f"   Error message: {str(e)[:100]}...")
    
    # Example 3: Parameterized queries (SQL injection prevention)
    print("\n🔒 Example 3: Safe Parameterized Queries")
    
    def safe_data_lookup(table_name, column_name, value):
        """
        Safely query data using parameterized queries.
        Note: Table and column names can't be parameterized, so validate them first.
        """
        try:
            # Validate table exists
            if table_name not in db_info:
                raise ValueError(f"Table {table_name} not found")
            
            # Validate column exists
            if column_name not in db_info[table_name]['columns']:
                raise ValueError(f"Column {column_name} not found in {table_name}")
            
            # Use parameterized query for the value
            query = f"""
                SELECT COUNT(*) as record_count
                FROM "{table_name}" 
                WHERE "{column_name}" = %(search_value)s
            """
            result = db.execute_query(query, params={'search_value': value})
            return result
        except Exception as e:
            print(f"❌ Query failed: {e}")
            return pd.DataFrame()
    
    # Test safe query
    if sales_tables and state_cols:
        # Get a real state value first
        sample_query = f'SELECT DISTINCT "{state_cols[0]}" FROM "{sales_tables[0]}" WHERE "{state_cols[0]}" IS NOT NULL LIMIT 1'
        try:
            sample_state = db.execute_query(sample_query)
            if len(sample_state) > 0:
                test_value = sample_state.iloc[0, 0]
                safe_result = safe_data_lookup(sales_tables[0], state_cols[0], test_value)
                if len(safe_result) > 0:
                    print(f"✅ Safe query returned {safe_result.iloc[0, 0]} records for '{test_value}'")
        except Exception as e:
            print(f"⚠️ Could not test safe query: {e}")
    
    # Example 4: Connection management with context managers
    print("\n🔌 Example 4: Proper Connection Management")
    
    class SafeDatabaseQuery:
        """
        Context manager for safe database operations.
        """
        def __init__(self, engine):
            self.engine = engine
            self.connection = None
        
        def __enter__(self):
            self.connection = self.engine.connect()
            return self.connection
        
        def __exit__(self, exc_type, exc_val, exc_tb):
            if self.connection:
                self.connection.close()
            if exc_type:
                print(f"❌ Database error occurred: {exc_type.__name__}: {exc_val}")
            return False  # Don't suppress exceptions
    
    # Use context manager for safe operations
    try:
        with SafeDatabaseQuery(db.engine) as conn:
            result = pd.read_sql(
                text("SELECT 'Connection test successful' as message"), 
                conn
            )
            print(f"✅ Context manager query successful: {result.iloc[0, 0]}")
    except Exception as e:
        print(f"❌ Context manager caught error: {e}")
    
    # Example 5: Data validation
    print("\n✅ Example 5: Data Validation Best Practices")
    
    def validate_query_result(df, expected_columns=None, min_rows=0):
        """
        Validate query results meet business requirements.
        """
        validations = []
        
        # Check if DataFrame is empty
        if df.empty:
            validations.append("❌ Query returned no data")
        else:
            validations.append(f"✅ Query returned {len(df):,} rows")
        
        # Check minimum row count
        if len(df) < min_rows:
            validations.append(f"⚠️ Row count ({len(df)}) below minimum ({min_rows})")
        
        # Check expected columns
        if expected_columns:
            missing_cols = set(expected_columns) - set(df.columns)
            if missing_cols:
                validations.append(f"❌ Missing columns: {missing_cols}")
            else:
                validations.append("✅ All expected columns present")
        
        # Check for null values in key columns
        if not df.empty:
            null_counts = df.isnull().sum()
            if null_counts.any():
                validations.append(f"⚠️ Null values found: {dict(null_counts[null_counts > 0])}")
            else:
                validations.append("✅ No null values detected")
        
        return validations
    
    # Test validation
    if sales_tables:
        test_query = f"""
            SELECT *
            FROM "{sales_tables[0]}"
            LIMIT 10
        """
        
        try:
            test_data = db.execute_query(test_query)
            validations = validate_query_result(
                test_data, 
                expected_columns=list(test_data.columns)[:3],  # Check first 3 columns
                min_rows=5
            )
            
            print("Query validation results:")
            for validation in validations:
                print(f"  {validation}")
        except Exception as e:
            print(f"❌ Validation test failed: {e}")
    
    return validations

# Run error handling demonstration
error_handling_results = demonstrate_error_handling()

print("\n📚 Database Best Practices Summary:")
print("  🔒 Always use parameterized queries to prevent SQL injection")
print("  🛡️ Implement comprehensive error handling for all database operations")
print("  🔌 Use connection context managers to ensure proper resource cleanup")
print("  ✅ Validate query results before processing in business logic")
print("  📊 Log query performance for optimization opportunities")
print("  🔄 Implement retry logic for transient connection issues")
print("  📝 Document query patterns and business logic for team maintenance")

## 7. Key Takeaways and Next Steps

### What We've Accomplished:

1. **PostgreSQL Database Connection**
   - Connected to Supabase cloud PostgreSQL database
   - Established professional connection patterns with SQLAlchemy
   - Implemented proper resource management and error handling

2. **SQL Query Execution from Python**
   - Basic data exploration and filtering
   - Complex business intelligence with JOINs
   - Advanced analytics with window functions and CTEs

3. **Real-World Data Integration**
   - Worked with actual Olist e-commerce and marketing datasets
   - Adapted queries to real schema structures
   - Handled data quality issues and missing values

4. **Production-Ready Practices**
   - Error handling and validation
   - Parameterized queries for security
   - Connection pooling and resource management

### Business Value:

- **Real-time Analysis**: Connect directly to live business systems
- **Scalability**: Handle enterprise-scale datasets efficiently
- **Performance**: Leverage database engines for heavy computation
- **Security**: Proper authentication and query sanitization
- **Collaboration**: Multiple analysts accessing the same cloud data source

### When to Use SQL vs Pandas:

**Use SQL for:**
- Data extraction from large datasets
- Complex joins across multiple tables
- Window functions and analytical queries
- Business logic implementation with CASE statements
- Database-level performance optimization

**Use Pandas for:**
- Statistical analysis and modeling
- Data cleaning and transformation
- Visualization preparation
- Machine learning feature engineering
- Iterative data exploration

### Next Session Preview:
In our next sessions, we'll explore:
- Advanced SQL patterns for business intelligence
- Real-time data pipeline automation
- Combining SQL analytics with interactive visualizations
- Building automated reporting systems

**🎉 You now have the fundamental skills to connect Python to cloud databases and perform enterprise-level data analysis!**

## 8. Practice Exercise

**Your Challenge! 🚀**

**Business Scenario**: The Olist analytics team wants to understand the relationship between their marketing efforts and customer behavior. Your task is to create an analysis that bridges the marketing and sales datasets.

**Your Task**: Create a comprehensive analysis that combines both datasets to answer business questions.

**Requirements**:
1. **Data Exploration**: Explore both olist_sales_data_set and olist_marketing_data_set
2. **Schema Analysis**: Document the structure and relationships between datasets
3. **Business Intelligence**: Create queries that provide actionable insights
4. **Error Handling**: Implement proper error handling for your queries
5. **Best Practices**: Use parameterized queries and validation

**Specific Questions to Answer**:
- What is the structure of each dataset?
- How can these datasets be connected?
- What insights can we derive about customer acquisition and behavior?
- Which marketing channels or strategies show the most promise?

**Deliverable**: A comprehensive analysis with SQL queries, data validation, and business insights.

In [None]:
# Your practice exercise solution here

def comprehensive_business_analysis():
    """
    Your challenge: Create a comprehensive analysis bridging marketing and sales data.
    
    Business Goal: Understand the relationship between marketing efforts and customer behavior.
    
    Implementation steps:
    1. Explore both datasets thoroughly
    2. Identify connection points between datasets
    3. Create business intelligence queries
    4. Validate results and handle errors
    5. Generate actionable insights
    """
    
    print("🎯 Comprehensive Business Analysis Challenge")
    print("📊 Goal: Bridge marketing and sales data for business insights")
    print("\n" + "="*60)
    
    # Step 1: Dataset Exploration
    print("\n📋 Step 1: Dataset Structure Analysis")
    
    # TODO: Explore olist_sales_data_set structure
    # Think about: What columns are available? What do they represent?
    
    # TODO: Explore olist_marketing_data_set structure  
    # Think about: How does this relate to sales data?
    
    # Step 2: Connection Analysis
    print("\n🔗 Step 2: Identify Dataset Relationships")
    
    # TODO: Find common columns or keys between datasets
    # Think about: How can we join these datasets?
    
    # Step 3: Business Intelligence Queries
    print("\n💼 Step 3: Business Intelligence Analysis")
    
    # TODO: Create queries that answer business questions:
    # - Which marketing channels are most effective?
    # - What's the customer journey from lead to purchase?
    # - How do marketing efforts correlate with sales performance?
    
    # Step 4: Advanced Analytics
    print("\n📈 Step 4: Advanced Business Insights")
    
    # TODO: Use advanced SQL features:
    # - Window functions for trend analysis
    # - CTEs for complex business logic
    # - Statistical functions for performance metrics
    
    # Step 5: Validation and Error Handling
    print("\n✅ Step 5: Data Validation and Quality Checks")
    
    # TODO: Implement proper error handling and data validation
    
    # Step 6: Business Recommendations
    print("\n🎯 Step 6: Strategic Business Recommendations")
    
    # TODO: Synthesize findings into actionable business insights
    
    return None

print("💡 Hints for Your Analysis:")
print("  • Start by examining the schema of both datasets")
print("  • Look for common identifiers (seller_id, customer_id, etc.)")
print("  • Use SQL JOINs to combine datasets where appropriate")
print("  • Focus on metrics that matter to business decision-makers")
print("  • Always validate your results and handle potential errors")

print("\n🔍 Analysis Framework:")
print("  1. Data Discovery: Understand what data is available")
print("  2. Relationship Mapping: How datasets connect")
print("  3. Business Metrics: What KPIs can we calculate?")
print("  4. Trend Analysis: How do metrics change over time?")
print("  5. Insights Generation: What actions should the business take?")

print("\n📊 Expected Deliverables:")
print("  • Dataset structure documentation")
print("  • Relationship mapping between datasets")
print("  • Business intelligence SQL queries")
print("  • Data quality assessment")
print("  • Strategic recommendations based on findings")

# Uncomment to run your solution:
# comprehensive_analysis_results = comprehensive_business_analysis()

# Remember to clean up database connection when done
# db.close()

In [None]:
# Clean up database connection
print("🔒 Closing database connection...")
db.close()
print("✅ Session complete!")