# Week 7: EDA Techniques - Part 1: Structured EDA Approach and Framework

## Learning Objectives
By the end of this session, you will be able to:
- Understand the systematic approach to Exploratory Data Analysis
- Apply structured EDA frameworks to real business datasets
- Create reproducible EDA workflows
- Connect to live databases for real-time analysis

## Business Context
Today we're working with **live Olist Brazilian e-commerce data** from our Supabase database. This represents real customer transactions, product catalogs, and business operations from one of Brazil's largest marketplaces.

**Key Business Questions:**
- What patterns exist in our customer behavior?
- How do different product categories perform?
- What geographic trends can we identify?
- How can we structure our analysis for maximum business impact?

## 1. Environment Setup and Data Connection

In [None]:
# Standard imports for data analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Database connection
import psycopg2
from sqlalchemy import create_engine

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

# Plotting style
plt.style.use('default')
sns.set_palette("husl")

print("✅ Environment setup complete!")

In [None]:
# Supabase connection details
DATABASE_URL = "postgresql://postgres.pzykoxdiwsyclwfqfiii:L3tMeQuery123!@aws-0-us-east-1.pooler.supabase.com:6543/postgres"

# Create database engine
engine = create_engine(DATABASE_URL)

# Test connection
try:
    with engine.connect() as conn:
        result = conn.execute("SELECT 1 as test")
        print("✅ Database connection successful!")
except Exception as e:
    print(f"❌ Connection failed: {e}")

## 2. The Structured EDA Framework

### The 5-Step EDA Process

1. **Data Discovery** - What data do we have?
2. **Data Quality Assessment** - How clean is our data?
3. **Initial Exploration** - What patterns are immediately visible?
4. **Deep Dive Analysis** - What insights can we extract?
5. **Business Insights** - What actions can we recommend?

### Why This Matters for Business
- **Systematic approach** ensures nothing is missed
- **Reproducible process** can be applied to any dataset
- **Business-focused** analysis drives actionable insights
- **Time-efficient** method prevents analysis paralysis

## Step 1: Data Discovery

Let's start by understanding what data we have available in our Olist database.

In [None]:
# Query to see all available tables
tables_query = """
SELECT 
    schemaname,
    tablename,
    tableowner
FROM pg_tables 
WHERE schemaname IN ('olist_sales_data_set', 'olist_marketing_data_set')
ORDER BY schemaname, tablename;
"""

available_tables = pd.read_sql(tables_query, engine)
print("📊 Available Tables in Our Database:")
print("=" * 50)
display(available_tables)

In [None]:
# Load our main datasets for analysis
print("🔄 Loading main datasets...")

# Orders data - our primary transaction data
orders_query = "SELECT * FROM olist_sales_data_set.olist_orders_dataset LIMIT 5000"
orders_df = pd.read_sql(orders_query, engine)

# Customers data
customers_query = "SELECT * FROM olist_sales_data_set.olist_customers_dataset LIMIT 5000"
customers_df = pd.read_sql(customers_query, engine)

# Products data
products_query = "SELECT * FROM olist_sales_data_set.olist_products_dataset"
products_df = pd.read_sql(products_query, engine)

# Order items data
order_items_query = "SELECT * FROM olist_sales_data_set.olist_order_items_dataset LIMIT 10000"
order_items_df = pd.read_sql(order_items_query, engine)

print("✅ Data loaded successfully!")
print(f"📈 Orders: {len(orders_df):,} records")
print(f"👥 Customers: {len(customers_df):,} records")
print(f"📦 Products: {len(products_df):,} records")
print(f"🛒 Order Items: {len(order_items_df):,} records")

## Step 2: Data Quality Assessment

Before diving into analysis, we need to understand the quality and structure of our data.

In [None]:
def data_overview(df, name):
    """
    Comprehensive data overview function for EDA
    """
    print(f"\n📊 {name} Dataset Overview")
    print("=" * 50)
    
    # Basic info
    print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
    
    # Data types
    print("\n📋 Data Types:")
    dtype_counts = df.dtypes.value_counts()
    for dtype, count in dtype_counts.items():
        print(f"  {dtype}: {count} columns")
    
    # Missing values
    print("\n❓ Missing Values:")
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    missing_df = pd.DataFrame({
        'Column': missing.index,
        'Missing Count': missing.values,
        'Missing %': missing_pct.values
    })
    missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing %', ascending=False)
    
    if len(missing_df) > 0:
        display(missing_df.head(10))
    else:
        print("  ✅ No missing values found!")
    
    # Sample data
    print("\n👀 Sample Data:")
    display(df.head(3))
    
    return missing_df

In [None]:
# Analyze orders dataset
orders_missing = data_overview(orders_df, "Orders")

In [None]:
# Analyze customers dataset
customers_missing = data_overview(customers_df, "Customers")

In [None]:
# Analyze products dataset
products_missing = data_overview(products_df, "Products")

## Step 3: Initial Exploration

Now let's start exploring patterns in our data using systematic approaches.

In [None]:
# Order Status Analysis - Understanding our business pipeline
print("📊 Order Status Distribution")
print("=" * 30)

status_counts = orders_df['order_status'].value_counts()
status_pct = (status_counts / len(orders_df)) * 100

status_summary = pd.DataFrame({
    'Count': status_counts,
    'Percentage': status_pct.round(2)
})

display(status_summary)

# Visualize order status
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
status_counts.plot(kind='bar', color='skyblue')
plt.title('Order Status Distribution (Count)')
plt.xlabel('Order Status')
plt.ylabel('Count')
plt.xticks(rotation=45)

plt.subplot(1, 2, 2)
plt.pie(status_counts.values, labels=status_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Order Status Distribution (%)')

plt.tight_layout()
plt.show()

# Business insight
delivered_pct = status_pct.get('delivered', 0)
print(f"\n💡 Business Insight: {delivered_pct:.1f}% of orders are successfully delivered")

In [None]:
# Temporal Analysis - Understanding order patterns over time
print("📅 Temporal Analysis of Orders")
print("=" * 30)

# Convert timestamp to datetime
orders_df['order_purchase_timestamp'] = pd.to_datetime(orders_df['order_purchase_timestamp'])

# Extract temporal features
orders_df['order_year'] = orders_df['order_purchase_timestamp'].dt.year
orders_df['order_month'] = orders_df['order_purchase_timestamp'].dt.month
orders_df['order_day_of_week'] = orders_df['order_purchase_timestamp'].dt.dayofweek
orders_df['order_hour'] = orders_df['order_purchase_timestamp'].dt.hour

# Orders by year
yearly_orders = orders_df['order_year'].value_counts().sort_index()
print("\n📈 Orders by Year:")
for year, count in yearly_orders.items():
    print(f"  {year}: {count:,} orders")

# Visualize temporal patterns
plt.figure(figsize=(15, 10))

# Orders by month
plt.subplot(2, 2, 1)
monthly_orders = orders_df['order_month'].value_counts().sort_index()
monthly_orders.plot(kind='bar', color='lightcoral')
plt.title('Orders by Month')
plt.xlabel('Month')
plt.ylabel('Number of Orders')
plt.xticks(rotation=0)

# Orders by day of week
plt.subplot(2, 2, 2)
dow_orders = orders_df['order_day_of_week'].value_counts().sort_index()
dow_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
dow_orders.index = [dow_labels[i] for i in dow_orders.index]
dow_orders.plot(kind='bar', color='lightgreen')
plt.title('Orders by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Number of Orders')
plt.xticks(rotation=45)

# Orders by hour
plt.subplot(2, 2, 3)
hourly_orders = orders_df['order_hour'].value_counts().sort_index()
hourly_orders.plot(kind='line', marker='o', color='gold')
plt.title('Orders by Hour of Day')
plt.xlabel('Hour')
plt.ylabel('Number of Orders')
plt.grid(True, alpha=0.3)

# Time series of orders
plt.subplot(2, 2, 4)
orders_df.set_index('order_purchase_timestamp').resample('D').size().plot(color='purple', alpha=0.7)
plt.title('Daily Orders Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Orders')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

# Business insights
peak_hour = hourly_orders.idxmax()
peak_day = dow_labels[dow_orders.idxmax()]
print(f"\n💡 Business Insights:")
print(f"   • Peak ordering hour: {peak_hour}:00")
print(f"   • Peak ordering day: {peak_day}")
print(f"   • Most active month: {monthly_orders.idxmax()}")

## Geographic Analysis - Customer Distribution

In [None]:
# Geographic distribution of customers
print("🗺️ Geographic Analysis")
print("=" * 25)

# Top states by customer count
state_counts = customers_df['customer_state'].value_counts().head(10)
print("\n🏆 Top 10 States by Customer Count:")
for state, count in state_counts.items():
    pct = (count / len(customers_df)) * 100
    print(f"   {state}: {count:,} customers ({pct:.1f}%)")

# Top cities by customer count
city_counts = customers_df['customer_city'].value_counts().head(10)
print("\n🏙️ Top 10 Cities by Customer Count:")
for city, count in city_counts.items():
    pct = (count / len(customers_df)) * 100
    print(f"   {city}: {count:,} customers ({pct:.1f}%)")

# Visualize geographic distribution
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
state_counts.plot(kind='bar', color='lightblue')
plt.title('Top 10 States by Customer Count')
plt.xlabel('State')
plt.ylabel('Number of Customers')
plt.xticks(rotation=45)

plt.subplot(1, 2, 2)
city_counts.plot(kind='bar', color='orange')
plt.title('Top 10 Cities by Customer Count')
plt.xlabel('City')
plt.ylabel('Number of Customers')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

# Geographic concentration analysis
total_customers = len(customers_df)
top_5_states_pct = (state_counts.head(5).sum() / total_customers) * 100
top_5_cities_pct = (city_counts.head(5).sum() / total_customers) * 100

print(f"\n💡 Geographic Concentration:")
print(f"   • Top 5 states account for {top_5_states_pct:.1f}% of customers")
print(f"   • Top 5 cities account for {top_5_cities_pct:.1f}% of customers")
print(f"   • Total unique states: {customers_df['customer_state'].nunique()}")
print(f"   • Total unique cities: {customers_df['customer_city'].nunique()}")

## EDA Framework Summary

### What We've Accomplished

1. **✅ Data Discovery**: Identified available datasets and their relationships
2. **✅ Quality Assessment**: Analyzed data types, missing values, and data integrity
3. **✅ Initial Exploration**: Uncovered key patterns in:
   - Order processing pipeline
   - Temporal ordering patterns
   - Geographic customer distribution

### Key Business Insights So Far

**Order Processing:**
- Most orders are successfully delivered
- Clear operational pipeline from purchase to delivery

**Customer Behavior:**
- Peak ordering times and days identified
- Strong geographic concentration in specific regions
- Seasonal patterns in ordering behavior

### Next Steps
In the next parts, we'll dive deeper into:
- Descriptive statistics and summary insights
- Distribution analysis and correlation exploration
- Advanced customer segmentation techniques

## 🎯 Practice Exercises

Try these exercises to reinforce your understanding:

1. **Data Quality Check**: Create a function that checks for duplicate records in the orders dataset

2. **Temporal Analysis**: Analyze the time difference between order purchase and delivery dates

3. **Geographic Insights**: Find the average number of customers per city for each state

4. **Data Validation**: Check if all customer_ids in orders exist in the customers table

In [None]:
# Exercise space - try the exercises above!

# Exercise 1: Check for duplicates
def check_duplicates(df, column_name):
    """
    Check for duplicate records in a specific column
    """
    # Your code here
    pass

# Exercise 2: Delivery time analysis
# Your code here

# Exercise 3: Geographic analysis
# Your code here

# Exercise 4: Data validation
# Your code here