# Introduction to the Olist Dataset - Part 2: Database Schema and Relationships

## Week 4, Day 2 (Thursday) - May 1st, 2025

### Overview
In Part 1, we explored the business context and overall structure of the Olist dataset. Now we'll dive deep into the technical details: examining each table's schema, understanding data types, and mapping the relationships that make complex analysis possible.

### Learning Objectives
By the end of this session, you will be able to:
- Identify and explain the purpose of each field in all 9 Olist tables
- Understand the data types and constraints for each column
- Map primary and foreign key relationships between tables
- Design multi-table queries using relationship understanding
- Anticipate data quality issues based on field definitions
- Create an entity relationship diagram for the dataset

### Prerequisites
- Part 1: Olist Dataset Overview (previous session)
- SQL database concepts (primary keys, foreign keys, relationships)
- Pandas DataFrame fundamentals
- Data merging and joining concepts

In [None]:
# Setup for the session
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', 10)

print("🗄️ Database Schema Analysis Setup Complete")
print("Ready to examine the Olist database structure in detail!")

## 1. Database Schema Overview

The Olist dataset follows **relational database principles** with:
- **Normalized structure** to minimize redundancy
- **Primary keys** for unique record identification
- **Foreign keys** to establish relationships
- **Referential integrity** between related tables

### Table Categories

#### 🏗️ **Core Transaction Tables**
- **Orders**: Main transaction records
- **Order Items**: Product details within orders
- **Payments**: Payment method and installment information

#### 👥 **Entity Tables**
- **Customers**: Customer information and location
- **Sellers**: Seller information and location
- **Products**: Product characteristics and categorization

#### 📊 **Supplementary Tables**
- **Reviews**: Customer feedback and ratings
- **Geolocation**: Coordinate and location data
- **Category Translation**: Portuguese-English mapping

## 2. Core Transaction Tables

### 🛒 Table 1: Orders (`olist_orders_dataset.csv`)

**Purpose**: Central table containing all order information and status tracking.

#### Schema Definition

| Column Name | Data Type | Description | Constraints |
|-------------|-----------|-------------|-------------|
| `order_id` | STRING | Unique order identifier | PRIMARY KEY, NOT NULL |
| `customer_id` | STRING | Customer identifier | FOREIGN KEY → customers.customer_id |
| `order_status` | STRING | Current order status | NOT NULL, Limited values |
| `order_purchase_timestamp` | DATETIME | When order was placed | NOT NULL |
| `order_approved_at` | DATETIME | When payment was approved | CAN BE NULL |
| `order_delivered_carrier_date` | DATETIME | When shipped to carrier | CAN BE NULL |
| `order_delivered_customer_date` | DATETIME | When delivered to customer | CAN BE NULL |
| `order_estimated_delivery_date` | DATETIME | Estimated delivery date | CAN BE NULL |

#### Order Status Values
- `'delivered'` - Successfully completed orders
- `'shipped'` - In transit to customer
- `'processing'` - Being prepared for shipment
- `'invoiced'` - Payment confirmed, preparing order
- `'canceled'` - Order canceled
- `'unavailable'` - Product unavailable
- `'approved'` - Payment approved, starting fulfillment
- `'created'` - Order created but not yet processed

#### Business Logic
- **Order Lifecycle**: created → approved → invoiced → processing → shipped → delivered
- **Timestamps**: Enable calculation of processing times and delivery performance
- **Status Tracking**: Allows analysis of order fulfillment efficiency

In [None]:
# Create a sample orders table structure for demonstration
sample_orders = pd.DataFrame({
    'order_id': ['e481f51cbdc54678b7cc49136f2d6af7', '53cdb2fc8bc7dce0b6741e2150273451', 
                 '47770eb9100c2d0c44946d9cf07ec65d'],
    'customer_id': ['9ef432eb6251297304e76186b10a928d', 'b0830fb4747a6c6d20dea0b8c802d7ef',
                    '41ce2a54c0b03bf3443c3d931a367089'],
    'order_status': ['delivered', 'delivered', 'delivered'],
    'order_purchase_timestamp': ['2017-10-02 10:56:33', '2018-07-24 20:41:37', '2018-08-08 08:38:49'],
    'order_approved_at': ['2017-10-02 11:07:15', '2018-07-26 03:24:27', '2018-08-08 08:55:23'],
    'order_delivered_carrier_date': ['2017-10-04 19:55:00', '2018-07-26 14:31:00', '2018-08-08 13:50:00'],
    'order_delivered_customer_date': ['2017-10-10 21:25:13', '2018-08-07 15:27:45', '2018-08-17 18:03:12'],
    'order_estimated_delivery_date': ['2017-10-18 00:00:00', '2018-08-13 00:00:00', '2018-09-04 00:00:00']
})

# Convert to datetime
datetime_cols = ['order_purchase_timestamp', 'order_approved_at', 'order_delivered_carrier_date',
                'order_delivered_customer_date', 'order_estimated_delivery_date']

for col in datetime_cols:
    sample_orders[col] = pd.to_datetime(sample_orders[col])

print("📋 Sample Orders Table Structure:")
print(sample_orders.info())
print("\n📊 Sample Data:")
print(sample_orders.head())

# Calculate some derived metrics
sample_orders['approval_time_hours'] = (
    sample_orders['order_approved_at'] - sample_orders['order_purchase_timestamp']
).dt.total_seconds() / 3600

sample_orders['delivery_time_days'] = (
    sample_orders['order_delivered_customer_date'] - sample_orders['order_purchase_timestamp']
).dt.days

print("\n⏱️ Derived Metrics:")
print(f"Average approval time: {sample_orders['approval_time_hours'].mean():.1f} hours")
print(f"Average delivery time: {sample_orders['delivery_time_days'].mean():.1f} days")

### 📦 Table 2: Order Items (`olist_order_items_dataset.csv`)

**Purpose**: Details of individual products within each order, including pricing and seller information.

#### Schema Definition

| Column Name | Data Type | Description | Constraints |
|-------------|-----------|-------------|-------------|
| `order_id` | STRING | Order identifier | FOREIGN KEY → orders.order_id |
| `order_item_id` | INTEGER | Item sequence within order | NOT NULL, starts at 1 |
| `product_id` | STRING | Product identifier | FOREIGN KEY → products.product_id |
| `seller_id` | STRING | Seller identifier | FOREIGN KEY → sellers.seller_id |
| `shipping_limit_date` | DATETIME | Latest shipping date promised | NOT NULL |
| `price` | DECIMAL | Item price (before shipping) | NOT NULL, > 0 |
| `freight_value` | DECIMAL | Shipping cost for this item | NOT NULL, >= 0 |

#### Key Relationships
- **Composite Primary Key**: `order_id` + `order_item_id`
- **One-to-Many**: One order can have multiple items
- **Many-to-One**: Multiple items can reference same product/seller

#### Business Logic
- **Item Sequencing**: `order_item_id` shows order of items in cart
- **Pricing**: Separate tracking of product price vs. shipping costs
- **Seller Attribution**: Each item tracks which seller fulfilled it
- **Shipping Promises**: `shipping_limit_date` for customer expectations

In [None]:
# Create sample order items table
sample_order_items = pd.DataFrame({
    'order_id': ['e481f51cbdc54678b7cc49136f2d6af7', 'e481f51cbdc54678b7cc49136f2d6af7',
                 '53cdb2fc8bc7dce0b6741e2150273451', '47770eb9100c2d0c44946d9cf07ec65d'],
    'order_item_id': [1, 2, 1, 1],
    'product_id': ['4244733e06e7ecb4970a6e2683c13e61', 'e5f2d52b802189ee658865ca93d83a8f',
                   'c777355d18b72b67abbeef9df44fd0fd', '7634da152a4610f1595efa32f14722fc'],
    'seller_id': ['48436dade18ac8b2bce089ec2a041202', '48436dade18ac8b2bce089ec2a041202',
                  'dd7ddc04e1b6c2c614352b383efe2d36', '1f50f920176fa81dab994f9023523100'],
    'shipping_limit_date': ['2017-10-09 10:56:33', '2017-10-09 10:56:33',
                           '2018-07-31 20:41:37', '2018-08-15 08:38:49'],
    'price': [58.90, 239.90, 199.00, 129.90],
    'freight_value': [13.29, 19.93, 17.87, 14.78]
})

sample_order_items['shipping_limit_date'] = pd.to_datetime(sample_order_items['shipping_limit_date'])

print("📦 Sample Order Items Table:")
print(sample_order_items)

# Demonstrate relationship analysis
print("\n🔍 Relationship Analysis:")
print(f"Total items: {len(sample_order_items)}")
print(f"Unique orders: {sample_order_items['order_id'].nunique()}")
print(f"Items per order: {len(sample_order_items) / sample_order_items['order_id'].nunique():.1f} average")

# Show multi-item order
multi_item_order = sample_order_items[sample_order_items['order_id'] == 'e481f51cbdc54678b7cc49136f2d6af7']
print("\n🛒 Multi-item order example:")
print(multi_item_order[['order_item_id', 'price', 'freight_value']])
print(f"Total order value: ${multi_item_order['price'].sum():.2f} + ${multi_item_order['freight_value'].sum():.2f} shipping")

### 💳 Table 3: Payments (`olist_order_payments_dataset.csv`)

**Purpose**: Payment method details and installment information for orders.

#### Schema Definition

| Column Name | Data Type | Description | Constraints |
|-------------|-----------|-------------|-------------|
| `order_id` | STRING | Order identifier | FOREIGN KEY → orders.order_id |
| `payment_sequential` | INTEGER | Payment sequence number | NOT NULL, starts at 1 |
| `payment_type` | STRING | Payment method used | NOT NULL, Limited values |
| `payment_installments` | INTEGER | Number of installments | NOT NULL, >= 1 |
| `payment_value` | DECIMAL | Payment amount | NOT NULL, > 0 |

#### Payment Types
- `'credit_card'` - Credit card payments (most common)
- `'boleto'` - Brazilian bank slip payment
- `'voucher'` - Gift cards or vouchers
- `'debit_card'` - Debit card payments
- `'not_defined'` - Unknown payment method

#### Business Logic
- **Multiple Payments**: One order can have multiple payment methods
- **Installments**: Very common in Brazilian e-commerce
- **Sequential Tracking**: `payment_sequential` for multiple payments
- **Cultural Context**: Boleto is uniquely Brazilian payment method

In [None]:
# Create sample payments table
sample_payments = pd.DataFrame({
    'order_id': ['e481f51cbdc54678b7cc49136f2d6af7', '53cdb2fc8bc7dce0b6741e2150273451',
                 '47770eb9100c2d0c44946d9cf07ec65d', '47770eb9100c2d0c44946d9cf07ec65d'],
    'payment_sequential': [1, 1, 1, 2],
    'payment_type': ['credit_card', 'credit_card', 'credit_card', 'voucher'],
    'payment_installments': [8, 1, 4, 1],
    'payment_value': [298.83, 216.87, 119.90, 24.78]
})

print("💳 Sample Payments Table:")
print(sample_payments)

# Analyze payment patterns
print("\n📊 Payment Analysis:")
print("Payment type distribution:")
print(sample_payments['payment_type'].value_counts())

print("\nInstallment patterns:")
installment_analysis = sample_payments.groupby('payment_installments').agg({
    'payment_value': ['count', 'mean'],
    'order_id': 'nunique'
})
print(installment_analysis)

# Show mixed payment order
mixed_payment_order = sample_payments[sample_payments['order_id'] == '47770eb9100c2d0c44946d9cf07ec65d']
print("\n🔄 Mixed payment method example:")
print(mixed_payment_order)
print(f"Total payment: ${mixed_payment_order['payment_value'].sum():.2f}")

## 3. Entity Tables

### 👥 Table 4: Customers (`olist_customers_dataset.csv`)

**Purpose**: Customer information and geographic location data.

#### Schema Definition

| Column Name | Data Type | Description | Constraints |
|-------------|-----------|-------------|-------------|
| `customer_id` | STRING | Unique customer identifier | PRIMARY KEY, NOT NULL |
| `customer_unique_id` | STRING | Real customer identifier (anonymized) | NOT NULL |
| `customer_zip_code_prefix` | INTEGER | First 5 digits of ZIP code | NOT NULL |
| `customer_city` | STRING | Customer city name | NOT NULL |
| `customer_state` | STRING | Brazilian state abbreviation | NOT NULL, 2 characters |

#### Geographic Context
- **Brazilian States**: 27 total states (26 states + 1 federal district)
- **ZIP Code System**: 8-digit system, first 5 digits indicate region
- **Privacy Protection**: Real customer identity anonymized

#### Key Insights
- **Customer Uniqueness**: `customer_id` vs `customer_unique_id` distinction
- **Geographic Analysis**: Enables location-based customer segmentation
- **Market Reach**: Shows Olist's geographic penetration

In [None]:
# Create sample customers table
sample_customers = pd.DataFrame({
    'customer_id': ['9ef432eb6251297304e76186b10a928d', 'b0830fb4747a6c6d20dea0b8c802d7ef',
                    '41ce2a54c0b03bf3443c3d931a367089', '8d50f5eadf9050cf04b64645b8f2b075'],
    'customer_unique_id': ['861eff4711a542e4b93843c6dd7febb0', '290c77bc529b7ac935b93aa66c333dc3',
                          '060e732b5b29e8181a18229c7b0b2b5e', '854daefd58154449b40e90b8faf8b916'],
    'customer_zip_code_prefix': [14409, 9790, 1151, 8775],
    'customer_city': ['franca', 'sao bernardo do campo', 'sao paulo', 'mogi das cruzes'],
    'customer_state': ['SP', 'SP', 'SP', 'SP']
})

print("👥 Sample Customers Table:")
print(sample_customers)

# Geographic analysis
print("\n🗺️ Geographic Distribution:")
state_dist = sample_customers['customer_state'].value_counts()
print(f"States represented: {state_dist}")

print("\n📍 ZIP Code Analysis:")
print(f"ZIP code range: {sample_customers['customer_zip_code_prefix'].min()} - {sample_customers['customer_zip_code_prefix'].max()}")

# Brazilian state context
print("\n🇧🇷 Brazilian State Context:")
print("SP = São Paulo (Brazil's most populous state)")
print("Major economic center with highest e-commerce activity")

# Show privacy protection
print("\n🔒 Privacy Protection:")
print("customer_id ≠ customer_unique_id (anonymization layer)")
print("Real identity mapped through customer_unique_id")

### 🏪 Table 5: Sellers (`olist_sellers_dataset.csv`)

**Purpose**: Seller information and geographic distribution.

#### Schema Definition

| Column Name | Data Type | Description | Constraints |
|-------------|-----------|-------------|-------------|
| `seller_id` | STRING | Unique seller identifier | PRIMARY KEY, NOT NULL |
| `seller_zip_code_prefix` | INTEGER | First 5 digits of ZIP code | NOT NULL |
| `seller_city` | STRING | Seller city name | NOT NULL |
| `seller_state` | STRING | Brazilian state abbreviation | NOT NULL, 2 characters |

#### Business Context
- **Seller Network**: Distributed across Brazil
- **Geographic Strategy**: Logistics and shipping optimization
- **Market Reach**: Local sellers serving broader markets

#### Analysis Opportunities
- **Seller Performance by Location**: Regional success patterns
- **Logistics Analysis**: Distance between sellers and customers
- **Market Penetration**: Coverage across Brazilian regions

### 📋 Table 6: Products (`olist_products_dataset.csv`)

**Purpose**: Product characteristics and categorization information.

#### Schema Definition

| Column Name | Data Type | Description | Constraints |
|-------------|-----------|-------------|-------------|
| `product_id` | STRING | Unique product identifier | PRIMARY KEY, NOT NULL |
| `product_category_name` | STRING | Category name (Portuguese) | CAN BE NULL |
| `product_name_lenght` | INTEGER | Length of product name | CAN BE NULL, >= 0 |
| `product_description_lenght` | INTEGER | Length of description | CAN BE NULL, >= 0 |
| `product_photos_qty` | INTEGER | Number of product photos | CAN BE NULL, >= 0 |
| `product_weight_g` | INTEGER | Product weight in grams | CAN BE NULL, > 0 |
| `product_length_cm` | INTEGER | Product length in cm | CAN BE NULL, > 0 |
| `product_height_cm` | INTEGER | Product height in cm | CAN BE NULL, > 0 |
| `product_width_cm` | INTEGER | Product width in cm | CAN BE NULL, > 0 |

#### Key Insights
- **Physical Characteristics**: Enable shipping cost calculations
- **Content Quality**: Name/description length as quality indicators
- **Visual Appeal**: Photo quantity impacts conversion
- **Portuguese Categories**: Require translation for international analysis

#### Data Quality Notes
- **Missing Values**: Common in physical dimensions
- **Spelling Error**: "lenght" instead of "length" in column names
- **Measurement Units**: Consistent metric system usage

In [None]:
# Create sample products table
sample_products = pd.DataFrame({
    'product_id': ['4244733e06e7ecb4970a6e2683c13e61', 'e5f2d52b802189ee658865ca93d83a8f',
                   'c777355d18b72b67abbeef9df44fd0fd', '7634da152a4610f1595efa32f14722fc'],
    'product_category_name': ['beleza_saude', 'perfumaria', 'esporte_lazer', 'informatica_acessorios'],
    'product_name_lenght': [58, 42, 35, 67],
    'product_description_lenght': [315, 287, 542, 128],
    'product_photos_qty': [4, 3, 7, 2],
    'product_weight_g': [650, 350, 1200, 89],
    'product_length_cm': [18, 12, 35, 15],
    'product_height_cm': [6, 8, 15, 2],
    'product_width_cm': [12, 8, 25, 10]
})

print("📋 Sample Products Table:")
print(sample_products)

# Analyze product characteristics
print("\n📊 Product Analysis:")
print("Category distribution:")
print(sample_products['product_category_name'].value_counts())

print("\n📸 Content Quality Metrics:")
content_metrics = sample_products[['product_name_lenght', 'product_description_lenght', 'product_photos_qty']].describe()
print(content_metrics)

# Calculate shipping volume (for logistics)
sample_products['volume_cm3'] = (sample_products['product_length_cm'] * 
                                sample_products['product_height_cm'] * 
                                sample_products['product_width_cm'])

print("\n📦 Shipping Characteristics:")
shipping_analysis = sample_products[['product_weight_g', 'volume_cm3']].describe()
print(shipping_analysis)

# Category translation examples
print("\n🔤 Category Translation Examples:")
category_translations = {
    'beleza_saude': 'health_beauty',
    'perfumaria': 'perfumery',
    'esporte_lazer': 'sports_leisure',
    'informatica_acessorios': 'computers_accessories'
}

for pt, en in category_translations.items():
    print(f"{pt} → {en}")

## 4. Supplementary Tables

### ⭐ Table 7: Reviews (`olist_order_reviews_dataset.csv`)

**Purpose**: Customer feedback, ratings, and review text for orders.

#### Schema Definition

| Column Name | Data Type | Description | Constraints |
|-------------|-----------|-------------|-------------|
| `review_id` | STRING | Unique review identifier | PRIMARY KEY, NOT NULL |
| `order_id` | STRING | Order being reviewed | FOREIGN KEY → orders.order_id |
| `review_score` | INTEGER | Rating from 1-5 stars | NOT NULL, 1 ≤ value ≤ 5 |
| `review_comment_title` | STRING | Review title/summary | CAN BE NULL |
| `review_comment_message` | STRING | Full review text | CAN BE NULL |
| `review_creation_date` | DATETIME | When review was created | NOT NULL |
| `review_answer_timestamp` | DATETIME | When seller/platform responded | CAN BE NULL |

#### Rating Scale
- **5 stars**: Excellent experience
- **4 stars**: Good experience
- **3 stars**: Average experience
- **2 stars**: Poor experience
- **1 star**: Very poor experience

#### Analysis Applications
- **Sentiment Analysis**: Text mining of review content
- **Quality Metrics**: Correlation with delivery performance
- **Product Success**: Rating impact on sales
- **Response Analysis**: Seller engagement with feedback

### 🌍 Table 8: Geolocation (`olist_geolocation_dataset.csv`)

**Purpose**: Geographic coordinates and location data for Brazilian ZIP codes.

#### Schema Definition

| Column Name | Data Type | Description | Constraints |
|-------------|-----------|-------------|-------------|
| `geolocation_zip_code_prefix` | INTEGER | ZIP code prefix | NOT NULL |
| `geolocation_lat` | DECIMAL | Latitude coordinate | NOT NULL |
| `geolocation_lng` | DECIMAL | Longitude coordinate | NOT NULL |
| `geolocation_city` | STRING | City name | NOT NULL |
| `geolocation_state` | STRING | State abbreviation | NOT NULL |

#### Geographic Analysis Applications
- **Distance Calculations**: Between customers and sellers
- **Logistics Optimization**: Delivery route planning
- **Market Analysis**: Regional performance mapping
- **Heat Maps**: Geographic visualization of business metrics

#### Data Characteristics
- **High Volume**: ~1 million records for detailed coverage
- **Coordinate Precision**: Decimal degrees for accuracy
- **Duplicate Handling**: Multiple entries per ZIP code possible

### 🏷️ Table 9: Category Translation (`product_category_name_translation.csv`)

**Purpose**: Mapping between Portuguese and English product category names.

#### Schema Definition

| Column Name | Data Type | Description | Constraints |
|-------------|-----------|-------------|-------------|
| `product_category_name` | STRING | Portuguese category name | PRIMARY KEY, NOT NULL |
| `product_category_name_english` | STRING | English translation | NOT NULL |

#### Key Categories (Examples)

| Portuguese | English | Business Context |
|------------|---------|-------------------|
| `beleza_saude` | `health_beauty` | Personal care products |
| `informatica_acessorios` | `computers_accessories` | Technology products |
| `moveis_decoracao` | `furniture_decor` | Home improvement |
| `esporte_lazer` | `sports_leisure` | Recreation products |
| `casa_construcao` | `home_construction` | Building materials |

#### Usage Importance
- **International Analysis**: Makes data accessible globally
- **Standardization**: Consistent category naming
- **Business Intelligence**: Category performance comparison
- **Cultural Bridge**: Understanding Brazilian market categories

In [None]:
# Create sample category translation table
sample_categories = pd.DataFrame({
    'product_category_name': [
        'beleza_saude', 'informatica_acessorios', 'moveis_decoracao',
        'esporte_lazer', 'casa_construcao', 'eletronicos', 'telefonia',
        'automotivo', 'livros_tecnicos', 'fashion_bolsas_e_acessorios'
    ],
    'product_category_name_english': [
        'health_beauty', 'computers_accessories', 'furniture_decor',
        'sports_leisure', 'home_construction', 'electronics', 'telephony',
        'automotive', 'books_technical', 'fashion_bags_accessories'
    ]
})

print("🏷️ Sample Category Translation Table:")
print(sample_categories)

# Show translation usage
print("\n🔄 Translation Usage Example:")
print("Original products with Portuguese categories:")
products_with_translation = sample_products.merge(
    sample_categories, 
    on='product_category_name', 
    how='left'
)
print(products_with_translation[['product_id', 'product_category_name', 'product_category_name_english']].head())

print("\n📊 Category Analysis Benefits:")
print("✅ Enables international business analysis")
print("✅ Facilitates category comparison studies")
print("✅ Supports multi-language reporting")
print("✅ Bridges cultural understanding gaps")

## 5. Entity Relationship Diagram (ERD)

Understanding the relationships between tables is crucial for effective analysis. Let's visualize the complete schema structure.

In [None]:
# Create a comprehensive Entity Relationship Diagram
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import FancyBboxPatch, ConnectionPatch
import numpy as np

# Create large figure for detailed ERD
fig, ax = plt.subplots(1, 1, figsize=(16, 12))
ax.set_xlim(0, 16)
ax.set_ylim(0, 12)
ax.axis('off')

# Define table positions with detailed schema information
tables_erd = {
    'Orders': {
        'pos': (8, 9),
        'color': '#FF6B6B',
        'pk': 'order_id',
        'fk': ['customer_id'],
        'size': (2.5, 1.5),
        'fields': ['order_id (PK)', 'customer_id (FK)', 'order_status', 'purchase_timestamp']
    },
    'Order Items': {
        'pos': (8, 6.5),
        'color': '#4ECDC4',
        'pk': 'order_id + order_item_id',
        'fk': ['order_id', 'product_id', 'seller_id'],
        'size': (2.5, 1.5),
        'fields': ['order_id (FK)', 'order_item_id', 'product_id (FK)', 'seller_id (FK)', 'price']
    },
    'Customers': {
        'pos': (4, 9),
        'color': '#45B7D1',
        'pk': 'customer_id',
        'fk': [],
        'size': (2.5, 1.5),
        'fields': ['customer_id (PK)', 'customer_unique_id', 'zip_code_prefix', 'city', 'state']
    },
    'Products': {
        'pos': (12, 6.5),
        'color': '#FFEAA7',
        'pk': 'product_id',
        'fk': [],
        'size': (2.5, 1.5),
        'fields': ['product_id (PK)', 'category_name', 'weight_g', 'dimensions']
    },
    'Sellers': {
        'pos': (4, 6.5),
        'color': '#96CEB4',
        'pk': 'seller_id',
        'fk': [],
        'size': (2.5, 1.5),
        'fields': ['seller_id (PK)', 'zip_code_prefix', 'city', 'state']
    },
    'Payments': {
        'pos': (12, 9),
        'color': '#F7DC6F',
        'pk': 'order_id + payment_sequential',
        'fk': ['order_id'],
        'size': (2.5, 1.5),
        'fields': ['order_id (FK)', 'payment_sequential', 'payment_type', 'installments']
    },
    'Reviews': {
        'pos': (8, 4),
        'color': '#DDA0DD',
        'pk': 'review_id',
        'fk': ['order_id'],
        'size': (2.5, 1.5),
        'fields': ['review_id (PK)', 'order_id (FK)', 'review_score', 'comment_title']
    },
    'Geolocation': {
        'pos': (2, 3),
        'color': '#AED6F1',
        'pk': 'zip_code_prefix',
        'fk': [],
        'size': (2.5, 1.5),
        'fields': ['zip_code_prefix', 'latitude', 'longitude', 'city', 'state']
    },
    'Categories': {
        'pos': (14, 3),
        'color': '#F8C471',
        'pk': 'product_category_name',
        'fk': [],
        'size': (2.5, 1.5),
        'fields': ['category_name (PK)', 'category_name_english']
    }
}

# Draw tables with detailed information
for table_name, info in tables_erd.items():
    x, y = info['pos']
    width, height = info['size']
    
    # Create table box
    box = FancyBboxPatch(
        (x - width/2, y - height/2),
        width, height,
        boxstyle="round,pad=0.05",
        facecolor=info['color'],
        edgecolor='black',
        linewidth=2,
        alpha=0.8
    )
    ax.add_patch(box)
    
    # Add table name (header)
    ax.text(x, y + height/2 - 0.2, table_name, 
            ha='center', va='center', fontsize=11, fontweight='bold')
    
    # Add fields
    field_y_start = y + 0.1
    for i, field in enumerate(info['fields'][:4]):  # Show first 4 fields
        field_y = field_y_start - (i * 0.2)
        ax.text(x, field_y, field, ha='center', va='center', 
                fontsize=8, family='monospace')

# Define relationships with cardinality
relationships = [
    ('Orders', 'Order Items', '1:M', 'order_id'),
    ('Orders', 'Customers', 'M:1', 'customer_id'),
    ('Orders', 'Payments', '1:M', 'order_id'),
    ('Orders', 'Reviews', '1:1', 'order_id'),
    ('Order Items', 'Products', 'M:1', 'product_id'),
    ('Order Items', 'Sellers', 'M:1', 'seller_id'),
    ('Products', 'Categories', 'M:1', 'category_name'),
    ('Customers', 'Geolocation', 'M:1', 'zip_code_prefix'),
    ('Sellers', 'Geolocation', 'M:1', 'zip_code_prefix')
]

# Draw relationships
for start_table, end_table, cardinality, join_field in relationships:
    start_pos = tables_erd[start_table]['pos']
    end_pos = tables_erd[end_table]['pos']
    
    # Calculate arrow positions
    dx = end_pos[0] - start_pos[0]
    dy = end_pos[1] - start_pos[1]
    
    # Adjust start and end points to table edges
    start_size = tables_erd[start_table]['size']
    end_size = tables_erd[end_table]['size']
    
    # Calculate edge points
    if abs(dx) > abs(dy):  # Horizontal connection
        if dx > 0:  # Right connection
            start_point = (start_pos[0] + start_size[0]/2, start_pos[1])
            end_point = (end_pos[0] - end_size[0]/2, end_pos[1])
        else:  # Left connection
            start_point = (start_pos[0] - start_size[0]/2, start_pos[1])
            end_point = (end_pos[0] + end_size[0]/2, end_pos[1])
    else:  # Vertical connection
        if dy > 0:  # Up connection
            start_point = (start_pos[0], start_pos[1] + start_size[1]/2)
            end_point = (end_pos[0], end_pos[1] - end_size[1]/2)
        else:  # Down connection
            start_point = (start_pos[0], start_pos[1] - start_size[1]/2)
            end_point = (end_pos[0], end_pos[1] + end_size[1]/2)
    
    # Draw arrow
    ax.annotate('', xy=end_point, xytext=start_point,
                arrowprops=dict(arrowstyle='->', lw=1.5, color='gray', alpha=0.8))
    
    # Add cardinality label
    mid_x = (start_point[0] + end_point[0]) / 2
    mid_y = (start_point[1] + end_point[1]) / 2
    ax.text(mid_x, mid_y, cardinality, ha='center', va='center',
            fontsize=8, bbox=dict(boxstyle='round,pad=0.2', facecolor='white', alpha=0.8))

# Add title and legend
ax.text(8, 11.5, 'Olist Dataset - Entity Relationship Diagram', 
        ha='center', va='center', fontsize=16, fontweight='bold')

# Add legend
legend_x = 1
legend_y = 10.5
ax.text(legend_x, legend_y, 'Legend:', fontsize=12, fontweight='bold')
ax.text(legend_x, legend_y - 0.3, 'PK = Primary Key', fontsize=9)
ax.text(legend_x, legend_y - 0.6, 'FK = Foreign Key', fontsize=9)
ax.text(legend_x, legend_y - 0.9, '1:M = One to Many', fontsize=9)
ax.text(legend_x, legend_y - 1.2, 'M:1 = Many to One', fontsize=9)
ax.text(legend_x, legend_y - 1.5, '1:1 = One to One', fontsize=9)

# Add business context notes
notes_x = 1
notes_y = 7.5
ax.text(notes_x, notes_y, 'Business Context:', fontsize=10, fontweight='bold')
ax.text(notes_x, notes_y - 0.3, '• Orders are the central entity', fontsize=8)
ax.text(notes_x, notes_y - 0.5, '• Each order can have multiple items', fontsize=8)
ax.text(notes_x, notes_y - 0.7, '• Items connect to products & sellers', fontsize=8)
ax.text(notes_x, notes_y - 0.9, '• Geographic data supports logistics', fontsize=8)
ax.text(notes_x, notes_y - 1.1, '• Reviews provide quality feedback', fontsize=8)

plt.tight_layout()
plt.show()

print("🗂️ Complete Entity Relationship Diagram")
print("\nKey Relationship Patterns:")
print("1. Orders → Order Items (1:Many) - Each order has multiple line items")
print("2. Order Items → Products (Many:1) - Many items reference same product")
print("3. Order Items → Sellers (Many:1) - Multiple items from same seller")
print("4. Orders → Customers (Many:1) - Customer can have multiple orders")
print("5. Geographic tables support location-based analysis")

## 6. Primary and Foreign Key Relationships

### 🔑 Key Relationship Summary

Understanding the key relationships is crucial for joining tables correctly:

#### Primary Keys (Unique Identifiers)
| Table | Primary Key | Description |
|-------|-------------|-------------|
| Orders | `order_id` | Unique order identifier |
| Order Items | `order_id` + `order_item_id` | Composite key for line items |
| Customers | `customer_id` | Unique customer identifier |
| Sellers | `seller_id` | Unique seller identifier |
| Products | `product_id` | Unique product identifier |
| Reviews | `review_id` | Unique review identifier |
| Payments | `order_id` + `payment_sequential` | Composite key for payments |
| Geolocation | `geolocation_zip_code_prefix` | ZIP code (with duplicates) |
| Categories | `product_category_name` | Category name in Portuguese |

#### Foreign Key Relationships
| Child Table | Foreign Key | Parent Table | Parent Key | Relationship Type |
|-------------|-------------|--------------|------------|-------------------|
| Orders | `customer_id` | Customers | `customer_id` | Many-to-One |
| Order Items | `order_id` | Orders | `order_id` | Many-to-One |
| Order Items | `product_id` | Products | `product_id` | Many-to-One |
| Order Items | `seller_id` | Sellers | `seller_id` | Many-to-One |
| Payments | `order_id` | Orders | `order_id` | Many-to-One |
| Reviews | `order_id` | Orders | `order_id` | One-to-One* |

*Note: Reviews to Orders is typically One-to-One, but some orders may not have reviews.

In [None]:
# Demonstrate key relationship concepts with examples
print("🔗 Key Relationship Examples")
print("=" * 50)

# Example 1: One-to-Many (Orders to Order Items)
print("\n1️⃣ One-to-Many: Orders → Order Items")
print("One order can contain multiple products:")
print("")
print("Order ORD001:")
print("  ├── Item 1: Laptop (Seller A)")
print("  ├── Item 2: Mouse (Seller B)")
print("  └── Item 3: Keyboard (Seller A)")
print("")

# Example 2: Many-to-One (Order Items to Products)
print("2️⃣ Many-to-One: Order Items → Products")
print("Multiple order items can reference the same product:")
print("")
print("Product PROD123 (iPhone):")
print("  ← Order ORD001, Item 1 (Customer A)")
print("  ← Order ORD002, Item 1 (Customer B)")
print("  ← Order ORD003, Item 2 (Customer C)")
print("")

# Example 3: Composite Keys
print("3️⃣ Composite Keys: Order Items")
print("Order items identified by order_id + order_item_id:")
print("")
print("Primary Key Examples:")
print("  • (ORD001, 1) → First item in order ORD001")
print("  • (ORD001, 2) → Second item in order ORD001")
print("  • (ORD002, 1) → First item in order ORD002")
print("")

# SQL to Pandas translation examples
print("4️⃣ SQL to Pandas Translation:")
print("")
print("SQL JOIN:")
print("SELECT o.order_id, c.customer_city")
print("FROM orders o")
print("JOIN customers c ON o.customer_id = c.customer_id")
print("")
print("Pandas equivalent:")
print("pd.merge(orders, customers, on='customer_id')")
print("")

# Common join patterns
print("5️⃣ Common Join Patterns:")
print("")
join_patterns = [
    "orders + customers → Customer order analysis",
    "orders + order_items → Order detail analysis",
    "order_items + products → Product sales analysis",
    "order_items + sellers → Seller performance analysis",
    "orders + reviews → Customer satisfaction analysis",
    "orders + payments → Payment method analysis"
]

for pattern in join_patterns:
    print(f"  • {pattern}")

## 7. Data Quality Considerations

Understanding the schema helps anticipate data quality issues you'll encounter:

### 🚨 Common Data Quality Issues

#### **Missing Values (NULL)**
- **Product dimensions**: Not all products have complete physical measurements
- **Review comments**: Many reviews have ratings but no text
- **Delivery dates**: Canceled orders won't have delivery timestamps
- **Geographic coordinates**: Some ZIP codes may lack precise coordinates

#### **Data Type Consistency**
- **Date formats**: All timestamps should be consistent
- **Numeric precision**: Price values may have different decimal places
- **String formatting**: City names may have inconsistent capitalization

#### **Referential Integrity**
- **Orphaned records**: Order items without corresponding orders
- **Missing references**: Products referenced but not in products table
- **Cascade effects**: Deleted orders affecting related reviews/payments

#### **Business Logic Violations**
- **Negative values**: Prices or quantities that shouldn't be negative
- **Date inconsistencies**: Delivery before order date
- **Status mismatches**: Delivered orders without delivery dates

In [None]:
# Create examples of data quality checks you'll need to perform
print("🔍 Data Quality Checklist for Olist Dataset")
print("=" * 55)

quality_checks = {
    "🏗️ Structural Checks": [
        "Verify all expected columns are present",
        "Check data types match schema definitions",
        "Confirm primary key uniqueness",
        "Validate foreign key references exist"
    ],
    
    "📊 Completeness Checks": [
        "Identify missing values in required fields",
        "Calculate completion rates for optional fields",
        "Check for empty strings vs. NULL values",
        "Assess overall dataset coverage"
    ],
    
    "✅ Validity Checks": [
        "Verify order status values are valid",
        "Check review scores are between 1-5",
        "Validate Brazilian state codes (27 states)",
        "Confirm ZIP code format (5 digits)"
    ],
    
    "📅 Temporal Checks": [
        "Verify purchase_date ≤ approval_date",
        "Check approval_date ≤ shipped_date",
        "Validate shipped_date ≤ delivery_date",
        "Ensure delivery_date ≤ estimated_date tolerance"
    ],
    
    "💰 Business Logic Checks": [
        "Confirm prices are positive values",
        "Check freight costs are non-negative",
        "Verify payment amounts match order totals",
        "Validate installment counts are reasonable"
    ]
}

for category, checks in quality_checks.items():
    print(f"\n{category}:")
    for check in checks:
        print(f"  ✓ {check}")

print("\n" + "="*55)
print("💡 Pro Tips for Data Quality:")
print("  • Always profile data before analysis")
print("  • Document assumptions about missing data")
print("  • Create data quality reports for stakeholders")
print("  • Set up automated quality checks for ongoing analysis")

## 8. SQL to Pandas Query Planning

### 🗺️ Query Planning Framework

Before writing code, plan your multi-table queries:

#### Step 1: Define Business Question
- What specific insight are you seeking?
- What metrics do you need to calculate?
- What dimensions do you need to group by?

#### Step 2: Identify Required Tables
- Which entities contain your needed data?
- What are the joining paths between tables?
- Are there multiple ways to get the same data?

#### Step 3: Map Relationships
- What are the foreign key connections?
- What type of joins do you need (inner, left, etc.)?
- What's the expected result size?

#### Step 4: Plan Join Sequence
- Start with the main entity table
- Add related tables one by one
- Consider performance implications

In [None]:
# Create a query planning template
print("📋 Multi-Table Query Planning Template")
print("=" * 45)

# Example business questions with their table requirements
query_examples = [
    {
        'question': 'What is the average order value by customer state?',
        'tables': ['orders', 'customers'],
        'join_keys': ['customer_id'],
        'metrics': ['AVG(order_total)'],
        'group_by': ['customer_state'],
        'sql': '''SELECT c.customer_state, AVG(o.total_amount) as avg_order_value
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY c.customer_state''',
        'pandas': '''orders_customers = pd.merge(orders, customers, on='customer_id')
result = orders_customers.groupby('customer_state')['total_amount'].mean()'''
    },
    {
        'question': 'Which product category has the highest customer satisfaction?',
        'tables': ['orders', 'order_items', 'products', 'reviews', 'categories'],
        'join_keys': ['order_id', 'product_id', 'product_category_name'],
        'metrics': ['AVG(review_score)'],
        'group_by': ['product_category_name_english'],
        'sql': '''SELECT cat.product_category_name_english, AVG(r.review_score) as avg_rating
FROM reviews r
JOIN orders o ON r.order_id = o.order_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories cat ON p.product_category_name = cat.product_category_name
GROUP BY cat.product_category_name_english''',
        'pandas': '''# Step-by-step approach
step1 = pd.merge(reviews, orders, on='order_id')
step2 = pd.merge(step1, order_items, on='order_id')
step3 = pd.merge(step2, products, on='product_id')
result = pd.merge(step3, categories, on='product_category_name')
final = result.groupby('product_category_name_english')['review_score'].mean()'''
    }
]

for i, example in enumerate(query_examples, 1):
    print(f"\n🎯 Example {i}: {example['question']}")
    print(f"📊 Tables needed: {', '.join(example['tables'])}")
    print(f"🔗 Join keys: {', '.join(example['join_keys'])}")
    print(f"📈 Metrics: {', '.join(example['metrics'])}")
    print(f"📋 Group by: {', '.join(example['group_by'])}")
    
    print("\n💾 SQL Version:")
    for line in example['sql'].split('\n'):
        print(f"    {line}")
    
    print("\n🐍 Pandas Version:")
    for line in example['pandas'].split('\n'):
        print(f"    {line}")
    
    print("-" * 45)

print("\n💡 Query Planning Best Practices:")
print("  • Start simple, add complexity gradually")
print("  • Verify join results at each step")
print("  • Consider performance for large datasets")
print("  • Document your join logic for others")

## 9. Performance Considerations

### 📊 Dataset Size Implications

Understanding the relative sizes of tables helps plan efficient queries:

| Table | Approximate Rows | Memory Impact | Join Considerations |
|-------|------------------|---------------|--------------------|
| **Orders** | 100K | Medium | Central table - use as starting point |
| **Order Items** | 112K | Medium | Larger than orders (multi-item orders) |
| **Customers** | 99K | Small | Light table - safe to join early |
| **Sellers** | 3K | Very Small | Tiny table - negligible impact |
| **Products** | 32K | Small | Moderate size - consider selectivity |
| **Reviews** | 100K | Medium | Similar to orders - 1:1 relationship |
| **Payments** | 103K | Medium | Slightly larger than orders |
| **Geolocation** | 1M | Large | Biggest table - join carefully |
| **Categories** | 71 | Tiny | Lookup table - no performance impact |

### ⚡ Performance Optimization Tips

#### **Join Order Strategy**
1. **Start with filtered main table** (e.g., orders for specific date range)
2. **Add small lookup tables first** (categories, sellers)
3. **Add larger tables incrementally** (customers, products)
4. **Save geolocation joins for last** (if needed)

#### **Memory Management**
- **Select only needed columns** before joining
- **Filter data early** in the process
- **Use categorical data types** for repeated strings
- **Consider chunking** for very large analyses

#### **Query Optimization**
- **Avoid Cartesian products** (unexpected row multiplication)
- **Use appropriate join types** (inner vs. left)
- **Aggregate before joining** when possible
- **Cache intermediate results** for repeated use

In [None]:
# Performance optimization examples
print("⚡ Performance Optimization Examples")
print("=" * 40)

print("\n❌ INEFFICIENT APPROACH:")
print("# Loading everything first, then filtering")
inefficient_code = '''# Don't do this!
all_data = orders.merge(customers, on='customer_id')\
                .merge(order_items, on='order_id')\
                .merge(products, on='product_id')\
                .merge(sellers, on='seller_id')\
                .merge(geolocation, left_on='customer_zip_code_prefix', 
                       right_on='geolocation_zip_code_prefix')

# Then filter (too late!)
result = all_data[all_data['order_status'] == 'delivered']
result = result[result['customer_state'] == 'SP']'''

for line in inefficient_code.split('\n'):
    print(f"    {line}")

print("\n✅ EFFICIENT APPROACH:")
print("# Filter first, then join incrementally")
efficient_code = '''# Do this instead!
# 1. Filter main table first
delivered_orders = orders[orders['order_status'] == 'delivered']

# 2. Add customer info and filter by state
orders_customers = delivered_orders.merge(customers, on='customer_id')
sp_orders = orders_customers[orders_customers['customer_state'] == 'SP']

# 3. Add other tables incrementally
with_items = sp_orders.merge(order_items, on='order_id')
with_products = with_items.merge(products, on='product_id')

# 4. Only add geolocation if actually needed
if need_coordinates:
    final = with_products.merge(geolocation, 
                               left_on='customer_zip_code_prefix',
                               right_on='geolocation_zip_code_prefix')'''

for line in efficient_code.split('\n'):
    print(f"    {line}")

print("\n📊 MEMORY-EFFICIENT COLUMN SELECTION:")
memory_code = '''# Select only needed columns
orders_slim = orders[['order_id', 'customer_id', 'order_status', 'order_purchase_timestamp']]
customers_slim = customers[['customer_id', 'customer_state', 'customer_city']]

# Join with reduced datasets
result = orders_slim.merge(customers_slim, on='customer_id')'''

for line in memory_code.split('\n'):
    print(f"    {line}")

print("\n🎯 AGGREGATION OPTIMIZATION:")
agg_code = '''# Aggregate before joining when possible
# Instead of joining all order items then aggregating:
order_totals = order_items.groupby('order_id').agg({
    'price': 'sum',
    'freight_value': 'sum',
    'order_item_id': 'count'
}).rename(columns={'order_item_id': 'item_count'})

# Then join the aggregated results
orders_with_totals = orders.merge(order_totals, on='order_id')'''

for line in agg_code.split('\n'):
    print(f"    {line}")

print("\n💡 Performance Tips Summary:")
tips = [
    "Filter early and often",
    "Select only needed columns",
    "Join small tables first",
    "Aggregate before joining when possible",
    "Use categorical dtypes for repeated strings",
    "Monitor memory usage with df.info()",
    "Cache intermediate results for reuse"
]

for tip in tips:
    print(f"  ✓ {tip}")

## 10. Preparation for Part 3: Loading Data

### 🗂️ File Organization

Before loading the actual Olist data, you should understand how the files are organized:

#### **Expected File Structure**
```
olist_dataset/
├── olist_orders_dataset.csv
├── olist_order_items_dataset.csv
├── olist_customers_dataset.csv
├── olist_sellers_dataset.csv
├── olist_products_dataset.csv
├── olist_order_reviews_dataset.csv
├── olist_order_payments_dataset.csv
├── olist_geolocation_dataset.csv
└── product_category_name_translation.csv
```

### 📋 Loading Checklist

When you load the data in Part 3, you'll need to:

#### **1. File Validation**
- ✅ Confirm all 9 files are present
- ✅ Check file sizes are reasonable
- ✅ Verify CSV format and encoding

#### **2. Schema Validation**
- ✅ Confirm column names match expected schema
- ✅ Check data types are appropriate
- ✅ Validate primary key uniqueness

#### **3. Data Quality Assessment**
- ✅ Check for missing values
- ✅ Identify outliers and anomalies
- ✅ Validate referential integrity

#### **4. Initial Exploration**
- ✅ Calculate basic statistics
- ✅ Test join operations
- ✅ Create sample analyses

### 🎯 Learning Objectives for Part 3

In the next session, you will:
- Load all 9 Olist dataset files efficiently
- Perform comprehensive data quality assessment
- Execute your first multi-table joins
- Create initial business insights
- Prepare data for ongoing analysis

In [None]:
# Create a data loading preparation checklist
print("📋 Data Loading Preparation Checklist")
print("=" * 45)

checklist_items = {
    "🗂️ File Preparation": [
        "Download Olist dataset from Kaggle",
        "Extract all CSV files to working directory",
        "Verify file names match expected schema",
        "Check file sizes are reasonable (not corrupted)"
    ],
    
    "💻 Environment Setup": [
        "Import necessary libraries (pandas, numpy, matplotlib)",
        "Set pandas display options for exploration",
        "Configure memory usage monitoring",
        "Prepare data directory paths"
    ],
    
    "🔍 Schema Validation Plan": [
        "Load each file with proper data types",
        "Check column names against schema documentation",
        "Validate primary key uniqueness",
        "Test foreign key relationships"
    ],
    
    "📊 Initial Analysis Plan": [
        "Calculate basic statistics for each table",
        "Identify missing value patterns",
        "Test simple join operations",
        "Create first business insights"
    ]
}

for category, items in checklist_items.items():
    print(f"\n{category}:")
    for item in items:
        print(f"  ☐ {item}")

print("\n" + "=" * 45)
print("🎯 Ready for Part 3: Loading Data from Multiple Tables")
print("\nNext session preview:")
print("  → Load all 9 Olist dataset files")
print("  → Perform data quality assessment")
print("  → Execute multi-table joins")
print("  → Create first business analysis")
print("  → Prepare for Major Group Assignment")

## 11. Key Takeaways

### 🎯 **Schema Understanding is Foundation**
- **Table Structure**: Each table has a specific purpose and relationship pattern
- **Data Types**: Understanding constraints helps predict data quality issues
- **Business Logic**: Schema reflects real e-commerce operations and constraints

### 🔗 **Relationships Enable Complex Analysis**
- **Primary Keys**: Ensure unique identification of records
- **Foreign Keys**: Enable joining related data across tables
- **Cardinality**: Understanding 1:1, 1:M, M:1 relationships prevents join errors

### 🗂️ **Multi-Table Strategy**
- **Central Entity**: Orders table serves as the hub for most analyses
- **Join Patterns**: Common patterns emerge for different business questions
- **Performance**: Table sizes and join order affect query performance

### 🔍 **Data Quality Awareness**
- **Missing Values**: Expected in certain fields, problematic in others
- **Referential Integrity**: Foreign key relationships must be validated
- **Business Rules**: Schema constraints reflect real-world business logic

### 📊 **SQL Knowledge Transfers**
- **JOIN Operations**: Directly translate to pandas merge operations
- **Query Planning**: Same logical approach applies to both SQL and Pandas
- **Performance Concepts**: Filtering, indexing, and optimization principles apply

## Next Steps

You now have the technical foundation to work with the Olist dataset effectively. In **Part 3**, you'll put this knowledge into practice by loading the actual data files and performing your first multi-table analyses.

**Coming up in Part 3: Loading Data from Multiple Tables**
- Efficient data loading techniques
- Data quality assessment in practice
- Your first multi-table business analysis
- Preparation for the Major Group Assignment

### 🚀 **You're Ready!**
With this deep understanding of the database schema and relationships, you're prepared to unlock the full analytical potential of the Olist dataset. The complexity you see here is what makes real-world data analysis both challenging and rewarding!