# Introduction to the Olist Dataset - Part 1: Overview

## Week 4, Day 2 (Thursday) - May 1st, 2025

### Overview
Welcome to your introduction to the Olist Brazilian E-commerce Public Dataset! This dataset will be your companion for the remainder of the course, providing real-world e-commerce data to practice and master data analysis techniques.

### Learning Objectives
By the end of this session, you will be able to:
- Understand the business context and scope of the Olist dataset
- Identify key business questions that can be answered with this data
- Recognize the multi-table structure and its benefits for analysis
- Connect e-commerce concepts to data analysis opportunities
- Appreciate the real-world complexity and richness of the dataset

### Prerequisites
- Data manipulation with Pandas (Week 2-3)
- Data reshaping and merging (Week 4, Day 1)
- SQL knowledge (especially JOIN operations)
- Basic understanding of e-commerce business models

## 1. What is Olist?

### Company Background
**Olist** is a Brazilian e-commerce platform that connects small and medium-sized businesses to major marketplaces. Think of it as a bridge between local sellers and large e-commerce platforms like Mercado Livre (Latin America's largest e-commerce platform).

### Business Model
- **For Sellers**: Provides access to major marketplaces without technical complexity
- **For Marketplaces**: Increases product variety and seller base
- **For Olist**: Earns commission on successful transactions

### Why This Dataset Matters
This dataset represents **real e-commerce transactions** from 2016-2018, providing:
- Authentic business scenarios and challenges
- Complex multi-table relationships (like real databases)
- Rich data for multiple types of analysis
- Cultural and geographic insights from Brazilian market

## 2. Dataset Scope and Scale

### Time Period
- **Date Range**: September 2016 to October 2018
- **Duration**: ~2 years of e-commerce operations
- **Business Context**: Captures period of rapid e-commerce growth in Brazil

### Scale Overview
- **~100,000 orders** from real customers
- **~75,000 products** across diverse categories
- **~3,000 sellers** from across Brazil
- **Multiple cities and states** represented
- **~100,000 customer reviews** with ratings and comments

### Geographic Coverage
- Covers all **27 Brazilian states**
- Major cities: São Paulo, Rio de Janeiro, Belo Horizonte, Brasília
- Rural and urban areas represented
- Logistics challenges across different regions

In [None]:
# Let's start by importing our essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', 10)

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## 3. Business Context: E-commerce Ecosystem

### Key Players in the Ecosystem

1. **Customers** 🛒
   - Individual consumers making purchases
   - Located across Brazil with varying preferences

2. **Sellers** 🏪
   - Small to medium businesses
   - Various product categories and specializations
   - Different locations and operational capabilities

3. **Products** 📦
   - Diverse categories from electronics to home goods
   - Various price points and characteristics
   - Product information and categorization

4. **Orders** 📋
   - Complete purchase transactions
   - Include multiple products and payment methods
   - Status tracking through fulfillment

5. **Reviews** ⭐
   - Customer feedback on purchases
   - Ratings and detailed comments
   - Quality indicators for products and sellers

## 4. Types of Business Questions This Dataset Can Answer

### 📊 Customer Analytics
- **Customer Segmentation**: Who are our most valuable customers?
- **Geographic Patterns**: Which regions have the highest demand?
- **Purchase Behavior**: What do customers buy together?
- **Customer Lifetime Value**: How much is a customer worth over time?

### 🏪 Seller Performance
- **Top Performers**: Which sellers are most successful?
- **Geographic Distribution**: Where are the best sellers located?
- **Product Strategy**: What products drive seller success?
- **Review Impact**: How do reviews affect seller performance?

### 📦 Product Analysis
- **Category Performance**: Which product categories sell best?
- **Pricing Strategy**: How does price affect demand?
- **Seasonal Trends**: What are the seasonal buying patterns?
- **Product Success Factors**: What makes a product successful?

### 🚚 Operations & Logistics
- **Delivery Performance**: How long do deliveries take?
- **Geographic Challenges**: Which routes are most difficult?
- **Order Fulfillment**: What affects delivery success?
- **Payment Methods**: How do customers prefer to pay?

### ⭐ Quality & Satisfaction
- **Review Analysis**: What drives positive reviews?
- **Satisfaction Drivers**: What factors influence customer happiness?
- **Problem Areas**: Where are the biggest challenges?
- **Improvement Opportunities**: How can operations be optimized?

## 5. Dataset Structure Overview

The Olist dataset is organized into **9 interconnected tables**, similar to a well-designed relational database. This structure reflects real-world business data organization.

### Core Tables (5)

1. **🛒 Orders** (`olist_orders_dataset.csv`)
   - Main transaction records
   - Order status and timestamps
   - Links customers to their purchases

2. **📦 Order Items** (`olist_order_items_dataset.csv`)
   - Individual products within each order
   - Quantity, price, and shipping information
   - Links orders to specific products and sellers

3. **👥 Customers** (`olist_customers_dataset.csv`)
   - Customer information and location
   - Unique customer identifiers
   - Geographic distribution data

4. **🏪 Sellers** (`olist_sellers_dataset.csv`)
   - Seller information and location
   - Business registration details
   - Geographic distribution of sellers

5. **📋 Products** (`olist_products_dataset.csv`)
   - Product details and categorization
   - Physical characteristics (dimensions, weight)
   - Category information for analysis

### Additional Tables (4)

6. **⭐ Reviews** (`olist_order_reviews_dataset.csv`)
   - Customer feedback and ratings
   - Review text and timestamps
   - Quality indicators

7. **💳 Payments** (`olist_order_payments_dataset.csv`)
   - Payment method details
   - Installment information
   - Financial transaction data

8. **🌍 Geolocation** (`olist_geolocation_dataset.csv`)
   - Geographic coordinates
   - City and state information
   - Enables location-based analysis

9. **🏷️ Category Translation** (`product_category_name_translation.csv`)
   - Portuguese to English category names
   - Enables international analysis
   - Cultural context for products

In [None]:
# Let's create a visual representation of the data structure
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import FancyBboxPatch, ConnectionPatch

# Create a figure to show the table relationships
fig, ax = plt.subplots(1, 1, figsize=(14, 10))
ax.set_xlim(0, 10)
ax.set_ylim(0, 8)
ax.axis('off')

# Define table positions and information
tables = {
    'Orders': {'pos': (2, 6), 'color': '#FF6B6B', 'size': '~100K rows'},
    'Order Items': {'pos': (5, 6), 'color': '#4ECDC4', 'size': '~112K rows'},
    'Customers': {'pos': (0.5, 4), 'color': '#45B7D1', 'size': '~99K rows'},
    'Sellers': {'pos': (8, 4), 'color': '#96CEB4', 'size': '~3K rows'},
    'Products': {'pos': (5, 4), 'color': '#FFEAA7', 'size': '~32K rows'},
    'Reviews': {'pos': (2, 2), 'color': '#DDA0DD', 'size': '~100K rows'},
    'Payments': {'pos': (5, 2), 'color': '#F7DC6F', 'size': '~103K rows'},
    'Geolocation': {'pos': (8, 2), 'color': '#AED6F1', 'size': '~1M rows'},
    'Categories': {'pos': (2, 0.5), 'color': '#F8C471', 'size': '71 rows'}
}

# Draw tables
for table, info in tables.items():
    # Create fancy box
    box = FancyBboxPatch(
        (info['pos'][0]-0.4, info['pos'][1]-0.3),
        0.8, 0.6,
        boxstyle="round,pad=0.05",
        facecolor=info['color'],
        edgecolor='black',
        linewidth=1.5,
        alpha=0.8
    )
    ax.add_patch(box)
    
    # Add table name
    ax.text(info['pos'][0], info['pos'][1]+0.1, table, 
            ha='center', va='center', fontsize=10, fontweight='bold')
    
    # Add size information
    ax.text(info['pos'][0], info['pos'][1]-0.1, info['size'], 
            ha='center', va='center', fontsize=8, style='italic')

# Draw connections (simplified key relationships)
connections = [
    ('Orders', 'Order Items'),
    ('Orders', 'Customers'),
    ('Order Items', 'Sellers'),
    ('Order Items', 'Products'),
    ('Orders', 'Reviews'),
    ('Orders', 'Payments'),
    ('Products', 'Categories')
]

for start, end in connections:
    start_pos = tables[start]['pos']
    end_pos = tables[end]['pos']
    
    ax.annotate('', xy=end_pos, xytext=start_pos,
                arrowprops=dict(arrowstyle='->', lw=1.5, color='gray', alpha=0.7))

# Add title
ax.text(5, 7.5, 'Olist Dataset Structure Overview', 
        ha='center', va='center', fontsize=16, fontweight='bold')

# Add legend
ax.text(0.5, 7, 'Core Tables:', fontsize=12, fontweight='bold')
ax.text(0.5, 6.7, '• Orders, Order Items, Customers', fontsize=10)
ax.text(0.5, 6.4, '• Sellers, Products', fontsize=10)

ax.text(0.5, 5.8, 'Supporting Tables:', fontsize=12, fontweight='bold')
ax.text(0.5, 5.5, '• Reviews, Payments', fontsize=10)
ax.text(0.5, 5.2, '• Geolocation, Categories', fontsize=10)

plt.tight_layout()
plt.show()

print("📊 Dataset Structure Visualization")
print("\nKey Relationships:")
print("→ Orders connect to Customers (who placed them)")
print("→ Orders connect to Order Items (what was ordered)")
print("→ Order Items connect to Products (what items are)")
print("→ Order Items connect to Sellers (who sells them)")
print("→ Orders connect to Reviews & Payments (transaction details)")

## 6. Why Multi-Table Structure Matters

### Database Normalization Benefits
The multi-table structure follows **database normalization principles**, which provide several advantages:

#### 1. **Reduced Data Redundancy**
- Customer information stored once, referenced by multiple orders
- Product details stored once, referenced by multiple order items
- Saves storage space and maintains consistency

#### 2. **Data Integrity**
- Changes to customer address update all related orders
- Product category changes propagate consistently
- Reduces errors and inconsistencies

#### 3. **Analytical Flexibility**
- Can analyze at different granular levels (customer, order, item)
- Easy to aggregate data in multiple ways
- Supports complex business questions

#### 4. **Real-World Representation**
- Mirrors how actual e-commerce databases are structured
- Prepares you for working with production systems
- Reflects business entity relationships

## 7. SQL Knowledge Connection

Since you have SQL experience, let's connect the dataset structure to familiar SQL concepts:

### Primary and Foreign Keys

| Table | Primary Key | Foreign Key(s) | SQL Equivalent |
|-------|-------------|----------------|-----------------|
| **Orders** | `order_id` | `customer_id` | `SELECT * FROM orders` |
| **Order Items** | `order_id` + `order_item_id` | `order_id`, `product_id`, `seller_id` | `SELECT * FROM order_items` |
| **Customers** | `customer_id` | None | `SELECT * FROM customers` |
| **Sellers** | `seller_id` | None | `SELECT * FROM sellers` |
| **Products** | `product_id` | None | `SELECT * FROM products` |

### Common SQL Patterns You'll Use

```sql
-- Get order details with customer information
SELECT o.*, c.customer_city, c.customer_state
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;
```

**Pandas Equivalent:**
```python
pd.merge(orders, customers, on='customer_id')
```

```sql
-- Get order items with product details
SELECT oi.*, p.product_category_name, p.product_name_length
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id;
```

**Pandas Equivalent:**
```python
pd.merge(order_items, products, on='product_id')
```

## 8. Brazilian E-commerce Context

Understanding the Brazilian market context enhances your analysis:

### Geographic Considerations
- **Vast Country**: Brazil is the 5th largest country by area
- **Population Centers**: São Paulo and Rio de Janeiro are major metropolitan areas
- **Regional Differences**: Different economic conditions across regions
- **Logistics Challenges**: Delivery times vary significantly by location

### Economic Context (2016-2018)
- **Economic Recovery**: Brazil was emerging from recession
- **E-commerce Growth**: Online shopping was rapidly expanding
- **Mobile Commerce**: Smartphone adoption driving mobile purchases
- **Payment Innovation**: Installment payments very popular

### Cultural Factors
- **Payment Preferences**: Credit cards with installments common
- **Review Culture**: Brazilian consumers actively leave reviews
- **Social Commerce**: Word-of-mouth and social recommendations important
- **Seasonal Patterns**: Different holiday seasons than North America/Europe

In [None]:
# Let's create a timeline of Brazilian e-commerce context
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime

# Create timeline data
timeline_events = {
    'Date': [
        '2016-09', '2016-12', '2017-03', '2017-06', '2017-09', 
        '2017-12', '2018-03', '2018-06', '2018-09', '2018-10'
    ],
    'Event': [
        'Dataset Begins', 'Holiday Season', 'Economic Recovery', 
        'Mid-Year Sales', 'Spring Shopping', 'Black Friday/Christmas', 
        'Autumn Sales', 'Winter Promotions', 'Spring Return', 'Dataset Ends'
    ],
    'Context': [
        'Start of data collection', 'Peak shopping season',
        'Brazil emerging from recession', 'Traditional sales period',
        'Spring shopping season', 'Major retail events',
        'Seasonal promotions', 'Winter product focus',
        'Return to spring patterns', 'End of data collection'
    ]
}

timeline_df = pd.DataFrame(timeline_events)
timeline_df['Date'] = pd.to_datetime(timeline_df['Date'])

# Create timeline visualization
fig, ax = plt.subplots(1, 1, figsize=(15, 6))

# Plot timeline
y_pos = 1
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7', 
          '#DDA0DD', '#F7DC6F', '#AED6F1', '#F8C471', '#FFB347']

for i, (date, event, context) in enumerate(zip(timeline_df['Date'], 
                                               timeline_df['Event'], 
                                               timeline_df['Context'])):
    # Plot point
    ax.scatter(date, y_pos, s=200, c=colors[i], alpha=0.8, zorder=3)
    
    # Add event label
    y_text = y_pos + 0.15 if i % 2 == 0 else y_pos - 0.15
    ax.text(date, y_text, event, ha='center', va='center', 
            fontsize=9, fontweight='bold', rotation=45)
    
    # Add context (smaller text)
    y_context = y_pos + 0.25 if i % 2 == 0 else y_pos - 0.25
    ax.text(date, y_context, context, ha='center', va='center', 
            fontsize=7, style='italic', rotation=45)

# Draw timeline line
ax.plot(timeline_df['Date'], [y_pos]*len(timeline_df), 'k-', alpha=0.3, linewidth=2)

# Formatting
ax.set_ylim(0.4, 1.6)
ax.set_xlabel('Time Period', fontsize=12)
ax.set_title('Olist Dataset Timeline: Brazilian E-commerce Context (2016-2018)', 
             fontsize=14, fontweight='bold', pad=20)

# Remove y-axis
ax.set_yticks([])
ax.spines['left'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

# Format x-axis
ax.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("📅 Dataset Timeline Context")
print("\nKey Insights:")
print("• Data spans 2+ years of Brazilian e-commerce")
print("• Includes multiple seasonal cycles")
print("• Captures economic recovery period")
print("• Shows established e-commerce patterns")

## 9. Data Quality and Authenticity

### What Makes This Dataset Special

#### ✅ **Real Business Data**
- Actual transactions from operating e-commerce platform
- Authentic customer behavior patterns
- Real business challenges and complexities
- Genuine market dynamics

#### ✅ **Privacy Protected**
- All personal information anonymized
- Customers and sellers cannot be identified
- Geographic data aggregated appropriately
- Meets privacy and ethical standards

#### ✅ **Complete Business Cycle**
- Full customer journey from order to review
- Multiple payment methods and terms
- Complete logistics chain information
- Comprehensive feedback loop

#### ✅ **Research Quality**
- Publicly available for academic research
- Well-documented structure and relationships
- Suitable for teaching and learning
- Enables reproducible analysis

### Potential Data Challenges

#### ⚠️ **Real-World Messiness**
- Missing values in some records
- Inconsistent data entry formats
- Outliers and edge cases
- Natural business data variations

#### ⚠️ **Brazilian Context**
- Portuguese language in some fields
- Local business practices and customs
- Different geographic and economic patterns
- Cultural context may affect interpretations

#### ⚠️ **Time Period Specifics**
- 2016-2018 economic and technology context
- May not reflect current patterns
- Historical perspective on e-commerce evolution
- Different from post-pandemic patterns

## 10. Learning Path with Olist Dataset

### This Course Journey

#### **Month 1** (Current)
- **Week 4**: Dataset introduction and basic exploration
- Learn to load and understand the structure
- Practice joining multiple tables
- Basic data quality assessment

#### **Month 2**
- **Week 5-6**: Data visualization with Olist data
- **Week 7**: Comprehensive exploratory data analysis
- **Week 8**: Statistical analysis and hypothesis testing

#### **Month 3**
- **Week 9-10**: Machine learning applications
- **Week 11**: Advanced e-commerce analytics
- **Week 12**: Project planning using Olist data

#### **Months 4-5**
- **Capstone Projects**: In-depth analysis of specific business questions
- Choose from 5 project categories (all using Olist data)
- Apply all learned techniques to real business problems

### Skills You'll Develop

#### **Technical Skills**
- Complex data manipulation and joining
- Advanced visualization techniques
- Statistical analysis and hypothesis testing
- Machine learning for business applications
- Time series analysis for seasonal patterns

#### **Business Skills**
- E-commerce analytics and KPIs
- Customer segmentation and analysis
- Seller and product performance evaluation
- Logistics and operations analysis
- Review and sentiment analysis

#### **Data Science Skills**
- End-to-end analysis projects
- Data storytelling and presentation
- Business recommendation development
- Performance monitoring and optimization
- Cross-functional collaboration

## 11. Preview: What's Coming Next

### Today's Remaining Sessions

#### **Part 2: Database Schema and Relationships**
- Detailed examination of each table structure
- Understanding primary and foreign key relationships
- Data types and field meanings
- Connection patterns between tables

#### **Part 3: Loading Data from Multiple Tables**
- Practical data loading techniques
- Handling CSV files efficiently
- Initial data quality checks
- Creating your first multi-table joins

### This Week's Assignment Preview
**Major Group Assignment**: Initial exploration of the Olist dataset
- Load all 9 dataset files
- Perform basic data profiling
- Create simple visualizations
- Answer fundamental business questions
- Present findings to the class

## 12. Reflection Questions

Take a moment to consider these questions about what you've learned:

### Business Understanding
1. **What type of business questions are you most excited to explore with this dataset?**
2. **How does the multi-table structure benefit business analysis compared to a single flat file?**
3. **What Brazilian market factors might affect your analysis and interpretations?**

### Technical Connections
4. **How does this dataset structure compare to databases you've worked with in SQL?**
5. **What data quality challenges do you anticipate with real-world e-commerce data?**
6. **Which table relationships seem most important for business analysis?**

### Learning Expectations
7. **What specific skills do you hope to develop through working with this dataset?**
8. **Which potential project areas (customer analysis, seller optimization, etc.) interest you most?**
9. **How might these skills apply to your career goals or current work?**

In [None]:
# Interactive element: Create your first Olist analysis question
print("🎯 Your First Olist Analysis Challenge")
print("="*50)
print()
print("Based on what you've learned about the Olist dataset, think about:")
print()
print("1. ONE specific business question you'd like to answer")
print("2. Which tables you'd need to use")
print("3. What type of analysis or visualization would help")
print()
print("Example:")
print("❓ Question: 'Which product categories have the highest customer satisfaction?'")
print("📊 Tables needed: Products, Order Items, Reviews")
print("📈 Analysis: Average review scores by product category")
print()
print("Write your ideas below and discuss with your group!")
print()
print("Your Analysis Question:")
print("_" * 60)
print()
print("Tables Needed:")
print("_" * 60)
print()
print("Analysis Approach:")
print("_" * 60)

## 13. Key Takeaways

### 🎯 **Business Context is Crucial**
- Olist represents real e-commerce complexity
- Brazilian market context affects analysis interpretation
- Multi-stakeholder ecosystem creates rich analysis opportunities

### 🗄️ **Multi-Table Structure Enables Deep Analysis**
- Normalized database design supports flexible queries
- Different granularity levels (customer, order, item) provide multiple perspectives
- Real-world data complexity prepares you for professional environments

### 🔄 **SQL Knowledge Transfers Directly**
- Table relationships and keys work the same way
- JOIN patterns translate to pandas merge operations
- Database thinking applies to data analysis workflows

### 📈 **Rich Analysis Opportunities**
- Customer behavior and segmentation
- Seller performance optimization
- Product and category analysis
- Operations and logistics insights
- Review and satisfaction analysis

### 🎓 **Learning Progression**
- Gradual complexity increase throughout course
- Practical skills development with real data
- Capstone projects for portfolio development
- Business-focused analysis techniques

## Next Steps

In **Part 2**, we'll dive deep into the database schema and examine each table's structure in detail. You'll learn exactly what each field means and how the tables connect to enable powerful analysis.

In **Part 3**, you'll get your hands dirty loading the actual data and performing your first multi-table joins with the Olist dataset.

Get ready to transform from data manipulation to business insights! 🚀