# Week 7 Major Group Assignment: Comprehensive Business Intelligence EDA Project

**Due Date:** Thursday, May 22, 2025  
**Points:** 100 points  
**Group Assignment:** Teams of 3-4 students  
**Submission:** Upload completed notebook to course platform

## Assignment Overview

As a data analysis consulting team, you've been hired by Olist's executive leadership to conduct a comprehensive business intelligence analysis. Your goal is to provide strategic insights that will guide key business decisions for the next quarter.

## Business Context

Olist connects small businesses to major marketplaces in Brazil. The company needs to understand:
- Which customer segments drive the most value
- How product categories perform across different regions
- What seasonal patterns exist in their business
- Where operational improvements can increase customer satisfaction

## Requirements

Your analysis must integrate three major components:
1. **Customer Intelligence** (35 points)
2. **Product & Market Analysis** (35 points)
3. **Time Series & Forecasting** (20 points)
4. **Strategic Recommendations** (10 points)

## Deliverables

1. **Executive Summary** (300-500 words)
2. **Technical Analysis** (Code + Visualizations)
3. **Strategic Recommendations** (Bullet points with supporting data)
4. **Appendix** (Additional charts and technical details)


## Setup and Data Connection

**Security Note:** Always use environment variables for database credentials. Never hardcode sensitive information.

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Database connection
import psycopg2
import os
from sqlalchemy import create_engine

# Statistical analysis
from scipy import stats
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Time series
from datetime import datetime, timedelta
import calendar

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

In [None]:
# TODO: Set up secure database connection using environment variables
# IMPORTANT: Never hardcode credentials in your notebooks!

# Set environment variables if not already set (for Colab)
if 'SUPABASE_DB_HOST' not in os.environ:
    # Ask your instructor for these credentials
    os.environ['SUPABASE_DB_HOST'] = 'your-host-here'
    os.environ['SUPABASE_DB_NAME'] = 'your-database-name'
    os.environ['SUPABASE_DB_USER'] = 'your-username'
    os.environ['SUPABASE_DB_PASSWORD'] = 'your-password'
    os.environ['SUPABASE_DB_PORT'] = '5432'

# Create database connection
DATABASE_URL = f"postgresql://{os.environ['SUPABASE_DB_USER']}:{os.environ['SUPABASE_DB_PASSWORD']}@{os.environ['SUPABASE_DB_HOST']}:{os.environ['SUPABASE_DB_PORT']}/{os.environ['SUPABASE_DB_NAME']}"
engine = create_engine(DATABASE_URL)

print("✅ Database connection established successfully!")

## Part 1: Customer Intelligence Analysis (35 points)

### 1.1 RFM Analysis and Customer Segmentation (15 points)

Perform a comprehensive RFM (Recency, Frequency, Monetary) analysis to identify customer segments.

In [None]:
# TODO: Load customer order data with payments and calculate RFM metrics
# Hint: Join orders, order_items, order_payments, and customers tables

customer_rfm_query = """
-- TODO: Write SQL query to calculate RFM metrics for each customer
-- Requirements:
-- 1. Recency: Days since last purchase (use '2018-10-17' as analysis date)
-- 2. Frequency: Total number of orders per customer
-- 3. Monetary: Total amount spent per customer
-- 4. Include customer_state for geographic insights
-- 5. Only include delivered orders
"""

# rfm_data = pd.read_sql_query(customer_rfm_query, engine)
# print(f"Loaded {len(rfm_data):,} customers for RFM analysis")
# rfm_data.head()

In [None]:
# TODO: Create RFM scores (1-5 quintiles) and customer segments
# Hint: Use pd.qcut() for scoring, then create segment labels

# Calculate RFM scores
# rfm_data['R_Score'] = pd.qcut(rfm_data['Recency'], 5, labels=[5,4,3,2,1])  # Lower recency = higher score
# rfm_data['F_Score'] = pd.qcut(rfm_data['Frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
# rfm_data['M_Score'] = pd.qcut(rfm_data['Monetary'], 5, labels=[1,2,3,4,5])

# Create segment labels
# def get_rfm_segment(row):
#     if row['R_Score'] >= 4 and row['F_Score'] >= 4 and row['M_Score'] >= 4:
#         return 'Champions'
#     elif row['R_Score'] >= 3 and row['F_Score'] >= 3 and row['M_Score'] >= 3:
#         return 'Loyal Customers'
#     # TODO: Add more segment logic
#     else:
#         return 'Others'

# rfm_data['Segment'] = rfm_data.apply(get_rfm_segment, axis=1)
# print("\nCustomer Segments:")
# print(rfm_data['Segment'].value_counts())

In [None]:
# TODO: Create comprehensive visualizations of RFM analysis
# Requirements:
# 1. RFM distribution plots
# 2. Customer segment analysis
# 3. Geographic distribution of segments
# 4. Segment performance metrics

# Create subplot layout
# fig = make_subplots(
#     rows=2, cols=2,
#     subplot_titles=['RFM Distributions', 'Customer Segments', 'Geographic Analysis', 'Segment Performance'],
#     specs=[[{"secondary_y": False}, {"secondary_y": False}],
#            [{"secondary_y": False}, {"secondary_y": False}]]
# )

# TODO: Add your visualization code here

print("📊 RFM Analysis Visualization")
print("TODO: Create comprehensive RFM visualizations")

### 1.2 Advanced Customer Clustering (10 points)

Apply machine learning clustering to identify customer behavioral patterns beyond RFM.

In [None]:
# TODO: Extend customer analysis with additional behavioral metrics
# Requirements:
# 1. Average order value
# 2. Purchase frequency patterns
# 3. Product category preferences
# 4. Time between orders
# 5. Customer lifetime value

advanced_customer_query = """
-- TODO: Create query with advanced customer behavioral metrics
-- Include product category diversity, payment preferences, etc.
"""

# advanced_customer_data = pd.read_sql_query(advanced_customer_query, engine)
print("TODO: Load advanced customer behavioral data")

In [None]:
# TODO: Apply K-means clustering with optimal cluster selection
# Requirements:
# 1. Use elbow method to determine optimal clusters
# 2. Standardize features before clustering
# 3. Apply PCA for dimensionality reduction
# 4. Interpret cluster characteristics

# Example structure:
# scaler = StandardScaler()
# features_scaled = scaler.fit_transform(features)
# 
# # Elbow method
# inertias = []
# for k in range(2, 11):
#     kmeans = KMeans(n_clusters=k, random_state=42)
#     kmeans.fit(features_scaled)
#     inertias.append(kmeans.inertia_)

print("TODO: Implement advanced customer clustering")

### 1.3 Customer Lifetime Value Analysis (10 points)

Calculate and analyze customer lifetime value across different segments.

In [None]:
# TODO: Calculate Customer Lifetime Value (CLV)
# Requirements:
# 1. Historical CLV calculation
# 2. Predictive CLV modeling
# 3. CLV by customer segment
# 4. Geographic CLV analysis
# 5. Retention rate analysis

clv_query = """
-- TODO: Create query to calculate CLV components
-- Include customer acquisition dates, order history, churn indicators
"""

print("TODO: Implement CLV analysis")

## Part 2: Product & Market Analysis (35 points)

### 2.1 Product Portfolio Analysis (15 points)

Analyze product performance using portfolio management frameworks.

In [None]:
# TODO: Load product performance data
# Requirements:
# 1. Revenue by product category
# 2. Growth rates
# 3. Market share analysis
# 4. Profitability metrics
# 5. Customer satisfaction by category

product_portfolio_query = """
-- TODO: Create comprehensive product performance query
-- Include revenue, growth, market share, satisfaction metrics
"""

# product_data = pd.read_sql_query(product_portfolio_query, engine)
print("TODO: Load product portfolio data")

In [None]:
# TODO: Create BCG-style portfolio matrix
# Requirements:
# 1. Market growth vs market share analysis
# 2. Classify products as Stars, Cash Cows, Question Marks, Dogs
# 3. Revenue bubble chart visualization
# 4. Strategic recommendations for each quadrant

# Example BCG matrix structure:
# fig = px.scatter(product_data, 
#                  x='market_share', 
#                  y='growth_rate',
#                  size='revenue',
#                  color='bcg_category',
#                  hover_data=['product_category'],
#                  title='Product Portfolio Matrix (BCG Style)')

print("TODO: Create BCG portfolio matrix")

### 2.2 Market Basket Analysis (10 points)

Identify product associations and cross-selling opportunities.

In [None]:
# TODO: Implement market basket analysis
# Requirements:
# 1. Calculate support, confidence, and lift for product associations
# 2. Identify frequently bought together products
# 3. Network analysis of product relationships
# 4. Cross-selling recommendations

# Market basket query
basket_query = """
-- TODO: Create query to identify products bought together
-- Group by order_id and collect product categories
"""

print("TODO: Implement market basket analysis")

### 2.3 Geographic Market Analysis (10 points)

Analyze product performance across different regions and states.

In [None]:
# TODO: Geographic product performance analysis
# Requirements:
# 1. Revenue by state and region
# 2. Product category preferences by geography
# 3. Market penetration analysis
# 4. Logistics and delivery performance
# 5. Regional growth opportunities

geographic_query = """
-- TODO: Create geographic analysis query
-- Include state-level performance, delivery metrics
"""

print("TODO: Implement geographic market analysis")

## Part 3: Time Series & Forecasting Analysis (20 points)

### 3.1 Seasonal Pattern Analysis (10 points)

Identify and analyze seasonal patterns in sales data.

In [None]:
# TODO: Time series analysis of sales patterns
# Requirements:
# 1. Daily, weekly, monthly sales trends
# 2. Seasonal decomposition
# 3. Holiday and special event analysis
# 4. Category-specific seasonality
# 5. Regional seasonal differences

time_series_query = """
-- TODO: Create time series query
-- Include daily sales, product categories, regions
"""

# time_series_data = pd.read_sql_query(time_series_query, engine)
print("TODO: Load time series data")

In [None]:
# TODO: Seasonal decomposition analysis
# Requirements:
# 1. Trend, seasonal, and residual components
# 2. Multiple seasonality detection (weekly, monthly, yearly)
# 3. Seasonal strength metrics
# 4. Anomaly detection in seasonal patterns

# Example seasonal decomposition:
# from statsmodels.tsa.seasonal import seasonal_decompose
# decomposition = seasonal_decompose(time_series_data['sales'], period=30)

print("TODO: Implement seasonal decomposition")

### 3.2 Forecasting and Business Planning (10 points)

Create forecasts for business planning purposes.

In [None]:
# TODO: Implement forecasting models
# Requirements:
# 1. Multiple forecasting approaches (moving average, exponential smoothing, ARIMA)
# 2. Forecast accuracy metrics
# 3. Confidence intervals
# 4. Scenario planning (best/worst/most likely)
# 5. Business impact projections

# Example forecasting structure:
# from statsmodels.tsa.arima.model import ARIMA
# from statsmodels.tsa.holtwinters import ExponentialSmoothing

print("TODO: Implement forecasting models")

## Part 4: Strategic Recommendations (10 points)

### 4.1 Executive Summary

**TODO: Write your executive summary here (300-500 words)**

Your executive summary should include:
- Key findings from customer intelligence analysis
- Critical insights from product/market analysis
- Important patterns from time series analysis
- Top 3 strategic recommendations
- Expected business impact

*Replace this text with your executive summary*

### 4.2 Strategic Recommendations

**TODO: Provide specific, actionable recommendations based on your analysis**

#### Customer Strategy
- **Recommendation 1:** [Based on RFM/clustering analysis]
- **Supporting Data:** [Specific metrics and visualizations]
- **Expected Impact:** [Quantified business impact]

#### Product Strategy
- **Recommendation 2:** [Based on portfolio/market analysis]
- **Supporting Data:** [Specific metrics and visualizations]
- **Expected Impact:** [Quantified business impact]

#### Operational Strategy
- **Recommendation 3:** [Based on time series/forecasting]
- **Supporting Data:** [Specific metrics and visualizations]
- **Expected Impact:** [Quantified business impact]

*Replace this template with your specific recommendations*

### 4.3 Implementation Roadmap

**TODO: Create a timeline for implementing your recommendations**

#### Phase 1 (Immediate - 0-3 months)
- [ ] Action item 1
- [ ] Action item 2
- [ ] Action item 3

#### Phase 2 (Short-term - 3-6 months)
- [ ] Action item 1
- [ ] Action item 2
- [ ] Action item 3

#### Phase 3 (Medium-term - 6-12 months)
- [ ] Action item 1
- [ ] Action item 2
- [ ] Action item 3

*Replace this template with your implementation roadmap*

## Appendix: Additional Analysis

### A.1 Detailed Statistical Analysis

In [None]:
# TODO: Include additional statistical tests and analysis
# Examples:
# - Correlation analysis between key metrics
# - Statistical significance tests
# - Confidence intervals for key estimates
# - Sensitivity analysis

print("TODO: Add detailed statistical analysis")

### A.2 Technical Implementation Details

In [None]:
# TODO: Document your technical approach and methodology
# Include:
# - Data quality checks performed
# - Assumptions made in analysis
# - Limitations of the analysis
# - Validation steps taken

print("TODO: Document technical implementation")

## Grading Rubric (100 points total)

### Customer Intelligence Analysis (35 points)
- **RFM Analysis & Segmentation (15 points)**
  - Excellent (13-15): Comprehensive RFM with clear segmentation and actionable insights
  - Good (10-12): Solid RFM analysis with adequate segmentation
  - Satisfactory (7-9): Basic RFM analysis with some segmentation issues
  - Needs Improvement (0-6): Incomplete or incorrect RFM analysis

- **Advanced Clustering (10 points)**
  - Excellent (9-10): Sophisticated clustering with optimal selection and interpretation
  - Good (7-8): Good clustering approach with reasonable interpretation
  - Satisfactory (5-6): Basic clustering with limited interpretation
  - Needs Improvement (0-4): Poor or missing clustering analysis

- **CLV Analysis (10 points)**
  - Excellent (9-10): Comprehensive CLV with predictive modeling and strategic insights
  - Good (7-8): Solid CLV calculation with good analysis
  - Satisfactory (5-6): Basic CLV calculation with limited insights
  - Needs Improvement (0-4): Incomplete or incorrect CLV analysis

### Product & Market Analysis (35 points)
- **Portfolio Analysis (15 points)**
  - Excellent (13-15): Sophisticated portfolio analysis with strategic framework
  - Good (10-12): Good portfolio analysis with clear insights
  - Satisfactory (7-9): Basic portfolio analysis
  - Needs Improvement (0-6): Incomplete portfolio analysis

- **Market Basket Analysis (10 points)**
  - Excellent (9-10): Comprehensive association analysis with actionable recommendations
  - Good (7-8): Good market basket analysis with insights
  - Satisfactory (5-6): Basic association analysis
  - Needs Improvement (0-4): Poor or missing market basket analysis

- **Geographic Analysis (10 points)**
  - Excellent (9-10): Thorough geographic analysis with regional insights
  - Good (7-8): Good geographic analysis
  - Satisfactory (5-6): Basic geographic analysis
  - Needs Improvement (0-4): Incomplete geographic analysis

### Time Series & Forecasting (20 points)
- **Seasonal Analysis (10 points)**
  - Excellent (9-10): Comprehensive seasonal decomposition with business insights
  - Good (7-8): Good seasonal analysis
  - Satisfactory (5-6): Basic seasonal patterns identified
  - Needs Improvement (0-4): Poor seasonal analysis

- **Forecasting (10 points)**
  - Excellent (9-10): Multiple forecasting methods with accuracy assessment
  - Good (7-8): Solid forecasting approach
  - Satisfactory (5-6): Basic forecasting
  - Needs Improvement (0-4): Poor or missing forecasting

### Strategic Recommendations (10 points)
- **Executive Summary & Recommendations (10 points)**
  - Excellent (9-10): Clear, actionable recommendations with strong business justification
  - Good (7-8): Good recommendations with adequate support
  - Satisfactory (5-6): Basic recommendations
  - Needs Improvement (0-4): Weak or missing recommendations

### Additional Considerations
- **Code Quality**: Clean, well-commented code with proper error handling
- **Visualizations**: Professional, clear charts that support the analysis
- **Business Understanding**: Demonstrates understanding of business context
- **Team Collaboration**: Evidence of effective teamwork and contribution balance

### Submission Requirements
1. **Completed Jupyter Notebook** with all code cells executed
2. **Team Member Contributions** document (who did what)
3. **Presentation Slides** (10-15 slides) summarizing key findings
4. **Data Files** (if you created any derived datasets)

### Late Submission Policy
- 10% penalty per day late
- No submissions accepted after 3 days late
- Contact instructor immediately if you anticipate difficulties

---

**Good luck with your analysis! Remember to think like business consultants and provide actionable insights that Olist's leadership can implement.**