# Week 7 Minor Assignment: Customer Satisfaction EDA by Product Category

## Assignment Overview
**Due Date**: End of Wednesday Session  
**Type**: Individual Assignment  
**Points**: 25 points  
**Estimated Time**: 60-90 minutes

## Learning Objectives
By completing this assignment, you will demonstrate your ability to:
- Apply structured EDA techniques to analyze customer satisfaction patterns
- Use descriptive statistics to generate business insights
- Create meaningful visualizations for categorical and numerical data analysis
- Identify actionable recommendations based on data analysis

## Business Context
As a data analyst for Olist, you've been tasked with investigating **customer satisfaction patterns across different product categories**. The business team wants to understand:
- Which product categories have the highest/lowest customer satisfaction?
- What factors might be driving satisfaction differences?
- How can the company improve customer experience based on these insights?

Your analysis will directly inform product management and customer experience strategies.

## Assignment Requirements

### Part 1: Data Loading and Initial Exploration (5 points)
- [ ] Load data from the Supabase database using **environment variables** for credentials
- [ ] Perform initial data quality assessment
- [ ] Display basic dataset information (shape, columns, data types)
- [ ] Show sample data and summary statistics

### Part 2: Customer Satisfaction Analysis by Category (10 points)
- [ ] Calculate satisfaction metrics by product category
- [ ] Create appropriate visualizations (bar charts, box plots, etc.)
- [ ] Identify top 5 and bottom 5 categories by satisfaction
- [ ] Analyze satisfaction distribution patterns

### Part 3: Advanced Analysis and Insights (7 points)
- [ ] Investigate factors that might influence satisfaction (price, delivery time, etc.)
- [ ] Perform correlation analysis between satisfaction and other variables
- [ ] Create at least one advanced visualization (heatmap, scatter plot with categories, etc.)

### Part 4: Business Recommendations (3 points)
- [ ] Provide 3-5 specific, actionable business recommendations
- [ ] Support recommendations with data evidence
- [ ] Consider feasibility and potential impact

## Submission Guidelines
- Complete all code cells with proper comments
- Ensure all visualizations have titles, labels, and legends
- Write clear markdown explanations for your findings
- Test your code to ensure it runs without errors
- Submit the completed notebook file

## Starter Code and Setup

In [None]:
# Required imports - DO NOT MODIFY
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sqlalchemy import create_engine
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.2f}'.format)
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)

print("✅ Setup complete! Ready to start your analysis.")

## Part 1: Data Loading and Initial Exploration (5 points)

**Instructions**: 
1. Set up secure database connection using environment variables
2. Load customer satisfaction data with product categories
3. Perform initial data exploration

In [None]:
# TODO: Set up secure database connection
# Use environment variables for database credentials
# HINT: Follow the pattern from the lecture notebooks

# Set environment variables (for educational purposes)
if 'SUPABASE_DB_HOST' not in os.environ:
    # TODO: Set the environment variables properly
    # Remember: Never hardcode credentials in production!
    pass

# TODO: Create database engine using environment variables
# DATABASE_URL = f"postgresql://..."
# engine = create_engine(DATABASE_URL)

# TODO: Test the connection
print("Database connection setup complete!")

In [None]:
# TODO: Write SQL query to load customer satisfaction data
# Include: product categories, review scores, prices, delivery info
# HINT: Join orders, order_items, products, reviews, and translation tables

satisfaction_query = """
-- TODO: Write your SQL query here
-- Include:
-- - Product categories (with English translations)
-- - Review scores 
-- - Order values and delivery information
-- - Any other relevant satisfaction factors
"""

# TODO: Load data using pandas
# df = pd.read_sql(satisfaction_query, engine)

# TODO: Display basic information about the dataset
print("Dataset loaded successfully!")
# Add your analysis here

In [None]:
# TODO: Perform initial data exploration

# 1. Display dataset shape and basic info
# print(f"Dataset shape: {df.shape}")

# 2. Show data types and missing values
# df.info()

# 3. Display first few rows
# display(df.head())

# 4. Show summary statistics
# display(df.describe())

# 5. Check for missing values
# missing_data = df.isnull().sum()
# print("Missing values:")
# print(missing_data[missing_data > 0])

print("Initial exploration complete!")

## Part 2: Customer Satisfaction Analysis by Category (10 points)

**Instructions**:
1. Calculate satisfaction metrics by product category
2. Create visualizations to show satisfaction patterns
3. Identify best and worst performing categories
4. Analyze satisfaction distributions

In [None]:
# TODO: Calculate satisfaction metrics by category

# 1. Group by product category and calculate satisfaction metrics
# category_satisfaction = df.groupby('category_english').agg({
#     'review_score': ['mean', 'median', 'std', 'count'],
#     # Add other relevant metrics
# }).round(2)

# 2. Clean up column names
# category_satisfaction.columns = ['avg_rating', 'median_rating', 'rating_std', 'review_count']

# 3. Sort by average satisfaction
# category_satisfaction = category_satisfaction.sort_values('avg_rating', ascending=False)

# 4. Display results
# print("Satisfaction by Product Category:")
# display(category_satisfaction.head(10))

print("Category satisfaction analysis complete!")

In [None]:
# TODO: Create visualizations for satisfaction analysis

# 1. Bar chart of average satisfaction by category (top 15)
# plt.figure(figsize=(14, 6))
# top_categories = category_satisfaction.head(15)
# plt.bar(range(len(top_categories)), top_categories['avg_rating'])
# plt.title('Average Customer Satisfaction by Product Category (Top 15)')
# plt.xlabel('Product Category')
# plt.ylabel('Average Review Score')
# plt.xticks(range(len(top_categories)), top_categories.index, rotation=45, ha='right')
# plt.grid(True, alpha=0.3)
# plt.tight_layout()
# plt.show()

# 2. Box plot showing satisfaction distribution for top categories
# TODO: Create box plot

# 3. Satisfaction distribution histogram
# TODO: Create histogram of review scores

print("Satisfaction visualizations complete!")

In [None]:
# TODO: Identify top 5 and bottom 5 categories by satisfaction

# 1. Filter categories with sufficient reviews (e.g., > 50 reviews)
# significant_categories = category_satisfaction[category_satisfaction['review_count'] > 50]

# 2. Get top 5 categories
# top_5 = significant_categories.head(5)
# print("🏆 TOP 5 CATEGORIES BY CUSTOMER SATISFACTION:")
# for i, (category, metrics) in enumerate(top_5.iterrows(), 1):
#     print(f"{i}. {category}: {metrics['avg_rating']:.2f} avg rating ({metrics['review_count']:,} reviews)")

# 3. Get bottom 5 categories
# bottom_5 = significant_categories.tail(5)
# print("\n⚠️ BOTTOM 5 CATEGORIES BY CUSTOMER SATISFACTION:")
# for i, (category, metrics) in enumerate(bottom_5.iterrows(), 1):
#     print(f"{i}. {category}: {metrics['avg_rating']:.2f} avg rating ({metrics['review_count']:,} reviews)")

print("Top/bottom category analysis complete!")

## Part 3: Advanced Analysis and Insights (7 points)

**Instructions**:
1. Investigate factors that might influence satisfaction
2. Perform correlation analysis
3. Create advanced visualizations
4. Generate insights about satisfaction drivers

In [None]:
# TODO: Investigate factors that influence satisfaction

# 1. Analyze relationship between price and satisfaction
# print("Price vs Satisfaction Analysis:")
# price_satisfaction_corr = df['price'].corr(df['review_score'])
# print(f"Correlation between price and satisfaction: {price_satisfaction_corr:.3f}")

# 2. Analyze delivery time impact (if available)
# TODO: Calculate delivery time and analyze its impact on satisfaction

# 3. Analyze freight cost impact
# TODO: Investigate how freight costs relate to satisfaction

# 4. Create price bins and analyze satisfaction by price range
# df['price_bin'] = pd.cut(df['price'], bins=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
# price_satisfaction = df.groupby('price_bin')['review_score'].mean()
# print("\nSatisfaction by Price Range:")
# print(price_satisfaction)

print("Satisfaction factors analysis complete!")

In [None]:
# TODO: Perform correlation analysis

# 1. Select numerical variables for correlation
# numerical_vars = ['review_score', 'price', 'freight_value', 'total_order_value']
# Add delivery days if calculated

# 2. Calculate correlation matrix
# correlation_matrix = df[numerical_vars].corr()

# 3. Create correlation heatmap
# plt.figure(figsize=(10, 8))
# sns.heatmap(correlation_matrix, annot=True, cmap='RdBu_r', center=0, 
#             square=True, fmt='.3f')
# plt.title('Correlation Matrix: Satisfaction and Order Factors')
# plt.tight_layout()
# plt.show()

# 4. Identify strongest correlations with satisfaction
# satisfaction_corrs = correlation_matrix['review_score'].sort_values(key=abs, ascending=False)
# print("\nCorrelations with Customer Satisfaction:")
# for var, corr in satisfaction_corrs.items():
#     if var != 'review_score':
#         print(f"{var}: {corr:.3f}")

print("Correlation analysis complete!")

In [None]:
# TODO: Create at least one advanced visualization

# Option 1: Scatter plot of price vs satisfaction colored by category
# plt.figure(figsize=(12, 8))
# categories_to_plot = df['category_english'].value_counts().head(8).index
# df_subset = df[df['category_english'].isin(categories_to_plot)]
# 
# for category in categories_to_plot:
#     cat_data = df_subset[df_subset['category_english'] == category]
#     plt.scatter(cat_data['price'], cat_data['review_score'], 
#                label=category, alpha=0.6, s=30)
# 
# plt.xlabel('Price (R$)')
# plt.ylabel('Review Score')
# plt.title('Price vs Customer Satisfaction by Product Category')
# plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
# plt.grid(True, alpha=0.3)
# plt.tight_layout()
# plt.show()

# Option 2: Box plot comparing satisfaction across price ranges
# TODO: Create box plot

# Option 3: Heatmap of satisfaction by category and price range
# TODO: Create heatmap

print("Advanced visualization complete!")

## Part 4: Business Recommendations (3 points)

**Instructions**:
Based on your analysis, provide specific, actionable business recommendations.
Support each recommendation with data evidence from your analysis.

### Business Recommendations

**TODO: Write your recommendations here based on your analysis**

#### Recommendation 1: [Title]
**Finding**: [What did you discover in your analysis?]  
**Recommendation**: [What specific action should the business take?]  
**Evidence**: [What data supports this recommendation?]  
**Expected Impact**: [What business outcome do you expect?]

#### Recommendation 2: [Title]
**Finding**: [What did you discover in your analysis?]  
**Recommendation**: [What specific action should the business take?]  
**Evidence**: [What data supports this recommendation?]  
**Expected Impact**: [What business outcome do you expect?]

#### Recommendation 3: [Title]
**Finding**: [What did you discover in your analysis?]  
**Recommendation**: [What specific action should the business take?]  
**Evidence**: [What data supports this recommendation?]  
**Expected Impact**: [What business outcome do you expect?]

#### Additional Recommendations (Optional)
[Add more recommendations if you discovered additional insights]

#### Implementation Priority
**High Priority**: [Which recommendations should be implemented first?]  
**Medium Priority**: [Which can wait but are still important?]  
**Long-term**: [Which are strategic, longer-term initiatives?]

## Reflection Questions (Bonus - Optional)

**TODO: Answer these questions to demonstrate deeper thinking**

1. **Data Limitations**: What limitations did you encounter in the data that might affect your conclusions?

2. **Additional Analysis**: What additional data or analysis would help you provide better recommendations?

3. **Business Impact**: How would you measure the success of your recommendations if they were implemented?

4. **Methodology**: What did you learn about the EDA process through this assignment?

---

## Submission Checklist

Before submitting, ensure you have completed:

- [ ] All code cells run without errors
- [ ] Environment variables used for database connection
- [ ] Data loading and exploration completed
- [ ] Satisfaction analysis by category with visualizations
- [ ] Advanced analysis with correlation and factors
- [ ] At least 3 specific business recommendations
- [ ] All visualizations have proper titles and labels
- [ ] Code is well-commented and organized
- [ ] Markdown cells explain your findings clearly

**Good luck with your analysis! 🚀**