# Superstore Sales Analysis

## Project Overview
This analysis examines retail sales data from a superstore to uncover insights about sales performance, profitability, and customer behavior. The goal is to answer key business questions that can help drive strategic decisions.

## Business Questions
1. What are the sales trends over time?
2. Which products and categories perform best?
3. How does regional performance vary?
4. What customer segments generate the most revenue?
5. Which products are most profitable?

---

## Step 1: Import Libraries

We'll use pandas for data manipulation and matplotlib/seaborn for visualizations.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
warnings.filterwarnings('ignore')

# Set style for visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

## Step 2: Load the Data

Load the Superstore sales dataset from CSV file.

In [None]:
# Load the dataset
df = pd.read_csv('../data/superstore_sales.csv')

# Display first few rows
print("First 5 rows of the dataset:")
df.head()

## Step 3: Explore the Data

Let's understand the structure and content of our dataset.

In [None]:
# Check dataset shape
print(f"Dataset shape: {df.shape[0]} rows and {df.shape[1]} columns\n")

# Display column names and data types
print("Column information:")
df.info()

In [None]:
# Check for missing values
print("Missing values per column:")
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0] if missing_values.sum() > 0 else "No missing values found!")

In [None]:
# Display summary statistics
print("Summary statistics for numerical columns:")
df.describe()

## Step 4: Clean the Data

Convert date columns to proper datetime format and create additional useful columns.

### Data Cleaning Rationale

**Date Format Handling:**
- The dataset contains `Order Date` and `Ship Date` columns that need to be converted to proper datetime format for time-based analysis
- We implement robust parsing with mixed format support to handle potential inconsistencies in production data
- Error handling ensures the notebook doesn't crash if unexpected date formats are encountered
- Using `format='mixed'` allows pandas to intelligently detect and parse different date formats in the same column
- The `errors='coerce'` parameter converts unparseable dates to NaT (Not a Time) rather than raising exceptions

**Production-Ready Safeguards:**
1. **Mixed Format Support**: Handles scenarios where dates may be in different formats (e.g., MM/DD/YYYY, YYYY-MM-DD)
2. **Error Handling**: Gracefully manages parsing failures by coercing invalid dates to NaT
3. **Validation**: Checks for any failed conversions and reports them for data quality monitoring
4. **Derived Features**: Creates additional time-based columns (Year, Month, Quarter) for aggregation and analysis

In [None]:
# Convert date columns to datetime with robust error handling
# Using format='mixed' to handle potential mixed date formats in production
try:
    # Convert Order Date with mixed format support
    df['Order Date'] = pd.to_datetime(df['Order Date'], format='mixed', errors='coerce')
    
    # Convert Ship Date with mixed format support  
    df['Ship Date'] = pd.to_datetime(df['Ship Date'], format='mixed', errors='coerce')
    
    # Check for any dates that failed to parse (became NaT)
    order_date_nulls = df['Order Date'].isna().sum()
    ship_date_nulls = df['Ship Date'].isna().sum()
    
    if order_date_nulls > 0:
        print(f"⚠️  Warning: {order_date_nulls} Order Date values could not be parsed and were set to NaT")
    if ship_date_nulls > 0:
        print(f"⚠️  Warning: {ship_date_nulls} Ship Date values could not be parsed and were set to NaT")
    
    # Extract additional date components for analysis
    df['Year'] = df['Order Date'].dt.year
    df['Month'] = df['Order Date'].dt.month
    df['Month Name'] = df['Order Date'].dt.strftime('%B')
    df['Quarter'] = df['Order Date'].dt.quarter
    
    # Calculate shipping delay in days with edge case handling
    # This handles: missing dates (NaT), mixed formats, and negative values
    df['shipping_delay_days'] = (df['Ship Date'] - df['Order Date']).dt.days
    
    # Check for potential data quality issues
    negative_delays = (df['shipping_delay_days'] < 0).sum()
    if negative_delays > 0:
        print(f"⚠️  Warning: {negative_delays} records have negative shipping_delay_days (Ship Date before Order Date)")
    
    print("✓ Data cleaning completed successfully!")
    print(f"\nDate range: {df['Order Date'].min()} to {df['Order Date'].max()}")
    print(f"Total records: {len(df)}")
    print(f"Records with valid dates: {len(df) - order_date_nulls}")
    
except Exception as e:
    print(f"❌ Error during date conversion: {str(e)}")
    print("Please check the date format in your dataset")
    raise

### Understanding `shipping_delay_days`

The `shipping_delay_days` column represents the time elapsed between when a customer placed an order (`Order Date`) and when the order was shipped (`Ship Date`). This metric is calculated as:

```
shipping_delay_days = Ship Date - Order Date
```

**Edge Cases and Data Quality Considerations:**

1. **Negative Values**: If `shipping_delay_days` is negative, this indicates that the `Ship Date` is earlier than the `Order Date`. This is logically impossible in normal business operations and typically suggests:
   - **Data entry errors**: Dates may have been transposed or entered incorrectly
   - **System timezone issues**: Different systems recording dates in different timezones
   - **Backdated entries**: Orders entered into the system after they were shipped
   - **Unusual shipping practices**: Pre-positioned inventory shipped before official order confirmation

2. **Missing Values (NaT/NaN)**: If either the `Order Date` or `Ship Date` is missing or could not be parsed, the `shipping_delay_days` will be `NaN`. These records should be:
   - Flagged for data quality review
   - Excluded from shipping performance analysis
   - Investigated with the data source team

3. **Zero Values**: A `shipping_delay_days` of 0 means the order was shipped on the same day it was ordered, which is possible for:
   - Same-day shipping services
   - Orders placed early in the day
   - Digital products or services

**Business Interpretation:**

- **Typical Range**: For most businesses, shipping delay ranges from 1-7 days depending on the shipping method
- **Performance Metric**: Lower values indicate faster order fulfillment
- **Customer Satisfaction**: Shipping delay directly impacts customer experience and satisfaction
- **Operational Efficiency**: Can identify bottlenecks in warehouse operations or inventory management

When analyzing this metric, always:
- Filter out or investigate records with negative or missing values
- Segment analysis by shipping mode (Standard Class, Second Class, etc.)
- Compare against service level agreements (SLAs)
- Look for trends over time or by product category

---

# Business Question 1: What are the sales trends over time?

Let's analyze how sales have changed over the years and identify seasonal patterns.

In [None]:
# Calculate total sales by year
yearly_sales = df.groupby('Year')['Sales'].sum().reset_index()
yearly_sales['Sales'] = yearly_sales['Sales'].round(2)

print("Total Sales by Year:")
print(yearly_sales)

# Calculate year-over-year growth
yearly_sales['Growth %'] = yearly_sales['Sales'].pct_change() * 100
print("\nYear-over-Year Growth:")
print(yearly_sales[['Year', 'Growth %']].dropna())

In [None]:
# Visualize yearly sales trend
plt.figure(figsize=(10, 6))
plt.plot(yearly_sales['Year'], yearly_sales['Sales'], marker='o', linewidth=2, markersize=8)
plt.title('Total Sales by Year', fontsize=16, fontweight='bold')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Sales ($)', fontsize=12)
plt.grid(True, alpha=0.3)

# Add value labels on points
for x, y in zip(yearly_sales['Year'], yearly_sales['Sales']):
    plt.text(x, y, f'${y:,.0f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
# Calculate monthly sales trend (across all years)
monthly_sales = df.groupby('Month')['Sales'].sum().reset_index()

# Create month names for better readability
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
monthly_sales['Month Name'] = [month_names[m-1] for m in monthly_sales['Month']]

print("\nTotal Sales by Month (all years combined):")
print(monthly_sales[['Month Name', 'Sales']])

In [None]:
# Visualize monthly sales pattern
plt.figure(figsize=(12, 6))
plt.bar(monthly_sales['Month Name'], monthly_sales['Sales'], color='steelblue', alpha=0.8)
plt.title('Sales by Month (All Years Combined)', fontsize=16, fontweight='bold')
plt.xlabel('Month', fontsize=12)
plt.ylabel('Sales ($)', fontsize=12)
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

**Key Findings:**
- Sales trends show the overall growth pattern year by year
- Monthly patterns reveal seasonal variations in sales
- This helps identify peak sales periods for inventory planning

---

# Business Question 2: Which products and categories perform best?

Analyze top-performing categories and sub-categories to understand product performance.

In [None]:
# Calculate sales by category
category_sales = df.groupby('Category')['Sales'].sum().reset_index()
category_sales = category_sales.sort_values('Sales', ascending=False)
category_sales['Sales'] = category_sales['Sales'].round(2)

print("Total Sales by Category:")
print(category_sales)

# Calculate percentage of total sales
total_sales = category_sales['Sales'].sum()
category_sales['Percentage'] = (category_sales['Sales'] / total_sales * 100).round(2)
print("\nCategory Sales Percentage:")
print(category_sales[['Category', 'Percentage']])

In [None]:
# Visualize category sales
plt.figure(figsize=(10, 6))
plt.bar(category_sales['Category'], category_sales['Sales'], color=['#ff9999', '#66b3ff', '#99ff99'])
plt.title('Total Sales by Category', fontsize=16, fontweight='bold')
plt.xlabel('Category', fontsize=12)
plt.ylabel('Sales ($)', fontsize=12)
plt.xticks(rotation=45, ha='right')

# Add value labels on bars
for i, (cat, sales) in enumerate(zip(category_sales['Category'], category_sales['Sales'])):
    plt.text(i, sales, f'${sales:,.0f}', ha='center', va='bottom', fontsize=10)

plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

In [None]:
# Analyze top 10 sub-categories
subcategory_sales = df.groupby('Sub-Category')['Sales'].sum().reset_index()
subcategory_sales = subcategory_sales.sort_values('Sales', ascending=False)
top_10_subcategories = subcategory_sales.head(10)

print("\nTop 10 Sub-Categories by Sales:")
print(top_10_subcategories)

In [None]:
# Visualize top 10 sub-categories
plt.figure(figsize=(12, 6))
plt.barh(top_10_subcategories['Sub-Category'], top_10_subcategories['Sales'], color='coral')
plt.title('Top 10 Sub-Categories by Sales', fontsize=16, fontweight='bold')
plt.xlabel('Sales ($)', fontsize=12)
plt.ylabel('Sub-Category', fontsize=12)
plt.gca().invert_yaxis()  # Highest at the top

# Add value labels
for i, (cat, sales) in enumerate(zip(top_10_subcategories['Sub-Category'], top_10_subcategories['Sales'])):
    plt.text(sales, i, f'  ${sales:,.0f}', va='center', fontsize=9)

plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

**Key Findings:**
- Identifies which product categories drive the most revenue
- Shows top-performing sub-categories within each category
- Helps prioritize inventory and marketing efforts

---

# Business Question 3: How does regional performance vary?

Compare sales performance across different regions.

In [None]:
# Calculate sales by region
regional_sales = df.groupby('Region')['Sales'].sum().reset_index()
regional_sales = regional_sales.sort_values('Sales', ascending=False)
regional_sales['Sales'] = regional_sales['Sales'].round(2)

print("Total Sales by Region:")
print(regional_sales)

# Calculate percentage contribution
regional_sales['Percentage'] = (regional_sales['Sales'] / regional_sales['Sales'].sum() * 100).round(2)
print("\nRegional Sales Percentage:")
print(regional_sales[['Region', 'Percentage']])

In [None]:
# Visualize regional sales with a bar chart
plt.figure(figsize=(10, 6))
bars = plt.bar(regional_sales['Region'], regional_sales['Sales'], 
               color=['#8dd3c7', '#fb8072', '#bebada', '#fdb462'])
plt.title('Total Sales by Region', fontsize=16, fontweight='bold')
plt.xlabel('Region', fontsize=12)
plt.ylabel('Sales ($)', fontsize=12)

# Add value labels
for i, (region, sales) in enumerate(zip(regional_sales['Region'], regional_sales['Sales'])):
    plt.text(i, sales, f'${sales:,.0f}', ha='center', va='bottom', fontsize=10)

plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

In [None]:
# Visualize regional sales with a pie chart
plt.figure(figsize=(10, 8))
colors = ['#8dd3c7', '#fb8072', '#bebada', '#fdb462']
plt.pie(regional_sales['Sales'], labels=regional_sales['Region'], autopct='%1.1f%%',
        startangle=90, colors=colors, textprops={'fontsize': 12})
plt.title('Sales Distribution by Region', fontsize=16, fontweight='bold')
plt.axis('equal')
plt.tight_layout()
plt.show()

In [None]:
# Analyze top 10 states by sales
state_sales = df.groupby('State')['Sales'].sum().reset_index()
state_sales = state_sales.sort_values('Sales', ascending=False)
top_10_states = state_sales.head(10)

print("\nTop 10 States by Sales:")
print(top_10_states)

In [None]:
# Visualize top 10 states
plt.figure(figsize=(12, 6))
plt.barh(top_10_states['State'], top_10_states['Sales'], color='teal')
plt.title('Top 10 States by Sales', fontsize=16, fontweight='bold')
plt.xlabel('Sales ($)', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.gca().invert_yaxis()

# Add value labels
for i, (state, sales) in enumerate(zip(top_10_states['State'], top_10_states['Sales'])):
    plt.text(sales, i, f'  ${sales:,.0f}', va='center', fontsize=9)

plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

**Key Findings:**
- Shows which regions contribute most to overall sales
- Identifies top-performing states for targeted marketing
- Helps optimize regional distribution and resources

---

# Business Question 4: What customer segments generate the most revenue?

Analyze sales by customer segment to understand our customer base.

In [None]:
# Calculate sales by customer segment
segment_sales = df.groupby('Segment')['Sales'].sum().reset_index()
segment_sales = segment_sales.sort_values('Sales', ascending=False)
segment_sales['Sales'] = segment_sales['Sales'].round(2)

print("Total Sales by Customer Segment:")
print(segment_sales)

# Calculate percentage
segment_sales['Percentage'] = (segment_sales['Sales'] / segment_sales['Sales'].sum() * 100).round(2)
print("\nSegment Sales Percentage:")
print(segment_sales[['Segment', 'Percentage']])

In [None]:
# Visualize segment sales
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Bar chart
ax1.bar(segment_sales['Segment'], segment_sales['Sales'], 
        color=['#ff6b6b', '#4ecdc4', '#45b7d1'])
ax1.set_title('Total Sales by Customer Segment', fontsize=14, fontweight='bold')
ax1.set_xlabel('Customer Segment', fontsize=11)
ax1.set_ylabel('Sales ($)', fontsize=11)
ax1.grid(True, alpha=0.3, axis='y')

# Add value labels
for i, (seg, sales) in enumerate(zip(segment_sales['Segment'], segment_sales['Sales'])):
    ax1.text(i, sales, f'${sales:,.0f}', ha='center', va='bottom', fontsize=10)

# Pie chart
ax2.pie(segment_sales['Sales'], labels=segment_sales['Segment'], autopct='%1.1f%%',
        startangle=90, colors=['#ff6b6b', '#4ecdc4', '#45b7d1'], textprops={'fontsize': 11})
ax2.set_title('Sales Distribution by Segment', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Analyze average order value by segment
segment_avg = df.groupby('Segment')['Sales'].mean().reset_index()
segment_avg = segment_avg.sort_values('Sales', ascending=False)
segment_avg.columns = ['Segment', 'Average Order Value']
segment_avg['Average Order Value'] = segment_avg['Average Order Value'].round(2)

print("\nAverage Order Value by Segment:")
print(segment_avg)

In [None]:
# Count orders by segment
segment_orders = df.groupby('Segment')['Order ID'].nunique().reset_index()
segment_orders.columns = ['Segment', 'Number of Orders']
segment_orders = segment_orders.sort_values('Number of Orders', ascending=False)

print("\nNumber of Orders by Segment:")
print(segment_orders)

**Key Findings:**
- Reveals which customer segments are most valuable
- Shows differences in purchasing behavior between segments
- Helps tailor marketing strategies to different customer groups

---

# Business Question 5: Which products are most profitable?

Analyze profitability across categories and identify high-margin products.

In [None]:
# Calculate profit by category
category_profit = df.groupby('Category')['Profit'].sum().reset_index()
category_profit = category_profit.sort_values('Profit', ascending=False)
category_profit['Profit'] = category_profit['Profit'].round(2)

print("Total Profit by Category:")
print(category_profit)

# Calculate profit margin by category
category_metrics = df.groupby('Category').agg({
    'Sales': 'sum',
    'Profit': 'sum'
}).reset_index()
category_metrics['Profit Margin %'] = (category_metrics['Profit'] / category_metrics['Sales'] * 100).round(2)

print("\nProfit Margin by Category:")
print(category_metrics[['Category', 'Profit Margin %']].sort_values('Profit Margin %', ascending=False))

In [None]:
# Visualize profit by category
plt.figure(figsize=(10, 6))
bars = plt.bar(category_profit['Category'], category_profit['Profit'], 
               color=['#2ecc71' if x > 0 else '#e74c3c' for x in category_profit['Profit']])
plt.title('Total Profit by Category', fontsize=16, fontweight='bold')
plt.xlabel('Category', fontsize=12)
plt.ylabel('Profit ($)', fontsize=12)
plt.axhline(y=0, color='black', linestyle='-', linewidth=0.5)

# Add value labels
for i, (cat, profit) in enumerate(zip(category_profit['Category'], category_profit['Profit'])):
    plt.text(i, profit, f'${profit:,.0f}', ha='center', 
             va='bottom' if profit > 0 else 'top', fontsize=10)

plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

In [None]:
# Analyze top 10 most profitable sub-categories
subcategory_profit = df.groupby('Sub-Category').agg({
    'Sales': 'sum',
    'Profit': 'sum'
}).reset_index()
subcategory_profit['Profit Margin %'] = (subcategory_profit['Profit'] / subcategory_profit['Sales'] * 100).round(2)
subcategory_profit = subcategory_profit.sort_values('Profit', ascending=False)
top_10_profitable = subcategory_profit.head(10)

print("\nTop 10 Most Profitable Sub-Categories:")
print(top_10_profitable[['Sub-Category', 'Profit', 'Profit Margin %']])

In [None]:
# Visualize top 10 profitable sub-categories
plt.figure(figsize=(12, 6))
plt.barh(top_10_profitable['Sub-Category'], top_10_profitable['Profit'], color='#27ae60')
plt.title('Top 10 Most Profitable Sub-Categories', fontsize=16, fontweight='bold')
plt.xlabel('Profit ($)', fontsize=12)
plt.ylabel('Sub-Category', fontsize=12)
plt.gca().invert_yaxis()

# Add value labels
for i, (cat, profit) in enumerate(zip(top_10_profitable['Sub-Category'], top_10_profitable['Profit'])):
    plt.text(profit, i, f'  ${profit:,.0f}', va='center', fontsize=9)

plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

In [None]:
# Identify least profitable sub-categories (potential losses)
bottom_10_profitable = subcategory_profit.tail(10).sort_values('Profit')

print("\n10 Least Profitable Sub-Categories:")
print(bottom_10_profitable[['Sub-Category', 'Profit', 'Profit Margin %']])

In [None]:
# Visualize least profitable sub-categories
plt.figure(figsize=(12, 6))
colors = ['#e74c3c' if x < 0 else '#f39c12' for x in bottom_10_profitable['Profit']]
plt.barh(bottom_10_profitable['Sub-Category'], bottom_10_profitable['Profit'], color=colors)
plt.title('10 Least Profitable Sub-Categories', fontsize=16, fontweight='bold')
plt.xlabel('Profit ($)', fontsize=12)
plt.ylabel('Sub-Category', fontsize=12)
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.gca().invert_yaxis()

# Add value labels
for i, (cat, profit) in enumerate(zip(bottom_10_profitable['Sub-Category'], bottom_10_profitable['Profit'])):
    plt.text(profit, i, f'  ${profit:,.0f}', va='center', fontsize=9,
             ha='left' if profit > 0 else 'right')

plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

In [None]:
# Create a scatter plot: Sales vs Profit by Sub-Category
plt.figure(figsize=(12, 8))
scatter = plt.scatter(subcategory_profit['Sales'], subcategory_profit['Profit'], 
                     s=100, alpha=0.6, c=subcategory_profit['Profit Margin %'],
                     cmap='RdYlGn', edgecolors='black', linewidth=0.5)
plt.colorbar(scatter, label='Profit Margin %')
plt.title('Sales vs Profit by Sub-Category', fontsize=16, fontweight='bold')
plt.xlabel('Sales ($)', fontsize=12)
plt.ylabel('Profit ($)', fontsize=12)
plt.axhline(y=0, color='red', linestyle='--', linewidth=1, alpha=0.5)
plt.grid(True, alpha=0.3)

# Annotate some key sub-categories
for i, row in subcategory_profit.head(5).iterrows():
    plt.annotate(row['Sub-Category'], (row['Sales'], row['Profit']),
                xytext=(10, 5), textcoords='offset points', fontsize=8,
                bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.5))

plt.tight_layout()
plt.show()

**Key Findings:**
- Identifies highest and lowest profit categories and sub-categories
- Shows profit margins to understand efficiency
- Reveals products that may need pricing adjustments or discontinuation
- Scatter plot shows relationship between sales volume and profitability

---

## Summary of Key Insights

### Sales Trends
- Sales show growth patterns over the years analyzed
- Seasonal variations exist with certain months performing better
- Understanding these patterns helps with forecasting and planning

### Product Performance
- Technology and Office Supplies are major revenue drivers
- Specific sub-categories like Phones, Chairs, and Storage perform well
- Product mix optimization can improve overall performance

### Regional Analysis
- Sales distribution varies significantly across regions
- Some states contribute disproportionately to total revenue
- Regional strategies should be tailored based on performance

### Customer Segments
- Consumer segment represents the largest customer base
- Corporate and Home Office segments have different purchasing patterns
- Segment-specific marketing can improve customer engagement

### Profitability
- Not all high-revenue categories are equally profitable
- Some sub-categories operate at losses and need attention
- Profit margin analysis reveals efficiency opportunities

---

## Recommendations

1. **Focus on High-Margin Products**: Prioritize marketing and inventory for products with the best profit margins
2. **Address Underperforming Categories**: Review pricing and costs for sub-categories with negative profits
3. **Optimize Regional Strategy**: Allocate resources based on regional performance and potential
4. **Seasonal Planning**: Prepare for peak sales months with adequate inventory and staffing
5. **Customer Segment Strategy**: Develop targeted campaigns for each customer segment

---

## Next Steps

Further analysis could explore:
- Customer retention and repeat purchase rates
- Shipping efficiency and cost optimization
- Discount impact on sales and profitability
- Product bundling opportunities
- Detailed customer segmentation analysis