# Product and Sub-Category Profitability Analysis

## Objective
Identify high-revenue, low-profit products and sub-categories to understand which items may need pricing adjustments, cost reductions, or strategic review.

## Business Goal
High sales volume doesn't always translate to strong profitability. This analysis helps us:
- Find products generating significant revenue but minimal or negative profit
- Highlight sub-categories that may need pricing or cost structure review
- Support data-driven decisions about product portfolio optimization

## Analysis Steps
1. Load the sales data from the `data/` directory
2. Clean and prepare the data (handle missing values, check data types)
3. Group data by Product Name and Sub-Category
4. Calculate total sales and total profit for each group
5. Calculate profit margins to identify concerning patterns
6. Flag items with high sales but low or negative profits
7. Visualize the findings for clear interpretation

## Key Metrics
- **Total Sales**: Sum of revenue generated by each product/sub-category
- **Total Profit**: Sum of profit for each product/sub-category
- **Profit Margin**: (Total Profit / Total Sales) × 100%
- **High Revenue Threshold**: Products in the top quartile of sales
- **Low Profit Threshold**: Profit margin below 10% or negative

## 1. Import Libraries and Load Data

We'll use pandas for data manipulation, numpy for calculations, and matplotlib/seaborn for visualizations.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
# Load the sales data
# Using relative path from notebooks/ directory to data/ directory
df = pd.read_csv('../data/superstore_sales.csv')

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"\nTotal records: {len(df):,}")
print(f"\nColumns in dataset:")
print(df.columns.tolist())

In [None]:
# Preview the first few rows
df.head()

## 2. Data Cleaning and Preparation

Before analysis, we need to check for missing values and ensure data types are correct for our calculations.

In [None]:
# Check for missing values in key columns
key_columns = ['Product Name', 'Sub-Category', 'Category', 'Sales', 'Profit']
missing_values = df[key_columns].isnull().sum()

print("Missing values in key columns:")
print(missing_values)
print(f"\nTotal missing values: {missing_values.sum()}")

In [None]:
# Check data types of key columns
print("Data types:")
print(df[key_columns].dtypes)

# Verify that Sales and Profit are numeric
print(f"\nSales column is numeric: {pd.api.types.is_numeric_dtype(df['Sales'])}")
print(f"Profit column is numeric: {pd.api.types.is_numeric_dtype(df['Profit'])}")

In [None]:
# Handle any missing values if present
# For this analysis, we'll remove rows with missing Sales or Profit data
df_clean = df.dropna(subset=['Product Name', 'Sub-Category', 'Sales', 'Profit'])

print(f"Original records: {len(df):,}")
print(f"Records after cleaning: {len(df_clean):,}")
print(f"Records removed: {len(df) - len(df_clean):,}")

## 3. Product-Level Analysis

Group by individual product names to calculate total sales and profit for each product.

In [None]:
# Group by Product Name and calculate totals
product_summary = df_clean.groupby('Product Name').agg({
    'Sales': 'sum',
    'Profit': 'sum',
    'Order ID': 'count'  # Count number of orders for each product
}).reset_index()

# Rename columns for clarity
product_summary.columns = ['Product Name', 'Total Sales', 'Total Profit', 'Order Count']

# Calculate profit margin percentage
product_summary['Profit Margin %'] = (product_summary['Total Profit'] / product_summary['Total Sales'] * 100)

# Sort by Total Sales descending to see top revenue products
product_summary = product_summary.sort_values('Total Sales', ascending=False)

print(f"Total unique products: {len(product_summary):,}")
print("\nTop 10 products by total sales:")
product_summary.head(10)

## 4. Identify High-Revenue, Low-Profit Products

We'll flag products that are in the top quartile of sales but have low or negative profit margins.

In [None]:
# Calculate the 75th percentile (top quartile) of sales
sales_threshold = product_summary['Total Sales'].quantile(0.75)

print(f"High revenue threshold (75th percentile): ${sales_threshold:,.2f}")

# Define low profit as margin below 10% or negative
low_profit_threshold = 10

# Filter for high-revenue products
high_revenue_products = product_summary[product_summary['Total Sales'] >= sales_threshold].copy()

# Among high-revenue products, find those with low profit margins
problematic_products = high_revenue_products[
    high_revenue_products['Profit Margin %'] < low_profit_threshold
].copy()

# Sort by profit margin to see worst performers first
problematic_products = problematic_products.sort_values('Profit Margin %')

print(f"\nHigh-revenue products: {len(high_revenue_products):,}")
print(f"High-revenue, low-profit products: {len(problematic_products):,}")
print(f"\nPercentage of high-revenue products with low profitability: {len(problematic_products) / len(high_revenue_products) * 100:.1f}%")

In [None]:
# Display the problematic products
print("High-Revenue, Low-Profit Products (Profit Margin < 10%):")
print("\nThese products generate significant sales but have concerning profit margins:")
problematic_products

## 5. Sub-Category Level Analysis

Analyze at the sub-category level to identify broader patterns across product groups.

In [None]:
# Group by Sub-Category and calculate totals
subcategory_summary = df_clean.groupby('Sub-Category').agg({
    'Sales': 'sum',
    'Profit': 'sum',
    'Product Name': 'count'  # Count products in each sub-category
}).reset_index()

# Rename columns
subcategory_summary.columns = ['Sub-Category', 'Total Sales', 'Total Profit', 'Product Count']

# Calculate profit margin
subcategory_summary['Profit Margin %'] = (
    subcategory_summary['Total Profit'] / subcategory_summary['Total Sales'] * 100
)

# Sort by Total Sales descending
subcategory_summary = subcategory_summary.sort_values('Total Sales', ascending=False)

print(f"Total sub-categories: {len(subcategory_summary)}")
print("\nSub-category performance summary:")
subcategory_summary

In [None]:
# Identify sub-categories with low or negative profit margins
low_margin_subcategories = subcategory_summary[
    subcategory_summary['Profit Margin %'] < 10
].sort_values('Profit Margin %')

print("Sub-Categories with Low Profit Margins (< 10%):")
print("\nThese sub-categories may need strategic review:")
low_margin_subcategories

## 6. Visualizations

Create visualizations to better understand the relationship between sales and profitability.

In [None]:
# Visualization 1: Scatter plot of Sales vs Profit by Sub-Category
plt.figure(figsize=(12, 7))

# Create scatter plot
scatter = plt.scatter(
    subcategory_summary['Total Sales'],
    subcategory_summary['Total Profit'],
    s=200,
    alpha=0.6,
    c=subcategory_summary['Profit Margin %'],
    cmap='RdYlGn',
    edgecolors='black',
    linewidth=1
)

# Add colorbar to show profit margin
cbar = plt.colorbar(scatter)
cbar.set_label('Profit Margin %', fontsize=12)

# Add labels for each point
for idx, row in subcategory_summary.iterrows():
    plt.annotate(
        row['Sub-Category'],
        (row['Total Sales'], row['Total Profit']),
        fontsize=9,
        alpha=0.8,
        ha='center'
    )

# Add reference line (break-even)
plt.axhline(y=0, color='red', linestyle='--', linewidth=1, alpha=0.5, label='Break-even')

plt.xlabel('Total Sales ($)', fontsize=12)
plt.ylabel('Total Profit ($)', fontsize=12)
plt.title('Sub-Category Performance: Sales vs Profit', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Visualization 2: Profit Margin by Sub-Category (Bar Chart)
plt.figure(figsize=(12, 8))

# Sort by profit margin for better visualization
subcategory_sorted = subcategory_summary.sort_values('Profit Margin %')

# Create color map (red for negative, yellow for low positive, green for good)
colors = ['red' if x < 0 else 'orange' if x < 10 else 'green' 
          for x in subcategory_sorted['Profit Margin %']]

# Create horizontal bar chart
bars = plt.barh(subcategory_sorted['Sub-Category'], subcategory_sorted['Profit Margin %'], color=colors, alpha=0.7)

# Add value labels
for i, (value, bar) in enumerate(zip(subcategory_sorted['Profit Margin %'], bars)):
    plt.text(value + 1, i, f'{value:.1f}%', va='center', fontsize=9)

# Add reference line at 10% threshold
plt.axvline(x=10, color='blue', linestyle='--', linewidth=1, alpha=0.5, label='10% Threshold')
plt.axvline(x=0, color='black', linewidth=1)

plt.xlabel('Profit Margin (%)', fontsize=12)
plt.ylabel('Sub-Category', fontsize=12)
plt.title('Profit Margin by Sub-Category', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

In [None]:
# Visualization 3: Top 15 Products by Sales with Profit Margin Color-Coding
plt.figure(figsize=(12, 8))

# Get top 15 products by sales
top_products = product_summary.head(15).copy()

# Create color map based on profit margin
colors = ['red' if x < 0 else 'orange' if x < 10 else 'green' 
          for x in top_products['Profit Margin %']]

# Create horizontal bar chart
bars = plt.barh(range(len(top_products)), top_products['Total Sales'], color=colors, alpha=0.7)

# Truncate long product names for better display
product_labels = [name[:40] + '...' if len(name) > 40 else name 
                  for name in top_products['Product Name']]

plt.yticks(range(len(top_products)), product_labels, fontsize=9)

# Add value labels
for i, (sales, margin) in enumerate(zip(top_products['Total Sales'], top_products['Profit Margin %'])):
    plt.text(sales + 500, i, f'${sales:,.0f} ({margin:.1f}%)', va='center', fontsize=8)

plt.xlabel('Total Sales ($)', fontsize=12)
plt.ylabel('Product Name', fontsize=12)
plt.title('Top 15 Products by Revenue (Color = Profit Margin)', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()  # Highest sales at top

# Add legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='green', alpha=0.7, label='Healthy (≥10%)'),
    Patch(facecolor='orange', alpha=0.7, label='Low (0-10%)'),
    Patch(facecolor='red', alpha=0.7, label='Negative (<0%)')
]
plt.legend(handles=legend_elements, loc='lower right')

plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

In [None]:
# Visualization 4: Sales vs Profit by Category
# Add Category information to subcategory summary
category_mapping = df_clean[['Sub-Category', 'Category']].drop_duplicates()
subcategory_with_cat = subcategory_summary.merge(category_mapping, on='Sub-Category')

plt.figure(figsize=(14, 6))

# Create grouped bar chart
x = np.arange(len(subcategory_with_cat))
width = 0.35

# Normalize values for better comparison (divide by 1000 for thousands)
sales_values = subcategory_with_cat['Total Sales'] / 1000
profit_values = subcategory_with_cat['Total Profit'] / 1000

bars1 = plt.bar(x - width/2, sales_values, width, label='Sales ($K)', alpha=0.8, color='skyblue')
bars2 = plt.bar(x + width/2, profit_values, width, label='Profit ($K)', alpha=0.8, color='lightcoral')

plt.xlabel('Sub-Category', fontsize=12)
plt.ylabel('Amount (Thousands $)', fontsize=12)
plt.title('Sales and Profit by Sub-Category', fontsize=14, fontweight='bold')
plt.xticks(x, subcategory_with_cat['Sub-Category'], rotation=45, ha='right')
plt.legend()
plt.axhline(y=0, color='black', linewidth=0.8)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

## 7. Key Findings Summary

Generate a summary of the most important insights from this analysis.

In [None]:
# Calculate summary statistics
total_sales_all = df_clean['Sales'].sum()
total_profit_all = df_clean['Profit'].sum()
overall_margin = (total_profit_all / total_sales_all) * 100

# Count products with negative profit
negative_profit_products = product_summary[product_summary['Total Profit'] < 0]
negative_profit_sales = negative_profit_products['Total Sales'].sum()

# Count sub-categories with negative profit
negative_subcats = subcategory_summary[subcategory_summary['Total Profit'] < 0]

print("="*70)
print("KEY FINDINGS: PRODUCT PROFITABILITY ANALYSIS")
print("="*70)
print()
print(f"Overall Business Metrics:")
print(f"  • Total Sales: ${total_sales_all:,.2f}")
print(f"  • Total Profit: ${total_profit_all:,.2f}")
print(f"  • Overall Profit Margin: {overall_margin:.2f}%")
print()
print(f"Product-Level Insights:")
print(f"  • Total unique products analyzed: {len(product_summary):,}")
print(f"  • Products with negative profit: {len(negative_profit_products):,} ({len(negative_profit_products)/len(product_summary)*100:.1f}%)")
print(f"  • Sales from unprofitable products: ${negative_profit_sales:,.2f}")
print(f"  • High-revenue, low-profit products: {len(problematic_products):,}")
print()
print(f"Sub-Category Insights:")
print(f"  • Total sub-categories: {len(subcategory_summary)}")
print(f"  • Sub-categories with negative profit: {len(negative_subcats)}")
print(f"  • Sub-categories with margin < 10%: {len(low_margin_subcategories)}")
print()
print("Top 3 Most Profitable Sub-Categories:")
top_profit_subcat = subcategory_summary.nlargest(3, 'Total Profit')[['Sub-Category', 'Total Profit', 'Profit Margin %']]
for idx, row in top_profit_subcat.iterrows():
    print(f"  • {row['Sub-Category']}: ${row['Total Profit']:,.2f} ({row['Profit Margin %']:.1f}% margin)")
print()
print("Sub-Categories Needing Attention (Lowest Margins):")
worst_subcat = subcategory_summary.nsmallest(3, 'Profit Margin %')[['Sub-Category', 'Total Sales', 'Profit Margin %']]
for idx, row in worst_subcat.iterrows():
    print(f"  • {row['Sub-Category']}: ${row['Total Sales']:,.2f} sales, {row['Profit Margin %']:.1f}% margin")
print()
print("="*70)

## 8. Business Recommendations

Based on the analysis, here are actionable recommendations:

### Immediate Actions
1. **Review pricing for low-margin products**: Products with high sales but low margins may benefit from strategic price increases or cost negotiations with suppliers.

2. **Investigate negative-profit items**: Products showing consistent losses should be evaluated for discontinuation or restructuring.

3. **Focus on high-margin categories**: Increase marketing and inventory investment in sub-categories showing strong profitability.

### Strategic Considerations
1. **Cost structure analysis**: For low-margin sub-categories, conduct a detailed cost analysis to identify opportunities for operational efficiency.

2. **Bundle opportunities**: Consider bundling low-margin products with high-margin items to improve overall profitability.

3. **Customer segmentation**: Analyze whether certain customer segments purchase more profitable product mixes.

### Monitoring
- Track profit margins on a monthly basis to identify trends
- Set alerts for products falling below acceptable profit thresholds
- Review product portfolio quarterly to ensure alignment with profitability goals

## Conclusion

This analysis identified specific products and sub-categories that generate significant revenue but deliver minimal or negative profit. The visualizations clearly show the relationship between sales volume and profitability, highlighting areas where business decisions can drive improved financial performance.

Key takeaway: **Not all sales are created equal.** Understanding which products drive profit, not just revenue, is essential for sustainable business growth.