# High Revenue, Low Profit Analysis (Very Strong Signal)

## Business Goal
This analysis identifies products and sub-categories that generate high revenue but deliver low or negative profit. These items represent critical areas for review because they consume resources, generate volume, and may even lose money despite appearing successful from a sales perspective.

## Analysis Steps
1. **Load sales data** from the data/ directory
2. **Prepare and validate** columns for Product Name, Sub-Category, Sales, and Profit
3. **Aggregate metrics** by Product and Sub-Category (total sales, total profit, order count)
4. **Flag high-revenue, low-profit items** using configurable thresholds
5. **Visualize** the relationship between sales and profit with flagged items highlighted
6. **Export** flagged product and sub-category lists for action
7. **Recommend** next steps for investigation and improvement

This analysis helps prioritize which products or categories may need pricing adjustments, cost reductions, discount policy reviews, or removal from inventory.

## Step 1: Import Libraries and Configure Settings

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os

# Configure pandas display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)
pd.set_option('display.float_format', '{:.2f}'.format)

# Configure visualization settings
# Use matplotlib inline for Jupyter notebooks
%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print("âœ“ Libraries imported and settings configured")

## Step 2: Load Data from CSV

We'll automatically detect and load the first CSV file found in the data/ directory.

In [None]:
# Auto-detect the first CSV file in the data/ directory
csv_files = glob.glob('../data/*.csv')

if not csv_files:
    raise FileNotFoundError(
        "No CSV files found in the data/ directory. "
        "Please place your sales data CSV file in the data/ directory and try again."
    )

# Load the first CSV file found
data_file = csv_files[0]
print(f"Loading data from: {data_file}")

df = pd.read_csv(data_file)

# Display basic information about the dataset
print(f"\nâœ“ Dataset loaded successfully")
print(f"  Total records: {len(df):,}")
print(f"  Columns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()

## Step 3: Column Guidance and Flexible Mapping

### Expected Columns
This analysis requires the following columns in your dataset:
- **Product or Product Name**: The name or identifier of the product
- **Sub-Category**: The sub-category classification of the product
- **Sales**: The sales amount/revenue for each transaction
- **Profit**: The profit amount for each transaction

The code below will automatically detect these columns using case-insensitive matching of common column name variations.

In [None]:
# Define function to detect column names (case-insensitive)
def find_column(df, possible_names):
    """
    Find a column in the dataframe by checking multiple possible names (case-insensitive).
    Returns the actual column name if found, None otherwise.
    """
    df_columns_lower = {col.lower(): col for col in df.columns}
    for name in possible_names:
        if name.lower() in df_columns_lower:
            return df_columns_lower[name.lower()]
    return None

# Map common column name variations to our expected column names
product_col = find_column(df, ['Product Name', 'Product', 'ProductName', 'product_name'])
subcat_col = find_column(df, ['Sub-Category', 'SubCategory', 'Sub Category', 'subcategory', 'sub_category'])
sales_col = find_column(df, ['Sales', 'Revenue', 'sales', 'revenue'])
profit_col = find_column(df, ['Profit', 'profit'])

# Report what was found
print("Column Mapping Results:")
print(f"  Product column: {product_col if product_col else 'NOT FOUND'}")
print(f"  Sub-Category column: {subcat_col if subcat_col else 'NOT FOUND'}")
print(f"  Sales column: {sales_col if sales_col else 'NOT FOUND'}")
print(f"  Profit column: {profit_col if profit_col else 'NOT FOUND'}")

# Check if required columns are present
if not sales_col or not profit_col:
    raise ValueError(
        f"Required columns not found. Available columns: {list(df.columns)}\n"
        "Please ensure your CSV has 'Sales' and 'Profit' columns."
    )

print("\nâœ“ Required columns found")

In [None]:
# Convert Sales and Profit to numeric, coerce errors to NaN
df[sales_col] = pd.to_numeric(df[sales_col], errors='coerce')
df[profit_col] = pd.to_numeric(df[profit_col], errors='coerce')

# Check for and report missing values
print("Data Quality Check:")
print(f"  Missing Sales values: {df[sales_col].isna().sum()}")
print(f"  Missing Profit values: {df[profit_col].isna().sum()}")

# Drop rows with missing Sales or Profit values
original_count = len(df)
df = df.dropna(subset=[sales_col, profit_col])
dropped_count = original_count - len(df)

if dropped_count > 0:
    print(f"  Dropped {dropped_count} rows with missing Sales or Profit values")

print(f"\nâœ“ Data cleaned - {len(df):,} records ready for analysis")

## Step 4: Aggregate by Product

We'll calculate total sales, total profit, and order count for each product, then sort by sales to identify top revenue generators.

In [None]:
# Check if product column is available
if product_col:
    # Aggregate by Product
    product_agg = df.groupby(product_col).agg({
        sales_col: ['sum', 'count'],
        profit_col: 'sum'
    }).reset_index()
    
    # Flatten column names
    product_agg.columns = ['product', 'total_sales', 'n_orders', 'total_profit']
    
    # Sort by total sales descending
    product_agg = product_agg.sort_values('total_sales', ascending=False).reset_index(drop=True)
    
    # Add profit margin percentage
    product_agg['profit_margin_pct'] = (product_agg['total_profit'] / product_agg['total_sales'] * 100).round(2)
    
    print(f"âœ“ Product aggregation complete: {len(product_agg)} unique products")
    print(f"\nTop 10 Products by Total Sales:")
    display(product_agg.head(10))
else:
    print("âš  Product column not found - skipping product-level analysis")
    product_agg = None

## Step 5: Aggregate by Sub-Category

Similarly, we'll aggregate metrics by sub-category to identify category-level patterns.

In [None]:
# Check if sub-category column is available
if subcat_col:
    # Aggregate by Sub-Category
    subcat_agg = df.groupby(subcat_col).agg({
        sales_col: ['sum', 'count'],
        profit_col: 'sum'
    }).reset_index()
    
    # Flatten column names
    subcat_agg.columns = ['sub_category', 'total_sales', 'n_orders', 'total_profit']
    
    # Sort by total sales descending
    subcat_agg = subcat_agg.sort_values('total_sales', ascending=False).reset_index(drop=True)
    
    # Add profit margin percentage
    subcat_agg['profit_margin_pct'] = (subcat_agg['total_profit'] / subcat_agg['total_sales'] * 100).round(2)
    
    print(f"âœ“ Sub-Category aggregation complete: {len(subcat_agg)} unique sub-categories")
    print(f"\nAll Sub-Categories by Total Sales:")
    display(subcat_agg)
else:
    print("âš  Sub-Category column not found - skipping sub-category analysis")
    subcat_agg = None

## Step 6: Flagging Logic - Identify High Revenue, Low Profit Items

### Flagging Approach
We'll flag items as "High Revenue, Low Profit" using the following criteria:

1. **High Sales Threshold**: Top 20% of items by total sales (configurable)
   - This identifies items that generate significant revenue volume

2. **Low Profit Criteria** (items meeting EITHER condition):
   - **Negative or Zero Profit**: total_profit <= 0
   - **Bottom 20% Profit**: Among profitable items, those in the bottom 20% by total profit

3. **Combined Flag**: Items that are BOTH high sales AND low profit

These thresholds can be adjusted based on your business context. You may want to use more aggressive thresholds (e.g., top 10% sales, bottom 10% profit) for very large datasets.

In [None]:
# Define configurable thresholds
HIGH_SALES_PERCENTILE = 80  # Top 20% (items above 80th percentile)
LOW_PROFIT_PERCENTILE = 20  # Bottom 20% (items below 20th percentile)

print(f"Using thresholds:")
print(f"  High Sales: Top {100 - HIGH_SALES_PERCENTILE}% (above {HIGH_SALES_PERCENTILE}th percentile)")
print(f"  Low Profit: Bottom {LOW_PROFIT_PERCENTILE}% or negative/zero profit")

### Flag Products

In [None]:
if product_agg is not None:
    # Calculate thresholds for products
    sales_threshold_product = product_agg['total_sales'].quantile(HIGH_SALES_PERCENTILE / 100)
    profit_threshold_product = product_agg[product_agg['total_profit'] > 0]['total_profit'].quantile(LOW_PROFIT_PERCENTILE / 100)
    
    print(f"\nProduct Thresholds:")
    print(f"  High Sales threshold: ${sales_threshold_product:,.2f}")
    print(f"  Low Profit threshold: ${profit_threshold_product:,.2f}")
    
    # Create flags
    product_agg['high_sales_flag'] = product_agg['total_sales'] >= sales_threshold_product
    product_agg['low_or_negative_profit_flag'] = product_agg['total_profit'] <= 0
    product_agg['low_profit_bottom_pct_flag'] = (
        (product_agg['total_profit'] > 0) & 
        (product_agg['total_profit'] <= profit_threshold_product)
    )
    
    # Combined flag: high sales AND (low or negative profit)
    product_agg['flagged_strong'] = (
        product_agg['high_sales_flag'] & 
        (product_agg['low_or_negative_profit_flag'] | product_agg['low_profit_bottom_pct_flag'])
    )
    
    # Show flagged products
    flagged_products = product_agg[product_agg['flagged_strong']].copy()
    flagged_products = flagged_products.sort_values('total_sales', ascending=False)
    
    print(f"\nðŸš¨ Found {len(flagged_products)} HIGH REVENUE, LOW PROFIT products:")
    if len(flagged_products) > 0:
        display(flagged_products[['product', 'total_sales', 'total_profit', 'profit_margin_pct', 'n_orders']])
    else:
        print("  No products flagged with current thresholds")
else:
    print("âš  Skipping product flagging (no product data)")
    flagged_products = None

### Flag Sub-Categories

In [None]:
if subcat_agg is not None:
    # Calculate thresholds for sub-categories
    sales_threshold_subcat = subcat_agg['total_sales'].quantile(HIGH_SALES_PERCENTILE / 100)
    profit_threshold_subcat = subcat_agg[subcat_agg['total_profit'] > 0]['total_profit'].quantile(LOW_PROFIT_PERCENTILE / 100)
    
    print(f"\nSub-Category Thresholds:")
    print(f"  High Sales threshold: ${sales_threshold_subcat:,.2f}")
    print(f"  Low Profit threshold: ${profit_threshold_subcat:,.2f}")
    
    # Create flags
    subcat_agg['high_sales_flag'] = subcat_agg['total_sales'] >= sales_threshold_subcat
    subcat_agg['low_or_negative_profit_flag'] = subcat_agg['total_profit'] <= 0
    subcat_agg['low_profit_bottom_pct_flag'] = (
        (subcat_agg['total_profit'] > 0) & 
        (subcat_agg['total_profit'] <= profit_threshold_subcat)
    )
    
    # Combined flag: high sales AND (low or negative profit)
    subcat_agg['flagged_strong'] = (
        subcat_agg['high_sales_flag'] & 
        (subcat_agg['low_or_negative_profit_flag'] | subcat_agg['low_profit_bottom_pct_flag'])
    )
    
    # Show flagged sub-categories
    flagged_subcats = subcat_agg[subcat_agg['flagged_strong']].copy()
    flagged_subcats = flagged_subcats.sort_values('total_sales', ascending=False)
    
    print(f"\nðŸš¨ Found {len(flagged_subcats)} HIGH REVENUE, LOW PROFIT sub-categories:")
    if len(flagged_subcats) > 0:
        display(flagged_subcats[['sub_category', 'total_sales', 'total_profit', 'profit_margin_pct', 'n_orders']])
    else:
        print("  No sub-categories flagged with current thresholds")
else:
    print("âš  Skipping sub-category flagging (no sub-category data)")
    flagged_subcats = None

## Step 7: Visualizations

### Scatter Plot: Product Sales vs Profit

In [None]:
if product_agg is not None and len(product_agg) > 0:
    plt.figure(figsize=(14, 8))
    
    # Plot non-flagged products in gray
    non_flagged = product_agg[~product_agg['flagged_strong']]
    plt.scatter(non_flagged['total_sales'], non_flagged['total_profit'], 
                alpha=0.5, s=50, c='gray', label='Normal Products')
    
    # Plot flagged products in red
    if flagged_products is not None and len(flagged_products) > 0:
        plt.scatter(flagged_products['total_sales'], flagged_products['total_profit'], 
                    alpha=0.8, s=100, c='red', marker='D', label='High Revenue, Low Profit', edgecolors='darkred', linewidth=1.5)
        
        # Annotate top flagged products (up to 5)
        top_flagged = flagged_products.nlargest(5, 'total_sales')
        for idx, row in top_flagged.iterrows():
            plt.annotate(row['product'][:30] + '...' if len(row['product']) > 30 else row['product'], 
                        xy=(row['total_sales'], row['total_profit']),
                        xytext=(10, 10), textcoords='offset points',
                        fontsize=9, alpha=0.8,
                        bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.3),
                        arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0', color='red', lw=1))
    
    # Add reference line at profit = 0
    plt.axhline(y=0, color='black', linestyle='--', linewidth=1, alpha=0.5, label='Break-even')
    
    plt.title('Product Analysis: Total Sales vs Total Profit', fontsize=14, fontweight='bold', pad=20)
    plt.xlabel('Total Sales ($)', fontsize=12)
    plt.ylabel('Total Profit ($)', fontsize=12)
    plt.legend(loc='best', fontsize=10)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print("\nChart Interpretation:")
    print("  â€¢ Red diamonds = High Revenue, Low Profit products requiring attention")
    print("  â€¢ Products below the black line have negative profit (losses)")
    print("  â€¢ Top flagged products are labeled for easy identification")
else:
    print("âš  Skipping product scatter plot (no product data)")

### Scatter Plot: Sub-Category Sales vs Profit

In [None]:
if subcat_agg is not None and len(subcat_agg) > 0:
    plt.figure(figsize=(14, 8))
    
    # Plot non-flagged sub-categories in gray
    non_flagged = subcat_agg[~subcat_agg['flagged_strong']]
    plt.scatter(non_flagged['total_sales'], non_flagged['total_profit'], 
                alpha=0.6, s=100, c='gray', label='Normal Sub-Categories')
    
    # Plot flagged sub-categories in red
    if flagged_subcats is not None and len(flagged_subcats) > 0:
        plt.scatter(flagged_subcats['total_sales'], flagged_subcats['total_profit'], 
                    alpha=0.9, s=200, c='red', marker='D', label='High Revenue, Low Profit', edgecolors='darkred', linewidth=2)
        
        # Annotate all flagged sub-categories (usually not too many)
        for idx, row in flagged_subcats.iterrows():
            plt.annotate(row['sub_category'], 
                        xy=(row['total_sales'], row['total_profit']),
                        xytext=(10, 10), textcoords='offset points',
                        fontsize=10, alpha=0.9, fontweight='bold',
                        bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.4),
                        arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0', color='red', lw=1.5))
    
    # Add reference line at profit = 0
    plt.axhline(y=0, color='black', linestyle='--', linewidth=1.5, alpha=0.6, label='Break-even')
    
    plt.title('Sub-Category Analysis: Total Sales vs Total Profit', fontsize=14, fontweight='bold', pad=20)
    plt.xlabel('Total Sales ($)', fontsize=12)
    plt.ylabel('Total Profit ($)', fontsize=12)
    plt.legend(loc='best', fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print("\nChart Interpretation:")
    print("  â€¢ Red diamonds = High Revenue, Low Profit sub-categories requiring attention")
    print("  â€¢ Sub-categories below the black line have negative profit (losses)")
    print("  â€¢ All flagged sub-categories are labeled")
else:
    print("âš  Skipping sub-category scatter plot (no sub-category data)")

### Bar Chart: Top Flagged Products by Sales

In [None]:
if flagged_products is not None and len(flagged_products) > 0:
    # Show top 15 flagged products by sales
    top_n = min(15, len(flagged_products))
    top_flagged = flagged_products.nlargest(top_n, 'total_sales')
    
    # Create figure with two subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Sales bar chart
    colors_sales = ['red' if p < 0 else 'orange' for p in top_flagged['total_profit']]
    ax1.barh(range(top_n), top_flagged['total_sales'], color=colors_sales, alpha=0.7)
    ax1.set_yticks(range(top_n))
    ax1.set_yticklabels([p[:40] + '...' if len(p) > 40 else p for p in top_flagged['product']], fontsize=9)
    ax1.set_xlabel('Total Sales ($)', fontsize=11)
    ax1.set_title(f'Top {top_n} Flagged Products by Sales', fontsize=12, fontweight='bold')
    ax1.invert_yaxis()
    ax1.grid(axis='x', alpha=0.3)
    
    # Profit bar chart
    colors_profit = ['darkred' if p < 0 else 'orange' for p in top_flagged['total_profit']]
    ax2.barh(range(top_n), top_flagged['total_profit'], color=colors_profit, alpha=0.7)
    ax2.set_yticks(range(top_n))
    ax2.set_yticklabels([p[:40] + '...' if len(p) > 40 else p for p in top_flagged['product']], fontsize=9)
    ax2.set_xlabel('Total Profit ($)', fontsize=11)
    ax2.set_title(f'Total Profit for Top {top_n} Flagged Products', fontsize=12, fontweight='bold')
    ax2.axvline(x=0, color='black', linestyle='--', linewidth=1)
    ax2.invert_yaxis()
    ax2.grid(axis='x', alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nChart Note:")
    print("  â€¢ Dark red bars indicate negative profit (losses)")
    print("  â€¢ Orange bars indicate low but positive profit")
else:
    print("âš  No flagged products to display in bar chart")

## Step 8: Export Flagged Lists

Save the flagged products and sub-categories to CSV files for further action and tracking.

In [None]:
# Create output directory if it doesn't exist
output_dir = 'outputs'
os.makedirs(output_dir, exist_ok=True)

# Export flagged products
if flagged_products is not None and len(flagged_products) > 0:
    output_file = os.path.join(output_dir, 'product_agg_flagged.csv')
    flagged_products.to_csv(output_file, index=False)
    print(f"âœ“ Exported {len(flagged_products)} flagged products to: {output_file}")
else:
    print("â„¹ No flagged products to export")

# Export flagged sub-categories
if flagged_subcats is not None and len(flagged_subcats) > 0:
    output_file = os.path.join(output_dir, 'subcat_agg_flagged.csv')
    flagged_subcats.to_csv(output_file, index=False)
    print(f"âœ“ Exported {len(flagged_subcats)} flagged sub-categories to: {output_file}")
else:
    print("â„¹ No flagged sub-categories to export")

print(f"\nâœ“ Export complete - check the {output_dir}/ directory for CSV files")

## Step 9: Actionable Recommendations and Next Steps

### What to Investigate for Flagged Items

For each product or sub-category flagged as "High Revenue, Low Profit," investigate the following:

#### 1. **Discount Analysis**
   - Are these items frequently discounted?
   - Is the discount percentage eroding profit margins?
   - **Action**: Review discount policy, consider reducing discount depth or frequency

#### 2. **Returns and Refunds**
   - Do these items have high return rates?
   - Are returns eating into profitability?
   - **Action**: Investigate quality issues, improve product descriptions, or discontinue problematic items

#### 3. **Cost of Goods Sold (COGS)**
   - Is the cost per unit too high relative to selling price?
   - Can you negotiate better pricing with suppliers?
   - **Action**: Review supplier contracts, consider alternative suppliers, or adjust pricing

#### 4. **Pricing Strategy**
   - Is the product priced too low?
   - Can the market support a price increase?
   - **Action**: Test gradual price increases, consider value-based pricing

#### 5. **Shipping and Fulfillment Costs**
   - Are shipping costs disproportionately high for these items?
   - Are there hidden fulfillment costs?
   - **Action**: Optimize shipping methods, bundle items, or pass costs to customer

#### 6. **Product Mix and Bundling**
   - Can low-profit items be bundled with high-profit items?
   - Are these items "loss leaders" that drive other purchases?
   - **Action**: Analyze purchase patterns, create strategic bundles

### Suggested Next Steps

#### Short-term Actions (This Week)
1. **Review the exported CSV files** with your merchandising and pricing teams
2. **Pull detailed transaction data** for the top 5-10 flagged items
3. **Calculate average discount percentages** for flagged vs. non-flagged items
4. **Identify quick wins**: Items where a small price adjustment could significantly improve profitability

#### Medium-term Analysis (This Month)
1. **Time Window Analysis**: Run this analysis for different time periods (last 30 days, last 90 days, last year) to see if these patterns are consistent or seasonal
2. **Customer Segment Analysis**: Determine if certain customer segments are driving the low profitability
3. **Calculate True Profit Margins**: If possible, add detailed cost data (COGS, shipping, handling) to get more accurate profit margins
4. **Supplier Negotiations**: For flagged items with high COGS, open discussions with suppliers about volume discounts or alternative products

#### Long-term Strategy (This Quarter)
1. **Set Up Monitoring Dashboard**: Create a Tableau or Power BI dashboard to track these metrics automatically and alert when new items are flagged
2. **Establish Profit Margin Targets**: Define minimum acceptable profit margins by category
3. **Review Product Portfolio**: Consider phasing out persistently unprofitable items that don't serve a strategic purpose
4. **Pricing Optimization Project**: Implement dynamic pricing or regular pricing reviews for flagged categories
5. **Track Impact**: Measure the financial impact of changes made to flagged items

### Success Metrics to Track
- Number of flagged items reduced month-over-month
- Average profit margin improvement for previously flagged items
- Revenue maintained or grown while improving profitability
- Total profit dollars recovered from corrective actions

### Important Considerations
- **Don't eliminate items hastily**: Some low-profit items may serve strategic purposes (customer acquisition, complementary products, competitive positioning)
- **Test changes carefully**: Make pricing or policy changes incrementally and monitor impact
- **Consider customer lifetime value**: An item with low immediate profit might attract customers who make profitable future purchases
- **Review regularly**: Run this analysis monthly or quarterly to catch new problem items early

---

## How to Run This Notebook

### Prerequisites
- Python 3.8 or higher
- Required libraries: pandas, numpy, matplotlib, seaborn (install via `pip install -r requirements.txt`)

### Instructions
1. **Place your sales data CSV file** in the `data/` directory (one level up from this notebook)
2. **Ensure your CSV has the required columns**: Sales, Profit, and optionally Product Name and Sub-Category
3. **Run all cells** in order from top to bottom (Kernel â†’ Restart & Run All)
4. **Review the flagged items** in the output and exported CSV files
5. **Adjust thresholds** if needed (change `HIGH_SALES_PERCENTILE` and `LOW_PROFIT_PERCENTILE` values)
6. **Export results** are saved in the `outputs/` subdirectory

### Troubleshooting
- **Error: No CSV files found**: Ensure your CSV is in the `data/` directory
- **Error: Required columns not found**: Check that your CSV has 'Sales' and 'Profit' columns (case-insensitive)
- **No items flagged**: Try adjusting thresholds to be less restrictive (e.g., 70th percentile for sales)

---

**Analysis Complete** âœ“

This notebook has identified high-revenue, low-profit products and sub-categories that warrant further investigation. Use the exported CSV files and recommendations above to drive profitable business decisions.