# High Revenue, Low Profit Analysis (Very Strong Signal)

Goal
- Identify products and sub-categories that generate high total sales but have low or negative profit.
- These are strong signals for pricing, discounting, or cost issues.

What this notebook does
- Load sales data from the `data/` directory.
- Clean and prepare the data.
- Group by Product and Sub-Category to compute total Sales and Profit.
- Flag items with high sales but low/negative profits.
- Visualize and export flagged lists.

In [None]:
import os
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# %matplotlib inline  # optional in Jupyter
sns.set(style="whitegrid")
pd.set_option("display.max_columns", 80)
pd.set_option("display.width", 120)

In [None]:
data_dir = "data"
csv_files = glob.glob(os.path.join(data_dir, "*.csv"))
if not csv_files:
    raise FileNotFoundError(f"No CSV files found in '{data_dir}'. Place your sales CSV in that folder.")

data_path = csv_files[0]
print("Using data file:", data_path)

df = pd.read_csv(data_path, low_memory=False)
print("Rows, columns:", df.shape)
df.head()

Notes on columns
- This notebook expects the dataset to contain at least columns similar to: `Product Name` (or `Product`), `Sub-Category` (or `SubCategory`), `Sales`, and `Profit`.
- If your column names differ, update the mapping in the next cell.

In [None]:
cols = {c.lower(): c for c in df.columns}

def find_col(possible_names):
    for name in possible_names:
        if name.lower() in cols:
            return cols[name.lower()]
    return None

product_col = find_col(["Product Name", "Product"])
subcat_col = find_col(["Sub-Category", "SubCategory", "Sub Category"])
sales_col = find_col(["Sales"])
profit_col = find_col(["Profit"])

print("Detected columns -> Product:", product_col, "Sub-Category:", subcat_col, "Sales:", sales_col, "Profit:", profit_col)

if sales_col is None or profit_col is None:
    raise ValueError("Sales and/or Profit column not found. Rename columns or update mapping.")

# Convert to numeric
df[sales_col] = pd.to_numeric(df[sales_col], errors="coerce")
df[profit_col] = pd.to_numeric(df[profit_col], errors="coerce")

# Drop rows without sales or profit info
df_clean = df.dropna(subset=[sales_col, profit_col]).copy()
print("After dropping NA sales/profit:", df_clean.shape)

df_clean[[c for c in [product_col, subcat_col, sales_col, profit_col] if c is not None]].head()

Aggregation
- We'll compute total Sales and total Profit for each Product (if Product column exists) and for each Sub-Category.
- We'll compute thresholds for "high sales" using percentile (default: top 20%) and define "low profit" as profit <= 0 or bottom 20% by profit. Both options are shown so you can choose which is more appropriate.

In [None]:
if product_col:
    prod_agg = (
        df_clean.groupby(product_col)
        .agg(total_sales=(sales_col, "sum"), total_profit=(profit_col, "sum"), n_orders=(sales_col, "count"))
        .reset_index()
        .sort_values("total_sales", ascending=False)
    )
    prod_agg.head(10)
else:
    prod_agg = pd.DataFrame()
    print("Product column not available; skipping product-level aggregation.")

In [None]:
if subcat_col:
    subcat_agg = (
        df_clean.groupby(subcat_col)
        .agg(total_sales=(sales_col, "sum"), total_profit=(profit_col, "sum"), n_orders=(sales_col, "count"))
        .reset_index()
        .sort_values("total_sales", ascending=False)
    )
    subcat_agg.head(10)
else:
    subcat_agg = pd.DataFrame()
    print("Sub-Category column not available; skipping sub-category aggregation.")

Flagging logic (configurable)
- High sales threshold: products/sub-categories in the top 20% by total sales (configurable).
- Low profit threshold: (A) total_profit <= 0 (strict), and/or (B) bottom 20% by profit (configurable).
- We'll create three flags:
  - high_sales_flag (boolean)
  - low_or_negative_profit_flag (profit <= 0)
  - flagged_strong (high_sales_flag & low_or_negative_profit_flag)
You can adjust percentiles (e.g., top 10% instead of top 20%).

In [None]:
high_sales_pct = 0.80  # top 20% considered high sales
low_profit_pct = 0.20  # bottom 20% considered low profit

def flag_agg(df_agg):
    if df_agg.empty:
        return df_agg, None, None
    sales_th = df_agg["total_sales"].quantile(high_sales_pct)
    profit_low_th = df_agg["total_profit"].quantile(low_profit_pct)
    df_agg = df_agg.copy()
    df_agg["high_sales_flag"] = df_agg["total_sales"] >= sales_th
    df_agg["low_or_negative_profit_flag"] = df_agg["total_profit"] <= 0
    df_agg["low_profit_bottom_pct_flag"] = df_agg["total_profit"] <= profit_low_th
    df_agg["flagged_strong"] = df_agg["high_sales_flag"] & (
        df_agg["low_or_negative_profit_flag"] | df_agg["low_profit_bottom_pct_flag"]
    )
    return df_agg, sales_th, profit_low_th

prod_agg_flagged, prod_sales_th, prod_profit_low_th = flag_agg(prod_agg) if not prod_agg.empty else (prod_agg, None, None)
subcat_agg_flagged, subcat_sales_th, subcat_profit_low_th = flag_agg(subcat_agg) if not subcat_agg.empty else (subcat_agg, None, None)

print("Product sales threshold:", prod_sales_th, "profit low threshold:", prod_profit_low_th)
print("Sub-category sales threshold:", subcat_sales_th, "profit low threshold:", subcat_profit_low_th)

if not prod_agg_flagged.empty:
    print("Flagged products (strong):")
    prod_agg_flagged[prod_agg_flagged["flagged_strong"]].sort_values("total_sales", ascending=False).head(20)

if not subcat_agg_flagged.empty:
    print("Flagged sub-categories (strong):")
    subcat_agg_flagged[subcat_agg_flagged["flagged_strong"]].sort_values("total_sales", ascending=False)

In [None]:
import matplotlib

def plot_scatter(df_agg_flagged, label_col, title):
    plt.figure(figsize=(10,6))
    ax = sns.scatterplot(data=df_agg_flagged, x="total_sales", y="total_profit",
                         hue="flagged_strong", palette={True: "red", False: "gray"}, alpha=0.8, s=80)
    plt.axhline(0, color="black", linewidth=0.7, linestyle="--")
    plt.xscale("symlog")
    plt.xlabel("Total Sales")
    plt.ylabel("Total Profit")
    plt.title(title)
    plt.legend(title="Flagged (strong)")
    if "flagged_strong" in df_agg_flagged.columns:
        flagged = df_agg_flagged[df_agg_flagged["flagged_strong"]]
        for _, row in flagged.nlargest(8, "total_sales").iterrows():
            plt.annotate(str(row[label_col]), (row["total_sales"], row["total_profit"]),
                         textcoords="offset points", xytext=(5,5), ha="left")
    plt.tight_layout()
    plt.show()

if not subcat_agg_flagged.empty:
    plot_scatter(subcat_agg_flagged, subcat_col, "Sub-Category: Total Sales vs Total Profit")

if not prod_agg_flagged.empty:
    plot_scatter(prod_agg_flagged, product_col, "Product: Total Sales vs Total Profit")

In [None]:
output_dir = "notebooks/outputs"
os.makedirs(output_dir, exist_ok=True)

if not prod_agg_flagged.empty:
    prod_agg_flagged.to_csv(os.path.join(output_dir, "product_agg_flagged.csv"), index=False)
    print("Saved:", os.path.join(output_dir, "product_agg_flagged.csv"))

if not subcat_agg_flagged.empty:
    subcat_agg_flagged.to_csv(os.path.join(output_dir, "subcat_agg_flagged.csv"), index=False)
    print("Saved:", os.path.join(output_dir, "subcat_agg_flagged.csv"))

Actionable next steps for flagged items
- Investigate reasons for low/negative profit despite high sales:
  - Check discounts applied, promotions, returns, or refunds.
  - Review cost of goods sold (COGS) or supplier costs for those products.
  - Re-evaluate pricing, bundling, or minimum advertised price.
  - Consider targeted margin improvement (reduce discounts, renegotiate purchase price).
- For sub-categories, consider category-level assortment optimization.
- Track these flagged items over time (trend) — are they recent or persistent problems?
- If many flagged products share suppliers or attributes, investigate supplier-level issues.

Customizations and enhancements
- Use a rolling timeframe (last 12 months) rather than all-time totals: filter dataset by 'Order Date' if available.
- Add grouping: Category, Region, Customer Segment.
- Compute profit margin = total_profit / total_sales to identify low-margin high-sales items.
- Add automated alerts (e.g., when a product is flagged several consecutive months).
- Integrate with BI tools or dashboards for ongoing monitoring.

Run instructions
- Place your sales CSV in the repository `data/` directory (notebooks auto-detects the first CSV) and run all cells.