# Initial Exploration of Item Catalog

This notebook explores the item catalog which contains information about menu items offered at Belly Rubb.

In [None]:
import pandas as pd
import numpy as np

# Data visualization
import missingno as msno
import matplotlib.pyplot as plt

from belly_rubb.config import RAW_DATA_DIR

In [None]:
catalog_df = pd.read_csv(RAW_DATA_DIR / 'MLW4W4RYAASNM_catalog-2025-08-26-2046.csv')
catalog_df.head()

# Data Overview

In [None]:
catalog_df.info()

In [None]:
catalog_df.shape

At a cursory glance:
* The dataset contains **75** features and **66** rows.
* `SKU`, `GTIN`,`Social Media Link Title`, `Social Media Link Description`, `Online Sale Price`, `Sellable`, `Stockable`, `New Quantity ...` are fully null.

# Data Quality Checks

- [x] Investigate missing values
- [x] Identify static columns
- [x] Check for duplicates
- [x] Validate data types

## Missing Values

In [None]:
# Get columns with no information

null_cols = catalog_df.columns[catalog_df.isna().all()]
null_cols

In [None]:
np.round(catalog_df.isna().sum() / len(catalog_df) * 100, 2)

In [None]:
# Plot missingness heatmap

msno.heatmap(catalog_df, figsize=(12, 6))

plt.show()

* Relevant columns with high missingness correlation:
    * `Reporting Category` and `Categories`
    * Delivery-related columns such as `Shipping Enabled` and SEO related columns such as `SEO Title`

### `Description`

In [None]:
# Filter rows with missing Description

catalog_df[catalog_df['Description'].isna()]

In [None]:
# Investigate BABY BACK PORK RIBS

catalog_df[catalog_df['Item Name'] == 'BABY BACK PORK RIBS']

In [None]:
# Investigate ARTISAN MAC AND CHEESE

catalog_df[catalog_df['Item Name'] == 'ARTISAN MAC AND CHEESE']

In [None]:
# Calculate percentage of missing values in Description column

print(f"Percentage of missing values in 'Description': {np.round(catalog_df['Description'].isna().sum() / len(catalog_df) * 100, 2)}%")

In [None]:
# Compare the number of descriptions and variations of item

catalog_df.groupby(by='Item Name').agg({
    'Description': 'count',
    'Token': 'count'
}).sort_values(by='Token', ascending=False)

* The column `Description` is missing **43.94%** of its values.
* The feature stores a description of the item.
    * Each item has *at most* **one description** entry. This implies that the description is shared between variations of the item.

### `Categories` and `Reporting Category`

In [None]:
# Investigate rows with missing Categories

catalog_df[catalog_df['Categories'].isna()]

In [None]:
# List categories

catalog_df['Categories'].unique()

In [None]:
# Check to see if there is another entry for Pulled Beef Sliders

catalog_df[catalog_df['Item Name'].str.contains('Pulled Beef Sliders')]

Categories are missing for:
* Family Size `Mac'n'Cheese` orders
* `Pear+Gorgonzola Salad (Full Pan)`
* `Pulled Beef Sliders (12pcs)`
* `Utensil Kit`

Based on the available categories, there are no appropriate ones to fit these into.


#### `Pickup Enabled`

In [None]:
# Count values in Pickup Enabled

catalog_df['Pickup Enabled'].value_counts(dropna=False)

In [None]:
# View single item with pickup disabled

catalog_df[catalog_df['Pickup Enabled'] == 'N']

In [None]:
# View items with missing pickup enabled entry

catalog_df[catalog_df['Pickup Enabled'].isna()]

In [None]:
# Convert Pickup Enabled to numeric

catalog_df['Pickup Enabled Numeric'] = np.where(catalog_df['Pickup Enabled'] == 'Y', 1, 0)
catalog_df.head()

In [None]:
# Compare number of items with and without pickup enabled

catalog_df.groupby(by='Item Name').agg({
    'Pickup Enabled Numeric': 'sum',
    'Token': 'count'
}).sort_values(by='Token', ascending=False)

Similar to `Description`, `Pickup Enabled` is present for *at most* **one entry** per `Item Name`.

## Static Columns

In [None]:
static_cols = catalog_df.columns[catalog_df.nunique() == 1]
print(static_cols)

* **Static Columns**:
    * `Self-serve Ordering Enabled`
    * `Archived`
    * `Contains Alcohol`
    * `Stock Alert Enabled BELLY RUBB - BBQ Ribs To Go & Catering`
    * `Stock Alert Count BELLY RUBB - BBQ Ribs To Go & Catering`
    * `Modifier Set - Are you OK with spicy food?`
    * `Modifier Set - Choose the glaze for rack 4`
    * `Modifier Set - Extra Toppings`
    * `Modifier Set - Glazed Belly Bites`
    * `Modifier Set - Half Rack Glaze`
    * `Modifier Set - How about Pork Rib Bites (off the bone rib meat)?`
    * `Modifier Set - How about some cheese on top?`
    * `Modifier Set - Spicy sauce?`

## Duplicates

In [None]:
duplicates = catalog_df.duplicated()
print(f"Duplicate rows in catalog_df:\n{catalog_df[duplicates]}")

There are **no** duplicate rows.

## Data Types

In [None]:
catalog_df.dtypes

In [None]:
modifier_cols = catalog_df.columns[catalog_df.columns.str.contains("Modifier")]
print(modifier_cols)

In [None]:
catalog_df[modifier_cols].head()

* `Modifier Set` columns can be converted to binary since they only contains `Y` for Yes and `N` for No.

# Category Insights

- [x] Calculate number of items in each category.
- [x] List items in categories.
- [x] Rank categories by aggregated `Price`.
- [x] Item category vs Variation type.

In [None]:
catalog_df['Reporting Category']

In [None]:
# Create category column

catalog_df['category'] = catalog_df['Reporting Category'].str.split(" \(", expand=True)[0]
catalog_df.head()

In [None]:
# List categories

catalog_df['category'].unique()

In [None]:
# Inspect items with missing category

catalog_df[catalog_df['category'].isna()]

## Number of items in categories

In [None]:
category_counts = catalog_df.groupby(by='category')['Item Name'].nunique().reset_index(name='count')
category_counts = category_counts.sort_values(by='count', ascending=False)
category_counts

**Top 3 Categories by Number of Items**
1. `DIPS`: **14**
2. `COMBOS`: **8**
3. `RIBS`: **7**

**Lowest 3 Categories by Number of Items**
1. `PARTY PACKAGE`: **3**
2. `SANDWICHES`: **3**
3. `DESSERTS`: **2**

## Items in Categories

In [None]:
# List all items by category

catalog_df.groupby(by='category')['Item Name'].unique().explode()

In [None]:
catalog_df[catalog_df['Item Name'] == 'BABY BACK PORK RIBS']

In [None]:
catalog_df[catalog_df['Item Name'] == 'BEEF SHORT RIB']

In [None]:
catalog_df[catalog_df['Item Name'] == 'BEEF BACK RIBS (Full Rack)']

* `BABY BACK PORK RIBS` constitutes one entry and has two variations: `Full Rack` and `Half Rack`.
* Beef ribs, including `BEEF BACK RIBS` and `BEEF SHORT RIB` list the variation in the `Item Name`.
    * The `Variation Name` for these entries is `Regular`.

## Categories by Aggregated Price

In [None]:
cat_revenue = catalog_df.groupby(by='category').agg(
    total_price=('Price', 'sum'),
    average_price=('Price', 'mean'),
    median_price=('Price', 'median'),
    min_price=('Price', 'min'),
    max_price=('Price', 'max')
).reset_index().sort_values(by='average_price', ascending=False)
cat_revenue['range'] = cat_revenue['max_price'] - cat_revenue['min_price']
cat_revenue

* The *range* of the `PARTY PACKAGE` category is the highest at **$187**.
    * Second highest *range* is the `RIBS` category at **$91.01**.
* *Average* and *median* prices within all categories other than `DESSERTS` each category are 

## Variations by Category

In [None]:
# Validate Item variations per category

catalog_df.groupby(by='category')['Variation Name'].unique().explode()

* All categories have Regular `Variation Name`.
* `RIBS` category is the only one with variations `Half rack` and `Full rack`.
* `SIDES` category is the only one with variations `Side` and `Full`.
* Variations specifying the number of pieces are limited to `BITES` category.

# Data Integrity

- [x] Item prices for different variations.
- [x] Prices not negative

## Variation Prices

### Ribs

In [None]:
# Filter for Ribs category

ribs_df =catalog_df[catalog_df['category'] == 'RIBS']
ribs_df

In [None]:
ribs_df['Variation Name'].value_counts()

In [None]:
# Compare full rack price of pork ribs to half rack price

full_rack_price = ribs_df[ribs_df['Variation Name'] == 'Full Rack']['Price'].values
half_rack_price = ribs_df[ribs_df['Variation Name'] == 'Half rack']['Price'].values

print(f"Full rack price greater than half rack price: {(full_rack_price > half_rack_price)[0]}")

In [None]:
# Compare full rack price of beef short ribs to regular rack price

short_rib = ribs_df[ribs_df['Item Name'] == 'BEEF SHORT RIB']['Price'].values
full_short_rib = ribs_df[ribs_df['Item Name'] == 'BEEF SHORT RIBS (FULL RACK)']['Price'].values

print(f"Full rack of Beef Short Ribs greater than regular: {(full_short_rib > short_rib)[0]}")

In [None]:
# Compare dino dinner price to full dino ribs price

dino_dinner = ribs_df[ribs_df['Item Name'] == 'DINO DINNER']['Price'].values
dino_ribs = ribs_df[ribs_df['Item Name'] == 'DINO RIBS (FULL RACK)']['Price'].values

print(f"Full rack of dino ribs greater than dino dinner: {(dino_ribs > dino_dinner)[0]}")

Full rack variations are appropriately priced **greater** than half rack variations.

### Sides

In [None]:
# Sides dataframe

sides_df = catalog_df[catalog_df['category'] == 'SIDES']
sides_df.head()

In [None]:
# Compare side and full prices for mac and cheese

mac_and_cheese = sides_df[sides_df['Item Name'] == 'ARTISAN MAC AND CHEESE']

side_price = mac_and_cheese[mac_and_cheese['Variation Name'] == 'Side']['Price'].values
full_price = mac_and_cheese[mac_and_cheese['Variation Name'] == 'Full']['Price'].values

print(f"Full mac and cheese price greater than side: {(full_price > side_price)[0]}")

In [None]:
# Compare side and full prices for crispy fries

fries = sides_df[sides_df['Item Name'] == 'CRISPY  FRIES']

side_price = fries[fries['Variation Name'] == 'Side']['Price'].values
full_price = fries[fries['Variation Name'] == 'Full']['Price'].values

print(f"Full crispy fries price greater than side: {(full_price > side_price)[0]}")

In [None]:
# Verify only one variation for baked baby potatoes

catalog_df[catalog_df['Item Name'] == 'BAKED BABY POTATOES']

* *Full* variations for sides are appropriately priced **greater than** *side* variations.
* Only `BAKED BABY POTATOES` have a variation of `Regular`.

### Bites

In [None]:
bites_df = catalog_df[catalog_df['category'] == 'BITES']
bites_df

In [None]:
# Check prices for fried chicken wings

wings = bites_df[bites_df['Item Name'] == 'FRIED CHICKEN WINGS']

six_price = wings[wings['Variation Name'] == '6 pcs']['Price'].values
eight_price = wings[wings['Variation Name'] == '8 pcs']['Price'].values
twelve_price = wings[wings['Variation Name'] == '12 pcs']['Price'].values

print(f"Chicken Wings variations priced accordingly: {((twelve_price > eight_price) and (eight_price > six_price))[0]}")

In [None]:
# Check prices for mozzarella sticks

sticks = bites_df[bites_df['Item Name'] == 'MOZZARELLA STICKS']

three_price = sticks[sticks['Variation Name'] == '3 pcs']['Price'].values
six_price = sticks[sticks['Variation Name'] == '6 pcs']['Price'].values
nine_price = sticks[sticks['Variation Name'] == '9 pcs']['Price'].values

print(f"Chicken Wings variations priced accordingly: {((nine_price > six_price) and (six_price > three_price))[0]}")

* Wings and Mozzarella Sticks variations are **priced appropriately**.
* `LOADED FRIES`, `GRILLED SWEET CORN`, and `PORK BELLY BITES` only have `Regular` variation.