# Orders Cleaning

This notebook will perform basic cleaning of the `Orders` dataset in order to prepare for dashboarding and storytelling.

Steps include:
* Removing columns with no or single values.
* Removing columns with no valuable information.
* Standardizing item names.

In [None]:
import re
import pandas as pd
import numpy as np

from belly_rubb.config import RAW_DATA_DIR, INTERIM_DATA_DIR
from belly_rubb.utils import load_config_file

In [None]:
# Load orders

orders_df = pd.read_csv(INTERIM_DATA_DIR / 'orders.csv')
orders_df.head()

In [None]:
orders_df.shape

# Standardizing Item Names

In [None]:
# Load menu data

catalog_df = pd.read_csv(RAW_DATA_DIR / 'MLW4W4RYAASNM_catalog-2025-08-26-2046.csv')
catalog_df.head()

## Normalize `Item Name` in orders

- Lowercase
- Trim

In [None]:
def normalize(item: str) -> str:
    """
    Normalize item names.

    Change string to lowercase and trim whitespace surrounding the text.

    Params:
        item (str): Name of item.

    Returns:
        trimmed (str): Normalized text.
    """
    lowercase = item.lower() # Lowercase name
    trimmed = lowercase.strip() # Strip surrounding whitespace

    return trimmed

In [None]:
# Normalize Item Name

normalized_df = orders_df.dropna(subset='Item Name')
normalized_df['Item Name'] = normalized_df['Item Name'].apply(lambda x: normalize(x))
normalized_df['Item Name'].value_counts()

## Standardization

### Load and generate mapping data

In [None]:
# Load data dictionary

item_dict = load_config_file(file_path='../references/item_synonyms.json')
print(item_dict is not None) # Confirm successfully loaded

In [None]:
# Generate mapping dictionary for item name and variation

alias_to_base = {}
var_map = {}

# Loop through menu item dicts
for base_name, cfg in item_dict.items():
    # Loop through item name aliases
    for alias in cfg['aliases']:
        alias_to_base[alias.lower()] = base_name # {lowercase_alias: item_name}

    # Loop through variations and their corresponding aliases
    for var_base_name, var_alias_list in cfg['variations'].items():
        # Loop through variation aliases
        for var in var_alias_list:
            var_map[var.lower()] = var_base_name # {lowercase_variation_alias: variation_base_name}

alias_to_base

In [None]:
var_map

### Standardizing with `.map()`

In [None]:
# Map item name dictionary to Item Name

normalized_df['std_names'] = normalized_df['Item Name'].map(alias_to_base)

# View transformations
normalized_df[['Item Name', 'std_names']].value_counts()

In [None]:
# Check unique standardized names

normalized_df['std_names'].unique()

In [None]:
# Investigate Item Names of items with 'nan' standardized name

normalized_df[normalized_df['std_names'].isna()]['Item Name'].unique()

In [None]:
# Investigate rows with null standardized name

normalized_df[normalized_df['std_names'].isna()]

Simply mapping aliases to base names does not capture items where the variation is included in `Item Name`. A more manual approach is required.

### Standardizing with `.apply()`

In [None]:
def paren_text(text: str, start_str: str, end_str: str) -> str:
    """
    Extract text between two strings not including the strings.

    Uses indeces of start_str and end_str to extract text between them.

    Params:
        text (str): Text to search within.
        start_str (str): Starting character.
        end_str (str): Ending character.

    Returns:
        str: Text between start_str and end_str
    """
    # Find index of start_str
    start_idx = text.find(start_str) + 1
    if start_idx == -1: # Check if found
        # print(f"Start string {start_str} not found in {text}.")
        return None
    
    # Find index of end_str
    end_idx = text.find(end_str, start_idx)
    if end_idx == -1: # Check if found
        # print(f"End string {end_str} not found in {text}.")
        return None
    
    return text[start_idx:end_idx]

In [None]:
def standardize_name_vars(row: pd.Series):
    """
    Standardizes Item name
    """
    item_name = row['Item Name'].lower()
    item_variation = row['Item Variation'].lower()

    # Check if item name is in aliases
    if item_name in alias_to_base:
        row['std_name'] = alias_to_base.get(item_name, item_name) # Save standardized name to std_name
        row['std_variation'] = var_map.get(item_variation, item_variation) # Save standardized variation to std_variation
    else:
        # Check if variation in item name in parentheses
        var = paren_text(text=item_name, start_str='(', end_str=')')

        # If variation found between parentheses
        if var:
            # Subtract variation from item name
            item_name = re.sub(r"\((.*?)\)", "", string=item_name).strip().lower() # Extract name from parentheses

            # Save standardized names, default to 'na'
            row['std_name'] = alias_to_base.get(item_name, item_name)
            row['std_variation'] = var_map.get(var, item_variation)
        else:
            # Item name doesn't include variation name in parentheses
            row['std_name'] = alias_to_base.get(item_name, item_name)
            row['std_variation'] = var_map.get(var, item_variation)
            
    return row

In [None]:
# Drop null Item name values and apply standardize_name_vars

no_null_df = orders_df.dropna(subset='Item Name')
standardized_df = no_null_df.apply(standardize_name_vars, axis=1)

standardized_df.head()

In [None]:
# Check rows with na standardized variation

null_std_var = standardized_df[standardized_df['std_variation'] == 'na']
null_std_var[['Item Name', 'Item Variation']].value_counts()

In [None]:
null_std_var['Item Name'].unique()

There appear to be `Item Name` and `Item Variation` entries that were not included in the data dictionary. These will be further explored here.

#### Item Reviews

##### Belly Sliders

In [None]:
belly_slider_orders = standardized_df[standardized_df['Item Name'] == 'BELLY SLIDERS']
belly_slider_orders.describe()

In [None]:
belly_slider_orders['std_name'].unique()

In [None]:
belly_slider_orders['Item Variation'].unique()

In [None]:
belly_slider_orders['Order Date'].agg(['max', 'min'])

###### Conclusion

* Orders of `BELLY SLIDERS` were not added to the data dictionary.
* They do not exist in current menu.

**Resolution:** Added to dictionary.

##### Slow-Cooked Asparagus

In [None]:
asparagus_orders = standardized_df[standardized_df['Item Name'] == 'SLOW-COOKED ASPARAGUS']
asparagus_orders.describe()

In [None]:
asparagus_orders['std_name'].value_counts()

In [None]:
asparagus_orders['Order Date'].agg(['min', 'max'])

In [None]:
asparagus_orders['Item Variation'].unique()

In [None]:
asparagus_orders['std_variation'].unique()

###### Conclusion

* `SLOW-COOKED ASPARAGUS` not part of data dictionary.
* Not in current menu version.

**Resolution:** Added to data dictionary.

##### Marsh'n'Cookie

In [None]:
cookie_df = standardized_df[standardized_df['Item Name'] == 'MARSH’n’COOKIE']
cookie_df.describe()

In [None]:
cookie_df['std_name'].value_counts()

In [None]:
cookie_df['Item Name'].unique()

In [None]:
cookie_df['Item Variation'].unique()

In [None]:
cookie_df[['std_name', 'std_variation']].value_counts()

###### Conclusions

* Entry in data dictionary was misspelled.

**Resolution:** Spelling corrected.

##### Mozzarella Wedges

In [None]:
mozz_wedge = standardized_df[standardized_df['Item Name'] == 'MOZZARELLA WEDGES']
mozz_wedge.describe()

In [None]:
mozz_wedge['Order Date'].agg(['min', 'max'])

In [None]:
mozz_wedge['Item Variation'].value_counts()

###### Conclusions
* `MOZZARELLA WEDGES` were not in the data dictionary.
* Since there are only **8** orders, and `Item Variations` are equivalent to variations for oders of *Mozzarella sticks*, this will be added as an alias for **mozzarella sticks**.

**Resolution:** Added to list of *aliases* for **mozzarella sticks**.

##### Pork Noodle Soup

In [None]:
soup_df = standardized_df[standardized_df['Item Name'] == 'PORK NOODLE SOUP']
soup_df.describe()

In [None]:
soup_df['Order Date'].agg(['min', 'max'])

###### Conclusions

* There is only **1** order of pork noodle soup.
* Since it also is not in the current menu, it will be dropped.

**Next steps:** Add cleaning step to drop items not in the data dictinoary with only 1 order

##### Chop Chop

In [None]:
chop_df = standardized_df[standardized_df['Item Name'] == 'CHOP-CHOP']
chop_df.describe()

In [None]:
chop_df

###### Conclusions
* Similar to *Pork Noodle Soup*, there is only **1** order of `CHOP-CHOP` and it is not part of the menu catalog.

##### Local Delivery Service

In [None]:
delivery_df = standardized_df[standardized_df['Item Name'] == 'Local Delivery Service']
delivery_df.describe()

In [None]:
delivery_df

###### Conclusion
* Similar to *Pork Noodle Soup* and *Chop-Chop*.
* It will be dropped.

##### Party Package !BEEF RIBS ONLY! (4-6 ppl)

In [None]:
prty_df = standardized_df[standardized_df['Item Name'] == 'Party Package !BEEF RIBS ONLY! (4-6 ppl)']
prty_df.describe()

In [None]:
prty_df[['std_name', 'std_variation']]

###### Conclusion

* Only **1** order, therefore it will be dropped.

#### Dropping Single Orders

In [None]:
# Get names of items with only 1 order

item_name_counts = standardized_df['Item Name'].value_counts(ascending=True).reset_index()
single_item_names = item_name_counts[item_name_counts['count'] == 1]['Item Name'].values

single_item_names

In [None]:
# Drop orders with single-order Item Name

single_orders = standardized_df[standardized_df['Item Name'].isin(single_item_names)]
no_single_orders_df = standardized_df.drop(index=single_orders.index)

any(no_single_orders_df['Item Name'].isin(single_item_names)) # Validate resulting dataframe

In [None]:
# List remaining unstandardized Item Names

no_single_orders_df[no_single_orders_df['std_name'] == 'na']['Item Name'].unique()

In [None]:
no_single_orders_df[no_single_orders_df['std_variation'] == 'na']

### Validation

This section validates the results of standardizing with the above applied methods.

In [None]:
# View unique combinations of Item name and standardized name

no_single_orders_df[['Item Name', 'std_name']].value_counts()

In [None]:
# View unique combinations of item variation and standardized variation

no_single_orders_df[['Item Variation', 'std_variation']].value_counts()

In [None]:
# Use fuzzy matching to evaluate std_name

from rapidfuzz.fuzz import token_set_ratio

def name_is_fuzzy_match(row: pd.Series) -> pd.Series:
    """
    Check if Item name and std_name are a fuzzy match.

    Uses threshold of 80 to evaluate match. Adds column name_fuzzy_match with True for fuzzy matches
    and False otherwise.

    Params:
        row (Series): Row from data.

    Returns:
        row (Series): Row with added column name_fuzzy_match.
    """
    row['name_fuzzy_match'] = token_set_ratio(row['Item Name'].lower(), row['std_name'].lower()) > 80

    return row

In [None]:
# Use fuzzy matching to evaluate std_variation

def variation_is_fuzzy_match(row: pd.Series) -> pd.Series:
    """
    Check if Item variation and std_variation are a fuzzy match.

    Uses threshold of 80 to evaluate match. Adds column variation_fuzzy_match with True for fuzzy matches
    and False otherwise.

    Params:
        row (Series): Row from data.

    Returns:
        row (Series): Row with added column variation_fuzzy_match.
    """
    row['variation_fuzzy_match'] = token_set_ratio(row['Item Variation'].lower(), row['std_variation'].lower()) > 80

    return row

In [None]:
# Calculate fuzzy match results for std_name

name_fuzzy_df = no_single_orders_df.apply(name_is_fuzzy_match, axis=1)
false_match = name_fuzzy_df[name_fuzzy_df['name_fuzzy_match'] == False]

In [None]:
# View unique combinations where not a fuzzy match

false_match[['Item Name', 'std_name']].value_counts()

Although these instances are not a fuzzy match, they are still **valid**. They signify items that experienced a drastic name change when updating the menu catalog.

In [None]:
# Evaluate standardized variations

variation_fuzzy_df = name_fuzzy_df.apply(variation_is_fuzzy_match, axis=1)
variation_fuzzy_df

In [None]:
# Inspect unique combinations where not a fuzzy match

false_match = variation_fuzzy_df[variation_fuzzy_df['variation_fuzzy_match'] == False]
false_match[['Item Variation', 'std_variation']].value_counts()

In [None]:
# Further inspect items with Regular item variation and full rack standardized

regular_full = false_match[(false_match['Item Variation'] == 'Regular') & (false_match['std_variation'] == 'full rack')]
regular_full['Item Name'].unique()

* Instances where *Regular* turned into *full rack* were orders where the item variation was a *part of* the item name.
* Other failed fuzzy matches are **valid** and signify major updates.

# Data Quality
These cleaning steps will include:
* [x] Dropping static columns.
* [x] Dropping fully empty features.
* [ ] Dropping insignificant columns:
    * `Fullfillment Notes`
    * `Recipient Country`

In [None]:
# Replace Item Name and Item Variation columns with standardized values
no_single_orders_df['Item Name'] = no_single_orders_df['std_name']
no_single_orders_df['Item Variation'] = no_single_orders_df['std_variation']

# Drop std cols
no_single_orders_df = no_single_orders_df.drop(labels=['std_name', 'std_variation'], axis=1)
no_single_orders_df.head()

## Static Columns & Empty Features

In [None]:
# Drop static columns

clean_df = no_single_orders_df.drop(no_single_orders_df.columns[no_single_orders_df.nunique() < 2].values, axis=1)
clean_df.columns

This step also drops empty features.

## Empty Features

In [None]:
# Confirm no empty features

clean_df.columns[clean_df.isna().all()]

## Insignificant Columns

In [None]:
cols = ['Fulfillment Notes', 'Recipient Country']

clean_df = clean_df.drop(labels=cols, axis=1)
clean_df.columns

# Conclusion

The preceding cleaning steps are intended to prepare the data for EDA and visualizations.
* Unimputed missing values remain.
* Features with high levels of missing values remain.
* Rows have not been aggregated on the order level.
* Features have not been engineered or transformed.

In [None]:
# Save df

clean_df.to_csv(INTERIM_DATA_DIR / 'orders_eda_cleaned.csv', index=False)