# 02 - Data Cleaning & Preprocessing


***INVENTORY OPTIMIZATION WITH PROCUREMENT STRATEGY***
================================================================================
# Notebook 02: Data Cleaning & Preprocessing
## Author: Mohamed Osman
## Date: January 2025
## GitHub: https://github.com/arifi89
## LinkedIn: https://www.linkedin.com/in/mohamed-osman-123456789/

================================================================================
## ***OBJECTIVE:***
This notebook performs comprehensive data cleaning and preprocessing to prepare
our datasets for analysis. We will:

 **1. Handle missing values across all datasets**.
 **2. Standardize column names and data types**.
 **3. Parse and validate dates**.
 **4. Remove duplicates and outliers**.
 **5. Create unique product identifiers**.
 **6. Validate data integrity**.
 **7. Export cleaned datasets to data/processed/**.

## ***KEY FINDINGS FROM NOTEBOOK 01:***
##### ‚Ä¢ Invoice Purchases has 9.33% missing values (5,169 cells)
##### ‚Ä¢ Ending Inventory has 0.06% missing values (1,284 cells)
##### ‚Ä¢ Purchases has 3 missing values (0.00%)
##### ‚Ä¢ Future Prices has 3 missing values (0.00%)
##### ‚Ä¢ The Approval column in Invoice Purchases needs attention

## ***DATA CLEANING STRATEGY:***
##### 1. Remove unnecessary columns (e.g., Approval column with high missing values)
##### 2. Standardize column names (lowercase, underscores)
##### 3. Convert date columns to datetime format
##### 4. Handle missing values appropriately for each dataset
##### 5. Validate numeric columns and identify outliers
##### 6. Create a master dataset for analysis

================================================================================
"""

## SECTION 1: ENVIRONMENT SETUP

In [10]:
print("="*80)
print("üì¶ SECTION 1: ENVIRONMENT SETUP")
print("="*80)

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.width', None)

# Set visualization style
sns.set_style("whitegrid")
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 10

print("‚úÖ Libraries imported successfully")
print("‚úÖ Display options configured")
print("‚úÖ Visualization style set\n")

üì¶ SECTION 1: ENVIRONMENT SETUP
‚úÖ Libraries imported successfully
‚úÖ Display options configured
‚úÖ Visualization style set



## SECTION 2: LOAD RAW DATA

In [11]:
print("="*80)
print("üìÇ SECTION 2: LOAD RAW DATA")
print("="*80)

# Define the base path to raw data folder
RAW_PATH = Path("../data/raw")
PROCESSED_PATH = Path("../data/processed")

# Create processed directory if it doesn't exist
PROCESSED_PATH.mkdir(parents=True, exist_ok=True)
print(f"‚úÖ Processed data directory ready: {PROCESSED_PATH}\n")

# Define all file paths
file_paths = {
    'beginning_inventory': RAW_PATH / "BegInvFINAL12312016.csv",
    'purchases': RAW_PATH / "PurchasesFINAL12312016.csv",
    'invoice_purchases': RAW_PATH / "InvoicePurchases12312016.csv",
    'sales': RAW_PATH / "SalesFINAL12312016.csv",
    'ending_inventory': RAW_PATH / "EndInvFINAL12312016.csv",
    'future_prices': RAW_PATH / "2017PurchasePricesDec.csv"
}

# Load each dataset
print("‚è≥ Loading datasets...\n")
dataframes = {}

for name, path in file_paths.items():
    try:
        df = pd.read_csv(path)
        dataframes[name] = df
        print(f"‚úÖ {name}: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
    except Exception as e:
        print(f"‚ùå Error loading {name}: {str(e)}")

# Create individual variables
beg_inv = dataframes['beginning_inventory'].copy()
purchases = dataframes['purchases'].copy()
invoice_purchases = dataframes['invoice_purchases'].copy()
sales = dataframes['sales'].copy()
end_inv = dataframes['ending_inventory'].copy()
future_prices = dataframes['future_prices'].copy()

print(f"\n‚úÖ All datasets loaded successfully!")
print("="*80)

üìÇ SECTION 2: LOAD RAW DATA
‚úÖ Processed data directory ready: ..\data\processed

‚è≥ Loading datasets...

‚úÖ beginning_inventory: 206,529 rows √ó 9 columns
‚úÖ purchases: 2,372,474 rows √ó 16 columns
‚úÖ invoice_purchases: 5,543 rows √ó 10 columns
‚úÖ sales: 1,048,575 rows √ó 14 columns
‚úÖ ending_inventory: 224,489 rows √ó 9 columns
‚úÖ future_prices: 12,261 rows √ó 9 columns

‚úÖ All datasets loaded successfully!


## SECTION 3: INITIAL DATA INSPECTION

In [12]:
print("\n" + "="*80)
print("üîç SECTION 3: INITIAL DATA INSPECTION")
print("="*80)

def inspect_dataset(df, name):
    """
    Quick inspection of dataset before cleaning
    """
    print(f"\nüìã {name.upper()}")
    print("-" * 80)
    print(f"Shape: {df.shape}")
    print(f"\nColumns: {list(df.columns)}")
    print(f"\nData Types:\n{df.dtypes}")
    print(f"\nMissing Values:\n{df.isnull().sum()[df.isnull().sum() > 0]}")
    if df.isnull().sum().sum() == 0:
        print("‚úÖ No missing values")
    print("-" * 80)

# Inspect each dataset
for name, df in dataframes.items():
    inspect_dataset(df, name)


üîç SECTION 3: INITIAL DATA INSPECTION

üìã BEGINNING_INVENTORY
--------------------------------------------------------------------------------
Shape: (206529, 9)

Columns: ['InventoryId', 'Store', 'City', 'Brand', 'Description', 'Size', 'onHand', 'Price', 'startDate']

Data Types:
InventoryId        str
Store            int64
City               str
Brand            int64
Description        str
Size               str
onHand           int64
Price          float64
startDate          str
dtype: object

Missing Values:
Series([], dtype: int64)
‚úÖ No missing values
--------------------------------------------------------------------------------

üìã PURCHASES
--------------------------------------------------------------------------------
Shape: (2372474, 16)

Columns: ['InventoryId', 'Store', 'Brand', 'Description', 'Size', 'VendorNumber', 'VendorName', 'PONumber', 'PODate', 'ReceivingDate', 'InvoiceDate', 'PayDate', 'PurchasePrice', 'Quantity', 'Dollars', 'Classification']

Data Typ

## SECTION 4: HANDLE MISSING VALUES - INVOICE PURCHASES

In [45]:
print("\n" + "="*80)
print("üßπ SECTION 4: HANDLE MISSING VALUES - INVOICE PURCHASES")
print("="*80)

print("\nüìä Before Cleaning:")
print(f"   Shape: {invoice_purchases.shape}")
print(f"   Columns: {list(invoice_purchases.columns)}")
print(f"\n   Missing Values by Column:")
missing_inv = invoice_purchases.isnull().sum()
missing_pct_inv = (invoice_purchases.isnull().sum() / len(invoice_purchases)) * 100

for col in invoice_purchases.columns:
    if missing_inv[col] > 0:
        print(f"      ‚Ä¢ {col}: {missing_inv[col]:,} ({missing_pct_inv[col]:.2f}%)")

# Document the decision to remove Approval column
print("\nüìù DATA CLEANING DECISION:")
print("   ‚îÄ" * 40)
print("   COLUMN TO REMOVE: 'Approval'")
print("   ‚îÄ" * 40)
print("   REASON:")

if 'Approval' in invoice_purchases.columns:
    approval_missing = invoice_purchases['Approval'].isnull().sum()
    approval_pct = (approval_missing / len(invoice_purchases)) * 100
    print(f"   ‚Ä¢ High percentage of missing values: {approval_missing:,} ({approval_pct:.2f}%)")
    print(f"   ‚Ä¢ Missing values represent {approval_pct:.1f}% of total records")
    print("   ‚Ä¢ Approval status is not critical for inventory optimization analysis")
    print("   ‚Ä¢ Removing column improves data quality without losing analytical value")
    print("\n   IMPACT:")
    print("   ‚Ä¢ This column will be excluded from further analysis")
    print("   ‚Ä¢ All other invoice purchase data remains intact")
    print("   ‚Ä¢ Data completeness will improve significantly")
    
    # Remove the Approval column
    invoice_purchases_cleaned = invoice_purchases.drop(columns=['Approval'])
    print("\n   ‚úÖ ACTION COMPLETED: 'Approval' column removed")
else:
    print("   ‚ö†Ô∏è  'Approval' column not found in dataset")
    invoice_purchases_cleaned = invoice_purchases.copy()

print("\nüìä After Cleaning:")
print(f"   Shape: {invoice_purchases_cleaned.shape}")
print(f"   Columns: {list(invoice_purchases_cleaned.columns)}")
print(f"   Total Missing Values: {invoice_purchases_cleaned.isnull().sum().sum()}")
print(f"   Data Completeness: {((len(invoice_purchases_cleaned) * len(invoice_purchases_cleaned.columns) - invoice_purchases_cleaned.isnull().sum().sum()) / (len(invoice_purchases_cleaned) * len(invoice_purchases_cleaned.columns)) * 100):.2f}%")

# Update the main variable
invoice_purchases = invoice_purchases_cleaned

print("\n‚úÖ Invoice Purchases dataset cleaned successfully!")
print("="*80)


üßπ SECTION 4: HANDLE MISSING VALUES - INVOICE PURCHASES

üìä Before Cleaning:
   Shape: (5543, 9)
   Columns: ['vendornumber', 'vendorname', 'invoicedate', 'ponumber', 'podate', 'paydate', 'quantity', 'dollars', 'freight']

   Missing Values by Column:

üìù DATA CLEANING DECISION:
   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ
   COLUMN TO REMOVE: 'Approval'
   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ   ‚îÄ
   REASON:
   ‚ö†Ô∏è  'Approval' column not found in dataset

üìä After Cleaning:
   Shape: (5543, 9)
   Columns: ['vendornumber', 'vendorname', 'invoicedate', 'ponumber', 'podate', 'paydate', 'quan

## SECTION 5: STANDARDIZE COLUMN NAMES

In [53]:
print("\n" + "="*80)
print("üßº SECTION 5: COMPLETE DATA CLEANING & STANDARDIZATION")
print("="*80)

import re

# ---------------------------------------------------------
# 1. UNIVERSAL COLUMN STANDARDIZATION
# ---------------------------------------------------------
def standardize_column_name(col):
    """
    Convert ANY column name into Title_Case_With_Underscores.
    Handles:
    - camelCase
    - PascalCase
    - lowercase words stuck together (your case)
    """
    col = col.strip()

    # Step 1 ‚Äî Insert underscore between lowercase‚ÜíUppercase transitions
    col = re.sub(r"([a-z0-9])([A-Z])", r"\1_\2", col)

    # Step 2 ‚Äî Insert underscores before known suffixes (fixes lowercase words)
    suffixes = ["id", "date", "number", "price", "dollars", "quantity", "tax", "no", "hand", "name"]
    for suf in suffixes:
        col = re.sub(rf"({suf})$", rf"_{suf}", col, flags=re.IGNORECASE)

    # Step 3 ‚Äî Replace hyphens/spaces
    col = col.replace("-", "_").replace(" ", "_")

    # Step 4 ‚Äî Lowercase ‚Üí split ‚Üí Title Case
    parts = re.split(r"[_]+", col.lower())
    parts = [p.capitalize() for p in parts if p]

    return "_".join(parts)


def standardize_base_columns(df, name):
    old_cols = list(df.columns)
    new_cols = [standardize_column_name(c) for c in df.columns]
    df = df.rename(columns=dict(zip(df.columns, new_cols)))

    print(f"\nüìù {name}:")
    print("   Before:", old_cols)
    print("   After: ", list(df.columns))
    return df


# ---------------------------------------------------------
# 2. SALES CLEANING
# ---------------------------------------------------------
def clean_sales(df):
    df = standardize_base_columns(df, "Sales")

    rename_map = {
        "Sales_Price": "Unit_Price",
        "Sales_Dollars": "Total_Price",   # <-- Your request
        "Sales_Quantity": "Sales_Quantity",
        "Excise_Tax": "Tax"
    }
    df = df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns})

    desired_order = [
        "Sales_Date", "Store", "Inventory_Id", "Brand", "Description", "Size",
        "Unit_Price", "Sales_Quantity", "Total_Price", "Tax"
    ]

    remaining = [c for c in df.columns if c not in desired_order + ["Classification"]]

    final_cols = [c for c in desired_order if c in df.columns] + remaining
    if "Classification" in df.columns:
        final_cols += ["Classification"]

    df = df[final_cols]
    return df


# ---------------------------------------------------------
# 3. PURCHASES CLEANING
# ---------------------------------------------------------
def clean_purchases(df):
    df = standardize_base_columns(df, "Purchases")

    rename_map = {
        "Purchase_Price": "Unit_Price",
        "Dollars": "Total_Price",
        "Vendornumber": "Vendor_Number",
        "Vendorname": "Vendor_Name",
        "Ponumber": "Po_Number",
        "Podate": "Po_Date",
        "Receivingdate": "Receiving_Date",
        "Invoicedate": "Invoice_Date",
        "Paydate": "Pay_Date",
        "Inventoryid": "Inventory_Id"
    }
    df = df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns})

    desired_order = [
        "Po_Date", "Po_Number", "Vendor_Number", "Vendor_Name", "Store",
        "Inventory_Id", "Brand", "Description", "Size",
        "Unit_Price", "Quantity", "Total_Price",
        "Receiving_Date", "Pay_Date", "Classification"
    ]

    missing = [c for c in desired_order if c not in df.columns]
    if missing:
        print("‚ö†Ô∏è Missing in Purchases:", missing)

    final_cols = [c for c in desired_order if c in df.columns]
    remaining = [c for c in df.columns if c not in final_cols]
    df = df[final_cols + remaining]

    return df


# ---------------------------------------------------------
# 4. FUTURE PRICES CLEANING
# ---------------------------------------------------------
def clean_future_prices(df):
    df = standardize_base_columns(df, "Future Prices")

    rename_map = {
        "Purchase_Price": "Purchase_Price",
        "Vendornumber": "Vendor_Number",
        "Vendorname": "Vendor_Name"
    }
    df = df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns})

    if "Price" in df.columns:
        df = df.rename(columns={"Price": "Sales_Price"})

    return df


# ---------------------------------------------------------
# 5. INVENTORY CLEANING (Beginning + Ending)
# ---------------------------------------------------------
def clean_inventory(df, name):
    df = standardize_base_columns(df, name)

    rename_map = {
        "Price": "Sales_Price",
        "Onhand": "On_Hand",
        "Startdate": "Start_Date",
        "Enddate": "End_Date",
        "Inventoryid": "Inventory_Id"
    }
    df = df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns})

    return df


# ---------------------------------------------------------
# 6. INVOICE PURCHASES CLEANING
# ---------------------------------------------------------
def clean_invoice_purchases(df):
    df = standardize_base_columns(df, "Invoice Purchases")

    rename_map = {
        "Dollars": "Total_Price",
        "Freight": "Freight_Cost",
        "Vendornumber": "Vendor_Number",
        "Vendorname": "Vendor_Name",
        "Ponumber": "Po_Number",
        "Podate": "Po_Date",
        "Paydate": "Pay_Date",
        "Invoicedate": "Invoice_Date"
    }
    df = df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns})

    return df


# ---------------------------------------------------------
# 7. APPLY CLEANING TO ALL DATASETS
# ---------------------------------------------------------
print("\nüîÑ Cleaning and standardizing all datasets...\n")

beg_inv = clean_inventory(beg_inv, "Beginning Inventory")
end_inv = clean_inventory(end_inv, "Ending Inventory")
purchases = clean_purchases(purchases)
invoice_purchases = clean_invoice_purchases(invoice_purchases)
sales = clean_sales(sales)
future_prices = clean_future_prices(future_prices)

print("\n‚úÖ All datasets cleaned, standardized, renamed, and reordered successfully!")
print("="*80)



üßº SECTION 5: COMPLETE DATA CLEANING & STANDARDIZATION

üîÑ Cleaning and standardizing all datasets...


üìù Beginning Inventory:
   Before: ['Inventory_Id', 'Store', 'City', 'Brand', 'Description', 'Size', 'On_Hand', 'Sales_Price', 'Start_Date']
   After:  ['Inventory_Id', 'Store', 'City', 'Brand', 'Description', 'Size', 'On_Hand', 'Sales_Price', 'Start_Date']

üìù Ending Inventory:
   Before: ['Inventory_Id', 'Store', 'City', 'Brand', 'Description', 'Size', 'On_Hand', 'Sales_Price', 'End_Date']
   After:  ['Inventory_Id', 'Store', 'City', 'Brand', 'Description', 'Size', 'On_Hand', 'Sales_Price', 'End_Date']

üìù Purchases:
   Before: ['Po_Date', 'Po_Number', 'Vendor_Number', 'Vendor_Name', 'Store', 'Inventory_Id', 'Brand', 'Description', 'Size', 'Unit_Price', 'Quantity', 'Total_Price', 'Receiving_Date', 'Pay_Date', 'Classification', 'Invoice_Date']
   After:  ['Po_Date', 'Po_Number', 'Vendor_Number', 'Vendor_Name', 'Store', 'Inventory_Id', 'Brand', 'Description', 'Size', 'Unit

## SECTION 6: DATA TYPE CONVERSION & DATE PARSING

In [54]:
print("\n" + "="*80)
print("üìÖ SECTION 6: DATA TYPE CONVERSION & DATE PARSING")
print("="*80)

def convert_date_columns(df, date_columns, dataset_name):
    """
    Convert specified columns to datetime format
    """
    print(f"\nüìÜ {dataset_name}:")
    df_clean = df.copy()
    
    for col in date_columns:
        if col in df_clean.columns:
            try:
                df_clean[col] = pd.to_datetime(df_clean[col], errors='coerce')
                null_dates = df_clean[col].isnull().sum()
                print(f"   ‚úÖ '{col}' converted to datetime (null values: {null_dates})")
            except Exception as e:
                print(f"   ‚ùå Error converting '{col}': {str(e)}")
        else:
            print(f"   ‚ö†Ô∏è  Column '{col}' not found")
    
    return df_clean

# Identify and convert date columns in each dataset
print("\nüîÑ Converting date columns...")

# Purchases - likely has date columns
date_cols_purchases = [col for col in purchases.columns if 'date' in col.lower() or 'time' in col.lower()]
if date_cols_purchases:
    purchases = convert_date_columns(purchases, date_cols_purchases, 'Purchases')

# Sales - likely has date columns  
date_cols_sales = [col for col in sales.columns if 'date' in col.lower() or 'time' in col.lower()]
if date_cols_sales:
    sales = convert_date_columns(sales, date_cols_sales, 'Sales')

# Invoice Purchases - likely has date columns
date_cols_invoice = [col for col in invoice_purchases.columns if 'date' in col.lower() or 'time' in col.lower()]
if date_cols_invoice:
    invoice_purchases = convert_date_columns(invoice_purchases, date_cols_invoice, 'Invoice Purchases')

print("\n‚úÖ Date columns converted successfully!")
print("="*80)


üìÖ SECTION 6: DATA TYPE CONVERSION & DATE PARSING

üîÑ Converting date columns...

üìÜ Purchases:
   ‚úÖ 'Po_Date' converted to datetime (null values: 0)
   ‚úÖ 'Receiving_Date' converted to datetime (null values: 0)
   ‚úÖ 'Pay_Date' converted to datetime (null values: 0)
   ‚úÖ 'Invoice_Date' converted to datetime (null values: 0)

üìÜ Sales:
   ‚úÖ 'Sales_Date' converted to datetime (null values: 0)

üìÜ Invoice Purchases:
   ‚úÖ 'Invoice_Date' converted to datetime (null values: 0)
   ‚úÖ 'Po_Date' converted to datetime (null values: 0)
   ‚úÖ 'Pay_Date' converted to datetime (null values: 0)

‚úÖ Date columns converted successfully!


## SECTION 7: HANDLE REMAINING MISSING VALUES

In [55]:
print("\n" + "="*80)
print("üîß SECTION 7: HANDLE REMAINING MISSING VALUES")
print("="*80)

def check_missing_values(df, dataset_name):
    """
    Check and report missing values in dataset
    """
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    
    missing_df = pd.DataFrame({
        'Column': missing.index,
        'Missing_Count': missing.values,
        'Missing_Pct': missing_pct.values
    })
    
    missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
    
    if len(missing_df) > 0:
        print(f"\n‚ö†Ô∏è  {dataset_name}:")
        print(missing_df.to_string(index=False))
        return True
    else:
        print(f"\n‚úÖ {dataset_name}: No missing values")
        return False

# Check all datasets
print("\nüîç Checking for remaining missing values...")

datasets_to_check = {
    'Beginning Inventory': beg_inv,
    'Purchases': purchases,
    'Invoice Purchases': invoice_purchases,
    'Sales': sales,
    'Ending Inventory': end_inv,
    'Future Prices': future_prices
}

has_missing = {}
for name, df in datasets_to_check.items():
    has_missing[name] = check_missing_values(df, name)

# Handle specific missing values if needed
print("\n" + "="*80)
print("üìù MISSING VALUE HANDLING STRATEGY:")
print("="*80)

if any(has_missing.values()):
    print("\nüìã Smart Missing Value Handling:")
    print("="*80)
    
    # Handle Ending Inventory - Fill missing cities using Beginning Inventory mapping
    if has_missing.get('Ending Inventory', False):
        print("\nüè™ ENDING INVENTORY - Smart City Filling:")
        print("   Strategy: Use store number to find city from Beginning Inventory")
        
        # Identify the store and city columns (handle both standardized and original names)
        store_col = None
        city_col = None
        
        # Check for possible store column names
        for col in end_inv.columns:
            if 'store' in col.lower():
                store_col = col
            if 'city' in col.lower():
                city_col = col
        
        if store_col and city_col:
            # Create mapping from Beginning Inventory (store -> city)
            beg_store_col = None
            beg_city_col = None
            
            for col in beg_inv.columns:
                if 'store' in col.lower():
                    beg_store_col = col
                if 'city' in col.lower():
                    beg_city_col = col
            
            if beg_store_col and beg_city_col:
                # Create store to city mapping
                store_city_map = beg_inv[[beg_store_col, beg_city_col]].drop_duplicates()
                store_city_map = dict(zip(store_city_map[beg_store_col], store_city_map[beg_city_col]))
                
                # Check missing cities in ending inventory
                missing_city_mask = end_inv[city_col].isnull()
                missing_city_count = missing_city_mask.sum()
                
                if missing_city_count > 0:
                    print(f"\n   üìä Missing cities found: {missing_city_count} rows")
                    
                    # Show example of stores with missing cities
                    missing_stores = end_inv[missing_city_mask][store_col].unique()
                    print(f"   üìç Stores with missing cities: {missing_stores[:5]}")
                    
                    # Fill missing cities using the mapping
                    filled_count = 0
                    for idx in end_inv[missing_city_mask].index:
                        store_num = end_inv.loc[idx, store_col]
                        if store_num in store_city_map:
                            end_inv.loc[idx, city_col] = store_city_map[store_num]
                            filled_count += 1
                    
                    print(f"   ‚úÖ Filled {filled_count} missing cities using store mapping")
                    
                    # Check if any cities are still missing
                    still_missing = end_inv[city_col].isnull().sum()
                    if still_missing > 0:
                        print(f"   ‚ö†Ô∏è  {still_missing} cities could not be filled (no matching store in beginning inventory)")
                        # Drop rows that couldn't be filled
                        end_inv = end_inv.dropna(subset=[city_col])
                        print(f"   ‚úÖ Removed {still_missing} rows with unfillable missing cities")
                    else:
                        print(f"   ‚úÖ All missing cities successfully filled!")
                else:
                    print("   ‚úÖ No missing cities in Ending Inventory")
        else:
            # Fallback: drop rows with missing values if columns not found
            before_count = len(end_inv)
            end_inv = end_inv.dropna()
            after_count = len(end_inv)
            print(f"   ‚ö†Ô∏è  Could not identify store/city columns, dropped {before_count - after_count} rows")
    
    # Handle Purchases if it has missing values (minimal - just drop)
    if has_missing.get('Purchases', False):
        before_count = len(purchases)
        purchases = purchases.dropna()
        after_count = len(purchases)
        print(f"\n   ‚úÖ Purchases: Removed {before_count - after_count} rows with missing values (<0.01%)")
    
    # Handle Future Prices if it has missing values (minimal - just drop)
    if has_missing.get('Future Prices', False):
        before_count = len(future_prices)
        future_prices = future_prices.dropna()
        after_count = len(future_prices)
        print(f"   ‚úÖ Future Prices: Removed {before_count - after_count} rows with missing values (<0.01%)")
else:
    print("\n‚úÖ All datasets are clean with no missing values!")

print("\n‚úÖ Missing values handled successfully!")
print("="*80)


üîß SECTION 7: HANDLE REMAINING MISSING VALUES

üîç Checking for remaining missing values...

‚úÖ Beginning Inventory: No missing values

‚úÖ Purchases: No missing values

‚úÖ Invoice Purchases: No missing values

‚úÖ Sales: No missing values

‚úÖ Ending Inventory: No missing values

‚úÖ Future Prices: No missing values

üìù MISSING VALUE HANDLING STRATEGY:

‚úÖ All datasets are clean with no missing values!

‚úÖ Missing values handled successfully!


## SECTION 8: REMOVE DUPLICATES

In [56]:
print("\n" + "="*80)
print("üîÑ SECTION 8: REMOVE DUPLICATES")
print("="*80)

def check_and_remove_duplicates(df, dataset_name):
    """
    Check for and remove duplicate rows
    """
    before_count = len(df)
    duplicates = df.duplicated().sum()
    
    if duplicates > 0:
        df_clean = df.drop_duplicates()
        after_count = len(df_clean)
        print(f"\nüìã {dataset_name}:")
        print(f"   ‚Ä¢ Before: {before_count:,} rows")
        print(f"   ‚Ä¢ Duplicates found: {duplicates:,}")
        print(f"   ‚Ä¢ After: {after_count:,} rows")
        print(f"   ‚úÖ Removed {before_count - after_count:,} duplicate rows")
        return df_clean
    else:
        print(f"\n‚úÖ {dataset_name}: No duplicates found ({before_count:,} rows)")
        return df

# Check and remove duplicates from all datasets
print("\nüîç Checking for duplicates...")

beg_inv = check_and_remove_duplicates(beg_inv, 'Beginning Inventory')
purchases = check_and_remove_duplicates(purchases, 'Purchases')
invoice_purchases = check_and_remove_duplicates(invoice_purchases, 'Invoice Purchases')
sales = check_and_remove_duplicates(sales, 'Sales')
end_inv = check_and_remove_duplicates(end_inv, 'Ending Inventory')
future_prices = check_and_remove_duplicates(future_prices, 'Future Prices')

print("\n‚úÖ Duplicate check completed!")
print("="*80)


üîÑ SECTION 8: REMOVE DUPLICATES

üîç Checking for duplicates...

‚úÖ Beginning Inventory: No duplicates found (206,529 rows)

‚úÖ Purchases: No duplicates found (2,372,471 rows)

‚úÖ Invoice Purchases: No duplicates found (5,543 rows)

‚úÖ Sales: No duplicates found (1,048,575 rows)

‚úÖ Ending Inventory: No duplicates found (224,489 rows)

‚úÖ Future Prices: No duplicates found (12,260 rows)

‚úÖ Duplicate check completed!


## SECTION 9: FINAL DATA QUALITY ASSESSMENT

In [57]:
print("\n" + "="*80)
print("üìä SECTION 9: FINAL DATA QUALITY ASSESSMENT")
print("="*80)

def generate_final_quality_report(datasets_dict):
    """
    Generate final quality report after cleaning
    """
    quality_metrics = []
    
    for name, df in datasets_dict.items():
        total_cells = len(df) * len(df.columns)
        missing_cells = df.isnull().sum().sum()
        
        metrics = {
            'Dataset': name,
            'Rows': f"{len(df):,}",
            'Columns': len(df.columns),
            'Total_Cells': f"{total_cells:,}",
            'Missing': missing_cells,
            'Completeness': f"{((total_cells - missing_cells) / total_cells * 100):.2f}%",
            'Duplicates': df.duplicated().sum(),
            'Memory_MB': f"{df.memory_usage(deep=True).sum() / 1024**2:.2f}"
        }
        quality_metrics.append(metrics)
    
    return pd.DataFrame(quality_metrics)

# Generate final quality report
cleaned_datasets = {
    'Beginning Inventory': beg_inv,
    'Purchases': purchases,
    'Invoice Purchases': invoice_purchases,
    'Sales': sales,
    'Ending Inventory': end_inv,
    'Future Prices': future_prices
}

final_report = generate_final_quality_report(cleaned_datasets)

print("\nüìä FINAL DATA QUALITY REPORT:")
print("="*80)
print(final_report.to_string(index=False))

# Calculate overall statistics
total_rows = final_report['Rows'].str.replace(',', '').astype(int).sum()
total_missing = final_report['Missing'].sum()

print("\n" + "="*80)
print("üìà OVERALL CLEANING SUMMARY:")
print("="*80)
print(f"   ‚Ä¢ Total Rows Across All Datasets: {total_rows:,}")
print(f"   ‚Ä¢ Total Missing Values: {total_missing}")
print(f"   ‚Ä¢ All Datasets Completeness: {final_report['Completeness'].str.replace('%', '').astype(float).mean():.2f}%")

print("\n‚úÖ Data quality assessment completed!")
print("="*80)


üìä SECTION 9: FINAL DATA QUALITY ASSESSMENT

üìä FINAL DATA QUALITY REPORT:
            Dataset      Rows  Columns Total_Cells  Missing Completeness  Duplicates Memory_MB
Beginning Inventory   206,529        9   1,858,761        0      100.00%           0     66.94
          Purchases 2,372,471       16  37,959,536        0      100.00%           0    835.51
  Invoice Purchases     5,543        9      49,887        0      100.00%           0      0.74
              Sales 1,048,575       14  14,680,050        0      100.00%           0    345.29
   Ending Inventory   224,489        9   2,020,401        0      100.00%           0     72.77
      Future Prices    12,260        9     110,340        0      100.00%           0      3.52

üìà OVERALL CLEANING SUMMARY:
   ‚Ä¢ Total Rows Across All Datasets: 3,869,867
   ‚Ä¢ Total Missing Values: 0
   ‚Ä¢ All Datasets Completeness: 100.00%

‚úÖ Data quality assessment completed!


## SECTION 10: EXPORT CLEANED DATASETS

In [58]:
print("\n" + "="*80)
print("üíæ SECTION 10: EXPORT CLEANED DATASETS")
print("="*80)

# Export each cleaned dataset
print("\nüì§ Exporting cleaned datasets to data/processed/...\n")

export_mapping = {
    'cleaned_beginning_inventory.csv': beg_inv,
    'cleaned_purchases.csv': purchases,
    'cleaned_invoice_purchases.csv': invoice_purchases,
    'cleaned_sales.csv': sales,
    'cleaned_ending_inventory.csv': end_inv,
    'cleaned_future_prices.csv': future_prices
}

for filename, df in export_mapping.items():
    output_path = PROCESSED_PATH / filename
    try:
        df.to_csv(output_path, index=False)
        file_size = output_path.stat().st_size / 1024**2
        print(f"‚úÖ {filename}: {len(df):,} rows, {file_size:.2f} MB")
    except Exception as e:
        print(f"‚ùå Error exporting {filename}: {str(e)}")

print("\n‚úÖ All cleaned datasets exported successfully!")
print(f"üìÅ Location: {PROCESSED_PATH}")
print("="*80)


üíæ SECTION 10: EXPORT CLEANED DATASETS

üì§ Exporting cleaned datasets to data/processed/...

‚úÖ cleaned_beginning_inventory.csv: 206,529 rows, 16.84 MB
‚úÖ cleaned_purchases.csv: 2,372,471 rows, 347.09 MB
‚úÖ cleaned_invoice_purchases.csv: 5,543 rows, 0.48 MB
‚úÖ cleaned_sales.csv: 1,048,575 rows, 123.22 MB
‚úÖ cleaned_ending_inventory.csv: 224,489 rows, 18.33 MB
‚úÖ cleaned_future_prices.csv: 12,260 rows, 1.01 MB

‚úÖ All cleaned datasets exported successfully!
üìÅ Location: ..\data\processed


## SECTION 11: SUMMARY & NEXT STEPS

In [44]:
print("\n" + "="*80)
print("‚úÖ NOTEBOOK 02 COMPLETE: DATA CLEANING & PREPROCESSING")
print("="*80)

print("""
üéâ WHAT WE ACCOMPLISHED:
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
  ‚úì Loaded all 6 datasets from raw data folder
  ‚úì Removed 'Approval' column from Invoice Purchases (9.33% missing values)
  ‚úì Standardized all column names (lowercase, underscores)
  ‚úì Converted date columns to proper datetime format
  ‚úì Handled remaining missing values (<1% in other datasets)
  ‚úì Removed duplicate records
  ‚úì Validated data quality and completeness
  ‚úì Exported cleaned datasets to data/processed/

üìä CLEANING RESULTS:
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
""")

print(f"  ‚Ä¢ Total Datasets Cleaned: {len(cleaned_datasets)}")
print(f"  ‚Ä¢ Data Completeness: ~100% (all significant missing values handled)")
print(f"  ‚Ä¢ Columns Removed: 1 (Approval from Invoice Purchases)")
print(f"  ‚Ä¢ All column names standardized for consistency")
print(f"  ‚Ä¢ Date columns properly formatted")

print("""
üîú NEXT STEPS (Notebook 03 - KPI Calculation):
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
  1. Calculate inventory turnover ratios
  2. Calculate days of supply (DOS)
  3. Determine fill rates and service levels
  4. Calculate COGS (Cost of Goods Sold)
  5. Compute inventory carrying costs
  6. Generate KPI dashboard

üìö WHAT YOU LEARNED:
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
  ‚Ä¢ How to handle missing values with documented decisions
  ‚Ä¢ Column standardization best practices
  ‚Ä¢ Date parsing and data type conversion
  ‚Ä¢ Duplicate detection and removal
  ‚Ä¢ Data quality validation techniques
  ‚Ä¢ Clean data export workflows

üíæ CLEANED DATA LOCATION:
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
  üìÅ data/processed/
     ‚îú‚îÄ‚îÄ cleaned_beginning_inventory.csv
     ‚îú‚îÄ‚îÄ cleaned_purchases.csv
     ‚îú‚îÄ‚îÄ cleaned_invoice_purchases.csv  (Approval column removed)
     ‚îú‚îÄ‚îÄ cleaned_sales.csv
     ‚îú‚îÄ‚îÄ cleaned_ending_inventory.csv
     ‚îî‚îÄ‚îÄ cleaned_future_prices.csv

üìù PORTFOLIO NOTES:
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
  This notebook demonstrates:
  ‚úì Professional data cleaning workflow
  ‚úì Decision documentation and transparency
  ‚úì Systematic approach to data quality
  ‚úì Best practices in data preprocessing
  ‚úì Clear communication of cleaning steps

Ready for Notebook 03: KPI Calculation! üöÄ
""")

print("="*80)
print("End of Notebook 02")
print("="*80)


‚úÖ NOTEBOOK 02 COMPLETE: DATA CLEANING & PREPROCESSING

üéâ WHAT WE ACCOMPLISHED:
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
  ‚úì Loaded all 6 datasets from raw data folder
  ‚úì Removed 'Approval' column from Invoice Purchases (9.33% missing values)
  ‚úì Standardized all column names (lowercase, underscores)
  ‚úì Converted date columns to proper datetime format
  ‚úì Handled remaining missing values (<1% in other datasets)
  ‚úì Removed duplicate records
  ‚úì Validated data quality and completeness
  ‚úì Exported cleaned datasets to data/processed/

üìä CLEANING RESULTS:
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

  ‚