
# 01_data_cleaning.ipynb
**Omnichannel FMCG ‚Äî Data Cleaning & Preprocessing**  
Author: **Derrick Wong**

This notebook loads the mock dataset, profiles it, fixes types, handles missing values, normalizes categories, removes duplicates, caps outliers, and exports a **cleaned CSV** plus summary stats.  
Use this as the foundation for analysis, modeling, and dashboards.


## 1) Setup

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from pathlib import Path

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 140)
print('‚úÖ Libraries ready.')


‚úÖ Libraries ready.


## 2) Load mock data (CSV/XLSX)

In [2]:
from pathlib import Path

DATA_PATH = Path(r"C:\Users\inchr\Downloads\Capstone Associate Data Analyst\omnichannel-growth-engine\data\fmcg_omnichannel_sales.csv") # or .xlsx
CUSTOMERS_PATH = Path(r"C:\Users\inchr\Downloads\Capstone Associate Data Analyst\omnichannel-growth-engine\data\fmcg_customers.csv")     # optional
SKU_REF_PATH = Path(r"C:\Users\inchr\Downloads\Capstone Associate Data Analyst\omnichannel-growth-engine\data\fmcg_sku_reference.csv")   # optional

def load_table(path: Path):
    if not path.exists():
        print(f'‚ö†Ô∏è File not found (skipping): {path}')
        return None
    if path.suffix.lower() == '.csv':
        return pd.read_csv(path)
    if path.suffix.lower() in ['.xlsx', '.xls']:
        return pd.read_excel(path)
    raise ValueError(f'Unsupported file type: {path}')

df = load_table(DATA_PATH)
cust = load_table(CUSTOMERS_PATH)
sku  = load_table(SKU_REF_PATH)

assert df is not None, "Main dataset not found. Please place fmcg_omnichannel_sales.csv next to this notebook."
print('‚úÖ Main dataset loaded. Shape:', df.shape)
df.head()


‚ö†Ô∏è File not found (skipping): C:\Users\inchr\Downloads\Capstone Associate Data Analyst\omnichannel-growth-engine\data\fmcg_customers.csv
‚úÖ Main dataset loaded. Shape: (122280, 12)


Unnamed: 0,Order_ID,Week_Start,Customer_ID,SKU_ID,Category,Territory,Channel,Promo_Flag,Unit_Price,Units,Revenue,Orders_6m
0,O4997839985,2025-04-28,C23349035,INS01,Instant Noodles,North,Retail,0,1.18,3,3.54,66
1,O6872379697,2025-04-28,C49284335,INS01,Instant Noodles,North,Retail,0,1.18,6,7.08,36
2,O7033690015,2025-04-28,C72946384,INS01,Instant Noodles,North,Retail,0,1.18,4,4.72,47
3,O8606233571,2025-04-28,C81020779,INS01,Instant Noodles,North,Retail,0,1.18,3,3.54,45
4,O9316119385,2025-04-28,C65978939,INS01,Instant Noodles,North,Retail,0,1.18,8,9.44,37


## 3) Profile & preview

In [3]:

print('üßæ .info():')
display(df.info())
print('\nüìä .describe():')
display(df.describe(include='all').T)
print('\nüîç Missing values by column:')
missing = df.isna().sum().sort_values(ascending=False)
display(missing.to_frame('missing_count'))


üßæ .info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122280 entries, 0 to 122279
Data columns (total 12 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Order_ID     122280 non-null  object 
 1   Week_Start   122280 non-null  object 
 2   Customer_ID  122280 non-null  object 
 3   SKU_ID       122280 non-null  object 
 4   Category     122280 non-null  object 
 5   Territory    122280 non-null  object 
 6   Channel      122280 non-null  object 
 7   Promo_Flag   122280 non-null  int64  
 8   Unit_Price   122280 non-null  float64
 9   Units        122280 non-null  int64  
 10  Revenue      122280 non-null  float64
 11  Orders_6m    122280 non-null  int64  
dtypes: float64(2), int64(3), object(7)
memory usage: 11.2+ MB


None


üìä .describe():


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Order_ID,122280.0,122279.0,O4196622835,2.0,,,,,,,
Week_Start,122280.0,26.0,2025-07-28,5316.0,,,,,,,
Customer_ID,122280.0,2200.0,C96692453,125.0,,,,,,,
SKU_ID,122280.0,18.0,RTD04,7585.0,,,,,,,
Category,122280.0,4.0,Instant Noodles,35687.0,,,,,,,
Territory,122280.0,4.0,Central,31563.0,,,,,,,
Channel,122280.0,4.0,Retail,38896.0,,,,,,,
Promo_Flag,122280.0,,,,0.247048,0.431297,0.0,0.0,0.0,0.0,1.0
Unit_Price,122280.0,,,,1.619534,0.604422,0.7,1.19,1.38,2.04,3.3
Units,122280.0,,,,6.150343,6.549518,1.0,2.0,4.0,8.0,125.0



üîç Missing values by column:


Unnamed: 0,missing_count
Order_ID,0
Week_Start,0
Customer_ID,0
SKU_ID,0
Category,0
Territory,0
Channel,0
Promo_Flag,0
Unit_Price,0
Units,0


## 4) Helper: audit logger for shapes & issues

In [4]:

audit = []
def log_step(name, df_before, df_after, notes=''):
    audit.append({
        'step': name,
        'rows_before': int(df_before.shape[0]),
        'rows_after': int(df_after.shape[0]),
        'cols_before': int(df_before.shape[1]),
        'cols_after': int(df_after.shape[1]),
        'notes': notes
    })
def show_audit():
    import pandas as pd
    return pd.DataFrame(audit)


## 5) Handle missing values (drop or fill)

In [5]:

df_before = df.copy()
missing_threshold = 0.5
drop_cols = [c for c in df.columns if df[c].isna().mean() > missing_threshold]
df = df.drop(columns=drop_cols) if drop_cols else df
notes = f'Dropped cols (>50% missing): {drop_cols}' if drop_cols else 'No columns dropped for missingness.'
for col in df.columns:
    if df[col].dtype == 'O':
        if df[col].isna().any():
            mode_val = df[col].mode(dropna=True)
            if len(mode_val):
                df[col] = df[col].fillna(mode_val.iloc[0])
    else:
        if df[col].isna().any():
            df[col] = df[col].fillna(df[col].median())
log_step('missing_values', df_before, df, notes)
print('‚úÖ Missing values handled.')
df.isna().sum().sum()


‚úÖ Missing values handled.


np.int64(0)

## 6) Fix data types (dates, categorical, numeric)

In [6]:

df_before = df.copy()
date_like = [c for c in df.columns if c.lower() in ('date','week_start','order_date')]
for c in date_like:
    df[c] = pd.to_datetime(df[c], errors='coerce')
cat_candidates = ['Category','Channel','Territory','SKU_ID','Customer_ID','Promo_Flag']
for c in cat_candidates:
    if c in df.columns:
        if c == 'Promo_Flag':
            df[c] = pd.to_numeric(df[c], errors='coerce').fillna(0).astype(int).astype('category')
        else:
            df[c] = df[c].astype(str).str.strip().astype('category')
num_candidates = ['Units','Unit_Price','Revenue']
for c in num_candidates:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors='coerce')
log_step('fix_types', df_before, df, f'Dates: {date_like} | Categorical: {[c for c in cat_candidates if c in df.columns]} | Numeric: {[c for c in num_candidates if c in df.columns]}')
print('‚úÖ Types standardized.')
df.info()


‚úÖ Types standardized.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122280 entries, 0 to 122279
Data columns (total 12 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Order_ID     122280 non-null  object        
 1   Week_Start   122280 non-null  datetime64[ns]
 2   Customer_ID  122280 non-null  category      
 3   SKU_ID       122280 non-null  category      
 4   Category     122280 non-null  category      
 5   Territory    122280 non-null  category      
 6   Channel      122280 non-null  category      
 7   Promo_Flag   122280 non-null  category      
 8   Unit_Price   122280 non-null  float64       
 9   Units        122280 non-null  int64         
 10  Revenue      122280 non-null  float64       
 11  Orders_6m    122280 non-null  int64         
dtypes: category(6), datetime64[ns](1), float64(2), int64(2), object(1)
memory usage: 6.5+ MB


## 7) Normalize inconsistent names (channel/territory/category)

In [7]:

df_before = df.copy()
def normalize_text(s: pd.Series) -> pd.Series:
    return (s.astype(str).str.strip().str.replace(r'\s+', ' ', regex=True).str.title())
for c in ['Channel','Territory','Category']:
    if c in df.columns:
        df[c] = normalize_text(df[c])
repl_channel = {'Shopee Sg':'Shopee','Lazada Sg':'Lazada','Retail Store':'Retail','Direct':'D2C'}
repl_territory = {'Central Region':'Central','W.':'West','E.':'East','N.':'North'}
if 'Channel' in df.columns:   df['Channel'] = df['Channel'].replace(repl_channel)
if 'Territory' in df.columns: df['Territory'] = df['Territory'].replace(repl_territory)
log_step('normalize_text', df_before, df, 'Applied strip/title + manual mappings.')
print('‚úÖ Normalization done.')
df.head(3)


‚úÖ Normalization done.


Unnamed: 0,Order_ID,Week_Start,Customer_ID,SKU_ID,Category,Territory,Channel,Promo_Flag,Unit_Price,Units,Revenue,Orders_6m
0,O4997839985,2025-04-28,C23349035,INS01,Instant Noodles,North,Retail,0,1.18,3,3.54,66
1,O6872379697,2025-04-28,C49284335,INS01,Instant Noodles,North,Retail,0,1.18,6,7.08,36
2,O7033690015,2025-04-28,C72946384,INS01,Instant Noodles,North,Retail,0,1.18,4,4.72,47


## 8) Identify & remove duplicates

In [8]:

df_before = df.copy()
dup_count = int(df.duplicated().sum())
df = df.drop_duplicates().copy()
log_step('deduplicate', df_before, df, f'Removed {dup_count} duplicate rows.')
print(f'‚úÖ Duplicates removed: {dup_count}')
df.shape


‚úÖ Duplicates removed: 0


(122280, 12)

## 9) Outlier detection & treatment (IQR winsorization)

In [9]:

df_before = df.copy()
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
def iqr_cap(s: pd.Series, k=1.5):
    q1, q3 = s.quantile([0.25, 0.75])
    iqr = q3 - q1
    lower, upper = q1 - k*iqr, q3 + k*iqr
    return np.clip(s, lower, upper), (float(lower), float(upper))
bounds = {}
for col in num_cols:
    capped, (lo, hi) = iqr_cap(df[col].astype(float), k=1.5)
    bounds[col] = (lo, hi)
    df[col] = capped
log_step('outliers_iqr', df_before, df, f'IQR caps applied to: {num_cols}')
print('‚úÖ Outliers capped using IQR.')
bounds


‚úÖ Outliers capped using IQR.


{'Unit_Price': (-0.08500000000000019, 3.3150000000000004),
 'Units': (-7.0, 17.0),
 'Revenue': (-12.575000000000001, 27.705),
 'Orders_6m': (5.5, 113.5)}

## 10) Recompute dependent fields (Revenue)

In [10]:

df_before = df.copy()
if set(['Units','Unit_Price','Revenue']).issubset(df.columns):
    df['Revenue'] = df['Units'] * df['Unit_Price']
    note = 'Revenue recomputed from Units √ó Unit_Price.'
else:
    note = 'Revenue columns not all present; skipped.'
log_step('recompute_revenue', df_before, df, note)
print('‚úÖ', note)
df[['Units','Unit_Price','Revenue']].head() if set(['Units','Unit_Price','Revenue']).issubset(df.columns) else df.head(2)


‚úÖ Revenue recomputed from Units √ó Unit_Price.


Unnamed: 0,Units,Unit_Price,Revenue
0,3.0,1.18,3.54
1,6.0,1.18,7.08
2,4.0,1.18,4.72
3,3.0,1.18,3.54
4,8.0,1.18,9.44


## 11) Audit trail ‚Äî steps & shapes

In [11]:

show_audit()


Unnamed: 0,step,rows_before,rows_after,cols_before,cols_after,notes
0,missing_values,122280,122280,12,12,No columns dropped for missingness.
1,fix_types,122280,122280,12,12,Dates: ['Week_Start'] | Categorical: ['Categor...
2,normalize_text,122280,122280,12,12,Applied strip/title + manual mappings.
3,deduplicate,122280,122280,12,12,Removed 0 duplicate rows.
4,outliers_iqr,122280,122280,12,12,"IQR caps applied to: ['Unit_Price', 'Units', '..."
5,recompute_revenue,122280,122280,12,12,Revenue recomputed from Units √ó Unit_Price.


## 12) Summary statistics & export cleaned data

In [12]:
from pathlib import Path

# Force base directory to the correct project root
BASE_DIR = Path(r"C:\Users\inchr\Downloads\Capstone Associate Data Analyst\omnichannel-growth-engine")
CLEAN_DIR = BASE_DIR / "clean"
CLEAN_DIR.mkdir(exist_ok=True)

out_path = CLEAN_DIR / "cleaned_fmcg_omnichannel_sales.csv"
df.to_csv(out_path, index=False)
print("‚úÖ Cleaned dataset saved to:", out_path.resolve())

‚úÖ Cleaned dataset saved to: C:\Users\inchr\Downloads\Capstone Associate Data Analyst\omnichannel-growth-engine\clean\cleaned_fmcg_omnichannel_sales.csv
