# Online Retail (UCI) â€“ Data Cleaning by Aidan

Continuation of the final project. This notebook builds on the initial exploratory data analysis and focuses on **data cleaning**.

## Load Dataset
We'll load the same dataset used in the previous stage. This can be from the UCI URL or from a saved local copy (e.g., the working CSV created earlier).

In [1]:
import pandas as pd
import numpy as np

# Option A: Directly from UCI URL
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx'
df = pd.read_excel(url, engine='openpyxl')

# Option B: From saved CSV
# df = pd.read_csv('online_retail_working_copy.csv')

print('Initial shape:', df.shape)
df.head()

Initial shape: (541909, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


## Step 1: Handle Missing Values
Check for missing values and decide whether to fill them, replace them, or drop the rows.

In [2]:
missing_counts = df.isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]

Unnamed: 0,0
CustomerID,135080
Description,1454


In [3]:
# Example cleaning: Drop rows missing CustomerID (common in this dataset for incomplete transactions)
if 'CustomerID' in df.columns:
    df = df.dropna(subset=['CustomerID'])
print('After dropping missing CustomerID:', df.shape)

After dropping missing CustomerID: (406829, 8)


In [4]:
# Fill missing descriptions if StockCode is present but Description is NaN (optional strategy)
if 'Description' in df.columns and 'StockCode' in df.columns:
    df['Description'] = df['Description'].fillna('No description')
df['Description'].isna().sum()

np.int64(0)

## Step 2: Remove Duplicates
Duplicate rows can occur in the dataset. Remove exact duplicates to avoid double counting.

In [5]:
dup_count = df.duplicated().sum()
print('Duplicate rows:', dup_count)
df = df.drop_duplicates()
print('After removing duplicates:', df.shape)

Duplicate rows: 5225
After removing duplicates: (401604, 8)


## Step 3: Handle Outliers
Outliers in Quantity and UnitPrice may be due to returns (negative quantities) or data entry errors.

In [6]:
# Check basic stats
df[['Quantity', 'UnitPrice']].describe()

Unnamed: 0,Quantity,UnitPrice
count,401604.0,401604.0
mean,12.183273,3.474064
std,250.283037,69.764035
min,-80995.0,0.0
25%,2.0,1.25
50%,5.0,1.95
75%,12.0,3.75
max,80995.0,38970.0


In [7]:
# Remove rows with non-positive Quantity or UnitPrice (common cleaning choice for sales analysis)
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]
print('After removing non-positive Quantity/UnitPrice:', df.shape)

After removing non-positive Quantity/UnitPrice: (392692, 8)


## Step 4: Standardize Column Names
Make column names lowercase with underscores for easier referencing in code.

In [8]:
df.columns = [c.strip().lower().replace(' ', '_') for c in df.columns]
df.columns

Index(['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate',
       'unitprice', 'customerid', 'country'],
      dtype='object')

## Step 5: Save Cleaned Data
Export the cleaned dataset for later analysis stages.

In [9]:
cleaned_path = 'online_retail_cleaned.csv'
df.to_csv(cleaned_path, index=False)
cleaned_path

'online_retail_cleaned.csv'

---
## Summary of Cleaning Steps
1. Dropped rows with missing `CustomerID`.
2. Filled missing `Description` values with 'No description'.
3. Removed exact duplicate rows.
4. Removed rows with non-positive Quantity or UnitPrice.
5. Standardized column names to lowercase with underscores.
6. Saved cleaned dataset to CSV for further analysis.