# Online Retail (UCI) – Initial EDA by Aidan

This notebook begins the data analysis for the **Online Retail** dataset from the UCI Machine Learning Repository.

**Dataset:** https://archive.ics.uci.edu/ml/datasets/Online+Retail  
**Direct download (Excel):** https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx


## Project kickoff: scope & questions
**Five questions to explore (later in the project):**
1. Which products generate the most revenue (top-10 SKUs)?  
2. How do sales trend over time (monthly/seasonal)?  
3. Customer segmentation via RFM (Recency, Frequency, Monetary).  
4. Which countries (outside the UK) contribute most to international revenue?  
5. What’s the relationship between unit price and quantity sold?

This notebook focuses on **initial exploratory analysis** to understand the dataset’s structure and quality.


## Setup
If you're in Google Colab, run the next cell to install the Excel engine:


In [1]:
# If needed (e.g., in Google Colab):
!pip install openpyxl



In [2]:
import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option('display.max_columns', 0)  # show all columns
pd.set_option('display.width', 120)

## Load data
We'll load directly from the UCI URL. If you already downloaded the file locally, you can point to that path instead.


In [3]:
# Option A: Load directly from URL
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx'
df = pd.read_excel(url, engine='openpyxl')  # requires openpyxl

# Option B: Local path (uncomment and set your path)
# local_path = 'Online Retail.xlsx'
# df = pd.read_excel(local_path, engine='openpyxl')

print('Rows, Columns:', df.shape)
df.head()

Rows, Columns: (541909, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


## Basic properties
Use core inspection methods to understand the dataset:
- `describe()`  
- `columns`  
- `shape`  
- `dtypes`  
- `head()`, `tail()`, `sample()`  
- `info()`


In [4]:
df.columns.tolist()

['InvoiceNo',
 'StockCode',
 'Description',
 'Quantity',
 'InvoiceDate',
 'UnitPrice',
 'CustomerID',
 'Country']

In [5]:
df.shape

(541909, 8)

In [6]:
df.dtypes

Unnamed: 0,0
InvoiceNo,object
StockCode,object
Description,object
Quantity,int64
InvoiceDate,datetime64[ns]
UnitPrice,float64
CustomerID,float64
Country,object


In [7]:
df.head(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
5,536365,22752,SET 7 BABUSHKA NESTING BOXES,2,2010-12-01 08:26:00,7.65,17850.0,United Kingdom
6,536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,2010-12-01 08:26:00,4.25,17850.0,United Kingdom
7,536366,22633,HAND WARMER UNION JACK,6,2010-12-01 08:28:00,1.85,17850.0,United Kingdom
8,536366,22632,HAND WARMER RED POLKA DOT,6,2010-12-01 08:28:00,1.85,17850.0,United Kingdom
9,536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,2010-12-01 08:34:00,1.69,13047.0,United Kingdom


In [8]:
df.tail(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
541899,581587,22726,ALARM CLOCK BAKELIKE GREEN,4,2011-12-09 12:50:00,3.75,12680.0,France
541900,581587,22730,ALARM CLOCK BAKELIKE IVORY,4,2011-12-09 12:50:00,3.75,12680.0,France
541901,581587,22367,CHILDRENS APRON SPACEBOY DESIGN,8,2011-12-09 12:50:00,1.95,12680.0,France
541902,581587,22629,SPACEBOY LUNCH BOX,12,2011-12-09 12:50:00,1.95,12680.0,France
541903,581587,23256,CHILDRENS CUTLERY SPACEBOY,4,2011-12-09 12:50:00,4.15,12680.0,France
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.1,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680.0,France


In [9]:
df.sample(10, random_state=7)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
343961,566976,23207,LUNCH BAG ALPHABET DESIGN,20,2011-09-16 09:27:00,1.65,15382.0,United Kingdom
482283,577469,22364,GLASS JAR DIGESTIVE BISCUITS,1,2011-11-20 11:25:00,2.95,15009.0,United Kingdom
333437,566195,22200,FRYING PAN PINK POLKADOT,24,2011-09-09 13:44:00,3.75,12433.0,Norway
226664,556812,20677,PINK POLKADOT BOWL,30,2011-06-14 17:25:00,2.46,,United Kingdom
185080,552730,84821,DANISH ROSE DELUXE COASTER,12,2011-05-11 10:42:00,0.85,16837.0,United Kingdom
126806,547101,22207,FRYING PAN UNION FLAG,24,2011-03-21 10:34:00,3.75,16029.0,United Kingdom
409626,572066,22737,RIBBON REEL CHRISTMAS PRESENT,10,2011-10-20 13:07:00,1.65,15159.0,United Kingdom
196954,553879,21937,STRAWBERRY PICNIC BAG,10,2011-05-19 15:15:00,2.95,15791.0,United Kingdom
525800,580638,21976,PACK OF 60 MUSHROOM CAKE CASES,24,2011-12-05 12:44:00,0.55,12381.0,Norway
457289,575739,21945,STRAWBERRIES DESIGN FLANNEL,1,2011-11-11 09:05:00,1.63,,United Kingdom


In [10]:
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
InvoiceNo,541909.0,25900.0,573585.0,1114.0,,,,,,,
StockCode,541909.0,4070.0,85123A,2313.0,,,,,,,
Description,540455.0,4223.0,WHITE HANGING HEART T-LIGHT HOLDER,2369.0,,,,,,,
Quantity,541909.0,,,,9.55225,-80995.0,1.0,3.0,10.0,80995.0,218.081158
InvoiceDate,541909.0,,,,2011-07-04 13:34:57.156386048,2010-12-01 08:26:00,2011-03-28 11:34:00,2011-07-19 17:17:00,2011-10-19 11:27:00,2011-12-09 12:50:00,
UnitPrice,541909.0,,,,4.611114,-11062.06,1.25,2.08,4.13,38970.0,96.759853
CustomerID,406829.0,,,,15287.69057,12346.0,13953.0,15152.0,16791.0,18287.0,1713.600303
Country,541909.0,38.0,United Kingdom,495478.0,,,,,,,


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


## Quick data hygiene checks
Parse dates and look at missing values and duplicates.


In [12]:
# Parse invoice date if needed (dataset typically has 'InvoiceDate')
if 'InvoiceDate' in df.columns:
    df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], errors='coerce')  # coerce invalid
df['InvoiceDate'].head() if 'InvoiceDate' in df.columns else 'Column InvoiceDate not present'

Unnamed: 0,InvoiceDate
0,2010-12-01 08:26:00
1,2010-12-01 08:26:00
2,2010-12-01 08:26:00
3,2010-12-01 08:26:00
4,2010-12-01 08:26:00


In [13]:
# Missing values overview
missing = df.isna().sum().sort_values(ascending=False)
missing[missing>0].to_frame('missing_count')

Unnamed: 0,missing_count
CustomerID,135080
Description,1454


In [14]:
# Duplicate rows count
dup_count = df.duplicated().sum()
dup_count

np.int64(5268)

## Quick peeks
Some fast frequency tables and sanity checks.


In [15]:
# Top countries by row count (quick proxy for activity)
if 'Country' in df.columns:
    country_counts = df['Country'].value_counts().head(10)
    country_counts
else:
    'Column Country not present'

In [16]:
# Unique SKUs / products if present
sku_cols = [c for c in df.columns if c.lower() in ('stockcode','stock_code','sku','product_code','productid','product_id')]
desc_cols = [c for c in df.columns if c.lower() in ('description','product','item_description','product_name')]
{
    'sku_columns_found': sku_cols,
    'description_columns_found': desc_cols,
    'n_unique_sku': df[sku_cols[0]].nunique() if sku_cols else None
}

{'sku_columns_found': ['StockCode'],
 'description_columns_found': ['Description'],
 'n_unique_sku': 4070}

## Save a working copy (optional)
Save a CSV to include with your submission or for faster reloads next time.


In [17]:
out_csv = 'online_retail_working_copy.csv'
df.to_csv(out_csv, index=False)
Path(out_csv).resolve()

PosixPath('/content/online_retail_working_copy.csv')

---
## Submission checklist (for this stage)
- Your **Jupyter Notebook** (this file) with markdown documentation and initial EDA cells executed.  
- Your **dataset file** (Excel) or the saved CSV copy.
