Online Retail UK - Data Quality & Exploratory Analysis

Project Overview

This project performs comprehensive data quality assessment and exploratory data analysis (EDA) on the Online Retail dataset from a UK-based non-store online retail company. The dataset contains transactional data from December 2010 to December 2011, including information about invoices, products, quantities, prices, customers, and countries.

The analysis focuses on:

Data quality assessment and validation
Outlier detection and treatment
Missing value handling
Data standardization and optimization
Exploratory data analysis to uncover business insights

Dataset Information

Original Dataset:

Records: 541,909 transactions
Features: 9 columns (InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country, Revenue)
Date Range: December 1, 2010 - December 9, 2011
Source: UK-based online retail company

Key Columns:

InvoiceNo: Invoice number (categorical)
StockCode: Product code (categorical)
Description: Product name (categorical)
Quantity: Quantity of items purchased (numerical)
InvoiceDate: Date and time of transaction (datetime)
UnitPrice: Price per unit (numerical)
CustomerID: Customer identifier (categorical)
Country: Customer's country (categorical)
Revenue: Total revenue (Quantity × UnitPrice) (numerical)

Key Analysis Steps

1. Data Quality Assessment

Outlier Detection:

Used both IQR (Interquartile Range) and Z-score methods
IQR method identified: 58,619 outliers in Quantity, 39,627 in UnitPrice, 44,997 in Revenue
Z-score method (threshold > 3) identified: 346 outliers in Quantity, 374 in UnitPrice, 403 in Revenue
Visualized outliers using boxplots and scatter plots

Duplicate Records:

Detected and removed 5,268 duplicate rows

Missing Values:

Original dataset: 1,454 missing values in Description, 135,080 in CustomerID
After cleaning: All missing values handled

2. Outlier Handling

Treatment Strategy:

Removed extreme outliers based on Z-score > 3 threshold
Analyzed outlier patterns by:
- Product descriptions (e.g., "DOTCOM POSTAGE", "Manual")
- Customer IDs (identified high-frequency outlier customers)
- Countries (United Kingdom had the most outlier transactions)
Rows reduced from 541,909 to 540,997 after outlier removal

3. Missing Value Imputation

Imputation Strategy:

Description column: Imputed with mode ("WHITE HANGING HEART T-LIGHT HOLDER")
CustomerID column: Imputed with placeholder value "Unknown" for non-identifiable transactions

4. Data Standardization

Text Standardization:

Converted Description, Country, and CustomerID to lowercase
Stripped whitespace from text fields

Numerical Standardization:

Applied StandardScaler to Quantity, UnitPrice, and Revenue
Ensures all numerical features have mean ≈ 0 and standard deviation ≈ 1

Data Type Optimization:

Converted InvoiceDate to datetime objects
Converted categorical columns to category dtype for memory efficiency
Properly typed numerical columns

5. Exploratory Data Analysis (EDA)

Univariate Analysis:

Distribution analysis of numerical variables (Quantity, UnitPrice, Revenue)
Frequency analysis of categorical variables
Top 10 countries: United Kingdom dominates with 489,486 transactions
Top products: "WHITE HANGING HEART T-LIGHT HOLDER" (3,785 occurrences)
Top customers identified by transaction frequency

Bivariate Analysis:

Correlation analysis between numerical variables:
- Strong positive correlation (0.646) between Quantity and Revenue
- Weak positive correlation (0.144) between UnitPrice and Revenue
- Weak negative correlation (-0.080) between Quantity and UnitPrice
Visualized relationships using scatter plots and heatmaps
Box plots comparing distributions across top 10 countries

Major Findings

Data Quality Insights

Outliers: Both detection methods revealed significant outliers, with IQR being more sensitive. The Z-score method was chosen for removal to preserve more data while eliminating extreme values.
Missing Customer IDs: 24.9% of transactions had missing CustomerID values, suggesting anonymous purchases or data collection issues.
Duplicates: 0.97% of records were exact duplicates, indicating potential data entry issues.
Data Reduction: Final cleaned dataset contains 535,731 records (98.9% of original data retained).

Business Insights

Geographic Concentration:
- United Kingdom accounts for ~91% of all transactions
- Other significant markets: Germany (1.8%), France (1.6%), EIRE (1.5%)
Product Popularity:
- Home décor and gift items dominate (t-light holders, cake stands, retrospot bags)
- "WHITE HANGING HEART T-LIGHT HOLDER" is the most frequently purchased item
Customer Behavior:
- Top customer (17841.0) made 7,812 transactions
- Significant variance in customer purchase frequency
Revenue Patterns:
- Revenue strongly correlates with quantity purchased
- Unit price has minimal impact on total revenue correlation

How to Run the Notebook

Prerequisites

pip install pandas numpy matplotlib seaborn plotly scikit-learn scipy

Running in Google Colab

Upload the Dataset:
- Open the notebook in Google Colab
- Upload your data.csv file when prompted by the file upload widget
- Ensure the file is saved as /content/data.csv
Run the Cells:
- Execute cells sequentially from top to bottom
- The notebook is organized into sections:
  - Setup and Data Cleaning
  - Data Overview
  - Data Quality Assessment & Outlier Detection
  - Outlier Investigation
  - Outlier Treatment Decisions
  - Missing Data Handling
  - Data Type Optimization & Standardization
  - Cleaned Data Validation
  - Exploratory Data Analysis (EDA)
Export Cleaned Data (Optional):
- Uncomment the export line in the final cell:
```
df_cleaned.to_csv('cleaned_data.csv', index=False)
```
- Download the cleaned dataset from Colab

Key Libraries Used

pandas: Data manipulation and analysis
numpy: Numerical computing
matplotlib & seaborn: Data visualization
plotly: Interactive visualizations
scikit-learn: StandardScaler for data normalization
scipy: Statistical functions (Z-score calculation)

Project Structure

Online-Retail-UK/
│
├── README.md                          # This file
├── Online_retail_UK.ipynb             # Main Jupyter notebook with analysis
├── data.csv                           # Original dataset (not included)
└── cleaned_data.csv                   # Cleaned dataset (generated after running notebook)

Results Summary

Before Cleaning:

541,909 total records
1,454 missing descriptions
135,080 missing customer IDs
5,268 duplicate rows
Significant outliers in all numerical columns

After Cleaning:

535,731 total records (98.9% retention)
0 missing values
0 duplicate rows
Outliers removed (Z-score > 3)
Standardized text and numerical data

Key Metrics:

Data quality improved significantly
All columns properly typed and optimized
Ready for downstream analysis and modeling

Future Work

Customer Segmentation: Apply clustering algorithms (K-means, DBSCAN) to identify customer segments
Time Series Analysis: Analyze seasonal trends and patterns in purchasing behavior
Market Basket Analysis: Identify frequently co-purchased products using association rules
Predictive Modeling: Build models to predict customer churn, revenue, or product demand
Cohort Analysis: Study customer retention and behavior over time

License

This project is open source and available under the MIT License.

Contact

For questions or feedback, please contact: tranquangduc242@gmail.com

Acknowledgments

Dataset source: UCI Machine Learning Repository - Online Retail Dataset
Analysis performed as part of data quality and EDA practice project

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Online_retail_UK.ipynb		Online_retail_UK.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Online Retail UK - Data Quality & Exploratory Analysis

Project Overview

Dataset Information

Key Analysis Steps

1. Data Quality Assessment

2. Outlier Handling

3. Missing Value Imputation

4. Data Standardization

5. Exploratory Data Analysis (EDA)

Major Findings

Data Quality Insights

Business Insights

How to Run the Notebook

Prerequisites

Running in Google Colab

Key Libraries Used

Project Structure

Results Summary

Future Work

License

Contact

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

databypeter/Python

Folders and files

Latest commit

History

Repository files navigation

Online Retail UK - Data Quality & Exploratory Analysis

Project Overview

Dataset Information

Key Analysis Steps

1. Data Quality Assessment

2. Outlier Handling

3. Missing Value Imputation

4. Data Standardization

5. Exploratory Data Analysis (EDA)

Major Findings

Data Quality Insights

Business Insights

How to Run the Notebook

Prerequisites

Running in Google Colab

Key Libraries Used

Project Structure

Results Summary

Future Work

License

Contact

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages