### Data Exploration and Cleaning

Run this notebook once to read the raw data file, transform it to a `.csv` file, conduct exploratory inspection and perform data cleaning. The data path(s) are loaded from the `.env` file. 

In [12]:
# Run once to load environment variables into the notebook environment
# If .env file is updated (rare), restart the kernel and rerun

%load_ext dotenv
%dotenv

# Install Excel engine dependency
%pip install openpyxl

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv
Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl

   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   ---------------------------------------- 2/2 [openpyxl]

Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5
Note: you may need to restart the kernel to use updated packages.


In [13]:
# Import relevant libraries

import os
from pathlib import Path

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt

# Read paths from .env, ensure "processed" directory exists

RAW_DATA_PATH = os.getenv("RAW_DATA_PATH", "./data/raw")
PROCESSED_DATA_PATH = os.getenv("PROCESSED_DATA_PATH", "./processed")

raw_dir  = Path(RAW_DATA_PATH).expanduser().resolve()
proc_dir = Path(PROCESSED_DATA_PATH).expanduser().resolve()
proc_dir.mkdir(parents=True, exist_ok=True)

# Load .xlsx file

xlsx_path = (raw_dir / os.getenv("RAW_DATA_FILE")).resolve()
assert xlsx_path.exists(), f"Expected: {xlsx_path} â€” run the fetch notebook or fix RAW_DATA_FILE"
assert xlsx_path.suffix.lower() == ".xlsx", f"Expected a .xlsx, got {xlsx_path.suffix}"

print("Using Excel:", xlsx_path)

Using Excel: C:\Users\jjsos\Documents\DSI_7\team_project\ds08_online-retail\data\raw\Online Retail.xlsx


In [14]:
# Read .xlsx file and write a pre-processing .csv

# Read "Online Retail" sheet from .xlsx

SHEET = os.getenv("RAW_DATA_SHEET") or 0 

# Set schemas (not cleaning)

dtypes = {
    "InvoiceNo": "string",
    "StockCode": "string",
    "Description": "string",
    "Quantity": "Int64",
    "UnitPrice": "float",
    "CustomerID": "string",
    "Country": "string",
}

# Read the .xlsx file once (Consider dd-mm-yyyy HH : MM date format for this dataset)

df_raw = pd.read_excel(
    xlsx_path,
    sheet_name=SHEET,
    dtype=dtypes,
    engine="openpyxl"
)

# Write a pre-processing .csv file

csv_path = proc_dir / "online_retail_raw.csv"
df_raw.to_csv(
    csv_path,
    index=False,
    encoding="utf-8-sig",
    na_rep=""
)

print(f"CSV saved to: {csv_path}")
print(f"Rows: {len(df_raw):,} | Columns: {len(df_raw.columns)}")
display(df_raw.head()) 

CSV saved to: C:\Users\jjsos\Documents\DSI_7\team_project\ds08_online-retail\processed\online_retail_raw.csv
Rows: 541,909 | Columns: 8


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
