# **Data Preparation**


In [18]:
import numpy as np
import pandas as pd


In [2]:
#File Path
file_path = "../data/raw/online retail.xlsx"
csv_output_path = "../data/processed/online_retail.csv"

#Read Excel File
df_raw= pd.read_excel(file_path, engine="openpyxl")

# Save as CSV
df_raw.to_csv(csv_output_path, index=False)

# **Data Collection**


In [3]:
# Read CSV file
df = pd.read_csv('../data/processed/online_retail.csv')

df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [4]:
#Check the data shape
df.shape

(541909, 8)

This means our dataset contains:
- 541,909 rows → Each row represents a transaction record.
- 8 columns → These are the different attributes describing each transaction.
This gives us an initial understanding of the dataset's size before diving deeper.

# **Data Cleaning**


##### **Checking Datatype**

In [5]:
#Check the Basic info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [39]:
#Converting CustomerID Data Type
df['CustomerID'] = df['CustomerID'].astype('object')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  object 
 7   Country      541909 non-null  object 
 8   Is_Canceled  541909 non-null  bool   
dtypes: bool(1), float64(1), int64(1), object(6)
memory usage: 33.6+ MB


In the dataset, CustomerID was initially stored as a numeric type but was converted to object because it represents a categorical identifier rather than a numerical value

The dataset contains **541,909 rows and 8 columns**. Five columns (`InvoiceNo`, `StockCode`, `Description`, `InvoiceDate`, `Country`) are `object` types, while `Quantity` is an integer and `UnitPrice` & `CustomerID` are floats.  Missing values appear in `Description` and `CustomerID`, likely indicating incomplete product details and guest transactions.



##### **Checking Missing Value**
In this step, we checked for missing values in the dataset using the `.isnull().sum()` method. The results show the number of missing values for each column in the dataset.

In [38]:
# Count missing values per column
df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
Is_Canceled         0
dtype: int64

- **CustomerID** has a significant number of missing values, with **135,080** missing entries.
- **Description** also has some missing values, totaling **1,454** entries.
- Other columns, such as **InvoiceNo, StockCode, Quantity, InvoiceDate, UnitPrice, Country, and Is_Canceled**, have no missing values (0 missing values).



In [12]:
missing_percentage = df['CustomerID'].isnull().mean() * 100
print(f'CustomerID Missing Value: {df["CustomerID"].isnull().sum()}')
print(f'Missing Value Percentage: {missing_percentage:.2f} %')

CustomerID Missing Value: 135080
Missing Value Percentage: 24.93 %


In [51]:
missing_by_country = df.groupby('Country')['CustomerID'].apply(lambda x: x.isna().mean()*100).sort_values(ascending=False)
missing_by_country = missing_by_country[missing_by_country > 0].round(2).astype(str) + '%'
print(f'{missing_by_country}', '%')

Country
Hong Kong         100.0%
Unspecified       45.29%
United Kingdom    26.96%
Israel            15.82%
Bahrain           10.53%
EIRE               8.67%
Switzerland        6.24%
Portugal           2.57%
France             0.77%
Name: CustomerID, dtype: object %


In [25]:
hongkong_missing = missing_by_country['Hong Kong']
total_missing = missing_by_country.sum()

percentage_hongkong = (hongkong_missing / total_missing) * 100
print(f"Hong Kong contributes {percentage_hongkong:.2f}% of all missing CustomerID values.")


Hong Kong contributes 46.11% of all missing CustomerID values.


In [41]:
num_missing_invoices = df[df['CustomerID'].isna()]['InvoiceNo'].nunique()
print(f'Mising CustomerID by  Invoices:  {num_missing_invoices}')


Mising CustomerID by  Invoices:  3710


In [37]:
missing_returns = df[df['CustomerID'].isna() & (df['Quantity'] < 0)]
print(f"Total missing CustomerIDs in canceled transactions: {missing_returns.shape[0]}")


Total missing CustomerIDs in canceled transactions: 1719


In [20]:
#Summary statistics for numerical coloumn
df.describe()

Unnamed: 0,Quantity,UnitPrice
count,541909.0,541909.0
mean,9.55225,4.611114
std,218.081158,96.759853
min,-80995.0,-11062.06
25%,1.0,1.25
50%,3.0,2.08
75%,10.0,4.13
max,80995.0,38970.0
