# **Data Preparation**


In [2]:
import numpy as np
import pandas as pd


In [3]:
#File Path
file_path = "../data/raw/online retail.xlsx"
csv_output_path = "../data/processed/online_retail.csv"

#Read Excel File
df_raw= pd.read_excel(file_path, engine="openpyxl")

# Save as CSV
df_raw.to_csv(csv_output_path, index=False)

# **Data Collection**


In [4]:
# Read CSV file
df = pd.read_csv('../data/processed/online_retail.csv')

df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [5]:
#Check the data shape
df.shape

(541909, 8)

This means our dataset contains:
- 541,909 rows → Each row represents a transaction record.
- 8 columns → These are the different attributes describing each transaction.
This gives us an initial understanding of the dataset's size before diving deeper.

# **Data Cleaning**


##### **Checking Datatype**

In [6]:
#Check the Basic info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [7]:
#Converting CustomerID Data Type
df['CustomerID'] = df['CustomerID'].astype('object')

In the dataset, CustomerID was initially stored as a numeric type but was converted to object because it represents a categorical identifier rather than a numerical value

The dataset contains **541,909 rows and 8 columns**. Five columns (`InvoiceNo`, `StockCode`, `Description`, `InvoiceDate`, `Country`) are `object` types, while `Quantity` is an integer and `UnitPrice` & `CustomerID` are floats.  Missing values appear in `Description` and `CustomerID`, likely indicating incomplete product details and guest transactions.



##### **Checking Missing Value**
In this step, we checked for missing values in the dataset using the `.isnull().sum()` method. The results show the number of missing values for each column in the dataset.

In [8]:
# Count missing values per column
df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

- **CustomerID** has a significant number of missing values, with **135,080** missing entries.
- **Description** also has some missing values, totaling **1,454** entries.
- Other columns, such as **InvoiceNo, StockCode, Quantity, InvoiceDate, UnitPrice, Country, and Is_Canceled**, have no missing values (0 missing values).



In [9]:
missing_percentage = df['CustomerID'].isnull().mean() * 100
print(f'CustomerID Missing Value: {df["CustomerID"].isnull().sum()}')
print(f'Missing Value Percentage: {missing_percentage:.2f} %')

CustomerID Missing Value: 135080
Missing Value Percentage: 24.93 %


In [10]:
#Calculate Missing Value  Proportionnn by country
missing_by_country = df.groupby('Country')['CustomerID'].apply(lambda x: x.isna().mean() * 100).sort_values(ascending=False)
missing_by_country = missing_by_country[missing_by_country > 0].round(2)

# Convert to DataFrame and rename column
missing_by_country = missing_by_country.to_frame(name="Missing Value Proportion")

# Print properly formatted output
print(missing_by_country)

                Missing Value Proportion
Country                                 
Hong Kong                         100.00
Unspecified                        45.29
United Kingdom                     26.96
Israel                             15.82
Bahrain                            10.53
EIRE                                8.67
Switzerland                         6.24
Portugal                            2.57
France                              0.77


Hong Kong has the highest proportion of missing CustomerID values at 100%, suggesting either a different data recording system or unregistered customers. The United Kingdom (26.96%) and Unspecified (45.29%) categories also show significant missing values, which might be due to guest checkouts or incomplete records, while other countries like Israel (15.82%), Bahrain (10.53%), and EIRE (8.67%) have moderate levels of missing data. To address this, further investigation is needed, particularly for Hong Kong and the Unspecified category, to determine whether the missing values should be removed, imputed, or retained based on business relevance.

In [11]:
#Hongkong CustomerID missing value
hongkong_missing = missing_by_country.loc["Hong Kong", "Missing Value Proportion"]

total_missing = missing_by_country["Missing Value Proportion"].sum()


percentage_hongkong = (hongkong_missing / total_missing) * 100
print(f"Hong Kong contributes {percentage_hongkong:.2f}% of all missing CustomerID values.")


Hong Kong contributes 46.11% of all missing CustomerID values.


In [12]:
#CustomerID missing value by InvoiceNo
num_missing_invoices = df[df['CustomerID'].isna()]['InvoiceNo'].nunique()
print(f'Mising CustomerID by  Invoices:  {num_missing_invoices}')
InvoiceNo_unique = df['InvoiceNo'].nunique()

print(f'Amount Unique InvoiceNo: ',InvoiceNo_unique)

Mising CustomerID by  Invoices:  3710
Amount Unique InvoiceNo:  25900


This suggests that these transactions were recorded without an associated customer identifier, which may impact customer-level analysis.


In [13]:
#CustomerID  missing value by canceled transaction
missing_returns = df[df['CustomerID'].isna() & (df['Quantity'] < 0)]
print(f"Total missing CustomerIDs in canceled transactions: {missing_returns.shape[0]}")


Total missing CustomerIDs in canceled transactions: 1719


The fact that **1,719 CustomerIDs are missing in canceled transactions** suggests that many orders without an assigned customer were later canceled. This could indicate that guest checkouts, system errors, or unregistered customers are more prone to cancellations. Understanding this pattern can help in improving customer retention strategies and identifying potential data quality issues.

Due to significant missing values in CustomerID, customer-based analysis may lead to biased insights. Thus, we decided to exclude it and focus on more reliable features.

##### **Checking Duplication**

In [14]:
# Identify duplicate rows (excluding the first occurrence)
df.duplicated().sum()

5268

In [15]:
# Remove duplicate rows, keeping only the first occurrence
df.drop_duplicates(keep='first', inplace=True)

In [16]:
#  Count the number of duplicated rows again to verify removal
df.duplicated().sum()

0

##### **Checking Irrelevant Value**

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 536641 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    536641 non-null  object 
 1   StockCode    536641 non-null  object 
 2   Description  535187 non-null  object 
 3   Quantity     536641 non-null  int64  
 4   InvoiceDate  536641 non-null  object 
 5   UnitPrice    536641 non-null  float64
 6   CustomerID   401604 non-null  object 
 7   Country      536641 non-null  object 
dtypes: float64(1), int64(1), object(6)
memory usage: 36.8+ MB


##### **Checking Irrelevant Value : InvoiceNo**

Since `InvoiceNo` is expected to have a numeric format, we need to check if there are any values with a different format. The code below identifies non-numeric `InvoiceNo`, which may indicate anomalies or specific cases that require further investigation. 

In [23]:
#Checking InvoiceNo format
invoice_different_format =  df[~df['InvoiceNo'].str.isnumeric()]

invoice_different_format['InvoiceNo']

141       C536379
154       C536383
235       C536391
236       C536391
237       C536391
           ...   
540449    C581490
541541    C581499
541715    C581568
541716    C581569
541717    C581569
Name: InvoiceNo, Length: 9254, dtype: object

In [66]:
# Counts Unique non-numeric InvoiceNo
invoice_different_format['InvoiceNo'].nunique()

3839

In [79]:
# Group by InvoiceNo and sum the Quantity column
invoice_summary = invoice_different_format[['InvoiceNo', 'Quantity']]
print(invoice_summary)

       InvoiceNo  Quantity
141      C536379        -1
154      C536383        -1
235      C536391       -12
236      C536391       -24
237      C536391       -24
...          ...       ...
540449   C581490       -11
541541   C581499        -1
541715   C581568        -5
541716   C581569        -1
541717   C581569        -5

[9254 rows x 2 columns]


In [91]:
# Filter invoices where Quantity is less than 0
negative_invoice_summary = invoice_different_format[invoice_different_format['Quantity'] < 0][['InvoiceNo', 'Quantity']]

# Display the result
print(negative_invoice_summary)


       InvoiceNo  Quantity
141      C536379        -1
154      C536383        -1
235      C536391       -12
236      C536391       -24
237      C536391       -24
...          ...       ...
540449   C581490       -11
541541   C581499        -1
541715   C581568        -5
541716   C581569        -1
541717   C581569        -5

[9251 rows x 2 columns]


In [95]:
# Filter invoices where Quantity is greater than or equal to 0
positive_invoice_summary = invoice_different_format[invoice_different_format['Quantity'] >= 0][['InvoiceNo', 'Quantity']]

# Display the result
print(positive_invoice_summary)


       InvoiceNo  Quantity
299982   A563185         1
299983   A563186         1
299984   A563187         1


Non-numeric `InvoiceNo` values (9,254 entries, 3,839 unique) start with 'C' and 'A'. Specifically, invoices starting with 'C' contain negative `Quantity` values, totaling 9,251 entries, suggesting they may represent returns or adjustments. In contrast, invoices starting with 'A' have positive `Quantity` values but account for only 3 entries, indicating a rare but structured pattern. This structured distinction implies an intentional data classification rather than an anomaly or human error. Further clarification from the team is needed to understand the context behind these invoice formats and ensure accurate data handling.