About Dataset
Context
Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

Content
"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

In [5]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

file = pd.read_csv('/kaggle/input/ecommerce-data/data.csv',encoding='latin1')


**Step 1 : DATA EXPLORATION**

In [7]:
#let's check the data shape 
file.head(100)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
95,536378,22352,LUNCH BOX WITH CUTLERY RETROSPOT,6,12/1/2010 9:37,2.55,14688.0,United Kingdom
96,536378,21212,PACK OF 72 RETROSPOT CAKE CASES,120,12/1/2010 9:37,0.42,14688.0,United Kingdom
97,536378,21975,PACK OF 60 DINOSAUR CAKE CASES,24,12/1/2010 9:37,0.55,14688.0,United Kingdom
98,536378,21977,PACK OF 60 PINK PAISLEY CAKE CASES,24,12/1/2010 9:37,0.55,14688.0,United Kingdom


#  **now that we have a the shape of data let's check if we have missing values or incorrect data**

In [8]:
#cheking data information to see missing values 

file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [14]:
file.isna().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

# We can see that we have missing values in the customerID column. 
# this means that some invoices was issues for a non existing customer 
# Step2 : Cleaning dataset by removing missing Values 

In [18]:
#let calculate the percentage of the impact of missing value 

CustomerID_missing = file['CustomerID'].isna().sum()
total_records = len(file)
missing_percentage = (CustomerID_missing / total_records) * 100

print(f'Missing Customer IDs: {CustomerID_missing}')
print(f'Percentage of Missing Customer IDs: {missing_percentage:.2f}%')


Missing Customer IDs: 135080
Percentage of Missing Customer IDs: 24.93%


# Insight : Almost 25 % percent of our customer ID are not consistent for the analysis . knowing that the customerID column is directly linked with invoiceID  and considering the KPI that has to be evaluated for this specific dataset , such as : Sales performance ( total Sales Revenue , Average Order Value (AOV),Conversion Rate) , Customer Metrics ( CLV , etc ) , Product performance ( top sell product ,etc )  removing the 25 % of the Customer ID missing might impact the analysis and biases the conclusion . 
# 
# hence my recommendation will be to pull out other data where the analysis will be conducted witht quality data .
# 