# Customer segmentation for online retail shop

## Dataset description

**Dataset citation / source**: Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197â€“208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17): [Online Retail Data Set](http://archive.ics.uci.edu/ml/datasets/Online+Retail). 

**Data Set Information:** This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. 

**Attribute Information:**: InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.

## Project goals

Customer segmentation for online retail shop is an RFM (Recency, Frequency, Monetary) analysis that groups the customers on the basis of the previous purchase transactions. It filters customers into various groups for the purpose of understanding who are the bigger spenders, how recently they purchased and what kind of product they prefer.






# Identifying potential cutomers

## Clean and validate the data

Before starting analysis, I need to clean and validate the data first, starting with checking the data shape and description

In [2]:
#import modules
import pandas as pd # for dataframes
import matplotlib.pyplot as plt # for plotting graphs
import seaborn as sns # for plotting graphs
import datetime as dt

In [10]:
# read the data from excel
train = pd.read_excel("Online Retail.xlsx")

In [11]:
# Displays Data Head (Top Rows) and Tail (Bottom Rows) of the Dataframe (Table)
def show_head_tail(data, head_rows, tail_rows):
    display(data.head(head_rows).append(data.tail(tail_rows)))

show_head_tail(train, head_rows=3, tail_rows=2)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680.0,France


In [14]:
# describe data
def describe_data(data):
    display(data.shape)
    display(data.describe())
    print()
    display(data.info())

describe_data(train)


(541909, 8)

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


None

Double checking for data integrity and if any data is missing:

In [15]:
# check if any data is missing
train.isnull()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
541904,False,False,False,False,False,False,False,False
541905,False,False,False,False,False,False,False,False
541906,False,False,False,False,False,False,False,False
541907,False,False,False,False,False,False,False,False


Check if there are duplicates for features that must be unique: 

In [30]:
print(train[train.duplicated(['InvoiceNo', 'StockCode', 'InvoiceDate'])])

       InvoiceNo StockCode                      Description  Quantity  \
125       536381     71270                  PHOTO CLIP LINE         3   
498       536409    90199C  5 STRAND GLASS NECKLACE CRYSTAL         1   
502       536409     85116  BLACK CANDELABRA T-LIGHT HOLDER         5   
517       536409     21866      UNION JACK FLAG LUGGAGE TAG         1   
525       536409    90199C  5 STRAND GLASS NECKLACE CRYSTAL         2   
...          ...       ...                              ...       ...   
541692    581538     22992           REVOLVER WOODEN RULER          1   
541697    581538     21194        PINK  HONEYCOMB PAPER FAN         1   
541698    581538    35004B      SET OF 3 BLACK FLYING DUCKS         1   
541699    581538     22694                     WICKER STAR          1   
541701    581538     23343     JUMBO BAG VINTAGE CHRISTMAS          1   

               InvoiceDate  UnitPrice  CustomerID         Country  
125    2010-12-01 09:41:00       1.25     15311.0  Unit

And frop the duplicates from the dataframe:

In [34]:
# drop the duplicates
ndp_train=train.drop_duplicates()

In [37]:
# checking how now the data looks
describe_data(ndp_train)

(536641, 8)

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,536641.0,536641.0,401604.0
mean,9.620029,4.632656,15281.160818
std,219.130156,97.233118,1714.006089
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13939.0
50%,3.0,2.08,15145.0
75%,10.0,4.13,16784.0
max,80995.0,38970.0,18287.0



<class 'pandas.core.frame.DataFrame'>
Int64Index: 536641 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    536641 non-null  object        
 1   StockCode    536641 non-null  object        
 2   Description  535187 non-null  object        
 3   Quantity     536641 non-null  int64         
 4   InvoiceDate  536641 non-null  datetime64[ns]
 5   UnitPrice    536641 non-null  float64       
 6   CustomerID   401604 non-null  float64       
 7   Country      536641 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 36.8+ MB


None