<a href="https://colab.research.google.com/github/anupstar100/UML-Capston_Project-Customer_Segmentation/blob/main/UML_Capston_Project_Online_Retail_Customer_Segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Extraction/identification of major topics & themes discussed in news articles. </u></b>

## <b> Problem Description </b>

### In this project, your task is to identify major customer segments on a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

## <b> Data Description </b>

### <b>Attribute Information: </b>

* ### InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
* ### StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
* ### Description: Product (item) name. Nominal.
* ### Quantity: The quantities of each product (item) per transaction. Numeric.
* ### InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
* ### UnitPrice: Unit price. Numeric, Product price per unit in sterling.
* ### CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
* ### Country: Country name. Nominal, the name of the country where each customer resides.

In [1]:
# mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# importing the required libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from sklearn.cluster import KMeans

In [4]:
# loading our data
df = pd.read_excel('/content/drive/MyDrive/Capston Project/Online Retail Customer Segmentation/Online Retail.xlsx')

### Glimpses of our data

In [5]:
# first five rows
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [6]:
# last five rows
df.tail()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.1,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680.0,France


In [7]:
# random five rows
df.sample(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
523118,580502,23506,MINI PLAYING CARDS SPACEBOY,3,2011-12-04 13:15:00,0.42,16931.0,United Kingdom
283003,561685,22384,LUNCH BAG PINK POLKADOT,10,2011-07-28 19:46:00,1.65,14315.0,United Kingdom
276272,561037,22423,REGENCY CAKESTAND 3 TIER,2,2011-07-24 11:55:00,12.75,12472.0,Germany
145687,548894,84510A,SET OF 4 ENGLISH ROSE COASTERS,1,2011-04-04 16:01:00,2.46,,United Kingdom
226311,556784,82552,WASHROOM METAL SIGN,4,2011-06-14 13:15:00,1.45,14461.0,United Kingdom


# Data Information

In [9]:
df.shape

(541909, 8)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


**Observation:**
1. There are 8 columns with 5,41,909 rows of data.
2. There are 4 categorical columns, 3 numerical columns and 1 date type columns. 

In [10]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


# Data Cleaning

In [13]:
# checking for duplicate values
df.duplicated().sum()

5268

In [14]:
# dropping the duplicates
df.drop_duplicates(inplace = True)

In [15]:
df.shape

(536641, 8)

**Observations:**
1. There are 5,268 duplicate rows.
2. Shape of data before dropping the duplicates ---> (541909, 8)
3. Shape of data after dropping the duplicates ---> (536641, 8)

In [19]:
# unique values
for cols in df.columns:
  print(f'{cols}:  ', df[cols].nunique())

InvoiceNo:   25900
StockCode:   4070
Description:   4223
Quantity:   722
InvoiceDate:   23260
UnitPrice:   1630
CustomerID:   4372
Country:   38


In [24]:
# CHECKING FOR NULL VALUES
pd.DataFrame({'Columns' : df.columns,
              'Toatl Nos of Null values' : df.isna().sum(),
              '% of total nos of null values' : round(df.isna().mean() * 100,2)}).reset_index().drop(['index'], axis = 1)

Unnamed: 0,Columns,Toatl Nos of Null values,% of total nos of null values
0,InvoiceNo,0,0.0
1,StockCode,0,0.0
2,Description,1454,0.27
3,Quantity,0,0.0
4,InvoiceDate,0,0.0
5,UnitPrice,0,0.0
6,CustomerID,135037,25.16
7,Country,0,0.0


**Observations:**
1. There are 1454 null values (0.27%) in `Description` columns.
2. There are 135037 null values (25.16%) in `CustomerID` columns.

In [27]:
# Check if InvoiceNo for Null Customer ID exist in cases where Customer ID is present for filling CustomerID Nulls
df[df['CustomerID'] == 'NaN']

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


**Observation:**
* No such cases present as empty dataframe is returned.

In [34]:
# CREATING A LIST OF UNIQUE INVOICES WHERE CUSTOMER ID IS NULL
null_id_invoices = df[df.CustomerID.isna()]['InvoiceNo'].drop_duplicates().tolist()
print('Invoices count with null Customer ID:  ', len(null_id_invoices))

Invoices count with null Customer ID:   3710


In [35]:
# CHECK IF INVOICE NUMBER IN NULL CUSTOMER ID DF EXIST IN NON - NULL CUSTOMER ID DF
df[~df['CustomerID'].isna()][df['InvoiceNo'].isin(null_id_invoices)]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


Since the customer ID's are missing, I assume these orders were not made by the customers already in the data set because those customers already have ID's. I also don't want to assign these orders to those customers because this would alter the insights I draw from the data. Instead of dropping the null CustomerID values which amounts to ~25% of data, let's assign those rows a unique customer ID per order using InvoiceNo. This will act as a new customer for each unique order.

In [41]:
# CHECK IF INVOICE NUMBER HAS UNIQUE MAPPING WITH CUSTOMER ID SO THAT
#EACH INVOICE NUMBER CORRESPONDING TO NULL CUSTOMER ID CAN BE ASSIGN A NEW CUSTOMER.
df.groupby(['InvoiceNo'])['CustomerID'].nunique().reset_index(name = 'nunique').sort_values(['nunique'], ascending = False).head(10)

Unnamed: 0,InvoiceNo,nunique
0,536365,1
16915,571200,1
16924,571215,1
16923,571214,1
16922,571213,1
16921,571212,1
16920,571205,1
16919,571204,1
16918,571203,1
16917,571202,1


**Observation:**
* On sorting, this data shows that each invoice related to maximum 1 customer.