<a href="https://colab.research.google.com/github/anupstar100/UML-Capston_Project-Customer_Segmentation/blob/main/UML_Capston_Project_Online_Retail_Customer_Segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Extraction/identification of major topics & themes discussed in news articles. </u></b>

## <b> Problem Description </b>

### In this project, your task is to identify major customer segments on a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

## <b> Data Description </b>

### <b>Attribute Information: </b>

* ### InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
* ### StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
* ### Description: Product (item) name. Nominal.
* ### Quantity: The quantities of each product (item) per transaction. Numeric.
* ### InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
* ### UnitPrice: Unit price. Numeric, Product price per unit in sterling.
* ### CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
* ### Country: Country name. Nominal, the name of the country where each customer resides.

In [90]:
# mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [91]:
# importing the required libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from sklearn.cluster import KMeans

In [92]:
# loading our data
df = pd.read_excel('/content/drive/MyDrive/Capston Project/Online Retail Customer Segmentation/Online Retail.xlsx')

### Glimpses of our data

In [93]:
# first five rows
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [94]:
# last five rows
df.tail()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.1,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680.0,France


In [95]:
# random five rows
df.sample(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
444133,574721,23511,EMBROIDERED RIBBON REEL EMILY,1,2011-11-06 14:43:00,2.08,17920.0,United Kingdom
458196,575760,22096,PINK PAISLEY SQUARE TISSUE BOX,2,2011-11-11 10:50:00,0.39,15965.0,United Kingdom
202558,554483,22090,PAPER BUNTING RETROSPOT,12,2011-05-24 12:57:00,2.95,13771.0,United Kingdom
816,536464,21814,HEART T-LIGHT HOLDER,2,2010-12-01 12:23:00,1.45,17968.0,United Kingdom
284068,561820,84692,BOX OF 24 COCKTAIL PARASOLS,1,2011-07-29 16:00:00,0.83,,United Kingdom


# Data Information

In [96]:
df.shape

(541909, 8)

In [97]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


**Observation:**
1. There are 8 columns with 5,41,909 rows of data.
2. There are 4 categorical columns, 3 numerical columns and 1 date type columns. 

In [98]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


# Data Cleaning

In [99]:
# checking for duplicate values
df.duplicated().sum()

5268

In [100]:
# dropping the duplicates
df.drop_duplicates(inplace = True)

In [101]:
df.shape

(536641, 8)

**Observations:**
1. There are 5,268 duplicate rows.
2. Shape of data before dropping the duplicates ---> (541909, 8)
3. Shape of data after dropping the duplicates ---> (536641, 8)

In [102]:
# NUMBER OF UNIQUE VALUES IN EAH COLUMN
df.nunique()

InvoiceNo      25900
StockCode       4070
Description     4223
Quantity         722
InvoiceDate    23260
UnitPrice       1630
CustomerID      4372
Country           38
dtype: int64

In [103]:
# CHECKING FOR NULL VALUES
pd.DataFrame({'Columns' : df.columns,
              'Toatl Nos of Null values' : df.isna().sum(),
              '% of null values' : round(df.isna().mean() * 100,2)}).reset_index().drop(['index'], axis = 1)

Unnamed: 0,Columns,Toatl Nos of Null values,% of null values
0,InvoiceNo,0,0.0
1,StockCode,0,0.0
2,Description,1454,0.27
3,Quantity,0,0.0
4,InvoiceDate,0,0.0
5,UnitPrice,0,0.0
6,CustomerID,135037,25.16
7,Country,0,0.0


**Observations:**
1. There are 1454 null values (0.27%) in `Description` columns.
2. There are 135037 null values (25.16%) in `CustomerID` columns.

In [104]:
# Check if InvoiceNo for Null Customer ID exist in cases where Customer ID is present for filling CustomerID Nulls
df[df['CustomerID'] == 'NaN']

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


**Observation:**
* No such cases present as empty dataframe is returned.

In [105]:
# CREATING A LIST OF UNIQUE INVOICES WHERE CUSTOMER ID IS NULL
null_id_invoices = df[df.CustomerID.isna()]['InvoiceNo'].drop_duplicates().tolist()
print('Invoices count with null Customer ID:  ', len(null_id_invoices))

Invoices count with null Customer ID:   3710


In [106]:
# CHECK IF INVOICE NUMBER IN NULL CUSTOMER ID DF EXIST IN NON - NULL CUSTOMER ID DF
df[~df['CustomerID'].isna()][df['InvoiceNo'].isin(null_id_invoices)]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


Since the customer ID's are missing, I assume these orders were not made by the customers already in the data set because those customers already have ID's. I also don't want to assign these orders to those customers because this would alter the insights I draw from the data. Instead of dropping the null CustomerID values which amounts to ~25% of data, let's assign those rows a unique customer ID per order using InvoiceNo. This will act as a new customer for each unique order.

In [107]:
# CHECK IF INVOICE NUMBER HAS UNIQUE MAPPING WITH CUSTOMER ID SO THAT
# EACH INVOICE NUMBER CORRESPONDING TO NULL CUSTOMER ID CAN BE ASSIGN A NEW CUSTOMER.
df.groupby(['InvoiceNo'])['CustomerID'].nunique().reset_index(name = 'nunique').sort_values(['nunique'], ascending = False).head(10)

Unnamed: 0,InvoiceNo,nunique
0,536365,1
16915,571200,1
16924,571215,1
16923,571214,1
16922,571213,1
16921,571212,1
16920,571205,1
16919,571204,1
16918,571203,1
16917,571202,1


**Observation:**
* On sorting, this data shows that each invoice related to maximum 1 customer.

In [108]:
# CREATING NewId COLUMN AND ASSIGNING TO InvoiceNo WHERE CustomerID IS NULL
df['NewID'] = df['CustomerID']
df.loc[df['CustomerID'].isna(), ['NewID']] = df['InvoiceNo']

In [109]:
# REMOVE ALL NON DIGIT CHARATERS FROM NewID COLUMNS
# SINCE INVOICE CAN CONTAIN 'C' REFERRING TO CANCELLATIONS
df['NewID'] = df['NewID'].astype(str).str.replace('\D+', '')

In [110]:
# ConvertiNG TO INTEGER
df['NewID'] = pd.to_numeric(df['NewID'])

In [111]:
# CHECK IF PRESENT CustomerIDs AND NewIDs HAVE ANY COMMON VALUES SINCE IT WOULD CREATE ALTER ACTUAL CUSTOMER INSIGHTS
customer = df['CustomerID'].nunique()
null_invoices = df[df.CustomerID.isnull()]['InvoiceNo'].nunique()
new_ids = df['NewID'].nunique()
print("Number of Customers:", customer)
print("Number of Orders where CustomerID in Null:", null_invoices)
print("Number of Customers + Number of Orders where CustomerID in Null:", customer + null_invoices)
print("Number of New ID's:", new_ids)

Number of Customers: 4372
Number of Orders where CustomerID in Null: 3710
Number of Customers + Number of Orders where CustomerID in Null: 8082
Number of New ID's: 8082


* Since both values equal, we know all the different orders that didn't have a customer ID got assigned unique NewID and no duplicates were created.

In [112]:
# RANGE OF InvoiceDate COLUMN
print('Maximum Invoice Date: ', max(df['InvoiceDate']))
print('Minimum Invoice Date: ', min(df['InvoiceDate']))

Maximum Invoice Date:  2011-12-09 12:50:00
Minimum Invoice Date:  2010-12-01 08:26:00


In [113]:
# ADDING CanceLLations COLUMN BASED ON DEFINITION THAT InvoiceNo STARTS WITH 'C'
df["cancellations"] = np.where(df["InvoiceNo"].str.startswith('C'), 1,0)
total_data = df["InvoiceNo"].shape[0]
cancelled_data = df[df.cancellations == 1].shape[0]
print("Number of cancelled products data", cancelled_data, cancelled_data*100/total_data, "\n")

print(df[df.cancellations == 1]["Quantity"].describe())

#### Removing cancellations since they have negative quantities and makes only ~2% of data
data = df[df.cancellations == 0]

Number of cancelled products data 536638 99.99944096705246 

count    536638.000000
mean          9.620077
std         219.130768
min      -80995.000000
25%           1.000000
50%           3.000000
75%          10.000000
max       80995.000000
Name: Quantity, dtype: float64


In [114]:
df[df.cancellations == 0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,NewID,cancellations
299982,A563185,B,Adjust bad debt,1,2011-08-12 14:50:00,11062.06,,United Kingdom,563185,0
299983,A563186,B,Adjust bad debt,1,2011-08-12 14:51:00,-11062.06,,United Kingdom,563186,0
299984,A563187,B,Adjust bad debt,1,2011-08-12 14:52:00,-11062.06,,United Kingdom,563187,0


In [115]:
df['InvoiceNo'][1]

536365