### Table of Contents

* [Introduction](#Intro)
    * [Problem Statement](#Problem)
* [Dataset Preparation](#DatasetPrep)
    * [Exploratory Data Analysis](#EDA)
* [Customer Categorisation with K-means Clustering](#Clustering)
* [Fine tuning the algorithm](#Tuning)
* [Visualising the results](#DataViz)
* [Interpreting the results](#Results)
* [Conclusions](#Conclusions)

### Introduction

#### Problem Statement
Bussiness case

### Dataset Preparation

#### Dataset

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go 
from plotly.subplots import make_subplots

from sklearn.cluster import KMeans

%matplotlib inline
%config InlineBackend.figure_format='retina'

pd.options.mode.chained_assignment = None

In [3]:
Order_Records = pd.read_csv("Orders - Analysis Task.csv")

In [4]:
Order_Records.head()

Unnamed: 0,product_title,product_type,variant_title,variant_sku,variant_id,customer_id,order_id,day,net_quantity,gross_sales,discounts,returns,net_sales,taxes,total_sales,returned_item_quantity,ordered_item_quantity
0,DPR,DPR,100,AD-982-708-895-F-6C894FB,52039657,1312378,83290718932496,04/12/2018,2,200.0,-200.0,0.0,0.0,0.0,0.0,0,2
1,RJF,Product P,28 / A / MTM,83-490-E49-8C8-8-3B100BC,56914686,3715657,36253792848113,01/04/2019,2,190.0,-190.0,0.0,0.0,0.0,0.0,0,2
2,CLH,Product B,32 / B / FtO,68-ECA-BC7-3B2-A-E73DE1B,24064862,9533448,73094559597229,05/11/2018,0,164.8,-156.56,-8.24,0.0,0.0,0.0,-2,2
3,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,53616575668264,19/02/2019,1,119.0,-119.0,0.0,0.0,0.0,0.0,0,1
4,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,29263220319421,19/02/2019,1,119.0,-119.0,0.0,0.0,0.0,0.0,0,1


In [5]:
Order_Records.shape
print(f"The raw data has {Order_Records.shape[0]} order records with {Order_Records.shape[1]} variables describing each order record")

The raw data has 70052 order records with 17 variables describing each order record


In [6]:
Order_Records.info() #Overview of the customer order records

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70052 entries, 0 to 70051
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   product_title           70052 non-null  object 
 1   product_type            70052 non-null  object 
 2   variant_title           70052 non-null  object 
 3   variant_sku             70052 non-null  object 
 4   variant_id              70052 non-null  int64  
 5   customer_id             70052 non-null  int64  
 6   order_id                70052 non-null  int64  
 7   day                     70052 non-null  object 
 8   net_quantity            70052 non-null  int64  
 9   gross_sales             70052 non-null  float64
 10  discounts               70052 non-null  float64
 11  returns                 70052 non-null  float64
 12  net_sales               70052 non-null  float64
 13  taxes                   70052 non-null  float64
 14  total_sales             70052 non-null

#### Exploratory Data Analysis

In [7]:
#Check for missing NA values

In [8]:
Order_Records.describe() #Summary of each independent variable

Unnamed: 0,variant_id,customer_id,order_id,net_quantity,gross_sales,discounts,returns,net_sales,taxes,total_sales,returned_item_quantity,ordered_item_quantity
count,70052.0,70052.0,70052.0,70052.0,70052.0,70052.0,70052.0,70052.0,70052.0,70052.0,70052.0,70052.0
mean,244232000000.0,601309100000.0,55060750000000.0,0.701179,61.776302,-4.949904,-10.246051,46.580348,9.123636,55.703982,-0.156098,0.857277
std,4255079000000.0,6223201000000.0,25876400000000.0,0.739497,31.800689,7.769972,25.154677,51.80269,10.305236,61.920557,0.36919,0.38082
min,10014470.0,1000661.0,10006570000000.0,-3.0,0.0,-200.0,-237.5,-237.5,-47.5,-285.0,-3.0,0.0
25%,26922230.0,3295695.0,32703170000000.0,1.0,51.67,-8.34,0.0,47.08,8.375,56.2275,0.0,1.0
50%,44945140.0,5566107.0,55222070000000.0,1.0,74.17,0.0,0.0,63.33,12.66,76.0,0.0,1.0
75%,77431060.0,7815352.0,77368760000000.0,1.0,79.17,0.0,0.0,74.17,14.84,89.0,0.0,1.0
max,84222120000000.0,99774090000000.0,99995540000000.0,6.0,445.0,0.0,0.0,445.0,63.34,445.0,0.0,6.0


From the summary of all independent variables, few interesting observations
1. returned item quantity is zero or negative for all the records
2. net quantity is zero or negative for significant number of records
3. ordered_item_quantity is zero for some records

table the data, draw some graphs and do shit here

In [9]:
Cust_NQZ = Order_Records.loc[Order_Records['net_quantity'] == 0] #order records with net quantity zero
Cust_NQLZ = Order_Records.loc[Order_Records['net_quantity'] < 0]

print(f"There are {Cust_NQZ.shape[0]} order records with net_quantity as zero")
print(f"There are {Cust_NQLZ.shape[0]} order records with net_quantity less than zero")

There are 68 order records with net_quantity as zero
There are 10715 order records with net_quantity less than zero


In [10]:
Cust_RIQZ = Order_Records.loc[Order_Records['returned_item_quantity'] == 0] #order records with returned_item_quantity zero
Cust_RIQLZ = Order_Records.loc[Order_Records['returned_item_quantity'] < 0]

print(f"There are {Cust_RIQZ.shape[0]} order records with returned_item_quantity as zero")
print(f"There are {Cust_RIQLZ.shape[0]} order records with returned_item_quantity less than zero")

There are 59269 order records with returned_item_quantity as zero
There are 10783 order records with returned_item_quantity less than zero


In [11]:
Cust_OIQZ = Order_Records.loc[Order_Records['ordered_item_quantity'] == 0] #order records with ordered_item_quantity zero
Cust_OIQLZ = Order_Records.loc[Order_Records['ordered_item_quantity'] < 0]

print(f"There are {Cust_OIQZ.shape[0]} order records with ordered_item_quantity as zero")
print(f"There are {Cust_OIQLZ.shape[0]} order records with ordered_item_quantity less than zero")

There are 10715 order records with ordered_item_quantity as zero
There are 0 order records with ordered_item_quantity less than zero


for significant number of rows, ordered quantity is zero, we need to further investigate as to why the orders have ordered quantity as zero
net quantity is less tan zero. it may indicate that these are erroneous data, need to investigate further to check

In [12]:
Cust_NQLZ.head()

Unnamed: 0,product_title,product_type,variant_title,variant_sku,variant_id,customer_id,order_id,day,net_quantity,gross_sales,discounts,returns,net_sales,taxes,total_sales,returned_item_quantity,ordered_item_quantity
59295,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,13666410519728,01/03/2019,-1,0.0,0.0,0.0,0.0,0.0,0.0,-1,0
59300,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,53616575668264,23/02/2019,-1,0.0,0.0,0.0,0.0,0.0,0.0,-1,0
59305,RJF,Product T,28 / A / 9,4D-D1F-A14-8D9-0-FD0E84A,31355561,3715657,93146430228825,04/12/2018,-1,0.0,0.0,0.0,0.0,0.0,0.0,-1,0
59314,OTH,Product F,40 / B / FtO,53-5CA-7CF-8F5-9-28CB78B,43823868,4121004,53616575668264,23/02/2019,-1,0.0,0.0,0.0,0.0,0.0,0.0,-1,0
59328,YQX,Product H,40 / B / FtO,F2-055-4C3-8C3-0-7070F1D,25826279,4121004,13666410519728,01/03/2019,-1,0.0,0.0,0.0,0.0,0.0,0.0,-1,0


In [13]:
Cust_OIQZ.head()

Unnamed: 0,product_title,product_type,variant_title,variant_sku,variant_id,customer_id,order_id,day,net_quantity,gross_sales,discounts,returns,net_sales,taxes,total_sales,returned_item_quantity,ordered_item_quantity
59295,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,13666410519728,01/03/2019,-1,0.0,0.0,0.0,0.0,0.0,0.0,-1,0
59300,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,53616575668264,23/02/2019,-1,0.0,0.0,0.0,0.0,0.0,0.0,-1,0
59305,RJF,Product T,28 / A / 9,4D-D1F-A14-8D9-0-FD0E84A,31355561,3715657,93146430228825,04/12/2018,-1,0.0,0.0,0.0,0.0,0.0,0.0,-1,0
59314,OTH,Product F,40 / B / FtO,53-5CA-7CF-8F5-9-28CB78B,43823868,4121004,53616575668264,23/02/2019,-1,0.0,0.0,0.0,0.0,0.0,0.0,-1,0
59328,YQX,Product H,40 / B / FtO,F2-055-4C3-8C3-0-7070F1D,25826279,4121004,13666410519728,01/03/2019,-1,0.0,0.0,0.0,0.0,0.0,0.0,-1,0


It looks like, the net quantity is less than zero when the returned item quantity is negative. To understand if the negative returned item quantity being negative is erroneous or not, we need to further investigate into the data

In [14]:
#Lets take record 59300 and check 
customerx = Order_Records.loc[Order_Records['customer_id']==4121004]
customerx

Unnamed: 0,product_title,product_type,variant_title,variant_sku,variant_id,customer_id,order_id,day,net_quantity,gross_sales,discounts,returns,net_sales,taxes,total_sales,returned_item_quantity,ordered_item_quantity
3,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,53616575668264,19/02/2019,1,119.0,-119.0,0.0,0.0,0.0,0.0,0,1
4,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,29263220319421,19/02/2019,1,119.0,-119.0,0.0,0.0,0.0,0.0,0,1
5,OTH,Product F,40 / B / FtO,53-5CA-7CF-8F5-9-28CB78B,43823868,4121004,53616575668264,19/02/2019,1,119.0,-119.0,0.0,0.0,0.0,0.0,0,1
6,OTH,Product F,40 / B / FtO,53-5CA-7CF-8F5-9-28CB78B,43823868,4121004,29263220319421,19/02/2019,1,119.0,-119.0,0.0,0.0,0.0,0.0,0,1
7,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,13666410519728,20/02/2019,1,119.0,-119.0,0.0,0.0,0.0,0.0,0,1
8,OTH,Product F,40 / C / FtO,8B-2C5-548-6C6-E-B5EECBC,43823868,4121004,80657249973427,22/02/2019,1,119.0,-119.0,0.0,0.0,0.0,0.0,0,1
16,WHX,Product P,40 / C / FtO,44-893-E04-6EF-F-E418295,14526828,4121004,80657249973427,22/02/2019,1,95.0,-95.0,0.0,0.0,0.0,0.0,0,1
19,WHX,Product P,40 / B / FtO,AC-93B-065-BD2-A-5D62CD8,14526828,4121004,29263220319421,19/02/2019,1,95.0,-95.0,0.0,0.0,0.0,0.0,0,1
24,YQX,Product H,40 / B / FtO,F2-055-4C3-8C3-0-7070F1D,25826279,4121004,13666410519728,20/02/2019,1,89.0,-89.0,0.0,0.0,0.0,0.0,0,1
27,YQX,Product H,40 / B / FtO,F2-055-4C3-8C3-0-7070F1D,25826279,4121004,29263220319421,19/02/2019,1,89.0,-89.0,0.0,0.0,0.0,0.0,0,1


take record 59300 and record 4, record 4 predates record 59300. This implies that, the returned items are marked with negative sign and are not erroneous data records. For these records, net quantity is less than zero as the customer has not ordered anything but returned something.

In [15]:
Cust_NQZ.head() #orders where net quantity is zero

Unnamed: 0,product_title,product_type,variant_title,variant_sku,variant_id,customer_id,order_id,day,net_quantity,gross_sales,discounts,returns,net_sales,taxes,total_sales,returned_item_quantity,ordered_item_quantity
2,CLH,Product B,32 / B / FtO,68-ECA-BC7-3B2-A-E73DE1B,24064862,9533448,73094559597229,05/11/2018,0,164.8,-156.56,-8.24,0.0,0.0,0.0,-2,2
22,KNB,Product H,28 / B / 29,BA-184-06C-4E3-6-1DC738F,10434338,1481447,82857371444896,09/11/2018,0,89.0,-89.0,0.0,0.0,0.0,0.0,-1,1
23,EYV,Product H,31 / A / FtO,E5-666-054-F18-A-90F6B20,22559066,3619130,88025805105285,13/11/2018,0,89.0,-89.0,0.0,0.0,0.0,0.0,-1,1
30,WHX,Product P,32 / B / FtO,85-2EB-163-D62-5-FC50316,26246865,9533448,73094559597229,05/11/2018,0,74.2,-70.49,-3.71,0.0,0.0,0.0,-1,1
31,KNB,Product P,32 / B / FtO,C5-B40-3CE-CB1-9-672218E,30277881,9533448,12837914491890,05/11/2018,0,74.2,-70.49,-3.71,0.0,0.0,0.0,-1,1


these are the records where the ordered quanity and returned quanity is same. i.e, the order has been cancelled even before it has been placed and checked out. These records can be removed as they are not usefull for us

In [16]:

print(f"There are {Cust_NQZ.shape[0]} order records with net_quantity as zero")

There are 68 order records with net_quantity as zero


In [17]:
Order_Records = Order_Records.loc[Order_Records['net_quantity']!=0] #Removing records with net_quanity = 0

In [18]:
Order_Records.shape 

(69984, 17)