# Customer Segmentation 

Customer segmentation is an exteremely popular technique used in customer analytics. It aims to dispers customers into groups with distinct behaviours so that they can be treated as a class. This is an extremely powerful concept. 

This treatment can lead to better customer engagement, experience and lower expenditure to the business.

In [1]:
import logging
from datetime import datetime
import pandas as pd
import numpy as np
import tensorflow as tf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from scipy import stats

init_notebook_mode(connected=True)
np.set_printoptions(suppress=True)

logger = logging.getLogger('tensorflow')
logger.setLevel(logging.DEBUG)

In [2]:
city = pd.read_csv("data/city.csv", names = ['CityID', 'CityName', 'Zipcode', 'CountryID'], header = 0)
country = pd.read_csv("data/country.csv" , names = ['CountryID', 'CountryName', 'CountryCode'], header = 0)
customer = pd.read_csv("data/customer.csv", names = ['CustomerID', 'FirstName', 'MiddleInitial', 'LastName',
       'CityID', 'Address'], header = 0)
product = pd.read_csv("data/product.csv", names = ['ProductID', 'ProductName', 'Price', 'CategoryID', 'Class',
       'ModifyDate', 'Resistant', 'IsAllergic', 'VitalityDays'], header = 0)
product_category = pd.read_csv("data/product_category.csv", names = ['CategoryID', 'CategoryName'], header = 0)
staff = pd.read_csv("data/staff.csv", names = ['EmployeeID', 'FirstName', 'MiddleInitial', 'LastName',
       'BirthDate', 'Gender', 'CityID', 'HireDate'], header = 0)
transaction = pd.read_csv("data/transaction.csv", names = ['SalesID', 'SalesPersonID', 'CustomerID', 'ProductID',
       'Quantity', 'Discount', 'TotalPrice', 'SalesDate',
       'TransactionNumber'], header = 0)

In [3]:
city.head()

Unnamed: 0,CityID,CityName,Zipcode,CountryID
0,1,Dayton,80563,32
1,2,Buffalo,17420,32
2,3,Chicago,44751,32
3,4,Fremont,20641,32
4,5,Virginia Beach,62389,32


In [4]:
country.head()

Unnamed: 0,CountryID,CountryName,CountryCode
0,1,Armenia,AN
1,2,Canada,FO
2,3,Belize,MK
3,4,Uganda,LV
4,5,Thailand,VI


In [5]:
customer.head()

Unnamed: 0,CustomerID,FirstName,MiddleInitial,LastName,CityID,Address
0,1,Stefanie,Y,Frye,79,97 Oak Avenue
1,2,Sandy,T,Kirby,96,52 White First Freeway
2,3,Lee,T,Zhang,55,921 White Fabien Avenue
3,4,Regina,S,Avery,40,75 Old Avenue
4,5,Daniel,S,Mccann,2,283 South Green Hague Avenue


In [6]:
product.head()

Unnamed: 0,ProductID,ProductName,Price,CategoryID,Class,ModifyDate,Resistant,IsAllergic,VitalityDays
0,1,Flour - Whole Wheat,742988,3,Medium,2018-02-16 08:21:49.190,Durable,,
1,2,Cookie Chocolate Chip With,912329,3,Medium,2017-02-12 11:39:10.970,,,
2,3,Onions - Cippolini,91379,9,Medium,2018-03-15 08:11:51.560,Weak,False,111.0
3,4,"Sauce - Gravy, Au Jus, Mix",543055,9,Medium,2017-07-16 00:46:28.880,Durable,,
4,5,Artichokes - Jerusalem,654771,2,Low,2017-08-16 14:13:35.430,Durable,True,27.0


In [7]:
product_category.head()

Unnamed: 0,CategoryID,CategoryName
0,1,Confections
1,2,Shell fish
2,3,Cereals
3,4,Dairy
4,5,Beverages


In [8]:
transaction.head()

Unnamed: 0,SalesID,SalesPersonID,CustomerID,ProductID,Quantity,Discount,TotalPrice,SalesDate,TransactionNumber
0,1,6,27039,381,7,,0,2018-02-05 07:38:25.430,FQL4S94E4ME1EZFTG42G
1,2,16,25011,61,7,,0,2018-02-02 16:03:31.150,12UGLX40DJ1A5DTFBHB8
2,3,13,94024,23,24,,0,2018-05-03 19:31:56.880,5DT8RCPL87KI5EORO7B0
3,4,8,73966,176,19,0.2,0,2018-04-07 14:43:55.420,R3DR9MLD5NR76VO17ULE
4,5,10,32653,310,9,,0,2018-02-12 15:37:03.940,4BGS0Z5OMAZ8NDAFHHP3


In [9]:
staff.head()

Unnamed: 0,EmployeeID,FirstName,MiddleInitial,LastName,BirthDate,Gender,CityID,HireDate
0,1,Nicole,T,Fuller,1981-03-07 00:00:00.000,F,80,2011-06-20 07:15:36.920
1,2,Christine,W,Palmer,1968-01-25 00:00:00.000,F,4,2011-04-27 04:07:56.930
2,3,Pablo,Y,Cline,1963-02-09 00:00:00.000,M,70,2012-03-30 18:55:23.270
3,4,Darnell,O,Nielsen,1989-02-06 00:00:00.000,M,39,2014-03-06 06:55:02.780
4,5,Desiree,L,Stuart,1963-05-03 00:00:00.000,F,23,2014-11-16 22:59:54.720


## Building the feature vector

In order to categorise the customers, we need to build a semantic feature vector. This is essentially saying over what context do you want to build your clusters over? Geographic, purchase behaviours, etc.

Given that this is a made up example, let's build a feature vector which involves all the information we have. This may not be the right thing to do, and design decision such as these should be discussed in detail with the team whose interested in building these clusters. 

Start from the transaction dataset, these features will hopefully corresponding the purchasing behaviour. Then we can join the customer data to the transactions.

In [10]:
transaction.dtypes

SalesID                int64
SalesPersonID          int64
CustomerID             int64
ProductID              int64
Quantity               int64
Discount             float64
TotalPrice            object
SalesDate             object
TransactionNumber     object
dtype: object

In [11]:
fv = transaction.merge(customer,on='CustomerID', sort = True)
fv.head()

Unnamed: 0,SalesID,SalesPersonID,CustomerID,ProductID,Quantity,Discount,TotalPrice,SalesDate,TransactionNumber,FirstName,MiddleInitial,LastName,CityID,Address
0,167492,12,1,125,1,,0,2018-01-06 14:36:14.790,26SHESF8FGH4REW2DKOD,Stefanie,Y,Frye,79,97 Oak Avenue
1,265787,14,1,278,1,,0,2018-04-06 04:10:50.820,ISUU3VFV7CPUPPZOZ7QL,Stefanie,Y,Frye,79,97 Oak Avenue
2,328672,10,1,413,1,0.2,0,2018-04-03 23:21:47.970,TI5RNCT9I5S3L1HWT5WU,Stefanie,Y,Frye,79,97 Oak Avenue
3,413293,9,1,415,1,,0,2018-04-08 01:34:02.080,2XQMF8227X1B22YPMCN3,Stefanie,Y,Frye,79,97 Oak Avenue
4,452499,9,1,214,1,,0,2018-01-03 05:24:59.690,17URA7QKGLD0BBKENLWZ,Stefanie,Y,Frye,79,97 Oak Avenue


We obviously don't want to overfit things such as IDs, they are generated values and do not actually signal anything to the underlying distribution. Getting rid of names and personal information also helps to provide information security in the data science value chain.

We can also drop the sales date, as we are looking at a small window of sales. Also making a segmentation which has a temporal input makes it so much more complicated and isn't necessary here.

In [12]:
fv = fv.drop(['SalesID','SalesPersonID','SalesDate', 'TransactionNumber','FirstName',
                       'MiddleInitial','LastName','Address'], axis = 1)
fv.head()

Unnamed: 0,CustomerID,ProductID,Quantity,Discount,TotalPrice,CityID
0,1,125,1,,0,79
1,1,278,1,,0,79
2,1,413,1,0.2,0,79
3,1,415,1,,0,79
4,1,214,1,,0,79


Our feature vector looks prettey good. We have the basic semantics around the products, discounts, expenditature and geography indirectly with CityID. But our rows have duplicate customers, when they did multiple transactions. So we need to pivot out the ID to be column wise. 

There are chances that same transaction ID was responsible for the product ID, but we are interested only in the product semantics so we won't worry about that.

Another thing we need to do is handle quantity. Quantity in a static value doesn't make much sense. Since some items such as toilet paper, generally get's bought in bulk and things such as an electric toothbrush probably in single items. So let's scale the quantity by the average quantity.

In [13]:
prd_avg = transaction.groupby(['ProductID'])['Quantity'].mean().reset_index()
prd_avg

Unnamed: 0,ProductID,Quantity
0,1,12.823286
1,2,13.000933
2,3,13.121500
3,4,12.939189
4,5,12.933273
5,6,13.414512
6,7,13.104599
7,8,13.013680
8,9,12.990385
9,10,12.979647


In [14]:
fv = fv.merge(prd_avg,on=['ProductID'],sort = True)
fv['QuantityScaled'] = fv.Quantity_x / fv.Quantity_y
fv = fv.drop(['Quantity_x','Quantity_y'], axis = 1)
fv.head()

Unnamed: 0,CustomerID,ProductID,Discount,TotalPrice,CityID,QuantityScaled
0,111,1,,0,55,0.077983
1,143,1,0.1,0,57,0.077983
2,153,1,,0,48,0.077983
3,198,1,,0,79,0.077983
4,204,1,,0,9,0.077983


Another thing we need to handle is the discount. There is a lot of NaN values. Let's just set these to zero, assuming, these didn't get a dicount. Ofcourse this is an assumption but for the sake of this excercise let's let that be the case.

In [15]:
fv.Discount[fv.Discount.isnull()] = 0.00
fv.head()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Unnamed: 0,CustomerID,ProductID,Discount,TotalPrice,CityID,QuantityScaled
0,111,1,0.0,0,55,0.077983
1,143,1,0.1,0,57,0.077983
2,153,1,0.0,0,48,0.077983
3,198,1,0.0,0,79,0.077983
4,204,1,0.0,0,9,0.077983


Total price here is string and Zero (good one data warehousing team), but we are not going to use it anyway. Here's the intuition: Each product has it's own associated cost,  and Total Price is a function of quantity. So we have semantics unique price and total expenditure.

In [16]:
fv = fv.drop(['TotalPrice','Discount'], axis = 1)

Now we have to pivot out the product since having an ordinal for product makes no sense. We can replace the value quantity scaled in its place. 

ValueError: Index contains duplicate entries, cannot reshape