# Cluster analysis

The purpose of this chapter is to use the python libraries given by scikit learn to solve the case of “Dominick’s Finer Foods” store segmentation problem.

Data: Dominick.csv

Variables

GROCERY	Total grocery sales in USD
CUSTCOUNT	Total customers visiting the store
AGE9	Population under age 9
NOCAR	% with no vehicles
HSIZE2	% of households with 2 persons
DENSITY	Trading area in sq miles per capita
SINGLE	% of singles
WRKWNCH	% of working women with no children
TELPHN	% of households with telephones
NWHITE	% of population that is non-white
SHOPCONS	% of constrained shoppers
SHOPHURR	% of hurried shoppers

Class Work
Develop a cluster model appropriate for the given objectives of Dominick’s using Python libraries
Choose the appropriate number of clusters (segments) 
Profile each cluster
Suggest promotional tactics for each segment


In [None]:
#Importing pandas, numpy and scikit libraries
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
#import yellowbrick

In [None]:
#Reading data, and pre-processing

data=pd.read_csv("/Users/sajimathew/Documents/DoMS/Teaching/EMBA/DMBI/DMBI-2020/Data/Dominick.csv")
#check if there are empty cells
Nandata=np.count_nonzero(pd.isnull(data))
Nandata
#standardizing the data using z scores
scaler = StandardScaler()
scaler.fit(data)
#X: [n_samples, n_features]
X=scaler.transform(data)
type(X)
np.shape(X) # dimension of dataframe
data.describe()

In [None]:
#DETERMINATION OF K BY ELBOW RULE
distortions = [] #distortion is total within cluster error
for i in range(1, 12):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km.fit(X)
    distortions.append(km.inertia_)
#inertia_ (Sum of squared distances of samples to their closest cluster center.)
plt.plot(range(1, 12), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()

In [None]:
#BUILDING CLUSTER 
km = KMeans(
    n_clusters=3, init='random',
    n_init=10, max_iter=300, 
    tol=1e-04)
km.fit(X)

In [None]:
#VISSUALIZING CLUSTER
from yellowbrick.cluster import InterclusterDistance
visualizer = InterclusterDistance(km)
visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

In [None]:
#CLUSTER PRIFILING

centres=km.cluster_centers_
cl1=centres[0,:]
cl2=centres[1,:]
cl3=centres[2,:]
Xlab=['GROCERY','CUSTCOUNT','AGE9','NOCAR','HSIZE2','SINGLE','WRKWNCH','TELEPHN','NWHITE','SHPCONS','SHPHURR']
width = 0.35 
plt.figure(figsize=[12, 4.8])
p1 = plt.bar(Xlab, cl1, width, color='r')
p2=plt.bar(Xlab, cl2, width, color='b')
p3=plt.bar(Xlab, cl3, width, color='g')
