<a href="https://colab.research.google.com/github/coryroyce/code_assignments/blob/main/211104_K_Means_Shopping_Cory_Randolph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# K Means Shopping

CMPE 256

Cory Randolph

11/4/2021



# Prompt

Develop K Means clustering for the following dataset: This data set is to be grouped into two clusters.

Please develop Python code to Cluster K = 2, K = 3 & K = 4

# Summary of Analysis

Kmeans is a relatively simple clustering algorithm to code manually (without sklearn library), and as part of the process I had to figure out a good way to graph the data using ploty and could reuse that code for the simple sklearn versions.

The overall process is to randomly choose a starting center, map each point to the closest center, recalculate the centers and repeat until the centers are no longer changing.


# Imports

In [47]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from matplotlib import pyplot as plt 

# Data

Input the data for the shopper Spending Index and Income Index as a list of pairs.

In [3]:
data = [
  [1,3,5],
  [2,3,4],
  [3,5,6],
  [4,2,6],
  [5,4,5],
  [6,6,8],
  [7,6,2],
  [8,6,3],
  [9,5,6],
  [10,6,7],
  [11,7,2],
  [12,8,5],
  [13,9,1],
  [14,8,2],
  [15,9,6],
  [16,9,1],
  [17,8,3],
]

columns = ['shopper', 'spending_index', 'income_index']

Convert the data into a Pandas Dataframe

In [4]:
df = pd.DataFrame(data = data, columns = columns)

# Set the index
df.set_index('shopper',inplace = True)

# Display the first few rows
df.head(3)

Unnamed: 0_level_0,spending_index,income_index
shopper,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3,5
2,3,4
3,5,6


Plot the data for easy visualization

In [5]:
fig = px.scatter(df,
                 x = 'spending_index', 
                 y = 'income_index',
                 labels={
                     'spending_index' : 'Spending Index',
                     'income_index' : 'Income Index'},
                 title = {'text': 'Shopper Data',
                          'x': 0.5},
                 )
fig.show()

# Manual K Means Cluster

Create clusters from the data manually (i.e. without using a clustering package)

Convert the dataframe into a Numpy array for easier math formulas

In [36]:
X = df.values

Set the number of clusters

In [35]:
n_clusters = 2

1) Randomly choos starting cluster centers

In [37]:
# Set random seed for repeatability 
rng = np.random.RandomState(3)
i = rng.permutation(X.shape[0])[:n_clusters]
centers = X[i]

2) Assign cluster labels to each point, recalculate and repeat

In [40]:
current_iteration = 1
while True:
    # 2a. Assign labels based on closest center
    labels = pairwise_distances_argmin(X, centers)
    
    # 2b. Find new centers from means of points
    new_centers = np.array([X[labels == i].mean(0)
                            for i in range(n_clusters)])
    
    # Print out the number of itterations for reference
    print(f'Currently on iteration #{current_iteration}')
    current_iteration += 1
    
    # 2c. Check for convergence
    if np.all(centers == new_centers):
        break
    centers = new_centers

    

Currently on iteration #1
Currently on iteration #2
Currently on iteration #3
Currently on iteration #4


Now that the new cluster centers have been found and labels assigned, let's plot the data.

In [41]:
# Create a new Dataframe for the cluster labels
df_manual_cluster = df.copy()
df_manual_cluster['cluster_label'] = labels + 1
df_manual_cluster['cluster_label'] = df_manual_cluster['cluster_label'].apply(np.str)

Plot the cluster and overlay the cluster centers

In [78]:
# Plot the inital data points labeled/colored by cluster
fig = px.scatter(df_manual_cluster,
                 x = 'spending_index', 
                 y = 'income_index',
                 color = 'cluster_label',
                 labels={
                     'spending_index' : 'Spending Index',
                     'income_index' : 'Income Index'},
                 title = {'text': 'Shopper Data',
                          'x': 0.5},
                 )

# Add in the centers to the plot
fig.add_trace(go.Scatter(x = centers[:,0], y = centers[:,1], 
                         name = 'Centers', 
                         line= {'width':0}, 
                         marker={'size':10, 'symbol':'x', 'color':'rgb(0, 0, 0)'},
                         )
             )

fig.show()

# K Means Clusters

Create carious K-Means clusters with k = 2,3,4

In [81]:
from sklearn.cluster import KMeans

## K = 2

Fit the kmeans clustering

In [89]:
kmeans = KMeans(n_clusters=2, random_state=3).fit(X)

Plot the clusters

In [90]:
# Create a new Dataframe for the cluster labels
df_manual_cluster = df.copy()
df_manual_cluster['cluster_label'] = kmeans.labels_ + 1
df_manual_cluster['cluster_label'] = df_manual_cluster['cluster_label'].apply(np.str)

In [91]:
# Plot the inital data points labeled/colored by cluster
fig = px.scatter(df_manual_cluster,
                 x = 'spending_index', 
                 y = 'income_index',
                 color = 'cluster_label',
                 labels={
                     'spending_index' : 'Spending Index',
                     'income_index' : 'Income Index'},
                 title = {'text': 'Shopper Data',
                          'x': 0.5},
                 )

# Add in the centers to the plot
fig.add_trace(go.Scatter(x = kmeans.cluster_centers_[:,0], y = kmeans.cluster_centers_[:,1], 
                         name = 'Centers', 
                         line= {'width':0}, 
                         marker={'size':10, 'symbol':'x', 'color':'rgb(0, 0, 0)'},
                         )
             )

fig.show()

## K = 3

Fit the kmeans clustering

In [92]:
kmeans = KMeans(n_clusters=3, random_state=3).fit(X)

Plot the clusters

In [93]:
# Create a new Dataframe for the cluster labels
df_manual_cluster = df.copy()
df_manual_cluster['cluster_label'] = kmeans.labels_ + 1
df_manual_cluster['cluster_label'] = df_manual_cluster['cluster_label'].apply(np.str)

In [94]:
# Plot the inital data points labeled/colored by cluster
fig = px.scatter(df_manual_cluster,
                 x = 'spending_index', 
                 y = 'income_index',
                 color = 'cluster_label',
                 labels={
                     'spending_index' : 'Spending Index',
                     'income_index' : 'Income Index'},
                 title = {'text': 'Shopper Data',
                          'x': 0.5},
                 )

# Add in the centers to the plot
fig.add_trace(go.Scatter(x = kmeans.cluster_centers_[:,0], y = kmeans.cluster_centers_[:,1], 
                         name = 'Centers', 
                         line= {'width':0}, 
                         marker={'size':10, 'symbol':'x', 'color':'rgb(0, 0, 0)'},
                         )
             )

fig.show()

## K = 4

Fit the kmeans clustering

In [98]:
kmeans = KMeans(n_clusters=4, random_state=3).fit(X)

Plot the clusters

In [99]:
# Create a new Dataframe for the cluster labels
df_manual_cluster = df.copy()
df_manual_cluster['cluster_label'] = kmeans.labels_ + 1
df_manual_cluster['cluster_label'] = df_manual_cluster['cluster_label'].apply(np.str)

In [100]:
# Plot the inital data points labeled/colored by cluster
fig = px.scatter(df_manual_cluster,
                 x = 'spending_index', 
                 y = 'income_index',
                 color = 'cluster_label',
                 labels={
                     'spending_index' : 'Spending Index',
                     'income_index' : 'Income Index'},
                 title = {'text': 'Shopper Data',
                          'x': 0.5},
                 )

# Add in the centers to the plot
fig.add_trace(go.Scatter(x = kmeans.cluster_centers_[:,0], y = kmeans.cluster_centers_[:,1], 
                         name = 'Centers', 
                         line= {'width':0}, 
                         marker={'size':10, 'symbol':'x', 'color':'rgb(0, 0, 0)'},
                         )
             )

fig.show()

# Reference

Example of manual clustering as [reference](https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html)

Example of Kmeans in [Scikit Learn](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html?highlight=kmeans#sklearn.cluster.KMeans)

[Plotly Scatter Plot](https://plotly.com/python/line-and-scatter/) examples