<a href="https://colab.research.google.com/github/codewithkate/Defense_Policy_Analysis/blob/main/kmeans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Perform K-Means clustering

Modified code from: [Dhiraj Kumar at NeptuneAI](https://neptune.ai/blog/customer-segmentation-using-machine-learning)

In [1]:
# Import required libraries

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

## Upload dataset from your local file system
Download the dataset used in the is analysis from [GitHub](https://github.com/codewithkate/Defense_Policy_Analysis).

In [2]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving eda_data.csv to eda_data.csv
User uploaded file "eda_data.csv" with length 806992 bytes


In [3]:
import io
df = pd.read_csv(io.BytesIO(uploaded['eda_data.csv']))
df.shape

(4113, 18)

## Inspect the data

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4113 entries, 0 to 4112
Data columns (total 18 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Item Description-original          4113 non-null   object 
 1   Item Description - cleaned         4113 non-null   object 
 2   Transfer Authority                 4113 non-null   object 
 3   Fiscal Year of Request             4113 non-null   object 
 4   Implementing Agency                4113 non-null   object 
 5   Country (Transfer to)              4113 non-null   object 
 6   Status                             4113 non-null   object 
 7   Status Date                        4113 non-null   object 
 8   Qty Requested                      4113 non-null   int64  
 9   Qty Allocated                      4113 non-null   object 
 10  Qty Accepted                       4113 non-null   int64  
 11  Qty Rejected                       4113 non-null   int64

In [5]:
df.describe()

Unnamed: 0,Qty Requested,Qty Accepted,Qty Rejected,Total Delivered Qty,Total Acquisition Value,Total Current Value,Request to Status Duration (Days),Grant Pricing
count,4113.0,4113.0,4113.0,4113.0,4112.0,4111.0,4113.0,4113.0
mean,52360.23,7975.12,397.891077,17299.81,6449672.0,1027025.0,462.840749,258532.5
std,752113.4,353135.8,18274.542193,491423.9,35281740.0,6311961.0,367.668584,1412200.0
min,-999.0,-999.0,-999.0,-999.0,0.0,0.0,2.0,0.0
25%,1.0,0.0,0.0,0.0,37477.25,9068.25,217.0,1716.922
50%,3.0,0.0,0.0,0.0,190074.0,49138.2,351.0,8477.302
75%,34.0,1.0,0.0,1.0,1154636.0,253130.0,637.0,51000.0
max,23425820.0,21425580.0,1000000.0,21425580.0,957375600.0,180998100.0,2644.0,36571750.0


## K-Means

In [6]:
# Define K-means model
kmeans_model = KMeans(init='k-means++',  max_iter=400, random_state=42)

In [7]:
# Train the model
kmeans_model.fit(df[['Request to Status Duration (Days)','Qty Requested',
'Grant Pricing']])



### Elbow Method
Identify the best k-value by measuring the sum of squares within each cluster.


In [8]:
# Create the K means model for different values of K
def try_different_clusters(K, data):

    cluster_values = list(range(1, K+1))
    inertias=[]

    for c in cluster_values:
        model = KMeans(n_clusters = c,init='k-means++',max_iter=400,random_state=42)
        model.fit(data)
        inertias.append(model.inertia_)

    return inertias

In [9]:
# Find output for k values between 1 to 12
outputs = try_different_clusters(12, df[['Request to Status Duration (Days)','Qty Requested',
'Grant Pricing']])
distances = pd.DataFrame({"clusters": list(range(1, 13)),"sum of squared distances": outputs})



In [10]:
# Finding optimal number of clusters k
figure = go.Figure()
figure.add_trace(go.Scatter(x=distances["clusters"], y=distances["sum of squared distances"]))

figure.update_layout(xaxis = dict(tick0 = 1,dtick = 1,tickmode = 'linear'),
                  xaxis_title="Number of clusters",
                  yaxis_title="Sum of squared distances",
                  title_text="Finding optimal number of clusters using elbow method")
figure.show()

**Optimal value = 5**

The above visualization is used to determine the optimal value form where the curve of the line bends, like an elbow. This bend is based on the sum of square distances of each point from its center (centroid). If we add more than the optimal value then the model will decrease in performance and efficiency, as this where the the reduction in distance begins to slow down.

In [11]:
# Re-Train K means model with k=5
kmeans_model_new = KMeans(n_clusters = 5,init='k-means++',max_iter=400,random_state=42)

kmeans_model_new.fit_predict(df[['Request to Status Duration (Days)','Qty Requested',
'Grant Pricing']])





array([0, 0, 0, ..., 0, 0, 4], dtype=int32)

In [12]:
# Create data arrays
cluster_centers = kmeans_model_new.cluster_centers_
data = np.expm1(cluster_centers)
points = np.append(data, cluster_centers, axis=1)
points


overflow encountered in expm1



array([[2.74715975e+198,             inf,             inf,
        4.56922416e+002, 1.96160715e+004, 7.93710120e+004],
       [            inf,             inf,             inf,
        8.78333333e+002, 8.40000000e+002, 3.30462466e+007],
       [            inf, 5.22632645e+193,             inf,
        8.26052632e+002, 4.46052632e+002, 1.26086850e+007],
       [3.39914613e+223,             inf,             inf,
        5.14700000e+002, 1.37685045e+007, 1.77764403e+005],
       [4.77346614e+252, 7.33453646e+142,             inf,
        5.81814516e+002, 3.28959677e+002, 3.29670577e+006]])

In [13]:
# Add "clusters" to customers data
points = np.append(points, [[0], [1], [2], [3], [4]], axis=1)
df["clusters"] = kmeans_model_new.labels_

In [14]:
df[['clusters']].value_counts()

clusters
0           3957
4            124
2             19
3             10
1              3
dtype: int64

In [16]:
# visualize clusters
figure = px.scatter_3d(df,
                    color='clusters',
                    x='Request to Status Duration (Days)',
                    y='Qty Requested',
                    z='Grant Pricing',
                    category_orders = {"clusters": ["0", "1", "2", "3", "4"]}
                    )
figure.update_layout()
figure.show()

## Recommendations

*   Standardize quantitative values
*   Perform analysis on smaller, random sample
*   Include related business factors
*   Try other clustering methods, such as PAM, DBSCAN, or FCM



