# Exercises 5.2: k-means clustering

### PA Women in Tech: Intro to Data Science with Python

In these exercises we will look at **k-means clustering**. We will:
- find the best value of k
- apply k-means clustering
- interpret the results

First, we'll import everything we need:

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

## Loading and exploring the data

In these exercises we'll be looking at a dataset which contains data on a store's customer credit card transactions. Clustering can be performed on the dataset for customer segmentation, which is the process of separating your customers into groups based on certain traits they share. The store's marketing strategy could then be informed by these groupings (and targeting marketing could be performed, for example).

To save time for this exercise, the loading, exploring and pre-processing steps have all been filled out below.

Loading the `Credit_Card_Data.csv` dataset into a dataframe called `df_cc`:

In [3]:
df_cc = pd.read_csv("Credit_Card_Data.csv")

Displaying the first ten rows of the dataframe:

In [4]:
df_cc.head(10)

Unnamed: 0,CUST_ID,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
0,C10001,40.900749,0.818182,95.4,0.0,95.4,0.0,0.166667,0.0,0.083333,0.0,0,2,1000.0,201.802084,139.509787,0.0,12
1,C10002,3202.467416,0.909091,0.0,0.0,0.0,6442.945483,0.0,0.0,0.0,0.25,4,0,7000.0,4103.032597,1072.340217,0.222222,12
2,C10003,2495.148862,1.0,773.17,773.17,0.0,0.0,1.0,1.0,0.0,0.0,0,12,7500.0,622.066742,627.284787,0.0,12
3,C10004,1666.670542,0.636364,1499.0,1499.0,0.0,205.788017,0.083333,0.083333,0.0,0.083333,1,1,7500.0,0.0,,0.0,12
4,C10005,817.714335,1.0,16.0,16.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0,1,1200.0,678.334763,244.791237,0.0,12
5,C10006,1809.828751,1.0,1333.28,0.0,1333.28,0.0,0.666667,0.0,0.583333,0.0,0,8,1800.0,1400.05777,2407.246035,0.0,12
6,C10007,627.260806,1.0,7091.01,6402.63,688.38,0.0,1.0,1.0,1.0,0.0,0,64,13500.0,6354.314328,198.065894,1.0,12
7,C10008,1823.652743,1.0,436.2,0.0,436.2,0.0,1.0,0.0,1.0,0.0,0,12,2300.0,679.065082,532.03399,0.0,12
8,C10009,1014.926473,1.0,861.49,661.49,200.0,0.0,0.333333,0.083333,0.25,0.0,0,5,7000.0,688.278568,311.963409,0.0,12
9,C10010,152.225975,0.545455,1281.6,1281.6,0.0,0.0,0.166667,0.166667,0.0,0.0,0,3,11000.0,1164.770591,100.302262,0.0,12


All the 'frequency' columns have 'score' type values between 0 and 1, where 1 is 'frequently' and 0 is 'not frequently'

Displaying the number of rows and columns in the dataframe:

In [5]:
df_cc.shape

(8950, 18)

So we have 18 columns and 8950 rows.

## Pre-processing the data

Checking for missing values in the dataframe:

In [6]:
df_cc.isna().sum()

CUST_ID                               0
BALANCE                               0
BALANCE_FREQUENCY                     0
PURCHASES                             0
ONEOFF_PURCHASES                      0
INSTALLMENTS_PURCHASES                0
CASH_ADVANCE                          0
PURCHASES_FREQUENCY                   0
ONEOFF_PURCHASES_FREQUENCY            0
PURCHASES_INSTALLMENTS_FREQUENCY      0
CASH_ADVANCE_FREQUENCY                0
CASH_ADVANCE_TRX                      0
PURCHASES_TRX                         0
CREDIT_LIMIT                          1
PAYMENTS                              0
MINIMUM_PAYMENTS                    313
PRC_FULL_PAYMENT                      0
TENURE                                0
dtype: int64

Filling missing values for CREDIT_LIMIT and MINIMUM_PAYMENTS with the mean:

In [7]:
mean_credit_limit = df_cc["CREDIT_LIMIT"].dropna().mean()
df_cc["CREDIT_LIMIT"] = df_cc["CREDIT_LIMIT"].fillna(mean_credit_limit)

mean_minimum_payments = df_cc["MINIMUM_PAYMENTS"].dropna().mean()
df_cc["MINIMUM_PAYMENTS"] = df_cc["MINIMUM_PAYMENTS"].fillna(mean_minimum_payments)

One column which doesn't pertain to customer behaviour is the CUST_ID column, as this is an arbitrary string value used to identify the customer. We could bring this data back in later if we needed, but we don't want to include it in our clustering model.

Dropping the CUST_ID column:

In [8]:
df_cc = df_cc.drop(columns="CUST_ID")

Performing some quick statistical analysis on the dataframe:

In [9]:
df_cc.describe()

Unnamed: 0,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
count,8950.0,8950.0,8950.0,8950.0,8950.0,8950.0,8950.0,8950.0,8950.0,8950.0,8950.0,8950.0,8950.0,8950.0,8950.0,8950.0,8950.0
mean,1564.474828,0.877271,1003.204834,592.437371,411.067645,978.871112,0.490351,0.202458,0.364437,0.135144,3.248827,14.709832,4494.44945,1733.143852,864.206542,0.153715,11.517318
std,2081.531879,0.236904,2136.634782,1659.887917,904.338115,2097.163877,0.401371,0.298336,0.397448,0.200121,6.824647,24.857649,3638.612411,2895.063757,2330.588021,0.292499,1.338331
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,50.0,0.0,0.019163,0.0,6.0
25%,128.281915,0.888889,39.635,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,1.0,1600.0,383.276166,170.857654,0.0,12.0
50%,873.385231,1.0,361.28,38.0,89.0,0.0,0.5,0.083333,0.166667,0.0,0.0,7.0,3000.0,856.901546,335.628312,0.0,12.0
75%,2054.140036,1.0,1110.13,577.405,468.6375,1113.821139,0.916667,0.3,0.75,0.222222,4.0,17.0,6500.0,1901.134317,864.206542,0.142857,12.0
max,19043.13856,1.0,49039.57,40761.25,22500.0,47137.21176,1.0,1.0,1.0,1.5,123.0,358.0,30000.0,50721.48336,76406.20752,1.0,12.0


We can see that we various units and ranges of values across the different columns; there are several 'frequency' columns which only have values between 0 and 1, and other columns, such as CREDIT_LIMIT which are much higher. A machine learning algorithm will consider CREDIT_LIMIT more important than PURCHASES_FREQUENCY only because the values for CREDIT_LIMIT are larger and have higher variability from person to person.

Machine learning algorithms need to consider all features on an even playing field. That means the values for all features must be transformed to the same scale.

The process of transforming numerical features to use the same scale is known as feature scaling. It’s an important data preprocessing step for most distance-based machine learning algorithms because it can have a significant impact on the performance of your algorithm.

There are several approaches to implementing feature scaling. A great way to determine which technique is appropriate for your dataset is to read scikit-learn’s preprocessing documentation.

In this case, we will normalise our data:

In [12]:
normalised_data = StandardScaler().fit_transform(df_cc)
df_cc = pd.DataFrame(normalised_data, columns=df_cc.columns)

## Finding k

Now that we have pre-processed our dataset, we want to try and find the best number of clusters, k. We'll do this by using the elbow method.

1a. Find the cluster distortion for KMeans with cluster k values 1 - 10, populating them in a list

Hint: you can use the Python `range()` function to generate a list of numbers, which you can use a for loop to iterate through

In [None]:
distortion = []

cluster_range = range(1, 10)
for k in cluster_range:
    distortion.append(k)
    

1b. Plot the distortion values you found against the cluster k value

What do you think is a good number of clusters?

## K-means clustering

2. Perform k-means clustering on the data with your chosen number of clusters

To see the resulting cluster for each row, you can use the model `labels_` attribute and attach it to the original dataframe using the `.concat()` method

In [13]:
clusters = pd.concat([df_cc, pd.DataFrame({'cluster': kmeans.labels_})], axis=1)
clusters.head()

NameError: name 'kmeans' is not defined

## Interpreting the clusters

We can plot a seaborn 'FacetGrid' with the 'cluster' column to show us each variable's count per cluster type.

This would help us to interpret some of the characteristics that may define each cluster.

In [None]:
for c in clusters:
    grid = sns.FacetGrid(clusters, col='cluster')
    grid.map(plt.hist, c)

We can also use PCA to reduce the dimensions of our data so that we can visualise our clusters in 2d.

In [None]:
pca = PCA(n_components = 2)
principal_components = pca.fit_transform(df_cc)
principal_df = pd.DataFrame(data = principal_components, columns = ['principal component 1', 'principal component 2'])

sns.scatterplot(data=principal_df, x="principal component 1", y="principal component 2", hue=kmeans.labels_)
plt.show()

Extensions:
* Try and interpret the FacetGrid plot above and write a short description for each cluster; add this as a legend to the 2d cluster plot
* Retry clustering using different values of k. How do the results differ?