## Based on this resource

https://realpython.com/k-means-clustering-python/

In [3]:
# Let's start by insallting then importing some packages...

!pip install matplotlib
!pip install kneed
!pip install sklearn

import matplotlib.pyplot as plt
from kneed import KneeLocator
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m22.2.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m22.2.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m22.2.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Step 1 - Let's generate some clusters

First, we will use the scikit-learn function 'make_blobs', which is used to generate synthetic clusters. It uses the following parameters:

* n_samples - total number of samples to generate
* centres - the number of centroids to generate
* random_state - if we set this to 'None' a random number will be generated each time we call on the function. But, if we set it to an integer e.g. 32, this allows for reproducible output each time we call on the function.
* cluster_std - the standard deviation... but we won't worry about this too much for now

In [48]:
features, true_labels = make_blobs(
    n_samples=200,
    centers=3,
    cluster_std=2.75,
    random_state=32
)

Let's take a closer look at the first 5 samples, starting with the features.

In [53]:
features[:5]

array([[ 5.98141618, 10.51611954],
       [ 4.44331872,  8.91046702],
       [ 4.15174699,  5.22640696],
       [ 2.24146988,  3.16161526],
       [ 7.18004878, -1.57894707]])

In [50]:
true_labels[:5]

array([2, 1, 1, 2, 0])

As we can see, we have an array containing 2 numerical values for each sample. 

When it comes to performing clustering on a given dataset, you must think carefully about your feature variables. 

Perhaps your dataset contains information on bank loans and custoner data. It could contain a variable 'annual income' which ranges from £19,000 - £1,000,000, and another variable 'monthly debt' which ranges from £0 - £400,000. Therefore, we must change the values of these numeric columns in the dataset to ensure that they use a common scale, i.e., we standardise/normalise the data. In machine learning, this is referred to as 'feature scaling', and is especially important for clustering given that it is a distance-based algorithm.

Because k-means uses the Euclidean distance to calculate the distance between data points and the centroids, we need to ensure that the distance measure accords equal weight to each variable. We don't want to put more weight on variables that might have higher variance. 



## Step 2 - Standardise the data

To do this, we will use scikit-learn's preprocessing package which comes with a StandardScaler() class, which is a quick way to perform feature scaling.

In [115]:
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

In [116]:
scaled_features[:5]

array([[ 0.41715892,  1.11834775],
       [ 0.01079153,  0.83303204],
       [-0.0662421 ,  0.17839464],
       [-0.57093984, -0.1885076 ],
       [ 0.73383924, -1.0308797 ]])

Let's look at the variance for each matrix column BEFORE and AFTER feature scaling...

In [128]:
# Variance for the first column before scaling...

features[:,0].var()

14.326167418288941

In [129]:
# Variance for the second column before scaling...

features[:,1].var()

31.67027058298565

In [132]:
# Variance for first column after scaling...

scaled_features[:,0].var()

0.9999999999999993

In [131]:
# Variance for second column after scaling...

scaled_features[:,1].var()

1.0000000000000009

## Step 3 - Clustering

Nice. We can see that the variance of the two columns are now both close to 1.0. Now that we have finished the preprocessing phase, we can start to cluster our data!

To do this we can use the KMeans class which comes with the scikit-learn package. It has the following parameters:

* init - this is the method for initialisation. The standard version of the k-means algorithm is implemented by setting init to "random".

* n_clusters - this is the number of clusters that you want the algorithm to form, as well as the number of centroids to generate

* n_iter - this refers to the number of iterations, i.e., the number of times that the k-means algorithm will be run. This is important because 2 runs can converge on different cluster assignments. The default behaviour for the scikit-learn algorithm is to perform ten k-means runs and then return the results of the one with the lowest sum of the squared error (SSE).

* max_iter - this refers to the max number of iterations of the algorithm for a single run. 

In [134]:
kmeans = KMeans(
    init="random",
    n_clusters=3,
    n_init=10,
    max_iter=300,
    random_state=32
)

Now that we have our k-means algorithm prepped and ready to go, let's fit it to the data in scaled_features.

In [135]:
kmeans.fit(scaled_features)

KMeans(init='random', n_clusters=3, random_state=32)

In [None]:
After fitting the data to the algorithm, we can then access 

In [136]:
kmeans.inertia_

124.31264130265883

In [55]:
import pandas as pd

In [58]:
credit = pd.read_csv("archive/credit_train.csv")

In [59]:
credit.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100514 entries, 0 to 100513
Data columns (total 19 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   Loan ID                       100000 non-null  object 
 1   Customer ID                   100000 non-null  object 
 2   Loan Status                   100000 non-null  object 
 3   Current Loan Amount           100000 non-null  float64
 4   Term                          100000 non-null  object 
 5   Credit Score                  80846 non-null   float64
 6   Annual Income                 80846 non-null   float64
 7   Years in current job          95778 non-null   object 
 8   Home Ownership                100000 non-null  object 
 9   Purpose                       100000 non-null  object 
 10  Monthly Debt                  100000 non-null  float64
 11  Years of Credit History       100000 non-null  float64
 12  Months since last delinquent  46859 non-null

In [60]:
credit.columns = (credit.columns
                  .str.replace(' ', '_')
                  .map(str.lower))

credit.head(10)
credit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100514 entries, 0 to 100513
Data columns (total 19 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   loan_id                       100000 non-null  object 
 1   customer_id                   100000 non-null  object 
 2   loan_status                   100000 non-null  object 
 3   current_loan_amount           100000 non-null  float64
 4   term                          100000 non-null  object 
 5   credit_score                  80846 non-null   float64
 6   annual_income                 80846 non-null   float64
 7   years_in_current_job          95778 non-null   object 
 8   home_ownership                100000 non-null  object 
 9   purpose                       100000 non-null  object 
 10  monthly_debt                  100000 non-null  float64
 11  years_of_credit_history       100000 non-null  float64
 12  months_since_last_delinquent  46859 non-null

In [67]:
credit_data = credit[["current_loan_amount", "credit_score", 
                      "annual_income", "monthly_debt", 
                      "years_of_credit_history", "number_of_open_accounts"]]

credit_data.head(10)

Unnamed: 0,current_loan_amount,credit_score,annual_income,monthly_debt,years_of_credit_history,number_of_open_accounts
0,445412.0,709.0,1167493.0,5214.74,17.2,6.0
1,262328.0,,,33295.98,21.1,35.0
2,99999999.0,741.0,2231892.0,29200.53,14.9,18.0
3,347666.0,721.0,806949.0,8741.9,12.0,9.0
4,176220.0,,,20639.7,6.1,15.0
5,206602.0,7290.0,896857.0,16367.74,17.3,6.0
6,217646.0,730.0,1184194.0,10855.08,19.6,13.0
7,648714.0,,,14806.13,8.2,15.0
8,548746.0,678.0,2559110.0,18660.28,22.6,4.0
9,215952.0,739.0,1454735.0,39277.75,13.9,20.0


In [62]:
credit_data['current_loan_amount'].value_counts()


99999999.0    11484
223102.0         27
223322.0         27
216194.0         27
223652.0         27
              ...  
72050.0           1
712228.0          1
125752.0          1
594902.0          1
274076.0          1
Name: current_loan_amount, Length: 22004, dtype: int64

In [69]:
credit_data.loc['current_loan_amount'] = np.where(credit_data['current_loan_amount']==99999999, np.nan, credit_data['current_loan_amount'])

ValueError: cannot set a row with mismatched columns

In [84]:
credit_data.credit_score.max()

751.0

In [81]:
credit_data['credit_score'] = (credit_data['credit_score']
                                .astype(str)
                                .str[:3]
                                .astype(float))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  credit_data['credit_score'] = (credit_data['credit_score']


In [88]:
pd.set_option('float_format', '{:f}'.format)

In [89]:
credit_data.describe()

Unnamed: 0,current_loan_amount,credit_score,annual_income,monthly_debt,years_of_credit_history,number_of_open_accounts
count,100000.0,80846.0,80846.0,100000.0,100000.0,100000.0
mean,11760447.38946,716.293447,1378276.559842,18472.412336,18.199141,11.12853
std,31783942.546071,28.297164,1081360.195662,12174.992609,7.015324,5.00987
min,10802.0,585.0,76627.0,0.0,3.6,0.0
25%,179652.0,703.0,848844.0,10214.1625,13.5,8.0
50%,312246.0,722.0,1174162.0,16220.3,16.9,10.0
75%,524942.0,738.0,1650663.0,24012.0575,21.7,14.0
max,99999999.0,751.0,165557393.0,435843.28,70.5,76.0


In [90]:
credit_data.annual_income.min()

76627.0

In [91]:
credit_data.annual_income.max()

165557393.0

In [92]:
credit_data.years_of_credit_history.min()

3.6

In [93]:
credit_data.years_of_credit_history.max()

70.5

In [94]:
credit_data.monthly_debt.min()

0.0

In [95]:
credit_data.monthly_debt.max()

435843.28