Skip to content
awais546 edited this page Oct 3, 2020 · 1 revision

Python and Deep Learning

Introduction

In this ICP we studied another type of machine learning called the clustering. We focused on the specific type of clustering which is the K-Mean Clustering. Also we studied how can we measure the performance of K-Means and increase the performance of it.

Tasks

The tasks are as follows.

  • Apply K-Means clustering to the data. (https://umkc.box.com/s/a9lzu9qoqfkbhjwk5nz9m6dyybhl1wqy)
  • Remove null values with the mean
  • Use the elbow method to find a good number of clusters with the KMeans algorithm
  • Calculate the silhouette score for the above clustering
  • Try feature scaling to see if it will improve the Silhouette score
  • Apply PCA on the same dataset

### Apply K-Means

In order to apply K-Means clustering on the dataset there is some pre-processing required. Firstly a column with the name CUST_ID is removed. After that the null values are replaced with the mean of each column. After that the features and result dataframe are made.

Once the data is pre-processed we have to determine the constant for number of cluster. The best way to find it out is by elbow method. The graph shown below shows that the best value for neighboring values is 3.

Using 3 as the cluster value we can determine the Silhouette value. The closer it is to 0 the more accurate it is.

Feature scaling

Feature scaling is applied using the PCA and standardization. Following is the screenshot showing the code to scale the data. The dataframe after PCA scaling to 2 columns is renamed having first column as Feature1 and second column as Feature with the final column as TENURE.

The dataframe made is shown below.

Bonus Question

After applying the PCA we have two features. We can apply K-Means using this new dataframe. After applying the elbow method the new cluster neighbor constant is for which is shown below.

We can see that the new constant is now 4. Using this constant the Shilhouette score has improved.

We can visualize the clustering as follows.

Clone this wiki locally