### Clustering

1)	Clustering identifies similarities between objects, which it groups according to those characteristics in common and which differentiate them from other groups of objects. These groups are known as "clusters".<br>
2)	Clustering is framed in unsupervised learning; that is, for this type of algorithm we only have one set of input data (not labeled), about which we must obtain information, without previously knowing what the output will be.<br>
3)<b>	There is no need to split the data in training and testing dataset.</b>


#### Clustering can be categorised into the following categories
1) Centroid based - KMeans<br>
2) Density Based - DBSCAN (Density-based spatial clustering of applications with noise)<br>
3) Hierarchical - Agglomerative Clustering

In [None]:
# customer => age, income, expense, saving, income, etc

# G1(0) : high income,low expense 
# G2(1) : high income,high expense
# G3(2) : low income,high expense
# G4(3) : low income,low expense


### KMeans

Where K = number of clusters

1)	K-means algorithm is an iterative algorithm that tries to partition the dataset into<b> K pre-defined non-overlapping subgroups (clusters)</b> where each data point belongs to only one group<br>
2)	It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the<b> sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum.</b> The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.<br>
3)	Since clustering algorithms including KMeans which use distance-based measurements to determine the similarity between data points, it’s recommended to standardize or scale the data since almost always the features in any dataset would have different units of measurements for instance as age vs. income.<br>


#### Distance Metrics
<pre>
Distance between A(x1,y1) and B(x2,y2) is computed as 
a) Minkowski Distance  = ((|x2-x1|)^(p) + (|y2-y1|)^(p))^(1/p), p is an int
b) Euclidean Distance = (x2-x1)^(2) + (y2-y1)^(2)^(1/2)
c) Manhattan Distance = (|x2-x1| + |y2-y1|)
Put p=2 in Minkowski Distance => Euclidean Distance
Put p=1 in Minkowski Distance => Manhattan Distance

### K-Means Algorithm

1) Specify number of clusters K.<br>
2) Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.<br>
4) Compute the sum of the squared distance between each of the data points and all the centroids.<br>
5) Assign each data point to the closest cluster (centroid) based on its nearest distance<br>
6) Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.<br>
7) Repeat steps 4,5 and 6 until there is no change in the centroids.

<img src="kmeans1.png">

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
df = pd.DataFrame({'Age':np.random.randint(20,66,200),
                  'Expesne': np.random.randint(5000,25000,200)})
df.shape

(200, 2)

In [11]:
df.head()

Unnamed: 0,Age,Expesne,Dist_k1,Dist_k2,Dist_k3
0,26,17441,10986.000728,1320.136735,6549.110245
1,53,17820,11365.042279,1699.018835,6170.009806
2,39,19278,12823.011269,3157.005702,4712.06632
3,64,17990,11535.076463,1869.096573,6000.0
4,40,21307,14852.010908,5186.00241,2683.10734


### Assume K = 3 (Model selects initial centroids that are as far as posssible)

In [9]:
K = 3
# initial_centroid = df.sample(K)
# initial_centroid
k1 = [22,6455]
k2 = [45,16121]
k3 = [64,23990]

In [8]:
df.sample(20)

Unnamed: 0,Age,Expesne
191,23,23684
17,38,10190
98,58,5529
107,61,9201
41,49,18411
54,45,19571
67,63,19349
32,47,5591
2,39,19278
192,54,23722


#### Distance between all data points and Cluster Centroids

In [10]:
df['Dist_k1'] = np.sqrt((df['Age'] - k1[0])**2 + (df['Expesne'] - k1[1])**2)
df['Dist_k2'] = np.sqrt((df['Age'] - k2[0])**2 + (df['Expesne'] - k2[1])**2)
df['Dist_k3'] = np.sqrt((df['Age'] - k3[0])**2 + (df['Expesne'] - k3[1])**2)
df.head(10)

Unnamed: 0,Age,Expesne,Dist_k1,Dist_k2,Dist_k3
0,26,17441,10986.000728,1320.136735,6549.110245
1,53,17820,11365.042279,1699.018835,6170.009806
2,39,19278,12823.011269,3157.005702,4712.06632
3,64,17990,11535.076463,1869.096573,6000.0
4,40,21307,14852.010908,5186.00241,2683.10734
5,63,11026,4571.183873,5095.031796,12964.000039
6,41,17977,11522.015666,1856.00431,6013.043988
7,64,7777,1322.667003,8344.021632,16213.0
8,55,13699,7244.075165,2422.020644,10291.003935
9,51,23796,17341.024249,7675.002345,194.435079


In [14]:
df['Cluster'] = df.iloc[:,2:].apply(np.argmin , axis = 1).values

In [15]:
df['Cluster'].value_counts()

Cluster
1    86
0    62
2    52
Name: count, dtype: int64

In [17]:
c0 = df[df['Cluster'] == 0]
c1 = df[df['Cluster'] == 1]
c2 = df[df['Cluster'] == 2]
print(c0.shape,c1.shape,c2.shape)

(62, 6) (86, 6) (52, 6)


In [18]:
k1_new = [c0['Age'].mean() , c0['Expesne'].mean()]
k2_new = [c1['Age'].mean() , c1['Expesne'].mean()]
k3_new = [c2['Age'].mean() , c2['Expesne'].mean()]
print(k1_new)
print(k2_new)
print(k3_new)

[np.float64(40.91935483870968), np.float64(8081.354838709677)]
[np.float64(43.651162790697676), np.float64(15671.523255813954)]
[np.float64(39.73076923076923), np.float64(22965.76923076923)]


In [19]:
from sklearn.cluster import KMeans

In [20]:
model = KMeans(n_clusters=4,init='k-means++')
# model.fit(x)

In [19]:
for i,j in df.iloc[:2,:].iterrows():
    print('i',i) # i = Index
#     print('j',j,type(j)) # j = Series
    print(j['Age'])

i 0
60.0
i 1
56.0


In [23]:
clusters =  []
for i,j in df.iterrows():
    x = j[['Dist_k1','Dist_k2','Dist_k3']].min()
    if x == j['Dist_k1']:
        clusters.append('C1')
    elif x == j['Dist_k2']:
        clusters.append('C2')
    else:
        clusters.append('C3')

#### Iteration - 2

In [29]:
k1 = k1_new
k2 = k2_new
k3 = k3_new

#### KMeans using Library

In [30]:
from sklearn.cluster import KMeans

In [None]:
km = KMeans(n_clusters=5,init='k-means++')
# km.fit(x)

# k-means++ ensures that initial cluster centroids are well separated

### How to determine the optimal number of clusters?

#### 1) Elbow Method<br>
a) It is a plot between WCSS (Within Cluster Sum of Squares) on y-axis and number of clusters of x-axis.<br>
b) WCSS - It is sum of squared distance between a data point and its closest cluster centroid computed for all the data points (that may even belong to different clusters)


#### 2) Silhouette Score

In [None]:
# Davies Bouldinin Score