# K Means Clustering

- Clustering comes under the unsupervised learning algorithms where we do not need specifically labelled data to get the desired results

### Process

1. select the k (number of clusters we want to divide our data into)
2. randomly select k distinct data points to fit into the data range 
3. find the distance (can be various kinds) from every point to the k new random points in plane
4. for each point assign a cluster based on the closeness of the data point to the cluster points i.e Assign the point to a cluster whose random point is closest to the original point
5. After all the points have been clustered for the first time, we find the mean point representing each cluster
6. Considering these mean points as the centres of the cluster we repeat steps 3-5
7. This process is repeated until we get two continuous cluster collections which match exactly (i.e everry point is in the same cluster as in the previous round)

### Problem:


These steps seem good enough but they heavily depend on the initial points we choose to get the clusters and sometimes may not be very accurate

### Solution to the problem:

- After we finalize on the clusters we can find the variance in each cluster and make a not of this
- We can repeat the entire process multiple times and compare the variance results
- We can finally pick the clusters where the overall variance is balanced and is not very high for any single cluster
- We can set the number of times this process can be run

### How to pick a value for K?

- Start with a value of 1 for k, this will give a cluster where all points belong to the same cluster. In this case the overall sum of variations in very high
- Slowly increase the k value and make a note of percentage change in sum of variations
- By plotting k vs reduction in variation, we get a elbow plot

![elbow-plot.JPG](attachment:elbow-plot.JPG)

- Here we can see that the plot grows rapidly until a certain value of k and then the growth is slowed. We pick this tipping point as the value of k

## Source:
- https://www.youtube.com/watch?v=4b5d3muPQmA

## Example

In [1]:
from sklearn.cluster import KMeans
import numpy as np

In [2]:
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

In [3]:
kmeans.labels_

array([1, 1, 1, 0, 0, 0])

In [4]:
kmeans.predict([[0, 0], [12, 3]])
kmeans.cluster_centers_

array([[10.,  2.],
       [ 1.,  2.]])