<h1>K-Means - Unsupervised Learning with Python</h1>

<h2>Refrance:</h2>

https://towardsdatascience.com/unsupervised-learning-with-python-173c51dc7f03

https://towardsdatascience.com/k-means-clustering-from-a-to-z-f6242a314e9a

https://www.coursera.org/lecture/machine-learning/k-means-algorithm-93VPG

https://towardsdatascience.com/want-clusters-how-many-will-you-have-8737f4ba9bf2
<br>training with Javascript:<br>
https://towardsdatascience.com/extracting-colours-from-an-image-using-k-means-clustering-9616348712be

<h4><b>Important Terminology</b></h4><br>
<b>Feature:</b> An input variable used in making predictions.<br>
<b>Predictions:</b> A model’s output when provided with an input example.


<h3>The Concept:</h3>

The most commonly used clustering method is K-Means (because of it’s simplicity).<br>
The K in K-Means denotes the number of clusters. <br>
This algorithm is bound to converge to a solution after some iterations. It has 4 basic steps:
1. Initialize Cluster Centroids (Choose those 3 books to start with)
2. Assign datapoints to Clusters (Place remaining the books one by one)<br>
(according to Euclidean distance)
3. Update Cluster centroids (Start over with 3 different books)
4. Repeat step 2–3 until the stopping condition is met.

picture:
<img src="https://cdn-images-1.medium.com/max/1600/1*xkuet4YVglp8KWsK90bfRw.gif"></img>
https://towardsdatascience.com/k-means-clustering-from-a-to-z-f6242a314e9a

<b>centroid:</b><br>
As a starting point, you tell your model how many clusters it should make. First the model picks up K, (let K = 3) datapoints from the dataset. These datapoints are called cluster centroids.<br>
- updated cluster centorid is the the mean value of all the datapoints within that cluster.
-  instead of taking the average value, can take mode and median would be taken respectively.


Since step 2 and 3 would be performed iteratively, it would go on forever if we don’t set a stopping criterion. <br>
It is important to note that setting a stopping criterion would not necessarily return THE BEST clusters,but to make sure it returns reasonably good clusters, and more importantly at least return some clusters, we need to have a stopping criterion.<br>
When it is stopping:<br>(when happen one of the things below)
1. The datapoints assigned to specific cluster remain the same (takes too much time)
2. Centroids remain the same (time consuming)
3. The distance of datapoints from their centroid is minimum (the thresh you’ve set)
4. Fixed number of iterations have reached (insufficient iterations → poor results, choose max iteration wisely)

<h3>Data:</h3>

In this article we use, Iris dataset for making our very first predictions. The dataset contains a set of 150 records under 5 attributes.<br>
( Petal Length , Petal Width , Sepal Length , Sepal width and Class. Iris Setosa, Iris Virginica and Iris Versicolor are the three classes.)<br>
For our Unsupervised Algorithm we give these four features of the Iris flower and predict which class it belongs to.

In [None]:
from sklearn import datasets
import matplotlib.pyplot as plt

In [None]:
# Loading dataset
iris_df = datasets.load_iris()

# Available methods on dataset
print(dir(iris_df))

In [None]:
# Features
print(iris_df.feature_names)

# Targets
print(iris_df.target)

# Target Names
print(iris_df.target_names)
label = {0: 'red', 1: 'blue', 2: 'green'}

# Dataset Slicing
x_axis = iris_df.data[:, 0]  # Sepal Length
y_axis = iris_df.data[:, 2]  # Sepal Width

# Plotting
plt.scatter(x_axis, y_axis, c=iris_df.target)
plt.show()

<h2>Clustering</h2>

<h4>K-Means Clustering in Python:</h4>

K means is an iterative clustering algorithm that aims to find local maxima in each iteration.<br>
Initially desired number of clusters are chosen. Since we know that there are 3 classes involved, <br>
we program the algorithm to group the data into 3 classes, by passing the parameter “n_clusters” into our KMeans model. <br>
Now randomly three points(inputs) are assigned into three cluster. Based on the centroid distance between each points the next given inputs are segregated into respected clusters. <br>
Now, re-computing the centroids for all the clusters.

In [None]:
# Importing Modules
from sklearn import datasets
from sklearn.cluster import KMeans

In [None]:
# Loading dataset
iris_df = datasets.load_iris()

In [None]:
# Declaring Model
model = KMeans(n_clusters=3)

# Fitting Model
model.fit(iris_df.data)

In [None]:
# Predicitng a single input
predicted_label = model.predict([[7.2, 3.5, 0.8, 1.6]])

In [None]:
# Prediction on the entire data
all_predictions = model.predict(iris_df.data)

# Printing Predictions
print(predicted_label)
print(all_predictions)

<h2>Evaluating the cluster quality</h2>

Quality clustering is when the datapoints within a cluster are close together, and afar from other clusters.<br>
The two methods to measure the cluster quality are described below:
1. <b>Inertia:</b><br> Intuitively, inertia tells how far away the points within a cluster are. Therefore, a small of inertia is aimed for. The range of inertia’s value starts from zero and goes up.
2. <b>Silhouette score:</b><br>
Silhouette score tells how far away the datapoints in one cluster are, from the datapoints in another cluster. The range of silhouette score is from -1 to 1. Score should be closer to 1 than -1.

<h3>k-means optimization opertion :</h3>
https://www.coursera.org/lecture/machine-learning/optimization-objective-G6QWt

c(i) = index of cluster(1,2,...,k) to which example x(i) ic currently assigned.<br>
u(k) = cluster centriod k.<br>
u(c,i)= cluster centriod of cluster to which example x(i) has been assigned.

https://towardsdatascience.com/whos-talking-using-k-means-clustering-to-sort-neural-events-in-python-e7a8a76f316
<br>
Another more objective way is to use the Elbow method. For this we run the K-means function several times on our data and increase the number of clusters with every run. For each run we calculate the average distance of each data point to its cluster center.  As the plot below shows, with the number of clusters increasing the average inter cluster distance decreases. 

In [None]:
# Define the maximum number of clusters to test
max_num_clusters = 15
# Run K-means with increasing number of clusters (20 times each)
average_distance = []
for run in range(20):
    tmp_average_distance = []
    for num_clus in range(1, max_num_clusters +1):
        cluster, centers, distance = k_means(pca_result, num_clus)
        tmp_average_distance.append(np.mean([np.mean(distance[x]
        [cluster==x]) for x in range(num_clus)], axis=0))
    average_distance.append(tmp_average_distance)
    
    
# Plot the result -> Elbow point
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(range(1, max_num_clusters +1), np.mean(average_distance, axis=0))
ax.set_xlim([1, max_num_clusters])
ax.set_xlabel('number of clusters', fontsize=20)
ax.set_ylabel('average inter cluster distance', fontsize=20)
ax.set_title('Elbow point', fontsize=23)
plt.show()

<img src="https://cdn-images-1.medium.com/max/1600/1*McSBHbOKIutNNhNbtWjQzQ.png"></img>
https://cdn-images-1.medium.com/max/1600/1*McSBHbOKIutNNhNbtWjQzQ.png

we can see as well is that when we reach six clusters the average distance to the cluster center does not change much anymore. This is called the Elbow point and gives us a recommendation of how many clusters to use.

https://towardsdatascience.com/clustering-electricity-profiles-with-k-means-42d6d0644d00
<br>
We take the average of the silhouette across all load-profiles in order to have a global view of how the algorithm is performing.
I experiment with a range of cluster numbers (from 2 to 30). It is important to scale each period within the same range so that the magnitude of the energy load does not interfere in the selection of the cluster.

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_score

sillhoute_scores = []
n_cluster_list = np.arange(2,31).astype(int)

X = df_uci_pivot.values.copy()
    
# Very important to scale!
sc = MinMaxScaler()
X = sc.fit_transform(X)

for n_cluster in n_cluster_list:
    
    kmeans = KMeans(n_clusters=n_cluster)
    cluster_found = kmeans.fit_predict(X)
    sillhoute_scores.append(silhouette_score(X, kmeans.labels_))

<img src="https://cdn-images-1.medium.com/max/1600/1*9QNrTFc93ZXAHdcH2bX8MA.png"></img>
https://cdn-images-1.medium.com/max/1600/1*9QNrTFc93ZXAHdcH2bX8MA.png

The maximum average silhouette occurs when there are only 2 clusters, 

<h2>How many clusters?</h2>

There are a few methods available to choose the optimal number of K. <br>

The direct method is to just plot the datapoints and see if it gives you a hint.<br>
Other method is to use the value of inertia, The idea behind good clustering is having a small value of inertia, and small number of clusters.<br>
The value of inertia decreases as the number of clusters increase. So, its a trade-off here. Rule of thumb: The elbow point in the inertia graph is a good choice because after that the change in the value of inertia isn’t significant.

<img src="https://cdn-images-1.medium.com/max/1600/1*xOGY4uu6ng7E8lPLP-onWw.png"></img>
https://cdn-images-1.medium.com/max/1600/1*xOGY4uu6ng7E8lPLP-onWw.png

<h3>Final Note</h3>

It’s important to preprocess your data before performing K-Means. You would have to convert your dataset into numerical values if it is not already.<br>
Also, applying feature reduction techniques would speed up the process, and also improve the results.<br>
<b>These steps are important to follow because K-Means is sensitive to outliers,</b><br>

<font color='red'>just like every other algo that uses average/mean values. Following these steps alleviate these issues.</font>

<h1>Practive sample for tarining :</h1>

<b>Clustering electricity usage profiles with K-means</b><br>
https://towardsdatascience.com/clustering-electricity-profiles-with-k-means-42d6d0644d00

<br><b></b><br>


<br><b></b><br>

<br><b></b><br>

<br><b></b><br>