In [None]:
!pip install scikit-learn

In [None]:
from sklearn.cluster import KMeans 
import matplotlib.pyplot as plt 
from sklearn.preprocessing import scale 
import random
import numpy as np 
import pandas as pd

In [None]:
#Create fake income/age clusters for W people in k clusters
def createClusteredData (N, k) :
    pointsPerCluster = float (N)/k
    X = []
    for i in range (k):
        incomeCentroid = random.uniform(20000.0, 200000.0) # Draw samples from a uniform distribution.
        # Samples are uniformly distributed over the half-open interval [Low, high) (includes low, but excludes high)
        ageCentroid = random.uniform(20.0, 70.0)
        X.append([incomeCentroid, ageCentroid])
        
        for j in range(int (pointsPerCluster) - 1):
            # normal: Draw random samples from a normal (Gaussian) distribution.
            # loc : Mean ("centre") of the distribution.
            # scale: Standard deviation (spread or "width") of the distribution. Must be non-negative.
            X.append([np.random.normal(loc = incomeCentroid, scale = 20000.0), np.random.normal(loc=ageCentroid, scale=2.0)])
    X = np. array (X)
    return X


**Why we should scale data?**


Scaling data is important in many machine learning algorithms because it ensures that each feature contributes equally to the analysis. Here are the key reasons for scaling data:

**Equalizing Feature Influence:** In many datasets, different features might be measured in different units and scales (e.g., kilograms, dollars, years). If one feature has a much larger scale than others, it can dominate the algorithm's behavior. Scaling normalizes these scales, so no single feature will unduly influence the model's outcome.
Improving Algorithm Performance: Many machine learning algorithms, like KMeans clustering, use distance calculations. If features are on different scales, distance metrics can be skewed towards features with larger scales. Scaling ensures that the distance measure accurately reflects differences in all dimensions.
Speeding Up Convergence: Algorithms that use gradient descent as an optimization technique (like neural networks and linear regression) converge faster when data is scaled. This is because scaling ensures a more balanced contribution of each feature to the loss gradient, avoiding erratic changes in the optimization path.
Required by Some Models: Certain algorithms, like Support Vector Machines and Principal Component Analysis, require scaled data to function correctly. These algorithms are sensitive to the scale of the input data, and without scaling, they might not perform as expected.

In [None]:
data = createClusteredData(100, 5)


In [None]:
scaled_data = scale(data)

fig, ax = plt.subplots(1, 2, figsize=(15, 5))
ax[0].scatter(data[:, 0], data[:, 1])
ax[0].set(xlabel ='Income', label= 'Age', title= "Bank Data")

ax[1].scatter(scaled_data[:, 0], scaled_data[:, 1])
ax[1].set(xlabel='Income', ylabel='Age', title="Scaled Bank Data")
plt.show()


### What is Inertia?

Inertia, in the context of KMeans implemented in libraries like scikit-learn, is a measure that represents the total sum of squared distances of each data point to the centroid of its assigned cluster. 

It's essentially the value of the objective function.
We would like this number to be as small as possible.

K that is equal to the number of samples we will get inertia=0



In [None]:
SSE = []
for cluster in range(1,20):
    model = KMeans(n_clusters = cluster, init='k-means++', n_init= 'auto') 
    model.fit(scaled_data)
    SSE.append(model.inertia_)

#### SSE stands for Sum of Squared Errors
which is a measure used to quantify the variance within the clusters formed by the KMeans algorithm.

For each iteration of the loop, a KMeans model is created with the following parameters:
**n_clusters = cluster:** The number of clusters to form, which changes with each iteration of the loop.

**init='k-means++':** This parameter specifies the initialization method for the centroids. 'k-means++' is an optimized version of the KMeans algorithm that selects initial cluster centers in a smart way to speed up convergence.

**n_init= 'auto':** This should be a numerical value indicating the number of times the algorithm will run with different centroid seeds. The final results will be the best output of these runs. The code might have a typo here, as 'auto' is not a valid value for n_init. Usually, a number like 10 is used.

**model.fit(scaled_data):** This fits the KMeans model to the scaled_data. scaled_data should be a preprocessed version of your dataset where features are scaled to ensure equal contribution to the model (commonly using standardization or normalization).

In [None]:
frame = pd.DataFrame({'Cluster':range(1,20), 'SSE' :SSE})

In [None]:
plt.figure(figsize= (12, 6))

In [None]:

plt.plot(frame['Cluster'], frame['SSE'], marker='o') 
plt.xlabel('Number of clusters') 
plt.ylabel('Inertia') 

In [None]:
model = KMeans(n_clusters=4)
#note I am scaling data to normalise it! Important for good results
model= model.fit(scale(data))

In [None]:
print(model.labels_)



In [None]:
plt.figure(figsize=(8,6))


In [None]:
plt.scatter(data[:,0], data[:,1], c = model.labels_.astype(np.int64))
