### **K-Means Clustering with Income Dataset**

This notebook demonstrates the application of the K-Means clustering algorithm on a dataset containing information about individuals' age and income. We will preprocess the data, perform clustering, visualize the results, and analyze the clusters.

#### **1. Import Libraries**

Import the necessary libraries for data manipulation, clustering, and visualization.

In [None]:
from sklearn.cluster import KMeans
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt

#### **2. Load the Dataset**

Load the `income.csv` dataset, which contains information about individuals' `Name`, `Age`, and `Income`.

In [None]:
df = pd.read_csv('income.csv')
df.head()

#### **3. Visualize the Data**

Before applying clustering, we visualize the data to understand its distribution.

In [None]:
plt.scatter(df.Age, df['Income'])
plt.xlabel('Age')
plt.ylabel('Income')

#### **4. Apply K-Means Clustering**

Apply K-Means clustering with three clusters (`k=3`) and add the cluster assignments to the dataset.

In [None]:
km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age', 'Income']])
df['cluster'] = y_predicted
df.head()

#### **5. Analyze Cluster Centers**

The cluster centers represent the average values of `Age` and `Income` for each cluster. These centers can help interpret the clustering results.

In [None]:
km.cluster_centers_

#### **6. Visualize Clusters**

Visualize the clusters along with their centroids to understand the grouping.

In [None]:
df1 = df[df.cluster == 0]
df2 = df[df.cluster == 1]
df3 = df[df.cluster == 2]

plt.scatter(df1.Age, df1['Income'], color='green')
plt.scatter(df2.Age, df2['Income'], color='red')
plt.scatter(df3.Age, df3['Income'], color='black')
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], color='purple', marker='*', label='centroid')
plt.xlabel('Age')
plt.ylabel('Income')
plt.legend()

#### **7. Normalize the Data**
The KMeans clustering algorithm is sensitive to the scale of the data. If one feature (e.g., Income) has a much larger range or variance than another feature (e.g., Age), then the clustering will be dominated by the feature with the larger range. This is because the algorithm is based on calculating distances between data points.


We normalize the `Age` and `Income` features to scale them between 0 and 1, which can improve clustering performance when features have different ranges.

In [None]:
scaler = MinMaxScaler()

scaler.fit(df[['Income']])
df['Income'] = scaler.transform(df[['Income']])

scaler.fit(df[['Age']])
df['Age'] = scaler.transform(df[['Age']])
df.head()

#### **8. Revisualize Normalized Data**

After normalization, visualize the data again to confirm the changes.

In [None]:
plt.scatter(df.Age, df['Income'])

#### **9. Reapply K-Means Clustering**

Reapply K-Means clustering on the normalized data to get updated cluster assignments.

In [None]:
km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age', 'Income']])
df['cluster'] = y_predicted
df.head()

#### **10. Revisualize Clusters with Normalized Data**

Visualize the clusters again after normalization to see the updated centroids.

In [None]:
df1 = df[df.cluster == 0]
df2 = df[df.cluster == 1]
df3 = df[df.cluster == 2]

plt.scatter(df1.Age, df1['Income'], color='green')
plt.scatter(df2.Age, df2['Income'], color='red')
plt.scatter(df3.Age, df3['Income'], color='black')
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], color='purple', marker='*', label='centroid')
plt.legend()

#### **11. Elbow Method**

The Elbow Method helps determine the optimal number of clusters by plotting the sum of squared errors (SSE) for different values of `k`. The point where the SSE begins to diminish significantly is the optimal `k`. 

In [None]:
sse = []
k_rng = range(1, 10)
for k in k_rng:
    km = KMeans(n_clusters=k)
    km.fit(df[['Age', 'Income']])
    sse.append(km.inertia_)

plt.xlabel('K')
plt.ylabel('Sum of squared error')
plt.plot(k_rng, sse)