# Problem 1: K-means clustering algorithm

Implement this algorithm and use it to cluster the observations in the data set USArrests.csv into K = 4 clusters. This data set contains 50 observations (one for each US state). The variables are murder (per 100K), asssault (per 100K), percent urban population, and rape (per 100K). After you obtain the clustering solution, provide a qualitative description of each cluster.

If you use Euclidean distance or squared Euclidean distance as a dissimilarity metric, then you should scale the data so that each variable has mean zero and standard deviation one

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [2]:
usa_arrests_csv = "../data/USArrests.csv"

with open(usa_arrests_csv, "r") as usa_arrests_infile:
    usa_arrests_df = pd.read_csv(usa_arrests_infile)

In [3]:
usa_arrests_df.head()

Unnamed: 0,State,Murder,Assault,UrbanPop,Rape
0,Alabama,13.2,236,58,21.2
1,Alaska,10.0,263,48,44.5
2,Arizona,8.1,294,80,31.0
3,Arkansas,8.8,190,50,19.5
4,California,9.0,276,91,40.6


## Variables for clustering

The variables for clustering are:

- `Murder`
- `Assault`
- `UrbanPop`
- `Rape`

In [4]:
# Select the variables for clustering
X = usa_arrests_df[['Murder', 'Assault', 'UrbanPop', 'Rape']]

In [7]:
# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [8]:
# Perform K-means clustering with K=4
kmeans = KMeans(n_clusters=4, random_state=42)
usa_arrests_df['Cluster'] = kmeans.fit_predict(X_scaled)

In [9]:
usa_arrests_df.head()

Unnamed: 0,State,Murder,Assault,UrbanPop,Rape,Cluster
0,Alabama,13.2,236,58,21.2,1
1,Alaska,10.0,263,48,44.5,2
2,Arizona,8.1,294,80,31.0,2
3,Arkansas,8.8,190,50,19.5,1
4,California,9.0,276,91,40.6,2


## Qualitative Description of each cluster

In [10]:
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_labels = ['Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4']
for i, center in enumerate(cluster_centers):
    print(f"{cluster_labels[i]}:")
    print(f"  - Murder: {center[0]:.2f}")
    print(f"  - Assault: {center[1]:.2f}")
    print(f"  - UrbanPop: {center[2]:.2f}")
    print(f"  - Rape: {center[3]:.2f}")
    print()

Cluster 1:
  - Murder: 3.60
  - Assault: 78.54
  - UrbanPop: 52.08
  - Rape: 12.18

Cluster 2:
  - Murder: 13.94
  - Assault: 243.62
  - UrbanPop: 53.75
  - Rape: 21.41

Cluster 3:
  - Murder: 10.97
  - Assault: 264.00
  - UrbanPop: 76.50
  - Rape: 33.61

Cluster 4:
  - Murder: 5.85
  - Assault: 141.18
  - UrbanPop: 73.65
  - Rape: 19.34



These numbers represent the mean values of the variables (Murder, Assault, UrbanPop, and Rape) for each cluster:

- **Cluster 1**: This cluster has the lowest values for Murder, Assault, and Rape, indicating states with lower rates of violent crimes. The UrbanPop value suggests that these states have a moderate level of urbanization.

- **Cluster 2**: This cluster has the highest values for Murder, Assault, and Rape, indicating states with higher rates of violent crimes. The UrbanPop value suggests that these states have a moderate level of urbanization.

- **Cluster 3**: This cluster has high values for Murder, Assault, and Rape, similar to Cluster 2. However, the UrbanPop value is higher, indicating that these states have a higher level of urbanization compared to Cluster 2.

- **Cluster 4**: This cluster has moderate values for Murder, Assault, and Rape, indicating states with moderate rates of violent crimes. The UrbanPop value suggests that these states have a higher level of urbanization.

In summary, Cluster 1 represents states with the lowest crime rates and moderate urbanization, Cluster 2 represents states with the highest crime rates and moderate urbanization, Cluster 3 represents states with high crime rates and high urbanization, and Cluster 4 represents states with moderate crime rates and high urbanization.