# ML - Clustering

## Motivation

![Develop Countries](../images/developed_countries.png)

World map showing country classifications as per the IMF (International Monetary Fund) and the UN (United Nations) (last updated 2022).

- Blue: Developed countries
- Orange: Developing countries
- Red: Least developed countries
- Gray: Data unavailable

Most commonly, the criteria for evaluating the degree of economic development are gross domestic product (GDP), gross national product (GNP), the per capita income, level of industrialization, amount of widespread infrastructure and general standard of living.

Question: Can we categorize countries based on these features without having labels from the beginning? Why are there only three categories? 

Answer: Clustering! This is how we categorize elements without previous labels.

## K-means

### LLoyd Algorithm

1. Compute the centroid of the cluster by averaging the positions of the elements currently in the cluster.
2. Update cluster label of the elements using the closest distance to each centroid.


In this case, one video is worth more than a thousand pictures.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("5I3Ei69I40s")

In [None]:
import numpy as np
import pandas as pd

from pathlib import Path

In [None]:
filepath = "https://raw.githubusercontent.com/aoguedao/gmu_casbbi_data_science/main/data/gapminder.csv"
# filepath = Path().resolve().parent / "data" / "gapminder.csv"  # If you are running locally
data = pd.read_csv(filepath, usecols=[1, 5, 6])
data.head()

In [None]:
from sklearn.cluster import KMeans

K = 3
kmeans = KMeans(n_clusters=K)
kmeans.fit(data.drop(columns="country"))

In [None]:
data["label"] = kmeans.labels_
data.head()

In [None]:
data.query("label == 0")

In [None]:
data.query("label == 1")

In [None]:
data.query("label == 2")

### Another Example

We can compress images using clustering by reducing the number of bytes.

In [None]:
from PIL import Image
import requests

url = "https://raw.githubusercontent.com/aoguedao/gmu_casbbi_data_science/main/images/coyoya.jpg"
im_filapath = requests.get(url,stream=True).raw
# im_filepath = Path().resolve().parent / "images" / "coyoya.jpg"  # If you are running locally
im = Image.open(im_filapath)
im

In [None]:
K = 8  # Number of clusters
X = np.array(im.getdata())  # Array with image values
kmeans = KMeans(n_clusters=K)
kmeans.fit(X)
compressed_array = kmeans.cluster_centers_[kmeans.predict(X)]  # Prediction 
im_compressed = compressed_array.astype(np.uint8).reshape(im.size[1], im.size[0], 3)  # New image
Image.fromarray(im_compressed, mode="RGB")