<img src="https://drive.google.com/uc?id=1-cL5eOpEsbuIEkvwW2KnpXC12-PAbamr" style="Width:1000px">

In [None]:
from nbta.utils import download_data
download_data(id='1EO5lq7oX6HzxNyYIfQ4n7snCsDoleVV3')

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Image Compression with KMeans

In this exercise and the next one we will use **images**  as data. You will use **K-means** for image compression, by reducing the colors in an image to only the most frequent ones.

## The image we will use today

### Opening the image

Today, you will work with an image of my research group. This is a bit old now, as it was taken in 2019. Hopefully, you can at least still recognize one of us!

Do the following:

* Use `open-cv` (`cv2`) `imread` to open the image as a `numpy array`.
* If the colors seem off, you will need to convert the colors of this image to RGB by using the `cvtColor` function of `cv2`, with the first argument being your image, and the second argument `cv2.COLOR_RGB2BGR`
* Use the `imshow` function in `matplotlib` to display the image

Make sure to call your `np.array` version of the image `img`.

In [None]:
import cv2

img = cv2.imread('raw_data/john_group_2020.jpg')
img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
plt.imshow(img);

❓ What is the shape of this image?  Save this into a variable named `img_shape`

In [None]:
# Get the shape of image
img_shape = img.shape
img_shape

### What does the shape mean?

For a color image, the shape is the dimension of the image (nb pixels x nb pixels), and **3 channels for red, green and blue (RGB)**. The possible pixel values are between 0 to 255 (256 possible values each channel). 

A grayscale image (and a black and white image) will have only 1 channel: this will contain `int`s from 0-255 for a greyscale image, or `int`s of either 0 or 1 for a black and white image.

### Number of colors

So what are the **samples** and **features** of this dataset? Here, our goal will be to reduce the dimensionality of the image through `k-means`. Specifically, we will reduce the number of `colors` used to represent this image.

In this context, we can deduct that:

Each **pixel** is a **sample**  or an **observation**
Each **color value (Red, Green, Blue)** is a **feature**  

In our image, we have **138,000 samples** (400 * 345 pixels) and **3 features** per pixel (RGB)

Now, **Reshape the image**:  
- From its current size of `width * height * 3`
- To a matrix  of size `N * 3` where `N = width * height`  

Assign the reshaped image to `X`.

In [None]:
# Reshape
X = img.reshape(img_shape[0] * img_shape[1], img_shape[2])

A colored image may contain up to 16 Millions potential colors: **3** channels, and for each channel **256** possible values (from 0 to 255), yields  a potential maximum of **16,777,216** *($256^3$)* colors in a colored image.  

Since our image is only 138,000 pixels, it uses at most 138,000 colors, as each pixel contains a single color defined by the three channels' values

Store the number of unique colors in this picture in `color_count`

In [None]:
color_count = len(np.unique(X, axis=0))
color_count

In summary, we have: 
- 138,000 samples, each observation is a pixel  
- 3 features (Red, Green & Blue values) for each observation
- An unknown number of clusters of similar color values

**Let's use K-means to reduce the number of colors** 🎨

### 🧪 Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('image_analysis',
                         img_shape = img_shape,
                         color_count = color_count)
result.write()
print(result.check())

## Compression with K-means

We want to reduce the **54,095** colors to **K** colors.  

Using a `KMeans` algorithm over the pixels, we can assign each pixel to one of **K** clusters.  The center of each cluster is going to be the average color of the pixels that belong to it. 

We can then use this "mean cluster color" as the RGB values for each pixel in the cluster.  

Our objective is to use only **32** colors.

❓ Fit a K-means with `n_clusters=32` and `n_init=10` on your ML-ready image `X`, and assign it to `kmeans`

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=32, n_init=10)
kmeans.fit(X)

👉 Check the `labels_` of your `kmeans`, then check their `shape` and `number of unique values`

In [None]:
kmeans.labels_ # each label represents a cluster

In [None]:
kmeans.labels_.shape # One cluster per observation

In [None]:
np.unique(kmeans.labels_).shape # 32 unique clusters

☝️ What the above gives us:

- Each label is a cluster
- There is one label assigned to each observations
- There are a total of 32 different labels, one for each cluster

❓ Check the `cluster_centers_` of your `KMeans`, shape and first element

In [None]:
kmeans.cluster_centers_.shape # One center per cluster

In [None]:
kmeans.cluster_centers_[0] # Each center is a vector of mean RGB values

☝️Each cluster_center is a vector of RGB values, it represents the mean color of the cluster

❓ Store in `X_compressed` an array with the mean colors from the clusters centers for each pixel.

<details span="markdown">
    <summary>💡 Help</summary>

You can achieve this by using only `kmeans.cluster_centers_` and `kmeans.labels_`

</details>

In [None]:
X_compressed = kmeans.cluster_centers_[kmeans.labels_]
X_compressed

RGB values must be integer.  

Our colors in `X_compressed` are in `float64`.  

❓ Convert `X_compressed` to `uint8`, the unsigned 8-bits integer type which lies between 0 and 255.

In [None]:
X_compressed = X_compressed.astype('uint8')

❓ Verify that the number of unique colors used is indeed 32.

In [None]:
len(np.unique(X_compressed, axis=0))

In [None]:
X_compressed.shape

### 🧪 Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('image_shape',
                         image = X_compressed)
result.write()
print(result.check())

## Plot compressed image

Our `X_compressed` has shape (262144, 3), we can't display it as an image.

❓ Reshape your compressed image and plot it side by side with your original image.  

<details>
    <summary>💡 Hint</summary>

You need to reshape your flattened `X_compressed` into the right dimensions for an image  
Your original image has such dimensions.

---

</details>

In [None]:
img_compressed = X_compressed.reshape(*img_shape)

In [None]:
# Plot the original and the compressed image.
fig, ax = plt.subplots(1, 2, figsize = (7, 7))
ax[0].imshow(img)
ax[0].set_title('Original Image')

ax[1].imshow(img_compressed)
ax[1].set_title('Compressed Image')

# Remove 
for ax in fig.axes:
    ax.axis('off')

# Nice padding adjustments
plt.tight_layout()

Not bad!

Some colors are lost, but your can easily recognize the original image.

# Saving the compressed image back to a 'jpg'

Now, use `cv2.imwrite` to save your compressed image back to disc (remember: you will need to use the `cv2.cvtColor` conversation function again if you want to save the current colors). Compare the size of the two images: you should have saved about 20% space with minimal loss in image quality.

In [None]:
cv2.imwrite('compressed_group.jpg', cv2.cvtColor(img_compressed, cv2.COLOR_RGB2BGR))

## Finding the best `k`

You can use the Elbow method to find the optimal compression that loses the least color information.   

Try to plot the `inertia` for `n_clusters` in the list [5, 10, 20, 30, 50, 70, 100] for instance . Use a `max_iter=10` and `n_init=10`.

⚠️ You might wait several minutes

In [None]:
# Apply the elbow method to find the optimal number of clusters.
wcss = []
for i in [5, 10, 20, 30, 50, 70, 100]:
    print('working with ' + str(i) + ' clusters...', flush=True)
    kmeans = KMeans(n_clusters = i, max_iter=10, n_init=10)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

In [None]:
plt.plot([5, 10, 20, 30, 50, 70, 100], wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters/colors')
plt.ylabel('(Within-Cluster Sums of Squares')
plt.show()

<details><string><summary>Conclusions</summary><br>
    we can see that our choice of 32 colors is pretty close to the optimal number: 20 colors results in significanly increased inertia, and then inertia augments exponentially).</details>

# 🏁 Finished!

Well done! <span style="color:teal">**Push your exercise to GitHub**</span>, and move on to the next one.