In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab1.ipynb")

![](img/563_lab_banner.png)

# Lab 1: Clustering 

## Imports <a name="im"></a>

In [None]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline
pd.set_option("display.max_colwidth", 0)

<br><br><br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## Submission instructions <a name="si"></a>
rubric={mechanics}

You will receive marks for correctly submitting this assignment by following the instructions below:
    
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).
- [Here](https://github.com/UBC-MDS/public/tree/master/rubric) you will find the description of each rubric used in MDS.
- Make at least three commits in your lab's GitHub repository.    
- Push the final .ipynb file with your solutions to your GitHub repository for this lab.        
- Before submitting your lab, run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).     
- Make sure to enroll to Gradescope via [Canvas](https://canvas.ubc.ca/courses/130310).
- Upload the .ipynb file to Gradescope.
- Make sure that your plots/output are rendered properly in Gradescope.    
- If the .ipynb file is too big or doesn't render on Gradescope for some reason, also upload a pdf (preferably WebPDF) or html export of .ipynb file with your solutions so that TAs can view your submission on Gradescope. 
- The data you download for this lab <b>SHOULD NOT BE PUSHED TO YOUR REPOSITORY</b> (there is also a `.gitignore` in the repo to prevent this).
- Include a clickable link to your GitHub repo for the lab just below this cell.
</div>    

_Points:_ 3

YOUR REPO LINK GOES HERE

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 1: Document clustering warm-up
<hr>

In the lectures, we explored image clustering. In this exercise, you will explore another popular application of clustering, [**document clustering**](https://en.wikipedia.org/wiki/Document_clustering). 

A large amount of unlabeled text data is available out there (e.g., news, recipes, online Q&A, and tweets). Clustering is a commonly used technique to organize this data in a meaningful way. 

As a warm up, in this exercise, you will cluster sentences from a toy corpus. 

**Your tasks:**

Run the code below which 
- extracts content of Wikipedia articles on a set of queries and stores the first sentence in each article as a document representing that topic,
- tokenizes the text (i.e., separates "words" in the sentence), and 
- carries out preliminary preprocessing (e.g., removes punctuation marks and converts the text to lower case).

Some notes: 
- Typically, text preprocessing or normalization is an elaborate process before carrying out document clustering. But in this lab, we'll carry out minimal preprocessing, as we will be dealing with fairly short documents which are more or less clean.
  
- If you have created `conda` environment using [the course `yml` file](https://github.ubc.ca/mds-2023-24/DSCI_563_unsup-learn_students/blob/master/env-dsci-563.yml), you should have the following packages. If not, you may have to install appropriate packages in our course's environment.
  
- Feel free to experiment with Wikipedia queries of your choice. But stick to the provided list for the final submission so that it's easier for the TAs to grade your submission.


For tokenization we are using the `nltk` package. Even if you have the package installed via the course `conda` environment, you might have to download `nltk` pre-trained models, which can be done with the code below.

In [None]:
import nltk

nltk.download("punkt")

In [None]:
import wikipedia
import string 

from nltk.tokenize import sent_tokenize, word_tokenize

queries = [
    "Artificial Intelligence", "Deep learning", "Unsupervised learning", "Quantum Computing", 
    "Environmental protection", "Climate Change", "Renewable Energy", "Biodiversity",
    "French Cuisine", "Bread food", "Dumpling food", "Pizza"
]

wiki_dict = {"wiki query": [], "text": [], "n_words": []}
remove_tokens = list(string.punctuation) + ['``', '’', '`', 'br', '"', "”", "''", "'s", "(", ")", "[", "]"]

# Running this code might take some time.
for query in queries:
    try:
        # Attempt to fetch the page content
        page_content = wikipedia.page(query).content
    except wikipedia.exceptions.PageError:
        print(f"Page not found for query: {query}. Skipping...")
        continue
    except wikipedia.exceptions.DisambiguationError as e:
        print(f"Query: {query} led to a disambiguation page. Choosing the first option: {e.options[0]}")
        page_content = wikipedia.page(e.options[0]).content

    text = sent_tokenize(page_content)[0]
    tokenized = word_tokenize(text)
    text_pp = [token.lower() for token in tokenized if token.lower() not in remove_tokens]
    wiki_dict["n_words"].append(len(text_pp))
    wiki_dict["text"].append(" ".join(text_pp))
    wiki_dict["wiki query"].append(query)

wiki_df = pd.DataFrame(wiki_dict)
wiki_df

Our toy corpus has seven toy documents (`text` column in `wiki_df`) extracted from seven Wikipedia queries (`wiki query` column in `wiki_df`). 

<br><br>

<!-- BEGIN QUESTION -->

### 1.1 How many clusters? 
rubric={reasoning}


**Your tasks:**

- If tasked with manually clustering documents from this toy corpus, how many clusters would you identify, and what labels would you assign to each cluster?

<div class="alert alert-warning">

Solution_1.1
    
</div>

_Points:_ 1

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.2 K-Means with bag-of-words representation 
rubric={accuracy}

In the lecture, we saw that data representation plays a crucial role in clustering. Changing flattened representation of images to feature vectors extracted from pre-trained models greatly improved the quality of clustering. 

What kind of representation is suitable for text data? In previous machine learning courses, we have used bag-of-words representation to numerically encode text data, where each document is represented with a vector of word frequencies. 

Let's try clustering documents with this simplistic representation.  

**Your tasks:**

1. Create bag-of-words representation using [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) with default arguments for the `text` column in `wiki_df` above.
2. Cluster the encoded documents with [`KMeans` clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). Use `random_state=42` (for reproducibility) and set `n_clusters` to the number you identified in the previous exercise. 

<div class="alert alert-warning">

Solution_1.2
    
</div>

_Points:_ 3

In [None]:
...

In [None]:
kmeans_bow_labels = ...

In [None]:
wiki_df["bow_kmeans"] = kmeans_bow_labels
wiki_df

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.3 K-Means with sentence embedding representation
rubric={accuracy}

The bag-of-words representation, while useful, has limitations due to its inability to consider word order and context. There are other richer and more expressive representations of text which can be extracted using transfer learning. Similar to how pre-trained CNN models are employed for image data, there are pre-trained models for text data as well. In this lab, we will use a pre-trained model called 'all-MiniLM-L6-v2', accessible via the [sentence transformer](https://www.sbert.net/index.html) package. This deep learning model produces dense, fixed-length vector representations of sentences. For a comprehensive list of all available pre-trained models through this package, refer [here](https://www.sbert.net/docs/pretrained_models.html). These models are designed to capture the context and semantic meaning of sentences, making them particularly effective when we want to capture semantic similarity between texts. We may delve deeper into these representations in DSCI 575.

**Your tasks:**

1. Run the code below to create sentence embedding representation of documents in our toy corpus. 
2. Cluster documents in our toy corpus encoded with this representation (`emb_sents`) and `KMeans` with following arguments: 
    - `random_state=42` (for reproducibility)
    - `n_clusters`=the number of clusters you identified in 1.1

Note
- The code below might throw a warning. You may ignore it for the purpose of this lab. 

In [None]:
from sentence_transformers import SentenceTransformer

# embedder = SentenceTransformer("paraphrase-distilroberta-base-v1")
embedder = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
emb_sents = embedder.encode(wiki_df["text"].tolist())
emb_sent_df = pd.DataFrame(emb_sents, index=wiki_df.index)
emb_sent_df

<div class="alert alert-warning">

Solution_1.3
    
</div>

_Points:_ 2

In [None]:
...

In [None]:
kmeans_emb_labels = ...

In [None]:
wiki_df["emb_kmeans"] = kmeans_emb_labels
wiki_df

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.4 DBSCAN with sentence embedding representation and cosine distance  
rubric={accuracy}

Now try [`DBSCAN`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) on our toy dataset.
K-Means clustering is inherently linked to Euclidean distance due to its reliance on the concept of means. 
With `DBSCAN` we have the flexibility to experiment with different distance metrics. For text data, [cosine similarities](https://scikit-learn.org/stable/modules/metrics.html#cosine-similarity) or cosine distances are often effective. The **cosine distance** between two vectors $u$ and $v$ is defined as: 

$$distance_{cosine}(u,v) = 1 - (\frac{u \cdot v}{\left\lVert u\right\rVert_2 \left\lVert v\right\rVert_2})$$


**Your tasks:**

- Cluster documents in our toy corpus encoded with sentence embedding representation (`emb_sents`) and [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html?highlight=dbscan#sklearn.cluster.DBSCAN) with `metric='cosine'`. You will have to tune the hyperparamters `eps` and `min_samples` to get meaningful clusters, as default values of these hyperparameters are unlikely to work well on this toy dataset.

<div class="alert alert-warning">

Solution_1.4
    
</div>

_Points:_ 4

In [None]:
...

In [None]:
...

In [None]:
dbscan_emb_labels = ...

In [None]:
wiki_df["emb_dbscan"] = dbscan_emb_labels
wiki_df

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.5 Hierarchical clustering with sentence embedding representation
rubric={accuracy}

**Your tasks:**

Try hierarchical clustering on `emb_sents`. In particular
1. Create and show a dendrogram with `complete` linkage and `metric='cosine'` on this toy dataset.
2. Create flat clusters using `fcluster` with appropriate hyperparameters and store cluster labels to `hier_emb_labels` variable below.

<div class="alert alert-warning">

Solution_1.5
    
</div>

_Points:_ 3

In [None]:
...

In [None]:
hier_emb_labels = ...

In [None]:
wiki_df["emb_hierarchical"] = hier_emb_labels
wiki_df

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.6 Discussion
rubric={reasoning}

**Your tasks:**

- Reflect on and discuss the clustering results of the methods you explored in the previous exercises, focusing on the following points:    
    - effect of input representation on clustering results
    - whether the clustering results match with your intuitions and the challenges associated with getting the desired clustering results with each method

<div class="alert alert-warning">

Solution_1.6
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 2: Gaussian Mixture Models (GMMs) 

In this exercise, you'll investigate Gaussian Mixture Models (GMMs) using a toy dataset. Take a look at the following dataset:

How many clusters can you identify? While there's no definitive answer, it appears there could be 5 clusters. Three of these clusters seem to be blobs of points with roughly similar spreads. The other two clusters are elongated in shape and oriented in different directions. Moreover, they intersect, which could lead to ambiguous cluster assignments.

In [None]:
df = pd.read_csv('data/gmm-data.csv')

In [None]:
plt.scatter(df['X1'], df['X2'], alpha=0.6, edgecolors='k');

<br><br>

<!-- BEGIN QUESTION -->

### 2.1 How would K-Means behave?
rubric={viz,accuracy}

**Your tasks:**

- Apply the K-Means clustering algorithm to the dataset with $k=3$ and $k=5$. Ensure you scale the data beforehand using [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).
- Visualize the data points, colouring them according to the cluster assignments. Display the plots for different $k$ values side by side.

_You may use the visualization library of your choice._

<div class="alert alert-warning">

Solution_2.1
    
</div>

_Points:_ 4

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.2 Clustering with Gaussian Mixture Model (GMM)
rubric={viz,reasoning}

**Your tasks:**
- Fit a Gaussian Mixture Model (GMM) to the dataset using 5 components, experimenting with different values for the `covariance_type` argument. Use `random_state=42`, `max_iter=1000`, and `n_init=100`. 
- Visualize the data points by colouring them based on the cluster assignments from the GMM. In particular, create a 2-by-2 grid of plots, each coloured according to the cluster assignments, to illustrate the effect of each `covariance_type`. 
- Briefly describe how the different covariance types impact the shapes and orientations of the clusters.
- Compare and contrast the best results with those you obtained using the K-Means algorithm in the previous exercise.

_You may use the visualization library of your choice._

<div class="alert alert-warning">

Solution_2.2
    
</div>

_Points:_ 5

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 3: Clustering Flickr8k images and captions

Now that you have experience with text and image clustering separately, this exercise introduces multimodal clustering. This approach involves integrating different types of data, in this case visual and textual, to improve the clustering results. The concept is based on the idea that by leveraging the complementary information provided by each modality, we can achieve improved clustering results.

In the lectures, we discussed clustering images using pre-trained models as feature extractors, where these feature vectors represent the images. In Exercise 1 of this lab, you tackled document clustering using sentence embeddings derived from pre-trained language models. Expanding on these concepts, the current exercise focuses on creating a composite representation by merging image features extracted through pre-trained CNNs with caption features extracted from pre-trained language models.

For this exercise, you'll use a subset of the [Flickr8k dataset](https://www.kaggle.com/datasets/adityajn105/flickr8k). The original dataset comprises 8,000 images. Each image is accompanied by five distinct captions that provide clear descriptions of the significant entities and events in the images. In the subset selected for this exercise, we will be working with approximately 400 images, and each image will be accompanied by one associated caption. 

Download this subset from [here](https://github.ubc.ca/mds-2021-22/datasets/blob/master/data/sampled_Flickr8k.zip), unzip it, and put the `sampled_Flickr8k` folder under the data directory in lab1. 

Run the code below which reads images and corresponding captions in a list called `img_captions` and displays some sample images and corresponding captions from this list. 

In [None]:
import torch
from torch import nn
from torchvision.models import resnet50
from torchvision import transforms, models, datasets
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import random
from PIL import Image

In [None]:
# Set the device appropirately

#device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')
device

In [None]:
# Configuration
data_dir = os.path.join("data", "sampled_Flickr8k")  # Directory containing sampled images and captions 
sampled_img_path = os.path.join(data_dir, 'images')  # Directory containing sampled images
sampled_captions_file = os.path.join(data_dir, 'sampled_captions.txt')  # Sampled captions file path

# Function to load images and captions
def load_images_and_captions(captions_file):    
    data = {}
    with open(captions_file, 'r') as f:
        lines = f.readlines()
    for line in lines:
        img_id, caption = line.strip().split('%')
        path = os.path.join(sampled_img_path, img_id)
        data[path] = caption
            
    return list(data.items())
    
img_captions = load_images_and_captions(sampled_captions_file)

The code snippet above reads sampled images along with their corresponding captions and stores them in `img_captions` as a list of (image_path, caption) tuples, as illustrated below.

In [None]:
img_captions[:5]

Now let's write code to display some sample images along with their captions. 

In [None]:
def wrap_caption(caption, max_length=40):
    """A simple function to wrap text based on a maximum line length."""
    words = caption.split()
    wrapped_caption = ""
    current_line = ""
    for word in words:
        if len(current_line) + len(word) + 1 <= max_length:
            current_line += word + " "
        else:
            wrapped_caption += current_line.strip() + "\n"
            current_line = word + " "
    wrapped_caption += current_line.strip()  # Add the last line
    return wrapped_caption


def display_samples(image_caption_pairs, n_samples = 5):
    """
    Displays a random selection of image-caption pairs.

    This function randomly selects a specified number of image-caption pairs
    from a given list, resizes each image to a uniform size, and displays them
    alongside their captions in a single row.

    Parameters:
    - image_caption_pairs (list of tuples): A list where each tuple contains
      the path to an image (str) and its corresponding caption (str).
    - n_samples (int, optional): The number of image-caption pairs to display.
      Defaults to 5.

    Returns:
    - None. The function directly displays the images and captions using matplotlib.
    """    
    sampled_items = random.sample(image_caption_pairs, n_samples)

    desired_size = (200, 200)
    
    fig, axes = plt.subplots(1, n_samples, figsize=(20, 10), subplot_kw={"xticks": [], "yticks": []})
    axes = axes.flatten()  # Flatten the 2D numpy array to easily iterate over it

    for i, (img_path, caption) in enumerate(sampled_items):
        ax = axes[i]
        img = Image.open(img_path).convert('RGB')  # Open and convert to RGB
        img = img.resize(desired_size)
        ax.imshow(img)
        ax.set_title(wrap_caption(caption), fontsize=12)
        ax.axis('off')
    
    plt.tight_layout()
    plt.show()


In [None]:
display_samples(img_captions, n_samples = 5) 

<br><br><br><br>

<!-- BEGIN QUESTION -->

### 3.1 Clustering with image features
rubric={accuracy}

In this exercise, you will perform clustering on image features using the the pre-trained DenseNet CNN model. The code below 
- Loads the DenseNet model assuming that you've the appropriate device defined.
- Extracts image features for all the images using the loaded DenseNet model and creates 1024-dimensional feature vectors for each image in our dataset.   

**Your tasks:**

- Experiment with K-Means with different values for `n_clusters` on `image_features` extracted below.
- Store cluster labels of your best model into `cluster_labels_img` and show sample images from each cluster using the provided code. 

In [None]:
densenet = models.densenet121(weights='DenseNet121_Weights.DEFAULT')
densenet.classifier = nn.Identity()  # remove that last "classification" layer
densenet = densenet.to(device)

transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

In [None]:
# Running this code might take some time.
def extract_image_features(model, img_captions):
    model.eval()  # Ensure the model is in evaluation mode
    features = []
    with torch.no_grad():
        for img_path, _ in img_captions:
            image = Image.open(img_path).convert('RGB')
            image = transform(image).unsqueeze(0).to(device)
            feat_vec = model(image)
            feat_vec = feat_vec.squeeze().cpu().numpy()  # Move to CPU and convert to NumPy
            features.append(feat_vec)
    return np.array(features)
    
image_features = extract_image_features(densenet, img_captions)

In [None]:
image_features.shape

<div class="alert alert-warning">

Solution_3.1
    
</div>

_Points:_ 2

In [None]:
n_clusters = ...

In [None]:
...

In [None]:
cluster_labels_img_feats = ...

In [None]:
def organize_and_print_clusters(img_captions, cluster_labels, n_clusters=5, n_samples=5):
    """
    Organizes images and captions into specified clusters and displays samples from each cluster.

    This function takes a list of image-caption pairs and their corresponding cluster labels,
    organizes them into clusters, and displays a specified number of samples from each cluster
    along with their captions.

    Parameters:
    - img_captions (list of tuples): A list where each tuple contains the path to an image (str) 
      and its corresponding caption (str).
    - cluster_labels (list of int): A list of integer labels indicating the cluster assignment 
      for each image-caption pair in img_captions.
    - n_clusters (int, optional): The number of clusters. Defaults to 5.
    - n_samples (int, optional): The number of image-caption pairs to display from each cluster. 
      Defaults to 5.

    Returns:
    - None. This function prints the cluster ID and displays the images with captions for each cluster.

    Notes:
    - This function assumes that the `display_samples` function is defined and capable of displaying 
      image-caption pairs.
    """
    
    # Organize images and captions into clusters based on cluster IDs 
    # in cluster_labels_text_feats
    clustered = {i: [] for i in range(n_clusters)}
    for item, label in zip(img_captions, cluster_labels):
        clustered[label].append((item[0], item[1]))
        
    # Print cluster IDs and display 5 sample images with captions from each cluster
    # created using text features    
    for cluster in clustered: 
        print(f'\n\n\n Cluster {cluster}')
        display_samples(clustered[cluster], n_samples=n_samples)

In [None]:
organize_and_print_clusters(img_captions, cluster_labels_img_feats, n_clusters=n_clusters, n_samples=5)

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 3.2 Clustering with caption features
rubric={accuracy}

In this exercise, you will perform clustering on captions using the pre-trained `all-MiniLM-L6-v2` language model. The provided code loads the `all-MiniLM-L6-v2` model. 

**Your taks:**
- Use the pre-trained `all-MiniLM-L6-v2` model to encode captions.
- Try different values for `n_clusters` in `K-Means` clustering with the encoded captions.
- Save the cluster labels in `cluster_labels_text_feats`. Then, execute the provided code to display sample images and captions from each cluster.

In [None]:
# Load sentence-transformer model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

<div class="alert alert-warning">

Solution_3.2
    
</div>

_Points:_ 5

In [None]:
...

In [None]:
n_clusters = ...

In [None]:
...

In [None]:
cluster_labels_text_feats = ...

In [None]:
organize_and_print_clusters(img_captions, cluster_labels_text_feats, n_clusters=n_clusters)

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 3.3 Integrating image and text features for multi-modal clustering
rubric={accuracy}

You may have observed that clustering based on image features highlights visual similarities, while clustering based on caption features groups examples with semantic resemblance in captions.

In this exercise, you will combine:

- Image features extracted from the pre-trained `DenseNet` CNN model.
- Caption features extracted from the pre-trained `all-MiniLM-L6-v2` language model.

You will then cluster examples using these combined representations.

**Your tasks:**

- Combine the image and caption features obtained in the prior exercises so that each image caption pair is represented with a 1024 + 384 dimensional feature vector. 
- Normalize the combined features using `StandardScaler` to ensure both text and image features are on a comparable scale.
- Experiment with different clustering methods, pick a clustering method of your choice, and apply it to the combined features. Experiment with different hyperparameter values, but only report your optimal results.
- Store the cluster labels in `cluster_labels_combined`. Afterwards, run the provided code snippet to visualize sample images and captions from each cluster.

<div class="alert alert-warning">

Solution_3.3
    
</div>

_Points:_ 8

In [None]:
n_clusters = ...

In [None]:
...

In [None]:
cluster_labels_combined = ...

In [None]:
organize_and_print_clusters(img_captions, cluster_labels_combined, n_clusters=n_clusters)

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 3.4 Discussion
rubric={reasoning}

- Compare and contrast your clustering results from 3.1, 3.2, and 3.3.

<div class="alert alert-warning">

Solution_3.4
    
</div>

_Points:_ 3

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 4: Food for thought
<hr>

Similar to the previous courses, each lab will have a few challenging questions. In some of the labs I will be including challenging questions which lead to the material in the upcoming week. These are usually low-risk questions and will contribute to maximum 5% of the lab grade. The main purpose here is to challenge yourself or dig deeper in a particular area. When you start working on labs, attempt all other questions before moving to these challenging questions. If you are running out of time, please skip the challenging questions. 

We will be more strict with the marking of these questions. There might not be model answers. If you want to get full points in these questions, your answers need to
- be thorough, thoughtful, and well-written
- provide convincing justification and appropriate evidence for the claims you make 
- impress the reader of your lab with your understanding of the material, your analytical and critical reasoning skills, and your ability to think on your own


![](img/eva-game-on.png)

<br><br>

<!-- BEGIN QUESTION -->

### (Challenging) 4.1: Similarity measure for mixed datasets
rubric={reasoning}

Clustering is based on finding similar examples. So using appropriate similarity metric is crucial in order to find meaningful clusters. When the data contains only numeric features you can apply appropriate transformation (e.g., scaling) and find similarity between examples based on Euclidean distances between points. In document clustering with sentence embedding representation we used cosine similarity. But what if the dataset contains different types of features such as numeric, categorical (e.g., postal code), or multi-valued categorical features (e.g., movie genres)? How would you calculate similarity between examples as a single numeric value and apply clustering methods? Suggest some ideas.  

> As a concrete example, you may explore clustering of movies from [the IMDB Movies Dataset](https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows). 

<div class="alert alert-warning">

Solution_4.1
    
</div>

_Points:_ 1

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### (Challenging) 4.2: Vector Quantization
rubric={reasoning}

One more application of clustering is _vector quantization_, where we find a prototype point for each cluster and replace points in the cluster by their prototype. If our inputs are images, vector quantization gives us a rudimentary image compression algorithm.

We will implement image quantization by filling in the `quantize` and `dequantize` functions below. The `quantize` function should take in an image,  and using the pixels as examples and the 3 colour channels as features, run KMeans clustering on the data with $2^b$ clusters for some hyperparameter $b$. The code should store the cluster means and return the cluster assignments. The `dequantize` function should return a version of the image (the same size as the original) where each pixel's original colour is replaced with the nearest prototype colour.

To understand why this is compression, consider the original image space. Say the image can take on the values $0,1,\ldots,255$ in each colour channel. Since $2^8=256$ this means we need 8 bits to represent each colour channel, for a total of 24 bits per pixel. Using our method, we are restricting each pixel to only take on one of $2^b$ colour values. In other words, we are compressing each pixel from a 24-bit colour representation to a $b$-bit colour representation by picking the $2^b$ prototype colours that are "most representative" given the content of the image. 

**Your tasks:**

1. Complete the `quantize` and `dequantize` functions below.
2. Run the code on an image of your choosing. Display the results for a few different values of $b$.

Notes: 
- If you actually try saving this as a file, you won't see the file size being what you expected, because Python won't know to allocate exactly $b$ bits per element of `quantized_img`, and will instead probably store them as 32-bit integers if you don't specify otherwise. But if one wanted to work harder, it would be theoretically possible to store the elements with $b$ bits per pixel. Also, all this is before any additional lossless compression.

- This is not how image compression systems like JPEG actually work. They use something similar to the Fourier transform, followed by a (simpler) quantization, followed by lossless compression.

<div class="alert alert-warning">

Solution_4.2
    
</div>

_Points:_ 2

In [None]:
import os

from matplotlib.pyplot import imread, imshow
from sklearn.cluster import KMeans

In [None]:
img = imread(os.path.join("img/eva-happy-saturday.jpg"))
imshow(img)
plt.title("original image")
plt.show()

In [None]:
def quantize(img, b):
    """
    Quantizes an image into 2^b clusters

    Parameters
    ----------
    img : a (H,W,3) numpy array
      the image to be processed
    b   : int
      the desired number of bits

    Returns
    -------
    quantized_img : a (H,W) numpy array containing cluster indices
    colours       : a (2^b, 3) numpy array, each row is a colour
    """

    H, W, _ = img.shape
    model = KMeans(n_clusters=2 ** b)

    ### YOUR CODE
    ...

    return quantized_img, model.cluster_centers_.astype("uint8")


def dequantize(quantized_img, colours):
    H, W = quantized_img.shape
    img = np.zeros((H, W, 3), dtype="uint8")

    # YOUR CODE: fill in the values of `img` here

    ...

    return img

In [None]:
img.shape

In [None]:
img.reshape(398 * 398, 3).shape

In [None]:
b = 3
compressed, colours = quantize(img, b=b)
recon = dequantize(compressed, colours)
fig, ax = plt.subplots(1, 2, figsize=(10, 10))
ax[0].imshow(img)
ax[0].set_title("original image")
ax[1].imshow(recon)
ax[1].set_title(f"reconstructed image (b = {b})");

In [None]:
plt.imshow(colours[None])
plt.title("colours learned")
plt.xticks([])
plt.yticks([])
plt.show()

<!-- END QUESTION -->

<br><br><br><br>

Before submitting your assignment, please make sure you have followed all the instructions in the Submission Instructions section at the top. 

Well done!! Have a great weekend and happy reading week! 

In [None]:
from IPython.display import Image

Image("img/eva-well-done.png")