# Exercises

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

plt.rcParams.update({"figure.figsize": (10, 8), "font.size": 18})

## Exercise 1

How would you define clustering? Can you name a few clustering algorithms?

---

Clustering is an unsupervised learning task that involves grouping 'similar' instances together in clusters.

Some clustering algorithms are:
- K-means
- DBSCAN
- Gaussian mixtures

## Exercise 2

What are some of the main applications of clustering algorithms?

---

Some applications of clustering are:
- Segmentation - e.g. grouping together similar customers or image segmentation
- Anomaly/novelty detection - e.g. identifying spam emails or faulty products
- Dimensionality reduction - mapping each instance to its vector of distances to the cluster centres can reduce dimensions
- Semi-supervised learning - propagating labels to instances in the same cluster

## Exercise 3

Describe two techniques to select the right number of clusters when using K-means.

---

1. Maximise silhouette score
2. Implement K-means as the first step in a pipeline and treat the number of clusters as a hyperparameter to be optimised (e.g. through cross validation)
3. Plot inertia against the number of clusters and look for the 'elbow'

## Exercise 4

What is label propagation? Why would you implement it, and how?

----

Label propagation is the practice of copying labels from labelled instances to unlabelled instances. You might implement it because labelling instances 'properly' is costly or time-consuming and propagating labels increases model performance.

To implement label propagation for K-means you might first cluster the data, then select a representative for each cluster (the instances closest to the centroids are a good choice), then assign the representative instances' labels to other instances in their cluster.

## Exercise 5

Can you name two clustering algorithms that can scale to large datasets? And two that look for regions of high density?

---

K-means and BIRCH both scale well to large datasets.

DBSCAN and mean shift both look for regions of high density.

## Exercise 6

Can you think of a use case where active learning would be useful? How would you implement it?

---

In image recognition - have humans manually select images that contain a particular object (i.e. image identification CAPTCHA).

To implement it you might:
- Train a binary classifier for recognising an object, e.g. cars, in on a labelled training set of photos 
- Have the classifier predict probabilities for containing a car on an unlabelled set
- Select images where the probabilities are close to 50% and have humans label them
- Retrain the algorithm (or train on-the-fly if possible) with the expanded training set

This is uncertainty sampling.

## Exercise 7

What is the difference between anomaly detection and novelty detection?

---

Novelty detection assumes that there are no outliers in your training set; anomaly detection assumes the training set is mixed.

## Exercise 8

What is a Gaussian mixture? What tasks can you use it for?

---

A Gaussian mixture is a model where new instances are sampled from a random one of $k$ fixed Gaussian distributions (the different distributions are not necessarily equally likely to be selected - the probabilities are determined by weights).

You can use Gaussian mixtures for:
- Clustering
- Anomaly/novelty detection
- Data augmentation (since the model is generative)
- Density estimation

## Exercise 9

Can you name two techniques to find the right number of clusters when using a Gaussian mixture model?

---

1. You could minimise an theoretical information criterion such as AIC or BIC
2. You could apply a Bayesian Gaussian mixture model with a large number of clusters - these tend to set weights for unnecessary clusters equal to, or close to, zero (you may want to use this as a guide to the optimal number of clusters and retrain a standarad Gaussian mixture model).