# Lab: Clustering using BFR
Data Mining 2021/2022  
Jordi Smit and Gosia Migut  
Revised by Bianca Cosma

**WHAT** This _optional_ lab consists of several programming exercises and insight questions. These exercises are meant to let you practice with the theory covered in: [Chapter 7][1] from "Mining of Massive Datasets" by J. Leskovec, A. Rajaraman, J. D. Ullman.

**WHY** Practicing, both through programming and answering the insight questions, aims at deepening your knowledge and preparing you for the exam. 

**HOW** Follow the exercises in this notebook either on your own or with a friend. Use [StackOverflow][2] to discuss the questions with your peers. For additional questions and feedback please consult the TAs during the assigned lab session. The answers to these exercises will not be provided.
 
[1]: http://infolab.stanford.edu/~ullman/mmds/ch7.pdf
[2]: https://stackoverflow.com/c/tud-cs/questions

#### Summary
In this exercise you will implement the BFR algorithm. This is a clustering algorithm designed for very large datasets that don't fit into memory. We will simulate the lack of memory by dividing the data in a list of lists, whereby each sub-list is a different batch that has 'supposedly' been read from disk or some other storage server.

In [None]:
from uuid import UUID
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import sys
import uuid

## Exercise: The BFR algorithm
K-means and Hierarchical Clustering are two very well known clustering algorithms. However, both work only if the entire data set is in the main memory, which means that there is an upper limit on the amount of data they can cluster. So if we want to go beyond this upper limit we need an algorithm that doesn't need the entire data set to be in main memory. In this exercise we will look at the approach of the BFR algorithm.

BFR works by summarizing the clustering data into statistical data, such as the sum, squared sum and number of data points per cluster. The algorithm uses three sets that contain cluster summaries:
- **Discard Set**:
Contains the summaries of the data points that are *close enough* (we'll define this later on) to one of the main clusters.
- **Compressed Set** (also known as the set of *miniclusters*):
Contains the summaries of the data points that are not *close enough* to one of the main clusters, but form *miniclusters* with other points that are not *close enough* to one of the main clusters.
- **Retained Set**: 
Contains data points that are not *close enough* to one of the main clusters and not *close enough* to one of the *miniclusters* (these are not summaries, but individual data points).

**BFR steps:** (as outlined in this exercise)
1. BFR uses the first chunk of data to find the $k$ main clusters and summarizes them in **Discard Set**. Then it loops through the remaining chunks of data. 
2. For each data point in one of the remaining chunks, it will check if the data point is  *close enough* to a cluster summary in the **Discard Set**. If the data point is *close enough*, it will be added to a cluster summary in the **Discard Set**. If not, it will be added to the **Retained Set**. 
3. After we went through all the data points in a chunk, we check if we can find any new *miniclusters* by combining the clusters in the **Retained Set**, using a traditional clustering method. All the new non-singleton clusters will be summarized and added to the **Compressed Set**, while all the singleton clusters will stay in the **Retained Set**. 
4. Before we continue to the next chunk, we have to check if we don't have too many *miniclusters* in the **Compressed Set**. We can reduce the number of *miniclusters* by combining them through clustering. 
5. After we have gone through all the data, we end up with $k$ main clusters, $m$ *miniclusters* and $n$ retained data points. Because we only want $k$ clusters, we need to combine all of them, which can also be done using traditional clustering.


After we have done all this, we end up with $k$ cluster summaries, which can be used to assign future data to the closest clusters.

If you are looking for a more detailed explanation, see [this online video lecture](https://www.youtube.com/watch?v=NP1Zk8MY08k) from the authors of the book or read the corresponding section of the book.

### Step 1: Setup
Let's get started by creating the data structures for this problem. First of all, we need to create a class for a `DataPoint`. This class stores the location of a data point and the ID of the cluster to which the point has been assigned. We also define a function which can convert this data point to a singleton `BFRCluster`.

In [1]:
class DataPoint(object):
    """
    A data point that can be clustered.
    """
    
    def __init__(self, vector):
        self.vector = vector
        self.cluster_id = None

    def to_singleton_cluster(self):
        """
        Returns:
        Cluster: A cluster with a single data point.
        """
        sum_v = self.vector
        squared_sum = sum_v ** 2
        n_data_points = 1
        self.cluster_id = uuid.uuid4()
        return BFRCluster(sum_v, squared_sum, n_data_points, set([self.cluster_id]))

    def __repr__(self):
        return f"DataPoint(vector: {self.vector}, cluster_id: {self.cluster_id})"

In the next cell we import some helper functions we have already created for you:
 - `load_data`;
 - `hierarchical_clustering`.

You can read their documentation using Python's `help` function, as shown below, or look at their implementation in `bfr_helper.py`.

In [2]:
from bfr_helper import hierarchical_clustering
from bfr_helper import load_data

# help(hierarchical_clustering)
# help(load_data)

### Step 2: Create BFR clusters

Next let's create a class for the BFR cluster. This class must store both the statistical summaries of the data and be usable with hierarchical clustering. All the hierarchical clustering related logic has already been implemented in its parent class `Cluster`. You can read its documentation using `help(Cluster)` or see its implementation in `bfr_helper.py`.

However, the statistical summary and BFR related logic must still be implemented. **Now it is your job to**:
 - Define the `mean` attribute;
 - Define the `variance` attribute;
 - Define the `std` attribute;
 - Finish the `is_data_point_sufficiently_close` method, used to  determine if a `DataPoint` is close enough to be added to the discard set;
 - Finish the `mahalanobis_distance` method, the distance measure used by the `is_data_point_sufficiently_close` function.

We define a `DataPoint` as close enough if $MD < 3 \cdot std_i$, for at least one $i$, where $i$ is the axis index, $MD$ is the *mahalanobis distance* and $std_i$ is the standard deviation along the $i$ axis.

**Hint:** You may find the following formulas useful
 $${\sigma_i}^2 = \frac{SUMSQ}{N}  - \bar{x_i}^2$$
 
 $$\bar{x_i} = \frac{SUM}{N}$$
 
 $$MD =\sum_{i=1}^{N} {(\frac{x_i - \bar{x_i}}{\sigma_i})^2}$$

In [None]:
from bfr_helper import Cluster
# Uncomment the line below if you want to read the documentation
# help(Cluster)

In [None]:
class BFRCluster(Cluster):
    """
    A summary of multiple data points.
    """
    def __init__(self, sum_v, squared_sum, n_data_points, cluster_ids):
        # START ANSWER
        # END ANSWER
        
        super().__init__(sum_v, squared_sum, n_data_points, cluster_ids, mean, variance, std)
        
    def is_singleton(self):
        """
        Returns:
        bool: True if the cluster only has a single data point, false otherwise.
        """
        return self.n_data_points == 1

    def mahalanobis_distance(self, dp):
        """
        Parameters:
        dp: DataPoint: The DataPoint we are interested in.

        Returns:
        float: The mahalanobis distance between the centroid of this cluster and the given data point.
        """
        # START ANSWER
        # END ANSWER
    
    def is_data_point_sufficiently_close(self, dp):
        """
        Parameters:
        dp: DataPoint: The DataPoint we are interested in.

        Returns:
        bool: True if the mahalanobis distance is less than 3 times the std on at least one axis, false otherwise.
        """
        # START ANSWER
        # END ANSWER
        return False

Run the code below to verify that the functions were implemented correctly:

In [None]:
np.random.seed(42)
# Initialize 3 random data points in a 2-dimensional space.
v = np.random.rand(3,2)
cluster = BFRCluster(np.sum(v, axis=0, keepdims=True), np.sum(v ** 2, axis=0, keepdims=True), len(v), [uuid.uuid4()])

# Check that the mean is implemented correctly.
assert cluster.mean.shape == (1,2)
assert np.all(np.isclose(cluster.mean[0], [0.4208509, 0.56845577], atol=0.0001))

# Check that the variance is implemented correctly.
assert cluster.variance.shape == (1,2)
assert np.all(np.isclose(cluster.variance[0], [0.0563636, 0.10571936], atol=0.0001))

# Check that the std is implemented correctly.
assert cluster.std.shape == (1,2)
assert np.all(np.isclose(cluster.std[0], [0.2374102, 0.32514513], atol=0.0001))

# Check that mahalanobis_distance is implemented correctly.
dp = DataPoint(np.random.rand(1,2))
assert np.isclose(cluster.mahalanobis_distance(dp), 3.1732638628025542, atol=0.0001)

inpoint = DataPoint(cluster.mean)
outpoint = DataPoint(2 * cluster.mean)

# Check that is_data_point_sufficiently_close is implemented correctly.
assert cluster.is_data_point_sufficiently_close(inpoint)
assert not cluster.is_data_point_sufficiently_close(outpoint)

### Step 3: Implement the BFR algorithm

In this section we'll use the previously defined data structures and functions to create the BFR algorithm. Let's get started by defining the `find_index_sufficiently_close_cluster` function. This function needs to return the index of the **first** cluster in a list that is found to be sufficiently close. If no cluster is close enough, it should return `None`. 

We will later use this function when iterating over the chunks of data points, to check if a data point, `dp`, is sufficiently close to one of the $k$ cluster summaries in the discard set, `k_clusters`.<br>
**Hint:** We have already defined a function which checks if the point is close enough to some cluster.

In [None]:
def find_index_sufficiently_close_cluster(k_clusters, dp):
    """
    Finds the index of the first sufficiently close cluster from the given list of k clusters.

    Parameters:
    k_clusters: List[Cluster]: A list of k clusters.
    dp: DataPoint: The data point we are interested in.

    Returns:
    Optional[int]: The index of the first sufficiently cluster in the list. 
                   Returns None if no cluster is sufficiently close.

    """
    # START ANSWER
    # END ANSWER
    return None

These are the hyperparameters of the algorithm:

 - `chunk_size`: how much data we can store in a single memory scan;
 - `k`: the final amount of clusters we want;
 - `num_discard`: the number of discard clusters we'll have in the algorithm;
 - `num_new_mini`: the number of new *miniclusters* we can add during one run (i.e., how many clusters we want to get after clustering the points in the retained set);
 - `num_mini`: the number of *miniclusters* we keep between memory scans (i.e., the size of the compressed set between runs).

In [None]:
# This path might be different on your local machine.
file_path = "data/cluster.txt"

# Algorithm hyperparameters.
chunk_size = 35
k = 3
num_discard = 3
num_new_mini = 25
num_mini = 25

data = load_data(file_path, chunk_size, create_data_point_func=DataPoint)

In the cell below we'll implement the BFR algorithm.

- For the first chunk:
	 - Fill the discard set with `num_discard` clusters, using traditional clustering of the data points in the first chunk. Note that the hierarchical clustering method defined in `bfr_helper.py` takes a list of clusters as input, so we will first have to transform the data points into singleton clusters.
- For each of the remaining chunks:
     - For each data point in a chunk:
         - If the data point is sufficiently close to a cluster in the discard set, then add it to the summary of that cluster;
         - If the data point is not sufficiently close to any cluster in the discard set, then add it to the retained set as a singleton BFR cluster.
	 - Combine each singleton cluster in the retained set with the singleton clusters that are closest to it, using a traditional clustering method. Add the new non-singleton *miniclusters* to the compressed set. Keep the remaining singleton clusters in the retained set.
     - If the size of the compressed set is too large, apply traditional clustering on the summaries in the set until you get `num_mini` clusters.
- After iterating through all chunks:
     - Combine the discard, compressed and retained sets into the desired amount of `k` clusters.

**Hints:**
 - You can combine the clusters that are closest to each other using `hierarchical_clustering`;
 - Carefully look at the functions we have defined in the previous part. Most of the logic is already defined there.


In [None]:
discard = []
compressed = []
retained = []

for dp in data[0]:
    # Transform the data points in the first chunk into singleton clusters.
    # START ANSWER
    # END ANSWER
    
# Fill the discard set with num_discard clusters, using the singleton clusters determined before.
# START ANSWER
# END ANSWER

# Iterate over the remaining chunks.
for chunk in data[1:]:
    for dp in chunk:
        index_sufficiently_close_cluster = find_index_sufficiently_close_cluster(discard, dp)
        if index_sufficiently_close_cluster is not None:
            # Replace the sufficiently close cluster with the new cluster, formed by adding dp.
            # START ANSWER
            # END ANSWER
        else:
            # Transform the data point into a singleton cluster and add it to the retained set
            # START ANSWER
            # END ANSWER
    
    new_miniclusters = None
    
    # Find the new_miniclusters by clustering the singleton clusters in the retained set.
    # You can use hierarchical clustering to form num_new_mini clusters.
    # You should leave the remaining singleton clusters in the retained set, 
    # and add the non-singleton clusters to the compressed set.
    # START ANSWER
    # END ANSWER
    
    # Perform hierarchical clustering on the newly modified compressed set, 
    # so the number of miniclusters in the compressed set for the next iteration is num_mini.
    # START ANSWER
    # END ANSWER

# Combine the three sets.
combined_summaries = discard + compressed + retained

resulting_k_clusters = None
# Further combine the summaries until there are only k, using a traditional clustering method.
# START ANSWER 
# END ANSWER

# Check that the number of resulting clusters is correct.
assert len(resulting_k_clusters) == 3
# Check that the cluster means are determined correctly.
assert np.all(np.isclose(sorted(list(map(lambda cluster : cluster.mean[0], resulting_k_clusters)), key = lambda x : x[0]), 
                         [[-1.96876465, 1.4193697], [-1.9142869, -2.34254524], [2.00728041, 2.03337113]], 
                         atol=0.0001))

### Step 4: Apply the algorithm
And we are done! The only thing left to do is to look at the final result. Run the cell below to visualize the resulting clusters. The small dots are the data points, while the diamonds are the centroids of the clusters.

In [None]:
marker_dp = "."
marker_cluster = "D"
k = len(resulting_k_clusters)
colors = cm.rainbow(np.linspace(0,1,k))

# Plot the centroids of the clusters.
for i, cluster in enumerate(resulting_k_clusters):
    x = cluster.mean[:, 0]
    y = cluster.mean[:, 1]
    plt.scatter(x, y, marker=marker_cluster,  edgecolors='k', c=[colors[i]])

# Plot the assigned data.
for chunk in data:
    for dp in chunk:
        x = dp.vector[:, 0]
        y = dp.vector[:, 1]
        color = None
        for i, cluster in enumerate(resulting_k_clusters):
            if cluster.contains(dp):
                color = colors[i]
                break
        assert color is not None
        plt.scatter(x, y, marker=marker_dp, c=[color])

plt.show()

$\textbf{Question 1}$: This algorithm works under one major assumption. What is this assumption?

$\textbf{Question 2}$: What is the major disadvantage of this assumption?

$\textbf{Question 3}$: How many secondary memory passes does this algorithm have to make?

$\textbf{Question 4}$: Let's say we have a dataset with 3 clusters `A`, `B`, and `C`. What happens if the first chunk only has data from cluster `A`?