<h1> kMeans and k Nearest Neighbours</h1>

This week, we are going to learn about and implement two loosely related machine-learning algorithms. In this lab, we'll help you to implement the *k-Means* algorithm, and then you'll work in pairs to implement *k-Nearest-Neighbours* (kNN). In both cases, you should try to not only write code that is correct, but also ensure your code is readable. The best way to do this is to stick to the unit style guide (available on Blackboard).


<h2> kMeans </h2>

Before we dive in to code, we need to introduce some background ideas so we can understand what kMeans is trying to do and why we'd want to use it. 

kMeans is an *unsupervised* *clustering* algorithm. A *clustering* algorithm tries to organise some data points into groups (or *clusters*). 

If we're given some 2-dimensional data points (i.e each data point consists of two values, so we could write $d_i = (value_{1, i}, value_{2, i})$), then we can plot each data point on a plane. In the left hand image below, we've done this for some sample data points. Hopefully it's clear that this data consists of three seperate "groups" of data, where each element of a group is much closer to other members of the group than other points. 

The right hand plot shows the output when we use kMeans (with k=3) to assign each data point to a cluster. 

<img src="https://github.com/engmaths/SEMT10002_2024/blob/main/img/cluster_example.png?raw=true" width="80%">



<h3> How it works </h3>

The basic idea behind kMeans is to try to assign a data point, $d_i$ to the cluster, $C_j$ which it is *closest* to. It does this by calculating the mean, $\mu_j$ (hence kMeans) of all data points assigned to a cluster. Once the means have been calculated, the algorithm loops through each data point, calculating the distance to each cluster mean and re-assigning points to the cluster it is closest too. 

However, to calculate the mean of a cluster, we need to have first assigned each of our data points to a cluster already. But we can't assign a point to a cluster if we don't know where the mean is-- so how do we get started? 

Well, it turns out that we can start with *random* means for our clusters and then repeat a process of first assigning points to the (random) clusters and then re-calculating the mean points. Over time, we'll see that often (but not always!), the number of points which change cluster reduces, until eventually the algorithm converges (i.e. we do an update and no points are re-assigned to different clusters). Convergence isn't guaranteed with kMeans, so it's a good idea to set an upper limit on the number of iterations we want to run.

As a sequence of steps, we can write this down as:

1. Decide how many clusters we want (i.e. what value of k should we use). 
2. Randomly initialise K cluster means
3. Repeat until convergence
   
    a. Assign each data point to the cluster it is closest to
   
    b. Re-calculate the mean position of each cluster.

To calculate distance, kMeans use the *Euclidean distance* metric. If you haven't heard of this before, the idea is that if we have two points we can draw a triangle between them, formed by the hypotenuse (i.e. the distance between the points), the horizontal separation, and the vertical separation. The horizontal ($x_2 - x_1$) and vertical ($y_2 - y_1$) seperations can be calculated directly from our data points. The distance can then be calculated using Pythagoras theorem- $d = \sqrt{(x_2-x_1)^2 + (y_2-y_1)^2}$

<img src="https://github.com/engmaths/SEMT10002_2024/blob/main/img/euclidean_distance.png?raw=true" width="30%">

This is the situation for when we have 2D data- if our data was 3D, then we would simply modify the equation to include an additional *separation* term (i.e $(z_2-z_1)^2$). In fact, the same equation can work for any number of dimensions. This is good, because data points in machine learning are often very highly dimensional!

Let's see what happens when we run kMeans on the data plotted above. We start by choosing 3 random data points to be our initial cluster means (shown by the '+' in our plot).

<img src="https://github.com/engmaths/SEMT10002_2025/blob/main/media/week_13/step1.png?raw=true" width="75%">

Next, we assign each data point to the cluster it is closest to.

<img src="https://github.com/engmaths/SEMT10002_2025/blob/main/media/week_13/step2.png?raw=true" width="75%">

Next, we update our cluster means

<img src="https://github.com/engmaths/SEMT10002_2025/blob/main/media/week_13/step3.png?raw=true" width="75%">

Then we re-assign each data point to the closest cluster

<img src="https://github.com/engmaths/SEMT10002_2025/blob/main/media/week_13/step4.png?raw=true" width="75%">

and again update our cluster means

<img src="https://github.com/engmaths/SEMT10002_2025/blob/main/media/week_13/step5.png?raw=true" width="75%">

After two more iterations we converge on a solution

<img src="https://github.com/engmaths/SEMT10002_2025/blob/main/media/week_13/step6.png?raw=true" width="75%">
<img src="https://github.com/engmaths/SEMT10002_2025/blob/main/media/week_13/step7.png?raw=true" width="75%">
<img src="https://github.com/engmaths/SEMT10002_2025/blob/main/media/week_13/step8.png?raw=true" width="75%">



<h3> Implementing kMeans</h3>

Our first task for today is to implement kMeans. You should use the file "kMeans.py" to write your code (available on Blackboard).

Before we can start implementing kMeans, we need a data set to apply it to. The file "heart_data.csv" contains data from 297 patients who suffered from heart failure. We'd like to run kMeans on this data to see how many different underlying causes led to heart failure. Our data set contains 13 bits of information (or *features* as we normally call them in machine learning) about each patient. We'll begin by just looking at two- the patient's age and the results of a test for the presence of creatanine phosphokinase. 



<h4> Task 1 </h4>


Download the file "heart_data.csv" and write some code for reading the contents of the file. Then create a scatter plot of age against creatanine phosphokinase. Your plot should look like the image below. 

<img src="https://github.com/engmaths/SEMT10002_2024/blob/main/img/heart_data_plot.png?raw=true" width="40%">

Now that we've got some data to process, we can start working on our function for implementing kMeans. At the highest level, the code needs to perform the following steps:

1. Initialise cluster means randomly
2. For some number of iterations, repeat:
    + Assign data to clusters
    + Update cluster means
3. Plot output

These steps map fairly cleanly into a set of functions, as we've shown below.

In [None]:
def kMeans(k, data, max_iterations = 100):

    #We need to randomly initialise the means (or centroids) before we can start
    means = initialise_means(k, data)

    #kMeans doesn't always converge, so we set a maximum number of iterations to stop our code from running forever. 
    for iteration in range(max_iterations):
        #We start by assigning data to clusters
        clusters = assign_data_to_clusters(means, data)
        #Next we update our centroids
        means = calculate_cluster_means(clusters)
        #Sometime it's helpful to visualise each step of the algorithm- uncomment the line below to see that.
        #plot_clusters(clusters)

    #Once finished, let's plot the clusters
    plot_clusters(clusters)


Now, we just need to define these functions. Let's implement each in turn.


<h4>Task 2: Initialising centroids</h4>

We need to randomly choose some starting centroid means. A simple way to do this is to  randomly select K points and use their position at the cluster means. We can do this by randomly generating K integers between 0 and the size of our data set and using these to index into the data array.

<h4> Task 3: Assigning Data to Clusters</h4>


To assign a data point to a cluster, we need to find which cluster mean it is closest to. The obvious thing to do here is to loop through the data, and for each point calculate the distance to each centroid. We then add the data point to the cluster it is closest too. 

<h4> Task 4: Calculating cluster means </h4>

To calculate the mean position of each cluster, we need to calculate the mean of each feature for every point in the cluster. In other words, the mean position of cluster 1 is given by (mean_of_cluster_1_first_feature, mean_of_cluster_1_second_feature, ...). We can do this by creating a slice for each feature of the data in cluster 1 and calculating the mean.

<h4> Task 5: Plotting</h4>

To plot the data, we can use a matplotlib scatter plot. The only complication is that we would like to plot each cluster in a different color. You can do this by using the 'color' keyword when you call the scatter function. To plot each cluster, we need to loop over all clusters, for each one plotting the data points it contains with a different color.

 Once you've implemented all functions, your kMeans implementation is finished- try running it on the data set and see what happens.  Note that because of the random initialisation, you will get slightly different results each time.  What happens as you vary the value of K?

 Below is the plot I get when I run with K=2 and K=3. Your output may be slightly different.

 <img src="https://github.com/engmaths/SEMT10002_2025/blob/main/media/week_13/kmeans_2.png?raw=true" width="35%">
<img src="https://github.com/engmaths/SEMT10002_2025/blob/main/media/week_13/kmeans_3.png?raw=true" width="35%">

<h4> Bonus exercises </h4>

These exercises are included as optional extras- if you've finished both k_means and k_nn, *and* you're loving CPA so much you want to do some extra work, then please attempt these two exercises.

**Exercise 1** 

kMeans is a commonly used machine learning algorithm. Although we've implemented it ourselves here, there are many libraries which contain their own (optimised) implementations, and conventionally we'd use one of those libraries rather than writing it ourselves. One such library is scikit-learn, a popular machine learnng library for Python. The documentation for their implementation of kMeans is <a href=https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html>here</a>. Try installing (with pip3) scikit-learn and running their version of kMeans. How do the results compare to our implementation?

**Exercise 2**

Applying kMeans to 2D data is slightly overkill, as we can often immediately 'see' where the cluster boundaries should lie when we plot the data. With 3D data, this is much harder (but still possible), however, when we go beyond 3D, it's essentially impossible. Can you edit your code to use all features in the data set, rather than just 2? Try to write code that is "abstracted"- i.e rather than hardcoding the number of features, can you write code that works for any number of features?

<h2> k-Nearest-Neighbours</h2>

k-Nearest-Neighbours (kNN) is a *supervised* *classification* algorithm. A classification algorithm tries to identify which of N classes a data point belongs to. For example, given some data on a student's performance in their year 1 assessments, we might want to classify them into those likely to drop out and those likely to complete the course. 

kNN is a *supervised* algorithm, so has access to some data (typically called *training data*) which contains both our *features* and the *true classes*. Given a new data point (typically called *test data*), it then predicts a class for the new point based on the new data point's features. kNN uses a simple heuristic to do this- use the modal value of the k closest points in the data set. 

<img src="https://github.com/engmaths/SEMT10002_2024/blob/main/img/KnnClassification.svg?raw=true" width="40%">

In the image above, the green circle represents a test point. If K=3, then the neighbours would be two red triangles and a blue square, so we would predict that the test point is a red triangle. If however, we used K=5, then we'd have two additional blue squares, so we'd predict that the test point is a blue square.


<h3> Implementing kNN</h3>

The algorithm for implementing kNN is relatively short and sweet. Given a test point, we calculate the Euclidean distance to every point in our training data. We then find the K closest points, and return the mode of those points.

The file heart_data.csv (i.e. the same data as we used for kMeans) also contains a feature that records whether a patient survived or not ("DEATH_EVENT" in the file). For this exercise, you need to implement kNN and use it to predict whether a patient is likely to survive or not. You should base your decision on three features of the data set- *age*, *creatinine_phosphokinase*, and *ejection_fraction*. The data file contains 297 patient records- you should use the first 250 records as your training data and the final 47 as the *test* data. 

Once you have implemented kNN, I'd like you to explore which value of k produces the best predictions. To do this, you should call your function with values of K in the range 1-10, calculating the mean squared error for each value of K. As a reminder, the mean squared error is defined by the formula:

$ MSE = \frac{1}{n} \sum_{i=1}^n(Y_{predicted, i} - Y_{actual, i})^2$

where $Y_{predicted, i}$ is your prediction for the ith test point and $Y_{actual, i}$ is the actual value for the ith test point (i.e. what the file heart_data.csv actually contains). 