# K-Means Clustering

## Chapter 6

K-Means clustering is a form of unsupervised learning where the goal is to discover clusters - groups - not known ahead of time within the data. We do not have the target, but instead will form clusters and then analyze them for any meaning we can derive.

In [1]:
from __future__ import annotations
from typing import TypeVar, Generic, List, Sequence
from copy import deepcopy
from functools import partial
from random import uniform
from statistics import mean, pstdev
from dataclasses import dataclass

## Function to Calculate Z-Scores

In [2]:
def zscores(original: Sequence[float]) -> List[float]:
    """
    Calculate zscores of a sequence.
    """
    avg: float = mean(original)
    std: float = pstdev(original)
        
    # Cannot divide by 0
    if std == 0:
        return original
    
    # Apply the zscore formalize to normalize
    return [(x - avg) / std for x in original]

## Define a Dataclass

In [3]:
from math import sqrt

class DataPoint:
    def __init__(self, initial: Iterable[float]) -> None:
        self._originals: Tuple[float, ...] = tuple(initial)
        self.dimensions: Tuple[float, ...] = tuple(initial)
            
    @property
    def num_dimensions(self) -> int:
        return len(self.dimensions)
    
    def distance(self, other: DataPoint) -> float:
        """
        Distance between one datapoint and another. Defined as the Euclidean distance.
        """
        combined: Iterable[Tuple[float, float]] = zip(self.dimensions, other.dimensions)
        differences: List[float] = [(x - y) ** 2 for x, y in combined]
        return sqrt(sum(differences))
    
    def __eq__(self, other: object) -> bool:
        if not isinstance(other, DataPoint):
            return NotImplemented
        return self.dimensions == other.dimensions
    
    def __repr__(self) -> str:
        return self._originals.__repr__()

In [4]:
dp = DataPoint(initial=(9.8, 5.6, 11.2))
dp2 = DataPoint(initial=(11.3, 4.5, 5.4))

print(dp)

(9.8, 5.6, 11.2)


In [5]:
print(dp==dp2)

False


In [6]:
dp.distance(dp2)

6.090976933136423

In [7]:
dp2.distance(dp)

6.090976933136423

In [9]:
zscores(dp._originals)
zscores(dp2._originals)

[0.39223227027636814, -1.3728129459672884, 0.9805806756909196]

[1.4036791557351331, -0.8510495668630337, -0.5526295888720998]

## Normalization

K-Means is a distance based algorithm, which means it requires the scale of data to be the same for each data point. One way to do this is through calculating the zscores of each datapoint (also called the standard score) which takes the average and divides by the standard deviation to put points on the same scale.

# K-Means Algorithm

1. Initialize "k" empty clusters
2. Normalize all data points
3. Create random centroids for each cluster
4. Assign ach data point to the cluster of the centroid to which it is closest
5. Recalculate each centroid based on the data points assigned to it
6. Continue until no data points are re-assigned or until the maximum number of iterations is reached.