# k Means Clustering
## DS-3001: Machine Learning 1

Content adapted from Terence Johnson (UVA)

**Notebook Summary**: In this notebook, we introduce the concept of clustering as an unsupervised learning task. In clustering, we are interested in discovering some underlying groups (clusters) in our data that we do not have information about beforehand. In other words, we don't know how many groups we have or necessarily what the groups represent. In this notebook we specifically look at KMeans clustering and how we can use it to find underlying structure in Virginia Electrcity Sales data. We discuss how the KMeans algorithm operates, how changing the value of k changes our results, and how we can identify the best value of k for our data.

#### Setting Up Our Environment

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os # For changing directory

# To mount your google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path_to_DS_3001_folder = '/content/drive/MyDrive/DS-3001/02_Intro_to_ML_Algorithms'
# path_to_DS_3001_folder = ''

# Update the path to your folder for the class
# Where you stored the data from the previous noteboook
os.chdir(path_to_DS_3001_folder)

## Clustering

- Last week, we discussed the k Nearest Neighbor method for *classification* (predicting categorial values for new data) and *regression* (predicting numeric values for new data).
- The cases of classification and regression are examples of **supervised learning**. We know what success looks like because we have the true answers for the training data.
- Today, we're looking at an **unsupervised learning** algorithm where we don't necessarily know what success looks like. Instead, we're looking for general patterns in the data without defining what success looks like in advance.
- The main algorithm we're looking at for today is *k-means clustering*.

## Unsupervised Learning

- In **unsupervised learning**, we do not have an outcome variable $y$ for each observation. There is not a single outcome that we are trying to predict, but rather we are looking for meaningful patterns within the data.
- The idea is that there is a *latent structure* with a discrete characteristic in the data we are trying to recover.

## Examples of Data Where Unsupervised Learning is Applicable

- When we want to detect anomalies in some financial patterns.
  - Ex. Identifying collusive bidders vs competittive bidders in an auction.
  - Ex. Identifying fraud in banking activity.

- Identifying different communities from interaction data.

- Identifying different strains of disease based on patient outcomes.

## Data for Today: VA Electrcity Sales

- Today we are going to look at data that describes the electricity consumption in Virginia.
- We are interested in finding some underlying structure in the data given two variables:
  - **price:** The price of electricity. The units are cents per kilowatthour.
  - **sales:** The amount of electricity sold. The units are milllion kilowatthours.

In [None]:
el_df = pd.read_csv('./data/electricity_data_validation.csv')
el_df.head()

In [None]:
# Clean the price and sales variables


In [None]:
# Look at a comparison between price and sales
# Does there seem to be an underlying relationship?

sns.scatterplot(
  x = el_df['price'],
  y = el_df['sales']
)
plt.title('Commparison Between Price and Sales')
plt.xlabel('Price (cents / kWh)')
plt.ylabel('Sales (million kWh)')
plt.show()

**Question:** Looking at the plot above, are there clear clusters in the data? How many clusters do you think there are?

## Unsupervised Learning Continued

- We are interested in the following question: *can we identify some label / group for each observation without initially knowing which group each point / observation belongs to?*
- For this problem, we use the same idea we used for KNN.
  - If two points were created by the same data generating process/cluster, then their values are probably close together.
  - If there are a discrete number of distinct data generating processes, we should be able to recover their values by looking at separation between the groups.

In [None]:
# Quick numeric example with data geneation processes

# Define different means and standard deviations
# for the different generation processes
mean1 = 0
mean2 = 10
std1 = 1
std2 = 3

n_per_k = 100

# Generate data using 2 different data generation processes
np.random.seed(2302)
x1 = np.random.normal(mean1, std1, n_per_k)
y1 = np.random.normal(mean1, std1, n_per_k)
x2 = np.random.normal(mean2, std2, n_per_k)
y2 = np.random.normal(mean2, std2, n_per_k)

# Combine data into a single data set
x = np.vstack([x1, x2]).flatten()
y = np.vstack([y1, y2]).flatten()

# Plot the data now without labels
plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Comparison of Data with\nDifferent Data Generation Processes')
plt.show()

# Plot the example with true labels
plt.scatter(x[:n_per_k], y[:n_per_k], color = 'dodgerblue', label = 'Data Generation Process 1')
plt.scatter(x[n_per_k:], y[n_per_k:], color = 'firebrick', label = 'Data Generation Process 2')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Comparison of Data with\nDifferent Data Generation Processes')
plt.legend()
plt.show()

## k-Means Clustering Algorithm (k-MC)

- The k-Means Clustering algorithm is defined as follows:

  0. (Initialization) Randomly select $k$ points to be the *centroids*, $\{c_1, c_2, ..., c_k\}$
  1. Find the distance of each observation $x_i$ to each centroid $c_j$
  2. Assign each point $i$ to the closest centroid $j$
  3. Compute the new value of each centroid $j$ as the average of all of the observations $i$ assigned to it
  4. (Convergence) Repeat steps 1-3 until the observations are assigned to the same centroids twice (no data point switches centroid assignment), or a maximum number of iterations is reached.

- k-MC is one of many clustering algorithms. What is nice about it is that, despite it being computationally intensive to calculate all these distances, it does scale to very large datasets. Likewise, a lot of the steps in using k-MC appear in other, similar algorithms (e.g. scree plot for spectral clustering)


#### Distance and Scaling

* Similarily to KNN, KMC uses the Euclidean distance to find the distance between a new case $\hat{x}$ and each observation $x_i$.

\begin{gather}
  d(\hat{x},x_i) = \sqrt{ \sum_{\ell=1}^N (\hat{x}_i - x_i)^2}
\end{gather}

* Again, we need to scale the variables so that no one variable dominates the distance calculation between two observations.

\begin{gather}
   u_i = \dfrac{x - \min(x_i)}{\max(x)-\min(x)}
\end{gather}

#### SciKit Learn Implementation

* SciKit learn has a function for k Means that we can use.
  - `from sklearn.cluster import KMeans`
* You can use the following arguments when creating the model instance:
  - `n_clusters = k`: The number of clusters / centroids to use.
  - `n_init = 10`: The number of test runs to do. Each test run has a different set of initial centroids. The model will return the "best" clustering result.
  - `max_iter = 300`: The maximum number of iterations to run for the KMC algoritm.
  - `random_state = None`: An initial state for the random generation so that the results are replicable.
* You can create the KMC model in the same way we did with KNN:
  1. Use `model = KMeans(n_clusters, n_init)` to create the model.
  2. Use the `.fit(X)` method to fit the model to the data $X$.
  3. Use the `.predict(X_hat)` method to predict cluster values for new classes `X_hat`.


In [None]:
# First, import the kmeans function from sklearn
from sklearn.cluster import KMeans

# Define a minmax scaler as we did for KNN
def MinMaxScaler(x):

  # Pre-compute the min and max of the variable
  min_x = np.min(x)
  max_x = np.max(x)

  # Calculate the newly scaled version of the variable
  u = (x - min_x) / (max_x - min_x)

  # Return the scaled version of the value
  return u

In [None]:
# Create new columns for our scaled values
el_df['price_scaled'] = MinMaxScaler(el_df['price'])
el_df['sales_scaled'] = MinMaxScaler(el_df['sales'])

# Pull out the minimum and maximum for each variable
# need these to transform are values back to the unscaled version
min_price = np.min(el_df['price'])
max_price = np.max(el_df['price'])

min_sales = np.min(el_df['sales'])
max_sales = np.max(el_df['sales'])

# Create an X value for sklearn
X = el_df.loc[:, ['price_scaled', 'sales_scaled']]
X.head()

In [None]:
# Plotting scaled values
sns.scatterplot(
    data = el_df,
    x = 'price_scaled',
    y = 'sales_scaled'
)
plt.show()

#### Setting Up k Means Clustering Algorithm

##### Iteration 1

In [None]:
# 1. Set a seed for reproducibility
random_state_value = 12
np.random.seed(random_state_value)
k = 5 # Fill this in with your guess from above
max_iter = 1 # Start with only one iteration

# 2. Create some initial centroids that are randomly defined
initial_centroids = np.random.rand(k, 2)

# 3. Create a model instance for the KMeans
model = KMeans(
  n_clusters = k, # The number of centroids / groups
  max_iter = max_iter, # The number of iterations to run
  init = initial_centroids, # Pass in our initial centroids
  random_state = random_state_value # The random state for reproducibility
)

# 4. Fit the model to our data
model = model.fit(X)

# 5. Gather predictions
el_df['g_hat'] = model.predict(X) # Predicted group

# 6. Re-normalize the centers so that
# they're on the same scale as our original data
centers = model.cluster_centers_ # The computed centers
centroid_price = centers[:, 0] * (max_price - min_price) + min_price
centroid_sales = centers[:, 1] * (max_sales - min_sales) + min_sales

# 7. Plot the data and their centers
sns.scatterplot(
    data = el_df,
    x = 'price',
    y = 'sales',
    hue = 'g_hat', # Color points based on predicted group
    style = 'g_hat'
)
plt.title('Centroids and Group Predictions After 1 Iterations')

## Ploting the centroids
plt.scatter(
    centroid_price,
    centroid_sales,
    color = 'red',
    label = 'centroids'
)

# Add a legend and show
plt.legend()
plt.show()

#### What if we use a different initializaiton?

**Change the `random_state_value` to have different intializations**. Notice how the centroid placement changes.

In [None]:
# 1. Set a seed for reproducibility
random_state_value = 10923
np.random.seed(random_state_value)
k = 5 # Fill this in with your guess from above
max_iter = 1 # Start with only one iteration

# 2. Create some initial centroids that are randomly defined
initial_centroids = np.random.rand(k, 2)

# 3. Create a model instance for the KMeans
model = KMeans(
  n_clusters = k, # The number of centroids / groups
  max_iter = max_iter, # The number of iterations to run
  init = initial_centroids, # Pass in our initial centroids
  random_state = random_state_value # The random state for reproducibility
)

# 4. Fit the model to our data
model = model.fit(X)

# 5. Gather predictions
el_df['g_hat'] = model.predict(X) # Predicted group

# 6. Re-normalize the centers so that
# they're on the same scale as our original data
centers = model.cluster_centers_ # The computed centers
centroid_price = centers[:, 0] * (max_price - min_price) + min_price
centroid_sales = centers[:, 1] * (max_sales - min_sales) + min_sales

# 7. Plot the data and their centers
sns.scatterplot(
    data = el_df,
    x = 'price',
    y = 'sales',
    hue = 'g_hat', # Color points based on predicted group
    style = 'g_hat'
)
plt.title('Centroids and Group Predictions After 1 Iterations')

## Ploting the centroids
plt.scatter(
    centroid_price,
    centroid_sales,
    color = 'red',
    label = 'centroids'
)

# Add a legend and show
plt.legend()
plt.show()

##### Iteration 2

Now, we'll go back to our first initial centroid placement and view how the algorithm progresses with different iterations.

In [None]:
# 1. Set a seed for reproducibility
random_state_value = 12 # using our original seed again
np.random.seed(random_state_value)
k = 5 # Fill this in with your guess from above
max_iter = 2 # Change the number of iterations to 2

# 2. Create some initial centroids that are randomly defined
initial_centroids = np.random.rand(k, 2)

# 3. Create a model instance for the KMeans
model = KMeans(
  n_clusters = k, # The number of centroids / groups
  max_iter = max_iter, # The number of iterations to run
  init = initial_centroids, # Pass in our initial centroids
  random_state = random_state_value # The random state for reproducibility
)

# 4. Fit the model to our data
model = model.fit(X)

# 5. Gather predictions
el_df['g_hat'] = model.predict(X) # Predicted group

# 6. Re-normalize the centers so that
# they're on the same scale as our original data
centers = model.cluster_centers_ # The computed centers
centroid_price = centers[:, 0] * (max_price - min_price) + min_price
centroid_sales = centers[:, 1] * (max_sales - min_sales) + min_sales

# 7. Plot the data and their centers
sns.scatterplot(
    data = el_df,
    x = 'price',
    y = 'sales',
    hue = 'g_hat', # Color points based on predicted group
    style = 'g_hat'
)
plt.title(f'Centroids and Group Predictions After {max_iter} Iterations')

## Ploting the centroids
plt.scatter(
    centroid_price,
    centroid_sales,
    color = 'red',
    label = 'centroids'
)

# Add a legend and show
plt.legend()
plt.show()

##### Iteration 3

In [None]:
# 1. Set a seed for reproducibility
np.random.seed(random_state_value)
k = 5 # Fill this in with your guess from above
max_iter = 3 # Change the iterations to 3

# 2. Create some initial centroids that are randomly defined
initial_centroids = np.random.rand(k, 2)

# 3. Create a model instance for the KMeans
model = KMeans(
  n_clusters = k, # The number of centroids / groups
  max_iter = max_iter, # The number of iterations to run
  init = initial_centroids, # Pass in our initial centroids
  random_state = random_state_value # The random state for reproducibility
)

# 4. Fit the model to our data
model = model.fit(X)

# 5. Gather predictions
el_df['g_hat'] = model.predict(X) # Predicted group

# 6. Re-normalize the centers so that
# they're on the same scale as our original data
centers = model.cluster_centers_ # The computed centers
centroid_price = centers[:, 0] * (max_price - min_price) + min_price
centroid_sales = centers[:, 1] * (max_sales - min_sales) + min_sales

# 7. Plot the data and their centers
sns.scatterplot(
    data = el_df,
    x = 'price',
    y = 'sales',
    hue = 'g_hat', # Color points based on predicted group
    style = 'g_hat'
)
plt.title(f'Centroids and Group Predictions After {max_iter} Iterations')

## Ploting the centroids
plt.scatter(
    centroid_price,
    centroid_sales,
    color = 'red',
    label = 'centroids'
)

# Add a legend and show
plt.legend()
plt.show()

##### Iteration 4

In [None]:
# 1. Set a seed for reproducibility
np.random.seed(random_state_value)
k = 5 # Fill this in with your guess from above
max_iter = 4 # Change the iterations to 4

# 2. Create some initial centroids that are randomly defined
initial_centroids = np.random.rand(k, 2)

# 3. Create a model instance for the KMeans
model = KMeans(
  n_clusters = k, # The number of centroids / groups
  max_iter = max_iter, # The number of iterations to run
  init = initial_centroids, # Pass in our initial centroids
  random_state = random_state_value # The random state for reproducibility
)

# 4. Fit the model to our data
model = model.fit(X)

# 5. Gather predictions
el_df['g_hat'] = model.predict(X) # Predicted group

# 6. Re-normalize the centers so that
# they're on the same scale as our original data
centers = model.cluster_centers_ # The computed centers
centroid_price = centers[:, 0] * (max_price - min_price) + min_price
centroid_sales = centers[:, 1] * (max_sales - min_sales) + min_sales

# 7. Plot the data and their centers
sns.scatterplot(
    data = el_df,
    x = 'price',
    y = 'sales',
    hue = 'g_hat', # Color points based on predicted group
    style = 'g_hat'
)
plt.title(f'Centroids and Group Predictions After {max_iter} Iterations')

## Ploting the centroids
plt.scatter(
    centroid_price,
    centroid_sales,
    color = 'red',
    label = 'centroids'
)

# Add a legend and show
plt.legend()
plt.show()

##### Iteration 5

In [None]:
# 1. Set a seed for reproducibility
np.random.seed(random_state_value)
k = 5 # Fill this in with your guess from above
max_iter = 5 # Change the iterations to 5

# 2. Create some initial centroids that are randomly defined
initial_centroids = np.random.rand(k, 2)

# 3. Create a model instance for the KMeans
model = KMeans(
  n_clusters = k, # The number of centroids / groups
  max_iter = max_iter, # The number of iterations to run
  init = initial_centroids, # Pass in our initial centroids
  random_state = random_state_value # The random state for reproducibility
)

# 4. Fit the model to our data
model = model.fit(X)

# 5. Gather predictions
el_df['g_hat'] = model.predict(X) # Predicted group

# 6. Re-normalize the centers so that
# they're on the same scale as our original data
centers = model.cluster_centers_ # The computed centers
centroid_price = centers[:, 0] * (max_price - min_price) + min_price
centroid_sales = centers[:, 1] * (max_sales - min_sales) + min_sales

# 7. Plot the data and their centers
sns.scatterplot(
    data = el_df,
    x = 'price',
    y = 'sales',
    hue = 'g_hat', # Color points based on predicted group
    style = 'g_hat'
)
plt.title(f'Centroids and Group Predictions After {max_iter} Iterations')

## Ploting the centroids
plt.scatter(
    centroid_price,
    centroid_sales,
    color = 'red',
    label = 'centroids'
)

# Add a legend and show
plt.legend()
plt.show()

##### Example with 100 Iterations

In [None]:
# 1. Set a seed for reproducibility
np.random.seed(random_state_value)
k = 5 # Fill this in with your guess from above
max_iter = 100 # Change the iterations to 100

# 2. Create some initial centroids that are randomly defined
initial_centroids = np.random.rand(k, 2)

# 3. Create a model instance for the KMeans
model = KMeans(
  n_clusters = k, # The number of centroids / groups
  max_iter = max_iter, # The number of iterations to run
  init = initial_centroids, # Pass in our initial centroids
  random_state = random_state_value # The random state for reproducibility
)

# 4. Fit the model to our data
model = model.fit(X)

# 5. Gather predictions
el_df['g_hat'] = model.predict(X) # Predicted group

# 6. Re-normalize the centers so that
# they're on the same scale as our original data
centers = model.cluster_centers_ # The computed centers
centroid_price = centers[:, 0] * (max_price - min_price) + min_price
centroid_sales = centers[:, 1] * (max_sales - min_sales) + min_sales

# 7. Plot the data and their centers
sns.scatterplot(
    data = el_df,
    x = 'price',
    y = 'sales',
    hue = 'g_hat', # Color points based on predicted group
    style = 'g_hat'
)
plt.title(f'Centroids and Group Predictions After {max_iter} Iterations')

## Ploting the centroids
plt.scatter(
    centroid_price,
    centroid_sales,
    color = 'red',
    label = 'centroids'
)

# Add a legend and show
plt.legend()
plt.show()

## Greedy Algorithms

- KMeans Clustering is an example of a *greedy algorithm*.
- Each time we assign observations to a cluster or centroid, we ignore the consequences of changing the centroid averages for overal optimality.
- Because we ignore the consequences of changing the centroid averages, the **optimal assignment might change**.
- **KMeans May Never Converge:** Since this is a greedy algorithm, KMeans may not converge and can run endlessly. In this case, the points will continually jump back and forth between centroid assignments. There is no guarentee for a stable solution.
- **Short-sighted:** The algorithm is short-sighted about the consequences of its action, and need not converge to an optimal outcome.

## Changing the value of $k$

- We want to find an optimal value of k such that we do not have too many clusters.
- Too many clusters means that our initial guesses give similar results in terms of minimizing error, but the clusters change a lot across initial guess. This means that our assignments are somewhat arbitrary, defeating the point of the algorithm.

- **Let's look at the results we get for different values of k**

#### Examle with k = 2

In [None]:
# Let's try
# 1. Set a seed for reproducibility
np.random.seed(random_state_value)
k = 2 # Example with 2 centroids
max_iter = 300 # Change the iterations to 300

# 2. Create some initial centroids that are randomly defined
initial_centroids = np.random.rand(k, 2)

# 3. Create a model instance for the KMeans
model = KMeans(
  n_clusters = k, # The number of centroids / groups
  max_iter = max_iter, # The number of iterations to run
  init = initial_centroids, # Pass in our initial centroids
  random_state = random_state_value # The random state for reproducibility
)

# 4. Fit the model to our data
model = model.fit(X)

# 5. Gather predictions
el_df['g_hat'] = model.predict(X) # Predicted group

# 6. Re-normalize the centers so that
# they're on the same scale as our original data
centers = model.cluster_centers_ # The computed centers
centroid_price = centers[:, 0] * (max_price - min_price) + min_price
centroid_sales = centers[:, 1] * (max_sales - min_sales) + min_sales

# 7. Plot the data and their centers
sns.scatterplot(
    data = el_df,
    x = 'price',
    y = 'sales',
    hue = 'g_hat', # Color points based on predicted group
    style = 'g_hat'
)
plt.title(f'Centroid Placements After {max_iter} Iterations\nk = {k}')

## Ploting the centroids
plt.scatter(
    centroid_price,
    centroid_sales,
    color = 'red',
    label = 'centroids'
)

# Add a legend and show
plt.legend()
plt.show()

#### Example with k = 3

In [None]:
# Let's try
# 1. Set a seed for reproducibility
np.random.seed(random_state_value)
k = 3 # Example with 3 centroids
max_iter = 300 # Change the iterations to 300

# 2. Create some initial centroids that are randomly defined
initial_centroids = np.random.rand(k, 2)

# 3. Create a model instance for the KMeans
model = KMeans(
  n_clusters = k, # The number of centroids / groups
  max_iter = max_iter, # The number of iterations to run
  init = initial_centroids, # Pass in our initial centroids
  random_state = random_state_value # The random state for reproducibility
)

# 4. Fit the model to our data
model = model.fit(X)

# 5. Gather predictions
el_df['g_hat'] = model.predict(X) # Predicted group

# 6. Re-normalize the centers so that
# they're on the same scale as our original data
centers = model.cluster_centers_ # The computed centers
centroid_price = centers[:, 0] * (max_price - min_price) + min_price
centroid_sales = centers[:, 1] * (max_sales - min_sales) + min_sales

# 7. Plot the data and their centers
sns.scatterplot(
    data = el_df,
    x = 'price',
    y = 'sales',
    hue = 'g_hat', # Color points based on predicted group
    style = 'g_hat'
)
plt.title(f'Centroid Placements After {max_iter} Iterations\nk = {k}')

## Ploting the centroids
plt.scatter(
    centroid_price,
    centroid_sales,
    color = 'red',
    label = 'centroids'
)

# Add a legend and show
plt.legend()
plt.show()

#### Example with k = 4

In [None]:
# Let's try
# 1. Set a seed for reproducibility
np.random.seed(random_state_value)
k = 4 # Example with 4 centroids
max_iter = 300 # Change the iterations to 300

# 2. Create some initial centroids that are randomly defined
initial_centroids = np.random.rand(k, 2)

# 3. Create a model instance for the KMeans
model = KMeans(
  n_clusters = k, # The number of centroids / groups
  max_iter = max_iter, # The number of iterations to run
  init = initial_centroids, # Pass in our initial centroids
  random_state = random_state_value # The random state for reproducibility
)

# 4. Fit the model to our data
model = model.fit(X)

# 5. Gather predictions
el_df['g_hat'] = model.predict(X) # Predicted group

# 6. Re-normalize the centers so that
# they're on the same scale as our original data
centers = model.cluster_centers_ # The computed centers
centroid_price = centers[:, 0] * (max_price - min_price) + min_price
centroid_sales = centers[:, 1] * (max_sales - min_sales) + min_sales

# 7. Plot the data and their centers
sns.scatterplot(
    data = el_df,
    x = 'price',
    y = 'sales',
    hue = 'g_hat', # Color points based on predicted group
    style = 'g_hat'
)
plt.title(f'Centroid Placements After {max_iter} Iterations\nk = {k}')

## Ploting the centroids
plt.scatter(
    centroid_price,
    centroid_sales,
    color = 'red',
    label = 'centroids'
)

# Add a legend and show
plt.legend()
plt.show()

#### Example with k = 5

In [None]:
# Let's try
# 1. Set a seed for reproducibility
np.random.seed(random_state_value)
k = 5 # Example with 5 centroids
max_iter = 300 # Change the iterations to 300

# 2. Create some initial centroids that are randomly defined
initial_centroids = np.random.rand(k, 2)

# 3. Create a model instance for the KMeans
model = KMeans(
  n_clusters = k, # The number of centroids / groups
  max_iter = max_iter, # The number of iterations to run
  init = initial_centroids, # Pass in our initial centroids
  random_state = random_state_value # The random state for reproducibility
)

# 4. Fit the model to our data
model = model.fit(X)

# 5. Gather predictions
el_df['g_hat'] = model.predict(X) # Predicted group

# 6. Re-normalize the centers so that
# they're on the same scale as our original data
centers = model.cluster_centers_ # The computed centers
centroid_price = centers[:, 0] * (max_price - min_price) + min_price
centroid_sales = centers[:, 1] * (max_sales - min_sales) + min_sales

# 7. Plot the data and their centers
sns.scatterplot(
    data = el_df,
    x = 'price',
    y = 'sales',
    hue = 'g_hat', # Color points based on predicted group
    style = 'g_hat'
)
plt.title(f'Centroid Placements After {max_iter} Iterations\nk = {k}')

## Ploting the centroids
plt.scatter(
    centroid_price,
    centroid_sales,
    color = 'red',
    label = 'centroids'
)

# Add a legend and show
plt.legend()
plt.show()

#### Example with k = 6

In [None]:
# Let's try
# 1. Set a seed for reproducibility
np.random.seed(random_state_value)
k = 6 # Example with 6 centroids
max_iter = 300 # Change the iterations to 300

# 2. Create some initial centroids that are randomly defined
initial_centroids = np.random.rand(k, 2)

# 3. Create a model instance for the KMeans
model = KMeans(
  n_clusters = k, # The number of centroids / groups
  max_iter = max_iter, # The number of iterations to run
  init = initial_centroids, # Pass in our initial centroids
  random_state = random_state_value # The random state for reproducibility
)

# 4. Fit the model to our data
model = model.fit(X)

# 5. Gather predictions
el_df['g_hat'] = model.predict(X) # Predicted group

# 6. Re-normalize the centers so that
# they're on the same scale as our original data
centers = model.cluster_centers_ # The computed centers
centroid_price = centers[:, 0] * (max_price - min_price) + min_price
centroid_sales = centers[:, 1] * (max_sales - min_sales) + min_sales

# 7. Plot the data and their centers
sns.scatterplot(
    data = el_df,
    x = 'price',
    y = 'sales',
    hue = 'g_hat', # Color points based on predicted group
    style = 'g_hat'
)
plt.title(f'Centroid Placements After {max_iter} Iterations\nk = {k}')

## Ploting the centroids
plt.scatter(
    centroid_price,
    centroid_sales,
    color = 'red',
    label = 'centroids'
)

# Add a legend and show
plt.legend()
plt.show()

#### Example with k = 20

In [None]:
# Let's try
# 1. Set a seed for reproducibility
np.random.seed(random_state_value)
k = 20 # Example with 4 centroids
max_iter = 300 # Change the iterations to 300

# 2. Create some initial centroids that are randomly defined
initial_centroids = np.random.rand(k, 2)

# 3. Create a model instance for the KMeans
model = KMeans(
  n_clusters = k, # The number of centroids / groups
  max_iter = max_iter, # The number of iterations to run
  init = initial_centroids, # Pass in our initial centroids
  random_state = random_state_value # The random state for reproducibility
)

# 4. Fit the model to our data
model = model.fit(X)

# 5. Gather predictions
el_df['g_hat'] = model.predict(X) # Predicted group

# 6. Re-normalize the centers so that
# they're on the same scale as our original data
centers = model.cluster_centers_ # The computed centers
centroid_price = centers[:, 0] * (max_price - min_price) + min_price
centroid_sales = centers[:, 1] * (max_sales - min_sales) + min_sales

# 7. Plot the data and their centers
sns.scatterplot(
    data = el_df,
    x = 'price',
    y = 'sales',
    hue = 'g_hat', # Color points based on predicted group
    style = 'g_hat'
)
plt.title(f'Centroid Placements After {max_iter} Iterations\nk = {k}')

## Ploting the centroids
plt.scatter(
    centroid_price,
    centroid_sales,
    color = 'red',
    # label = 'centroids'
)

# Add a legend and show
plt.legend(bbox_to_anchor=(1.05, 1.25))
#plt.tight_layout()
plt.show()

### Sum of Squared Error

- We are interested in understanding how well our identified clusters fit the data. To do this, we can look at the *sum of squared error*.

- When calcualting the sum of squared error, we fix the value for k and focus on its solution to the clustering problem.

- Each cluster $j$ has a centroid $c_j$. We can define the distance from each observation $x_i$ to its assigned centroid as $d(x_i, c_j)$.

- The **within cluster squared error** is calculated as:

\begin{gather}
  W_j = \sum_{\text{All observations } i \text{ in cluster } j} d(x_i,c_j)^2
\end{gather}

* In words, the **within cluster squared error** is just the sum of squared distances of points in a single cluster to its centroid.

* The **sum of squared error (SSE)** is

\begin{gather}
  \sum_{\text{All clusters j}} W_j = \sum_{\text{All clusters j}} \quad \sum_{\text{All observations } i \text{ in cluster } j} d(x_i,c_j)^2
\end{gather}

* In words, the SSE is the sum of the within cluster squared errors.

* In principle, KMC is trying to minimize the SSE, but it is a greedy algorithm, so it doesn't achieve the global minimum SSE.

* In scikit, the SSE is calculated and stored in the fitted model. It is stored as an attribute called `.inertia_`.



## Identifying the best value of k using the Scree Plot

* We can plot the SSE for each value of k, $SSE(k)$, against the number of clusters k.
* This creates a **scree plot** which allows us to decide what the best value of k is for our data.

In [None]:
# Select a list of k's that we want to investigate
max_k = 10
k_grid = np.arange(1, max_k + 1)

# Create a Numpy array to store our values of SSE
SSE = np.zeros(max_k)

# Loop over the indices for each value of k
for j in range(max_k):

  # index out the value of k
  k = k_grid[j]

  # Create the model instance using our value of k
  model = KMeans(
      n_clusters = k,
      max_iter = 300, # Define the number of iterations to run
      n_init = 10 # Define the number of initializations
      # sklearn will return the best model in terms of SSE
  )

  # Fit the model to our data
  model = model.fit(X)

  # Isolate the intertia (SSE) for this value of k and save it to our numpy array
  SSE[j] = model.inertia_

# Create the scree plot using Seaborn
sns.lineplot(
  x = k_grid,
  y = SSE
)
plt.title('Scree Plot')
plt.xlabel('k')
plt.ylabel('SSE')
plt.xticks(np.arange(1, max_k + 1))
plt.show()

### Deciding the Best Value of k From the Scree Plot

- The optimal $k^*$ is decided by looking for large reductions in the SSE by going from $k-1$ to $k$ as compared to going from $k$ to $k+1$.

- In other words, we want the value of k that we choose to significantly decrease the value of the SSE and for adding additional clusters after that value of k to only marginally decrease the SSE.

- This value $k^*$ is considered the **elbow** in the Scree plot because there's a large drop in the SSE followed by a marginal decrease in SSE as k increases.

- **What does it mean if there's no elbow in the scree plot?**
  1. There may not be discrete clusters in your data. Instead, there's a continuous trend.
  2. It's possible there are discrete clusters, but you need to do more feature engineering (e.g. Principle Component Analysis), or use a different algoritm or error metric to identify them.

- The scree plot is pretty subjective, and more quantitative approches exist such as the [silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html).

**Question:** From the plot above, what do you think the best value of $k$ is for our electricity data?

## Identifying the Truth in Our Data Example

* In this example, we assumed we didn't know the correct answer beforehand for demonstration purposes.
* The attribute of our data we were actually estimating with our identified groups was the variable `sectorName`. This is what sector the observation fell into.
* Let's plot our data to see what the true split should have been:

In [None]:
# Plotting the true values
sns.scatterplot(
    data = el_df,
    x = 'price',
    y = 'sales',
    hue = 'sectorName' # Color points based on True sectors
)
plt.title('True Clustering of Our Electrcity Example')
plt.xlabel('Price (cents / kWh)')
plt.ylabel('Sales (million kWh)')
plt.show()

**Question:** How was your guess for the value of k?

## Conclusions

* Clustering in high dimensions is risky. There could be redundancy in the information between different variables and the values get further apart in higher dimensions.
  - When dealing with higher dimensions, you can use dimensionality reduction techniques to represet your higher dimensional data within a lower dimension. The most commont tehcnique is **Principle Component Analysis**. It also removes redundancy between variables.
* There are other versions of clustering that fix some short comings of KMeans clustering (spectral, dbscan, OPTICS, Mahalanobis). These techniques are more complex and computationally expensive.
* Clustering can be used to identify a latent structure in your data and then you can use that structure for other analyses.