# Applying the *k*-Means Algorithm

## Review the brief

To familiarize ourselves with carrying out cluster analysis using Python, we'll work through an example of applying the k-means algorithm to a 2-dimensional dataset on countries of the world.

Our task is to find and group similar countries together on the basis of two features, fertility rate and female labor force participation rate. 

We'll see how Python can be used to perform a full cluster analysis. As a reminder, the steps to a cluster analysis are: 
1. Determine if clustering is appropriate for the task 
2. Pre-process the data
3. Carry out the algorithm 
4. Evaluate the results
5. Interpret the results

As you work through this notebook, you'll complete 12 tasks. If you get stuck at any point, you can refer to the completed notebook `applying_kmeans_completed.ipynb`, which provides example answers.

## Import necessary libraries

First, we need to import some important libraries for our analysis. We'll mostly be working with the `sklearn` library. 

### *Task 1 - import libraries (run cell)*
- Run the cell below to import the libraries that we'll be working with. 
- Take a look at some of the `sklearn` documentation on its clustering functions by clicking [this link](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster).

In [None]:
# import libraries 

# for loading and manipulating dataframes
import pandas as pd
import numpy as np

# for various plotting functions
import matplotlib.pyplot as plt
import seaborn as sns

# preprocessing functions to scale our data
from sklearn.preprocessing import scale, StandardScaler, normalize

# function to carry out k-means
from sklearn.cluster import KMeans

## Load in and explore the data

To begin, let's load in and explore our dataset. The data is from the 2018 World Bank Development Indicators and features 2 variables - female labor force participation rate and fertility rate. You can look at the data on the World Bank site by clicking [this link](https://databank.worldbank.org/source/world-development-indicators#).

### *Task 2 - load data*
Read in the `countries.csv` data file that you downloaded into a data frame named `data`. 
Print out the first 10 rows of the dataset using `.head()`.

In [None]:
# load in the dataset

In [None]:
# inspect the first 10 rows

Next, let's inspect the values in our data set. First, let's make sure we have no missing values and that our cluster variables are stored as numeric values. 

### *Task 3 - data overview*
Apply the `.info()` method to our data frame to check our values.

In [None]:
# inspect the dataframe

We can see the data contains 187 rows representing 187 countries. The data is already clean, since the two variables we are using for clustering are numerical, and there are no missing values. 

### *Task 4 - summary statistics*
Look at some summary statistics for our variables using the `.describe()` method.

In [None]:
# check summary statistics

We can see that the two variables have different scales: female labor force rate ranges from about 8 to 56, whereas fertility rate ranges from around 1 to 7. This is important to note since we want to scale our data before applying our algorithm. 

Before we do so, let's quickly look at a scatter plot of the two variables to see what our data looks like visually. 

### *Task 5 - scatter plot*
Create a scatter plot of the two variables with the variable `female_labor` on the x-axis and `fertility_rate` on the y-axis using the `sns.scatterplot()` function. 

In [None]:
# create a scatter plot of the 2 variables

From this scatter plot, there are a few things for us to note: 
1. There are a few values for each variable that may be considered outliers, but, for now, we're going to keep all the observations for our analysis.
2. We can already see that there are a few spots where we might detect a grouping of countries. Although we can't say with certainty how many clusters are going to be best for our analysis, it looks like there might be potentially three groupings: 
    * Countries with high female labor force participation rates and low fertility rates
    * Countries with high female labor force rates and high fertility rates
    * Countries with lower female labor rates and a mix of fertility rates

---

# Perform a Cluster Analysis
Now that we've looked over our dataset, it's time to begin our analysis. 
## 1. Determine if clustering is appropriate for the task

First, we must decide whether or not clustering is suited for our problem. This involves domain knowledge and determining clustering tendency. 

Before applying any analysis, you should determine that finding similar groups in your data will help answer your questions. In this example, we already know from the brief that our goal is to find similar groups of countries, so clustering will be useful. 

Additionally, you should check for cluster tendency. As mentioned, there are several metrics to do so, but for this example, we will simply verify this visually. Take a look at our scatter plot of variables again from '*Task 5*' above.

As we saw in *Task 5*, there is evidence of grouping in this plot. This helps validate that our data has clustering tendency. However, this is not always possible to see visually in two dimensions. Furthermore, in multiple dimensions, validating cluster tendency can't be done visually and requires the use of domain knowledge and various metrics.

## 2. Preprocess the data

Now that we have inspected our data, ensured it is clean, and validated that clustering can be used for our problem, it's time to preprocess the data before applying our algorithm. 

We will **scale** our data so that all variables are of similar magnitude, and, therefore, they will contribute similarly to the clustering. Scaling is an important aspect of many machine learning problems.


To scale our data, we will use the `StandardScaler()` from `sklearn`. You can read more about it [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

### *Task  6 - scale data (run cell)*
Run the cell below to scale the data. We will store the scaled variables in `data_scaled`.

In [None]:
# create scaling object
scaler = StandardScaler()

# scale our variables 
data_scaled = scaler.fit_transform(data[['female_labor', 'fertility_rate']])

## 3. Carry out the algorithm
Our data is ready for clustering. The next step is to run the *k*-means algorithm on our scaled data using the `KMeans()` function from `sklearn`. Make sure to read about the function and its parameters from the documentation by clicking [this link](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).

There are two steps to carry out k-means using `KMeans()`: 
1. Instantiate a model with parameter values.
    * The number of clusters *k* is given by the argument `n_clusters`. In this analysis, we've chosen to identify 3 clusters, since from exploring the scatter plot of the data we estimated there was a potential for 3 groupings. 
    * In addition, we often set the parameter for the random state, `random_state`. This ensures that each time we run our code, the algorithm uses the same initial conditions. For k-means, this means that when we use the same `n_clusters`, the data points are assigned to the same final cluster each time. 
    
    In this analysis, we set the random state to 123, although this could be any number. Inputting the same value each time for the random state (whether that value is 123 or a different one), ensures our analysis uses the same initial conditions.  

  
2. Fit the model to the data using `.fit()` method. This action computes the k-means clustering.

### *Task 7 - apply k-means (run cells)*
Run the four cells below to apply k-means to our scaled data, assign the cluster labels to our dataframe, and visualize our clusters. 

In [None]:
# instantiate a model with parameter values 
model = KMeans(n_clusters=3, random_state=123)

# run .fit() to fit the model to the (scaled) data
model.fit(data_scaled)

After running our model, we can access the cluster labels that the algorithm assigned to each data point through the 
model's attribute `.labels_`.

In [None]:
# access the clustering results via .labels_ after running .fit()
model.labels_

Now let's add the cluster labels to the original data frame to analyze our results.

In [None]:
# add the cluster labels to the original dataframe
data['cluster'] = model.labels_

# inspect the new cluster column
data.head()

Now that we have our cluster labels, we can graph our scatter plot again, but now with the data points colored by their cluster membership.

In [None]:
# visualise the clusters by replotting our scatter plot 
# but now with cluster label as color
sns.scatterplot(x=data['female_labor'],
                y=data['fertility_rate'],
                hue=data['cluster'],
                palette='Dark2_r',
                legend = 'full')

This looks like we found fairly sensible clusters, which we identified visually earlier. We can observe: 

- Cluster 0 has strong cluster cohesion and contains countries with high female labor force partition rates and low fertility rates.
- Cluster 1 is more dispersed, but generally includes countries with generally high female labor force rates and high fertility rates.
- Finally, Cluster 2 is the most sparse cluster and contains countries with lower female labor force rates and generally low fertility rates, with the exception of a few countries with higher fertility rates.

Take a moment to think about what this means about the relationship between these two variables and among the clusters. Were any results surprising?

Before we interpret further and take a closer look at the countries within each cluster, let's first:
- Understand the need for scaling
- Evaluate our clusters and validate our choice for *k*.

### Note: What if we hadn't scaled our data?

Remember that if we don't scale our data before clustering, variables with larger scales can have a disproportionate effect on the clusters. We can illustrate the effect of scaling by copying the code we used to cluster above, but this time performing it on the unscaled data.

### *Task 8 - non-scaled data (run cell)*
Run the cell below to apply k-means on our unscaled data.

In [None]:
# subset the non-scaled variables 
data_not_scaled = data[['female_labor', 'fertility_rate']]

# instantiate a model with the same parameter values 
model2 = KMeans(n_clusters=3, random_state=123)

# run .fit() to fit the model to the (non-scaled) data
model2.fit(data_not_scaled)

# add the cluster labels to the data frame 
data['cluster_not_scaled'] = model2.labels_

# visualise the clusters by replotting our scatter plot 
# with cluster label as color
sns.scatterplot(x=data['female_labor'],
                y=data['fertility_rate'],
                hue=data['cluster_not_scaled'],
                palette='Dark2_r',
                legend = 'full')

Notice how when we apply clustering to non-scaled data, the clusters turn out in stripes rather than our intuitive findings above. This is because the variable with the larger scale, female labor force rate, dominates how the clusters are determined. The variable overpowers any influence of the other variable, fertility rate, simply because its values are on a larger scale.

---

<h1><center>BREAK</center></h1>
<center> The remainder of this notebook covers how we evaluate and interpret the results of our analysis. </center>
<center>Return to the introduction and complete the relevant sections before completing the rest of the notebook. If you close the notebook beforehand, remember to run all cells above before working in the cells below. </center>

---

## 4. Evaluate the results
Now that we have produced our clustering, remember the next step is to ask questions such as: are our results "good"? How can we quantify this? Did we pick the best value of k? 

We can use **inertia** to help us evaluate cluster cohesion, which is one aspect of a "good" clustering.
### *Task 9 - inspect inertia (run cell)*
We can see the calculated inertia score using the `.inertia_` attribute.  
Run the cell below to inspect the inertia score from the analysis above.  

In [None]:
# look at inertia score 
model.inertia_

From this, we know that inertia (or WCSS) - the total squared distances from each point to its cluster center - is around 94. On its own, we can't draw many conclusions from the score. 

However, we can run our algorithm on the dataset multiple times at different parameter values, such as the number of clusters *k*, to evaluate the impact of these parameter choices.

Let's use inertia to help us validate our choice of *k*. To do so, we will create our own **elbow plot**, which looks at inertia against different values of *k*. Remember that the "elbow" of the plot indicates a good choice of *k*, where choosing any higher number of clusters only results in a marginal improvement in inertia. 

### *Task 10 - Evaluate k-means (run cells)*
Run the four cells below to work through how we use inertia to evaluate k-means and select *k*.  

First we will create a list of different values of *k* to inspect in our elbow plot.

In [None]:
# create a list of different values of k to test
num_clusters = list(range(1,16))

Next, for each of the 15 values of *k*, we will run the *k*-means algorithm again, compute the inertia, and store the score in a list.

This is broken down in the `for` loop below.

In [None]:
# create list to store the inertia scores 
inertias = []

# iterate over each value of k
for i in num_clusters:
    
    # instantiate a model with the k number of clustesr
    kmeans = KMeans(n_clusters=i, random_state=123)
    
    # fit the model to the scaled data
    kmeans.fit_predict(data_scaled)
    
    # get the inertia score and append the value to the list
    inertias.append(kmeans.inertia_)

We can see we now have our list of 15 inertia scores, which we will then plot against our values for *k*. 

In [None]:
# view all inertia scores
inertias

In [None]:
# elbow plot: plot inertia against k values
# and look for the kink or "elbow" in the plot 
# to determine the optimal value of k

plt.figure(figsize=(8, 5))

sns.lineplot(x=num_clusters, y=inertias)

plt.xticks(num_clusters)
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.title("Elbow test to find k")

plt.show()

We can see the elbow in the plot is located at k=3, meaning that there is not much decrease in inertia for each increase in the number of clusters beyond 3. Adding more clusters would simply add complexity without much improvement in performance. This helps validate our choice of *k*=3 earlier. 

### **Note: When do we use the elbow plot?**

In this analysis, we first chose a number for *k*, and then we applied the elbow method to validate our choice. In practice, it is more helpful to perform the elbow method earlier in the analysis to first determine a potential choice of *k* before applying your algorithm and visualizing the results.

Inertia is not the only metric we can use to evaluate our clusters. We will be exploring another internal metric, the **silhouette score**, later on in this module.

## Interpret the results 

###  Explore the clusters

The final part of the analysis involves taking a closer look at each of our clusters to see how the variables differ between them. This step is key to cluster analysis, since understanding what the groupings mean is up to the data analyst to determine, and not the computer. 

This requires a degree of domain knowledge and experimentation to find results that are useful and meaningful. 

### *Task 11 - inspect cluster membership*
- Take a look at the countries within each of your final clusters. 
- *Hint:* One approach you might use is applying the `print()` function inside a `for` loop.

In [None]:
# look at the countries within each cluster

### Interpret the clusterings

Now that we know which countries are in each of our clusters, an important final step to the analysis is interpreting the clusters and assigning meaning to our groupings. 

### *Task 12 - interpretation*
With some research, or using previous knowledge, how can our clusters help inform our understanding of similarities and differences among countries?

*Your answer here:*

You might find that, while these two variables give some insight into what countries are similar on the basis of economic development, public health, and gender equality, there are certainly a number of other features we might want to explore. Later on in the module, we will have the opportunity to look at how adding more features to our dataset and clustering on more variables can allow us a potentially richer understanding of our country clusters.

---

# Next steps
When we move beyond clustering on 2 or 3 dimensions, it's more difficult to visualize. Although clustering is trickier in multiple dimensions, this is where the technique excels. We no longer rely on our intuitions as much, and we let the algorithms reveal answers using data. We'll gain experience in our workshop and online practice on how to approach clustering when it's no longer possible to visualize our data all at once.