# Tutorial 10 - Clustering

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

* Describe a case where clustering would be an appropriate tool, and what insight it would bring from the data.
* Explain the K-means clustering algorithm.
* Interpret the output of a K-means cluster analysis.
* Perform K-means clustering in R
* Visualize the output of K-means clustering in R using a coloured scatter plot 
* Identify when it is necessary to scale variables before clustering and do this using R
* Use the elbow method to choose the number of clusters for k-means
* Describe advantages, limitations and assumptions of the kmeans clustering algorithm.

This worksheet covers parts of [the Clustering chapter](https://datasciencebook.ca/clustering.html) of the online textbook. You should read this chapter before attempting the worksheet.

In [None]:
### Run this cell before continuing.
library(tidyverse)
library(tidymodels)
library(tidyclust)
library(repr)
library(GGally)
options(repr.matrix.max.rows = 6)
source('tests.R')
source("cleanup.R")

# 1. Pokemon

We will be working with the Pokemon dataset from Kaggle, which can be found [here.](https://www.kaggle.com/abcsds/pokemon)
This dataset compiles the statistics on 721 Pokemon. The information in this dataset includes Pokemon name, type, health points, attack strength, defensive strength, speed points etc. These are values that apply to a Pokemon's abilities (higher values are better). We are interested in seeing if there are any sub-groups/clusters of pokemon based on these statistics. And if so, how many sub-groups/clusters there are.

![](https://media.giphy.com/media/3oEduV4SOS9mmmIOkw/giphy.gif)

Source: https://media.giphy.com/media/3oEduV4SOS9mmmIOkw/giphy.gif


**Question 1.0**
<br> {points: 1}

Use `read_csv` to load `pokemon.csv` from the `data/` folder. 

*Assign your answer to an object called `pokemon_full`.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
pokemon_full

In [None]:
test_1.0()

**Question 1.1**
<br> {points: 1}

To start exploring the Pokemon data, create a scatter plot matrix (or pairplot) using `ggpairs`. The plot should only contain the columns `Total` to `Speed` from `pm_data`. You can check the data wrangling chapter in the textbook to recall how to select a range of columns using `select` with `:`.

*Assign your answer to an object called `pokemon_pairs`. Make sure to set a suitable size for the plot.*

In [None]:
# options(...)
#
# ... <- pokemon_full |> ... |>
#     ggpairs(aes(alpha = 0.05)) +
#     theme(text = element_text(size = 20))

# your code here
fail() # No Answer - remove if you provide an answer
pokemon_pairs

In [None]:
test_1.1()

**Question 1.2** 
<br> {points: 1}

From the pairplot above, it does not look like the pokemon are separated into clear groups in any of the pairwise variable scatterplots. Here, we will continue exploring the relationship between `Speed` and `Defense` and see what happens if we try to cluster the data points on these two variables although there are no visually discernable variables in the chart.

First, select the columns `Speed` and `Defense`, creating a new dataframe with only those columns.

*Assign your answer to an object named `pokemon`.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
pokemon

In [None]:
test_1.2()

**Question 1.3**
<br> {points: 1}

Next, create a scatter plot of only these two variables so that we can look close at their relationship. Put the `Speed` variable on the x-axis, and the `Defense` variable on the y-axis.

*Assign your plot to an object called `pokemon_scatter`. Don't forget to do everything needed to make an effective visualization, including setting an appropriate `alpha` value of the points.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
pokemon_scatter

In [None]:
test_1.3()

**Question 1.4.1** 
<br> {points: 3}

The chart above confirms what we saw in the pairplot; there doesn't seem to be visually distinct clusters of points in these two dimensions. Could it still be informative to run clustering with this data? Let's find out by using K-Means to cluster the Pokemon based on their `Speed` and `Defense`.

So far when using K-Means, we have scaled our input features. Will it matter much for our clustering if we scale our variables for the pokemon data? Is there any argument against scaling here?

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.4.2**
<br> {points: 1}

Now, let's use K-means to cluster the Pokemon based on their `Speed` and `Defense` variables.
- Create a recipe named `pokemon_recipe` that standardizes the data
- Create a model specification named `pokemon_spec` for K-means clustering with 4 clusters. 
- Fit the model using a `tidymodels` workflow; call the output of the `fit()` function `pokemon_clustering`.

*Assign your answers to objects called `pokemon_recipe`, `pokemon_spec`, and `pokemon_clustering`.*

**Note:** We set the random seed here because K-means initializes observations to random clusters.

In [None]:
#DON'T CHANGE THE SEED VALUE BELOW!
set.seed(2019)

# your code here
fail() # No Answer - remove if you provide an answer
pokemon_clustering

In [None]:
test_1.4.2()

**Question 1.5**
<br> {points: 1}

Let's visualize the clusters we built in `pokemon_clustering`. Use the `augment` function and create a coloured scatter plot of `Speed` (x-axis) vs `Defense` (y-axis) with the points coloured by their cluster assignment. 

Name this plot `pokemon_clustering_plot`.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
pokemon_clustering_plot

In [None]:
test_1.5()

**Question 1.6**
<br> {points: 3}

Below you can see multiple initializations of k-means with different seeds for `K = 4`. Can you explain what is happening and how we can mitigate this in the `kmeans` function?

![](imgs/multiple_initializations.png)

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.7**
<br> {points: 1}

We know that comparing how the WSSD varies for multiple values of $K$ is an important step of selecting a suitable clustering model. That's what we will do next!

For this exercise, you will calculate the total within-cluster sum-of-squared distances for $K$ = 1 to $K$ = 10.

1. Create a tibble with the desired values of $K$.
2. Create a new model specification that sets `nstart` to 10 and tells `k_means` you want to tune the number of clusters.
3. Create a new workflow that uses `tune_cluster` to tune the number of clusters
4. Use the `collect_metrics` function to collect the results.
5. Use `filter`, `select`, and `mutate` functions to construct a tibble with two columns named `num_clusters` and `total_WSSD`. Store that tibble in an object named `elbow_stats`.


*Assign your answer to a tibble object named `elbow_stats`. It should have the columns `num_clusters` and `total_WSSD`.*

In [None]:
set.seed(2020) # DO NOT REMOVE

# your code here
fail() # No Answer - remove if you provide an answer
elbow_stats

In [None]:
test_1.7()

**Question 1.8**
<br> {points: 1}

Let's visualize how WSSD changes for as we vary the value of $K$. To do this, create the elbow plot. Put the within-cluster sum of squares on the y-axis, and the number of clusters on the x-axis.

*Assign your plot to an object called `elbow_plot`*.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
elbow_plot

In [None]:
test_1.8()

**Question 1.9** 
<br> {points: 3}

Based on the elbow plot above, what value of $K$ would you choose? Explain why.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.10**
<br> {points: 3}

Using the value that you chose for $K$, perform the K-means algorithm, set `nstart = 10` and assign your answer to an object called `pokemon_final_kmeans`. 

Augment the data with the final cluster labels and assign your answer to an object called `pokemon_final_clusters`. 

Finally, create a plot called `pokemon_final_clusters_plot` to visualize the clusters. Include a title, colour the points by the cluster and make sure your axes are human-readable.

In [None]:
set.seed(2019) # DO NOT REMOVE
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.10()

**Question 1.11** 
<br> {points: 3}

This looks perhaps a bit better than when we used $K=4$ clusters originally, but is it really a lot better? Use this plot and the elbow plot from Question 1.8 to reason about what might be going on here.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

# 2. Tourism Reviews

![](https://media.giphy.com/media/xUNd9IsOQ4BSZPfnLG/giphy.gif)
Source: https://media.giphy.com/media/xUNd9IsOQ4BSZPfnLG/giphy.gif

The Ministry of Land, Infrastructure, Transport and Tourism of Japan is interested in knowing the type of tourists that visit East Asia. They know the [majority of their visitors come from this region](https://statistics.jnto.go.jp/en/graph/) and would like to stay competitive in the region to keep growing the tourism industry. For this, they have hired us to perform segmentation of the tourists. A [dataset from TripAdvisor](https://archive.ics.uci.edu/ml/datasets/Travel+Reviews) has been scraped and it's provided to you.

This dataset contains the following variables:

- User ID : Unique user id 
- Category 1 : Average user feedback on art galleries 
- Category 2 : Average user feedback on dance clubs 
- Category 3 : Average user feedback on juice bars 
- Category 4 : Average user feedback on restaurants 
- Category 5 : Average user feedback on museums 
- Category 6 : Average user feedback on resorts 
- Category 7 : Average user feedback on parks/picnic spots 
- Category 8 : Average user feedback on beaches 
- Category 9 : Average user feedback on theaters 
- Category 10 : Average user feedback on religious institutions

**Question 2.0**
<br> {points: 3}

Load the data set from https://archive.ics.uci.edu/ml/machine-learning-databases/00484/tripadvisor_review.csv and clean it so that only the Category # columns are in the data frame (i.e., remove the `User ID` column). 

Assign your answer to an object called `clean_reviews`.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Did not create an object called clean_reviews', {
    expect_true(exists("clean_reviews"))
})
# The remainder of the tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.


**Question 2.1**
<br> {points: 3}

Perform K-means and vary $K$ from 1 to 10 to identify the optimal number of clusters. Use `nstart = 100`. Assign your answer to a tibble object called `tourism_elbow_stats` that has the columns `num_clusters` and `total_WSSD`.

Afterwards, create an elbow plot to help you choose $K$. Assign your answer to an object called `tourism_elbow_plot`.

In [None]:
#DON'T CHANGE THIS SEED VALUE
set.seed(2019)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Did not create an object called elbow_stats', {
    expect_true(exists('elbow_stats'))
})
test_that('Did not create a plot called tourism_elbow_plot', {
    expect_true(exists('tourism_elbow_plot'))
})
# The remainder of the tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.

**Question 2.2** 
<br> {points: 3}

From the elbow plot above, which $k$ should you choose? Explain why you chose that $k$.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 2.3**
<br> {points: 3}

Run K-means again, with the optimal $K$, and assign your answer to an object called `reviews_clusters`. Use `nstart = 100`. Then, use the `augment` function to get the cluster assignments for each point. Name the data frame `cluster_assignments`.

In [None]:
#DONT CHANGE THIS SEED VALUE
set.seed(2019)

# your code here
fail() # No Answer - remove if you provide an answer
cluster_assignments

For the following 2 questions use the following plot as reference. 

> The visualization below is a density plot, you can think of it as a smoothed version of a histogram. Density plots are more effective for comparing multiple distributions. What we are looking for with these visualizations, is to see which variables have difference distributions between the different clusters.

In [None]:
options(repr.plot.height = 8, repr.plot.width = 15)
cluster_assignments |>
    pivot_longer(cols = -.pred_cluster, names_to = 'category', values_to = 'value')  |> 
    ggplot(aes(value, fill = .pred_cluster)) +
        geom_density(alpha = 0.4, colour = 'white') +
        # We are setting the x-scale to "free" since we standardized the rating values before clustering them,
        # which means that their original range (which is what we show here) does not matter
        facet_wrap(facets = vars(category), scales = 'free') +
        theme_minimal() +
        theme(text = element_text(size = 20))

**Question 2.4** Multiple Choice:
<br> {points: 1}

From the plots above, point out the categories that we might hypothesize are driving the clustering? (i.e., are useful to distinguish between the type of tourists?) We list the table of the categories below. 

- Category 1 : Average user feedback on art galleries 
- Category 2 : Average user feedback on dance clubs 
- Category 3 : Average user feedback on juice bars 
- Category 4 : Average user feedback on restaurants 
- Category 5 : Average user feedback on museums 
- Category 6 : Average user feedback on resorts 
- Category 7 : Average user feedback on parks/picnic spots 
- Category 8 : Average user feedback on beaches 
- Category 9 : Average user feedback on theaters 
- Category 10 : Average user feedback on religious institutions

A. 10, 3, 5, 6, 7

B. 10, 3, 5, 6, 1

C. 10, 3, 4, 6, 7

D. 10, 2, 5, 6, 7

*Assign your answer to an object called `answer2.4`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
answer2.4

**Question 2.5** 
<br> {points: 3}

Discuss one disadvantage of not being able to visualize the clusters when dealing with multidimensional data.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

In [None]:
source("cleanup.R")