# Tutorial: Clustering

This worksheet covers the [Clustering](https://datasciencebook.ca/clustering.html) chapter of the online textbook, which also lists the learning objectives for this worksheet. You should read the textbook chapter before attempting this worksheet. 

In [None]:
### Run this cell before continuing.
library(tidyverse)
library(tidymodels)
library(tidyclust)
library(repr)
library(GGally)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

# 1. Pokemon

We will be working with the Pokemon dataset from Kaggle, which can be found [here.](https://www.kaggle.com/abcsds/pokemon)
This dataset compiles the statistics on 721 Pokemon. The information in this dataset includes Pokemon name, type, health points, attack strength, defensive strength, speed points etc. These are values that apply to a Pokemon's abilities (higher values are better). We are interested in seeing if there are any sub-groups/clusters of pokemon based on these statistics. And if so, how many sub-groups/clusters there are.

![](https://media.giphy.com/media/3oEduV4SOS9mmmIOkw/giphy.gif)

Source: https://media.giphy.com/media/3oEduV4SOS9mmmIOkw/giphy.gif


**Question 1.0**
<br> {points: 1}

Use `read_csv` to load `pokemon.csv` from the `data/` folder. 

*Assign your answer to an object called `pokemon_full`.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
pokemon_full

In [None]:
library(digest)
stopifnot("pokemon_full should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(pokemon_full)), "1aa94")), "d8722c332261a7eb8df34a29070f0b96"))
stopifnot("dimensions of pokemon_full are not correct"= setequal(digest(paste(toString(dim(pokemon_full)), "1aa94")), "cb9950d5f186426c1413a04dd82c705c"))
stopifnot("column names of pokemon_full are not correct"= setequal(digest(paste(toString(sort(colnames(pokemon_full))), "1aa94")), "27e0d34718e8e114842877550ebbf53c"))
stopifnot("types of columns in pokemon_full are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(pokemon_full, class)))), "1aa94")), "c9687dc54b80fe8ee894aa3fce4ed9b8"))
stopifnot("values in one or more numerical columns in pokemon_full are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon_full, is.numeric))) sort(round(sapply(pokemon_full[, sapply(pokemon_full, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "1aa94")), "52e4958b6e65eef8878476becd605fcc"))
stopifnot("values in one or more character columns in pokemon_full are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon_full, is.character))) sum(sapply(pokemon_full[sapply(pokemon_full, is.character)], function(x) length(unique(x)))) else 0), "1aa94")), "79ea32a0828eb09672d7a4cd9c5fc1ae"))
stopifnot("values in one or more factor columns in pokemon_full are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon_full, is.factor))) sum(sapply(pokemon_full[, sapply(pokemon_full, is.factor)], function(col) length(unique(col)))) else 0), "1aa94")), "1b3d9adf33ab1ab2b6e027384c274af6"))

print('Success!')

**Question 1.1**
<br> {points: 1}

To start exploring the Pokemon data, create a scatter plot matrix (or pairplot) using `ggpairs`. The plot should only contain the columns `Total` to `Speed` from `pokemon_full`. You can check the data wrangling chapter in the textbook to recall how to select a range of columns using `select` with `:`.

*Assign your answer to an object called `pokemon_pairs`. Make sure to set a suitable size for the plot.*

In [None]:
# options(...)
#
# ... <- pokemon_full |> ... |>
#     ggpairs(aes(alpha = 0.05)) +
#     theme(text = element_text(size = 20))

# your code here
fail() # No Answer - remove if you provide an answer
pokemon_pairs

In [None]:
library(digest)
stopifnot("type of 'ggmatrix' %in% c(class(pokemon_pairs)) is not logical"= setequal(digest(paste(toString(class('ggmatrix' %in% c(class(pokemon_pairs)))), "27255")), "91690584d223b3c9e219098e717348b5"))
stopifnot("logical value of 'ggmatrix' %in% c(class(pokemon_pairs)) is not correct"= setequal(digest(paste(toString('ggmatrix' %in% c(class(pokemon_pairs))), "27255")), "fb42f48882e4435d9cd7cfa2928340f9"))

stopifnot("type of sort(pokemon_pairs$xAxisLabels) is not character"= setequal(digest(paste(toString(class(sort(pokemon_pairs$xAxisLabels))), "27256")), "02ccace9144085f83a1ac71af90d900b"))
stopifnot("length of sort(pokemon_pairs$xAxisLabels) is not correct"= setequal(digest(paste(toString(length(sort(pokemon_pairs$xAxisLabels))), "27256")), "16a38c5ae066350e3b2a8772d0fed2b0"))
stopifnot("value of sort(pokemon_pairs$xAxisLabels) is not correct"= setequal(digest(paste(toString(tolower(sort(pokemon_pairs$xAxisLabels))), "27256")), "370d89318f12662579d68a5ff348e6a2"))
stopifnot("letters in string value of sort(pokemon_pairs$xAxisLabels) are correct but case is not correct"= setequal(digest(paste(toString(sort(pokemon_pairs$xAxisLabels)), "27256")), "320e229d2d0a24ca2d29c09804e4c329"))

stopifnot("type of nrow(pokemon_pairs$data) is not integer"= setequal(digest(paste(toString(class(nrow(pokemon_pairs$data))), "27257")), "6afeb13a322705823ae0360b6e32b92f"))
stopifnot("length of nrow(pokemon_pairs$data) is not correct"= setequal(digest(paste(toString(length(nrow(pokemon_pairs$data))), "27257")), "1a06bfef1da6c8667c903e6fa84df486"))
stopifnot("values of nrow(pokemon_pairs$data) are not correct"= setequal(digest(paste(toString(sort(nrow(pokemon_pairs$data))), "27257")), "70a9e4b52d5491d4a73d8bb1dee1a716"))

stopifnot("type of ncol(pokemon_pairs$data) is not integer"= setequal(digest(paste(toString(class(ncol(pokemon_pairs$data))), "27258")), "7f02144cb9ae54825300767439bfcbf2"))
stopifnot("length of ncol(pokemon_pairs$data) is not correct"= setequal(digest(paste(toString(length(ncol(pokemon_pairs$data))), "27258")), "5b9a3f16e2c1fcb1f9a247bcc2e26aba"))
stopifnot("values of ncol(pokemon_pairs$data) are not correct"= setequal(digest(paste(toString(sort(ncol(pokemon_pairs$data))), "27258")), "196bdfa7194cc08033edb3bd513ad6fc"))

print('Success!')

**Question 1.2** 
<br> {points: 1}

From the pairplot above, it does not look like the pokemon are separated into clear groups in any of the pairwise variable scatterplots. Here, we will continue exploring the relationship between `Speed` and `Defense` and see what happens if we try to cluster the data points on these two variables although there are no visually discernable variables in the chart.

First, select the columns `Speed` and `Defense`, creating a new dataframe with only those columns.

*Assign your answer to an object named `pokemon`.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
pokemon

In [None]:
library(digest)
stopifnot("pokemon should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(pokemon)), "708e0")), "3c3509ecd49a0a5b15ba077575895d0e"))
stopifnot("dimensions of pokemon are not correct"= setequal(digest(paste(toString(dim(pokemon)), "708e0")), "bb82512a869ae129e3610da253ba2446"))
stopifnot("column names of pokemon are not correct"= setequal(digest(paste(toString(sort(colnames(pokemon))), "708e0")), "1253d89295645c4b605c4b472082858e"))
stopifnot("types of columns in pokemon are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(pokemon, class)))), "708e0")), "f92d756b33856a87b9598c385b0a0b53"))
stopifnot("values in one or more numerical columns in pokemon are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon, is.numeric))) sort(round(sapply(pokemon[, sapply(pokemon, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "708e0")), "8d13a28eead10298f1a2b08a17fdd0f4"))
stopifnot("values in one or more character columns in pokemon are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon, is.character))) sum(sapply(pokemon[sapply(pokemon, is.character)], function(x) length(unique(x)))) else 0), "708e0")), "2e8362aa0b66a4c7d8cc327ff394bacd"))
stopifnot("values in one or more factor columns in pokemon are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon, is.factor))) sum(sapply(pokemon[, sapply(pokemon, is.factor)], function(col) length(unique(col)))) else 0), "708e0")), "2e8362aa0b66a4c7d8cc327ff394bacd"))

print('Success!')

**Question 1.3**
<br> {points: 1}

Next, create a scatter plot of only these two variables so that we can look close at their relationship. Put the `Speed` variable on the x-axis, and the `Defense` variable on the y-axis.

*Assign your plot to an object called `pokemon_scatter`. Don't forget to do everything needed to make an effective visualization, including setting an appropriate `alpha` value of the points.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
pokemon_scatter

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(pokemon_scatter$layers)), function(i) {c(class(pokemon_scatter$layers[[i]]$geom))[1]})), "cf66a")), "ef19a90ef15880e21918b3e3b72c5c1c"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(pokemon_scatter$layers)), function(i) {rlang::get_expr(c(pokemon_scatter$layers[[i]]$mapping, pokemon_scatter$mapping)$x)}), as.character))), "cf66a")), "79f75baeacdfe656140fe1afef3aaa16"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(pokemon_scatter$layers)), function(i) {rlang::get_expr(c(pokemon_scatter$layers[[i]]$mapping, pokemon_scatter$mapping)$y)}), as.character))), "cf66a")), "d4c63b39022174ec6bc1061d2f168eac"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_scatter$layers[[1]]$mapping, pokemon_scatter$mapping)$x)!= pokemon_scatter$labels$x), "cf66a")), "2b4b02f943dbceec362dc3394c74d5ea"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_scatter$layers[[1]]$mapping, pokemon_scatter$mapping)$y)!= pokemon_scatter$labels$y), "cf66a")), "2b4b02f943dbceec362dc3394c74d5ea"))
stopifnot("incorrect colour variable in pokemon_scatter, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_scatter$layers[[1]]$mapping, pokemon_scatter$mapping)$colour)), "cf66a")), "0400c036fa8d4de2dd6e8ee92ece682d"))
stopifnot("incorrect shape variable in pokemon_scatter, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_scatter$layers[[1]]$mapping, pokemon_scatter$mapping)$shape)), "cf66a")), "0400c036fa8d4de2dd6e8ee92ece682d"))
stopifnot("the colour label in pokemon_scatter is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_scatter$layers[[1]]$mapping, pokemon_scatter$mapping)$colour) != pokemon_scatter$labels$colour), "cf66a")), "0400c036fa8d4de2dd6e8ee92ece682d"))
stopifnot("the shape label in pokemon_scatter is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_scatter$layers[[1]]$mapping, pokemon_scatter$mapping)$colour) != pokemon_scatter$labels$shape), "cf66a")), "0400c036fa8d4de2dd6e8ee92ece682d"))
stopifnot("fill variable in pokemon_scatter is not correct"= setequal(digest(paste(toString(quo_name(pokemon_scatter$mapping$fill)), "cf66a")), "ea60d61b6604c7ee845e572d18f38bbe"))
stopifnot("fill label in pokemon_scatter is not informative"= setequal(digest(paste(toString((quo_name(pokemon_scatter$mapping$fill) != pokemon_scatter$labels$fill)), "cf66a")), "0400c036fa8d4de2dd6e8ee92ece682d"))
stopifnot("position argument in pokemon_scatter is not correct"= setequal(digest(paste(toString(class(pokemon_scatter$layers[[1]]$position)[1]), "cf66a")), "b2dd569853eaf1b0c70f612a50dfc565"))

stopifnot("pokemon_scatter$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(pokemon_scatter$data)), "cf66b")), "c51723a5c89616ef4c70c6f1ee0fe4d4"))
stopifnot("dimensions of pokemon_scatter$data are not correct"= setequal(digest(paste(toString(dim(pokemon_scatter$data)), "cf66b")), "6f0e7394ce5d36ba354306be4f7417a6"))
stopifnot("column names of pokemon_scatter$data are not correct"= setequal(digest(paste(toString(sort(colnames(pokemon_scatter$data))), "cf66b")), "980dfa932c3dfeaa173c437736fd6316"))
stopifnot("types of columns in pokemon_scatter$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(pokemon_scatter$data, class)))), "cf66b")), "65dfe487d25a56c0e18aa3d556df25a1"))
stopifnot("values in one or more numerical columns in pokemon_scatter$data are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon_scatter$data, is.numeric))) sort(round(sapply(pokemon_scatter$data[, sapply(pokemon_scatter$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "cf66b")), "b89544bcf7d86f48f70e6011bf8ebdb5"))
stopifnot("values in one or more character columns in pokemon_scatter$data are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon_scatter$data, is.character))) sum(sapply(pokemon_scatter$data[sapply(pokemon_scatter$data, is.character)], function(x) length(unique(x)))) else 0), "cf66b")), "58722f8302fb2e519f7c6b05801d8d7e"))
stopifnot("values in one or more factor columns in pokemon_scatter$data are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon_scatter$data, is.factor))) sum(sapply(pokemon_scatter$data[, sapply(pokemon_scatter$data, is.factor)], function(col) length(unique(col)))) else 0), "cf66b")), "58722f8302fb2e519f7c6b05801d8d7e"))

print('Success!')

**Question 1.4.1** 
<br> {points: 3}

The chart above confirms what we saw in the pairplot; there doesn't seem to be visually distinct clusters of points in these two dimensions. Could it still be informative to run clustering with this data? Let's find out by using K-Means to cluster the Pokemon based on their `Speed` and `Defense`.

So far when using K-Means, we have scaled our input features. Will it matter much for our clustering if we scale our variables for the pokemon data? Is there any argument against scaling here?

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.4.2**
<br> {points: 1}

Now, let's use K-means to cluster the Pokemon based on their `Speed` and `Defense` variables.
- Create a recipe named `pokemon_recipe` that standardizes the data
- Create a model specification named `pokemon_spec` for K-means clustering with 4 clusters. 
- Fit the model using a `tidymodels` workflow; call the output of the `fit()` function `pokemon_clustering`.

*Assign your answers to objects called `pokemon_recipe`, `pokemon_spec`, and `pokemon_clustering`.*

**Note:** We set the random seed here because K-means initializes observations to random clusters.

In [None]:
#DON'T CHANGE THE SEED VALUE BELOW!
set.seed(2019)

# your code here
fail() # No Answer - remove if you provide an answer
pokemon_clustering

In [None]:
library(digest)
stopifnot("pokemon_recipe should be a recipe"= setequal(digest(paste(toString('recipe' %in% class(pokemon_recipe)), "58547")), "f5cff1c5bee02672dbf67dc61fd5df8d"))
stopifnot("response variable of pokemon_recipe is not correct"= setequal(digest(paste(toString(sort(filter(pokemon_recipe$var_info, role == 'outcome')$variable)), "58547")), "cd72c1f6151f9aec69357e0046b410d3"))
stopifnot("predictor variable(s) of pokemon_recipe are not correct"= setequal(digest(paste(toString(sort(filter(pokemon_recipe$var_info, role == 'predictor')$variable)), "58547")), "a289dc410476c2ed7231ad6b90ec5916"))
stopifnot("pokemon_recipe does not contain the correct data, might need to be standardized"= setequal(digest(paste(toString(round(sum(bake(prep(pokemon_recipe), pokemon_recipe$template) %>% select_if(is.numeric), na.rm = TRUE), 2)), "58547")), "12a809b2b1ccecb2ac476154c263b1bb"))

stopifnot("pokemon_spec should be a k_means model specification"= setequal(digest(paste(toString('k_means' %in% class(pokemon_spec)), "58548")), "6ca191120dfabdddb1465d1807b793de"))
stopifnot("pokemon_spec did not specify to use the correct number of centers"= setequal(digest(paste(toString(quo_name(rlang::get_expr(pokemon_spec$args$num_clusters))), "58548")), "1809f6b4eecd1fa6762d00fcd7d5ce85"))
stopifnot("the engine specified in pokemon_spec is not correct"= setequal(digest(paste(toString(pokemon_spec$engine), "58548")), "bb9602e7df33ac86dc1aa990e8603af8"))
stopifnot("the nstart argument is not correct"= setequal(digest(paste(toString(rlang::get_expr(pokemon_spec$eng_args$nstart)), "58548")), "7179329549a814d76d76e36276e32e40"))

stopifnot("pokemon_clustering should be a workflow"= setequal(digest(paste(toString('workflow' %in% class(pokemon_clustering)), "58549")), "656910c0f5d5900eeb4e4c390f56de5b"))
stopifnot("computational engine used in pokemon_clustering is not correct"= setequal(digest(paste(toString(pokemon_clustering$fit$actions$model$spec$engine), "58549")), "4348d7dd6e9f9059f6d125da8b5bb4aa"))
stopifnot("model specification used in pokemon_clustering is not correct"= setequal(digest(paste(toString(pokemon_clustering$fit$actions$model$spec$mode), "58549")), "b24d804a4ad943e33d78d4c0b15aec0a"))
stopifnot("pokemon_clustering must be a trained workflow, make sure to call the fit() function"= setequal(digest(paste(toString(pokemon_clustering$trained), "58549")), "656910c0f5d5900eeb4e4c390f56de5b"))
stopifnot("predictor variable(s) of pokemon_clustering are not correct"= setequal(digest(paste(toString(sort(filter(pokemon_clustering$pre$actions$recipe$recipe$var_info, role == 'predictor')$variable)), "58549")), "22b8040f44974f1908251c2fa9186695"))
stopifnot("pokemon_clustering does not contain the correct data"= setequal(digest(paste(toString(sort(vapply(pokemon_clustering$pre$mold$predictors[, sapply(pokemon_clustering$pre$mold$predictors, is.numeric)], function(col) if(!is.null(col)) round(sum(col), 2) else NA_real_, numeric(1)), na.last = NA)), "58549")), "69b1b77de79aeceeed45bdb82ace807b"))
stopifnot("did not fit pokemon_clustering on the training dataset"= setequal(digest(paste(toString(nrow(pokemon_clustering$pre$mold$outcomes)), "58549")), "2844fc65413544e49f4dfd00469a65b9"))
stopifnot("for classification/regression models, weight function is not correct"= setequal(digest(paste(toString(quo_name(pokemon_clustering$fit$actions$model$spec$args$weight_func)), "58549")), "37a2b0c2bc2f660adfd7348864f3888d"))
stopifnot("for classification/regression models, response variable of pokemon_clustering is not correct"= setequal(digest(paste(toString(sort(filter(pokemon_clustering$pre$actions$recipe$recipe$var_info, role == 'outcome')$variable)), "58549")), "768199d025ec8a0887c54dc347aff6af"))
stopifnot("for KNN models, number of neighbours is not correct"= setequal(digest(paste(toString(quo_name(pokemon_clustering$fit$actions$model$spec$args$neighbors)), "58549")), "37a2b0c2bc2f660adfd7348864f3888d"))
stopifnot("for clustering models, the clustering is not correct"= setequal(digest(paste(toString(pokemon_clustering$fit$fit$fit$cluster), "58549")), "1d2d3e92f6b7d8ceb859ac52a924bace"))
stopifnot("for clustering models, the total within-cluster sum-of-squared distances is not correct"= setequal(digest(paste(toString(if (!is.null(pokemon_clustering$fit$fit$fit$tot.withinss)) round(pokemon_clustering$fit$fit$fit$tot.withinss, 2) else NULL), "58549")), "ffc2d844f28a1ec04d92a77d0ecf19ed"))

print('Success!')

**Question 1.5**
<br> {points: 1}

Let's visualize the clusters we built in `pokemon_clustering`. Use the `augment` function to create a dataframe called `clustered_pokemon`, then create a coloured scatter plot of `Speed` (x-axis) vs `Defense` (y-axis), with the points coloured by their cluster assignment. 

Name this plot `pokemon_clustering_plot`.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
pokemon_clustering_plot

In [None]:
library(digest)
stopifnot("clustered_pokemon should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(clustered_pokemon)), "5af6f")), "d69c810441f5d1dc01756420773306a5"))
stopifnot("dimensions of clustered_pokemon are not correct"= setequal(digest(paste(toString(dim(clustered_pokemon)), "5af6f")), "b2cac8f60431ce5cfe1921307dadbdd2"))
stopifnot("column names of clustered_pokemon are not correct"= setequal(digest(paste(toString(sort(colnames(clustered_pokemon))), "5af6f")), "efe3857cd7aa9d070c0c2f016749607a"))
stopifnot("types of columns in clustered_pokemon are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(clustered_pokemon, class)))), "5af6f")), "e7c03bb6e0a99f09ac1a786377e69ee4"))
stopifnot("values in one or more numerical columns in clustered_pokemon are not correct"= setequal(digest(paste(toString(if (any(sapply(clustered_pokemon, is.numeric))) sort(round(sapply(clustered_pokemon[, sapply(clustered_pokemon, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "5af6f")), "565487bb5c95bf8b32750dec495ba407"))
stopifnot("values in one or more character columns in clustered_pokemon are not correct"= setequal(digest(paste(toString(if (any(sapply(clustered_pokemon, is.character))) sum(sapply(clustered_pokemon[sapply(clustered_pokemon, is.character)], function(x) length(unique(x)))) else 0), "5af6f")), "940dd52de52d4e5f2b3c7386b579140e"))
stopifnot("values in one or more factor columns in clustered_pokemon are not correct"= setequal(digest(paste(toString(if (any(sapply(clustered_pokemon, is.factor))) sum(sapply(clustered_pokemon[, sapply(clustered_pokemon, is.factor)], function(col) length(unique(col)))) else 0), "5af6f")), "b1929cbe0ddf12deca48095b2720bf6c"))

stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(pokemon_clustering_plot$layers)), function(i) {c(class(pokemon_clustering_plot$layers[[i]]$geom))[1]})), "5af70")), "f8ca3113e784f61eea2d0c8de9a217cb"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(pokemon_clustering_plot$layers)), function(i) {rlang::get_expr(c(pokemon_clustering_plot$layers[[i]]$mapping, pokemon_clustering_plot$mapping)$x)}), as.character))), "5af70")), "a81226e5e07f71f2fc93f81f8a36d820"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(pokemon_clustering_plot$layers)), function(i) {rlang::get_expr(c(pokemon_clustering_plot$layers[[i]]$mapping, pokemon_clustering_plot$mapping)$y)}), as.character))), "5af70")), "3cceaa6886acaecba19fcf6915094075"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_clustering_plot$layers[[1]]$mapping, pokemon_clustering_plot$mapping)$x)!= pokemon_clustering_plot$labels$x), "5af70")), "f9a369b926cd34504e492a3313537f6f"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_clustering_plot$layers[[1]]$mapping, pokemon_clustering_plot$mapping)$y)!= pokemon_clustering_plot$labels$y), "5af70")), "f9a369b926cd34504e492a3313537f6f"))
stopifnot("incorrect colour variable in pokemon_clustering_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_clustering_plot$layers[[1]]$mapping, pokemon_clustering_plot$mapping)$colour)), "5af70")), "05e4fbeef61020b0e35c7a01fa06eedf"))
stopifnot("incorrect shape variable in pokemon_clustering_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_clustering_plot$layers[[1]]$mapping, pokemon_clustering_plot$mapping)$shape)), "5af70")), "a0b2a9bc799ed9c809316afb42dddebb"))
stopifnot("the colour label in pokemon_clustering_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_clustering_plot$layers[[1]]$mapping, pokemon_clustering_plot$mapping)$colour) != pokemon_clustering_plot$labels$colour), "5af70")), "f9a369b926cd34504e492a3313537f6f"))
stopifnot("the shape label in pokemon_clustering_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_clustering_plot$layers[[1]]$mapping, pokemon_clustering_plot$mapping)$colour) != pokemon_clustering_plot$labels$shape), "5af70")), "a0b2a9bc799ed9c809316afb42dddebb"))
stopifnot("fill variable in pokemon_clustering_plot is not correct"= setequal(digest(paste(toString(quo_name(pokemon_clustering_plot$mapping$fill)), "5af70")), "beffe3d3484f52140f90e7121421dd32"))
stopifnot("fill label in pokemon_clustering_plot is not informative"= setequal(digest(paste(toString((quo_name(pokemon_clustering_plot$mapping$fill) != pokemon_clustering_plot$labels$fill)), "5af70")), "a0b2a9bc799ed9c809316afb42dddebb"))
stopifnot("position argument in pokemon_clustering_plot is not correct"= setequal(digest(paste(toString(class(pokemon_clustering_plot$layers[[1]]$position)[1]), "5af70")), "196097e37d62448428f0841ddf703738"))

stopifnot("pokemon_clustering_plot$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(pokemon_clustering_plot$data)), "5af71")), "6ab9c39a08eadd3141a5ecbbdb40f048"))
stopifnot("dimensions of pokemon_clustering_plot$data are not correct"= setequal(digest(paste(toString(dim(pokemon_clustering_plot$data)), "5af71")), "445708a80ec7161bf61ae86ba852c253"))
stopifnot("column names of pokemon_clustering_plot$data are not correct"= setequal(digest(paste(toString(sort(colnames(pokemon_clustering_plot$data))), "5af71")), "34f96f4675f02313692669613f3a8a81"))
stopifnot("types of columns in pokemon_clustering_plot$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(pokemon_clustering_plot$data, class)))), "5af71")), "2c7ec46df65cfbb940dd853fe19236b9"))
stopifnot("values in one or more numerical columns in pokemon_clustering_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon_clustering_plot$data, is.numeric))) sort(round(sapply(pokemon_clustering_plot$data[, sapply(pokemon_clustering_plot$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "5af71")), "ea7f8a697dc8e0c345d04184e320d845"))
stopifnot("values in one or more character columns in pokemon_clustering_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon_clustering_plot$data, is.character))) sum(sapply(pokemon_clustering_plot$data[sapply(pokemon_clustering_plot$data, is.character)], function(x) length(unique(x)))) else 0), "5af71")), "902099d6ca07b8f640e270b70a6a8bdc"))
stopifnot("values in one or more factor columns in pokemon_clustering_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon_clustering_plot$data, is.factor))) sum(sapply(pokemon_clustering_plot$data[, sapply(pokemon_clustering_plot$data, is.factor)], function(col) length(unique(col)))) else 0), "5af71")), "751f24917d4014708c1bd1a763d9c565"))

print('Success!')

**Question 1.6**
<br> {points: 3}

Below you can see multiple initializations of k-means with different seeds for `K = 4`. Can you explain what is happening and how we can mitigate this in the `k_means` function?

![](imgs/multiple_initializations.png)

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.7**
<br> {points: 1}

We know that comparing how the WSSD varies for multiple values of $K$ is an important step of selecting a suitable clustering model. That's what we will do next!

For this exercise, you will calculate the total within-cluster sum-of-squared distances for $K$ = 1 to $K$ = 10.

1. Create a tibble with the desired values of $K$.
2. Create a new model specification that sets `nstart` to 10 and tells `k_means` you want to tune the number of clusters.
3. Create a new workflow that uses `tune_cluster` to tune the number of clusters
4. Use the `collect_metrics` function to collect the results.
5. Use `filter`, `select`, and `mutate` functions to construct a tibble with two columns named `num_clusters` and `total_WSSD`. Store that tibble in an object named `elbow_stats`.


*Assign your answer to a tibble object named `elbow_stats`. It should have the columns `num_clusters` and `total_WSSD`.*

In [None]:
set.seed(2020) # DO NOT REMOVE

# your code here
fail() # No Answer - remove if you provide an answer
elbow_stats

In [None]:
library(digest)
stopifnot("mutate_all(elbow_stats, as.double) should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(mutate_all(elbow_stats, as.double))), "668f4")), "b09f8b084dc3207747aa50950f15cedf"))
stopifnot("dimensions of mutate_all(elbow_stats, as.double) are not correct"= setequal(digest(paste(toString(dim(mutate_all(elbow_stats, as.double))), "668f4")), "c48731b11615da6bbc0b5288f42af1e3"))
stopifnot("column names of mutate_all(elbow_stats, as.double) are not correct"= setequal(digest(paste(toString(sort(colnames(mutate_all(elbow_stats, as.double)))), "668f4")), "f0ce6c086e80ab8410df2feeca4069fb"))
stopifnot("types of columns in mutate_all(elbow_stats, as.double) are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(mutate_all(elbow_stats, as.double), class)))), "668f4")), "98124f8dfa0d800b51e1a1626937b866"))
stopifnot("values in one or more numerical columns in mutate_all(elbow_stats, as.double) are not correct"= setequal(digest(paste(toString(if (any(sapply(mutate_all(elbow_stats, as.double), is.numeric))) sort(round(sapply(mutate_all(elbow_stats, as.double)[, sapply(mutate_all(elbow_stats, as.double), is.numeric)], sum, na.rm = TRUE), 2)) else 0), "668f4")), "53611d7fa2dbb7ef608fc384b9bb6381"))
stopifnot("values in one or more character columns in mutate_all(elbow_stats, as.double) are not correct"= setequal(digest(paste(toString(if (any(sapply(mutate_all(elbow_stats, as.double), is.character))) sum(sapply(mutate_all(elbow_stats, as.double)[sapply(mutate_all(elbow_stats, as.double), is.character)], function(x) length(unique(x)))) else 0), "668f4")), "56955cbeb9e8458fcd459bdfccf2e2af"))
stopifnot("values in one or more factor columns in mutate_all(elbow_stats, as.double) are not correct"= setequal(digest(paste(toString(if (any(sapply(mutate_all(elbow_stats, as.double), is.factor))) sum(sapply(mutate_all(elbow_stats, as.double)[, sapply(mutate_all(elbow_stats, as.double), is.factor)], function(col) length(unique(col)))) else 0), "668f4")), "56955cbeb9e8458fcd459bdfccf2e2af"))

print('Success!')

**Question 1.8**
<br> {points: 1}

Let's visualize how WSSD changes for as we vary the value of $K$. To do this, create the elbow plot. Put the within-cluster sum of squares on the y-axis, and the number of clusters on the x-axis.

*Assign your plot to an object called `elbow_plot`*.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
elbow_plot

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(elbow_plot$layers)), function(i) {c(class(elbow_plot$layers[[i]]$geom))[1]})), "63832")), "44d7933f4bf2a2b6f8dbd23c90b15a98"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(elbow_plot$layers)), function(i) {rlang::get_expr(c(elbow_plot$layers[[i]]$mapping, elbow_plot$mapping)$x)}), as.character))), "63832")), "7b2bbc0d6afbdb44f809a2390da0ea47"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(elbow_plot$layers)), function(i) {rlang::get_expr(c(elbow_plot$layers[[i]]$mapping, elbow_plot$mapping)$y)}), as.character))), "63832")), "d4051ecaa85e72a5faeb391fba6b2ece"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(elbow_plot$layers[[1]]$mapping, elbow_plot$mapping)$x)!= elbow_plot$labels$x), "63832")), "bc6c92903aa39fb264cdc4ca6af936cc"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(elbow_plot$layers[[1]]$mapping, elbow_plot$mapping)$y)!= elbow_plot$labels$y), "63832")), "bc6c92903aa39fb264cdc4ca6af936cc"))
stopifnot("incorrect colour variable in elbow_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(elbow_plot$layers[[1]]$mapping, elbow_plot$mapping)$colour)), "63832")), "1edbde18e42d7fc56caa41fdd81fbe20"))
stopifnot("incorrect shape variable in elbow_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(elbow_plot$layers[[1]]$mapping, elbow_plot$mapping)$shape)), "63832")), "1edbde18e42d7fc56caa41fdd81fbe20"))
stopifnot("the colour label in elbow_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(elbow_plot$layers[[1]]$mapping, elbow_plot$mapping)$colour) != elbow_plot$labels$colour), "63832")), "1edbde18e42d7fc56caa41fdd81fbe20"))
stopifnot("the shape label in elbow_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(elbow_plot$layers[[1]]$mapping, elbow_plot$mapping)$colour) != elbow_plot$labels$shape), "63832")), "1edbde18e42d7fc56caa41fdd81fbe20"))
stopifnot("fill variable in elbow_plot is not correct"= setequal(digest(paste(toString(quo_name(elbow_plot$mapping$fill)), "63832")), "b119e23edd0e72d5d99436c1ee8c14e5"))
stopifnot("fill label in elbow_plot is not informative"= setequal(digest(paste(toString((quo_name(elbow_plot$mapping$fill) != elbow_plot$labels$fill)), "63832")), "1edbde18e42d7fc56caa41fdd81fbe20"))
stopifnot("position argument in elbow_plot is not correct"= setequal(digest(paste(toString(class(elbow_plot$layers[[1]]$position)[1]), "63832")), "e6d7752dc11e07e8517ade3fae591668"))

stopifnot("mutate_all(elbow_plot$data, as.double) should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(mutate_all(elbow_plot$data, as.double))), "63833")), "7b6a009426693b9ec3d3e54d7a6f35d8"))
stopifnot("dimensions of mutate_all(elbow_plot$data, as.double) are not correct"= setequal(digest(paste(toString(dim(mutate_all(elbow_plot$data, as.double))), "63833")), "d5056596286eb195dabcd6ed4f474a9f"))
stopifnot("column names of mutate_all(elbow_plot$data, as.double) are not correct"= setequal(digest(paste(toString(sort(colnames(mutate_all(elbow_plot$data, as.double)))), "63833")), "b9968607c64c5859a284b90063fafb87"))
stopifnot("types of columns in mutate_all(elbow_plot$data, as.double) are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(mutate_all(elbow_plot$data, as.double), class)))), "63833")), "c8f03653e1fc198e93a075360081f2a7"))
stopifnot("values in one or more numerical columns in mutate_all(elbow_plot$data, as.double) are not correct"= setequal(digest(paste(toString(if (any(sapply(mutate_all(elbow_plot$data, as.double), is.numeric))) sort(round(sapply(mutate_all(elbow_plot$data, as.double)[, sapply(mutate_all(elbow_plot$data, as.double), is.numeric)], sum, na.rm = TRUE), 2)) else 0), "63833")), "ba22a54a0568dd7945cf4af5cab27e40"))
stopifnot("values in one or more character columns in mutate_all(elbow_plot$data, as.double) are not correct"= setequal(digest(paste(toString(if (any(sapply(mutate_all(elbow_plot$data, as.double), is.character))) sum(sapply(mutate_all(elbow_plot$data, as.double)[sapply(mutate_all(elbow_plot$data, as.double), is.character)], function(x) length(unique(x)))) else 0), "63833")), "37053f81463da731c11643bab29bc67a"))
stopifnot("values in one or more factor columns in mutate_all(elbow_plot$data, as.double) are not correct"= setequal(digest(paste(toString(if (any(sapply(mutate_all(elbow_plot$data, as.double), is.factor))) sum(sapply(mutate_all(elbow_plot$data, as.double)[, sapply(mutate_all(elbow_plot$data, as.double), is.factor)], function(col) length(unique(col)))) else 0), "63833")), "37053f81463da731c11643bab29bc67a"))

print('Success!')

**Question 1.9** 
<br> {points: 3}

Based on the elbow plot above, what value of $K$ would you choose? Explain why.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.10**
<br> {points: 1}

Using the value that you chose for $K$, perform the K-means algorithm, set `nstart = 10` and assign your answer to an object called `pokemon_final_kmeans`. 

Augment the data with the final cluster labels and assign your answer to an object called `pokemon_final_clusters`. 

Finally, create a plot called `pokemon_final_clusters_plot` to visualize the clusters. Include a title, colour the points by the cluster and make sure your axes are human-readable.

In [None]:
set.seed(2019) # DO NOT REMOVE
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(pokemon_final_clusters_plot$layers)), function(i) {c(class(pokemon_final_clusters_plot$layers[[i]]$geom))[1]})), "dbe40")), "32de4b0a553c54ade84e3fa11162b7c8"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(pokemon_final_clusters_plot$layers)), function(i) {rlang::get_expr(c(pokemon_final_clusters_plot$layers[[i]]$mapping, pokemon_final_clusters_plot$mapping)$x)}), as.character))), "dbe40")), "553d05250f7297f35e670d028ff71b2b"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(pokemon_final_clusters_plot$layers)), function(i) {rlang::get_expr(c(pokemon_final_clusters_plot$layers[[i]]$mapping, pokemon_final_clusters_plot$mapping)$y)}), as.character))), "dbe40")), "7dde31c2b83605afbd1e360ee0679b73"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_final_clusters_plot$layers[[1]]$mapping, pokemon_final_clusters_plot$mapping)$x)!= pokemon_final_clusters_plot$labels$x), "dbe40")), "a82e1e8e672d666f3c5216730310bc4e"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_final_clusters_plot$layers[[1]]$mapping, pokemon_final_clusters_plot$mapping)$y)!= pokemon_final_clusters_plot$labels$y), "dbe40")), "a82e1e8e672d666f3c5216730310bc4e"))
stopifnot("incorrect colour variable in pokemon_final_clusters_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_final_clusters_plot$layers[[1]]$mapping, pokemon_final_clusters_plot$mapping)$colour)), "dbe40")), "68d98290143df0a1fe80445d4d1b0f19"))
stopifnot("incorrect shape variable in pokemon_final_clusters_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_final_clusters_plot$layers[[1]]$mapping, pokemon_final_clusters_plot$mapping)$shape)), "dbe40")), "2edee403098e36fab01e0add32c4ae74"))
stopifnot("the colour label in pokemon_final_clusters_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_final_clusters_plot$layers[[1]]$mapping, pokemon_final_clusters_plot$mapping)$colour) != pokemon_final_clusters_plot$labels$colour), "dbe40")), "a82e1e8e672d666f3c5216730310bc4e"))
stopifnot("the shape label in pokemon_final_clusters_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(pokemon_final_clusters_plot$layers[[1]]$mapping, pokemon_final_clusters_plot$mapping)$colour) != pokemon_final_clusters_plot$labels$shape), "dbe40")), "2edee403098e36fab01e0add32c4ae74"))
stopifnot("fill variable in pokemon_final_clusters_plot is not correct"= setequal(digest(paste(toString(quo_name(pokemon_final_clusters_plot$mapping$fill)), "dbe40")), "337222069b752eb291807fa9d0dd5d34"))
stopifnot("fill label in pokemon_final_clusters_plot is not informative"= setequal(digest(paste(toString((quo_name(pokemon_final_clusters_plot$mapping$fill) != pokemon_final_clusters_plot$labels$fill)), "dbe40")), "2edee403098e36fab01e0add32c4ae74"))
stopifnot("position argument in pokemon_final_clusters_plot is not correct"= setequal(digest(paste(toString(class(pokemon_final_clusters_plot$layers[[1]]$position)[1]), "dbe40")), "cd14d7aeb9d5fb6e3c04386a08b5392e"))

stopifnot("pokemon_final_clusters_plot$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(pokemon_final_clusters_plot$data)), "dbe41")), "fa4b1bbca4914b5a41b65146f16955b8"))
stopifnot("dimensions of pokemon_final_clusters_plot$data are not correct"= setequal(digest(paste(toString(dim(pokemon_final_clusters_plot$data)), "dbe41")), "2c0d60fd779bca602efa9d43ace64cd1"))
stopifnot("column names of pokemon_final_clusters_plot$data are not correct"= setequal(digest(paste(toString(sort(colnames(pokemon_final_clusters_plot$data))), "dbe41")), "37db9972f627dd200c60fee606a2114f"))
stopifnot("types of columns in pokemon_final_clusters_plot$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(pokemon_final_clusters_plot$data, class)))), "dbe41")), "2f2b576f4f954f5ffd7b2f3721053a14"))
stopifnot("values in one or more numerical columns in pokemon_final_clusters_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon_final_clusters_plot$data, is.numeric))) sort(round(sapply(pokemon_final_clusters_plot$data[, sapply(pokemon_final_clusters_plot$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "dbe41")), "16ee0f934560bc2e9d47c764fa82214e"))
stopifnot("values in one or more character columns in pokemon_final_clusters_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon_final_clusters_plot$data, is.character))) sum(sapply(pokemon_final_clusters_plot$data[sapply(pokemon_final_clusters_plot$data, is.character)], function(x) length(unique(x)))) else 0), "dbe41")), "aa6ec4b1c18e5b5f259b01eed552ea4b"))
stopifnot("values in one or more factor columns in pokemon_final_clusters_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(pokemon_final_clusters_plot$data, is.factor))) sum(sapply(pokemon_final_clusters_plot$data[, sapply(pokemon_final_clusters_plot$data, is.factor)], function(col) length(unique(col)))) else 0), "dbe41")), "56a308146951896619abe0e0238dbeb2"))

stopifnot("type of is.character(pokemon_final_clusters_plot$labels$title) is not logical"= setequal(digest(paste(toString(class(is.character(pokemon_final_clusters_plot$labels$title))), "dbe42")), "1c30b3f263c386af3705b7ee6105f7b6"))
stopifnot("logical value of is.character(pokemon_final_clusters_plot$labels$title) is not correct"= setequal(digest(paste(toString(is.character(pokemon_final_clusters_plot$labels$title)), "dbe42")), "175acf11d28a98f4e89db96bc4f03f9c"))

print('Success!')

**Question 1.11**
<br> {points: 3}

This looks perhaps a bit better than when we used $K=4$ clusters originally, but is it really a lot better? Use the plot in Question 1.10 and the elbow plot from Question 1.8 to reason about what might be going on here.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

# 2. Tourism Reviews

![](https://media.giphy.com/media/xUNd9IsOQ4BSZPfnLG/giphy.gif)
Source: https://media.giphy.com/media/xUNd9IsOQ4BSZPfnLG/giphy.gif

The Ministry of Land, Infrastructure, Transport and Tourism of Japan is interested in knowing the type of tourists that visit East Asia. They know the [majority of their visitors come from this region](https://statistics.jnto.go.jp/en/graph/) and would like to stay competitive in the region to keep growing the tourism industry. For this, they have hired us to perform segmentation of the tourists. A [dataset from TripAdvisor](https://archive.ics.uci.edu/ml/datasets/Travel+Reviews) has been scraped and it's provided to you.

This dataset contains the following variables:

- User ID : Unique user id 
- Category 1 : Average user feedback on art galleries 
- Category 2 : Average user feedback on dance clubs 
- Category 3 : Average user feedback on juice bars 
- Category 4 : Average user feedback on restaurants 
- Category 5 : Average user feedback on museums 
- Category 6 : Average user feedback on resorts 
- Category 7 : Average user feedback on parks/picnic spots 
- Category 8 : Average user feedback on beaches 
- Category 9 : Average user feedback on theaters 
- Category 10 : Average user feedback on religious institutions

**Question 2.0**
<br> {points: 3}

Load the data set from https://archive.ics.uci.edu/ml/machine-learning-databases/00484/tripadvisor_review.csv and clean it so that only the Category # columns are in the data frame (i.e., remove the `User ID` column). 

Assign your answer to an object called `clean_reviews`.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of exists('clean_reviews') is not logical"= setequal(digest(paste(toString(class(exists('clean_reviews'))), "4d0d9")), "30315f07816bc8ae5795319bd8def115"))
stopifnot("logical value of exists('clean_reviews') is not correct"= setequal(digest(paste(toString(exists('clean_reviews')), "4d0d9")), "5673fe82f0f89f286676f8f91f3d3916"))

# The remainder of the tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
print('Success!')

**Question 2.1**
<br> {points: 3}

Perform K-means and vary $K$ from 1 to 10 to identify the optimal number of clusters. Use `nstart = 100`. Assign your answer to a tibble object called `tourism_elbow_stats` that has the columns `num_clusters` and `total_WSSD`.

Afterwards, create an elbow plot to help you choose $K$. Assign your answer to an object called `tourism_elbow_plot`.

*Note: You may see a warning message indicating that your model did not converge within 10 iterations. Please ignore this message.*

In [None]:
#DON'T CHANGE THIS SEED VALUE
set.seed(2019)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of exists('tourism_elbow_stats') is not logical"= setequal(digest(paste(toString(class(exists('tourism_elbow_stats'))), "400c0")), "7d02cd6ac551938cebcdf3777f1f7143"))
stopifnot("logical value of exists('tourism_elbow_stats') is not correct"= setequal(digest(paste(toString(exists('tourism_elbow_stats')), "400c0")), "6a1b2191ba194266195096b5dfeca8e6"))

stopifnot("type of exists('tourism_elbow_plot') is not logical"= setequal(digest(paste(toString(class(exists('tourism_elbow_plot'))), "400c1")), "a1cf4ad42fd76fe3f7437804af517bbd"))
stopifnot("logical value of exists('tourism_elbow_plot') is not correct"= setequal(digest(paste(toString(exists('tourism_elbow_plot')), "400c1")), "7593126db59bc8c603b0ff2b73ba23c2"))

# The remainder of the tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
print('Success!')

**Question 2.2** 
<br> {points: 3}

From the elbow plot above, which $k$ should you choose? Explain why you chose that $k$.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 2.3**
<br> {points: 3}

Run K-means again, with the optimal $K$, and assign your answer to an object called `tourism_model_2`. Use `nstart = 100`. Then, use the `augment` function to get the cluster assignments for each point. Name the data frame `cluster_assignments`.

In [None]:
#DONT CHANGE THIS SEED VALUE
set.seed(2019)

# your code here
fail() # No Answer - remove if you provide an answer
cluster_assignments

For the following 2 questions use the following plot as reference. 

> The visualization below is a density plot, you can think of it as a smoothed version of a histogram. Density plots are more effective for comparing multiple distributions. What we are looking for with these visualizations, is to see which variables have difference distributions between the different clusters.

In [None]:
options(repr.plot.height = 8, repr.plot.width = 15)
cluster_assignments |>
    pivot_longer(cols = -.pred_cluster, names_to = 'category', values_to = 'value')  |> 
    ggplot(aes(value, fill = .pred_cluster)) +
        geom_density(alpha = 0.4, colour = 'white') +
        # We are setting the x-scale to "free" since we standardized the rating values before clustering them,
        # which means that their original range (which is what we show here) does not matter
        facet_wrap(facets = vars(category), scales = 'free') +
        theme_minimal() +
        theme(text = element_text(size = 20))

**Question 2.4** Multiple Choice:
<br> {points: 1}

From the plots above, point out the categories that we might hypothesize are driving the clustering? (i.e., are useful to distinguish between the type of tourists?) We list the table of the categories below. 

- Category 1 : Average user feedback on art galleries 
- Category 2 : Average user feedback on dance clubs 
- Category 3 : Average user feedback on juice bars 
- Category 4 : Average user feedback on restaurants 
- Category 5 : Average user feedback on museums 
- Category 6 : Average user feedback on resorts 
- Category 7 : Average user feedback on parks/picnic spots 
- Category 8 : Average user feedback on beaches 
- Category 9 : Average user feedback on theaters 
- Category 10 : Average user feedback on religious institutions

A. 10, 3, 5, 6, 7

B. 10, 3, 5, 6, 1

C. 10, 3, 4, 6, 7

D. 10, 2, 5, 6, 7

*Assign your answer to an object called `answer2.4`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
answer2.4

In [None]:
library(digest)
stopifnot("type of exists('answer2.4') is not logical"= setequal(digest(paste(toString(class(exists('answer2.4'))), "8c6d4")), "a4a37dbdda577e23831b184df4689bcc"))
stopifnot("logical value of exists('answer2.4') is not correct"= setequal(digest(paste(toString(exists('answer2.4')), "8c6d4")), "c2d53029cfec982f1ef2ca5904b7c1bc"))

# The remainder of the tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
print('Success!')

**Question 2.5** 
<br> {points: 3}

Discuss one disadvantage of not being able to visualize the clusters when dealing with multidimensional data.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

In [None]:
source("cleanup.R")