# Worksheet: Clustering

This worksheet covers the [Clustering](https://datasciencebook.ca/clustering.html) chapter of the online textbook, which also lists the learning objectives for this worksheet. You should read the textbook chapter before attempting this worksheet. 

In [None]:
### Run this cell before continuing.
library(tidyverse)
library(tidymodels)
library(tidyclust)
library(forcats)
library(repr)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

**Question 0.0** Multiple Choice:
<br> {points: 1}

In which of the following scenarios would clustering methods likely be appropriate?

A. Identifying sub-groups of houses according to their house type, value, and geographical location

B. Predicting whether a given user will click on an ad on a website

C. Segmenting customers based on their preferences to target advertising

D. Both A. and B.

E. Both A. and C. 

*Assign your answer to an object called `answer0.0`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer0.0 is not character"= setequal(digest(paste(toString(class(answer0.0)), "687d1")), "b0606851f8ab1403affe70dae88277a1"))
stopifnot("length of answer0.0 is not correct"= setequal(digest(paste(toString(length(answer0.0)), "687d1")), "8be1aa3b66de219b03427ea3a762d5f1"))
stopifnot("value of answer0.0 is not correct"= setequal(digest(paste(toString(tolower(answer0.0)), "687d1")), "ca83714cf7394644cf305070b2af6f05"))
stopifnot("letters in string value of answer0.0 are correct but case is not correct"= setequal(digest(paste(toString(answer0.0), "687d1")), "1fa65423585b1341f90e6244f1deb7ab"))

print('Success!')

**Question 0.1** Multiple Choice:
<br> {points: 1}

Which step in the description of the K-means algorithm below is *incorrect*?

0. Choose the number of clusters

1. Randomly assign each of the points to one of the clusters

2. Calculate the position for the cluster centre (centroid) for each of the clusters (this is the middle of the points in the cluster, as measured by straight-line distance)

3. Re-assign each of the points to the cluster whose centroid is furthest from that point

4. Repeat steps 2 - 3 until the cluster centroids don't change at all

*Assign your answer to an object called `answer0.1`. Your answer should be a single numerical character surrounded by quotes.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer0.1 is not character"= setequal(digest(paste(toString(class(answer0.1)), "8c44c")), "2afe90074fad3ec3878a60a49c4f380e"))
stopifnot("length of answer0.1 is not correct"= setequal(digest(paste(toString(length(answer0.1)), "8c44c")), "5ef1b720a52eb9027ba23705d33bbf41"))
stopifnot("value of answer0.1 is not correct"= setequal(digest(paste(toString(tolower(answer0.1)), "8c44c")), "a93e7b85ea377bee22defced1879188f"))
stopifnot("letters in string value of answer0.1 are correct but case is not correct"= setequal(digest(paste(toString(answer0.1), "8c44c")), "a93e7b85ea377bee22defced1879188f"))

print('Success!')

## Hoppy Craft Beer

Craft beer is a strong market in Canada and the US, and is expanding to other countries as well. If you wanted to get into the craft beer brewing market, you might want to better understand the product landscape. One popular craft beer product is hopped craft beer. Breweries create/label many different kinds of hopped craft beer, but how many different kinds of hopped craft beer are there really when you look at the chemical properties instead of the human labels? 

We will start to look at the question by looking at a [craft beer data set from Kaggle](https://www.kaggle.com/nickhould/craft-cans#beers.csv). In this data set, we will use the alcoholic content by volume  (`abv` column) and the International bittering units (`ibu` column) as variables to try to cluster the beers. The `abv` variable has values 0 (indicating no alcohol) up to 1 (pure alcohol) and the `ibu` variable quantifies the bitterness of the beer (higher values indicate higher bitterness).

**Question 1.0** 
<br> {points: 1}

Read in the `beers.csv` data using `read_csv()` and assign it to an object called `beer`. The data is located within the `data/` folder. 

*Assign your dataframe answer to an object called `beer`.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
beer

In [None]:
library(digest)
stopifnot("beer should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(beer)), "af0d9")), "65a634e10902025124b0cdeae053cdba"))
stopifnot("dimensions of beer are not correct"= setequal(digest(paste(toString(dim(beer)), "af0d9")), "e95f3a4ef97858969ce19f3a38d95329"))
stopifnot("column names of beer are not correct"= setequal(digest(paste(toString(sort(colnames(beer))), "af0d9")), "2efaff770197da938bf0ba3a76735251"))
stopifnot("types of columns in beer are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(beer, class)))), "af0d9")), "59f1716f2874dca774d51856e34a74f1"))
stopifnot("values in one or more numerical columns in beer are not correct"= setequal(digest(paste(toString(if (any(sapply(beer, is.numeric))) sort(round(sapply(beer[, sapply(beer, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "af0d9")), "51fe6a9e4bb8100d70a2ac0758457477"))
stopifnot("values in one or more character columns in beer are not correct"= setequal(digest(paste(toString(if (any(sapply(beer, is.character))) sum(sapply(beer[sapply(beer, is.character)], function(x) length(unique(x)))) else 0), "af0d9")), "9c90ba9bc164febcc8c0c72bc1e11550"))
stopifnot("values in one or more factor columns in beer are not correct"= setequal(digest(paste(toString(if (any(sapply(beer, is.factor))) sum(sapply(beer[, sapply(beer, is.factor)], function(col) length(unique(col)))) else 0), "af0d9")), "8a2d3e749ec4aedda74265256047e7e6"))

print('Success!')

**Question 1.1**
<br> {points: 1}

Let's start by visualizing the variables we are going to use in our cluster analysis as a scatter plot. Put `ibu` on the horizontal axis, and `abv` on the vertical axis. Name the plot object `beer_plot`. 

*Assign your plot to an object named `beer_plot`, and remember to follow the best visualization practices, including adding human-readable labels to your plot.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
beer_plot

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(beer_plot$layers)), function(i) {c(class(beer_plot$layers[[i]]$geom))[1]})), "8dc79")), "d83478f91f40b42aaa8f416949f79068"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(beer_plot$layers)), function(i) {rlang::get_expr(c(beer_plot$layers[[i]]$mapping, beer_plot$mapping)$x)}), as.character))), "8dc79")), "71fb719b6782a1af193c210eb75096d2"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(beer_plot$layers)), function(i) {rlang::get_expr(c(beer_plot$layers[[i]]$mapping, beer_plot$mapping)$y)}), as.character))), "8dc79")), "b3dd7ca08f1f780dda9032a62d1104bf"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(beer_plot$layers[[1]]$mapping, beer_plot$mapping)$x)!= beer_plot$labels$x), "8dc79")), "f80e9a8ca7c8408f4132aa6dcec05d81"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(beer_plot$layers[[1]]$mapping, beer_plot$mapping)$y)!= beer_plot$labels$y), "8dc79")), "f80e9a8ca7c8408f4132aa6dcec05d81"))
stopifnot("incorrect colour variable in beer_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(beer_plot$layers[[1]]$mapping, beer_plot$mapping)$colour)), "8dc79")), "cdf5bcc08dd28c04aa2dd4f57a5b7619"))
stopifnot("incorrect shape variable in beer_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(beer_plot$layers[[1]]$mapping, beer_plot$mapping)$shape)), "8dc79")), "cdf5bcc08dd28c04aa2dd4f57a5b7619"))
stopifnot("the colour label in beer_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(beer_plot$layers[[1]]$mapping, beer_plot$mapping)$colour) != beer_plot$labels$colour), "8dc79")), "cdf5bcc08dd28c04aa2dd4f57a5b7619"))
stopifnot("the shape label in beer_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(beer_plot$layers[[1]]$mapping, beer_plot$mapping)$colour) != beer_plot$labels$shape), "8dc79")), "cdf5bcc08dd28c04aa2dd4f57a5b7619"))
stopifnot("fill variable in beer_plot is not correct"= setequal(digest(paste(toString(quo_name(beer_plot$mapping$fill)), "8dc79")), "169169e4051f31db60caedca614d1696"))
stopifnot("fill label in beer_plot is not informative"= setequal(digest(paste(toString((quo_name(beer_plot$mapping$fill) != beer_plot$labels$fill)), "8dc79")), "cdf5bcc08dd28c04aa2dd4f57a5b7619"))
stopifnot("position argument in beer_plot is not correct"= setequal(digest(paste(toString(class(beer_plot$layers[[1]]$position)[1]), "8dc79")), "9f9330e52ee60aec346db5ca44dc3691"))

stopifnot("beer_plot$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(beer_plot$data)), "8dc7a")), "306ae95914bb8e022832ca9a04a70496"))
stopifnot("dimensions of beer_plot$data are not correct"= setequal(digest(paste(toString(dim(beer_plot$data)), "8dc7a")), "36f51ecdf3c29306725ea0cdb479d273"))
stopifnot("column names of beer_plot$data are not correct"= setequal(digest(paste(toString(sort(colnames(beer_plot$data))), "8dc7a")), "4730bbb7f6dd2d8155c08b9e8ee6de74"))
stopifnot("types of columns in beer_plot$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(beer_plot$data, class)))), "8dc7a")), "0d39c844fdb302a2375d625450131646"))
stopifnot("values in one or more numerical columns in beer_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(beer_plot$data, is.numeric))) sort(round(sapply(beer_plot$data[, sapply(beer_plot$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "8dc7a")), "6f5b967ac30c3bf24b2c1eeca246afa1"))
stopifnot("values in one or more character columns in beer_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(beer_plot$data, is.character))) sum(sapply(beer_plot$data[sapply(beer_plot$data, is.character)], function(x) length(unique(x)))) else 0), "8dc7a")), "d5172022a7473ef74101c8c391f920f6"))
stopifnot("values in one or more factor columns in beer_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(beer_plot$data, is.factor))) sum(sapply(beer_plot$data[, sapply(beer_plot$data, is.factor)], function(col) length(unique(col)))) else 0), "8dc7a")), "20ec1157a96415e272d5d460dc4c0e7a"))

print('Success!')

**Question 1.2**
<br> {points: 1}

We need to clean this data a bit. Specifically, we need to remove the rows where `ibu` is `NA`, and select only the columns we are interested in clustering, which are `ibu` and `abv`. 

*Assign your answer to an object named `clean_beer`.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
clean_beer

In [None]:
library(digest)
stopifnot("clean_beer should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(clean_beer)), "54f74")), "3e8934df0d3c845f84b67318998bfca2"))
stopifnot("dimensions of clean_beer are not correct"= setequal(digest(paste(toString(dim(clean_beer)), "54f74")), "5ef23a730675c079e340e18114d93858"))
stopifnot("column names of clean_beer are not correct"= setequal(digest(paste(toString(sort(colnames(clean_beer))), "54f74")), "ea35b9d11845b882d528dda27651dbb3"))
stopifnot("types of columns in clean_beer are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(clean_beer, class)))), "54f74")), "5271b114451a38e6d8aefc5ad0945bae"))
stopifnot("values in one or more numerical columns in clean_beer are not correct"= setequal(digest(paste(toString(if (any(sapply(clean_beer, is.numeric))) sort(round(sapply(clean_beer[, sapply(clean_beer, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "54f74")), "ab04570224daf680c8b2d58ebfec29c0"))
stopifnot("values in one or more character columns in clean_beer are not correct"= setequal(digest(paste(toString(if (any(sapply(clean_beer, is.character))) sum(sapply(clean_beer[sapply(clean_beer, is.character)], function(x) length(unique(x)))) else 0), "54f74")), "14d49f5f36537602227474bc7326644f"))
stopifnot("values in one or more factor columns in clean_beer are not correct"= setequal(digest(paste(toString(if (any(sapply(clean_beer, is.factor))) sum(sapply(clean_beer[, sapply(clean_beer, is.factor)], function(col) length(unique(col)))) else 0), "54f74")), "14d49f5f36537602227474bc7326644f"))

print('Success!')

**Question 1.3** Multiple Choice:
<br>{points: 1}

Why do we need to scale the variables when using K-means clustering?

A. K-means uses the Euclidean distance to compute how similar data points are to each cluster center

B. K-means is an iterative algorithm

C. Some variables might be more important for prediction than others

D. To make sure their mean is 0

*Assign your answer to an object named `answer1.3`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer1.3 is not character"= setequal(digest(paste(toString(class(answer1.3)), "75192")), "6df0cbdd2741cf1a8f336efcf3b44b15"))
stopifnot("length of answer1.3 is not correct"= setequal(digest(paste(toString(length(answer1.3)), "75192")), "d64af15441c6d637ced918a86298e49a"))
stopifnot("value of answer1.3 is not correct"= setequal(digest(paste(toString(tolower(answer1.3)), "75192")), "02fd308614967d37c296f81259f682de"))
stopifnot("letters in string value of answer1.3 are correct but case is not correct"= setequal(digest(paste(toString(answer1.3), "75192")), "289ca03fc96b458a1a4f4113328bc164"))

print('Success!')

**Question 1.4**
<br> {points: 1}

We will now build a `tidymodels` workflow to cluster the data. The first step is to create a `recipe` that specifies that we want to center and scale all of the variables in the `clean_beer` data frame. 

*Recall that we used a `recipe` for scaling when doing classification and regression. Even though `recipe`s were originally designed for predictive modeling tasks (like classification and regression), the `tidyclust` library lets us use our familiar `tidymodels` functions for clustering too!*

*Assign your answer to an object named `kmeans_recipe`. Use the scaffolding provided.*

In [None]:
# ... <- ...( ~ . , ...) |> 
#        ...(...) |>
#        ...(...)

# your code here
fail() # No Answer - remove if you provide an answer
kmeans_recipe

In [None]:
library(digest)
stopifnot("kmeans_recipe should be a recipe"= setequal(digest(paste(toString('recipe' %in% class(kmeans_recipe)), "b2221")), "946ae82dcfd2e3fc1fc519c860e274f4"))
stopifnot("response variable of kmeans_recipe is not correct"= setequal(digest(paste(toString(sort(filter(kmeans_recipe$var_info, role == 'outcome')$variable)), "b2221")), "f4aa5da6f857ea0c153ee7730caafdaa"))
stopifnot("predictor variable(s) of kmeans_recipe are not correct"= setequal(digest(paste(toString(sort(filter(kmeans_recipe$var_info, role == 'predictor')$variable)), "b2221")), "7cb38a67e086e18f8b9e5a2f1536c186"))
stopifnot("kmeans_recipe does not contain the correct data, might need to be standardized"= setequal(digest(paste(toString(round(sum(bake(prep(kmeans_recipe), kmeans_recipe$template) %>% select_if(is.numeric), na.rm = TRUE), 2)), "b2221")), "7f7135393b20824949ef6ebf96db8353"))

print('Success!')

**Question 1.5**
<br>{points: 1}

The next step in our `tidymodels` workflow is a model specification that specifies that we want to cluster the data. From our exploratory data visualization, 2 seems like a reasonable number of clusters. Use the `k_means` function with `num_clusters = 2` to perform clustering with this choice of $k$. Make sure to use the "stats" engine.

*Assign your answer to an object named `kmeans_spec`. Use the scaffolding provided.*

In [None]:
# ... <- ...(... = ...) |>
#        ...(...)

# your code here
fail() # No Answer - remove if you provide an answer
kmeans_spec

In [None]:
library(digest)
stopifnot("kmeans_spec should be a k_means model specification"= setequal(digest(paste(toString('k_means' %in% class(kmeans_spec)), "6456d")), "03bf77c94b9e19a81d380cacb7ce1fab"))
stopifnot("kmeans_spec did not specify to use the correct number of centers"= setequal(digest(paste(toString(quo_name(rlang::get_expr(kmeans_spec$args$num_clusters))), "6456d")), "052a40a07a4683bca86bfb89871694d3"))
stopifnot("the engine specified in kmeans_spec is not correct"= setequal(digest(paste(toString(kmeans_spec$engine), "6456d")), "1cfa52e130f8f8f7a1f7400a16cdc900"))
stopifnot("the nstart argument is not correct"= setequal(digest(paste(toString(rlang::get_expr(kmeans_spec$eng_args$nstart)), "6456d")), "10922a4790d665038bfee1e22452dc93"))

print('Success!')

**Question 1.6**
<br> {points: 1}

Combine the recipe and model specification into a `workflow`, and fit the `workflow` on the `clean_beer` data.

*Assign your model to an object named `kmeans_fit`. Note that since k-means uses a random initialization, we need to set the seed; don't change the value!*

In [None]:
# DON'T CHANGE THE SEED VALUE!
set.seed(1234)

# ... <- ...() |>
#     ...(...) |>
#     ...(...) |>
#     ...(...)

# your code here
fail() # No Answer - remove if you provide an answer
kmeans_fit

In [None]:
library(digest)
stopifnot("kmeans_fit should be a workflow"= setequal(digest(paste(toString('workflow' %in% class(kmeans_fit)), "bffe0")), "7c0077287bd812f15aacf90036877e3b"))
stopifnot("computational engine used in kmeans_fit is not correct"= setequal(digest(paste(toString(kmeans_fit$fit$actions$model$spec$engine), "bffe0")), "dac369a701ce8eb271478f8e1f00a2d5"))
stopifnot("model specification used in kmeans_fit is not correct"= setequal(digest(paste(toString(kmeans_fit$fit$actions$model$spec$mode), "bffe0")), "ec663607d407edfeac6ffd952395cf04"))
stopifnot("kmeans_fit must be a trained workflow, make sure to call the fit() function"= setequal(digest(paste(toString(kmeans_fit$trained), "bffe0")), "7c0077287bd812f15aacf90036877e3b"))
stopifnot("predictor variable(s) of kmeans_fit are not correct"= setequal(digest(paste(toString(sort(filter(kmeans_fit$pre$actions$recipe$recipe$var_info, role == 'predictor')$variable)), "bffe0")), "f52a6b5431db09ce70cdef67bdec5a3a"))
stopifnot("kmeans_fit does not contain the correct data"= setequal(digest(paste(toString(sort(vapply(kmeans_fit$pre$mold$predictors[, sapply(kmeans_fit$pre$mold$predictors, is.numeric)], function(col) if(!is.null(col)) round(sum(col), 2) else NA_real_, numeric(1)), na.last = NA)), "bffe0")), "b4de568b9ebd382d46b093deac961158"))
stopifnot("did not fit kmeans_fit on the training dataset"= setequal(digest(paste(toString(nrow(kmeans_fit$pre$mold$outcomes)), "bffe0")), "79337745868055c07f1626663cc6f156"))
stopifnot("for classification/regression models, weight function is not correct"= setequal(digest(paste(toString(quo_name(kmeans_fit$fit$actions$model$spec$args$weight_func)), "bffe0")), "3a6f658928be6733f70332c8b4bc8e57"))
stopifnot("for classification/regression models, response variable of kmeans_fit is not correct"= setequal(digest(paste(toString(sort(filter(kmeans_fit$pre$actions$recipe$recipe$var_info, role == 'outcome')$variable)), "bffe0")), "5c8e7f4a14c142f26feaf6650567dfa8"))
stopifnot("for KNN models, number of neighbours is not correct"= setequal(digest(paste(toString(quo_name(kmeans_fit$fit$actions$model$spec$args$neighbors)), "bffe0")), "3a6f658928be6733f70332c8b4bc8e57"))
stopifnot("for clustering models, the clustering is not correct"= setequal(digest(paste(toString(kmeans_fit$fit$fit$fit$cluster), "bffe0")), "312d089ec58ff2934250dc3f46997686"))
stopifnot("for clustering models, the total within-cluster sum-of-squared distances is not correct"= setequal(digest(paste(toString(if (!is.null(kmeans_fit$fit$fit$fit$tot.withinss)) round(kmeans_fit$fit$fit$fit$tot.withinss, 2) else NULL), "bffe0")), "8f11d4ca89f91e7a95b0536552b6b10a"))

print('Success!')

**Question 1.7**
<br> {points: 1}

Use the `augment` function to add the cluster assignment for each point to the `clean_beer` data frame. 

*Assign your answer to an object named `labelled_beer`.* 

In [None]:
# ... <- augment(..., ...)
# your code here
fail() # No Answer - remove if you provide an answer
labelled_beer

In [None]:
library(digest)
stopifnot("labelled_beer should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(labelled_beer)), "7116b")), "325a263c630b0fa51633c1e7ff53c362"))
stopifnot("dimensions of labelled_beer are not correct"= setequal(digest(paste(toString(dim(labelled_beer)), "7116b")), "f27a558412aabdeff74dd49e33e3cec2"))
stopifnot("column names of labelled_beer are not correct"= setequal(digest(paste(toString(sort(colnames(labelled_beer))), "7116b")), "3fc86d09cd88ed55cf2261ae98efacae"))
stopifnot("types of columns in labelled_beer are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(labelled_beer, class)))), "7116b")), "15e70b9bbf0382b42f68bc13045c4530"))
stopifnot("values in one or more numerical columns in labelled_beer are not correct"= setequal(digest(paste(toString(if (any(sapply(labelled_beer, is.numeric))) sort(round(sapply(labelled_beer[, sapply(labelled_beer, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "7116b")), "2176daac06bebf9456f9c16f101635c1"))
stopifnot("values in one or more character columns in labelled_beer are not correct"= setequal(digest(paste(toString(if (any(sapply(labelled_beer, is.character))) sum(sapply(labelled_beer[sapply(labelled_beer, is.character)], function(x) length(unique(x)))) else 0), "7116b")), "eb14eac0a95351ca867924e51c1e9e3f"))
stopifnot("values in one or more factor columns in labelled_beer are not correct"= setequal(digest(paste(toString(if (any(sapply(labelled_beer, is.factor))) sum(sapply(labelled_beer[, sapply(labelled_beer, is.factor)], function(col) length(unique(col)))) else 0), "7116b")), "e9c979ea0b9f2950e9b9dd65d16da392"))

print('Success!')

**Question 1.8**
<br> {points: 1}

Create a scatter plot of `abv` on the y-axis versus `ibu` on the x-axis (using the data in `labelled_beer`) where the points are colored by their cluster assignment. Name the plot object `cluster_plot`.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
cluster_plot

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(cluster_plot$layers)), function(i) {c(class(cluster_plot$layers[[i]]$geom))[1]})), "83580")), "c82778ae72df898e841ac132df41ac91"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(cluster_plot$layers)), function(i) {rlang::get_expr(c(cluster_plot$layers[[i]]$mapping, cluster_plot$mapping)$x)}), as.character))), "83580")), "71342a18c4c93774c040e7ad7f0260f8"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(cluster_plot$layers)), function(i) {rlang::get_expr(c(cluster_plot$layers[[i]]$mapping, cluster_plot$mapping)$y)}), as.character))), "83580")), "3706048ab536be4abc7419dfd3c7d1d7"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(cluster_plot$layers[[1]]$mapping, cluster_plot$mapping)$x)!= cluster_plot$labels$x), "83580")), "6d44a01806bde6fcec5c658604cd0105"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(cluster_plot$layers[[1]]$mapping, cluster_plot$mapping)$y)!= cluster_plot$labels$y), "83580")), "6d44a01806bde6fcec5c658604cd0105"))
stopifnot("incorrect colour variable in cluster_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(cluster_plot$layers[[1]]$mapping, cluster_plot$mapping)$colour)), "83580")), "26ecff1028a9384363814a565aa99581"))
stopifnot("incorrect shape variable in cluster_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(cluster_plot$layers[[1]]$mapping, cluster_plot$mapping)$shape)), "83580")), "f98c96b531bc0f0a252e0c6394a88283"))
stopifnot("the colour label in cluster_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(cluster_plot$layers[[1]]$mapping, cluster_plot$mapping)$colour) != cluster_plot$labels$colour), "83580")), "6d44a01806bde6fcec5c658604cd0105"))
stopifnot("the shape label in cluster_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(cluster_plot$layers[[1]]$mapping, cluster_plot$mapping)$colour) != cluster_plot$labels$shape), "83580")), "f98c96b531bc0f0a252e0c6394a88283"))
stopifnot("fill variable in cluster_plot is not correct"= setequal(digest(paste(toString(quo_name(cluster_plot$mapping$fill)), "83580")), "2ec797b6febb4f91388ab4c88cfe193f"))
stopifnot("fill label in cluster_plot is not informative"= setequal(digest(paste(toString((quo_name(cluster_plot$mapping$fill) != cluster_plot$labels$fill)), "83580")), "f98c96b531bc0f0a252e0c6394a88283"))
stopifnot("position argument in cluster_plot is not correct"= setequal(digest(paste(toString(class(cluster_plot$layers[[1]]$position)[1]), "83580")), "7cece97c4438a80d9ab42e084752283f"))

stopifnot("cluster_plot$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(cluster_plot$data)), "83581")), "22e1bbf90a75adb1810aa4c7c13bdaf2"))
stopifnot("dimensions of cluster_plot$data are not correct"= setequal(digest(paste(toString(dim(cluster_plot$data)), "83581")), "4d23c98c61ca429bb26f4d46e4422963"))
stopifnot("column names of cluster_plot$data are not correct"= setequal(digest(paste(toString(sort(colnames(cluster_plot$data))), "83581")), "eed0b906cdca69187f25336a1f558bb6"))
stopifnot("types of columns in cluster_plot$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(cluster_plot$data, class)))), "83581")), "576ff77d35757cc95fb0ac3f4af1bd84"))
stopifnot("values in one or more numerical columns in cluster_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(cluster_plot$data, is.numeric))) sort(round(sapply(cluster_plot$data[, sapply(cluster_plot$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "83581")), "ee7aeaf48f699cecb2ac823f7177609f"))
stopifnot("values in one or more character columns in cluster_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(cluster_plot$data, is.character))) sum(sapply(cluster_plot$data[sapply(cluster_plot$data, is.character)], function(x) length(unique(x)))) else 0), "83581")), "de1bd6b1e3c589811c9d03aa0390a5f9"))
stopifnot("values in one or more factor columns in cluster_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(cluster_plot$data, is.factor))) sum(sapply(cluster_plot$data[, sapply(cluster_plot$data, is.factor)], function(col) length(unique(col)))) else 0), "83581")), "7cc2f8e87512b251c37282ede4f29bf0"))

print('Success!')

**Question 1.9.1** Multiple Choice:
<br> {points: 1}

We do not know, however, that two clusters ($K$ = 2) is the best choice for this data set. What can we do to choose the best $K$?

A. Perform *cross-validation* for a variety of possible $K$s. Choose the one where within-cluster sum of squares distance starts to *decrease less*.

B. Perform *cross-validation* for a variety of possible $K$s. Choose the one where the within-cluster sum of squares distance starts to *decrease more*. 

C. Perform *clustering* for a variety of possible $K$s. Choose the one where within-cluster sum of squares distance starts to *decrease less*.

D. Perform *clustering* for a variety of possible $K$s. Choose the one where the within-cluster sum of squares distance starts to *decrease more*. 

*Assign your answer to an object called `answer1.9.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer1.9.1 is not character"= setequal(digest(paste(toString(class(answer1.9.1)), "5ccbf")), "6316f2ce2ece2bf442e8dd8618597cbc"))
stopifnot("length of answer1.9.1 is not correct"= setequal(digest(paste(toString(length(answer1.9.1)), "5ccbf")), "c8fb905455ad361e88bc322f1b36e56f"))
stopifnot("value of answer1.9.1 is not correct"= setequal(digest(paste(toString(tolower(answer1.9.1)), "5ccbf")), "d5231719f211d039dece5cf74aaa2d13"))
stopifnot("letters in string value of answer1.9.1 are correct but case is not correct"= setequal(digest(paste(toString(answer1.9.1), "5ccbf")), "b0be05034ec14f6585f280be0a74f62e"))

print('Success!')

**Question 1.9.2**
<br> {points: 1}

Use the `glance` function to get the model-level statistics for the clustering we just performed, including total within-cluster sum of squares. 

*Assign your answer to an object named `clustering_stats`.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
clustering_stats

In [None]:
library(digest)
stopifnot("clustering_stats should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(clustering_stats)), "aa8e5")), "23b7494b72e5a09d2cdba645070863ae"))
stopifnot("dimensions of clustering_stats are not correct"= setequal(digest(paste(toString(dim(clustering_stats)), "aa8e5")), "ed4cf8a4b3349c54827ff87b3bdd7cb7"))
stopifnot("column names of clustering_stats are not correct"= setequal(digest(paste(toString(sort(colnames(clustering_stats))), "aa8e5")), "e1751ea4cef61d370bafe85825136392"))
stopifnot("types of columns in clustering_stats are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(clustering_stats, class)))), "aa8e5")), "20b265eaaa93caf8c52a559557eacace"))
stopifnot("values in one or more numerical columns in clustering_stats are not correct"= setequal(digest(paste(toString(if (any(sapply(clustering_stats, is.numeric))) sort(round(sapply(clustering_stats[, sapply(clustering_stats, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "aa8e5")), "75c85ef1769a8e6d709e94bc7931fad5"))
stopifnot("values in one or more character columns in clustering_stats are not correct"= setequal(digest(paste(toString(if (any(sapply(clustering_stats, is.character))) sum(sapply(clustering_stats[sapply(clustering_stats, is.character)], function(x) length(unique(x)))) else 0), "aa8e5")), "344c47addc49b471912e531c2b845891"))
stopifnot("values in one or more factor columns in clustering_stats are not correct"= setequal(digest(paste(toString(if (any(sapply(clustering_stats, is.factor))) sum(sapply(clustering_stats[, sapply(clustering_stats, is.factor)], function(col) length(unique(col)))) else 0), "aa8e5")), "344c47addc49b471912e531c2b845891"))

print('Success!')

**Question 1.9.3**
<br>{points: 1}

What is the total within cluster sum-of-squares distance for this clustering (rounded to 2 decimals)?

*Assign your answer to an object named `totalWSSD`. Round your answer to 2 decimal points.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of totalWSSD is not numeric"= setequal(digest(paste(toString(class(totalWSSD)), "fc81")), "d79e4ebc33e87f9389abaf8d0bd9f97b"))
stopifnot("value of totalWSSD is not correct (rounded to 2 decimal places)"= setequal(digest(paste(toString(round(totalWSSD, 2)), "fc81")), "47425a447322dbc9e073bf2222b5e6c4"))
stopifnot("length of totalWSSD is not correct"= setequal(digest(paste(toString(length(totalWSSD)), "fc81")), "6bb692074eb766a4d12f095f2369cd25"))
stopifnot("values of totalWSSD are not correct"= setequal(digest(paste(toString(sort(round(totalWSSD, 2))), "fc81")), "47425a447322dbc9e073bf2222b5e6c4"))

print('Success!')

**Question 2.0**
<br> {points: 1}

Let's now choose the best $K$ for this clustering problem. To do this we need to create a tibble with a column having the same name as the parameter we want to tune (`num_clusters`), taking values 1 to 10. 

*Assign your answer to an object named `beer_ks`.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of as.integer(beer_ks$num_clusters) is not integer"= setequal(digest(paste(toString(class(as.integer(beer_ks$num_clusters))), "50bc7")), "17d597907ed91d691187ffab09c4da43"))
stopifnot("length of as.integer(beer_ks$num_clusters) is not correct"= setequal(digest(paste(toString(length(as.integer(beer_ks$num_clusters))), "50bc7")), "b32ba2d7b8ec74ebaedda444baf7470f"))
stopifnot("values of as.integer(beer_ks$num_clusters) are not correct"= setequal(digest(paste(toString(sort(as.integer(beer_ks$num_clusters))), "50bc7")), "705159601ddd6f9ebb32389ff86853b6"))

print('Success!')

**Question 2.1**
<br> {points: 1}

We also need to create a new model specification that lets `tidymodels` tune the number of clusters. Rather than setting `num_clusters` to a particular value in the model specification, set it to `tune()`. Use `nstart = 10` restarts.

*Assign your answer to an object named `kmeans_spec_tune`.*

In [None]:
# ... <- ...(... = ...) |>
#        ...(...)

# your code here
fail() # No Answer - remove if you provide an answer
kmeans_spec_tune

In [None]:
library(digest)
stopifnot("kmeans_spec_tune should be a k_means model specification"= setequal(digest(paste(toString('k_means' %in% class(kmeans_spec_tune)), "9ff8d")), "4156090925f6404984d13142ea28efcf"))
stopifnot("kmeans_spec_tune did not specify to use the correct number of centers"= setequal(digest(paste(toString(quo_name(rlang::get_expr(kmeans_spec_tune$args$num_clusters))), "9ff8d")), "cd38d5ca2073f87c08b5a51419d5e579"))
stopifnot("the engine specified in kmeans_spec_tune is not correct"= setequal(digest(paste(toString(kmeans_spec_tune$engine), "9ff8d")), "18de6f26b81c2b3cea183d73b0de24bd"))
stopifnot("the nstart argument is not correct"= setequal(digest(paste(toString(rlang::get_expr(kmeans_spec_tune$eng_args$nstart)), "9ff8d")), "e118fb788a117778ac36afebdf5df298"))

print('Success!')

**Question 2.2**
<br>{points: 1}

Now combine the new model specification and our original recipe into a new `workflow`. Include the `tune_cluster` function in the workflow to run the tuning procedure. In the `tune_cluster` function, specify the `resamples` argument to be `apparent(clean_beer)` so that we use the same full data for each tuning trial. Also specify the `grid` argument to be the data frame of values of $K$ we just created. Finally, include the `collect_metrics` step to gather the results of the tuning procedure.

*Assign your answer to an object named `kmeans_tuning_stats`*.

In [None]:
# DON'T CHANGE THE SEED VALUE!
set.seed(9999)
# 
# ... <- ... |>
#        ...(...) |>
#        ...(...) |>
#        tune_cluster(resamples = ..., grid = ...) |>
#        ...()

# your code here
fail() # No Answer - remove if you provide an answer
kmeans_tuning_stats

In [None]:
library(digest)
stopifnot("mutate_if(kmeans_tuning_stats, is.integer, as.double) should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(mutate_if(kmeans_tuning_stats, is.integer, as.double))), "32952")), "26dc3ef90520c743e3bb546c5f77ab5b"))
stopifnot("dimensions of mutate_if(kmeans_tuning_stats, is.integer, as.double) are not correct"= setequal(digest(paste(toString(dim(mutate_if(kmeans_tuning_stats, is.integer, as.double))), "32952")), "e55625c33cb6d51b934444ef41bd50a6"))
stopifnot("column names of mutate_if(kmeans_tuning_stats, is.integer, as.double) are not correct"= setequal(digest(paste(toString(sort(colnames(mutate_if(kmeans_tuning_stats, is.integer, as.double)))), "32952")), "f1d20cd8bce234409b3178a8e224c5aa"))
stopifnot("types of columns in mutate_if(kmeans_tuning_stats, is.integer, as.double) are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(mutate_if(kmeans_tuning_stats, is.integer, as.double), class)))), "32952")), "8285c421ab33355cc12892fc89ac4c6e"))
stopifnot("values in one or more numerical columns in mutate_if(kmeans_tuning_stats, is.integer, as.double) are not correct"= setequal(digest(paste(toString(if (any(sapply(mutate_if(kmeans_tuning_stats, is.integer, as.double), is.numeric))) sort(round(sapply(mutate_if(kmeans_tuning_stats, is.integer, as.double)[, sapply(mutate_if(kmeans_tuning_stats, is.integer, as.double), is.numeric)], sum, na.rm = TRUE), 2)) else 0), "32952")), "c97735cf6296b3ad15d4ac454aaad257"))
stopifnot("values in one or more character columns in mutate_if(kmeans_tuning_stats, is.integer, as.double) are not correct"= setequal(digest(paste(toString(if (any(sapply(mutate_if(kmeans_tuning_stats, is.integer, as.double), is.character))) sum(sapply(mutate_if(kmeans_tuning_stats, is.integer, as.double)[sapply(mutate_if(kmeans_tuning_stats, is.integer, as.double), is.character)], function(x) length(unique(x)))) else 0), "32952")), "bc0a23a532c98c2d2b9830c6125d9adf"))
stopifnot("values in one or more factor columns in mutate_if(kmeans_tuning_stats, is.integer, as.double) are not correct"= setequal(digest(paste(toString(if (any(sapply(mutate_if(kmeans_tuning_stats, is.integer, as.double), is.factor))) sum(sapply(mutate_if(kmeans_tuning_stats, is.integer, as.double)[, sapply(mutate_if(kmeans_tuning_stats, is.integer, as.double), is.factor)], function(col) length(unique(col)))) else 0), "32952")), "7283d21e282d896af34290043be0c140"))

print('Success!')

**Question 2.3**
<br> {points: 1}

Now we need to extract the total WSSD results from the `kmeans_tuning_stats` data frame. Recall that we want to look at the `mean` variable for rows where the `.metric` variable is `sse_within_total`. Use the `filter`, `select`, and `mutate` functions to create a data frame containing only two variables: `num_clusters` and `total_WSSD`.

*Assign your answer to an object named `tidy_tuning_stats`.*

In [None]:
# ... <- ... |>
#        mutate(... = ...) |>
#        filter(... == ...) |>
#        select(..., ...)

# your code here
fail() # No Answer - remove if you provide an answer
print(tidy_tuning_stats)

In [None]:
library(digest)
stopifnot("(mutate_all(tidy_tuning_stats, as.double)) should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class((mutate_all(tidy_tuning_stats, as.double)))), "8262c")), "083c4756efb67c13df46425744af2d91"))
stopifnot("dimensions of (mutate_all(tidy_tuning_stats, as.double)) are not correct"= setequal(digest(paste(toString(dim((mutate_all(tidy_tuning_stats, as.double)))), "8262c")), "753310a8a7d1f39e508d5ca516b51a1e"))
stopifnot("column names of (mutate_all(tidy_tuning_stats, as.double)) are not correct"= setequal(digest(paste(toString(sort(colnames((mutate_all(tidy_tuning_stats, as.double))))), "8262c")), "76984e5268fed553b2bdcacbd797f34d"))
stopifnot("types of columns in (mutate_all(tidy_tuning_stats, as.double)) are not correct"= setequal(digest(paste(toString(sort(unlist(sapply((mutate_all(tidy_tuning_stats, as.double)), class)))), "8262c")), "387d628fe8e0b7157a2e73bddf4a7587"))
stopifnot("values in one or more numerical columns in (mutate_all(tidy_tuning_stats, as.double)) are not correct"= setequal(digest(paste(toString(if (any(sapply((mutate_all(tidy_tuning_stats, as.double)), is.numeric))) sort(round(sapply((mutate_all(tidy_tuning_stats, as.double))[, sapply((mutate_all(tidy_tuning_stats, as.double)), is.numeric)], sum, na.rm = TRUE), 2)) else 0), "8262c")), "e6c12a67ac06bd9c19109d542ce07257"))
stopifnot("values in one or more character columns in (mutate_all(tidy_tuning_stats, as.double)) are not correct"= setequal(digest(paste(toString(if (any(sapply((mutate_all(tidy_tuning_stats, as.double)), is.character))) sum(sapply((mutate_all(tidy_tuning_stats, as.double))[sapply((mutate_all(tidy_tuning_stats, as.double)), is.character)], function(x) length(unique(x)))) else 0), "8262c")), "7d4aea4213c873bab686054462325c86"))
stopifnot("values in one or more factor columns in (mutate_all(tidy_tuning_stats, as.double)) are not correct"= setequal(digest(paste(toString(if (any(sapply((mutate_all(tidy_tuning_stats, as.double)), is.factor))) sum(sapply((mutate_all(tidy_tuning_stats, as.double))[, sapply((mutate_all(tidy_tuning_stats, as.double)), is.factor)], function(col) length(unique(col)))) else 0), "8262c")), "7d4aea4213c873bab686054462325c86"))

print('Success!')

**Question 2.4**
<br> {points: 1}

We now have the the values for total within-cluster sum of squares for each model in a column (`total_WSSD`). Let's use it to create a line plot with points of total within-cluster sum of squares versus $K$, so that we can choose the best number of clusters to use. 

*Assign your plot to an object called `choose_beer_k`. Total within-cluster sum of squares should be on the y-axis and $K$ should be on the x-axis. Remember to follow the best visualization practices, including adding human-readable labels to your plot.*

In [None]:
options(repr.plot.width = 8, repr.plot.height = 7)

# your code here
fail() # No Answer - remove if you provide an answer
choose_beer_k

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(choose_beer_k$layers)), function(i) {c(class(choose_beer_k$layers[[i]]$geom))[1]})), "9f8b2")), "abe06b7149611997d67a87b181bc153e"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(choose_beer_k$layers)), function(i) {rlang::get_expr(c(choose_beer_k$layers[[i]]$mapping, choose_beer_k$mapping)$x)}), as.character))), "9f8b2")), "b68850e9c150abc53d53fb7063977b67"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(choose_beer_k$layers)), function(i) {rlang::get_expr(c(choose_beer_k$layers[[i]]$mapping, choose_beer_k$mapping)$y)}), as.character))), "9f8b2")), "17feb5e453f4981639ff821116cc6f05"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(choose_beer_k$layers[[1]]$mapping, choose_beer_k$mapping)$x)!= choose_beer_k$labels$x), "9f8b2")), "a2c975237561577faf0af3f69f9d57c9"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(choose_beer_k$layers[[1]]$mapping, choose_beer_k$mapping)$y)!= choose_beer_k$labels$y), "9f8b2")), "a2c975237561577faf0af3f69f9d57c9"))
stopifnot("incorrect colour variable in choose_beer_k, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(choose_beer_k$layers[[1]]$mapping, choose_beer_k$mapping)$colour)), "9f8b2")), "014d0667b8b165dd62ec0c423f4f0187"))
stopifnot("incorrect shape variable in choose_beer_k, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(choose_beer_k$layers[[1]]$mapping, choose_beer_k$mapping)$shape)), "9f8b2")), "014d0667b8b165dd62ec0c423f4f0187"))
stopifnot("the colour label in choose_beer_k is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(choose_beer_k$layers[[1]]$mapping, choose_beer_k$mapping)$colour) != choose_beer_k$labels$colour), "9f8b2")), "014d0667b8b165dd62ec0c423f4f0187"))
stopifnot("the shape label in choose_beer_k is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(choose_beer_k$layers[[1]]$mapping, choose_beer_k$mapping)$colour) != choose_beer_k$labels$shape), "9f8b2")), "014d0667b8b165dd62ec0c423f4f0187"))
stopifnot("fill variable in choose_beer_k is not correct"= setequal(digest(paste(toString(quo_name(choose_beer_k$mapping$fill)), "9f8b2")), "1c93fde1bc2d06d79729efbdcf13950d"))
stopifnot("fill label in choose_beer_k is not informative"= setequal(digest(paste(toString((quo_name(choose_beer_k$mapping$fill) != choose_beer_k$labels$fill)), "9f8b2")), "014d0667b8b165dd62ec0c423f4f0187"))
stopifnot("position argument in choose_beer_k is not correct"= setequal(digest(paste(toString(class(choose_beer_k$layers[[1]]$position)[1]), "9f8b2")), "496d71218522b5f314e1f8b354fec920"))

stopifnot("(mutate_all(choose_beer_k$data, as.double)) should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class((mutate_all(choose_beer_k$data, as.double)))), "9f8b3")), "cb22fd02395a55f25f6ef7edee12779d"))
stopifnot("dimensions of (mutate_all(choose_beer_k$data, as.double)) are not correct"= setequal(digest(paste(toString(dim((mutate_all(choose_beer_k$data, as.double)))), "9f8b3")), "8b1cdc4662c94f4c291ee92aefc3e4d5"))
stopifnot("column names of (mutate_all(choose_beer_k$data, as.double)) are not correct"= setequal(digest(paste(toString(sort(colnames((mutate_all(choose_beer_k$data, as.double))))), "9f8b3")), "9ac38e47b6bf0b7c61f8366ab4ac3d2e"))
stopifnot("types of columns in (mutate_all(choose_beer_k$data, as.double)) are not correct"= setequal(digest(paste(toString(sort(unlist(sapply((mutate_all(choose_beer_k$data, as.double)), class)))), "9f8b3")), "49890d4e1ff5f87e87f2ed120bd629e0"))
stopifnot("values in one or more numerical columns in (mutate_all(choose_beer_k$data, as.double)) are not correct"= setequal(digest(paste(toString(if (any(sapply((mutate_all(choose_beer_k$data, as.double)), is.numeric))) sort(round(sapply((mutate_all(choose_beer_k$data, as.double))[, sapply((mutate_all(choose_beer_k$data, as.double)), is.numeric)], sum, na.rm = TRUE), 2)) else 0), "9f8b3")), "e8ff82fa4f5cc0e8dd9dfed2a05b4730"))
stopifnot("values in one or more character columns in (mutate_all(choose_beer_k$data, as.double)) are not correct"= setequal(digest(paste(toString(if (any(sapply((mutate_all(choose_beer_k$data, as.double)), is.character))) sum(sapply((mutate_all(choose_beer_k$data, as.double))[sapply((mutate_all(choose_beer_k$data, as.double)), is.character)], function(x) length(unique(x)))) else 0), "9f8b3")), "55f3c6dfe130bca7801204bb87a856d3"))
stopifnot("values in one or more factor columns in (mutate_all(choose_beer_k$data, as.double)) are not correct"= setequal(digest(paste(toString(if (any(sapply((mutate_all(choose_beer_k$data, as.double)), is.factor))) sum(sapply((mutate_all(choose_beer_k$data, as.double))[, sapply((mutate_all(choose_beer_k$data, as.double)), is.factor)], function(col) length(unique(col)))) else 0), "9f8b3")), "55f3c6dfe130bca7801204bb87a856d3"))

print('Success!')

**Question 2.5**
<br> {points: 1}

From the plot above, which $K$ should we choose? 

*Assign your answer to an object called `answer2.5`. Make sure your answer is a single numerical character surrounded by quotation marks.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer2.5 is not character"= setequal(digest(paste(toString(class(answer2.5)), "e9f9a")), "8a3dfd3fd2d4cb3377731560476cbe46"))
stopifnot("length of answer2.5 is not correct"= setequal(digest(paste(toString(length(answer2.5)), "e9f9a")), "be56a60858a1a66939f015e7ec3cb598"))
stopifnot("value of answer2.5 is not correct"= setequal(digest(paste(toString(tolower(answer2.5)), "e9f9a")), "b899ec74bf3dd2c1ccd8b6a3924214e1"))
stopifnot("letters in string value of answer2.5 are correct but case is not correct"= setequal(digest(paste(toString(answer2.5), "e9f9a")), "b899ec74bf3dd2c1ccd8b6a3924214e1"))

print('Success!')

**Question 2.6**
<br> {points: 1}

Why did we choose the $K$ we chose above?

A. It had the greatest total within-cluster sum of squares

B. It had the smallest total within-cluster sum of squares

C. Increasing $K$ further than this only decreased the total within-cluster sum of squares a small amount

D. Increasing $K$ further than this only increased the total within-cluster sum of squares a small amount

*Assign your answer to an object called `answer2.6`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer2.6 is not character"= setequal(digest(paste(toString(class(answer2.6)), "a939b")), "0550f51f4eb0c2f75cb902195850d8bc"))
stopifnot("length of answer2.6 is not correct"= setequal(digest(paste(toString(length(answer2.6)), "a939b")), "5ee198740a82d467b15cc4c9112aff33"))
stopifnot("value of answer2.6 is not correct"= setequal(digest(paste(toString(tolower(answer2.6)), "a939b")), "d6623aadd2d5526430f1858f45d5ee6a"))
stopifnot("letters in string value of answer2.6 are correct but case is not correct"= setequal(digest(paste(toString(answer2.6), "a939b")), "6b70b1fe0021e8f29ab9770bfefb6d22"))

print('Success!')

**Question 2.7** Multiple Choice:
<br> {points: 1}

What can we conclude from our analysis? How many different types of hoppy craft beer are there in this data set using the two variables we have? 


A. 1

B. 2 to 4

C. 5 to 7

D. more than 7

*Assign your answer to an object called `answer2.7`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer2.7 is not character"= setequal(digest(paste(toString(class(answer2.7)), "b65d7")), "33084850a2f49bc50f47eb6de7eef923"))
stopifnot("length of answer2.7 is not correct"= setequal(digest(paste(toString(length(answer2.7)), "b65d7")), "e1436e1f5637ed15e3c56276e5e999f6"))
stopifnot("value of answer2.7 is not correct"= setequal(digest(paste(toString(tolower(answer2.7)), "b65d7")), "0aabc483466812ca0e1f002937f98996"))
stopifnot("letters in string value of answer2.7 are correct but case is not correct"= setequal(digest(paste(toString(answer2.7), "b65d7")), "db890eb0d7b33b8598ab8b02e29b27aa"))

print('Success!')

**Question 2.8** True or false:
<br> {points: 1}

Our analysis might change if we added additional variables, true or false?

*Assign your answer to an object called `answer2.8`. Make sure your answer is written in lowercase and is surrounded by quotation marks (e.g. `"true"` or `"false"`).* 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer2.8 is not character"= setequal(digest(paste(toString(class(answer2.8)), "aaedd")), "e624160fbe6a234634b06173ac0b7299"))
stopifnot("length of answer2.8 is not correct"= setequal(digest(paste(toString(length(answer2.8)), "aaedd")), "4517a06e719fd89b2392344df51db16d"))
stopifnot("value of answer2.8 is not correct"= setequal(digest(paste(toString(tolower(answer2.8)), "aaedd")), "584d569a99587a83803a0329aa025222"))
stopifnot("letters in string value of answer2.8 are correct but case is not correct"= setequal(digest(paste(toString(answer2.8), "aaedd")), "584d569a99587a83803a0329aa025222"))

print('Success!')

In [None]:
source("cleanup.R")