# Worksheet 10 - Clustering

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

* Describe a case where clustering would be an appropriate tool, and what insight it would bring from the data.
* Explain the k-means clustering algorithm.
* Interpret the output of a k-means cluster analysis.
* Perform k-means clustering in Python using `scikit-learn`
* Visualize the output of k-means clustering in Python using a coloured scatter plot 
* Identify when it is necessary to scale variables before clustering and do this using Python
* Use the elbow method to choose the number of clusters for k-means
* Describe advantages, limitations and assumptions of the kmeans clustering algorithm.

This worksheet covers parts of [Chapter 9](https://python.datasciencebook.ca/clustering) of the online textbook. You should read this chapter before attempting this assignment. Any place you see `___`, you must fill in the function, variable, or data to complete the code. Substitute the `raise NotImplementedError` with your completed code and answers then proceed to run the cell.

In [None]:
### Run this cell before continuing.
import altair as alt
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn import set_config

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

**Question 0.0** Multiple Choice:
<br> {points: 1}

In which of the following scenarios would clustering methods likely be appropriate?

A. Identifying sub-groups of houses according to their house type, value, and geographical location

B. Predicting whether a given user will click on an ad on a website

C. Segmenting customers based on their preferences to target advertising

D. Both A. and B.

E. Both A. and C. 

*Assign your answer to an object called `answer0_0`. Your answer should be a single upper-case character surrounded by quotes.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer0_0)).encode("utf-8")+b"8410de25b60e1969").hexdigest() == "74c7fa636da10ec81606aaf89ce9d63b13f0b4d0", "type of answer0_0 is not str. answer0_0 should be an str"
assert sha1(str(len(answer0_0)).encode("utf-8")+b"8410de25b60e1969").hexdigest() == "0dee4d7f831cbc6d2f8bb54295ea7019773e52f4", "length of answer0_0 is not correct"
assert sha1(str(answer0_0.lower()).encode("utf-8")+b"8410de25b60e1969").hexdigest() == "ddc450446039be2700a3ed35f6a243058a450d33", "value of answer0_0 is not correct"
assert sha1(str(answer0_0).encode("utf-8")+b"8410de25b60e1969").hexdigest() == "77384839be0bf16cd80428535e26369cff8856f3", "correct string value of answer0_0 but incorrect case of letters"

print('Success!')

**Question 0.1** Multiple Choice:
<br> {points: 1}

Which step in the description of the Kmeans algorithm below is incorrect?

0. Choose the number of clusters

1. Randomly assign each of the points to one of the clusters

2. Calculate the position for the cluster centre (centroid) for each of the clusters (this is the middle of the points in the cluster, as measured by straight-line distance)

3. Re-assign each of the points to the cluster who's centroid is furthest from that point

4. Repeat steps 1 - 2 until the cluster centroids don't change very much between iterations

*Assign your answer to an object called `answer0_1`. Your answer should be a single numerical character surrounded by quotes.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer0_1)).encode("utf-8")+b"1c100869e4afe4ef").hexdigest() == "8b5938e6d46b00ec6babb00948adb81d7c27f1d8", "type of answer0_1 is not str. answer0_1 should be an str"
assert sha1(str(len(answer0_1)).encode("utf-8")+b"1c100869e4afe4ef").hexdigest() == "5a866734cb9501ad540c14d96ad328e3425ca113", "length of answer0_1 is not correct"
assert sha1(str(answer0_1.lower()).encode("utf-8")+b"1c100869e4afe4ef").hexdigest() == "1c0245bd0b6f91a00342a61b84025bb8e86e5222", "value of answer0_1 is not correct"
assert sha1(str(answer0_1).encode("utf-8")+b"1c100869e4afe4ef").hexdigest() == "1c0245bd0b6f91a00342a61b84025bb8e86e5222", "correct string value of answer0_1 but incorrect case of letters"

print('Success!')

## Hoppy Craft Beer

Craft beer is a strong market in Canada and the US, and is expanding to other countries as well. If you wanted to get into the craft beer brewing market, you might want to better understand the product landscape. One popular craft beer product is hopped craft beer. Breweries create/label many different kinds of hopped craft beer, but how many different kinds of hopped craft beer are there really when you look at the chemical properties instead of the human labels? 

We will start to look at the question by looking at a [craft beer data set from Kaggle](https://www.kaggle.com/nickhould/craft-cans#beers.csv). In this data set, we will use the alcoholic content by volume  (`abv` column) and the International bittering units (`ibu` column) as variables to try to cluster the beers.

**Question 1.0** 
<br> {points: 1}

Read in the `beers.csv` data using `pd.read_csv` and assign it to an object called `beer`. The data is located within the `data/` folder. 

*Assign your dataframe answer to an object called `beer`.*

In [None]:
# your code here
raise NotImplementedError
beer

In [None]:
from hashlib import sha1
assert sha1(str(type(beer is None)).encode("utf-8")+b"e341ae8bdad12023").hexdigest() == "80fc56569a79c1a8c42b767d1e7615f0f66c9df9", "type of beer is None is not bool. beer is None should be a bool"
assert sha1(str(beer is None).encode("utf-8")+b"e341ae8bdad12023").hexdigest() == "5241e8fc32cc938e8ae3dfda6fe972b843705fa5", "boolean value of beer is None is not correct"

assert sha1(str(type(beer)).encode("utf-8")+b"d2a59438c486c12a").hexdigest() == "93edeae1fb74fb91bf256c682b5c99588d0143f7", "type of type(beer) is not correct"

assert sha1(str(type(beer.shape)).encode("utf-8")+b"6dcc24337525a300").hexdigest() == "ac8fd7cfcc30cfd13d022eba52076158362a6533", "type of beer.shape is not tuple. beer.shape should be a tuple"
assert sha1(str(len(beer.shape)).encode("utf-8")+b"6dcc24337525a300").hexdigest() == "a4ae6b0663b365b9e09328bcf6e9b156298c4e13", "length of beer.shape is not correct"
assert sha1(str(sorted(map(str, beer.shape))).encode("utf-8")+b"6dcc24337525a300").hexdigest() == "37fc58e3c4c6962f998590864ea18cb7e4423cee", "values of beer.shape are not correct"
assert sha1(str(beer.shape).encode("utf-8")+b"6dcc24337525a300").hexdigest() == "ead1f75e794339b5ed7adf81752eaa4b48e031a5", "order of elements of beer.shape is not correct"

assert sha1(str(type("abv" in beer.columns)).encode("utf-8")+b"977bb85a445f3e23").hexdigest() == "f4767e3aeabec04a91e9707f8ff49f6d4d7fedc4", "type of \"abv\" in beer.columns is not bool. \"abv\" in beer.columns should be a bool"
assert sha1(str("abv" in beer.columns).encode("utf-8")+b"977bb85a445f3e23").hexdigest() == "1bf050c039e806c19abe089c3c408914dc294f74", "boolean value of \"abv\" in beer.columns is not correct"

assert sha1(str(type("ibu" in beer.columns)).encode("utf-8")+b"a54b4b4fa472c956").hexdigest() == "02e7a48438573998c580abe632d11fb35efa7429", "type of \"ibu\" in beer.columns is not bool. \"ibu\" in beer.columns should be a bool"
assert sha1(str("ibu" in beer.columns).encode("utf-8")+b"a54b4b4fa472c956").hexdigest() == "1cb4457e913321254575e9f6c6e3a29f3f5354f1", "boolean value of \"ibu\" in beer.columns is not correct"

assert sha1(str(type("id" in beer.columns)).encode("utf-8")+b"06390c317d496591").hexdigest() == "42efb3a9d64d019128e9e6f9910cab5e1f629d26", "type of \"id\" in beer.columns is not bool. \"id\" in beer.columns should be a bool"
assert sha1(str("id" in beer.columns).encode("utf-8")+b"06390c317d496591").hexdigest() == "d362d424648a60ed18992e04dde972be0ea1f630", "boolean value of \"id\" in beer.columns is not correct"

assert sha1(str(type("name" in beer.columns)).encode("utf-8")+b"1f964002b847a44a").hexdigest() == "91cd45a1e630c86770d6470f719c1bb3463987ec", "type of \"name\" in beer.columns is not bool. \"name\" in beer.columns should be a bool"
assert sha1(str("name" in beer.columns).encode("utf-8")+b"1f964002b847a44a").hexdigest() == "0c7e62ef4c219c86b6854f585d89784cb60c6873", "boolean value of \"name\" in beer.columns is not correct"

assert sha1(str(type("style" in beer.columns)).encode("utf-8")+b"7d2c05f736a6e8d4").hexdigest() == "c8c75eb837a04f718a3b8884f8554db9d2a10488", "type of \"style\" in beer.columns is not bool. \"style\" in beer.columns should be a bool"
assert sha1(str("style" in beer.columns).encode("utf-8")+b"7d2c05f736a6e8d4").hexdigest() == "66bd4ec2fd1dd4c9b8d5e4e77ed9e6ec995927e2", "boolean value of \"style\" in beer.columns is not correct"

assert sha1(str(type("brewery_id" in beer.columns)).encode("utf-8")+b"a4eb108b541c35fd").hexdigest() == "ba70989653cdef36e8d7f564f8580dea430123c0", "type of \"brewery_id\" in beer.columns is not bool. \"brewery_id\" in beer.columns should be a bool"
assert sha1(str("brewery_id" in beer.columns).encode("utf-8")+b"a4eb108b541c35fd").hexdigest() == "ac787d7ebbc20b809c1e880fd87067db8ce1e27e", "boolean value of \"brewery_id\" in beer.columns is not correct"

assert sha1(str(type("ounces" in beer.columns)).encode("utf-8")+b"dd23c29e721de384").hexdigest() == "408b3745538ca2dfbde826aeef8b14989beca868", "type of \"ounces\" in beer.columns is not bool. \"ounces\" in beer.columns should be a bool"
assert sha1(str("ounces" in beer.columns).encode("utf-8")+b"dd23c29e721de384").hexdigest() == "af73db6b32b25c4cb7a54607c445e189a3da9970", "boolean value of \"ounces\" in beer.columns is not correct"

print('Success!')

**Question 1.1**
<br> {points: 1}

Let's start by visualizing the variables we are going to use in our cluster analysis as a scatter plot. Put `ibu` on the horizontal axis, and `abv` on the vertical axis. Name the plot object `beer_scatter`. 

*Remember to follow the best visualization practices, including adding human-readable labels to your plot.*

In [None]:
# your code here
raise NotImplementedError
beer_scatter

In [None]:
from hashlib import sha1
assert sha1(str(type(beer_scatter is None)).encode("utf-8")+b"fe07cdf30d31f87d").hexdigest() == "7f3d8b387ff232417e9464d3e694fc74c37324ce", "type of beer_scatter is None is not bool. beer_scatter is None should be a bool"
assert sha1(str(beer_scatter is None).encode("utf-8")+b"fe07cdf30d31f87d").hexdigest() == "61e2dc8a4478807db03b4d1cdd8cca937f9fc865", "boolean value of beer_scatter is None is not correct"

assert sha1(str(type(beer_scatter.encoding.x['shorthand'])).encode("utf-8")+b"98a4c02c17b658f1").hexdigest() == "f33a70f51fdd88f32e03616fc68f2b4bfaae39f8", "type of beer_scatter.encoding.x['shorthand'] is not str. beer_scatter.encoding.x['shorthand'] should be an str"
assert sha1(str(len(beer_scatter.encoding.x['shorthand'])).encode("utf-8")+b"98a4c02c17b658f1").hexdigest() == "1dac1afaaaeeb4a6fdee2760eac6fda09f2e1d3a", "length of beer_scatter.encoding.x['shorthand'] is not correct"
assert sha1(str(beer_scatter.encoding.x['shorthand'].lower()).encode("utf-8")+b"98a4c02c17b658f1").hexdigest() == "aa945d968587257bb6dc45592c825ae891be80b1", "value of beer_scatter.encoding.x['shorthand'] is not correct"
assert sha1(str(beer_scatter.encoding.x['shorthand']).encode("utf-8")+b"98a4c02c17b658f1").hexdigest() == "aa945d968587257bb6dc45592c825ae891be80b1", "correct string value of beer_scatter.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(beer_scatter.encoding.y['shorthand'])).encode("utf-8")+b"e82c48935b034055").hexdigest() == "1322de0b1afe6df9a80c443bede8c328bce3360a", "type of beer_scatter.encoding.y['shorthand'] is not str. beer_scatter.encoding.y['shorthand'] should be an str"
assert sha1(str(len(beer_scatter.encoding.y['shorthand'])).encode("utf-8")+b"e82c48935b034055").hexdigest() == "31965e207e6e9217c2f68b82ac487f585e177c2b", "length of beer_scatter.encoding.y['shorthand'] is not correct"
assert sha1(str(beer_scatter.encoding.y['shorthand'].lower()).encode("utf-8")+b"e82c48935b034055").hexdigest() == "b4c888ac8213baefbbe409f2b0030693e6250977", "value of beer_scatter.encoding.y['shorthand'] is not correct"
assert sha1(str(beer_scatter.encoding.y['shorthand']).encode("utf-8")+b"e82c48935b034055").hexdigest() == "b4c888ac8213baefbbe409f2b0030693e6250977", "correct string value of beer_scatter.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(beer_scatter.to_dict()['mark']['type'] in ['circle', 'point'])).encode("utf-8")+b"da1487085eeefa5a").hexdigest() == "239cb13aa8b69da98b43d53adb05488c28dddea5", "type of beer_scatter.to_dict()['mark']['type'] in ['circle', 'point'] is not bool. beer_scatter.to_dict()['mark']['type'] in ['circle', 'point'] should be a bool"
assert sha1(str(beer_scatter.to_dict()['mark']['type'] in ['circle', 'point']).encode("utf-8")+b"da1487085eeefa5a").hexdigest() == "2e31344454ecf20393286e7866ed4f4206616166", "boolean value of beer_scatter.to_dict()['mark']['type'] in ['circle', 'point'] is not correct"

assert sha1(str(type('opacity' in beer_scatter.mark.to_dict())).encode("utf-8")+b"7b69d003e953b0c9").hexdigest() == "4524d3fdd60f7fb398013611ef2447532024df7b", "type of 'opacity' in beer_scatter.mark.to_dict() is not bool. 'opacity' in beer_scatter.mark.to_dict() should be a bool"
assert sha1(str('opacity' in beer_scatter.mark.to_dict()).encode("utf-8")+b"7b69d003e953b0c9").hexdigest() == "966899ea2e79ff18629b9e72571d1dc920c5928c", "boolean value of 'opacity' in beer_scatter.mark.to_dict() is not correct"

assert sha1(str(type(isinstance(beer_scatter.encoding.x['title'], str))).encode("utf-8")+b"f7591feee5b371cc").hexdigest() == "aa17042e894088231e693111b4b7fa9c403b1459", "type of isinstance(beer_scatter.encoding.x['title'], str) is not bool. isinstance(beer_scatter.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(beer_scatter.encoding.x['title'], str)).encode("utf-8")+b"f7591feee5b371cc").hexdigest() == "8a1ce6e619bea16c1ce8b378ca74f284118c57dd", "boolean value of isinstance(beer_scatter.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(beer_scatter.encoding.y['title'], str))).encode("utf-8")+b"30a5c138c7490f85").hexdigest() == "2dc049f1c7863463217941ad4d10355a1911d055", "type of isinstance(beer_scatter.encoding.y['title'], str) is not bool. isinstance(beer_scatter.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(beer_scatter.encoding.y['title'], str)).encode("utf-8")+b"30a5c138c7490f85").hexdigest() == "cc085b4d0e4342fc6510b605ff025c143c8d46e4", "boolean value of isinstance(beer_scatter.encoding.y['title'], str) is not correct"

print('Success!')

**Question 1.2**
<br> {points: 1}

KMeans clustering in scikit learn does not handle missing value. Therefore, we need to drop any rows that contain missing values in the columns we are using in the clustering, which are `ibu` and `abv`. Remember that you can use the `subset` parameter with `dropna` to specify which column to look at.

Do not select/drop any columns at this point, only remove rows that contain missing values in the columns `ibu` and `abv`.

*Assign your answer to an object named `clean_beer`.*

In [None]:
# your code here
raise NotImplementedError
clean_beer

In [None]:
from hashlib import sha1
assert sha1(str(type(clean_beer is None)).encode("utf-8")+b"a3e02d6c2ba54efb").hexdigest() == "150d8d0f6883cc16bf936fcf7ea6306d4a0b6196", "type of clean_beer is None is not bool. clean_beer is None should be a bool"
assert sha1(str(clean_beer is None).encode("utf-8")+b"a3e02d6c2ba54efb").hexdigest() == "63f71aa344563944f151422f3475acf58e45aff6", "boolean value of clean_beer is None is not correct"

assert sha1(str(type(clean_beer)).encode("utf-8")+b"b885eab89e89bea9").hexdigest() == "13ddf270173ecfc7ccb53be1c25f6f4c9db65990", "type of type(clean_beer) is not correct"

assert sha1(str(type(clean_beer.shape)).encode("utf-8")+b"5055fc6ecaba3a0c").hexdigest() == "b2550f3233c0e375697b4231615b1cd40b51f15c", "type of clean_beer.shape is not tuple. clean_beer.shape should be a tuple"
assert sha1(str(len(clean_beer.shape)).encode("utf-8")+b"5055fc6ecaba3a0c").hexdigest() == "471828ed7a9d71fce6526c66af94692b637251fe", "length of clean_beer.shape is not correct"
assert sha1(str(sorted(map(str, clean_beer.shape))).encode("utf-8")+b"5055fc6ecaba3a0c").hexdigest() == "aa4f7b83b87d63379711fe2ab70c020416673c9f", "values of clean_beer.shape are not correct"
assert sha1(str(clean_beer.shape).encode("utf-8")+b"5055fc6ecaba3a0c").hexdigest() == "4075cb5653698d16c50947eeaec2eae23331e685", "order of elements of clean_beer.shape is not correct"

assert sha1(str(type("abv" in clean_beer.columns)).encode("utf-8")+b"dd188309f6fa33c0").hexdigest() == "1886644f7b3503ec0a2d076570691942264446d0", "type of \"abv\" in clean_beer.columns is not bool. \"abv\" in clean_beer.columns should be a bool"
assert sha1(str("abv" in clean_beer.columns).encode("utf-8")+b"dd188309f6fa33c0").hexdigest() == "54deaa11841024df8e452f68ac6bba14073244b5", "boolean value of \"abv\" in clean_beer.columns is not correct"

assert sha1(str(type("ibu" in clean_beer.columns)).encode("utf-8")+b"91a371d0d8a179a9").hexdigest() == "864b0e41703539649149b956286fa58f7316b56b", "type of \"ibu\" in clean_beer.columns is not bool. \"ibu\" in clean_beer.columns should be a bool"
assert sha1(str("ibu" in clean_beer.columns).encode("utf-8")+b"91a371d0d8a179a9").hexdigest() == "a3420acad491b01544bfb16cdd9ffdd9fd4e71e5", "boolean value of \"ibu\" in clean_beer.columns is not correct"

assert sha1(str(type("id" in clean_beer.columns)).encode("utf-8")+b"3097f30a3f190d5d").hexdigest() == "e052c4eb61c1e983ee1deb6fc75696f34ee4cff7", "type of \"id\" in clean_beer.columns is not bool. \"id\" in clean_beer.columns should be a bool"
assert sha1(str("id" in clean_beer.columns).encode("utf-8")+b"3097f30a3f190d5d").hexdigest() == "6910701f9acab8708ebd857d42250afb41a019c5", "boolean value of \"id\" in clean_beer.columns is not correct"

assert sha1(str(type("name" in clean_beer.columns)).encode("utf-8")+b"acbdf581fb94f906").hexdigest() == "b18a6f5f7ea1ec05ca4a9da04f77db44d9e20268", "type of \"name\" in clean_beer.columns is not bool. \"name\" in clean_beer.columns should be a bool"
assert sha1(str("name" in clean_beer.columns).encode("utf-8")+b"acbdf581fb94f906").hexdigest() == "4a7e559b30809d053cdede848e8fd79c35866564", "boolean value of \"name\" in clean_beer.columns is not correct"

assert sha1(str(type("style" in clean_beer.columns)).encode("utf-8")+b"2aaf4bd2f3671e54").hexdigest() == "120138e89f291afaf8ee22881b0d565599df9c48", "type of \"style\" in clean_beer.columns is not bool. \"style\" in clean_beer.columns should be a bool"
assert sha1(str("style" in clean_beer.columns).encode("utf-8")+b"2aaf4bd2f3671e54").hexdigest() == "fc4551319711c95ad3fa8fbf74225f3899915ac3", "boolean value of \"style\" in clean_beer.columns is not correct"

assert sha1(str(type("brewery_id" in clean_beer.columns)).encode("utf-8")+b"5285d68068c6d968").hexdigest() == "708d2f08dd764450a9808d0f04f52c2177916ee5", "type of \"brewery_id\" in clean_beer.columns is not bool. \"brewery_id\" in clean_beer.columns should be a bool"
assert sha1(str("brewery_id" in clean_beer.columns).encode("utf-8")+b"5285d68068c6d968").hexdigest() == "fa7ec0810a4009192b9167755c147e170c8bb614", "boolean value of \"brewery_id\" in clean_beer.columns is not correct"

assert sha1(str(type("ounces" in clean_beer.columns)).encode("utf-8")+b"8f4a4bc454ffbb05").hexdigest() == "91043617ffa277b7b132c72248ce26a0a2ec7936", "type of \"ounces\" in clean_beer.columns is not bool. \"ounces\" in clean_beer.columns should be a bool"
assert sha1(str("ounces" in clean_beer.columns).encode("utf-8")+b"8f4a4bc454ffbb05").hexdigest() == "7defeec87e2494d13886571cebef2846f9a9f53e", "boolean value of \"ounces\" in clean_beer.columns is not correct"

print('Success!')

**Question 1.3**
<br>{points: 1}

Why do we need to scale the variables when using k-means clustering?

A. k-means uses the Euclidean distance to compute how similar data points are to each cluster centre

B. k-means is an iterative algorithm

C. Some variables might be more important for prediction than others

D. To make sure their mean is 0

*Assign your answer to an object named `answer1_3`. Make sure your answer is a single upper-case character surrounded by quotes.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_3)).encode("utf-8")+b"3770ab9b6a8ae236").hexdigest() == "766b64fe4856bb2bbda6fabd58bfb11100a66877", "type of answer1_3 is not str. answer1_3 should be an str"
assert sha1(str(len(answer1_3)).encode("utf-8")+b"3770ab9b6a8ae236").hexdigest() == "2d211fdb3d50d04bdbca8ff33a572a7e6a99532a", "length of answer1_3 is not correct"
assert sha1(str(answer1_3.lower()).encode("utf-8")+b"3770ab9b6a8ae236").hexdigest() == "096480e9482bca090b0838e78d90e57104f5fc8f", "value of answer1_3 is not correct"
assert sha1(str(answer1_3).encode("utf-8")+b"3770ab9b6a8ae236").hexdigest() == "2c5780f0d777f5b39995b4ba0423ab149141b0d3", "correct string value of answer1_3 but incorrect case of letters"

print('Success!')

**Question 1.4**
<br> {points: 1}

Let's setup that scaling now. Use `make_column_transformer` to specify that we want to apply a `StandardScaler()` to the columns `ibu` and `abv` (in that order), and that we want to drop any other columns.

*Assign your answer to an object named `beer_preprocessor`. Use the scaffolding provided.*

In [None]:
# ___ = ___(
#     (___(), [___, ___]),
#     ___='drop',
#     verbose_feature_names_out=False,
# )

# your code here
raise NotImplementedError
beer_preprocessor

In [None]:
from hashlib import sha1
assert sha1(str(type(beer_preprocessor is None)).encode("utf-8")+b"4dfed88add813cd8").hexdigest() == "8be1a4b634f199d0c4431eb00b3616112970a240", "type of beer_preprocessor is None is not bool. beer_preprocessor is None should be a bool"
assert sha1(str(beer_preprocessor is None).encode("utf-8")+b"4dfed88add813cd8").hexdigest() == "c0d529ca31bc5ba5f0a5de7b764f36d212ea8c95", "boolean value of beer_preprocessor is None is not correct"

assert sha1(str(type(type(beer_preprocessor))).encode("utf-8")+b"4d596b0df8ddd0a5").hexdigest() == "562ec7a052ea3e7345a1aadad66dcae8427e1d4b", "type of type(beer_preprocessor) is not correct"
assert sha1(str(type(beer_preprocessor)).encode("utf-8")+b"4d596b0df8ddd0a5").hexdigest() == "27e5d95a7d96553925c41eefd3ad6b6f802a60dd", "value of type(beer_preprocessor) is not correct"

assert sha1(str(type(beer_preprocessor.get_feature_names_out)).encode("utf-8")+b"abb9a63b17af75dd").hexdigest() == "5d439eebfc9013917c4b2d625bffa02ee8a96c2f", "type of beer_preprocessor.get_feature_names_out is not correct"
assert sha1(str(beer_preprocessor.get_feature_names_out).encode("utf-8")+b"abb9a63b17af75dd").hexdigest() == "4fbbe85f6182c8825abeeddf59328d623a7340a3", "value of beer_preprocessor.get_feature_names_out is not correct"

print('Success!')

**Question 1.5**
<br> {points: 1}

The next step in our clustering workflow is to create a model that specifies how we want to cluster the data. From our exploratory data visualization, it seems like we will start simple with only 2 clusters. Use the `KMeans` function with `n_clusters=2` to perform clustering with this choice of K. 

*Assign your model to an object named `beer_cluster_k2`. Note that since k-means uses a random initialization, we need to set the random_state; don't change the value!*

In [None]:
# ___ = KMeans(n_clusters=2, random_state=1234)  # Don't change the random_state value
# ___.fit(___)

# your code here
raise NotImplementedError
beer_cluster_k2

In [None]:
from hashlib import sha1
assert sha1(str(type(type(beer_cluster_k2))).encode("utf-8")+b"58c5ff665ff58fcf").hexdigest() == "a3b7725a5df6c50408b3d43610b16098c6d816ee", "type of type(beer_cluster_k2) is not correct"
assert sha1(str(type(beer_cluster_k2)).encode("utf-8")+b"58c5ff665ff58fcf").hexdigest() == "d40dfe3540f398cb7cb7b2fe4e562ef60d0dc234", "value of type(beer_cluster_k2) is not correct"

assert sha1(str(type(beer_cluster_k2.n_clusters)).encode("utf-8")+b"173e74534c2cb485").hexdigest() == "b922d07be947674d09628c6c4ff98aefd4fafd0e", "type of beer_cluster_k2.n_clusters is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(beer_cluster_k2.n_clusters).encode("utf-8")+b"173e74534c2cb485").hexdigest() == "106b6306e1905f633ecc474391bba1d2338e6cfe", "value of beer_cluster_k2.n_clusters is not correct"

print('Success!')

**Question 1.6**
<br> {points: 1}

Combine the preprocessor and model specification into a pipeline, and fit the pipeline on the `clean_beer` data.

*Assign your model to an object named beer_pipe.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(beer_pipe is None)).encode("utf-8")+b"45cadb8450aa7723").hexdigest() == "509259dab5310332347ad71bfd26f363798fd7c0", "type of beer_pipe is None is not bool. beer_pipe is None should be a bool"
assert sha1(str(beer_pipe is None).encode("utf-8")+b"45cadb8450aa7723").hexdigest() == "6e76120e8e9ae29a653925f4e53a14492ca877d4", "boolean value of beer_pipe is None is not correct"

assert sha1(str(type(type(beer_pipe))).encode("utf-8")+b"cb43ee60667844e7").hexdigest() == "8d983a8c7ebd047d987a16029dd8528f43ec75e1", "type of type(beer_pipe) is not correct"
assert sha1(str(type(beer_pipe)).encode("utf-8")+b"cb43ee60667844e7").hexdigest() == "67049472b9cf2b3e39a061be98ed35e505516cd1", "value of type(beer_pipe) is not correct"

assert sha1(str(type(beer_pipe.named_steps.kmeans.n_clusters)).encode("utf-8")+b"92e6e410a95563c0").hexdigest() == "53cf0d2aed7ea9ec868ae7cb3354ddac2bf8fd4e", "type of beer_pipe.named_steps.kmeans.n_clusters is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(beer_pipe.named_steps.kmeans.n_clusters).encode("utf-8")+b"92e6e410a95563c0").hexdigest() == "c119a00dc7c06173470daee81eb1c0474f301467", "value of beer_pipe.named_steps.kmeans.n_clusters is not correct"

print('Success!')

**Question 1.7**
<br> {points: 1}

Use the `labels_` attribute of KMeans model inside `beer_pipe` to get the cluster assignment for each point in the `clean_beer` data. Create a new dataframe called `clustered_beer` and assign the cluster labels to a column named `cluster`. 

In [None]:
# ___ = ___.assign(
#     cluster=___[1].___  # The KMeans model is in the second position of the pipeline
# )

# your code here
raise NotImplementedError
clustered_beer

In [None]:
from hashlib import sha1
assert sha1(str(type("abv" in clustered_beer.columns)).encode("utf-8")+b"aaef6aaeef49f8af").hexdigest() == "c517824dcd7596752e0c9d233ac9a69a88140dde", "type of \"abv\" in clustered_beer.columns is not bool. \"abv\" in clustered_beer.columns should be a bool"
assert sha1(str("abv" in clustered_beer.columns).encode("utf-8")+b"aaef6aaeef49f8af").hexdigest() == "900f3ff3759f5ff07e818057176c2c8e83aeea90", "boolean value of \"abv\" in clustered_beer.columns is not correct"

assert sha1(str(type("ibu" in clustered_beer.columns)).encode("utf-8")+b"522178535ede6532").hexdigest() == "fb869f864491fd4f023a5e3b93d96116d827e727", "type of \"ibu\" in clustered_beer.columns is not bool. \"ibu\" in clustered_beer.columns should be a bool"
assert sha1(str("ibu" in clustered_beer.columns).encode("utf-8")+b"522178535ede6532").hexdigest() == "4591feef988ce381c7f9929a9d4150e986c8fbae", "boolean value of \"ibu\" in clustered_beer.columns is not correct"

assert sha1(str(type("cluster" in clustered_beer.columns)).encode("utf-8")+b"b3a0e862ae816948").hexdigest() == "c609806123f071a66c49968baa1b403c1f5c08c4", "type of \"cluster\" in clustered_beer.columns is not bool. \"cluster\" in clustered_beer.columns should be a bool"
assert sha1(str("cluster" in clustered_beer.columns).encode("utf-8")+b"b3a0e862ae816948").hexdigest() == "4de02870b71d669c1cd79d3a54cfdcc544b7ea95", "boolean value of \"cluster\" in clustered_beer.columns is not correct"

assert sha1(str(type(clustered_beer.shape[0])).encode("utf-8")+b"01776c4ed1d9ff3d").hexdigest() == "744cd9bb55620f4aee86d8ef70f7b10ac958f140", "type of clustered_beer.shape[0] is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(clustered_beer.shape[0]).encode("utf-8")+b"01776c4ed1d9ff3d").hexdigest() == "39cbff105a9f68f05d1300a85a2b0115e6575c1c", "value of clustered_beer.shape[0] is not correct"

assert sha1(str(type(clustered_beer.shape[1])).encode("utf-8")+b"53a1f5a00e28fe10").hexdigest() == "7e929ae14a1626de504da0b19b4aab3da77cfd9e", "type of clustered_beer.shape[1] is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(clustered_beer.shape[1]).encode("utf-8")+b"53a1f5a00e28fe10").hexdigest() == "74d5d07ca3e00471bb885ed459aaacb51add9f77", "value of clustered_beer.shape[1] is not correct"

print('Success!')

**Question 1.8**
<br> {points: 1}

Create a scatter plot of `abv` on the y-axis versus `ibu` on the x-axis (using the data in `clustered_beer`) where the points are labelled by their cluster assignment. Add the `:N` suffix to the column name to ensure that altair will treat the `cluster` column as a categorical variable, and hence use a suitable color scheme. Name the plot object `clustered_beer_chart`.

*Remember to follow the best visualization practices, including adding human-readable labels to your plot.*

In [None]:
# your code here
raise NotImplementedError
clustered_beer_chart

In [None]:
from hashlib import sha1
assert sha1(str(type(clustered_beer_chart is None)).encode("utf-8")+b"06d83f3a5c842424").hexdigest() == "f7a7afa5ad3a071b254d186a6016fa9277068fb6", "type of clustered_beer_chart is None is not bool. clustered_beer_chart is None should be a bool"
assert sha1(str(clustered_beer_chart is None).encode("utf-8")+b"06d83f3a5c842424").hexdigest() == "86e03b0238361c4ded5438b8906119e984d09015", "boolean value of clustered_beer_chart is None is not correct"

assert sha1(str(type(clustered_beer_chart.to_dict()['mark']['type'] in ['circle', 'point'])).encode("utf-8")+b"60090b29a3f6f2bd").hexdigest() == "89683360e5b3be0a8ce170e98068c8735d92a435", "type of clustered_beer_chart.to_dict()['mark']['type'] in ['circle', 'point'] is not bool. clustered_beer_chart.to_dict()['mark']['type'] in ['circle', 'point'] should be a bool"
assert sha1(str(clustered_beer_chart.to_dict()['mark']['type'] in ['circle', 'point']).encode("utf-8")+b"60090b29a3f6f2bd").hexdigest() == "7d91e32d81025620e17cfd74b9a7c8dd8ed27a43", "boolean value of clustered_beer_chart.to_dict()['mark']['type'] in ['circle', 'point'] is not correct"

assert sha1(str(type(clustered_beer_chart.data.equals(clustered_beer_chart))).encode("utf-8")+b"155b1a19f2d8a91b").hexdigest() == "4cbc2874b5972a54f6783255f8e72ee7c2cd373c", "type of clustered_beer_chart.data.equals(clustered_beer_chart) is not bool. clustered_beer_chart.data.equals(clustered_beer_chart) should be a bool"
assert sha1(str(clustered_beer_chart.data.equals(clustered_beer_chart)).encode("utf-8")+b"155b1a19f2d8a91b").hexdigest() == "d3f951cfe0846dce524dae3a4beb5f286487132e", "boolean value of clustered_beer_chart.data.equals(clustered_beer_chart) is not correct"

assert sha1(str(type(clustered_beer_chart.encoding.x['shorthand'])).encode("utf-8")+b"f5f342265fd6a188").hexdigest() == "1a220954c4281f09bdd724241a9d7cddc863ff74", "type of clustered_beer_chart.encoding.x['shorthand'] is not str. clustered_beer_chart.encoding.x['shorthand'] should be an str"
assert sha1(str(len(clustered_beer_chart.encoding.x['shorthand'])).encode("utf-8")+b"f5f342265fd6a188").hexdigest() == "35f806bedefd612e09f9e455bf1b4ad00f963df9", "length of clustered_beer_chart.encoding.x['shorthand'] is not correct"
assert sha1(str(clustered_beer_chart.encoding.x['shorthand'].lower()).encode("utf-8")+b"f5f342265fd6a188").hexdigest() == "1bb2da6359b3e54063726927d9cec19fb491db90", "value of clustered_beer_chart.encoding.x['shorthand'] is not correct"
assert sha1(str(clustered_beer_chart.encoding.x['shorthand']).encode("utf-8")+b"f5f342265fd6a188").hexdigest() == "1bb2da6359b3e54063726927d9cec19fb491db90", "correct string value of clustered_beer_chart.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(clustered_beer_chart.encoding.y['shorthand'])).encode("utf-8")+b"0f7c50e7bfb7164b").hexdigest() == "06dce14162526409f1a2d6ae6f6a4f186d1f2676", "type of clustered_beer_chart.encoding.y['shorthand'] is not str. clustered_beer_chart.encoding.y['shorthand'] should be an str"
assert sha1(str(len(clustered_beer_chart.encoding.y['shorthand'])).encode("utf-8")+b"0f7c50e7bfb7164b").hexdigest() == "ff4e8c4a8760d3b8d725f66a3697bb9b6ad994e4", "length of clustered_beer_chart.encoding.y['shorthand'] is not correct"
assert sha1(str(clustered_beer_chart.encoding.y['shorthand'].lower()).encode("utf-8")+b"0f7c50e7bfb7164b").hexdigest() == "d578e19ce151fbccb01f0fb9a11504423c3580e7", "value of clustered_beer_chart.encoding.y['shorthand'] is not correct"
assert sha1(str(clustered_beer_chart.encoding.y['shorthand']).encode("utf-8")+b"0f7c50e7bfb7164b").hexdigest() == "d578e19ce151fbccb01f0fb9a11504423c3580e7", "correct string value of clustered_beer_chart.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(clustered_beer_chart.encoding.color['shorthand'])).encode("utf-8")+b"f9a33f99e8e0af06").hexdigest() == "57d08d23d6bbf647a698588120d29707a4a4ddf6", "type of clustered_beer_chart.encoding.color['shorthand'] is not str. clustered_beer_chart.encoding.color['shorthand'] should be an str"
assert sha1(str(len(clustered_beer_chart.encoding.color['shorthand'])).encode("utf-8")+b"f9a33f99e8e0af06").hexdigest() == "3933a2ef53664ae0ac80871396da612b7ae08ed0", "length of clustered_beer_chart.encoding.color['shorthand'] is not correct"
assert sha1(str(clustered_beer_chart.encoding.color['shorthand'].lower()).encode("utf-8")+b"f9a33f99e8e0af06").hexdigest() == "ebbaf1fcbd0c9ea4cfd1aef304b0e97239c82eae", "value of clustered_beer_chart.encoding.color['shorthand'] is not correct"
assert sha1(str(clustered_beer_chart.encoding.color['shorthand']).encode("utf-8")+b"f9a33f99e8e0af06").hexdigest() == "6dbdff11407303ccdb905fc07bbc6ace8f1a0867", "correct string value of clustered_beer_chart.encoding.color['shorthand'] but incorrect case of letters"

assert sha1(str(type(isinstance(clustered_beer_chart.encoding.x['title'], str))).encode("utf-8")+b"388b45056519bd72").hexdigest() == "1b9c55d00069543929dc70257548673f2fea281e", "type of isinstance(clustered_beer_chart.encoding.x['title'], str) is not bool. isinstance(clustered_beer_chart.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(clustered_beer_chart.encoding.x['title'], str)).encode("utf-8")+b"388b45056519bd72").hexdigest() == "f63601819c7f60f308008806df01de326f795045", "boolean value of isinstance(clustered_beer_chart.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(clustered_beer_chart.encoding.y['title'], str))).encode("utf-8")+b"262e3f1b055b65c9").hexdigest() == "1c2bec57a300b90077ea747360b70997039209b1", "type of isinstance(clustered_beer_chart.encoding.y['title'], str) is not bool. isinstance(clustered_beer_chart.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(clustered_beer_chart.encoding.y['title'], str)).encode("utf-8")+b"262e3f1b055b65c9").hexdigest() == "9699ce969ea0f86bf60678c6c64fa2a728daf384", "boolean value of isinstance(clustered_beer_chart.encoding.y['title'], str) is not correct"

assert sha1(str(type(isinstance(clustered_beer_chart.encoding.color['title'], str))).encode("utf-8")+b"07a2ca55e3026063").hexdigest() == "c1316161121d44b5358e1839d2aa3c6d667a2437", "type of isinstance(clustered_beer_chart.encoding.color['title'], str) is not bool. isinstance(clustered_beer_chart.encoding.color['title'], str) should be a bool"
assert sha1(str(isinstance(clustered_beer_chart.encoding.color['title'], str)).encode("utf-8")+b"07a2ca55e3026063").hexdigest() == "52e6674d5f9772e065e4590ae38d8a525a784b37", "boolean value of isinstance(clustered_beer_chart.encoding.color['title'], str) is not correct"

print('Success!')

**Question 1.9.1** Multiple Choice:
<br> {points: 1}

We do not know, however, that two clusters (K = 2) is the best choice for this data set. What can we do to choose the best K?

A. Perform *cross-validation* for a variety of possible Ks. Choose the one where within-cluster sum of squares distance starts to *decrease less*.

B. Perform *cross-validation* for a variety of possible Ks. Choose the one where the within-cluster sum of squares distance starts to *decrease more*. 

C. Perform *clustering* for a variety of possible Ks. Choose the one where within-cluster sum of squares distance starts to *decrease less*.

D. Perform *clustering* for a variety of possible Ks. Choose the one where the within-cluster sum of squares distance starts to *decrease more*. 

*Assign your answer to an object called `answer1_9_1`. Make sure it is a single upper-case character surrounded by quotes.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_9_1)).encode("utf-8")+b"697d3926a83e5104").hexdigest() == "ecf156f012a16310010caa66282d4957a0357a3f", "type of answer1_9_1 is not str. answer1_9_1 should be an str"
assert sha1(str(len(answer1_9_1)).encode("utf-8")+b"697d3926a83e5104").hexdigest() == "98f125d5911e0fad2442282287b6016a27207f79", "length of answer1_9_1 is not correct"
assert sha1(str(answer1_9_1.lower()).encode("utf-8")+b"697d3926a83e5104").hexdigest() == "f45d4c52544b2d67be10b685d9dd64b53ca8fa61", "value of answer1_9_1 is not correct"
assert sha1(str(answer1_9_1).encode("utf-8")+b"697d3926a83e5104").hexdigest() == "e8beb78dc890b1a931aad2e66d00eeede2e412a8", "correct string value of answer1_9_1 but incorrect case of letters"

print('Success!')

**Question 1.9.2**
<br> {points: 1}

Let's check the total within-cluster sum of squares for our K=2 model. Remember that scikit-learn already computes this for us and has stored it in an attribute of the model object. Find out the name of the attribute and store its value as a new variable called `beer_cluster_k2_wssd`. Remember that you need to access the model through its numerical position in the `beer_pipe` list.

*Hint: Check the textbook if you don't remember the name of the attribute.*

In [None]:
# your code here
raise NotImplementedError
beer_cluster_k2_wssd

In [None]:
from hashlib import sha1
assert sha1(str(type(beer_cluster_k2_wssd is None)).encode("utf-8")+b"ecf861d3e08c3e50").hexdigest() == "947ac8d2867f367fe08ad85f5540e8d6482377fb", "type of beer_cluster_k2_wssd is None is not bool. beer_cluster_k2_wssd is None should be a bool"
assert sha1(str(beer_cluster_k2_wssd is None).encode("utf-8")+b"ecf861d3e08c3e50").hexdigest() == "66f2a1f3b3fc3a1843a544753a90219088a43a27", "boolean value of beer_cluster_k2_wssd is None is not correct"

assert sha1(str(type(beer_cluster_k2_wssd)).encode("utf-8")+b"2dd25fb58665fca5").hexdigest() == "32e6228781cb45951a21fda8dbae88f574d91f2d", "type of type(beer_cluster_k2_wssd) is not correct"

assert sha1(str(type(round(beer_cluster_k2_wssd, 2))).encode("utf-8")+b"542988c443b6b8cd").hexdigest() == "9186637b20f5f322c5978198f8944f4375e14341", "type of round(beer_cluster_k2_wssd, 2) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(round(beer_cluster_k2_wssd, 2), 2)).encode("utf-8")+b"542988c443b6b8cd").hexdigest() == "4dcbe0ad1de45b6a68a3f27efeb4eeb4a87512e9", "value of round(beer_cluster_k2_wssd, 2) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 2.0**
<br> {points: 1}

Let's now choose the best $K$ for this clustering problem by computing the total within-cluster sum of squares for multiple values of $K$ and selecting the $K$ with the lowest value. To do this we need to first create a `range` of values to test; in this case we are interesting in all the integers from 1 to 10 (both inclusive).

*Assign your answer to an object named `beer_ks`.*

In [None]:
# your code here
raise NotImplementedError
beer_ks

In [None]:
from hashlib import sha1
assert sha1(str(type(beer_ks)).encode("utf-8")+b"a5823a68ba1530ec").hexdigest() == "5dfaf11c5adb17316e102e87909b2c4ae9e600a3", "type of type(beer_ks) is not correct"

assert sha1(str(type(beer_ks.start)).encode("utf-8")+b"542d0fcd2dec327a").hexdigest() == "4f9d44517d96cc1c43b4cdafa5b38c5a01fa7fc0", "type of beer_ks.start is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(beer_ks.start).encode("utf-8")+b"542d0fcd2dec327a").hexdigest() == "c79b54872003abb0de702da4e3a12eeb2855f7bc", "value of beer_ks.start is not correct"

assert sha1(str(type(beer_ks.stop)).encode("utf-8")+b"8bd114584cc95abd").hexdigest() == "10f5915b6a382993d8216ac3d58112b2ad6c2b59", "type of beer_ks.stop is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(beer_ks.stop).encode("utf-8")+b"8bd114584cc95abd").hexdigest() == "47208d27f219d505d8b580053ccf6300a79caca8", "value of beer_ks.stop is not correct"

print('Success!')

**Question 2.1**
<br> {points: 1}

Next, we want to compute the WSSD for each value of $K$.

Use a list comprehension to create a KMeans clustering model with $K$ clusters for each value of $K$ in the `beer_ks` range you just created. Each model should be wrapped in a pipeline together with the `beer_preprocessor` we created earlier. Train the pipeline on the `clean_beer` data and output the WSSD value for each value of $K$ as a list.

*Assign your answer to an object named `beer_wssds`.*

In [None]:
# ___ = [
#     ___(
#         beer_preprocessor, 
#         ___(n_clusters=___, random_state=1234)  # Create a new model with `k` clusters
#     ).fit(___)[1].___  # Fit the pipeline and compute its WSSD
#     for k in ___
# ]

# your code here
raise NotImplementedError
beer_wssds

In [None]:
from hashlib import sha1
assert sha1(str(type(len(beer_wssds))).encode("utf-8")+b"c64fa146de130cd3").hexdigest() == "b683874c5603cbf573a03bd20759af408debebce", "type of len(beer_wssds) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(len(beer_wssds)).encode("utf-8")+b"c64fa146de130cd3").hexdigest() == "628e4e869eadf113189f8a3b289142bcad45b7da", "value of len(beer_wssds) is not correct"

assert sha1(str(type(round(sum(beer_wssds), 2))).encode("utf-8")+b"b843adffe6deb299").hexdigest() == "973a8a55013ac92f366e6a9adf5a2c299868cdf4", "type of round(sum(beer_wssds), 2) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(round(sum(beer_wssds), 2), 2)).encode("utf-8")+b"b843adffe6deb299").hexdigest() == "f3ffb8dc9df5986998a4c8f3f9303a8dab4ebe3b", "value of round(sum(beer_wssds), 2) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 2.2**
<br> {points: 1}

Before visualizing our results, we need to create a dataframe that holds the values of $K$ in a column called `k` and the WSSD values in a column called `wssd`.

*Assign your answer to an object named `beer_model_stats`.*

In [None]:
# your code here
raise NotImplementedError
beer_model_stats

In [None]:
from hashlib import sha1
assert sha1(str(type(beer_model_stats.shape)).encode("utf-8")+b"c3db79f44a07daa7").hexdigest() == "8742414f8e0c3c536bf620d25a06868270177f43", "type of beer_model_stats.shape is not tuple. beer_model_stats.shape should be a tuple"
assert sha1(str(len(beer_model_stats.shape)).encode("utf-8")+b"c3db79f44a07daa7").hexdigest() == "578f9dc78cdd4cb58967b6b0e915acd719dbb19c", "length of beer_model_stats.shape is not correct"
assert sha1(str(sorted(map(str, beer_model_stats.shape))).encode("utf-8")+b"c3db79f44a07daa7").hexdigest() == "a84b9d7baea6b76322723bb123bc9900a8b2d11a", "values of beer_model_stats.shape are not correct"
assert sha1(str(beer_model_stats.shape).encode("utf-8")+b"c3db79f44a07daa7").hexdigest() == "96b61213faa03e00f9e29fe0a18ae7155af5164b", "order of elements of beer_model_stats.shape is not correct"

assert sha1(str(type("k" in beer_model_stats.columns.values)).encode("utf-8")+b"3170219e598f17b5").hexdigest() == "4dcd3cfff17f2c6730a1d4f114fc6eabac7ce39f", "type of \"k\" in beer_model_stats.columns.values is not bool. \"k\" in beer_model_stats.columns.values should be a bool"
assert sha1(str("k" in beer_model_stats.columns.values).encode("utf-8")+b"3170219e598f17b5").hexdigest() == "d006cb152ab28b98d9f37830a2264ae00d9b8982", "boolean value of \"k\" in beer_model_stats.columns.values is not correct"

assert sha1(str(type("wssd" in beer_model_stats.columns)).encode("utf-8")+b"eb34a326174fe368").hexdigest() == "c3acb10c58bbc33583b0385946360aef3b6ab230", "type of \"wssd\" in beer_model_stats.columns is not bool. \"wssd\" in beer_model_stats.columns should be a bool"
assert sha1(str("wssd" in beer_model_stats.columns).encode("utf-8")+b"eb34a326174fe368").hexdigest() == "eabf37e4ff21d0a54d13743d202a4fedc188d066", "boolean value of \"wssd\" in beer_model_stats.columns is not correct"

assert sha1(str(type(beer_model_stats['k'][0])).encode("utf-8")+b"70c468b564c6c9e5").hexdigest() == "7016c656828ebc028c269da837d51cbea1045fda", "type of type(beer_model_stats['k'][0]) is not correct"

assert sha1(str(type(beer_model_stats['wssd'][0])).encode("utf-8")+b"57e8db2418145c5f").hexdigest() == "43a2bc43320f9a1d62a002650a575ab6dd24951f", "type of type(beer_model_stats['wssd'][0]) is not correct"

print('Success!')

**Question 2.3**
<br> {points: 1}

Create a line plot of total within-cluster sum of squares (y-axis) versus the number of clusters (x-axis), so that we can choose the best number of clusters to use. Use the correct parameter inside `mark_line` to include a point in the chart for each data point.

*Assign your plot to an object called `elbow_plot`. Remember to follow the best visualization practices, including adding human-readable labels to your plot.*

In [None]:
# your code here
raise NotImplementedError
elbow_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(elbow_plot is None)).encode("utf-8")+b"496e28b810a8a328").hexdigest() == "124a994f03de88d3237d59bbef78bd985d9bd767", "type of elbow_plot is None is not bool. elbow_plot is None should be a bool"
assert sha1(str(elbow_plot is None).encode("utf-8")+b"496e28b810a8a328").hexdigest() == "ccf1676d5df61beea3cc35c56830dcfb38f58101", "boolean value of elbow_plot is None is not correct"

assert sha1(str(type(elbow_plot.encoding.x['shorthand'])).encode("utf-8")+b"73d2d0503ad90cce").hexdigest() == "ae027c28df98cfd8f4ab937940d328a355d4dc5e", "type of elbow_plot.encoding.x['shorthand'] is not str. elbow_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(elbow_plot.encoding.x['shorthand'])).encode("utf-8")+b"73d2d0503ad90cce").hexdigest() == "2302b38c97bdb10882895a9c78c1904808667b2d", "length of elbow_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(elbow_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"73d2d0503ad90cce").hexdigest() == "fc0560533c56ed73020767962067d2afc712c546", "value of elbow_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(elbow_plot.encoding.x['shorthand']).encode("utf-8")+b"73d2d0503ad90cce").hexdigest() == "fc0560533c56ed73020767962067d2afc712c546", "correct string value of elbow_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(elbow_plot.encoding.y['shorthand'])).encode("utf-8")+b"02ff938dd19f1583").hexdigest() == "458f10e4c70d22a126cc427ae71889d0a6599975", "type of elbow_plot.encoding.y['shorthand'] is not str. elbow_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(elbow_plot.encoding.y['shorthand'])).encode("utf-8")+b"02ff938dd19f1583").hexdigest() == "1dd91b100c1ffd7b25add27e9e4afebaaa358b6f", "length of elbow_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(elbow_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"02ff938dd19f1583").hexdigest() == "afbabc1d72e7a8c1a0e542c2ff92fbf01759b1d4", "value of elbow_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(elbow_plot.encoding.y['shorthand']).encode("utf-8")+b"02ff938dd19f1583").hexdigest() == "afbabc1d72e7a8c1a0e542c2ff92fbf01759b1d4", "correct string value of elbow_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(elbow_plot.mark)).encode("utf-8")+b"7efe4959c24c52b1").hexdigest() == "0eafb4e74e656602375c98fa29d6aa5dbebef904", "type of elbow_plot.mark is not correct"
assert sha1(str(elbow_plot.mark).encode("utf-8")+b"7efe4959c24c52b1").hexdigest() == "d9125e6db5c0e65c86357766d1aa407a39a4d436", "value of elbow_plot.mark is not correct"

assert sha1(str(type(isinstance(elbow_plot.encoding.x['title'], str))).encode("utf-8")+b"ed3b0e39603beff1").hexdigest() == "b89205c7579b6988d28f68490fa688126931f041", "type of isinstance(elbow_plot.encoding.x['title'], str) is not bool. isinstance(elbow_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(elbow_plot.encoding.x['title'], str)).encode("utf-8")+b"ed3b0e39603beff1").hexdigest() == "ac4eb33c2080b9e7480ac088f417997c3ea082a6", "boolean value of isinstance(elbow_plot.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(elbow_plot.encoding.y['title'], str))).encode("utf-8")+b"35b6df5197f50156").hexdigest() == "a80bc9129b52409c103478e3057bd11467d57064", "type of isinstance(elbow_plot.encoding.y['title'], str) is not bool. isinstance(elbow_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(elbow_plot.encoding.y['title'], str)).encode("utf-8")+b"35b6df5197f50156").hexdigest() == "8150608b324da96e607ef958198170e92657abde", "boolean value of isinstance(elbow_plot.encoding.y['title'], str) is not correct"

print('Success!')

**Question 2.4**
<br> {points: 1}

From the plot above, which $K$ should we choose? 

*Assign your answer to an object called `answer2_2`. Make sure your answer is a single numerical character surrounded by quotation marks, e.g. `'3'`.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_4 in ['2', '3', '4'])).encode("utf-8")+b"44fe820a74d9e18b").hexdigest() == "df893c384ab657c265d12c53373c6909cade9ba7", "type of answer2_4 in ['2', '3', '4'] is not bool. answer2_4 in ['2', '3', '4'] should be a bool"
assert sha1(str(answer2_4 in ['2', '3', '4']).encode("utf-8")+b"44fe820a74d9e18b").hexdigest() == "48f57efcff8d71a11c0499b62ea55ea690d2a095", "boolean value of answer2_4 in ['2', '3', '4'] is not correct"

print('Success!')

**Question 2.5**
<br> {points: 1}

Why did we choose the $K$ we chose above?

A. It had the greatest total within-cluster sum of squares

B. It had the smallest total within-cluster sum of squares

C. Increasing $k$ further than this only decreased the total within-cluster sum of squares a small amount

D. Increasing $k$ further than this only increased the total within-cluster sum of squares a small amount

*Assign your answer to an object called `answer2_5`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_5)).encode("utf-8")+b"cbd2daf3e135f76f").hexdigest() == "ea7f4fe24a3a1c1c0952e3b045ce46cb32eb1c79", "type of answer2_5 is not str. answer2_5 should be an str"
assert sha1(str(len(answer2_5)).encode("utf-8")+b"cbd2daf3e135f76f").hexdigest() == "9ac0331ceba761942475a79c3a506d04ce45c614", "length of answer2_5 is not correct"
assert sha1(str(answer2_5.lower()).encode("utf-8")+b"cbd2daf3e135f76f").hexdigest() == "7c9b3374727f65e88c40f7537fe43aaac4902faf", "value of answer2_5 is not correct"
assert sha1(str(answer2_5).encode("utf-8")+b"cbd2daf3e135f76f").hexdigest() == "0f8bf20db2205493d24395d3f106d5a9eb66b65d", "correct string value of answer2_5 but incorrect case of letters"

print('Success!')

**Question 2.6** Multiple Choice:
<br> {points: 1}

What can we conclude from our analysis? How many different types of hoppy craft beer are there in this data set using the two variables we have? 


A. 1

B. 2 to 4

C. 5 to 7

D. more than 7

*Assign your answer to an object called `answer2_7`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_6)).encode("utf-8")+b"5785a2d9f102c984").hexdigest() == "e4bcdd80ee61ba9b533f334c21a7576b9076ee71", "type of answer2_6 is not str. answer2_6 should be an str"
assert sha1(str(len(answer2_6)).encode("utf-8")+b"5785a2d9f102c984").hexdigest() == "3a3b442755a433388f8986f46d9b1219c3ac8b2c", "length of answer2_6 is not correct"
assert sha1(str(answer2_6.lower()).encode("utf-8")+b"5785a2d9f102c984").hexdigest() == "8f549aa1a33cddeb4b7709aa4cf80e4d36bd702f", "value of answer2_6 is not correct"
assert sha1(str(answer2_6).encode("utf-8")+b"5785a2d9f102c984").hexdigest() == "8af43c24e01e6d1fff8185e2d580331c5b436102", "correct string value of answer2_6 but incorrect case of letters"

print('Success!')

**Question 2.7** True or false:
<br> {points: 1}

Our analysis might change if we added additional variables, true or false?

*Assign your answer to an object called `answer2_7`. Make sure your answer is a boolean. i.e. `True` or `False`.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_7)).encode("utf-8")+b"c478fd58c79ecf1b").hexdigest() == "0839d72f574e8a01bb6c1733f3e7c28a030a361d", "type of answer2_7 is not bool. answer2_7 should be a bool"
assert sha1(str(answer2_7).encode("utf-8")+b"c478fd58c79ecf1b").hexdigest() == "b4b2fae8a06961ce87a4a838c3f24d3c1cbbff75", "boolean value of answer2_7 is not correct"

print('Success!')