# Tutorial 10 - Clustering

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

* Describe a case where clustering would be an appropriate tool, and what insight it would bring from the data.
* Explain the k-means clustering algorithm.
* Interpret the output of a k-means cluster analysis.
* Perform k-means clustering in Python using `scikit-learn`
* Visualize the output of k-means clustering in Python using a coloured scatter plot 
* Identify when it is necessary to scale variables before clustering and do this using Python
* Use the elbow method to choose the number of clusters for k-means
* Describe advantages, limitations and assumptions of the kmeans clustering algorithm.

This tutorial covers parts of [Chapter 9](https://python.datasciencebook.ca/clustering) of the online textbook. You should read this chapter before attempting this assignment. Any place you see `___`, you must fill in the function, variable, or data to complete the code. Substitute the `raise NotImplementedError` with your completed code and answers then proceed to run the cell.

In [None]:
### Run this cell before continuing.
import numpy as np
import pandas as pd
import altair as alt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn import set_config


# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

# Output dataframes instead of arrays
set_config(transform_output="pandas")

# 1. Pokemon

We will be working with the Pokemon dataset from Kaggle, which can be found [here.](https://www.kaggle.com/abcsds/pokemon)
This dataset compiles the statistics on 721 Pokemon. The information in this dataset includes Pokemon name, type, health points, attack strength, defensive strength, speed points etc. We are interested in seeing if there are any sub-groups/clusters of pokemon based on these statistics. And if so, how many sub-groups/clusters there are.

![](https://media.giphy.com/media/3oEduV4SOS9mmmIOkw/giphy.gif)

Source: https://media.giphy.com/media/3oEduV4SOS9mmmIOkw/giphy.gif


**Question 1.0**
<br> {points: 1}

Use `read_csv` to load `pokemon.csv` from the `data/` folder. Don't forget the clean the column names to remove "." in the column name. 

*A scaffolding of changing ". x" to "_x" has been given below, but you could choose to remove it or make other changes.*

*Assign your answer to an object called `pm_data`.*

In [None]:
# ___ = pd.read_csv(___).rename(columns={___: "Sp_Atk", "Sp. Def":___})


# your code here
raise NotImplementedError
pm_data

In [None]:
from hashlib import sha1
assert sha1(str(type(pm_data is None)).encode("utf-8")+b"afdc").hexdigest() == "852820664d5d68571ba9a684d465e06c4098fd34", "type of pm_data is None is not bool. pm_data is None should be a bool"
assert sha1(str(pm_data is None).encode("utf-8")+b"afdc").hexdigest() == "ffe2da6cbce69fa486812257c53d075d1410ea0c", "boolean value of pm_data is None is not correct"

assert sha1(str(type(pm_data)).encode("utf-8")+b"afdd").hexdigest() == "eca45b169b57a6ccf3b3b21d7c847ee047595cfd", "type of type(pm_data) is not correct"

assert sha1(str(type(pm_data.shape)).encode("utf-8")+b"afde").hexdigest() == "c962e4855b1aafc3fca43d469297dc9d2631eb38", "type of pm_data.shape is not tuple. pm_data.shape should be a tuple"
assert sha1(str(len(pm_data.shape)).encode("utf-8")+b"afde").hexdigest() == "c97b7e5fb5b8d6efd4aadc1f89a25ad0fea3d0a5", "length of pm_data.shape is not correct"
assert sha1(str(sorted(map(str, pm_data.shape))).encode("utf-8")+b"afde").hexdigest() == "a58303e07375e8bbe5b0b5ed13d04073b3724edb", "values of pm_data.shape are not correct"
assert sha1(str(pm_data.shape).encode("utf-8")+b"afde").hexdigest() == "b52533fd4c4ee783cd24703649eb11b68eda1c4c", "order of elements of pm_data.shape is not correct"

assert sha1(str(type('Name' in pm_data.columns)).encode("utf-8")+b"afdf").hexdigest() == "f841c072c04b85b5c2e33cf712e779167144828a", "type of 'Name' in pm_data.columns is not bool. 'Name' in pm_data.columns should be a bool"
assert sha1(str('Name' in pm_data.columns).encode("utf-8")+b"afdf").hexdigest() == "8bdc2d2aa36fd24a0ad95789da7c90f6b9796568", "boolean value of 'Name' in pm_data.columns is not correct"

assert sha1(str(type('HP' in pm_data.columns)).encode("utf-8")+b"afe0").hexdigest() == "bc9af893dc279ddc528a0797e8454bc34772a8fb", "type of 'HP' in pm_data.columns is not bool. 'HP' in pm_data.columns should be a bool"
assert sha1(str('HP' in pm_data.columns).encode("utf-8")+b"afe0").hexdigest() == "585cbe475ac364d5c5ba082824919dc0dcccc0fc", "boolean value of 'HP' in pm_data.columns is not correct"

assert sha1(str(type('Attack' in pm_data.columns)).encode("utf-8")+b"afe1").hexdigest() == "1ffefbf4054712808a3ac3aef5c36c50b0ee6a7c", "type of 'Attack' in pm_data.columns is not bool. 'Attack' in pm_data.columns should be a bool"
assert sha1(str('Attack' in pm_data.columns).encode("utf-8")+b"afe1").hexdigest() == "c94d534efa640ee2e9709e56c7de7767a9647949", "boolean value of 'Attack' in pm_data.columns is not correct"

assert sha1(str(type('Defense' in pm_data.columns)).encode("utf-8")+b"afe2").hexdigest() == "95636d0af31dcbc1762f15f5567a00ed050683b5", "type of 'Defense' in pm_data.columns is not bool. 'Defense' in pm_data.columns should be a bool"
assert sha1(str('Defense' in pm_data.columns).encode("utf-8")+b"afe2").hexdigest() == "794d3c46f863055e63bc473956b9d4ecba0a486c", "boolean value of 'Defense' in pm_data.columns is not correct"

assert sha1(str(type('#' in pm_data.columns)).encode("utf-8")+b"afe3").hexdigest() == "7e82e7d0f19d78d84e505207d4bf93e5772d81c0", "type of '#' in pm_data.columns is not bool. '#' in pm_data.columns should be a bool"
assert sha1(str('#' in pm_data.columns).encode("utf-8")+b"afe3").hexdigest() == "110fba95d5e22dc64b9c7a4a8ca2d3af83108bf4", "boolean value of '#' in pm_data.columns is not correct"

assert sha1(str(type('Type 1' in pm_data.columns)).encode("utf-8")+b"afe4").hexdigest() == "a6bfcb79d9f657af20ee7809c6d8fad3707d8b40", "type of 'Type 1' in pm_data.columns is not bool. 'Type 1' in pm_data.columns should be a bool"
assert sha1(str('Type 1' in pm_data.columns).encode("utf-8")+b"afe4").hexdigest() == "c833c0d09090f8d58db9424c6e67dbe37e08f454", "boolean value of 'Type 1' in pm_data.columns is not correct"

assert sha1(str(type('Type 2' in pm_data.columns)).encode("utf-8")+b"afe5").hexdigest() == "780b35570ea7ae7e73289bcb9a7cb27b78ceab96", "type of 'Type 2' in pm_data.columns is not bool. 'Type 2' in pm_data.columns should be a bool"
assert sha1(str('Type 2' in pm_data.columns).encode("utf-8")+b"afe5").hexdigest() == "ca2c2008067d932b56088134d39fbffc3a17c3cb", "boolean value of 'Type 2' in pm_data.columns is not correct"

assert sha1(str(type('Total' in pm_data.columns)).encode("utf-8")+b"afe6").hexdigest() == "fe64bf62977a8964b945a49caf2b838baf5199d5", "type of 'Total' in pm_data.columns is not bool. 'Total' in pm_data.columns should be a bool"
assert sha1(str('Total' in pm_data.columns).encode("utf-8")+b"afe6").hexdigest() == "e8e725afbff1b657f410d0b6d428f4193b026c96", "boolean value of 'Total' in pm_data.columns is not correct"

assert sha1(str(type('Sp_Atk' in pm_data.columns)).encode("utf-8")+b"afe7").hexdigest() == "f10de8bb6b2933d69955ccf613a3ac5c92852f99", "type of 'Sp_Atk' in pm_data.columns is not bool. 'Sp_Atk' in pm_data.columns should be a bool"
assert sha1(str('Sp_Atk' in pm_data.columns).encode("utf-8")+b"afe7").hexdigest() == "469fb30f7480961d9baaafa96cdb41fcf6fa7afa", "boolean value of 'Sp_Atk' in pm_data.columns is not correct"

assert sha1(str(type('Sp_Def' in pm_data.columns)).encode("utf-8")+b"afe8").hexdigest() == "ef885f8f543b37457d7c8599380be66e1b2a34d1", "type of 'Sp_Def' in pm_data.columns is not bool. 'Sp_Def' in pm_data.columns should be a bool"
assert sha1(str('Sp_Def' in pm_data.columns).encode("utf-8")+b"afe8").hexdigest() == "3a255e7ff51363c6f306a1ee3f0cc4ac09352213", "boolean value of 'Sp_Def' in pm_data.columns is not correct"

assert sha1(str(type('Speed' in pm_data.columns)).encode("utf-8")+b"afe9").hexdigest() == "5ff62b2e90353c92ebbcdbef4f0f27890f2368aa", "type of 'Speed' in pm_data.columns is not bool. 'Speed' in pm_data.columns should be a bool"
assert sha1(str('Speed' in pm_data.columns).encode("utf-8")+b"afe9").hexdigest() == "ea5217e7b93e5b27e08d1dd4a7152193bdf84ae1", "boolean value of 'Speed' in pm_data.columns is not correct"

assert sha1(str(type('Generation' in pm_data.columns)).encode("utf-8")+b"afea").hexdigest() == "568ab6ac22e7334e6ba43c56ff9a9d2ec90d200b", "type of 'Generation' in pm_data.columns is not bool. 'Generation' in pm_data.columns should be a bool"
assert sha1(str('Generation' in pm_data.columns).encode("utf-8")+b"afea").hexdigest() == "c46e74945f2e7da49f5b5fe1a32d123027a22641", "boolean value of 'Generation' in pm_data.columns is not correct"

assert sha1(str(type('Legendary' in pm_data.columns)).encode("utf-8")+b"afeb").hexdigest() == "f00b58cf36b65da0f6dff3b49d0457fe04fcac6f", "type of 'Legendary' in pm_data.columns is not bool. 'Legendary' in pm_data.columns should be a bool"
assert sha1(str('Legendary' in pm_data.columns).encode("utf-8")+b"afeb").hexdigest() == "98c145184a2b13b557833593cb4a3d71950c96f5", "boolean value of 'Legendary' in pm_data.columns is not correct"

print('Success!')

**Question 1.1**
<br> {points: 1}

To start exploring the Pokemon data, create a scatter plot matrix (or pairplot) using `altair`. The plot should only contain the columns `Total` to `Speed` from `pm_data`. You can check the data wrangling chapter in the textbook to recall how to select a range of columns using `loc` with `:`.

*Assign your answer to an object called `pm_pairs`.*

In [None]:
# columns_to_plot = ___

# pm_pairs = alt.Chart(pm_data).mark_circle(opacity=0.2).encode(
#     alt.X(alt.repeat("row"), type="quantitative"),
#     alt.Y(alt.repeat("column"), type="quantitative"),
# ).properties(
#     width=150,
#     height=150
# ).repeat(
#     column=columns_to_plot,
#     row=columns_to_plot
# )


# your code here
raise NotImplementedError
pm_pairs

In [None]:
from hashlib import sha1
assert sha1(str(type(pm_pairs is None)).encode("utf-8")+b"ce963").hexdigest() == "6e3c17770d5066e3b49862a1624654f6df1c1693", "type of pm_pairs is None is not bool. pm_pairs is None should be a bool"
assert sha1(str(pm_pairs is None).encode("utf-8")+b"ce963").hexdigest() == "f33019d278d1a9a5ab54b2ec14caac1148dbe0e9", "boolean value of pm_pairs is None is not correct"

assert sha1(str(type(pm_pairs)).encode("utf-8")+b"ce964").hexdigest() == "1ade6100bb11c155f191023d379631e4b129cdae", "type of type(pm_pairs) is not correct"

assert sha1(str(type(pm_pairs['repeat']['column'])).encode("utf-8")+b"ce965").hexdigest() == "c478ad611e3350f09751cc1d9f658a181665d680", "type of pm_pairs['repeat']['column'] is not list. pm_pairs['repeat']['column'] should be a list"
assert sha1(str(len(pm_pairs['repeat']['column'])).encode("utf-8")+b"ce965").hexdigest() == "d2ff36976010273d0047b817cd72cfed950da213", "length of pm_pairs['repeat']['column'] is not correct"
assert sha1(str(sorted(map(str, pm_pairs['repeat']['column']))).encode("utf-8")+b"ce965").hexdigest() == "ddeee60943099571c740d188c49ca398ec8c3b87", "values of pm_pairs['repeat']['column'] are not correct"
assert sha1(str(pm_pairs['repeat']['column']).encode("utf-8")+b"ce965").hexdigest() == "1a1015520a1ac98263a15223c006751929007810", "order of elements of pm_pairs['repeat']['column'] is not correct"

print('Success!')

**Question 1.2**
<br> {points: 1}

From the pairplot above, it does not look like the pokemon are separated into clear groups in any of the pairwise variable scatterplots. Here, we will continue exploring the relationship between `Speed` and `Defense` and see what happens if we try to cluster the data points on these two variables although there are no visually discernable variables in the chart.

First create a scatter plot of only these two variables so that we can look close at their relationship. Put the `Speed` variable on the x-axis, and the `Defense` variable on the y-axis.

*Assign your plot to an object called `pm_scatter`. Don't forget to do everything needed to make an effective visualization including setting the `opacity` to a suitable value.*

In [None]:
# your code here
raise NotImplementedError
pm_scatter

In [None]:
from hashlib import sha1
assert sha1(str(type(pm_scatter is None)).encode("utf-8")+b"15451").hexdigest() == "6c8e7a6e46f0eb8699eda225fbad5dbe6d2e7db9", "type of pm_scatter is None is not bool. pm_scatter is None should be a bool"
assert sha1(str(pm_scatter is None).encode("utf-8")+b"15451").hexdigest() == "c895e6e5a0da28881758b7dca42497839e3389e0", "boolean value of pm_scatter is None is not correct"

assert sha1(str(type(pm_scatter.encoding.x['shorthand'])).encode("utf-8")+b"15452").hexdigest() == "8604fb7271511b2d340cdf280aa8290143e51aa8", "type of pm_scatter.encoding.x['shorthand'] is not str. pm_scatter.encoding.x['shorthand'] should be an str"
assert sha1(str(len(pm_scatter.encoding.x['shorthand'])).encode("utf-8")+b"15452").hexdigest() == "ea9cb7f5258ad8f4a1affccfd9caee1979482196", "length of pm_scatter.encoding.x['shorthand'] is not correct"
assert sha1(str(pm_scatter.encoding.x['shorthand'].lower()).encode("utf-8")+b"15452").hexdigest() == "af904edc1924a6627a35cae560f6c12c34055765", "value of pm_scatter.encoding.x['shorthand'] is not correct"
assert sha1(str(pm_scatter.encoding.x['shorthand']).encode("utf-8")+b"15452").hexdigest() == "0d51b252baf966d16011aa7f8d01c8ef13d99e2b", "correct string value of pm_scatter.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(pm_scatter.encoding.y['shorthand'])).encode("utf-8")+b"15453").hexdigest() == "933d7a17b8deeca19833e8c7f08e724dcb989f2b", "type of pm_scatter.encoding.y['shorthand'] is not str. pm_scatter.encoding.y['shorthand'] should be an str"
assert sha1(str(len(pm_scatter.encoding.y['shorthand'])).encode("utf-8")+b"15453").hexdigest() == "9f67de18eabb254b19b346ed8601b3b8a479a5cd", "length of pm_scatter.encoding.y['shorthand'] is not correct"
assert sha1(str(pm_scatter.encoding.y['shorthand'].lower()).encode("utf-8")+b"15453").hexdigest() == "e928adf5c812e0324f4936da2169e97df1a07318", "value of pm_scatter.encoding.y['shorthand'] is not correct"
assert sha1(str(pm_scatter.encoding.y['shorthand']).encode("utf-8")+b"15453").hexdigest() == "c9550f1d9987ff9999a0d95884edd5f1b87166f7", "correct string value of pm_scatter.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(pm_scatter.mark.type in ['circle', 'point'])).encode("utf-8")+b"15454").hexdigest() == "4806a77177529c6d42ee0ebe81f06ef7a0213865", "type of pm_scatter.mark.type in ['circle', 'point'] is not bool. pm_scatter.mark.type in ['circle', 'point'] should be a bool"
assert sha1(str(pm_scatter.mark.type in ['circle', 'point']).encode("utf-8")+b"15454").hexdigest() == "28e8c8111895f7dc889a7f07f0e7c62c6a92e19d", "boolean value of pm_scatter.mark.type in ['circle', 'point'] is not correct"

assert sha1(str(type('opacity' in pm_scatter.mark.to_dict())).encode("utf-8")+b"15455").hexdigest() == "1a5d5884b15bcb0c19982ee46bd0d3d293d617ce", "type of 'opacity' in pm_scatter.mark.to_dict() is not bool. 'opacity' in pm_scatter.mark.to_dict() should be a bool"
assert sha1(str('opacity' in pm_scatter.mark.to_dict()).encode("utf-8")+b"15455").hexdigest() == "b5022c54c69a0025f669195d163bd20cada7d415", "boolean value of 'opacity' in pm_scatter.mark.to_dict() is not correct"

assert sha1(str(type(isinstance(pm_scatter.encoding.x['title'], str))).encode("utf-8")+b"15456").hexdigest() == "209a5595a2f3fe46a4e5f9391e3d2f165f368c78", "type of isinstance(pm_scatter.encoding.x['title'], str) is not bool. isinstance(pm_scatter.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(pm_scatter.encoding.x['title'], str)).encode("utf-8")+b"15456").hexdigest() == "9ef47e979f08bd8ceed1d7bc521dc671812cb416", "boolean value of isinstance(pm_scatter.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(pm_scatter.encoding.y['title'], str))).encode("utf-8")+b"15457").hexdigest() == "e2524e75d858f481ff8ffe3d0f4bde99a273d130", "type of isinstance(pm_scatter.encoding.y['title'], str) is not bool. isinstance(pm_scatter.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(pm_scatter.encoding.y['title'], str)).encode("utf-8")+b"15457").hexdigest() == "7123d0fb89dfe4f6377e6bb604f04510313bcecc", "boolean value of isinstance(pm_scatter.encoding.y['title'], str) is not correct"

print('Success!')

**Question 1.3** 
<br> {points: 3}

The chart above confirms what we saw in the pairplot; there doesn't seem to be visually distinct clusters of points in these two dimensions. Could it still be informative to run clustering with this data? Let's find out by using K-Means to cluster the Pokemon based on their `Speed` and `Defense`.

So far when using K-Means, we have scaled our input features. Will it matter much for our clustering if we scale our variables for the pokemon data? Is there any argument against scaling here?

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.4**
<br> {points: 1}

Now, let's use K-means to cluster the Pokemon based on their `Speed` and `Defense` variables.

1. Create a preprocessor named `pm_preprocessor` that standardizes the data and only keeps the two columns we are interest in.
2. Create a model named `pm_kmeans` for K-means clustering with 4 clusters.
3. Combine the preprocessor and model in a pipeline named `pm_pipe`.
4. Fit the pipeline on the `pm_data` data.

*Assign your answers to objects called `pm_preprocessor`, `pm_kmeans`, and `pm_pipe`.*


**Note:** We set the `random_state` here because K-means initializes observations to random clusters. Don't change the value!

In [None]:
# # 1.
# pm_preprocessor = ___(
#     (___(), [___, ___]),
#     ___=___,
#     verbose_feature_names_out=False
# )

# # 2.
# pm_kmeans = ___(___=___, random_state=2019)

# # 3.
# pm_pipe = ___(___, ___)
# pm_pipe.___(___)

# your code here
raise NotImplementedError
pm_pipe

In [None]:
from hashlib import sha1
assert sha1(str(type(pm_pipe[1].n_clusters)).encode("utf-8")+b"e469e").hexdigest() == "b760d8105ac4fba8a2de50ddf5bdf0c775454ea4", "type of pm_pipe[1].n_clusters is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(pm_pipe[1].n_clusters).encode("utf-8")+b"e469e").hexdigest() == "07f4963babd8b42c4694c621f0973ac281e06456", "value of pm_pipe[1].n_clusters is not correct"

assert sha1(str(type(pm_pipe[1].n_features_in_)).encode("utf-8")+b"e469f").hexdigest() == "fd443d7cc694e2ad4cafcfb700fb27470ba3d13c", "type of pm_pipe[1].n_features_in_ is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(pm_pipe[1].n_features_in_).encode("utf-8")+b"e469f").hexdigest() == "83bb6eb47774c1802784839dac215d2bbb77d1e4", "value of pm_pipe[1].n_features_in_ is not correct"

assert sha1(str(type(type(pm_pipe))).encode("utf-8")+b"e46a0").hexdigest() == "d939d34ef3cd2c12946d23d0c85d1bb8e96cb1e8", "type of type(pm_pipe) is not correct"
assert sha1(str(type(pm_pipe)).encode("utf-8")+b"e46a0").hexdigest() == "0db4f1ca45951a2b66d677ba0a2e8fd03ddcabe4", "value of type(pm_pipe) is not correct"

print('Success!')

**Question 1.5**
<br> {points: 1}

Let's visualize the clusters we just created.

1. Use the `assign` method on the `pm_data` data frame to create a column called `cluster` with the cluster labels from our model for each data point. Name the new data frame `pm_clustered`.
2. Create a scatter plot of `Speed` (x-axis) vs `Defense` (y-axis) with the points coloured by their cluster assignment. Make sure to specify that the clusters are a nominal variable so that the correct color scheme is used. Name this plot `pm_scatter_clustered`.

In [None]:
# your code here
raise NotImplementedError
pm_scatter_clustered

In [None]:
from hashlib import sha1
assert sha1(str(type(pm_scatter_clustered is None)).encode("utf-8")+b"a7e3c").hexdigest() == "bf08fe3ce24d434dd6a22511acbd6ee0ebbed0b1", "type of pm_scatter_clustered is None is not bool. pm_scatter_clustered is None should be a bool"
assert sha1(str(pm_scatter_clustered is None).encode("utf-8")+b"a7e3c").hexdigest() == "b875a7ebee1c415a0eddd6b6181dc66ee794d0eb", "boolean value of pm_scatter_clustered is None is not correct"

assert sha1(str(type(pm_scatter_clustered.encoding.x['shorthand'])).encode("utf-8")+b"a7e3d").hexdigest() == "1aececfbfd7c3312eb84440b5a9bac0c32834abb", "type of pm_scatter_clustered.encoding.x['shorthand'] is not str. pm_scatter_clustered.encoding.x['shorthand'] should be an str"
assert sha1(str(len(pm_scatter_clustered.encoding.x['shorthand'])).encode("utf-8")+b"a7e3d").hexdigest() == "27c71297a9478f64291422600200d69a2c683441", "length of pm_scatter_clustered.encoding.x['shorthand'] is not correct"
assert sha1(str(pm_scatter_clustered.encoding.x['shorthand'].lower()).encode("utf-8")+b"a7e3d").hexdigest() == "b8eafbcd59a402b8982ec5d32372742ebd391f42", "value of pm_scatter_clustered.encoding.x['shorthand'] is not correct"
assert sha1(str(pm_scatter_clustered.encoding.x['shorthand']).encode("utf-8")+b"a7e3d").hexdigest() == "f48aaa52415207046a8e735f859b14cf5da8afe3", "correct string value of pm_scatter_clustered.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(pm_scatter_clustered.encoding.y['shorthand'])).encode("utf-8")+b"a7e3e").hexdigest() == "fc94d6d1188d7f7cd00d4d852a7227bf1206e62a", "type of pm_scatter_clustered.encoding.y['shorthand'] is not str. pm_scatter_clustered.encoding.y['shorthand'] should be an str"
assert sha1(str(len(pm_scatter_clustered.encoding.y['shorthand'])).encode("utf-8")+b"a7e3e").hexdigest() == "4b5fd3c6e383271fc91932ac4d18c257510916cd", "length of pm_scatter_clustered.encoding.y['shorthand'] is not correct"
assert sha1(str(pm_scatter_clustered.encoding.y['shorthand'].lower()).encode("utf-8")+b"a7e3e").hexdigest() == "0c123a7af6f696c9b913b69a25cb99db54d3467d", "value of pm_scatter_clustered.encoding.y['shorthand'] is not correct"
assert sha1(str(pm_scatter_clustered.encoding.y['shorthand']).encode("utf-8")+b"a7e3e").hexdigest() == "74fd1326ccb2666f606fffa396f985d137643a53", "correct string value of pm_scatter_clustered.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(pm_scatter_clustered.encoding.color['shorthand'])).encode("utf-8")+b"a7e3f").hexdigest() == "b30ce0f64eec672a0f347d3c0bbd058e8eab930b", "type of pm_scatter_clustered.encoding.color['shorthand'] is not str. pm_scatter_clustered.encoding.color['shorthand'] should be an str"
assert sha1(str(len(pm_scatter_clustered.encoding.color['shorthand'])).encode("utf-8")+b"a7e3f").hexdigest() == "1caf5d91096bd46f2b64c3b7ba39e813ebdc2295", "length of pm_scatter_clustered.encoding.color['shorthand'] is not correct"
assert sha1(str(pm_scatter_clustered.encoding.color['shorthand'].lower()).encode("utf-8")+b"a7e3f").hexdigest() == "eb4a51af282d8139e138e7b542f7df30e433c849", "value of pm_scatter_clustered.encoding.color['shorthand'] is not correct"
assert sha1(str(pm_scatter_clustered.encoding.color['shorthand']).encode("utf-8")+b"a7e3f").hexdigest() == "0e17ae170a8c05aab2591901d7ebfbbe1bb6ccc0", "correct string value of pm_scatter_clustered.encoding.color['shorthand'] but incorrect case of letters"

assert sha1(str(type(pm_scatter_clustered.mark in ['circle', 'point'])).encode("utf-8")+b"a7e40").hexdigest() == "9ebcc41b4ca69cfa8c77950084f3330faa022fa4", "type of pm_scatter_clustered.mark in ['circle', 'point'] is not bool. pm_scatter_clustered.mark in ['circle', 'point'] should be a bool"
assert sha1(str(pm_scatter_clustered.mark in ['circle', 'point']).encode("utf-8")+b"a7e40").hexdigest() == "fa389a991f2a4383143f16300a36a2dcaf84b222", "boolean value of pm_scatter_clustered.mark in ['circle', 'point'] is not correct"

assert sha1(str(type(isinstance(pm_scatter_clustered.encoding.x['title'], str))).encode("utf-8")+b"a7e41").hexdigest() == "3c5e76b5f1feac15790e0e0e7a96f488590f40c1", "type of isinstance(pm_scatter_clustered.encoding.x['title'], str) is not bool. isinstance(pm_scatter_clustered.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(pm_scatter_clustered.encoding.x['title'], str)).encode("utf-8")+b"a7e41").hexdigest() == "1edc396e7be8132d0e69c85112a8354368530da2", "boolean value of isinstance(pm_scatter_clustered.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(pm_scatter_clustered.encoding.y['title'], str))).encode("utf-8")+b"a7e42").hexdigest() == "c09a4b2df288de933d48e43bad0e9a23c3f32481", "type of isinstance(pm_scatter_clustered.encoding.y['title'], str) is not bool. isinstance(pm_scatter_clustered.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(pm_scatter_clustered.encoding.y['title'], str)).encode("utf-8")+b"a7e42").hexdigest() == "55eb3339955066de1e064ae21ea4b47110c55d24", "boolean value of isinstance(pm_scatter_clustered.encoding.y['title'], str) is not correct"

assert sha1(str(type(isinstance(pm_scatter_clustered.encoding.color['title'], str))).encode("utf-8")+b"a7e43").hexdigest() == "a1a841ff9dab32ebca84b4f5a3839bbc195a44a8", "type of isinstance(pm_scatter_clustered.encoding.color['title'], str) is not bool. isinstance(pm_scatter_clustered.encoding.color['title'], str) should be a bool"
assert sha1(str(isinstance(pm_scatter_clustered.encoding.color['title'], str)).encode("utf-8")+b"a7e43").hexdigest() == "0342bb873d69ea940a9d0a09aa093fc8b390fb6a", "boolean value of isinstance(pm_scatter_clustered.encoding.color['title'], str) is not correct"

print('Success!')

**Question 1.6**
<br> {points: 3}

Below you can see multiple initializations of k-means with different seeds for `K=4`. Can you explain what is happening and how we can mitigate this in the `kmeans` function?

![](imgs/multiple_initializations.png)

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.7**
<br> {points: 1}

We know that comparing how the WSSD varies for multiple values of $K$ is an important step of selecting a suitable clustering model. That's what we will do next!

1. Create a range with the values 1 to 10 (both inclusive) for $K$ and store it in a variable called `pm_ks`.
2. Use a list comprehension to create a KMeans clustering model with $K$ clusters for each value of $K$ in the `pm_ks` range. Each model should be wrapped in a pipeline together with the `pm_preprocessor` we created earlier. Train the pipeline on the `pm_data` data and output the WSSD value for each value of $K$ as a list.
3. Create a dataframe that holds the values of $K$ in a column called `k` and the WSSD values in a column called `wssd`.

*Assign your answer to a data frame object named `pm_model_stats`.*

In [None]:
# # 1.
# pm_ks = ___
#
# # 2.
# pm_wssds = [
#     ___(
#         ___,
#         ___(___=___, random_state=4313)  # Create a new model with `k` clusters
#     ).___(___)[___].___  # Fit the pipeline and compute its WSSD
#     for ___ in ___
# ]
#
# # 3.
# pm_model_stats = pd.DataFrame({
#     ___: ___,
#     ___: ___
# })

# your code here
raise NotImplementedError
pm_model_stats

In [None]:
from hashlib import sha1
assert sha1(str(type(len(pm_wssds))).encode("utf-8")+b"e10a0").hexdigest() == "e29d12749073a3918276f473c0dea13294308adc", "type of len(pm_wssds) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(len(pm_wssds)).encode("utf-8")+b"e10a0").hexdigest() == "b694c5def8365ff7732e2bed0b7f9e627817c936", "value of len(pm_wssds) is not correct"

assert sha1(str(type(round(sum(pm_wssds), 2))).encode("utf-8")+b"e10a1").hexdigest() == "243d0e71718f95bcf73fb17f786773a79eddccbb", "type of round(sum(pm_wssds), 2) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(round(sum(pm_wssds), 2), 2)).encode("utf-8")+b"e10a1").hexdigest() == "382bbe0fbfd4c4aabd0a2cf159b4bc164115cba1", "value of round(sum(pm_wssds), 2) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(pm_model_stats.columns)).encode("utf-8")+b"e10a2").hexdigest() == "71c279d7bbe4187efc3577173e0dac1a6c1992be", "type of pm_model_stats.columns is not correct"
assert sha1(str(pm_model_stats.columns).encode("utf-8")+b"e10a2").hexdigest() == "3694dec8e7231594f346bf34f731e8a07c228eab", "value of pm_model_stats.columns is not correct"

print('Success!')

**Question 1.8**
<br> {points: 1}

Let's visualize how WSSD changes for as we vary the value of $K$. To do this, create the elbow plot. Put the within-cluster sum of squares on the y-axis, and the number of clusters on the x-axis. Use the correct parameter inside `mark_line` to include a point in the chart for each data point.

*Assign your plot to an object called `pm_elbow`*.

In [None]:
# your code here
raise NotImplementedError
elbow_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(elbow_plot is None)).encode("utf-8")+b"98bb0").hexdigest() == "f831a52a6d010c24c6a20956e19f21c71e63977a", "type of elbow_plot is None is not bool. elbow_plot is None should be a bool"
assert sha1(str(elbow_plot is None).encode("utf-8")+b"98bb0").hexdigest() == "ac73506f16c89eaf3e4fd2053bf39f6705f87f4d", "boolean value of elbow_plot is None is not correct"

assert sha1(str(type(elbow_plot.encoding.x['shorthand'])).encode("utf-8")+b"98bb1").hexdigest() == "c361ae94b355284d3b7b0c0ed952928c343ff951", "type of elbow_plot.encoding.x['shorthand'] is not str. elbow_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(elbow_plot.encoding.x['shorthand'])).encode("utf-8")+b"98bb1").hexdigest() == "95158a4ad4d8977992676b3eab56de5d947a3fa4", "length of elbow_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(elbow_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"98bb1").hexdigest() == "d69278b44ffbc1a53e145e42dbb6aa24d3a9b064", "value of elbow_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(elbow_plot.encoding.x['shorthand']).encode("utf-8")+b"98bb1").hexdigest() == "d69278b44ffbc1a53e145e42dbb6aa24d3a9b064", "correct string value of elbow_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(elbow_plot.encoding.y['shorthand'])).encode("utf-8")+b"98bb2").hexdigest() == "26af32ce20079daf9c9241f2ac22f9129c97aab0", "type of elbow_plot.encoding.y['shorthand'] is not str. elbow_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(elbow_plot.encoding.y['shorthand'])).encode("utf-8")+b"98bb2").hexdigest() == "b4fc205046385ba75d2b5a3b1af93c0d4d25c6a7", "length of elbow_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(elbow_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"98bb2").hexdigest() == "57d4267889852cb27dc705db774de23ebe7c7e50", "value of elbow_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(elbow_plot.encoding.y['shorthand']).encode("utf-8")+b"98bb2").hexdigest() == "57d4267889852cb27dc705db774de23ebe7c7e50", "correct string value of elbow_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(elbow_plot.mark)).encode("utf-8")+b"98bb3").hexdigest() == "678ce05595bdf0037f52821d27fcc64518ec9055", "type of elbow_plot.mark is not correct"
assert sha1(str(elbow_plot.mark).encode("utf-8")+b"98bb3").hexdigest() == "d008f6e5c2ca3b6da3116e2b08eb13791d485aa0", "value of elbow_plot.mark is not correct"

assert sha1(str(type(isinstance(elbow_plot.encoding.x['title'], str))).encode("utf-8")+b"98bb4").hexdigest() == "207c68dfa61656cf35d5d1213be30ac29c9ef858", "type of isinstance(elbow_plot.encoding.x['title'], str) is not bool. isinstance(elbow_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(elbow_plot.encoding.x['title'], str)).encode("utf-8")+b"98bb4").hexdigest() == "a880914625209666c97cdf83b37f1286a2932761", "boolean value of isinstance(elbow_plot.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(elbow_plot.encoding.y['title'], str))).encode("utf-8")+b"98bb5").hexdigest() == "0196c44352ed4881e6a6e795d9a1df55f4480b6f", "type of isinstance(elbow_plot.encoding.y['title'], str) is not bool. isinstance(elbow_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(elbow_plot.encoding.y['title'], str)).encode("utf-8")+b"98bb5").hexdigest() == "4544f494c5e6e334104b5a1b1a672953c0e2f0e4", "boolean value of isinstance(elbow_plot.encoding.y['title'], str) is not correct"

print('Success!')

**Question 1.9** 
<br>fieldoints: 3}

Based on the elbow plot above, what value of $K$ would you choose? Explain why.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.10**
<br> {points: 3}

1. Train a K-Means model with the value that you chose for $K$ based on the elbow plot. Combine it with the appropriate preprocessor in a pipeline and store it in a variable called `pm_pipe2`. Fit the pipeline on the relevant data frame.
2. Use the `pm_data` data frame to create a new column called `cluster` that holds the cluster labels from your model. Assign the new dataframe to a variable called `pm_clustered2`.
3. Finally, create a plot called `pm_scatter_clustered2` to visualize the clusters. Include a title, colour the points by the cluster and make sure your axes are human-readable.

In [None]:
# your code here
raise NotImplementedError
pm_scatter_clustered2

**Question 1.11** 
<br> {points: 3}

This looks perhaps a bit better than when we used $K=4$ clusters originally, but is it really a lot better? Use this plot and the elbow plot from Question 1.8 to reason about what might be going on here.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

# 2. Tourism Reviews

![](https://media.giphy.com/media/xUNd9IsOQ4BSZPfnLG/giphy.gif)
Source: https://media.giphy.com/media/xUNd9IsOQ4BSZPfnLG/giphy.gif

The Ministry of Land, Infrastructure, Transport and Tourism of Japan is interested in knowing the type of tourists that visit East Asia. They know the [majority of their visitors come from this region](https://statistics.jnto.go.jp/en/graph/) and would like to stay competitive in the region to keep growing the tourism industry. For this, they have hired us to perform segmentation of the tourists. A [dataset from TripAdvisor](https://archive.ics.uci.edu/ml/datasets/Travel+Reviews) has been scraped and it's provided to you.

This dataset contains the following variables:

- User ID : Unique user id 
- Category 1 : Average user feedback on art galleries 
- Category 2 : Average user feedback on dance clubs 
- Category 3 : Average user feedback on juice bars 
- Category 4 : Average user feedback on restaurants 
- Category 5 : Average user feedback on museums 
- Category 6 : Average user feedback on resorts 
- Category 7 : Average user feedback on parks/picnic spots 
- Category 8 : Average user feedback on beaches 
- Category 9 : Average user feedback on theaters 
- Category 10 : Average user feedback on religious institutions

**Question 2.0**
<br> {points: 1}

Load the data set from https://archive.ics.uci.edu/ml/machine-learning-databases/00484/tripadvisor_review.csv and clean it so that only the Category # columns are in the data frame (i.e., remove the User ID column). 

*Assign your answer to an object called `clean_reviews`.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(reviews is None)).encode("utf-8")+b"c2c61").hexdigest() == "043eb3ef22ea900104bb376b5b1a58cd73536f68", "type of reviews is None is not bool. reviews is None should be a bool"
assert sha1(str(reviews is None).encode("utf-8")+b"c2c61").hexdigest() == "17f57eabc4796266e22e388ec465f63a7f17344a", "boolean value of reviews is None is not correct"

assert sha1(str(type(reviews)).encode("utf-8")+b"c2c62").hexdigest() == "4350e7fc882009cb1bc8f2c9087a04103b4c64a9", "type of type(reviews) is not correct"

assert sha1(str(type(reviews.shape)).encode("utf-8")+b"c2c63").hexdigest() == "8d217e48bf7d5bbfa597404d33462c6ed5b2abde", "type of reviews.shape is not tuple. reviews.shape should be a tuple"
assert sha1(str(len(reviews.shape)).encode("utf-8")+b"c2c63").hexdigest() == "8273a4b67939a0cc53350df70b4940d03e1498f9", "length of reviews.shape is not correct"
assert sha1(str(sorted(map(str, reviews.shape))).encode("utf-8")+b"c2c63").hexdigest() == "cd32280629c2fced88d05e5485ac02e4dc7aaf90", "values of reviews.shape are not correct"
assert sha1(str(reviews.shape).encode("utf-8")+b"c2c63").hexdigest() == "862f45a102c520c045763ade9b54ea0449bce9a4", "order of elements of reviews.shape is not correct"

assert sha1(str(type("User ID" in reviews.columns)).encode("utf-8")+b"c2c64").hexdigest() == "0cc5b4597b5d132cdd296db1c1bca2a9adc30c1f", "type of \"User ID\" in reviews.columns is not bool. \"User ID\" in reviews.columns should be a bool"
assert sha1(str("User ID" in reviews.columns).encode("utf-8")+b"c2c64").hexdigest() == "2c83138970f0b1f3c2f4395c9eb8f6ec4002cf2b", "boolean value of \"User ID\" in reviews.columns is not correct"

assert sha1(str(type(round(sum(reviews["Category 1"]), 2))).encode("utf-8")+b"c2c65").hexdigest() == "0222c063420102e481eab77097cb56d34ea5a7b7", "type of round(sum(reviews[\"Category 1\"]), 2) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(round(sum(reviews["Category 1"]), 2), 2)).encode("utf-8")+b"c2c65").hexdigest() == "e7f4a95e2d9b5d4adf00cdf03f9be6f22118c03b", "value of round(sum(reviews[\"Category 1\"]), 2) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(round(sum(reviews["Category 10"]), 2))).encode("utf-8")+b"c2c66").hexdigest() == "770bf23dc592212a9003a38a4525341088e17fd1", "type of round(sum(reviews[\"Category 10\"]), 2) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(round(sum(reviews["Category 10"]), 2), 2)).encode("utf-8")+b"c2c66").hexdigest() == "236fbf50525015e753fee32e5780ffc560c84a47", "value of round(sum(reviews[\"Category 10\"]), 2) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 2.1**
<br> {points: 1}

Standardize all variables and perform K-means and vary $K$ from 1 to 10 to identify the optimal number of clusters. Use `random_state=2019` for K-means.

Create a dataframe called `review_models_stats` that contain the values of $K$ and their corresponding WSSD (with column names `k`and `wssd`).  From this data frame, create an elbow_plot and assign it to a variable named `elbow_plot`.

In [None]:
# your code here
raise NotImplementedError
elbow_plot

In [None]:
from hashlib import sha1
assert str(type(reviews_model_stats is None)) == "<class 'bool'>", "type of reviews_model_stats is None is not bool. reviews_model_stats is None should be a bool"
assert str(reviews_model_stats is None) == "False", "boolean value of reviews_model_stats is None is not correct"

assert str(type(elbow_plot is None)) == "<class 'bool'>", "type of elbow_plot is None is not bool. elbow_plot is None should be a bool"
assert str(elbow_plot is None) == "False", "boolean value of elbow_plot is None is not correct"


# The remainder of the tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.

assert sha1(str(type(reviews_model_stats)).encode("utf-8")+b"4b6d6").hexdigest() == "f07f5e59ed36cf4ae5265173424e864bc69705e2", "type of type(reviews_model_stats) is not correct"

assert sha1(str(type(reviews_model_stats.shape)).encode("utf-8")+b"4b6d7").hexdigest() == "55a260a828d809557c52e68ba929e7e278832e98", "type of reviews_model_stats.shape is not tuple. reviews_model_stats.shape should be a tuple"
assert sha1(str(len(reviews_model_stats.shape)).encode("utf-8")+b"4b6d7").hexdigest() == "8e3668004ff4afa516c60d7badb210043aa9beb5", "length of reviews_model_stats.shape is not correct"
assert sha1(str(sorted(map(str, reviews_model_stats.shape))).encode("utf-8")+b"4b6d7").hexdigest() == "9b3e8ff45ce0815877955248376aae040615bf63", "values of reviews_model_stats.shape are not correct"
assert sha1(str(reviews_model_stats.shape).encode("utf-8")+b"4b6d7").hexdigest() == "e989340dc047fb9c772ee898d6726fee52ed2d66", "order of elements of reviews_model_stats.shape is not correct"

assert sha1(str(type(round(sum(reviews_model_stats.k), 2))).encode("utf-8")+b"4b6d8").hexdigest() == "f033118e139c854f2ff1fa3cc5ed64c39600d198", "type of round(sum(reviews_model_stats.k), 2) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(round(sum(reviews_model_stats.k), 2)).encode("utf-8")+b"4b6d8").hexdigest() == "828552a7b1b390e47cb24baf2b3b8722c5b6b65b", "value of round(sum(reviews_model_stats.k), 2) is not correct"

assert sha1(str(type(round(sum(reviews_model_stats.wssd), 2))).encode("utf-8")+b"4b6d9").hexdigest() == "6650408d1dd40b9476a614b185c72ae680eb143d", "type of round(sum(reviews_model_stats.wssd), 2) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(round(sum(reviews_model_stats.wssd), 2), 2)).encode("utf-8")+b"4b6d9").hexdigest() == "52f26522dc8c7cdaca900327d0aece4657d1bc49", "value of round(sum(reviews_model_stats.wssd), 2) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(elbow_plot.mark.point)).encode("utf-8")+b"4b6da").hexdigest() == "c035b785a027c9d9ebc915ef2fce2a6c2574c3f4", "type of elbow_plot.mark.point is not bool. elbow_plot.mark.point should be a bool"
assert sha1(str(elbow_plot.mark.point).encode("utf-8")+b"4b6da").hexdigest() == "118fcc4c94f7a01fc19b7512d5d8dd0415eb2c68", "boolean value of elbow_plot.mark.point is not correct"

assert sha1(str(type(elbow_plot.mark.type)).encode("utf-8")+b"4b6db").hexdigest() == "99d9c224022f1da6c39c40dceb901818d4a590ae", "type of elbow_plot.mark.type is not str. elbow_plot.mark.type should be an str"
assert sha1(str(len(elbow_plot.mark.type)).encode("utf-8")+b"4b6db").hexdigest() == "b579aa79e7ce5870e39f7f23cbfd37ac09e04b24", "length of elbow_plot.mark.type is not correct"
assert sha1(str(elbow_plot.mark.type.lower()).encode("utf-8")+b"4b6db").hexdigest() == "e3a3163f600ff172e8c0565496153cc473cd3ecf", "value of elbow_plot.mark.type is not correct"
assert sha1(str(elbow_plot.mark.type).encode("utf-8")+b"4b6db").hexdigest() == "e3a3163f600ff172e8c0565496153cc473cd3ecf", "correct string value of elbow_plot.mark.type but incorrect case of letters"

assert sha1(str(type(elbow_plot.encoding.x['shorthand'])).encode("utf-8")+b"4b6dc").hexdigest() == "2ecc6cfc48e6ad67d32df64471bb0f8df2dfb6ea", "type of elbow_plot.encoding.x['shorthand'] is not str. elbow_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(elbow_plot.encoding.x['shorthand'])).encode("utf-8")+b"4b6dc").hexdigest() == "6ddc862d8bdca42503a5c3de0a04f797410ad336", "length of elbow_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(elbow_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"4b6dc").hexdigest() == "48f7d8bee959d3d80fd081a1148c6de7c7e8dd74", "value of elbow_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(elbow_plot.encoding.x['shorthand']).encode("utf-8")+b"4b6dc").hexdigest() == "48f7d8bee959d3d80fd081a1148c6de7c7e8dd74", "correct string value of elbow_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(elbow_plot.encoding.y['shorthand'])).encode("utf-8")+b"4b6dd").hexdigest() == "b720a24f46533c581a141c7afa6e8f83146efcbf", "type of elbow_plot.encoding.y['shorthand'] is not str. elbow_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(elbow_plot.encoding.y['shorthand'])).encode("utf-8")+b"4b6dd").hexdigest() == "0bfc260bf040fc622a3762da5b61b46804dd2c7c", "length of elbow_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(elbow_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"4b6dd").hexdigest() == "21f5fd5b2374fc0eace0d7529764b203e5250b27", "value of elbow_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(elbow_plot.encoding.y['shorthand']).encode("utf-8")+b"4b6dd").hexdigest() == "21f5fd5b2374fc0eace0d7529764b203e5250b27", "correct string value of elbow_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(isinstance(elbow_plot.encoding.y['title'], str))).encode("utf-8")+b"4b6de").hexdigest() == "b27797573a084872db631eb5c25ba366bca3ca60", "type of isinstance(elbow_plot.encoding.y['title'], str) is not bool. isinstance(elbow_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(elbow_plot.encoding.y['title'], str)).encode("utf-8")+b"4b6de").hexdigest() == "b9658eed96054b8c7639bb65609c962651c92ddd", "boolean value of isinstance(elbow_plot.encoding.y['title'], str) is not correct"

print('Success!')

**Question 2.2** 
<br> {points: 3}

From the elbow plot above, which K should you choose? Explain why you chose that K.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 2.3**
<br> {points: 3}

Run K-means again, with the optimal $K$ and `random_state=2019`. Wrap your model in a pipeline together with a step that standardizes all columns. Assign this pipeline to a variable called `reviews_pipe`.

Then, use the `assign` method of the `reviews` dataframe to assign the cluster labels to a new column called `cluster`, and save the returned data frame as `reviews_clustered`.

In [None]:
# your code here
raise NotImplementedError
reviews_clustered

For the following 2 questions use the following plot as reference. 

> The visualization below is a density plot, you can think of it as a smoothed version of a histogram. Density plots are more effective for comparing multiple distributions. What we are looking for with these visualizations, is to see which variables have difference distributions between the different clusters.

In [None]:
alt.Chart(
    reviews_clustered.melt(
        id_vars=["cluster"],
        var_name="Category",
        value_name="Rating",
    )
).transform_density(
    "Rating",
    groupby=["cluster", "Category"],
    as_=["Rating", "Density"],
    resolve='independent'
).mark_area(opacity=0.4).encode(
    x="Rating",
    y=alt.Y("Density:Q").stack(False),
    color="cluster:N"
).properties(
    width=120,
    height=120
).facet(
    alt.Facet(
        "Category",
        sort=reviews_clustered.columns.sort_values().tolist()
    ),
    columns=5
).resolve_scale(
    # We are setting the x-scale to "independent" since we standardized the rating values before clustering them,
    # which means that their original range (which is what we show here) does not matter
    x="independent",
    y="independent"
)

**Question 2.4** Multiple Choice:
<br> {points: 1}

From the plots above, point out the categories that we might hypothesize are driving the clustering? (i.e., are useful to distinguish between the type of tourists?) We list the table of the categories below. 

- Category 1 : Average user feedback on art galleries 
- Category 2 : Average user feedback on dance clubs 
- Category 3 : Average user feedback on juice bars 
- Category 4 : Average user feedback on restaurants 
- Category 5 : Average user feedback on museums 
- Category 6 : Average user feedback on resorts 
- Category 7 : Average user feedback on parks/picnic spots 
- Category 8 : Average user feedback on beaches 
- Category 9 : Average user feedback on theaters 
- Category 10 : Average user feedback on religious institutions

A. 10, 3, 5, 6, 7

B. 10, 3, 5, 6, 1

C. 10, 3, 4, 6, 7

D. 10, 2, 5, 6, 7

*Assign your answer to an object called `answer2_4`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError
answer2_4

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_4 is None)).encode("utf-8")+b"40339").hexdigest() == "e676d4d4643a523fbc8e60536bca9d4309a08d17", "type of answer2_4 is None is not bool. answer2_4 is None should be a bool"
assert sha1(str(answer2_4 is None).encode("utf-8")+b"40339").hexdigest() == "063ab4a4279e90cdf4d9a5714c2d803679e293c1", "boolean value of answer2_4 is None is not correct"

assert sha1(str(type(answer2_4)).encode("utf-8")+b"4033a").hexdigest() == "70576d542d579e2a7d75d48f4aa917458c315121", "type of answer2_4 is not str. answer2_4 should be an str"
assert sha1(str(len(answer2_4)).encode("utf-8")+b"4033a").hexdigest() == "4f843b8ea59de23e2b58b7e17f2dc232c65749c1", "length of answer2_4 is not correct"
assert sha1(str(answer2_4.lower()).encode("utf-8")+b"4033a").hexdigest() == "e3a050c84d59214de76541437917673982c7619b", "value of answer2_4 is not correct"
assert sha1(str(answer2_4).encode("utf-8")+b"4033a").hexdigest() == "576362ec4a424e879c960ff0868554262fa3ec04", "correct string value of answer2_4 but incorrect case of letters"

print('Success!')

**Question 2.5** 
<br> {points: 3}

Discuss one disadvantage of only being able to compare clusters along single categories when dealing with multidimensional data.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.