# Tutorial 10 - Clustering

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

* Describe a case where clustering would be an appropriate tool, and what insight it would bring from the data.
* Explain the k-means clustering algorithm.
* Interpret the output of a k-means cluster analysis.
* Perform k-means clustering in Python using `scikit-learn`
* Visualize the output of k-means clustering in Python using a coloured scatter plot 
* Identify when it is necessary to scale variables before clustering and do this using Python
* Use the elbow method to choose the number of clusters for k-means
* Describe advantages, limitations and assumptions of the kmeans clustering algorithm.

In [None]:
### Run this cell before continuing.
import numpy as np
import pandas as pd
import altair as alt
from sklearn import cluster, datasets, metrics
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
alt.data_transformers.disable_max_rows()

# 1. Pokemon

We will be working with the Pokemon dataset from Kaggle, which can be found [here.](https://www.kaggle.com/abcsds/pokemon)
This dataset compiles the statistics on 721 Pokemon. The information in this dataset includes Pokemon name, type, health points, attack strength, defensive strength, speed points etc. We are interested in seeing if there are any sub-groups/clusters of pokemon based on these statistics. And if so, how many sub-groups/clusters there are.

![](https://media.giphy.com/media/3oEduV4SOS9mmmIOkw/giphy.gif)

Source: https://media.giphy.com/media/3oEduV4SOS9mmmIOkw/giphy.gif


**Question 1.0**
<br> {points: 1}

Use `read_csv` to load `pokemon.csv` from the `data/` folder. Don't forget the clean the column names to remove "." in the column name. 

*A scaffolding of changing ". x" to "_x" has been given below, but you could choose to remove it or make other changes.*

*Assign your answer to an object called `pm_data`.*

In [None]:
# ___ = pd.read_csv(___).rename(columns={___: "Sp_Atk", "Sp. Def":___})


# your code here
raise NotImplementedError
pm_data.head()

In [None]:
from hashlib import sha1
assert sha1(str(type(pm_data is None)).encode("utf-8")+b"64487ca97ca64b97").hexdigest() == "a3bc28c355ac50e0d3d6be14c658b2118f0e8e53", "type of pm_data is None is not bool. pm_data is None should be a bool"
assert sha1(str(pm_data is None).encode("utf-8")+b"64487ca97ca64b97").hexdigest() == "6fc001a4eb9d26adc046c1564f82d94c8e02cbe3", "boolean value of pm_data is None is not correct"

assert sha1(str(type(pm_data)).encode("utf-8")+b"0c658335fcadcd3e").hexdigest() == "be538acddb0a3e6b7ccf24f4675cb6603768d790", "type of type(pm_data) is not correct"

assert sha1(str(type(pm_data.shape)).encode("utf-8")+b"a4c24090de27c18b").hexdigest() == "605e39361b80a8845aea5ef8e832c389340f102c", "type of pm_data.shape is not tuple. pm_data.shape should be a tuple"
assert sha1(str(len(pm_data.shape)).encode("utf-8")+b"a4c24090de27c18b").hexdigest() == "3f40e03ba3ed60e277337d8a7a791c4e2bdcd1b6", "length of pm_data.shape is not correct"
assert sha1(str(sorted(map(str, pm_data.shape))).encode("utf-8")+b"a4c24090de27c18b").hexdigest() == "5539c1c2260dc11a71828dba123d37c5bfd07691", "values of pm_data.shape are not correct"
assert sha1(str(pm_data.shape).encode("utf-8")+b"a4c24090de27c18b").hexdigest() == "6c35b1eda64f8751d1a659231b2b275dec1765cb", "order of elements of pm_data.shape is not correct"

assert sha1(str(type('Name' in pm_data.columns)).encode("utf-8")+b"c5ca54ee3182d2e0").hexdigest() == "8fb61f468bf3a002e8db5d4993c4a7c1a514bc39", "type of 'Name' in pm_data.columns is not bool. 'Name' in pm_data.columns should be a bool"
assert sha1(str('Name' in pm_data.columns).encode("utf-8")+b"c5ca54ee3182d2e0").hexdigest() == "a4a3feb8f2cbbcd9d8c86a6df62590c0fbd51e49", "boolean value of 'Name' in pm_data.columns is not correct"

assert sha1(str(type('HP' in pm_data.columns)).encode("utf-8")+b"ed964b298e9710e1").hexdigest() == "cc5190ab04a4a712577413626d6a7aecd9db5470", "type of 'HP' in pm_data.columns is not bool. 'HP' in pm_data.columns should be a bool"
assert sha1(str('HP' in pm_data.columns).encode("utf-8")+b"ed964b298e9710e1").hexdigest() == "f8325375bbda9ee6f3bb6103b4848b1635c77602", "boolean value of 'HP' in pm_data.columns is not correct"

assert sha1(str(type('Attack' in pm_data.columns)).encode("utf-8")+b"c723fed783cae1a8").hexdigest() == "f11c63aced1c5de8d9a51c8f66bf3f29838609f6", "type of 'Attack' in pm_data.columns is not bool. 'Attack' in pm_data.columns should be a bool"
assert sha1(str('Attack' in pm_data.columns).encode("utf-8")+b"c723fed783cae1a8").hexdigest() == "805f88745f19f5058762bea54b5333cb6d156f92", "boolean value of 'Attack' in pm_data.columns is not correct"

assert sha1(str(type('Defense' in pm_data.columns)).encode("utf-8")+b"9bf37d13ca9eab22").hexdigest() == "64bafccc8d874d9ca96b23250139e9a72b6d55d7", "type of 'Defense' in pm_data.columns is not bool. 'Defense' in pm_data.columns should be a bool"
assert sha1(str('Defense' in pm_data.columns).encode("utf-8")+b"9bf37d13ca9eab22").hexdigest() == "75347da29b18060f1c152a9c3d77ada2c86717ec", "boolean value of 'Defense' in pm_data.columns is not correct"

assert sha1(str(type('#' in pm_data.columns)).encode("utf-8")+b"6851ec940d00858e").hexdigest() == "6f4619b061c5b66d2ef5fb5947cbfb699abe9f46", "type of '#' in pm_data.columns is not bool. '#' in pm_data.columns should be a bool"
assert sha1(str('#' in pm_data.columns).encode("utf-8")+b"6851ec940d00858e").hexdigest() == "6037b97cd69af3f1a397011f2ebba59ae8d7ad2c", "boolean value of '#' in pm_data.columns is not correct"

assert sha1(str(type('Type 1' in pm_data.columns)).encode("utf-8")+b"461dd7b572201487").hexdigest() == "ea7d35f333ab2af712142a6dc4492a52da3f9f47", "type of 'Type 1' in pm_data.columns is not bool. 'Type 1' in pm_data.columns should be a bool"
assert sha1(str('Type 1' in pm_data.columns).encode("utf-8")+b"461dd7b572201487").hexdigest() == "0b4e682f426ccb0f1d5bb1dae63fca4d47653c44", "boolean value of 'Type 1' in pm_data.columns is not correct"

assert sha1(str(type('Type 2' in pm_data.columns)).encode("utf-8")+b"d5b12ad3c55698fd").hexdigest() == "106e3f490f693098ca5b7d3ca7376bfc15f1f98a", "type of 'Type 2' in pm_data.columns is not bool. 'Type 2' in pm_data.columns should be a bool"
assert sha1(str('Type 2' in pm_data.columns).encode("utf-8")+b"d5b12ad3c55698fd").hexdigest() == "48d8d773346cea9f1b98ed4472988f526b08aee1", "boolean value of 'Type 2' in pm_data.columns is not correct"

assert sha1(str(type('Total' in pm_data.columns)).encode("utf-8")+b"fafb4978cba3ea16").hexdigest() == "7853bdeb13ef08024525aaed58e5995b86dcc32c", "type of 'Total' in pm_data.columns is not bool. 'Total' in pm_data.columns should be a bool"
assert sha1(str('Total' in pm_data.columns).encode("utf-8")+b"fafb4978cba3ea16").hexdigest() == "a0eeba53e88e413402f9fab341878a549f20e869", "boolean value of 'Total' in pm_data.columns is not correct"

assert sha1(str(type('Sp_Atk' in pm_data.columns)).encode("utf-8")+b"390cc9c35ca26306").hexdigest() == "3c4d00449dbe102d7f30c5817b1d96f84c744036", "type of 'Sp_Atk' in pm_data.columns is not bool. 'Sp_Atk' in pm_data.columns should be a bool"
assert sha1(str('Sp_Atk' in pm_data.columns).encode("utf-8")+b"390cc9c35ca26306").hexdigest() == "ca3863ef742b0fb6de469b1fc3ebe42918c10fdd", "boolean value of 'Sp_Atk' in pm_data.columns is not correct"

assert sha1(str(type('Sp_Def' in pm_data.columns)).encode("utf-8")+b"0bb3acbb92c777e6").hexdigest() == "e204bdf305b358223c656738c5d08f7c8d0a47fc", "type of 'Sp_Def' in pm_data.columns is not bool. 'Sp_Def' in pm_data.columns should be a bool"
assert sha1(str('Sp_Def' in pm_data.columns).encode("utf-8")+b"0bb3acbb92c777e6").hexdigest() == "d271440323f7b5b36c9fc5e904f05fa2f994045c", "boolean value of 'Sp_Def' in pm_data.columns is not correct"

assert sha1(str(type('Speed' in pm_data.columns)).encode("utf-8")+b"dcbb7af0f37648d3").hexdigest() == "c43c659c161fa513d623c3a2425aed4a1823d79d", "type of 'Speed' in pm_data.columns is not bool. 'Speed' in pm_data.columns should be a bool"
assert sha1(str('Speed' in pm_data.columns).encode("utf-8")+b"dcbb7af0f37648d3").hexdigest() == "b360969ecb8ae6d4b30fad0864b2538e60cd56e4", "boolean value of 'Speed' in pm_data.columns is not correct"

assert sha1(str(type('Generation' in pm_data.columns)).encode("utf-8")+b"ffe897fecbdbc3c5").hexdigest() == "877bcd801a3a954fb9acdc987355602391abb4dd", "type of 'Generation' in pm_data.columns is not bool. 'Generation' in pm_data.columns should be a bool"
assert sha1(str('Generation' in pm_data.columns).encode("utf-8")+b"ffe897fecbdbc3c5").hexdigest() == "3c677c07504bf86fec831767e4ee1646c2eb433c", "boolean value of 'Generation' in pm_data.columns is not correct"

assert sha1(str(type('Legendary' in pm_data.columns)).encode("utf-8")+b"a88be686aa1c9101").hexdigest() == "bb9ec0c7adea36ddc6ab346d5db66f2f02d69d4a", "type of 'Legendary' in pm_data.columns is not bool. 'Legendary' in pm_data.columns should be a bool"
assert sha1(str('Legendary' in pm_data.columns).encode("utf-8")+b"a88be686aa1c9101").hexdigest() == "1b77930992d0b1575529f3d5cd33352634a63b5d", "boolean value of 'Legendary' in pm_data.columns is not correct"

print('Success!')

**Question 1.1**
<br> {points: 1}

Create a scatter plot matrix (or pairplot) using `Altair`, choosing columns 5 to 11 (or equivalently, Total to Speed) from `pm_data`. 

*Assign your answer to an object called `pm_pairs`.*

In [None]:
# your code here
raise NotImplementedError
pm_pairs

In [None]:
from hashlib import sha1
assert sha1(str(type(pm_pairs is None)).encode("utf-8")+b"eeda19021493fc03").hexdigest() == "dac6b194b3ad3bc2984fdbd5c0069043fba8f49d", "type of pm_pairs is None is not bool. pm_pairs is None should be a bool"
assert sha1(str(pm_pairs is None).encode("utf-8")+b"eeda19021493fc03").hexdigest() == "e9eea258d8aa6cee4053f57d99847bc19a37fa2b", "boolean value of pm_pairs is None is not correct"

assert sha1(str(type(pm_pairs)).encode("utf-8")+b"63dd183fdaad3818").hexdigest() == "eff07614a8052cdd387b4171100853f33cebcfce", "type of type(pm_pairs) is not correct"

print('Success!')

**Question 1.2** 
<br> {points: 1}

Select the columns `Speed` and `Defense`, creating a new dataframe with only those columns.

*Assign your answer to an object named `km_data`.*

In [None]:
# your code here
raise NotImplementedError
km_data.head()

In [None]:
from hashlib import sha1
assert sha1(str(type("Speed" in km_data.columns)).encode("utf-8")+b"f0c1f346bdc36165").hexdigest() == "5f67d6f656c8a2524ef3d9200df19e27db8e7d1f", "type of \"Speed\" in km_data.columns is not bool. \"Speed\" in km_data.columns should be a bool"
assert sha1(str("Speed" in km_data.columns).encode("utf-8")+b"f0c1f346bdc36165").hexdigest() == "be1dfb6055c5633cca31a40472a0f92f53a6ac65", "boolean value of \"Speed\" in km_data.columns is not correct"

assert sha1(str(type("Defense" in km_data.columns)).encode("utf-8")+b"f5015d8c3abb82c0").hexdigest() == "3c7b084f685078ab9cfd4f8ca99e33420b61ac9d", "type of \"Defense\" in km_data.columns is not bool. \"Defense\" in km_data.columns should be a bool"
assert sha1(str("Defense" in km_data.columns).encode("utf-8")+b"f5015d8c3abb82c0").hexdigest() == "a532eaf8c5663a348f7057f00782cd93cd31be0d", "boolean value of \"Defense\" in km_data.columns is not correct"

assert sha1(str(type(km_data.shape[0])).encode("utf-8")+b"a90089ffb555ed25").hexdigest() == "ca4ca2706b9d7e0a6169452ea5956c448f77dadb", "type of km_data.shape[0] is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(km_data.shape[0]).encode("utf-8")+b"a90089ffb555ed25").hexdigest() == "4d03c87b90b672d4e72cefd96ca2312cb150ba48", "value of km_data.shape[0] is not correct"

assert sha1(str(type(km_data.shape[1])).encode("utf-8")+b"e9e96d2c1b9c960e").hexdigest() == "0c9673122e359130398a353d92d9c765c2768850", "type of km_data.shape[1] is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(km_data.shape[1]).encode("utf-8")+b"e9e96d2c1b9c960e").hexdigest() == "233ce6103a39854f07c5132a10e699b3fd144769", "value of km_data.shape[1] is not correct"

print('Success!')

**Question 1.3**
<br> {points: 1}

Make a scatterplot to visualize the relationship between `Speed` and `Defense` of the Pokemon. Put the `Speed` variable on the x-axis, and the `Defense` variable on the y-axis.

*Assign your plot to an object called `pm_scatter`. Don't forget to do everything needed to make an effective visualization.*

In [None]:
# your code here
raise NotImplementedError
pm_scatter

In [None]:
from hashlib import sha1
assert sha1(str(type(pm_scatter is None)).encode("utf-8")+b"04977e768ad8112f").hexdigest() == "961674303ee737cf11bc31e613a366478815d3a7", "type of pm_scatter is None is not bool. pm_scatter is None should be a bool"
assert sha1(str(pm_scatter is None).encode("utf-8")+b"04977e768ad8112f").hexdigest() == "0a3a19f2d8de7a935ca31c5c45f362e1e0012dff", "boolean value of pm_scatter is None is not correct"

assert sha1(str(type(pm_scatter.encoding.x.field)).encode("utf-8")+b"4b85ac20e845a3f5").hexdigest() == "11bb70c7bbe0bc5e9e5d2289bd1c5d57f95a0f45", "type of pm_scatter.encoding.x.field is not str. pm_scatter.encoding.x.field should be an str"
assert sha1(str(len(pm_scatter.encoding.x.field)).encode("utf-8")+b"4b85ac20e845a3f5").hexdigest() == "11650d46abc0115e4d03171d487e14a97b50bf52", "length of pm_scatter.encoding.x.field is not correct"
assert sha1(str(pm_scatter.encoding.x.field.lower()).encode("utf-8")+b"4b85ac20e845a3f5").hexdigest() == "1cd8b1d8ab6a2689193705ddf8efde09e51cb953", "value of pm_scatter.encoding.x.field is not correct"
assert sha1(str(pm_scatter.encoding.x.field).encode("utf-8")+b"4b85ac20e845a3f5").hexdigest() == "b9d74325492dbfc4996297e00ca864847a183bcb", "correct string value of pm_scatter.encoding.x.field but incorrect case of letters"

assert sha1(str(type(pm_scatter.encoding.y.field)).encode("utf-8")+b"78864306a591a13f").hexdigest() == "7711ac5ab3f7721a4fcf0d25cc3baa93a87e1a28", "type of pm_scatter.encoding.y.field is not str. pm_scatter.encoding.y.field should be an str"
assert sha1(str(len(pm_scatter.encoding.y.field)).encode("utf-8")+b"78864306a591a13f").hexdigest() == "3b443d26a6731866b51724d63424d52669f5f0e9", "length of pm_scatter.encoding.y.field is not correct"
assert sha1(str(pm_scatter.encoding.y.field.lower()).encode("utf-8")+b"78864306a591a13f").hexdigest() == "e92fd515b3412e4bca27a96b843a5b603b0e701b", "value of pm_scatter.encoding.y.field is not correct"
assert sha1(str(pm_scatter.encoding.y.field).encode("utf-8")+b"78864306a591a13f").hexdigest() == "0e762772c8adec82459b776dac5a111de42ee1f5", "correct string value of pm_scatter.encoding.y.field but incorrect case of letters"

assert sha1(str(type(pm_scatter.mark)).encode("utf-8")+b"1ba5dd133b5a6036").hexdigest() == "eca50138855987121815ec8d2d8b51194713b663", "type of pm_scatter.mark is not str. pm_scatter.mark should be an str"
assert sha1(str(len(pm_scatter.mark)).encode("utf-8")+b"1ba5dd133b5a6036").hexdigest() == "d998a85261c32922e23c2ccf726c0b45dc9c67ff", "length of pm_scatter.mark is not correct"
assert sha1(str(pm_scatter.mark.lower()).encode("utf-8")+b"1ba5dd133b5a6036").hexdigest() == "3675c4523334687ac51298ee0f563768dd3b2cb0", "value of pm_scatter.mark is not correct"
assert sha1(str(pm_scatter.mark).encode("utf-8")+b"1ba5dd133b5a6036").hexdigest() == "3675c4523334687ac51298ee0f563768dd3b2cb0", "correct string value of pm_scatter.mark but incorrect case of letters"

assert sha1(str(type(pm_scatter.encoding.x.field !=  pm_scatter.encoding.x.title)).encode("utf-8")+b"3de025f51858ac3a").hexdigest() == "d54606f6982511d54d8cb49f385bd3b28053a6c8", "type of pm_scatter.encoding.x.field !=  pm_scatter.encoding.x.title is not bool. pm_scatter.encoding.x.field !=  pm_scatter.encoding.x.title should be a bool"
assert sha1(str(pm_scatter.encoding.x.field !=  pm_scatter.encoding.x.title).encode("utf-8")+b"3de025f51858ac3a").hexdigest() == "b6a8af94e6cad383a0d74eb92488623780a21012", "boolean value of pm_scatter.encoding.x.field !=  pm_scatter.encoding.x.title is not correct"

assert sha1(str(type(pm_scatter.encoding.y.field !=  pm_scatter.encoding.y.title)).encode("utf-8")+b"87592d924d195e10").hexdigest() == "1f1f60c13d4769436e9d8eb4ed1bd68f6031818d", "type of pm_scatter.encoding.y.field !=  pm_scatter.encoding.y.title is not bool. pm_scatter.encoding.y.field !=  pm_scatter.encoding.y.title should be a bool"
assert sha1(str(pm_scatter.encoding.y.field !=  pm_scatter.encoding.y.title).encode("utf-8")+b"87592d924d195e10").hexdigest() == "b3af1689322ffd30d874a5bf728bc15807ed85e0", "boolean value of pm_scatter.encoding.y.field !=  pm_scatter.encoding.y.title is not correct"

print('Success!')

**Question 1.4.1** 
<br> {points: 3}

We are going to cluster the Pokemon based on their `Speed` and `Defense`. Will it matter much for our clustering if we scale our variables? Is there any argument against scaling here?

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.4.2**
<br> {points: 1}

Now, let's use the `kmeans` function to cluster the Pokemon based on their `Speed` and `Defense` variables. For this question, use K = 4.

*Assign your answer to an object called `pokemon_clusters`.*

**Note:** since k-means uses a random initialization, we need to set the `random_state` to agree with that used in our answer key, don't change the value!

In [None]:
# your code here
raise NotImplementedError
pokemon_clusters

In [None]:
from hashlib import sha1
assert sha1(str(type(len(pokemon_clusters.cluster_centers_))).encode("utf-8")+b"fbda1ce97ad94ccd").hexdigest() == "847e2e1fd908b49cd71af22b4b7c1000a1c99456", "type of len(pokemon_clusters.cluster_centers_) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(len(pokemon_clusters.cluster_centers_)).encode("utf-8")+b"fbda1ce97ad94ccd").hexdigest() == "8669bb3c73151115e57a02ab00331815e9c82d24", "value of len(pokemon_clusters.cluster_centers_) is not correct"

assert sha1(str(type(pokemon_clusters.n_features_in_)).encode("utf-8")+b"cdaee28d6ca64b13").hexdigest() == "23e1c9013ba3173edb118405ecfa0fb54c25b08b", "type of pokemon_clusters.n_features_in_ is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(pokemon_clusters.n_features_in_).encode("utf-8")+b"cdaee28d6ca64b13").hexdigest() == "e5aa7c79319f3943254fb6b1288bb70b1cd1596c", "value of pokemon_clusters.n_features_in_ is not correct"

assert sha1(str(type(type(pokemon_clusters))).encode("utf-8")+b"6bb19d6014fcc952").hexdigest() == "7cee54ab46c958178358e830c4349d7b231d3b7f", "type of type(pokemon_clusters) is not correct"
assert sha1(str(type(pokemon_clusters)).encode("utf-8")+b"6bb19d6014fcc952").hexdigest() == "c50623c0e402ee3882d30b65232a97bbd21497e5", "value of type(pokemon_clusters) is not correct"

print('Success!')

**Question 1.5**
<br> {points: 1}

Let's visualize the clusters we built in `pokemon_clusters`. 

Your tasks:

1. Use the `assign` function create a column called `cluster` in the `km_data` data frame with the cluster assignments for each data point from Kmeans. Name the new data frame `clustered_pokemon` and it should have the columns `Speed` and `Defense` and `cluster`. 
2. Create a scatter plot of `Speed` (x-axis) vs `Defense` (y-axis) with the points coloured by their cluster assignment. 

Name this plot `answer1_5`.

In [None]:
# your code here
raise NotImplementedError
answer1_5

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_5 is None)).encode("utf-8")+b"a5ee3bb2845dcbce").hexdigest() == "6bad298145f22fece0351150b6c3af975ffc9ce7", "type of answer1_5 is None is not bool. answer1_5 is None should be a bool"
assert sha1(str(answer1_5 is None).encode("utf-8")+b"a5ee3bb2845dcbce").hexdigest() == "9fd545613a637a94ecdb14e7e2c2e6bf8962ae0b", "boolean value of answer1_5 is None is not correct"

assert sha1(str(type(answer1_5.encoding.x.field)).encode("utf-8")+b"b02480eb0f02ad29").hexdigest() == "d5fb0ad8f3ba8305e82198e0da18d628a4131152", "type of answer1_5.encoding.x.field is not str. answer1_5.encoding.x.field should be an str"
assert sha1(str(len(answer1_5.encoding.x.field)).encode("utf-8")+b"b02480eb0f02ad29").hexdigest() == "f93e7af194deb4b7ebe0a124ba01c888ff160fac", "length of answer1_5.encoding.x.field is not correct"
assert sha1(str(answer1_5.encoding.x.field.lower()).encode("utf-8")+b"b02480eb0f02ad29").hexdigest() == "d8e561c12d195cfd1e84103c8f91b81b6bc04389", "value of answer1_5.encoding.x.field is not correct"
assert sha1(str(answer1_5.encoding.x.field).encode("utf-8")+b"b02480eb0f02ad29").hexdigest() == "f061ebeb035dd150f11645a228759d9adeb72327", "correct string value of answer1_5.encoding.x.field but incorrect case of letters"

assert sha1(str(type(answer1_5.encoding.y.field)).encode("utf-8")+b"78de86cdcaf29a02").hexdigest() == "f386484c05bb2e2a945c6f138d9acbd0ba052c9b", "type of answer1_5.encoding.y.field is not str. answer1_5.encoding.y.field should be an str"
assert sha1(str(len(answer1_5.encoding.y.field)).encode("utf-8")+b"78de86cdcaf29a02").hexdigest() == "3e66d9c43905c39cd4e1a0c62434741f86b6aaba", "length of answer1_5.encoding.y.field is not correct"
assert sha1(str(answer1_5.encoding.y.field.lower()).encode("utf-8")+b"78de86cdcaf29a02").hexdigest() == "a628c947c6fcb7cd30403e3e405b36e594e5b55b", "value of answer1_5.encoding.y.field is not correct"
assert sha1(str(answer1_5.encoding.y.field).encode("utf-8")+b"78de86cdcaf29a02").hexdigest() == "dac88009501bd126c08c79fe6ac26e9476b7813a", "correct string value of answer1_5.encoding.y.field but incorrect case of letters"

assert sha1(str(type(answer1_5.encoding.color.field)).encode("utf-8")+b"1555f3ddb415689f").hexdigest() == "342a2e3ba91217070ea0ad673b12d4f06c80dc7e", "type of answer1_5.encoding.color.field is not str. answer1_5.encoding.color.field should be an str"
assert sha1(str(len(answer1_5.encoding.color.field)).encode("utf-8")+b"1555f3ddb415689f").hexdigest() == "9bf9f65755699baec74c645e4c38e9d19c164819", "length of answer1_5.encoding.color.field is not correct"
assert sha1(str(answer1_5.encoding.color.field.lower()).encode("utf-8")+b"1555f3ddb415689f").hexdigest() == "762c4a02d610d524bd1d75a6a497e731e91fbd21", "value of answer1_5.encoding.color.field is not correct"
assert sha1(str(answer1_5.encoding.color.field).encode("utf-8")+b"1555f3ddb415689f").hexdigest() == "762c4a02d610d524bd1d75a6a497e731e91fbd21", "correct string value of answer1_5.encoding.color.field but incorrect case of letters"

assert sha1(str(type(answer1_5.mark)).encode("utf-8")+b"7683d4485a7af855").hexdigest() == "e052acfc0575e98c376771488620e47391e0fe98", "type of answer1_5.mark is not str. answer1_5.mark should be an str"
assert sha1(str(len(answer1_5.mark)).encode("utf-8")+b"7683d4485a7af855").hexdigest() == "e0c071dc3b8389e9adff2926c7168391b2a77aa2", "length of answer1_5.mark is not correct"
assert sha1(str(answer1_5.mark.lower()).encode("utf-8")+b"7683d4485a7af855").hexdigest() == "cc4bb1113616d8ddf954ae7ae67716bb1ddd2569", "value of answer1_5.mark is not correct"
assert sha1(str(answer1_5.mark).encode("utf-8")+b"7683d4485a7af855").hexdigest() == "cc4bb1113616d8ddf954ae7ae67716bb1ddd2569", "correct string value of answer1_5.mark but incorrect case of letters"

assert sha1(str(type(answer1_5.encoding.x.field != answer1_5.encoding.x.title)).encode("utf-8")+b"03391754cc928ca8").hexdigest() == "cb109906b19e0b4cf6c1979602f2f93381a550d9", "type of answer1_5.encoding.x.field != answer1_5.encoding.x.title is not bool. answer1_5.encoding.x.field != answer1_5.encoding.x.title should be a bool"
assert sha1(str(answer1_5.encoding.x.field != answer1_5.encoding.x.title).encode("utf-8")+b"03391754cc928ca8").hexdigest() == "810248e784ed1b98af830b6fa7cde96bfa5f0f7f", "boolean value of answer1_5.encoding.x.field != answer1_5.encoding.x.title is not correct"

assert sha1(str(type(answer1_5.encoding.y.field != answer1_5.encoding.y.title)).encode("utf-8")+b"1270cdf5d6252f56").hexdigest() == "17d857d7e037b583aab7f4a45d0eb00f5a6061fc", "type of answer1_5.encoding.y.field != answer1_5.encoding.y.title is not bool. answer1_5.encoding.y.field != answer1_5.encoding.y.title should be a bool"
assert sha1(str(answer1_5.encoding.y.field != answer1_5.encoding.y.title).encode("utf-8")+b"1270cdf5d6252f56").hexdigest() == "25e1c7abba153ec90e8b7adf6a967215b5999716", "boolean value of answer1_5.encoding.y.field != answer1_5.encoding.y.title is not correct"

assert sha1(str(type(answer1_5.encoding.color.field != answer1_5.encoding.color.title)).encode("utf-8")+b"6551c4f43cb4fcd1").hexdigest() == "0c15ef0f45374664dc7aa708f1c792226252c58e", "type of answer1_5.encoding.color.field != answer1_5.encoding.color.title is not bool. answer1_5.encoding.color.field != answer1_5.encoding.color.title should be a bool"
assert sha1(str(answer1_5.encoding.color.field != answer1_5.encoding.color.title).encode("utf-8")+b"6551c4f43cb4fcd1").hexdigest() == "c58f45a04f61a6e1b692840bcb6dbe57207ee4ef", "boolean value of answer1_5.encoding.color.field != answer1_5.encoding.color.title is not correct"

print('Success!')

**Question 1.6**
<br> {points: 3}

Below you can see multiple initializations of k-means with different seeds for `K = 4`. Can you explain what is happening and how we can mitigate this in the `kmeans` function?

![](imgs/multiple_initializations.png)

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.7**
<br> {points: 1}

We know that choosing a K is an important step of the process. We can do this by examining how the total within-cluster sum of squares changes as we change K on a plot (which we call an elbow plot).

For this exercise, from K = 1 to K = 10, you will calculate the total within-cluster sum of squares:
1. create a dataframe with the K values
2. create a new column `poke_clusts` by applying `KMeans` to each value of `k` (set `n_init` to be 10 and set `random_state` to be 2020)
3. create a new column `inertia` by calling `inertia_` attributed to each of the results
4. remove the `poke_clusts` column


*Assign your answer to a data frame object named `elbow_stats`. It should have the columns `k` and `inertia`.*

Remember, to acess the  total within-cluster sum of squares, you can call the `inertia_` attribute:

In [None]:
pokemon_clusters.inertia_

In [None]:
# your code here
raise NotImplementedError
elbow_stats.head()

In [None]:
from hashlib import sha1
assert sha1(str(type(elbow_stats.shape[0])).encode("utf-8")+b"fbde95488b176cd9").hexdigest() == "ac9eec5311699554c31bf439b5ba2d3f47d6068d", "type of elbow_stats.shape[0] is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(elbow_stats.shape[0]).encode("utf-8")+b"fbde95488b176cd9").hexdigest() == "f30a9cf25122b557353d6d1c74637385141fc472", "value of elbow_stats.shape[0] is not correct"

assert sha1(str(type(round(sum(elbow_stats.inertia), 2))).encode("utf-8")+b"0884cd398823c45d").hexdigest() == "3cf77f8be8a09f77f39605352cb5f157356b9607", "type of round(sum(elbow_stats.inertia), 2) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(round(sum(elbow_stats.inertia), 2), 2)).encode("utf-8")+b"0884cd398823c45d").hexdigest() == "0ed8d14d260add3a098ae0b33232d3612b1e67ad", "value of round(sum(elbow_stats.inertia), 2) is not correct (rounded to 2 decimal places)"

assert sha1(str(type("poke_clusts" in elbow_stats.columns)).encode("utf-8")+b"789b185c8a244fa2").hexdigest() == "63b475e4a6b5dd09c9fd8da99d9545c315d9e2dd", "type of \"poke_clusts\" in elbow_stats.columns is not bool. \"poke_clusts\" in elbow_stats.columns should be a bool"
assert sha1(str("poke_clusts" in elbow_stats.columns).encode("utf-8")+b"789b185c8a244fa2").hexdigest() == "2937a6d7153515454c8d297dbaed9c36a723fb13", "boolean value of \"poke_clusts\" in elbow_stats.columns is not correct"

print('Success!')

**Question 1.8**
<br> {points: 1}

Create the elbow plot. Put the within-cluster sum of squares on the y-axis, and the number of clusters on the x-axis.

*Assign your plot to an object called `elbow_plot`*.

In [None]:
# your code here
raise NotImplementedError
elbow_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(elbow_plot is None)).encode("utf-8")+b"febb0e035f072f14").hexdigest() == "131b19535cd8ba0534cf56ca7fcfdc822360ada5", "type of elbow_plot is None is not bool. elbow_plot is None should be a bool"
assert sha1(str(elbow_plot is None).encode("utf-8")+b"febb0e035f072f14").hexdigest() == "55fdcff7ee9bc833a7af57fb3829e299b3db0192", "boolean value of elbow_plot is None is not correct"

assert sha1(str(type(elbow_plot.mark.point)).encode("utf-8")+b"13f8938680b390a4").hexdigest() == "9681631ac248ab0a7b2489734c50bf29f528d983", "type of elbow_plot.mark.point is not bool. elbow_plot.mark.point should be a bool"
assert sha1(str(elbow_plot.mark.point).encode("utf-8")+b"13f8938680b390a4").hexdigest() == "48e0f177efd5d44c157b1acd722f2130eba881b7", "boolean value of elbow_plot.mark.point is not correct"

assert sha1(str(type(elbow_plot.mark.type)).encode("utf-8")+b"3867d5f648eac550").hexdigest() == "580f51bec6c23edcab14ec509b4fcd7a1e5490e8", "type of elbow_plot.mark.type is not str. elbow_plot.mark.type should be an str"
assert sha1(str(len(elbow_plot.mark.type)).encode("utf-8")+b"3867d5f648eac550").hexdigest() == "0901533b137ff8b1fee22188e64967a865d06f5f", "length of elbow_plot.mark.type is not correct"
assert sha1(str(elbow_plot.mark.type.lower()).encode("utf-8")+b"3867d5f648eac550").hexdigest() == "ceeb4344b9feb5dadaafa5a18348e3d34be4c222", "value of elbow_plot.mark.type is not correct"
assert sha1(str(elbow_plot.mark.type).encode("utf-8")+b"3867d5f648eac550").hexdigest() == "ceeb4344b9feb5dadaafa5a18348e3d34be4c222", "correct string value of elbow_plot.mark.type but incorrect case of letters"

assert sha1(str(type(elbow_plot.encoding.x.field)).encode("utf-8")+b"5b4d0f7a3f2284a6").hexdigest() == "375e44c4a5d9cbda108506b430abe549af720c41", "type of elbow_plot.encoding.x.field is not str. elbow_plot.encoding.x.field should be an str"
assert sha1(str(len(elbow_plot.encoding.x.field)).encode("utf-8")+b"5b4d0f7a3f2284a6").hexdigest() == "14646f656d9d7444f65b77c3e36e982cd91a530f", "length of elbow_plot.encoding.x.field is not correct"
assert sha1(str(elbow_plot.encoding.x.field.lower()).encode("utf-8")+b"5b4d0f7a3f2284a6").hexdigest() == "a3a90e9d00de221e7a39e2ee9b8460a6e49be452", "value of elbow_plot.encoding.x.field is not correct"
assert sha1(str(elbow_plot.encoding.x.field).encode("utf-8")+b"5b4d0f7a3f2284a6").hexdigest() == "a3a90e9d00de221e7a39e2ee9b8460a6e49be452", "correct string value of elbow_plot.encoding.x.field but incorrect case of letters"

assert sha1(str(type(elbow_plot.encoding.y.field)).encode("utf-8")+b"44759e578973bad7").hexdigest() == "461c1b6fd03b0bc5455a2f2bd3cf7950663fc79e", "type of elbow_plot.encoding.y.field is not str. elbow_plot.encoding.y.field should be an str"
assert sha1(str(len(elbow_plot.encoding.y.field)).encode("utf-8")+b"44759e578973bad7").hexdigest() == "ed313b1ea7fec8b2818f66d540be6c3c63dbcb26", "length of elbow_plot.encoding.y.field is not correct"
assert sha1(str(elbow_plot.encoding.y.field.lower()).encode("utf-8")+b"44759e578973bad7").hexdigest() == "b29a5529e9f4bfc32d263b568dea4542f5149209", "value of elbow_plot.encoding.y.field is not correct"
assert sha1(str(elbow_plot.encoding.y.field).encode("utf-8")+b"44759e578973bad7").hexdigest() == "b29a5529e9f4bfc32d263b568dea4542f5149209", "correct string value of elbow_plot.encoding.y.field but incorrect case of letters"

assert sha1(str(type(elbow_plot.encoding.y.field != elbow_plot.encoding.y.title)).encode("utf-8")+b"19fed91d9b27064b").hexdigest() == "d0aa65db620312950681fc92f73d0968678e79d8", "type of elbow_plot.encoding.y.field != elbow_plot.encoding.y.title is not bool. elbow_plot.encoding.y.field != elbow_plot.encoding.y.title should be a bool"
assert sha1(str(elbow_plot.encoding.y.field != elbow_plot.encoding.y.title).encode("utf-8")+b"19fed91d9b27064b").hexdigest() == "eee9f5bdbbda7925983c90acc288b5d5960fe987", "boolean value of elbow_plot.encoding.y.field != elbow_plot.encoding.y.title is not correct"

print('Success!')

**Question 1.9** 
<br>fieldoints: 3}

Based on the elbow plot above, what value of k do you choose? Explain why.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.10**
<br> {points: 3}

Using the value that you chose for k, perform the k-means algorithm, set `n_init = 10` and `random_state=2019` and assign your answer to an object called `pokemon_final_kmeans`. 

Augment the data with the final cluster labels and assign your answer to an object called `pokemon_final_clusters`. 

Finally, create a plot called `pokemon_final_clusters_plot` to visualize the clusters. Include a title, colour the points by the cluster and make sure your axes are human-readable.

In [None]:
# your code here
raise NotImplementedError

**Question 1.11**
<br> {points: 3}

Using `Speed` and `Defense`, we find some number of clusters in our data. However, we have more information in our dataset that might be useful for clustering. Let's incorporate all of the numeric values to our kmeans model. Again use `n_init = 10`.

Your tasks:

1. Select the numeric type columns only. For example, do not include the `#` or `Generation`  columns (they are not pokemon statistics). Assign your answer to an object called `pm_multi`.
2. From K = 1 to K = 10, calculate the total within-cluster sum of squares. Set `n_init` to be 10 and `random_state` to be 2019. Assign your answer to an object called `pm_multi_elbow_stats`. 
3. Use the elbow plot method to determine the number of clusters. Assign your answer to an object called `pm_multi_elbow_plot`.
4. Train a k-means model with the number of clusters determined in (2). Assign your answer to an object called `multi_kmeans`. 
5. Print the cluster means for the trained model by calling the `cluster_centers_` attribute.

In [None]:
# your code here
raise NotImplementedError

**Question 1.12** 
<br> {points: 3}

Visualizing these clusters is not a simple task given the high-dimensionality of the model. But does the cluster means output help? Justify your reasoning.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

# 2. Tourism Reviews

![](https://media.giphy.com/media/xUNd9IsOQ4BSZPfnLG/giphy.gif)
Source: https://media.giphy.com/media/xUNd9IsOQ4BSZPfnLG/giphy.gif

The Ministry of Land, Infrastructure, Transport and Tourism of Japan is interested in knowing the type of tourists that visit East Asia. They know the [majority of their visitors come from this region](https://statistics.jnto.go.jp/en/graph/) and would like to stay competitive in the region to keep growing the tourism industry. For this, they have hired us to perform segmentation of the tourists. A [dataset from TripAdvisor](https://archive.ics.uci.edu/ml/datasets/Travel+Reviews) has been scraped and it's provided to you.

This dataset contains the following variables:

- User ID : Unique user id 
- Category 1 : Average user feedback on art galleries 
- Category 2 : Average user feedback on dance clubs 
- Category 3 : Average user feedback on juice bars 
- Category 4 : Average user feedback on restaurants 
- Category 5 : Average user feedback on museums 
- Category 6 : Average user feedback on resorts 
- Category 7 : Average user feedback on parks/picnic spots 
- Category 8 : Average user feedback on beaches 
- Category 9 : Average user feedback on theaters 
- Category 10 : Average user feedback on religious institutions

**Question 2.0**
<br> {points: 1}

Load the data set from https://archive.ics.uci.edu/ml/machine-learning-databases/00484/tripadvisor_review.csv and clean it so that only the Category # columns are in the data frame (i.e., remove the User ID column). 

*Assign your answer to an object called `clean_reviews`.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(clean_reviews is None)).encode("utf-8")+b"c9dd2e9b99775694").hexdigest() == "9b4ed75c9f0016f4d2a09cb3e32f28a149561a1f", "type of clean_reviews is None is not bool. clean_reviews is None should be a bool"
assert sha1(str(clean_reviews is None).encode("utf-8")+b"c9dd2e9b99775694").hexdigest() == "da3448773b6bd2ad8cb690c7b81c26488b53870e", "boolean value of clean_reviews is None is not correct"

assert sha1(str(type(clean_reviews)).encode("utf-8")+b"ba8e2d91d77cb456").hexdigest() == "bb9522487b8fdc0f955e3d16535087fbaf85faeb", "type of type(clean_reviews) is not correct"

assert sha1(str(type(clean_reviews.shape)).encode("utf-8")+b"065624ae346b7ec5").hexdigest() == "a527251f2218461f20b270b3bcb432e6a666b288", "type of clean_reviews.shape is not tuple. clean_reviews.shape should be a tuple"
assert sha1(str(len(clean_reviews.shape)).encode("utf-8")+b"065624ae346b7ec5").hexdigest() == "d5249cb170cddd1673ebf9e8b71908eded855425", "length of clean_reviews.shape is not correct"
assert sha1(str(sorted(map(str, clean_reviews.shape))).encode("utf-8")+b"065624ae346b7ec5").hexdigest() == "5f0ab56260dfb4b033eb9e8414f99fe27a97f6a6", "values of clean_reviews.shape are not correct"
assert sha1(str(clean_reviews.shape).encode("utf-8")+b"065624ae346b7ec5").hexdigest() == "4380eea72b9375befbb5e03564c5a9796c202e67", "order of elements of clean_reviews.shape is not correct"

assert sha1(str(type("User ID" in clean_reviews.columns)).encode("utf-8")+b"8513cea249c501e0").hexdigest() == "f02c4b574a4f96b9b4cce8fa461b1bb1b0b8ab34", "type of \"User ID\" in clean_reviews.columns is not bool. \"User ID\" in clean_reviews.columns should be a bool"
assert sha1(str("User ID" in clean_reviews.columns).encode("utf-8")+b"8513cea249c501e0").hexdigest() == "d0869d7c63b64a533574db81dacc0444bdfdde18", "boolean value of \"User ID\" in clean_reviews.columns is not correct"

assert sha1(str(type(round(sum(clean_reviews["Category 1"]), 2))).encode("utf-8")+b"4bd61681aa959a30").hexdigest() == "6fd57970d68361ee26a803a9cbd35308247e4798", "type of round(sum(clean_reviews[\"Category 1\"]), 2) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(round(sum(clean_reviews["Category 1"]), 2), 2)).encode("utf-8")+b"4bd61681aa959a30").hexdigest() == "5e740835786a580c5445ebfd7faaa9cef7436a0f", "value of round(sum(clean_reviews[\"Category 1\"]), 2) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(round(sum(clean_reviews["Category 10"]), 2))).encode("utf-8")+b"cc8602ff8d745b8b").hexdigest() == "d75058aa2241fc2d82ae23a4f811d0ac4eca7e2c", "type of round(sum(clean_reviews[\"Category 10\"]), 2) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(round(sum(clean_reviews["Category 10"]), 2), 2)).encode("utf-8")+b"cc8602ff8d745b8b").hexdigest() == "3c7f6fac7c6e13aa15822d2ce4d363e158f45b11", "value of round(sum(clean_reviews[\"Category 10\"]), 2) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 2.1**
<br> {points: 1}

Perform k-means and vary K from 1 to 10 to identify the optimal number of clusters. Use `n_init = 100` and `random_state=2019`. Assign your answer to a dataframe object called `elbow_stats` that has the columns `k`and `inertia`.  

Afterwards, create an elbow plot to help you choose K. Assign your answer to an object called `tourism_elbow_plot`.

**Note:** `altair` should only be given variables that can be plotted - select the `k` and `inertia` columns for your x and y plotting variables.

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(elbow_stats is None)).encode("utf-8")+b"6b85efc262ab5b72").hexdigest() == "22ce3200209361eb1bbc917ffe5671e3787d3fc8", "type of elbow_stats is None is not bool. elbow_stats is None should be a bool"
assert sha1(str(elbow_stats is None).encode("utf-8")+b"6b85efc262ab5b72").hexdigest() == "a76849e061b3d33b66119b0330de9efe5de9ede0", "boolean value of elbow_stats is None is not correct"

assert sha1(str(type(tourism_elbow_plot is None)).encode("utf-8")+b"9fcf098da73eb9bb").hexdigest() == "c690cbcef72d59d1db3174173b8e57285f9a334c", "type of tourism_elbow_plot is None is not bool. tourism_elbow_plot is None should be a bool"
assert sha1(str(tourism_elbow_plot is None).encode("utf-8")+b"9fcf098da73eb9bb").hexdigest() == "92614e632e9887d8deba5aad9b5a73f835b6aa15", "boolean value of tourism_elbow_plot is None is not correct"


# The remainder of the tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.

assert sha1(str(type(elbow_stats)).encode("utf-8")+b"3b0071310e36c5f8").hexdigest() == "14b20ccc87f52b5e5cb3360bc109718a30eed0d1", "type of type(elbow_stats) is not correct"

assert sha1(str(type(elbow_stats.shape)).encode("utf-8")+b"c84dd184c1818ee7").hexdigest() == "c1196f8dbc0ed4bd494d9ab9711a806b82b97dbb", "type of elbow_stats.shape is not tuple. elbow_stats.shape should be a tuple"
assert sha1(str(len(elbow_stats.shape)).encode("utf-8")+b"c84dd184c1818ee7").hexdigest() == "731aaf8dc22e5532652a1829012b4cbf61a89c91", "length of elbow_stats.shape is not correct"
assert sha1(str(sorted(map(str, elbow_stats.shape))).encode("utf-8")+b"c84dd184c1818ee7").hexdigest() == "1d5c0cc18bbe310e5e148fa68b6a2812a133111e", "values of elbow_stats.shape are not correct"
assert sha1(str(elbow_stats.shape).encode("utf-8")+b"c84dd184c1818ee7").hexdigest() == "6c98ec4e936ce1f293b15df17e34bb67292dac14", "order of elements of elbow_stats.shape is not correct"

assert sha1(str(type(round(sum(elbow_stats.k), 2))).encode("utf-8")+b"74ab083c1de1a8f8").hexdigest() == "a6e846b9dda437457b3f2f529e867837be16a92f", "type of round(sum(elbow_stats.k), 2) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(round(sum(elbow_stats.k), 2)).encode("utf-8")+b"74ab083c1de1a8f8").hexdigest() == "2161d5b7d7d1d1def65272d4425d184e242036e9", "value of round(sum(elbow_stats.k), 2) is not correct"

assert sha1(str(type(round(sum(elbow_stats.inertia), 2))).encode("utf-8")+b"0a0d9674e0b6b3e1").hexdigest() == "3f16b5e4ab3adf47cf6c7cf320c69e60096f486e", "type of round(sum(elbow_stats.inertia), 2) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(round(sum(elbow_stats.inertia), 2), 2)).encode("utf-8")+b"0a0d9674e0b6b3e1").hexdigest() == "96f3c0b12d6cdabbb8b8a884b4af9a60f7ceaf2e", "value of round(sum(elbow_stats.inertia), 2) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(tourism_elbow_plot.mark.point)).encode("utf-8")+b"c21fbab4fcaa8538").hexdigest() == "5949a051010aa36622ad6c98bdefd515fda801e1", "type of tourism_elbow_plot.mark.point is not bool. tourism_elbow_plot.mark.point should be a bool"
assert sha1(str(tourism_elbow_plot.mark.point).encode("utf-8")+b"c21fbab4fcaa8538").hexdigest() == "d59b2f1f1418189f2c82e237d84885b9f5be7f55", "boolean value of tourism_elbow_plot.mark.point is not correct"

assert sha1(str(type(tourism_elbow_plot.mark.type)).encode("utf-8")+b"3019befe064cc7ec").hexdigest() == "d035651dec957780007f62665ac9c6912c3433f5", "type of tourism_elbow_plot.mark.type is not str. tourism_elbow_plot.mark.type should be an str"
assert sha1(str(len(tourism_elbow_plot.mark.type)).encode("utf-8")+b"3019befe064cc7ec").hexdigest() == "414659730228a81ddab184f5a0ace2f02da27e75", "length of tourism_elbow_plot.mark.type is not correct"
assert sha1(str(tourism_elbow_plot.mark.type.lower()).encode("utf-8")+b"3019befe064cc7ec").hexdigest() == "bd21f72d1a5363cf461c3a7ff9ceeb6163669e6c", "value of tourism_elbow_plot.mark.type is not correct"
assert sha1(str(tourism_elbow_plot.mark.type).encode("utf-8")+b"3019befe064cc7ec").hexdigest() == "bd21f72d1a5363cf461c3a7ff9ceeb6163669e6c", "correct string value of tourism_elbow_plot.mark.type but incorrect case of letters"

assert sha1(str(type(tourism_elbow_plot.encoding.x.field)).encode("utf-8")+b"61a03f9765d3f3aa").hexdigest() == "3aae732bd3a886ef2d37a61524ea4eb6dee4093a", "type of tourism_elbow_plot.encoding.x.field is not str. tourism_elbow_plot.encoding.x.field should be an str"
assert sha1(str(len(tourism_elbow_plot.encoding.x.field)).encode("utf-8")+b"61a03f9765d3f3aa").hexdigest() == "7bb4e87af7c62621deca4c18b3c92a8a0dc6dd35", "length of tourism_elbow_plot.encoding.x.field is not correct"
assert sha1(str(tourism_elbow_plot.encoding.x.field.lower()).encode("utf-8")+b"61a03f9765d3f3aa").hexdigest() == "c681db7271fb8578bc23adb6d42df2d3fb6a10ad", "value of tourism_elbow_plot.encoding.x.field is not correct"
assert sha1(str(tourism_elbow_plot.encoding.x.field).encode("utf-8")+b"61a03f9765d3f3aa").hexdigest() == "c681db7271fb8578bc23adb6d42df2d3fb6a10ad", "correct string value of tourism_elbow_plot.encoding.x.field but incorrect case of letters"

assert sha1(str(type(tourism_elbow_plot.encoding.y.field)).encode("utf-8")+b"2ade2ea7fae7141b").hexdigest() == "1ee06d68c84956959a63c9630dc100702c0f546f", "type of tourism_elbow_plot.encoding.y.field is not str. tourism_elbow_plot.encoding.y.field should be an str"
assert sha1(str(len(tourism_elbow_plot.encoding.y.field)).encode("utf-8")+b"2ade2ea7fae7141b").hexdigest() == "21f1ff4bb83e02c29d8dde7cf20b68c34d274458", "length of tourism_elbow_plot.encoding.y.field is not correct"
assert sha1(str(tourism_elbow_plot.encoding.y.field.lower()).encode("utf-8")+b"2ade2ea7fae7141b").hexdigest() == "df439e5fcbbe59fae146fec83ebbc98f44ef09a9", "value of tourism_elbow_plot.encoding.y.field is not correct"
assert sha1(str(tourism_elbow_plot.encoding.y.field).encode("utf-8")+b"2ade2ea7fae7141b").hexdigest() == "df439e5fcbbe59fae146fec83ebbc98f44ef09a9", "correct string value of tourism_elbow_plot.encoding.y.field but incorrect case of letters"

assert sha1(str(type(tourism_elbow_plot.encoding.y.field == tourism_elbow_plot.encoding.y.title)).encode("utf-8")+b"43b2fcc51c5ac461").hexdigest() == "9f233e6c324e3575e845e30863d5df0b40e073c9", "type of tourism_elbow_plot.encoding.y.field == tourism_elbow_plot.encoding.y.title is not bool. tourism_elbow_plot.encoding.y.field == tourism_elbow_plot.encoding.y.title should be a bool"
assert sha1(str(tourism_elbow_plot.encoding.y.field == tourism_elbow_plot.encoding.y.title).encode("utf-8")+b"43b2fcc51c5ac461").hexdigest() == "44ca67a5e4a8221436eabdadd8dac427b0096970", "boolean value of tourism_elbow_plot.encoding.y.field == tourism_elbow_plot.encoding.y.title is not correct"

print('Success!')

**Question 2.2** 
<br> {points: 3}

From the elbow plot above, which K should you choose? Explain why you chose that K.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 2.3**
<br> {points: 3}

Run kmeans using `n_init=100` and `random_state=2019` with the optimal K, and assign your answer to an object called `reviews_clusters`. Then, use the `predict` function to get the cluster assignments for each point and add the result to a new column called `cluster` in `clean_reviews`. Name the data frame `cluster_assignments`.

In [None]:
# your code here
raise NotImplementedError

For the following 2 questions use the following plot as reference.

In [None]:
(
    alt.Chart(
        cluster_assignments.melt(
            id_vars=["cluster"],
            value_vars=cluster_assignments.iloc[:, :-1].columns.values.tolist(),
            var_name="category",
            value_name="value",
        )
    )
    .transform_density(
        "value", groupby=["cluster", "category"], as_=["value", "density"]
    )
    .mark_area(interpolate="monotone", opacity=0.4)
    .encode(
        x=alt.X("value", scale=alt.Scale(zero=False)), y="density:Q", fill="cluster:N"
    )    
    .properties(width=100, height=100)
    .facet("category", columns=4)
    .resolve_scale(x="independent", y="independent")
    .configure_axis(labelFontSize=8, titleFontSize=10)
)

**Question 2.4** Multiple Choice:
<br> {points: 1}

From the plots above, point out the categories that we might hypothesize are driving the clustering? (i.e., are useful to distinguish between the type of tourists?) We list the table of the categories below. 

- Category 1 : Average user feedback on art galleries 
- Category 2 : Average user feedback on dance clubs 
- Category 3 : Average user feedback on juice bars 
- Category 4 : Average user feedback on restaurants 
- Category 5 : Average user feedback on museums 
- Category 6 : Average user feedback on resorts 
- Category 7 : Average user feedback on parks/picnic spots 
- Category 8 : Average user feedback on beaches 
- Category 9 : Average user feedback on theaters 
- Category 10 : Average user feedback on religious institutions

A. 10, 3, 5, 6, 7

B. 10, 3, 5, 6, 1

C. 10, 3, 4, 6, 7

D. 10, 2, 5, 6, 7

*Assign your answer to an object called `answer2_4`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError
answer2_4

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_4 is None)).encode("utf-8")+b"a5db7941997aed14").hexdigest() == "821211cfd48983ae31692b58ac508707024a88d1", "type of answer2_4 is None is not bool. answer2_4 is None should be a bool"
assert sha1(str(answer2_4 is None).encode("utf-8")+b"a5db7941997aed14").hexdigest() == "c577cc89f094c385fea1d18844727172cfb481ea", "boolean value of answer2_4 is None is not correct"


# The remainder of the tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.

assert sha1(str(type(answer2_4)).encode("utf-8")+b"f53baf9b70a23dfd").hexdigest() == "87a177958fab36125b836bf5c35048c4a2e8d3b5", "type of answer2_4 is not str. answer2_4 should be an str"
assert sha1(str(len(answer2_4)).encode("utf-8")+b"f53baf9b70a23dfd").hexdigest() == "324c29e4282b31a464db144ac27f4710a8c6cd57", "length of answer2_4 is not correct"
assert sha1(str(answer2_4.lower()).encode("utf-8")+b"f53baf9b70a23dfd").hexdigest() == "fa535b33d6bd492cb39a9f15557421593c09badb", "value of answer2_4 is not correct"
assert sha1(str(answer2_4).encode("utf-8")+b"f53baf9b70a23dfd").hexdigest() == "a8652827f127a29f2a720b16812fec3ab7983aa6", "correct string value of answer2_4 but incorrect case of letters"

print('Success!')

**Question 2.5** 
<br> {points: 3}

Discuss one disadvantage of only being able to compare clusters along single categories when dealing with multidimensional data.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.