# Travel Review Dataset 
The dataset has been taken from the University of California Irvine machine learning repository and can be downloaded from [here](https://archive.ics.uci.edu/ml/datasets/Travel+Reviews#:~:text=UCI%20Machine%20Learning%20Repository%3A%20Travel%20Reviews%20Data%20Set&text=Abstract%3A%20Reviews%20on%20destinations%20in,and%20average%20rating%20is%20used).

It contains reviews of East Asian destinations across 10 categories.

Each traveler rating is mapped as Excellent (4), Very Good (3), Average (2), Poor (1), and Terrible (0); and average rating is used.

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

# Import The Dataset

Let's first import the csv dataset, which is in the data/ directory.

The dataset attributes are summarised in the following table: 

| Variable  | Meaning |
| ------------- | ------------- | 
|Attribute 1  | Unique user id  |
|Category 1| Average user feedback on art galleries | 
|Category 2 | Average user feedback on dance clubs | 
|Category 3| Average user feedback on juice bars |
|Category 4| Average user feedback on restaurants | 
|Category 5| Average user feedback on museums |
|Category 6| Average user feedback on resorts |
|Category 7| Average user feedback on parks/picnic spots |
|Category 8| Average user feedback on beaches |
|Category 9| Average user feedback on theatres |
|Category 10| Average user feedback on religious institutions | 


In [None]:
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

**Q1. Load the dataset called `tripadvisor_review.csv` into a DataFrame called`df`.**


Call the `.info()` method to view its concise summary.

In [None]:
# Add your code below
# df = ...


# Preprocess the dataset 

**Q2. Remove the first column, `User ID`, from the DataFrame.**


Before doing this, make a copy of `df`, named `df_new`, and work on `df_new`. 

*Hint: Use the `.drop()` method to remove columns from the DataFrame.*

In [None]:
# Add your code below
# df_new = ...


**Q3. Using `df_new`, scale its values between 0 and 1 using a *min_max* scaler. 
Save the new scaled dataset to a new DataFrame named `df_scaled`.**


*Hint: The min-max scaling formula is as follows:* 

$$
df\_scaled = \frac{original\_df \ - \ min\_val\_original\_df}{max\_val\_original\_df \ - \ min\_val\_original\_df}
$$


In [None]:
# Add your code below
# df_scaled = ...


# Clustering 

It's time to try some clustering. 

Let's start with K-Means and let's see how the clusters look. 

We will use the package `KMeans` from `sklearn.cluster` which we imported earlier.

**Q4. Run K-Means with 2 clusters. Save the model into a variable called `kmeans2`, and fit it to `df_scaled`.**


**Note**: When initalizing the `KMeans` object, specify `random_state = 8`.

In [None]:
# Add your code below
# kmeans2 = ...


**Q5. Save the centroids predicted by the K-Means clusters in a variable called `kmeans2_centres`.**


As you know, the way the K-Means algorithm works is by estimating the centroids of the clusters requested at each iteration. 

*Hint: You can access this information using the attribute `cluster_centers_` of the fitted model.* 

In [None]:
# Add your code below
# kmeans2_centres = ...


**Q6. Create a copy of `df_scaled` and call it `df_scaled_predictions_2`. Add to this DataFrame a column, called `predicted_kMeans_2`, with the the values of the labels assigned with `kmeans2`.**


*Hint: You can access this information using the attribute `labels_` of the fitted model.*

In [None]:
# Add your code below
# df_scaled_predictions_2 = ...


**Q7. Create a DataFrame with all the rows that belong to *cluster 0* and store it in a variable called `df_k2_cluster_0`.
Then create a DataFrame with all the rows that belong to *cluster 1* and store it in a variable called `df_k2_cluster_1`.**


*Note: in these DataFrames there shouldn't be the column `predicted_kMeans_2`.*

In [None]:
# Add your code below
# df_k2_cluster_0 = ...
# df_k2_cluster_1 = ...


**Q8. Plot the histograms of all the categories of `df_k2_cluster_0` and `df_k2_cluster_1`.**


Once you have plotted them, you can visually compare the histograms and see which categories' distribution is different between the two clusters. 

Assign the two plots to the variables called `hist_k2_cluster_0` and  `hist_k2_cluster_1` respectively.

*Hint: use the `.hist()` method on the DataFrame of interest and use `bins = 13`.*

*The library `matplotlib.pyplot` (which we imported earlier with the alias `plt`) has a `.tight_layout()` method, which you can use after each call of the `.hist()` method to create a nice layout of the histograms produced.*

In [None]:
# We create a new figure to make sure other figures in the notebook don't get modified
plt.figure()
# Add your code below
# hist_k2_cluster_0 = ...
# plt.tight_layout()
# hist_k2_cluster_1 = ...
# plt.tight_layout()


From the histograms, we can see that two categories are highly different between the two clusters - Category 3 and Category 10. 

**Q9. Make boxplots of Category 3 and Category 10, grouping by `predicted_kMeans_2`. Store your plots in two variables called `box_plot3_k2` and `box_plot10_k2` respectively.**


*Hint: In this case you should use `df_scaled_predictions_2`.*

*Use the `plt.suptitle("")` method after each call to the `.boxplot()` method, to create a better visualization of the plots.*

In [None]:
# We create a new figure to make sure other figures in the notebook don't get modified
plt.figure()
# Add your code below
# box_plot3_k2 = ...
# plt.suptitle("")
# box_plot10_k2 = ...
# plt.suptitle("")


In the model there are some parameters which can be tweaked. Let's change number of clusters and some of the parameters to see if the results change. 

**Q10. Run K-Means with 4 clusters. Change the parameter `max_iter` to 400 and and fit it to `df_scaled`. Call the variable where the model is saved `kmeans4`.**


**Note**: When initalizing the `KMeans` object, specify `random_state = 8`.

In [None]:
# Add your code below
# kmeans4 = ...


**Q11. Create a copy of `df_scaled` and call it `df_scaled_predictions_4`. Add to this DataFrame a column, called `predicted_kMeans_4` with the values of the labels assigned with `kmeans4`.**


*Hint: You can access this information using the attribute `labels_` of the fitted model.*

In [None]:
# Add your code below
# df_scaled_predictions_4 = ...


**Q12. Based on the prediction in column `predicted_kMeans_4`, create 4 DataFrames. One with all the rows that belong to *cluster 0*, one with all the rows that belong to *cluster 1*, one with all the rows that belong to *cluster 2*, and one with all the rows that belong to *cluster 3*. Save these DataFrames to the variables `df_k4_cluster_0`, `df_k4_cluster_1`, `df_k4_cluster_2`, and `df_k4_cluster_3` respectively.**


*Note: in these DataFrames there shouldn't be column `predicted_kMeans_4`.*

In [None]:
# Add your code below
# df_k4_cluster_0 = ...
# df_k4_cluster_1 = ...
# df_k4_cluster_2 = ...
# df_k4_cluster_3 = ...


**Q13. Plot the histograms of all the categories of `df_k4_cluster_0`, `df_k4_cluster_1`,`df_k4_cluster_2`, and `df_k4_cluster_3`.**


Once you have plotted them, you can visually compare the histograms and see which categories' distribution is different between the four clusters.

Assign the four plots to the variables `hist_k4_cluster_0`, `hist_k4_cluster_1`,`hist_k4_cluster_2`, and `hist_k4_cluster_3` respectively.

*Hint: use the `.hist()` method on the DataFrame of interest and use `bins = 13`.*

*The library `matplotlib.pyplot` (which we imported earlier with the alias `plt`) has a `.tight_layout()` method, which you can use after each call of the `.hist()` method to create a nice layout of the histograms produced.*

In [None]:
# We create a new figure to make sure other figures in the notebook don't get modified
plt.figure()
# Add your code below
# hist_k4_cluster_0 = ...
# plt.tight_layout()
# hist_k4_cluster_1 = ...
# plt.tight_layout()
# hist_k4_cluster_2 = ...
# plt.tight_layout()
# hist_k4_cluster_3 = ...
# plt.tight_layout()


Are the Category 3 and Category 10 distributions still different among the 4 groups of observations obtained?

**Q14. Make boxplots of Category 3 and Category 10, grouping by `predicted_kMeans_4`.**


Store your plots in two variables called `box_plot10_k4` and `box_plot3_k4` respectively.

*Hint: In this case you should use `df_scaled_predictions_4`.*

*Use the `plt.suptitle("")` method after each call to the `.boxplot()` method, to create a better visualization of the plots.*

In [None]:
# We create a new figure to make sure other figures in the notebook don't get modified
plt.figure()
# Add your code below
# box_plot3_k4 = ...
# plt.suptitle("")
# box_plot10_k4 = ...
# plt.suptitle("")


**Q15. When using `k = 2`, how many elements are present in each cluster? Store your answer in a variable called `distribution_2_clusters`.**


**Note**: your answer should be a Pandas Series, with the cluster label as the index and the count as value.

*Hint: you can use the `.value_counts()` method.*

In [None]:
# Add your code below
# distribution_2_clusters = ...


**Q16. When using `k = 4`, how many elements are present in each cluster? Store your answer in a variable called `distribution_4_clusters`.**


**Note**: your answer should be a Pandas Series, with the cluster label as the index and the count as value.

*Hint: you can use the `.value_counts()` method.*

In [None]:
# Add your code below
# distribution_4_clusters = ...
