# 03 - Clustering listings by Amenities
As well as geographical location, the file `wk9_airbnb_listings.csv` contains lots of other attributes we can use to group listings, including the [amenities](https://en.wikipedia.org/wiki/Amenity) on offer for each listing. For example, Wifi, TV, Kitchen, Board games, Free parking, etc.

In this notebook, the goal is to find clusters of listings which offer similar amenities.

### Task 01 - Load the data
In the code block below, first import the `pandas` package, then read `wk9_airbnb_listings.csv` in to create a `DataFrame` named `df_listings`. Finally, print the `head` of the `id` and `amenities` columns.

In [None]:
# (SOLUTION)


**Output check:** The output should look like this:
<img src="../images/Result_Check_03_01.jpg" width=400 height=400 />

## Data Pre-processing
We need to process the data contained in the `amenities` column of `df_listings` before we can use it to cluster the listings. We do this in stages.

### Task 01 - Convert strings to lists
The values in the `amenities` column are in [string](https://www.w3schools.com/python/python_strings.asp) format and we need to convert them into [list](https://www.w3schools.com/python/python_lists.asp) format.

To do this, we will import the `json` package and use the `apply` and `loads` methods to "load" the string value in each row into a list in a new column which we will name `amenities_list`.

No code task here, just run the code below.

In [None]:
import json
df_listings['amenities_list'] = df_listings['amenities'].apply(lambda s : json.loads(s))
df_listings[['amenities','amenities_list']].head()

Do you see any differences between the values in `amenities` and the values in `amenities_list`. They are subtle, but very important.

### Task 02 - Map lists to binary indicators
We now need to map the lists in `amenities_list` to a format which can can be used for cluster analysis, similar to how `get_dummies` was used in the "Example of Kmeans Clustering" in the first notebook, but `get_dummies` can be slow for large datasets. Instead we are going to use the awesome `MultiLabelBinarizer` from the `sklearn.preprocessing` package.

Specifically, we are going to use an instance of `MultiLabelBinarizer` to fit and transform the values in `df_listings['amenities_list']` and create an arrary of binary indicators named `amenities_binary`.

This may be all new to you, so please do just run the code below.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
amenities_binary = mlb.fit_transform(df_listings['amenities_list'])
amenities_binary

Given the above array is just an array of zeros and ones, it is difficult for us to know which listing/amenity each row/column corresponds to. To solve this, we will create a `DataFrame` which will contain everything we need.

Please run the code below.

In [None]:
X_data = pd.DataFrame(amenities_binary ,columns=mlb.classes_, index=df_listings['id'])
X_data.head()

### Task 03 - Check how many times each amenity appears in the data
Some of the amenities shown above look very niche, so let's check how many times they occur.

Run the code below to quickly see how often each amenity appears in a listing. Why can `sum` be used to do this?

In [None]:
X_data.sum()

### Task 04 - Filter to the most common amenities
Given there are many amenities which appear in only one (or a few) listing(s) we are going to filter out the amenities that appear less than or equal to the number of times than the mean average.

Due to the beauty of `pandas` we can do this in a very "Pythonic" way.

Please run the code below.

In [None]:
X_final = X_data[X_data.columns[X_data.sum()>X_data.sum().mean()]]
X_final.head()

## Group listings by amenities
Now we have prepared our data (`X_final`) we can perform cluster analysis to group listings by their amenities.

For simplicity, we are going to manually set the number of clusters to *5*.

### Task 05 - Run the code below

In [None]:
pd.options.mode.chained_assignment = None # Ignore this
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

X_final.loc[:,'cluster'] = KMeans(n_clusters=5).fit_predict(X_final)
pivot_cluster = X_final.pivot_table(index=['cluster'], aggfunc='sum')

category_colors = plt.get_cmap('hsv')(np.linspace(0, 1, len(pivot_cluster.columns)))

plt.figure(figsize=(20, 12))
for idx, row in pivot_cluster.iterrows():
    
    plt.subplot(1,10,idx+1)
    plt.barh(row.index,row.values, color=category_colors)
    if(idx>0):
        plt.yticks([])
    plt.xlim(0,8000)
    plt.title(f'Cluster {idx}')


### Task 06 - Relfect on results
* Which amenities appear for almost all listings?
* Do these clusters makes sense to you?
* Do you thik there were limitations in the cluster analysis?
* How do you think flitering to the most common amenities might impact the usefulness of the results?