# Guided Problems - Unsupervised Learning 

### The Dataset

We will be using the commonly reference Airbnb dataset. This dataset consists of different attributes for Airbnb listings.

In [None]:
# importing different dependencies 

import pandas as pd
import numpy as np
import random

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

### Part 1: EDA

Try loading the Airbnb data ("airbnb.csv") and let's play around with it to see what variables might be interesting to explore.

#### Questions: 
1. How many total listing are present in the data?
2. What are the different variables in the data?
3. What data types are present within the data?
4. What is the average number of reviews for a listing?
5. What is the average review score for a listing?

#### Data Visualizations

The 'number_of_reviews' variable and the 'review_scores_rating' variable are pretty interesting. Let's plot the two variables and see if we can see any distinct clusters. 

K Means clustering can help us easily see clusters in the data by using Euclidean Distance to calculate similarity between observations. 
Question: Can you plot the results for K Means clustering to show 2 clusters for the data based on the number of reviews and the review scores rating?

### Part 2: Airbnb Listing Similarity

We saw a great way to cluster two variables in the Airbnb dataset using K Means clustering. What happens when we try to find Airbnb listings that are "similar" to each other. It turns into a clustering problem, but now we are examing more than just 2 variables. We can take into account categorical variables and numerical variables, but for simplicity in this example let's focus on the numerical variables. 

#### Questions:
1. Which variables in the dataset are numerical variables
2. Can you cluster the data based on all of the numerical variables for 2 clusters? Show an example of two listings that were determined to be "similar" to each other based on these 2 clusters.
3. What is the optimal number of clusters? (hint: elbow method)
4. After clustering based on the optimal number of clusters - which cluster contains the most listings? Which one contains the least listings?
5. Given a new listing, can you show similar listings to it? (code to generate random listing is below)
6. What are some applications of clustering with your client work? Ex. customer segmentation, spam filtering, anomaly detection

In [None]:
# Random listing generator
l = random.randint(0, len(airbnb))

test_listing = airbnb[l-1:l][numerical_vars].reset_index(drop=True)
for i in numerical_vars:
    if i not in ['latitude', 'longitude']:
        test_listing[i][0] = random.randint(1, round(airbnb[i].max()))


## Additional Resources

Sklearn Documentation:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Articles on Clustering:
https://datafloq.com/read/7-innovative-uses-of-clustering-algorithms/6224
https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1
https://stanford.edu/~cpiech/cs221/handouts/kmeans.html
https://machinelearningmastery.com/clustering-algorithms-with-python/

Youtube Tutorials:
https://www.youtube.com/watch?v=1XqG0kaJVHY
https://www.youtube.com/watch?v=EItlUEPCIzM