Andy Nguyen, Michael Wolfe, Spencer Fogelman, Joseph Caguioa

DS 7331-407

Thursday 6:30 - 8:00 PM

<center><h1>Lab 3: Cluster Analysis of AirBnb listings in major U.S. cities</h1></center>

## Introduction

*Dataset Source: https://www.kaggle.com/rudymizrahi/airbnb-listings-in-major-us-cities-deloitte-ml*

This Kaggle dataset contains data on AirBnb listings for 6 major U.S. cities (CHI, DC, LA, NYC, SF, BOS). The competition's original objective was to use the provided attributes to predict the price of a listing. However, the dataset can be adjusted to predict other variables as well.

In this notebook, we will use the data on these AirBnb listings to practice clustering algorithms to extract new features from the available data. We will use these new features to build new classifiers and assess any improvements in performance compared to the classification models in our previous notebook.

### Setup

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
import csv
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import warnings

#from plotnine import *
%matplotlib inline
warnings.filterwarnings('ignore')

In [3]:
def EncodeCategory(col_to_encode):
    distinct_cats = col_to_encode.unique()
    encode_list = []
    i = 1
    for cat in distinct_cats:
        encode_list.append([cat,i])    
        i += 1
    return(encode_list)

### Data Preparation

In [None]:
# Load dataset
df = pd.read_csv('https://raw.githubusercontent.com/anguyen-07/DS7331-ML_Labs/master/data/airbnb_train.csv')

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
df.head()

In [None]:
df['price'] = np.exp(df['log_price'])
df['grade'] = pd.cut(df.review_scores_rating, [0,60,70,80,90,101], right=False, labels = ['F', 'D', 'C', 'B', 'A'])
df[['grade', 'review_scores_rating']].head()

In [None]:
# Clean up datatypes and duplicates
df_ratings = df.dropna(subset=['review_scores_rating'])
df_ratings.drop_duplicates(inplace=True)
floats = ['log_price','bathrooms','latitude','longitude','review_scores_rating', 'price']
df_ratings[floats] = df_ratings[floats].astype(np.float64)
ints = ['id','accommodates','number_of_reviews','bedrooms','beds']
df_ratings["host_response_rate"] = df_ratings["host_response_rate"].str.rstrip('%').astype(np.float64)/100
date_time = ['first_review','host_since','last_review']
df_ratings[date_time] = df_ratings[date_time].apply(pd.to_datetime)
booleans = ['host_has_profile_pic','host_identity_verified','instant_bookable']
df_ratings[booleans] = df_ratings[booleans].replace({'t':True,'f':False})
df_ratings[booleans] = df_ratings[booleans].astype(np.bool)
categorical = ['property_type','room_type','bed_type','cancellation_policy','city','neighbourhood','zipcode']
df_ratings[categorical] = df_ratings[categorical].astype('category')
df_ratings.drop_duplicates()
df_ratings.host_since[df_ratings.host_since.isna()] = df_ratings.first_review[df_ratings.host_since.isna()]

In [None]:
# Impute missing values
df_imputed = df_ratings
df_imputed["bathrooms"] = df_imputed["bathrooms"].fillna(df_imputed.groupby(["property_type","accommodates"])["bathrooms"].apply(lambda x : x.fillna(x.median())))
df_imputed["bedrooms"] = df_imputed["bedrooms"].fillna(df_imputed.groupby(["property_type","accommodates"])["bedrooms"].apply(lambda x : x.fillna(x.median())))
df_imputed["beds"] = df_imputed["beds"].fillna(df_imputed.groupby(["property_type","accommodates"])["beds"].apply(lambda x : x.fillna(x.median())))
df_imputed["host_response_rate"] = df_imputed["host_response_rate"].fillna(df_imputed.groupby(["number_of_reviews"])["host_response_rate"].apply(lambda x : x.fillna(x.mean())))
# Impute Missing Value of for 100% Host Response Rate for Row 48194 - Private Room in Apartment
df_imputed["host_response_rate"][df_imputed["host_response_rate"].isna()] = 1.0
df_imputed[ints] = df_imputed[ints].astype(np.int64)

In [None]:
import re
#Create a new cleaned amenities column where all amenities are in list form
df_imputed['amenities_new'] = df_imputed.apply(lambda row: re.sub(r'[{}""]', '', row['amenities']), axis=1)
df_imputed['amenities_new'] = df_imputed.apply(lambda row: row['amenities_new'].lower().split(','), axis=1)
df_imputed = df_imputed.reset_index()
df_imputed['length_amenities'] = df_imputed.apply(lambda row: len(row['amenities_new']), axis=1)

# Create separate columns based on amenities
df_imputed['internet'] = df_imputed.apply(lambda row: 'internet' in row.amenities.lower(), axis=1)
df_imputed['TV'] = df_imputed.apply(lambda row: 'tv' in row.amenities.lower(), axis=1)
df_imputed['air_conditioning'] = df_imputed.apply(lambda row: 'air conditioning' in row.amenities.lower(), axis=1)
df_imputed['kitchen'] = df_imputed.apply(lambda row: 'kitchen' in row.amenities.lower(), axis=1)
df_imputed['pool'] = df_imputed.apply(lambda row: 'pool' in row.amenities.lower(), axis=1)
df_imputed['parking'] = df_imputed.apply(lambda row: 'parking' in row.amenities.lower(), axis=1)

# Get information from description based on length in characters
df_imputed['description_length'] = df_imputed['description'].apply(len)

# Create the target variable superuser
df_imputed['superuser'] = False
df_imputed.loc[df.review_scores_rating >=96, 'superuser'] = True

# Create altnerative target variable grade_grouped
new_grades = {
    'A':'A',
    'B':'<A',
    'C':'<A',
    'D':'<A',
    'F':'<A'
}
df_imputed['grade_grouped'] = df_imputed['grade'].map(new_grades)

In [None]:
import datetime
date_published = datetime.datetime(2018,3,14)
df_imputed['host_since'] = pd.to_datetime(df_imputed['host_since'])
df_imputed['host_since_days'] = df_imputed.apply(lambda row: (date_published - row['host_since']).days, axis=1)

Removed columns that are not meaningful in a modeling context (i.e., index,id), redundant with newly engineered features (i.e., amenities, description), or are likenly not useful in predicting the target variable grade_grouped.

In [None]:
# delete useless columns
df_logistic = df_imputed.drop(['index','id','amenities','description','zipcode','description_length',
                               'latitude','longitude','name','neighbourhood','review_scores_rating','thumbnail_url',
                               'zipcode'], axis=1)

# delete other redundant variables
del df_logistic['amenities_new']
del df_logistic['host_since']
# del df_logistic['grade']
del df_logistic['first_review']
del df_logistic['last_review']

# Remove factor with unncessarily large number of levels
df_logistic.drop(['property_type'], axis=1, inplace=True)

One-Hot Encode Categorical Variables with multiple levels.

In [None]:
categorical_vars = ['room_type','bed_type','cancellation_policy','city']

# Create Dummy Variables
multiCategorical_vars = ['city','cancellation_policy','bed_type','room_type']

dummy_df = pd.get_dummies(df_logistic[multiCategorical_vars], drop_first = True)
df_final = pd.concat([df_logistic, dummy_df], axis=1)
del df_final['grade']

# Delete non-dummy categorical variables
for x in categorical_vars:
    if x in df_final:
        del df_final[x]

## Business Understanding

Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). How will you measure the effective of a good algorithm? Why does your chosen validation method make sense for this specific dataset and the stakeholders needs?

In [None]:
df_final.describe()

**INCOMPLETE**
* *This answer was taken from Lab 1 unedited to provided context. It needs to be edited to answer the specific questions asked.*

<font color = "red">

This dataset is provided by Kaggle and aims to predict the price of AirBnb listings in major cities in the United States. Each record is one AirBnB listing from 2018 with 29 attributes that describe the listing and what it offers to guests. A linear regression model would be a strong choice for predicting the continuous value of the listing price. This model could be extremely beneficial for new AirBnb hosts to guide them in setting an initial price for their listings based on location, amenities, and other features that they offered with their listing. Furthermore, the price perdiction model could help ensure that AirBnb guests are not overcharged for their stay and that hosts are adequately compensated for their hospitality. The performance of this model would be measured by the root mean square error (RMSE) to determine how far predictions deviate from actual values. Using the RMSE metric is most appropriate within this context because it heavily penalizes large errors and we want to avoid significant prediction errors in listing prices.  


In addition, the rating score from the listing reviews is a potential variable of interest that can predict if a certain listing provided guests with a positive experience during their stay. AirBnB currently employs a "Superhost" program that rewards the hosts with the most experience and highest-rated reviews. A model aimed at classifying review rating scores could provide valuable insight into how AirBnB hosts could improve their hospitality standards and provide a better experience for their guests. It could also help new hosts better market their listings and attain superhost status. We would use misclassification rate as our performance evaluation metric with the goal of maximizing the classficiation accuracy for this type of model. A minimal misclassifcation error rate would adequately captures the attributes influencing highly-rated reviews and be useful to further our understanding of what consumers look for when searching for an AirBnb listing to book.

* **Recall vs. Precision?** I think we should prioritize recall for the context of this classification model.

</font>

## Modeling and Evaluation

Be as thorough as possible when analyzing the data you have chosen and use visualizations of the results to explain performance and expected outcomes whenever possible. Guide the reader through your analysis with plenty of discussion of the results.

**Cluster Analysis**
+ Perform cluster analysis using several clustering methods
+ How did you determine a suitable number of clusters for each method?
+ Use internal and/or external validation measures to describe and compare the clusterings and the clusters (some visual methods would be good.)
+ Describe your results. What findings are the most interesting and why?

## Deployment

Be critical of your performance and tell the reader how your current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?
+ How useful is your model for interested parties (i.e., the companies or organizations that might want to use it?)
+ How would you deploy your model for interested parties?
+ What other data should be collected?
+ How often would the model need to be updated, etc.?

## Exceptional Work

You have free reign to provide additional analyses or combine analyses.