# Lab 1
## AirBnB Listings for Los Angeles Area
### Jason McDonald, Miguel Bonilla, Zachary Bunn

**Table of Contents**

[Business Understanding](#Business-Understanding)

[Data Meaning Type](#Data-Meaning-Type)

[Data Quality](#Data-Quality)

[Simple Statistics](#Simple-Statistics)

[Visualize Attributes](#Visualize-Attributes)

[Explore Joint Attributes](#Explore-Joint-Attributes)

[Explore Attributes and Class](#Explore-Attributes-and-Class)

[New Features](#New-Features)

[Exceptional Work](#Exceptional_Work)


### Business Understanding

This dataset we chose is from insideairbnb.com, which compiles data from Airbnb to highlight the impact of the company on local housing markets in an attempt to expose wastefulness. They reject the claim that Airbnb is solely "disrupting" the hotel industry, but that it negatively affects housing costs. It is currently limited to several major cites across the globe, including Los Angelos, where our focus lies.

Instead of focusing on the wastefulness of Airbnb, we will instead focus our study on business opportunity modeling for Airbnb rentals in the Los Angeles area. 

Using Los Angeles as our city of study to focus on one major housing market, our goal is to build a prediction model that predicts daily rental listing price using an error metric, adjusted r^2 to measure its effectiveness. Measuring house space waste from Airbnb, while interesting, would be an arduous task and difficult to prove for the scope of our class. What is useful about this dataset is that it can be ubiquitous for many uses beyond the authors' intended purposes.

**Price**

To be able to predict listing price based on the attributes we've identified as valuable, we intend to use 10 Fold Cross Validation, which entails splitting the data into training and test data sets.  The training data set is then split into 10 folds, or smaller sets of the training data.  A model would then be trained on ***k*** - 1 of the sets (9 in our 10 fold case) and the resultant model would be validated on that remaining 10th set of data by calculating our performance metric, adjusted r^2 for the pricing regression.

This process repeats a specified number of times and the average of the performance measure is used to determine the final performance metric of our model.

Once our model has been tuned using 10-fold CV, we can then use our model to predict the price of data in the original held back test dataset and calculate adjusted r^2 for the actual and predicted price. 

**Superhost**

From the property host's perspective, belonging to AirBNB's super host program provides a confidence boost to tenants knowing that the host of the property they've booked is responsive to questions, has high user reviews, and typically cares about their properties.  With a required rating of 4.8+ over the prior year and less than 1% cancellation rate, AirBNB elevates these hosts as the best of the best.

From a tenant standpoint, booking with a Superhost implies that your host not only cares about the property and your enjoyment, but takes pride in making sure that your stay is excellent and worthy of a high user rating.

For active and potential hosts, our goal is to develop a model which can predict whether a property is likely to be hosted by a Superhost.  This could help hosts refine the listings and their own behavior to elevate their status as a Superhost.

We again will use 10-Fold Cross Validation to train a model on recursive subsets of the training split of data with a single fold held back to score the current iteration, then averaging performance across the runs.

When our model is complete, we will measure the effectiveness by running our model against the originally held back test data set and calculating an F1 Score.  The F1 score takes into account precision and recall.  Recall is important because we do want to minimize false negatives (predicting a negative response when the opposite is true).  Overly estimating the false negatives could cause hosts to see Superhost status as unattainable or expensive, driving them away.  Precision is also important because we don't want to have a high instance of false positives (predicting a positive response when the true value is negative) which may prolong a hosts time until they become a Superhost by misleading them into thinking they are on the right path.  "I have great reviews and am well on my way to Superhost!  I don't need to improve!  Tenants love my properties!"  This could give a false sense of their future status.

F1 Score is a good overall performance metric to data sets such as ours, by giving equal weight and importance to both precision and recall.  It will be the best metric to measure how effective our model is for hosts.

In [None]:

## load pandas, numpy, plotly.express, plotly.graph_objects, and import dataset from GitHub

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

df = pd.read_csv("https://media.githubusercontent.com/media/boneeyah/DS7331_Group/main/Data_Files/airbnb_los_angeles.csv")
#df = pd.read_csv("Data_Files/airbnb_los_angeles.csv")

In [None]:
df.

### Data Meaning Type

Below, we begin to explore the dataset focusing on 25 attributes that our work will be based on.


In [2]:
# verify data loaded properly and see what we have in the dataset
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,45392,https://www.airbnb.com/rooms/45392,20220606014052,2022-06-06,Cute Home in Mount Washington,<b>The space</b><br />Cute house in Mount Wash...,,https://a0.muscache.com/pictures/miso/Hosting-...,201514,https://www.airbnb.com/users/show/201514,...,4.96,4.58,4.85,,f,1,0,1,0,0.2
1,45417,https://www.airbnb.com/rooms/45417,20220606014052,2022-06-06,Silver Lake Cottage Oasis- Private Terrace- Th...,"Ideal for an extended stay, our Cottage perfec...",Walk just a couple of blocks to reach all the ...,https://a0.muscache.com/pictures/monet/Select-...,50231247,https://www.airbnb.com/users/show/50231247,...,4.98,4.91,4.82,,f,4,4,0,0,1.09
2,5728,https://www.airbnb.com/rooms/5728,20220606014052,2022-06-06,Tiny Home in Artistic Oasis near Venice and LAX,Our home is located near Venice Beach without ...,Our neighborhood is very quiet and save. There...,https://a0.muscache.com/pictures/7a29d275-f293...,9171,https://www.airbnb.com/users/show/9171,...,4.92,4.8,4.7,HSR19-002149,f,3,0,3,0,2.0
3,5729,https://www.airbnb.com/rooms/5729,20220606014052,2022-06-06,Zen Room with Floating Bed near Venice and LAX,Our home is located near Venice Beach without ...,Our neighborhood is very quiet and save. There...,https://a0.muscache.com/pictures/f48e3ea8-2075...,9171,https://www.airbnb.com/users/show/9171,...,4.89,4.77,4.71,HSR19-002149,f,3,0,3,0,1.52
4,109,https://www.airbnb.com/rooms/109,20220606014052,2022-06-06,Amazing bright elegant condo park front *UPGRA...,"*** Unit upgraded with new bamboo flooring, br...",,https://a0.muscache.com/pictures/4321499/1da98...,521,https://www.airbnb.com/users/show/521,...,4.0,5.0,4.0,,f,1,1,0,0,0.02


In [3]:
## removing attributes whch have no value to us
### keeping longitude and latitude in case we wanted to do a sort of heat map for airbnb hotspots for the exceptional points, if not we can remove
### neighborhood, neighborhood_cleansed, and neighborhood_group_cleansed variations of same, but cleansed seems to be better to use, I'm guessing it has imputed values
### for some reason, all listings have NaN for bathrooms, so removing it
for col in [
    'listing_url','scrape_id','last_scraped','description','neighborhood_overview','picture_url','host_url','host_about','host_response_time','host_response_rate','host_acceptance_rate',
    'host_thumbnail_url','host_picture_url','host_verifications','host_has_profile_pic','bathroom_text','host_listings_count','host_neighbourhood','bathrooms','minimum_minimum_nights',
    'maximum_minimum_nights','minimum_maximum_nights','maximum_maximum_nights','minimum_nights_avg_ntm','maximum_nights_avg_ntm','calendar_updated','availability_30','availability_60',
    'availability_90','availability_365','calendar_last_scraped','number_of_reviews_ltm','number_of_reviews_l30d','review_scores_accuracy','review_scores_communication','review_scores_cleanliness',
    'review_scores_checkin','review_scores_value','review_scores_location','calculated_host_listings_count_entire_homes','calculated_host_listings_count_private_rooms',
    'calculated_host_listings_count_shared_rooms','reviews_per_month','neighbourhood','neighbourhood_group_cleansed', 'first_review','last_review','minimum_nights','maximum_nights','license'
]:
    if col in df:
        del df[col]

We removed a number of attributes which didn't appear to be of value for our purposes.  We did choose to keep latitude and longitude so that we can explore any location based information through heatmaps or other mapping tools.

We also found neighborhood_cleansed contained similar data to two other neighborhood columns but that it contained, what appeared to be, better quality data so we chose to keep it, dropping the other two.

| Field | Type | Description | Scale |
| :--- | :---: | :---: | :--- |
| id | integer | Airbnb's unique identifier for the listing | 42041 Unique Identifiers|
| name | text | Name of the listing | NA|
| host_id | integer | Airbnb's unique identifier for the host/user | Less than id since a host can have multiple listings|
| host_name | text | Name of the host. Usually just the first name(s).| NA|
| host_since | date | The date the host/user was created. For hosts that are Airbnb guests this could be the date they registered as a guest.| NA|
| host_location | text | The host's self reported location| NA|
| host_is_superhost | boolean | If host has Superhost classification. These are highly rated hosts.| NA|
| host_total_listings_count | text | The number of listings the host has (per Airbnb calculations)| 0 to 3322|
| host_identity_verified | boolean | The Host has verified their identity| NA| 
| neighbourhood_cleansed | text | The neighbourhood as geocoded using the latitude and longitude against neighborhoods as defined by open or public digital shapefiles.| NA|
| latitude | numeric | Uses the World Geodetic System (WGS84) projection for latitude and longitude. | Any latitudinal value|
| longitude | numeric | Uses the World Geodetic System (WGS84) projection for latitude and longitude. | Any longitudinal value|
| property_type | text | Self selected property type. Hotels and Bed and Breakfasts are described as such by their hosts in this field | NA|
|room_type | text | [Entire home/apt\Private room\Shared room\Hotel] All homes are grouped into the following three room types: Entire place, Private access to the entire home, Private rooms - Your own room in a shared house, Shared rooms - shared bedrooms and common rooms with other guests. | 5 Types of Rooms|
| accommodates | integer | The maximum capacity of the listing | 0 to 16 guests|
| bathrooms_text | string | The number of bathrooms in the listing. On the Airbnb web-site, the bathrooms field has evolved from a number to a textual description. For older scrapes, bathrooms is used. | anywhere from 0 - 8 counting half-baths|
|bedrooms | integer | The number of bedrooms | 0 to 24 bedrooms|
| beds | integer | The number of bed(s) | 0 - 34 beds|
| amenities | json | Array of added features that the host wanted to include as a listed benefit| Differs from listing to listing|
| price | currency | daily price in local currency | Avg of 0 to 25,000 dollars with an outlier at 100k|
| has_availability | boolean [t=true; f=false] | if the listing is available or not | NA|
| calendar_last_scraped | date | Last calendar date the data was scraped | NA|
| number_of_reviews | integer | The number of reviews the listing has | 0 to 1512|
| review_scores_rating | float | Score of property | 0.0 to 5.0 avgeraged scale|
| instant_bookable | boolean[t=true; f=false] | Whether the guest can automatically book the listing without the host requiring to accept their booking request. An indicator of a commercial listing. | NA|
| calculated_host_listings_count | integer | The number of listings the host has in the current scrape, in the city/region geography. | 0 to 532|

### Data Quality



**Missing values**

We will begin looking into the quality of the data by checking for any missing values for any of the attributes.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42041 entries, 0 to 42040
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              42041 non-null  int64  
 1   name                            42039 non-null  object 
 2   host_id                         42041 non-null  int64  
 3   host_name                       41858 non-null  object 
 4   host_since                      41858 non-null  object 
 5   host_location                   41780 non-null  object 
 6   host_is_superhost               41858 non-null  object 
 7   host_total_listings_count       41858 non-null  float64
 8   host_identity_verified          41858 non-null  object 
 9   neighbourhood_cleansed          42041 non-null  object 
 10  latitude                        42041 non-null  float64
 11  longitude                       42041 non-null  float64
 12  property_type                   

We can see from the output that listing name, host_name, host_since, host_location, host_is_superhost, host_total_listings_count, host_identity_verified, bathrooms_text, bedrooms, beds and review_scores_rating all have missing values. Review scores has the largest number of missing values, *we decided to drop those that do have missing values, as there are a lot of factors that affect review score and we felt that imputation would be limited in that regard.*

We will focus the scope of our project on predictions for listings which have been reviewed by guests. Additionally, we determined that imputing bathroom descriptions, host_location, and host since date *could lead to misleading values*, so we made the decision to drop these missing values from the dataset as well. 

For the missing values, it is difficult to attribute these to mistakes.  Surely some are due to mistaken entry, differences in style of entry, and changes in how data is collected, but missing values across such a wide swath of attributes leads us to wonder if rather than mistakes, whether it is possible that additional data began being collected at a later time or if some hosts simply didn't provide the requested data.  With our data source being a group which is not part of AirBNB but is trying to capture harm done by AirBNB, there will always need to be consideration that the data is simply as complete as they could get it from public sources.

In [5]:
df = df[~df.review_scores_rating.isnull() & ~df.bathrooms_text.isnull() & ~df.host_since.isnull() & ~df.host_location.isnull()]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32329 entries, 0 to 42000
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              32329 non-null  int64  
 1   name                            32329 non-null  object 
 2   host_id                         32329 non-null  int64  
 3   host_name                       32329 non-null  object 
 4   host_since                      32329 non-null  object 
 5   host_location                   32329 non-null  object 
 6   host_is_superhost               32329 non-null  object 
 7   host_total_listings_count       32329 non-null  float64
 8   host_identity_verified          32329 non-null  object 
 9   neighbourhood_cleansed          32329 non-null  object 
 10  latitude                        32329 non-null  float64
 11  longitude                       32329 non-null  float64
 12  property_type                   

In [6]:
## print the room type and property type attributes for comparisons
print(df.room_type.unique())
df.property_type.unique()

['Private room' 'Entire home/apt' 'Hotel room' 'Shared room']


array(['Private room in home', 'Entire bungalow',
       'Private room in tiny home', 'Private room in guesthouse',
       'Entire condo', 'Room in hotel', 'Private room in rental unit',
       'Entire guesthouse', 'Entire home', 'Entire guest suite',
       'Private room in loft', 'Entire rental unit', 'Entire townhouse',
       'Entire cottage', 'Entire villa', 'Private room',
       'Room in boutique hotel', 'Private room in townhouse',
       'Private room in hostel', 'Camper/RV',
       'Private room in bed and breakfast', 'Farm stay', 'Entire cabin',
       'Entire loft', 'Private room in farm stay',
       'Private room in condo', 'Private room in bungalow',
       'Private room in guest suite', 'Shared room in hostel',
       'Shared room in home', 'Shared room in villa', 'Entire place',
       'Private room in treehouse', 'Treehouse', 'Yurt',
       'Private room in villa', 'Private room in cottage',
       'Private room in castle', 'Shared room in rental unit',
       'Tiny h

Room type variable has more broad categories without a category for other. Given that some listings have types that would not strictly fit in to one of the room type categories, we made the decission to get new categories based on the property_type strings.

In [7]:
## getting property type from string
types = ['Private room', 'Entire', 'Room in hotel','Room','Shared room']
pat = '|'.join(r"\b{}\b".format(x) for x in types)

df['property_type']= df['property_type'].str.extract('('+ pat + ')', expand = False)
df['property_type'] = (df.property_type.
                       fillna(value = 'other').
                       replace(['Entire','Room in hotel'],['Entire unit','Hotel room']))
df['property_type'].value_counts()

Entire unit     22767
Private room     8081
Shared room       510
Room              401
other             298
Hotel room        272
Name: property_type, dtype: int64

After creating new categories from the property type descriptions, we can see that out of the 6 categories formed the vast majority are for entire units, with shared rooms, rooms, hotel rooms and others accounting for a very small number of listings.

We still have missing values for bedrooms and beds attributes. First we will impute beds based on the number of guests that can be accommodated, using median values, and we will then use beds to impute the number of bedrooms by type of property listed.

In [8]:
df['beds'] = df[['accommodates','beds']].groupby(by = 'accommodates').transform(lambda grp: grp.fillna(grp.median()))
df['beds'].info()

<class 'pandas.core.series.Series'>
Int64Index: 32329 entries, 0 to 42000
Series name: beds
Non-Null Count  Dtype  
--------------  -----  
32329 non-null  float64
dtypes: float64(1)
memory usage: 505.1 KB


Quick check shows that there are no more missing values for beds after imputing based on the median number of beds by number of guests accommodated. We will now proceed to impute number of bedrooms using the number of beds by type of property.

In [9]:
df_grouped = df.groupby(by = ['property_type','beds'])
df_grouped[['bedrooms']].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,bedrooms,bedrooms,bedrooms,bedrooms,bedrooms,bedrooms,bedrooms,bedrooms
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max
property_type,beds,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Entire unit,1.0,5120.0,1.019336,0.164835,1.0,1.00,1.0,1.0,5.0
Entire unit,2.0,5614.0,1.600641,0.517748,1.0,1.00,2.0,2.0,4.0
Entire unit,3.0,3839.0,2.375879,0.665655,1.0,2.00,2.0,3.0,6.0
Entire unit,4.0,2593.0,2.975318,0.857164,1.0,2.00,3.0,4.0,6.0
Entire unit,5.0,1313.0,3.542270,0.974485,1.0,3.00,4.0,4.0,6.0
...,...,...,...,...,...,...,...,...,...
other,4.0,8.0,2.500000,1.309307,1.0,1.75,2.0,4.0,4.0
other,5.0,8.0,1.625000,1.060660,1.0,1.00,1.0,2.0,4.0
other,6.0,4.0,2.500000,2.380476,1.0,1.00,1.5,3.0,6.0
other,7.0,3.0,3.333333,3.214550,1.0,1.50,2.0,4.5,7.0


This table shows the descriptive statistics breakdown of number of beds by property type, the median (represented here by the 50% column) will be used to impute missing bedroom values based on these groupings.

In [10]:
df_imputed = df_grouped[['beds','bedrooms']].transform(lambda grp: grp.fillna(grp.median()))
df_imputed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32329 entries, 0 to 42000
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   beds      32329 non-null  float64
 1   bedrooms  32327 non-null  float64
dtypes: float64(2)
memory usage: 757.7 KB


Quick check of the attributes shows that there are still 2 missing values for number of bedrooms. This is probably due to a specific grouping having no value for the number of beds, so no median could be calculated for the imputation.

In [11]:
## check for value that is still null
index = df_imputed[df_imputed.bedrooms.isnull()].index
df[['id','name','property_type','beds','bedrooms']][df.index.isin(index)]

Unnamed: 0,id,name,property_type,beds,bedrooms
20065,45257193,Safari Tent MASH Whites Landing,Private room,12.0,
20077,45257350,Safari Tent MASH Whites Landing,Private room,12.0,


Checking the 2 missing values for number of bedrooms shows that these 2 listings were for a Safari tent, we will therefore drop these listings from our dataset due to being so unique as to not be represented well in the data set.

In [12]:
df = df.drop(index= index)

In [13]:
## imputing missing bedroom number with the median
df['imputed']=df_imputed[['bedrooms']]
## verifying that missing values where imputed with the median
df[['beds','bedrooms','imputed']][df.bedrooms.isnull()]

Unnamed: 0,beds,bedrooms,imputed
17,3.0,,2.0
21,1.0,,1.0
28,1.0,,1.0
33,2.0,,2.0
50,1.0,,1.0
...,...,...,...
40952,2.0,,2.0
40972,1.0,,1.0
41221,1.0,,1.0
41237,1.0,,1.0


In [14]:
# replace 'bedrooms' column with imputed column and deleting the duplicated column
df['bedrooms'] = df['imputed']
del df['imputed']
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32327 entries, 0 to 42000
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              32327 non-null  int64  
 1   name                            32327 non-null  object 
 2   host_id                         32327 non-null  int64  
 3   host_name                       32327 non-null  object 
 4   host_since                      32327 non-null  object 
 5   host_location                   32327 non-null  object 
 6   host_is_superhost               32327 non-null  object 
 7   host_total_listings_count       32327 non-null  float64
 8   host_identity_verified          32327 non-null  object 
 9   neighbourhood_cleansed          32327 non-null  object 
 10  latitude                        32327 non-null  float64
 11  longitude                       32327 non-null  float64
 12  property_type                   

In [18]:
df

Unnamed: 0,id,name,host_id,host_name,host_since,host_location,host_is_superhost,host_total_listings_count,host_identity_verified,neighbourhood_cleansed,...,bathrooms_text,bedrooms,beds,amenities,price,has_availability,number_of_reviews,review_scores_rating,instant_bookable,calculated_host_listings_count
0,45392,Cute Home in Mount Washington,201514,Olivia & Alexey,2010-08-14,"Los Angeles, California, United States",f,1.0,t,Mount Washington,...,1 private bath,1.0,2.0,"[""Hangers"", ""Free parking on premises"", ""Essen...",$60.00,t,27,4.88,f,1
1,45417,Silver Lake Cottage Oasis- Private Terrace- Th...,50231247,Tim,2015-11-30,"Camarillo, California, United States",t,8.0,t,Silver Lake,...,1 bath,1.0,2.0,"[""Bathtub"", ""TV"", ""Iron"", ""Pack \u2019n play/T...",$135.00,t,154,4.91,f,4
2,5728,Tiny Home in Artistic Oasis near Venice and LAX,9171,Sanni,2009-03-05,"Los Angeles, California, United States",t,8.0,t,Del Rey,...,1 shared bath,1.0,1.0,"[""Hangers"", ""Free parking on premises"", ""Essen...",$50.00,t,314,4.79,f,3
3,5729,Zen Room with Floating Bed near Venice and LAX,9171,Sanni,2009-03-05,"Los Angeles, California, United States",t,8.0,t,Del Rey,...,1 shared bath,1.0,1.0,"[""Hangers"", ""Free parking on premises"", ""Essen...",$65.00,t,236,4.77,f,3
4,109,Amazing bright elegant condo park front *UPGRA...,521,Paolo,2008-06-27,"San Francisco, California, United States",f,1.0,t,Culver City,...,2 baths,2.0,3.0,"[""Hangers"", ""Pool"", ""Free parking on premises""...",$115.00,t,2,4.00,f,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41978,640114077433078451,COZY Travelers Dream Disney Home - REMODELED,121150191,Calvin,2017-03-17,"Garden Grove, California, United States",t,5.0,t,Santa Ana,...,2 baths,3.0,4.0,"[""Hangers"", ""Clothing storage"", ""Free parking ...",$326.00,t,2,5.00,f,5
41980,640360046512608448,💟2BR/2BA 5MIN toDisney!Angelstadium!HondaCenter.,434985835,Monika,2021-12-06,US,t,1.0,f,Orange,...,2 baths,2.0,2.0,"[""Hangers"", ""Free parking on premises"", ""Essen...",$221.00,t,1,5.00,t,3
41989,625482670051646954,Spacious serene master suite,329317578,Kenneth,2020-01-20,"Fountain Valley, California, United States",t,7.0,t,Fountain Valley,...,1 private bath,1.0,1.0,"[""Free parking on premises"", ""Essentials"", ""Ba...",$85.00,t,1,5.00,f,6
41992,625582857766131091,Lovely 1bedroom unit with parking on premises.,458841226,Alonzo,2022-05-12,US,f,0.0,t,Stanton,...,1 bath,1.0,2.0,"[""Hangers"", ""Central air conditioning"", ""Free ...",$124.00,t,4,5.00,t,1


We can now see there are no longer any missng values on our dataset, every attribute has values for every listing.

Host_since and price attribute have the wrong data types, we will adjust these manually and also transform the bathroom_text string to get meaningful information about the number and type of bathrooms for the listing.

In [None]:
## now fixing dtypes for attributes
df['host_since'] = pd.to_datetime(df.host_since)
df['price'] = df['price'].replace('[\$,]','',regex = True).astype(float)
df['bathrooms_text'] = df['bathrooms_text'].replace(['Half-bath', 'Shared half-bath', 'Private half-bath'],['0.5 bath','0.5 shared bath', '0.5 private bath'])
df_bathrooms = df['bathrooms_text'].str.split(n=1, expand=True).rename(columns = {0:'bathroom_number',1:'bathroom_type'})
df_bathrooms['bathroom_type'] = df_bathrooms.bathroom_type.fillna(value = 'bath')
df_bathrooms['bathroom_type'] = df_bathrooms['bathroom_type'].replace(['baths','shared baths'],['bath','shared bath'])
df_bathrooms['bathroom_number'] = df_bathrooms['bathroom_number'].astype('float')
df.insert(15, 'bathroom_number',df_bathrooms['bathroom_number'])
df.insert(16, 'bathroom_type', df_bathrooms['bathroom_type'])
del df['bathrooms_text']

In [None]:
df.info()

Quick check shows attributes are properly typed, and our newly formed bathroom_number and bathroom_type features have no missing values.

**Duplicates**

In [None]:
## checking for duplicates
df.nunique()

We can see that the unique identifier for each observation, id shows no duplicates in the dataset, with 32327 unique ids which equals the total number of rows left in the dataset. There are no duplicates.

**Outliers**

In [None]:
px.box(df,x=['bathroom_number','bedrooms','beds']).show()
px.box(df,x='price').show()

Visual inspection of the boxplots shows there are extreme outliers for price and number of beds. We will calculate possible cutoff values based on the 1.5 * Inter Quartile Range formula.

In [None]:
### find a cutoff for outliers based on IQR*1.5
print('Price cutoff value: ',df.price.quantile(.75)+(df.price.quantile(.75)-df.price.quantile(.25))*1.5)
print('Beds cutoff value: ',df.beds.quantile(.75)+(df.beds.quantile(.75)-df.beds.quantile(.25))*1.5)

In [None]:
### filter out price outliers
df = df[(df.beds<10) & (df.price<750)]
df.count()

Using the outlier range estimation of 75 percentile * 1.5 IQR yields a price cutoff of 554 and a bed cutoff of 6. However, with the hope of not reducing the scope of the project too much, we decided to go with slightly higher cutoff values of 750 and 10 respectively. This leaves us with 30580 listings in our dataset.

### Simple Statistics



In [None]:
## removing attributes for which it does not make sense to calculate descriptive statistics.
df[df.columns.difference(['id','host_id','latitude','longitude'])].describe()

This table shows the descriptive statistics for the full dataset. With the exception of review scores, we can see all of the attributes have outliers on the high end, with a jump from the 75 percentile to the maximum that is much higher than the jumps from 25-50 and 50-75 percentiles.

Comparing the mean and median for each attribute gives us an idea of the shape of the distribution and if the attribute distributions show skewedness. Number of reviews shows significant right skewedness, with a mean that is much larger than the median.

Additional insight could be gained by looking at the breakdown by property type, so we will calculate the descriptive statistics to see how each listing type compares to the overall values.

In [None]:
## repeat statistics per property type
## shorter explanation needed for this
from IPython.display import display

for ptype in ['Entire unit','Private room', 'Shared room', 'Hotel room', 'Room', 'other']:
    print('\nDescriptive Statistics for ',ptype)
    df_temp = df[df.columns.difference(['id','host_id','latitude','longitude'])][df.property_type == ptype].describe()
    display(df_temp)

The breakdown shows some of the attributes have very similar outliers on the high end, these include number of guests accommodated and number of beds. Number of reviews shows right skewedness accross the different property types.

We will proceed with the current data, keeping in mind that there are outliers on the high end which could be influential, and we might need to remove in the future based on the performance of our models.

### Visualize Attributes

Visualizations of selected attributes to help identify patterns and gain insight from our data.


In [None]:
## distribution of number of people accommodated, bedrooms, and beds per listing
#df.boxplot(column=['accommodates','bedrooms','beds'])
detailsPlot = px.box(df, x=['accommodates','bedrooms','beds'], labels={'value':'Value', 'variable':'Variable'}, title='Distribution of Accommodates, Bedrooms, and Beds of AirBNB Property',height =400)
detailsPlot.show()

**Distribution of Accommodates, Bedrooms, and Beds of AirBNB Properties.**

We chose to visualize distributions for the attributes Accommodates, Bedrooms, and Beds of AirBNB properities together with boxplots in part due to the close scale in which the data is contained.  We can clearly see that all indicate a right skew towards properties which would typically be sought after by smaller groups or individuals.  Note the median accomodates value of 3 as an example of this.

Knowing the distribution, along with the summary statistic that Plotly charts can provide by hovering over the chart, provides us with much needed information such as potential outliers that sit above the upper fence.  For instance, there is a property that accomodates 16 people according to the listing, but the maximum bed count is 9 (we cannot determine if they are the same property without further inverstigation).

This potentially could indicate problematic data or simply that there is more to understand about this dataset.  Do properties use king and queed beds to increase the number of accomodations with mimnimal number of beds?  Can we make any inferences about the relationship between bed and accommodates?  Can we infer the bed size?  Is the bed size reported in ammenities?

This leaves us with many possible questions to explore to better understand our data.  In joint attributes, we'll take a look at these further.



In [None]:
#df.boxplot(column=['price'])

#pricePlot = px.box(df, x='price', labels={'price':'Price in thousands of US dollars'}, title='Distribution of price of AirBNB Property')
#pricePlot.show()
# have to fix the scale of this one

#trying a histogram of price
priceHist = px.histogram(df, x='price', labels={'price':'Price in US dollars'}, title='Distribution of price of AirBNB Property',height =500)
priceHist.show()

#tryinga  binned histogram

#counts, bins = np.histogram(df['price'], bins=range(0,16000,500))
#bins = .5 * (bins[:-1] + bins[1:])

#priceHist2 = px.bar(x=bins, y=counts, labels={'x': 'Price in thousands of US dollars, in $500 bins', 'y':'Count of properties in $500 bin'}, text_auto=True, title='Distribution of price of AirBNB Property')
#priceHist2.show()

**Distribution of price of AirBNB Property**

Using a histogram which binned the counts in $5 bins to visualize the distribution of price allowed us to spot the obvious right skew to the distribution.  This isn't unexpected as 1. consumers typically are price conscious, looking for good deals, and 2. the boxplots of bedrooms and accommodates earlier indicated a right skew towards properties which target a smaller subset of customers.



In [None]:
#px.treemap(df, path = [px.Constant('Type'),'type'], values = 'count')
df_properties = df[['id','property_type']].groupby(by='property_type', as_index = False).count().rename(columns = {'property_type':'type','id':'count'})
df_properties

dfTreemap = px.treemap(df_properties, path = [px.Constant('Type'),'type'], values = 'count', title='TreeMap of Property Type, indicating Count of each Property Type', height =500)
dfTreemap.show()

**Treemap of Property Type, indicating Count of each Property Type**

We choose a treemap to demonstrate the disparity in the various types of properties with the overwhelming majority of our cleaned dataset consisting of listings where the tenant gets the entire unit versus listings where they may share common areas, for example.

Given that we saw right skewed distributions to properties where you may expect to see lower number of people in a booking party, we were mildly surprised to find that such a large section of listings are for rentals of entire properties as opposed to shared areas within a larger property.

In [None]:
## top 10 hosts by listings
df_byhost = (df[['host_id','host_name','calculated_host_listings_count']].
             groupby(by=['host_id','host_name'],as_index=False).
             mean().
             sort_values(by='calculated_host_listings_count',ascending = False).
             nlargest(n=10,columns='calculated_host_listings_count'))
df_byhostPlot = px.bar(df_byhost, x='host_name',y='calculated_host_listings_count', labels={'host_name':'Host Name/Company', 'calculated_host_listings_count':'Host\'s Listings'}, title='Top 10 Hosts by Number of Listings', height = 500).show()

**Top 10 Host by Number of Listings**

It caught our eye that there were a small number of hosts that owned a disproportionately large number of listed properties.  We chose to focus on the top 10 by count of listings and display that in a bar chart, which shows that the top 10 hosts own over 1500 units listed in the area



In [None]:
#Visualize the location of the AirBNB properties on a map
airBNBMap = px.density_mapbox(df, lat='latitude', lon='longitude', radius=1, mapbox_style="open-street-map", height = 500)
airBNBMap.show()

**Location of AirBNB listed properties in Los Angeles**

The map visual above shows the location of the properties listed on AirBNB in the Los Angeles area.  The map visualzation can be manipulated by zooming in and out to focus on specific areas.

The heatmap of properties was our best method to show visually if there are any regional clusters and any relationship they may have to other clusers.  While we do have neighborhood names through neighborhood_cleansed, this maps allows us to visually show where the larger concentrations of listings are, as well as their proximity to other clusters.

Interesting areas which appear to have a larger groups of properties are Hollywood, Downtown, and areas along Venice Beach to Santa Monica.

In [None]:
#Visulaize the host's location
hostLocationPlot = go.Figure()
hostLocationPlot.add_trace(go.Histogram(histfunc="count", x=df['host_location']))
hostLocationPlot.show()

**Host Location as given by the property owner/host**

Plotting the location as given by the host was useful but had a large amount of variation due to the variations in how each host reports their location. We explore this further by focusing on larger regions by extracting state and country from the reported locations.

In [None]:
#Create new columns for host_country and host_state
#Having to handle these with some if else statements.  Tried splitting on comma across the entire dataframe using the pandas Series.str.split but didn't have success
#As with much of user entered content, the data quality is poor here with variations on spelling, casing, and even what is input.  This makes this difficult to 
#do much with and with an uncertainty of value, I'm not willing to dive in too deep to get this near perfect.  I'll leave as is for now.
for index in df.index:
    if df.loc[index, 'host_location'] == "US":
        df.loc[index,'host_location_country'] = "United States"
        df.loc[index,'host_location_state'] = "United States"
    elif len(df.loc[index, 'host_location'].split(',')) == 1:
        df.loc[index,'host_location_country'] = df.loc[index, 'host_location'].strip()
        df.loc[index,'host_location_state'] = df.loc[index, 'host_location'].strip()
    elif len(df.loc[index, 'host_location'].split(',')) == 2:
        df.loc[index,'host_location_country'] = df.loc[index, 'host_location'].split(',')[1].strip()
        df.loc[index,'host_location_state'] = df.loc[index, 'host_location'].split(',')[0].strip()
    elif len(df.loc[index, 'host_location'].split(',')) == 3:
        df.loc[index,'host_location_country'] = df.loc[index, 'host_location'].split(',')[2].strip()
        df.loc[index,'host_location_state'] = df.loc[index, 'host_location'].split(',')[1].strip()
    else:
        df.loc[index,'host_location_country'] = "Unknown"
        df.loc[index,'host_location_state'] = "Unknown"

#print(df['host_location_country'])
hostLocationCountryPlot = go.Figure()
hostLocationCountryPlot.add_trace(go.Histogram(histfunc="count", x=df['host_location_country']))
hostLocationCountryPlot.show()

hostLocationStatePlot = go.Figure()
hostLocationStatePlot.add_trace(go.Histogram(histfunc="count", x=df['host_location_state']))
hostLocationStatePlot.show()

**Trying to extract new features from the Host Location**

The data entered by the hosts is very inconcistent and the value of such is a bit uncertain at this point.  We're trying here to more closely break it down by setting some rule based extractions that will more closely align, and or clean as much of the data as we can.

### Explore Joint Attributes



In [None]:
#Compare review_scores_rating and instant_bookable
#trying out a scatterplot of rating by instant_bookable

jaFig = px.scatter(df, y='review_scores_rating', x='neighbourhood_cleansed', title='Dot Plot Comparing Rating Scores across the Neighborhoods')
jaFig.update_traces(marker_size=2)
jaFig.show()


**Dot Plot Comparing Rating Score across the Neighborhoods**

We wondered if there may be any relationship to the rating of a property based on which neighborhood the property was located in.

We chose a dot plot to attempt to show density of ratings in a quick glance fashion.  An issue though is that there are so many neighborhoods that there simply isn't room to display all of the names.  The density of the neighborhood along the X axis is too dense to display.  Plotly helps by allowing a viewer to zoom in and hover for more information when an interesting data point is found.

A viewer can certainly see some neighborhoods, such as Hollywood, Marina del Ray, and Newport Beach with reviews towards the midpoint of the scale.

In [None]:
#Compare Beds to Bedrooms to Accommodates with a 3d scatterplot

bedVsAccPlot = px.scatter_3d(df, x='beds', y='accommodates', z='bedrooms', color='property_type', opacity=0.3, size_max=2,height=400).show()

**Compare Beds to Bedrooms to Accommodates with a 3d Scatterplot**

We wanted to explore further that relationship between beds, bedrooms, and accommodates to ensure we fully understood that these follow the logic conclusion that more beds, more bedrooms, means it should accommodate more people.  We further wanted to be able to look at this by property type.

We chose a 3d scatterplot as Plotly allows for the viewer to zoom into specific levels of interaction that wouldn't be possible on any other 2 dimensional comparison of 3 dimensions.

An interesting thing is that while overall, the logical assumptions seem to hold for these attributes, there are some points that may indicate either outliers, poor data quality (bad entry by a host) or possibly some other configuration that we haven't yet considered.  Do some properties have nothing but large beds?  Small beds?  Are there many beds in less bedrooms?  We will have to dig deeper to find out.

In [None]:
dfMatrix = px.scatter_matrix(df, dimensions= ['accommodates','bathroom_number','bedrooms','beds','price','number_of_reviews','review_scores_rating'], color = 'property_type', height = 800)
dfMatrix.update_traces(diagonal_visible=True)
dfMatrix.update_layout({"xaxis"+str(i+1): dict(tickangle = -45) for i in range(9)}, font_size=9)
dfMatrix.update_layout({"yaxis"+str(i+1): dict(tickangle = -45) for i in range(9)}, font_size=9)
dfMatrix.update_layout(legend = dict(font = dict(size=15)))
dfMatrix.show()

This paired scatterplot matrix shows the relationships between pair-wise numerical variables. From this plot, we do not see any strong correlations between any of the attributes, there is some evidence of mild correaltion between bedrooms-beds and accomodates-bedrooms. 

Looking at the number of reviews, we can also see there is a listing with an abnormally high number of reviews. This unit has over 1500 reviews, with the next highest reviewed unit having just over 1000. 

### Explore Attributes and Class

**Price**

We previously looked into price versus some other variables in the prior section on the scatter matrix, colored by property type.  

Price vs review_score_rating by property type:  One thing that stood out in the visual for this relationship are that there is clearly a trend where higher priced properties appear to have less of instances of lower review scores.  While there are certainly less properties at the higher price, it stands to reason that many higher priced properties are priced so due to the quality of the property and therefor the likelhood of achieving higher review scores.

Price vs accoommodates: There is clearly a positive relationship established where a properties price increases with the number of people that it accommodates.  

**Host is Superhost**

The box plots below explore some of the relationships that host_is_superhost has with the other attributes in the dataset.  We can see that a super host does appear to have higher ratings and charges higher prices for thier properties.  

AirBNB claims to perform evaluation for if a host is a Superhost once every 3 months and includes in its evaluation metrics such as a 4.8 or higher rating across properties, less than 1% cancellation rate, and greater than 90% responsiveness to communications from tenants.  With those metrics, it is safe to say that a Superhost is a desired quality for hosts and tenants, and could lead to increased revenue for hosts.

In [None]:
#Host is superhost boxplot vs ratings/price

superHostRatingPlot = px.box(df, x='review_scores_rating', y='host_is_superhost', labels={'review_scores_rating':'Review Score Rating', 'host_is_superhost':'Super Host True/False'}, title='Does a Super Host mean a higher rating?')
superHostRatingPlot.show()

superHostPricePlot = px.box(df, x='price', y='host_is_superhost', labels={'price':'Price in USD', 'host_is_superhost':'Super Host True/False'}, title='Can a Super Host command a higher price??')
superHostPricePlot.show()


### New Features

**New host location features**

Hosts provided their own location in a column titled 'host_location'.  Reviewing this found wide variation in how the hosts entered their location and communicated that to AirBNB.  

In an attempt to try to bring this into a more standard format, we've added two new features that attempt to extract the country, and if in the US, the state that the host resides, or at least reports that they reside within.

The new features are named host_location_country and host_location_state.  Code to generate these features from the host_location resides under the visualizations above, under the host_location visualization and includes visuals to represent these new features.

**Beds vs Accomodates**

We began to wonder if we couldn't improve any predictive value by better associating a number of beds to accommodates value.  It seems obvious to us that some bed sizes are for singles, and some for doubles.  Does that get presented to us in amenities?  Can we induce the bed size by comparing the bed size to accommodates?

We explored this in joint attributes above but there may be more work to do.

**New property type feature**

A new feature that grouped properties by type was created which narrowed properties into 6 categories which replaced the old property_type attribute. This included an "other" category for properties that don't fully meet one of the other property types such as huts, domes, etc.

**New bathroom features**

Two new attributes for bathrooms were created by extracting information from the bathroom_text variable. The new features in question, bathroom_number and bathroom_type give us information on the number of bathrooms on the listing as well as the type of bathroom (private, shared, etc)

### Exceptional Work

For exceptional work, we chose to perform a Principal Component Analysis on our attributes and to highlight some of our prior work.

Mapping locations using Plotly's Density Mapbox allows for us to visualize clusters or hotspots of properties in ways that would be impossible without keenly knowing the neighborhood names or other details of the city of Los Angelos.  

We also chose to present most all charts using Plotly as it allowed us to explore the data through the visuals more so than would typically be seen by a flat chart.  Plotly gives us statistics of many charts to help guide our decision making in the exploratory data analysis process. 

Finally, we spent a considerable amount of time to learn important features of markdown so that as we go through the semester, we can present our work with easy to read text embedded within the Jupyter Notebook.  These included the table used to display the attributes and their details, using links for a Table of Contents to allow us to quickly navigate to relevant sections, and making use of headings, bolded sub headings, and formatted text where applicable.  Learning this now will prove valuable to our future work.

In [None]:
### dimensionality reduction with PCA
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler

PCA works with numerical variables, the first step is to make a copy of the dataset with only the numerical variables that we will use for principal components analysis

In [None]:
df_pca = df[['accommodates','bathroom_number','bedrooms','beds','price','number_of_reviews','review_scores_rating']].copy()
df_pca.describe()

We can see from the description, that the attributes have large differences in variance. Since PCA is sensitive to variance, we will proceed by scaling all atributes by standardizing them. Without scaling, price would massively outweight other attributes in the principal components.

In [None]:
for col in ['accommodates','bathroom_number','bedrooms','beds','price','number_of_reviews','review_scores_rating']:
    df_pca[col] = (df_pca[col] - df_pca[col].mean())/(df_pca[col].std()) #scaling by finding z-score (value-mean)/std
df_pca.describe()

All attributes are now on the same scale, we can therefore proceed with the Principal Components Analysis.

Principal components are calculated in a way that eliminates correlation, we can visualize this by finding the correlation present after scaling the data, and comparing it to the correlation present after fitting the Principal Components.


In [None]:
sns.set(rc = {'figure.figsize':(12,8)})
sns.heatmap(df_pca.corr())

As expected, we can see that there is correlation present between accommodates and bedrooms, accommodates and bed, and bedrooms and beds. This is not unexpected, since it stands to reason that having more bedrooms would allow for having more beds and the number of beds is in a way related to the number of guests that can be accommodated.

In [None]:
## Performing PCA and getting dataframe with proportion of explained variance and cumulative explained variance
pca = PCA()
components = pca.fit_transform(df_pca)
components = pd.DataFrame(components, columns=['PC1','PC2','PC3','PC4','PC5','PC6','PC7'])
explained = pca.explained_variance_ratio_
cumulative=[]
for i in range(0,len(explained)):
    cumulative.append(sum(explained[:i+1]))
cumulative

pca_explained = pd.DataFrame({
    'PC': ['PC1','PC2','PC3','PC4','PC5','PC6','PC7'],
    'explained_variance': explained,
    'cumulative_explained': cumulative
})
pca_explained

In [None]:
sns.heatmap(components.corr())

The correlation heatmap shows there is no correlation present between the principal components.

In [None]:
### scree plot
px.line(pca_explained, y = 'explained_variance', x = 'PC', hover_data=['explained_variance','cumulative_explained'],hover_name = 'PC', labels={'explained_variance':'Explained Variance',
                                                                                     'PC': 'Principal Component'},height=500)

The scree plot is a visual aid that shows the explained variance for each Principal Component, we can see from the plot that there is a sharp elbow at Principal Component 2, meaning the largest drop in explained variance occurs from PC1 to PC2. Principal Component 1 explains over 51% of the variance in the data, however, including components 1 through 4 would explain over 88% of the variance, reducing our number of dimensions from 7 to 4 while losing inference on the original features of our dataset.