# Using AirBnB data to analyze the market in Seattle
I did the analysis from the point of view of a Seattle homeowner. The main objective of a Seattle homeowner would be to maximize revenues by attracting customers willing to pay high prices. This results in the following qustions:

### 1. Business Questions

#### 1.1 What drives higher ratings ?

#### 1.2 Where are the highest prices for rentals ?

#### 1.3 When are the highest prices for rentals ?

### 2. Data Understanding

#### 2.1 Retrieve Data

In [39]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython import display
import collections
from itertools import chain
import sklearn
from time import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
%matplotlib inline

In [9]:
df_listings = pd.read_csv("data/listings.csv")
df_calendar = pd.read_csv("data/calendar.csv")
df_reviews = pd.read_csv("data/reviews.csv")

#### 2.2 Overview and Exploration

In [30]:
# Set some display options
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_colwidth', 2000)

In [31]:
# Check number of rows and columns
print(df_listings.shape)
print(df_calendar.shape)
print(df_reviews.shape)

(3818, 92)
(1393570, 4)
(84849, 6)


In [34]:
# Get column names
print(df_listings.columns.values)
print(df_calendar.columns.values)
print(df_reviews.columns.values)

['id' 'listing_url' 'scrape_id' 'last_scraped' 'name' 'summary' 'space'
 'description' 'experiences_offered' 'neighborhood_overview' 'notes'
 'transit' 'thumbnail_url' 'medium_url' 'picture_url' 'xl_picture_url'
 'host_id' 'host_url' 'host_name' 'host_since' 'host_location'
 'host_about' 'host_response_time' 'host_response_rate'
 'host_acceptance_rate' 'host_is_superhost' 'host_thumbnail_url'
 'host_picture_url' 'host_neighbourhood' 'host_listings_count'
 'host_total_listings_count' 'host_verifications' 'host_has_profile_pic'
 'host_identity_verified' 'street' 'neighbourhood'
 'neighbourhood_cleansed' 'neighbourhood_group_cleansed' 'city' 'state'
 'zipcode' 'market' 'smart_location' 'country_code' 'country' 'latitude'
 'longitude' 'is_location_exact' 'property_type' 'room_type'
 'accommodates' 'bathrooms' 'bedrooms' 'beds' 'bed_type' 'amenities'
 'square_feet' 'price' 'weekly_price' 'monthly_price' 'security_deposit'
 'cleaning_fee' 'guests_included' 'extra_people' 'minimum_nights'
 'm

In [36]:
#check for missing values in listings
(df_listings.isnull().sum()/len(df_listings)).sort_values(ascending=False)

license                             1.000000
square_feet                         0.974594
monthly_price                       0.602672
security_deposit                    0.511262
weekly_price                        0.473808
notes                               0.420639
neighborhood_overview               0.270299
cleaning_fee                        0.269775
transit                             0.244631
host_about                          0.224987
host_acceptance_rate                0.202462
review_scores_accuracy              0.172342
review_scores_checkin               0.172342
review_scores_value                 0.171818
review_scores_location              0.171556
review_scores_cleanliness           0.171032
review_scores_communication         0.170508
review_scores_rating                0.169460
reviews_per_month                   0.164222
first_review                        0.164222
last_review                         0.164222
space                               0.149031
host_respo

In [41]:
# Distribution of missing values.
(df_listings.isnull().sum()/len(df_listings)).describe()

count    92.000000
mean      0.084893
std       0.181492
min       0.000000
25%       0.000000
50%       0.000000
75%       0.136983
max       1.000000
dtype: float64

There are a number of columns containing missing values. The license column has no values at all.

There are no missing values in the price column, perhaps this column can be used instead of weekly and monthly prices. But there seems to be no good substitute for square_feet.

In [47]:
df_listings[['monthly_price', 'weekly_price', 'price']].head(10)

Unnamed: 0,monthly_price,weekly_price,price
0,,,$85.00
1,"$3,000.00","$1,000.00",$150.00
2,,,$975.00
3,"$2,300.00",$650.00,$100.00
4,,,$450.00
5,,$800.00,$120.00
6,,$575.00,$80.00
7,,$360.00,$60.00
8,"$1,700.00",$500.00,$90.00
9,"$3,000.00","$1,000.00",$150.00


In [71]:
df_listings[['monthly_price', 'weekly_price', 'price']].replace('[\$,]','',regex=True).astype(float).corr()

Unnamed: 0,monthly_price,weekly_price,price
monthly_price,1.0,0.942644,0.87345
weekly_price,0.942644,1.0,0.937861
price,0.87345,0.937861,1.0


The price field seems to indicate a daily price, therefore I will drop the monthly and weekly prices from the dataset due to a large number of missing values. The price field should act as a good substitute.

In [62]:
df_listings[df_listings['square_feet'].notnull()][['square_feet', 'bathrooms', 'bedrooms', 'beds']].corr()

Unnamed: 0,square_feet,bathrooms,bedrooms,beds
square_feet,1.0,0.381094,0.448786,0.312155
bathrooms,0.381094,1.0,0.418992,0.303472
bedrooms,0.448786,0.418992,1.0,0.74292
beds,0.312155,0.303472,0.74292,1.0


The bathrooms and bedroom field will be used as substitutes for the mostly missing square_feet field.

In [63]:
(df_calendar.isnull().sum()/len(df_calendar)).sort_values(ascending=False)

price         0.32939
available     0.00000
date          0.00000
listing_id    0.00000
dtype: float64

Again the price column is missing in a lot of rows (32%).

In [42]:
(df_reviews.isnull().sum()/len(df_reviews)).sort_values(ascending=False)

comments         0.000212
reviewer_name    0.000000
reviewer_id      0.000000
date             0.000000
id               0.000000
listing_id       0.000000
dtype: float64

There are no missing data for reviews except for a small number of missing comments.

There are 3 datasets:

- listings
- calendar
- reviews

The most relevant dataset for our analysis is the listings dataset. The calendar dataset is relvant for answering the third question about popular times.

Since the reviews dataset is mainly unstructured data, we will postpone any analysis.

#### 2.2 Treatment of missing data