# San Francisco Airbnb Data Analysis Report

A Proposed Capstone Project for TDI

Haoming Jin

![](figures\AirBnB-San-Francisco-hq.png)

## 1. Introduction
### 1.1 Motivation
Airbnb is an online marketplace for arranging or offering lodging, primarily homestays, or tourism experiences. Airbnb is founded in 2008 and based in San Francisco, California. San Francisco is also the place where I am most interested in developing my own data analyst/scientist career. Also, analysis of these housing data is very valuable for business decisions.

### 1.2 Data

The Data used in this project is obtained from the website _Inside Airbnb_ , http://insideairbnb.com/. This is a website that collects data from publicly available information on Airbnb, it contains very detailed information craped from Airbnb listings all over the world, updated monthly. The data we used is all the listings in San Francisco, from 2019/04 to 2020/04. The data contains in total 12074 rows and 82 columns with relevant data about listings, including location, neighbourhood, prices and fees, review scores, host information, detailed listing description and images etc. 

## 2. Exploratory Data Analysis

### 2.1 Intuitive Map Images

The interactive map shows a heatmap of densities of Airbnb locations in San Francisco, we can see that the density is much higher in the north-east coastal areas.

In [5]:
import pandas as pd
df_original = pd.read_pickle('airbnb_SF_2019_04_to_2020_04.pkl')
col_lists = ['id','host_name','host_response_time','host_response_rate','host_acceptance_rate','host_is_superhost',
            'host_total_listings_count','neighbourhood_cleansed','latitude','longitude','is_location_exact','property_type',
            'room_type', 'accommodates','bathrooms','bedrooms','beds','bed_type','price','weekly_price','monthly_price',
            'security_deposit','cleaning_fee','guests_included']
df = df_original[col_lists]
import folium
from folium.plugins import HeatMap
m=folium.Map([37.76,-122.44,],zoom_start=13)
HeatMap(df[['latitude','longitude']].dropna(),radius=8,gradient={0.2:'blue',0.4:'purple',0.6:'orange',1.0:'red'}).add_to(m)
display(m)

For platforms that doesn't show interactive map, the following is a screenshot.
![](figures\folium_heatmap.png)

If we plot the prices as colored dots on the map, we can see that the higher priced listings are mostly in the central areas and houses very close to the coast.

<img src="figures/prices_on_map.png" width="700"/>

### 2.2 Neighbourhoods

We want to know if the neighborhoods have some affect on pricing. There are in total 36 different neighborhoods in the original data, the following plots show the neighborhoods with the top 5 most number of listings, their total listing numbers and the price distribution.

<img src="figures/neighbourhood_on_map.png" width="500"/>

<img src="figures/neighbourhood_price.png" width="900"/>

Indeed most of the listings are in the central and north-east coastal areas. Also, Downtown/Civic Center area have the lowest average price, but there is a clear seperation of a group of higher prices around 200, and another group of lower prices around 90.

### 2.3 Superhosts

Superhosts are experienced hosts who meet a certain set of requirements and are supposedly better and more qualified than regular hosts. In San Francisco 39.6% of the hosts are qualified as superhosts. The ratings are described as follow:
1. Overall: The overall experience.
2. Accuracy: How accurately did the listing page represent the space?
3. Cleanliness: Did guests feel that the space was clean and tidy?
4. Check-in: How smoothly did check-in go?
5. Communication: How well did you communicate before and during the stay? Guests often care that their host responds quickly, reliably, and frequently to their messages and questions.
6. Location: How did guests feel about the neighborhood? This may mean that there's an accurate description for proximity and access to transportation, shopping centers, city center, etc., and a description that includes special considerations, like noise, and family safety.
7. Value: Did the guest feel that the listing provided good value for the price?

In average, superhosts have a 0.2 - 0.4 higher ratings in all aspects compared to regular hosts.

![](figures\superhost_ratings.png)

## 3 Machine Learning Price Prediction

### 3.1 Model Buildling

In this section, we applied multiple machine learning methods to predict the price of a listing by the following information it provided:
1. Neighbourhood
2. Property type
3. Room type
4. Number of accommodated guests
5. Number of bathrooms
6. Number of bedrooms
7. Number of beds
8. Bed type

The rows with missing values are dropped, the categorical data: neighbourhood, property type, room type and bed type are converted into one hot encodings.

After these preprocessing, X is a 11307 row, 78 column matrix, y is a 11307 row, one column vector of prices.

X,y are divided into 70% training data and 30% testing data.
The following models are used:

**1. Linear Regression (Ordinary Least Squares)**
fits a linear model with coefficients $\vec w$
 to minimize the residual sum of squares between the observed targets in the dataset

$\min_{w} || X w - y||_2^2$

**2. Ridge Regression**
Addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefficients. The ridge coefficients minimize a penalized residual sum of squares:

$\min_{w} || X w - y||_2^2 + \alpha ||w||_2^2$

**3. Lasso Regression**
Similar to Ridge Regression, imposes a penalized term on the l1-norm of the coefficient vecter $\vec w$

$\min_{w} { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}$

### Model Evaluation:
In this part, we use 3 metric to evaluate the predictions:
1. Mean absolute error (MAE): The average of absolute difference between predicted and actual price
2. Root Mean Square Error (RMSE): The square root of the average squared difference
3. R2 score: The percentage of the response variable variation that is explained by a linear model

**Linear Regression:**

MAE =  51.773

RMSE =  71.587

R2 score =  0.513

**Ridge Regression:**

MAE =  51.766

RMSE =  71.585

R2 score =  0.513

**Lasso Regression:**

MAE =  51.999

RMSE =  72.110

R2 score =  0.506


The performance of these models are almost the same, with the Ridge regression marginally better than the basic Linear regression.

The following are 10 random samples from the test data and the predicted price.

|price	|predicted price|
|-------|---------------|
|300.0	|201.59|
|139.0	|160.51|
|200.0	|127.76|
|175.0	|190.97|
|50.0	|89.59|
|107.0	|143.07|
|94.0	|87.22|
|265.0	|109.85|
|199.0	|192.00|
|349.0	|222.84|
