# San Francisco Airbnb Data Analysis

## Introduction
### Motivation
Airbnb is an online marketplace for arranging or offering lodging, primarily homestays, or tourism experiences. Airbnb is founded in 2008 and based in San Francisco, California. San Francisco is also the place where I am most interested in developing my own data analyst/scientist career. Also, analysis of these housing data is very valuable for business decisions.

### Business Understanding
Some of the questions I want to answer are:

1. What are the locations that people rent Airbnb?
2. How does neighborhood affect pricing?
3. Do superhosts provide a better renting experience?
4. Can we provide a prediction on pricing given the information we are provided, if so which method does that best?

## Data Understanding

The Data used in this project is obtained from the website _Inside Airbnb_ , http://insideairbnb.com/. This is a website that collects data from publicly available information on Airbnb, it contains very detailed information craped from Airbnb listings all over the world, updated monthly. The data we used is all the listings in San Francisco, from 2019/04 to 2020/04. The data contains in total 12074 rows and 82 columns with relevant data about listings, including location, neighbourhood, prices and fees, review scores, host information, detailed listing description and images etc. 

In this section, we will load the data, check for cleanliness, and then clean the data.

In [2]:
import numpy as np
import pandas as pd
from urllib.request import urlopen
import matplotlib.pyplot as plt
import seaborn
from tqdm import tqdm

In [3]:
def load_data():
    '''
    INPUT: None
    OUTPUT: A data frame with loaded airbnb listing data, no duplicate listing id.
    '''
    df = pd.DataFrame()
    filelist = ['2020-04-07','2020-03-13','2020-02-12','2020-01-04','2020-01-02','2019-12-04','2019-11-01','2019-10-14',
                '2019-09-12','2019-08-06','2019-07-08','2019-06-02','2019-05-03','2019-04-03']
    for file in tqdm(filelist):
        try:
            url = "http://data.insideairbnb.com/united-states/ca/san-francisco/"+file+"/data/listings.csv.gz"
            resp = urlopen(url)
            df_temp = pd.read_csv(resp, compression = 'gzip')
            df = df.append(df_temp)
            # For the records from the same id, we take the newer record
            df = df[~df.id.duplicated()]
        except:
            print(file, ' fail')
    return df
df = load_data()

100%|██████████| 14/14 [00:34<00:00,  2.47s/it]


## Cleaning

Consider the length of this notebook, I am not showing the full inspection process of 106 columns but you can definitely check the conclusions.

I am first dropping columns that we definitely not using, they are empty meaningless or can be easily get from other data in the dataset.

In [10]:
drop_list = ['listing_url','scrape_id','experiences_offered','thumbnail_url','medium_url','xl_picture_url',
             'host_thumbnail_url', 'neighbourhood_group_cleansed', 'city','state','market','smart_location',
            'country_code', 'country', 'square_feet', 'minimum_minimum_nights', 'maximum_minimum_nights',
             'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm','maximum_nights_avg_ntm',
             'calendar_last_scraped','jurisdiction_names']
df_dropped = df.drop(labels = drop_list, axis = 1)

Due to the large amount of columns we will first clean items that are relavant or easy to clean.
1. change some rows of t/f to boolean type
2. change prices to float type.

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12074 entries, 0 to 7275
Columns: 106 entries, id to reviews_per_month
dtypes: float64(22), int64(21), object(63)
memory usage: 10.2+ MB


In [12]:
cols_to_bool = ['host_is_superhost','host_has_profile_pic','host_identity_verified','has_availability','instant_bookable',
                'is_business_travel_ready','require_guest_profile_picture','require_guest_phone_verification']
cols_to_price = ['price','weekly_price','monthly_price','security_deposit','cleaning_fee','extra_people']
cols_to_datetime = ['last_scraped', 'first_review','last_review']
cols_to_perc = ['host_response_rate','host_acceptance_rate']
cols_to_str = ['id','host_id']

In [13]:
df_dropped[cols_to_bool] = (df_dropped[cols_to_bool] == 't')

In [14]:
def clean_currency(x):
    if isinstance(x, str):
        return x.replace('$','').replace(',','')
    else: return x
for col in cols_to_price:
    df_dropped[col] = df_dropped[col].apply(clean_currency).astype('float')

I am converting datetime columns from a string to datetime type.

In [15]:
df_dropped[cols_to_datetime] = df_dropped[cols_to_datetime].apply(lambda x: pd.to_datetime(x))

I am converting columns that are percentage from '90%' form to a float 0.9.

In [16]:
def clean_perc(x):
    if isinstance(x, str):
        return x.replace('%','')
    else: return x
for col in cols_to_perc:
    df_dropped[col] = df_dropped[col].apply(clean_perc).astype('float')/100

I am changing id columns to string because the numerical values of these columns are meaningless.

In [17]:
for col in cols_to_str:
    df_dropped[col] = df_dropped[col].astype(str)

In [18]:
df_dropped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12074 entries, 0 to 7275
Data columns (total 83 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   id                                            12074 non-null  object        
 1   last_scraped                                  12074 non-null  datetime64[ns]
 2   name                                          12074 non-null  object        
 3   summary                                       11730 non-null  object        
 4   space                                         10064 non-null  object        
 5   description                                   11947 non-null  object        
 6   neighborhood_overview                         8759 non-null   object        
 7   notes                                         6860 non-null   object        
 8   transit                                       8106 non-null   objec

In [19]:
df_dropped.describe()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,latitude,longitude,accommodates,bathrooms,bedrooms,beds,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,10364.0,8516.0,12070.0,12070.0,12074.0,12074.0,12074.0,12064.0,12065.0,12007.0,...,8762.0,8759.0,8763.0,8759.0,8759.0,12074.0,12074.0,12074.0,12074.0,8891.0
mean,0.949767,0.880082,115.239685,115.239685,37.766015,-122.429654,3.15347,1.409773,1.351513,1.746315,...,9.569276,9.818587,9.783864,9.678845,9.352552,25.947656,20.978218,4.004721,0.779443,1.641264
std,0.146818,0.19644,428.131384,428.131384,0.023899,0.02705,1.983336,0.923997,0.963192,1.263385,...,0.894532,0.67181,0.738809,0.742596,0.928804,64.851421,62.305973,10.888533,4.03521,1.936059
min,0.0,0.0,0.0,0.0,37.70417,-122.51306,1.0,0.0,0.0,0.0,...,2.0,2.0,2.0,2.0,2.0,1.0,0.0,0.0,0.0,0.01
25%,1.0,0.82,1.0,1.0,37.751243,-122.442717,2.0,1.0,1.0,1.0,...,9.0,10.0,10.0,10.0,9.0,1.0,0.0,0.0,0.0,0.26
50%,1.0,0.98,2.0,2.0,37.77005,-122.42372,2.0,1.0,1.0,1.0,...,10.0,10.0,10.0,10.0,10.0,2.0,1.0,0.0,0.0,0.9
75%,1.0,1.0,12.0,12.0,37.78585,-122.41036,4.0,1.5,2.0,2.0,...,10.0,10.0,10.0,10.0,10.0,9.0,2.0,2.0,0.0,2.41
max,1.0,1.0,2347.0,2347.0,37.82879,-122.36702,16.0,14.0,30.0,30.0,...,10.0,10.0,10.0,10.0,10.0,301.0,301.0,87.0,36.0,31.02


### Wrap the process up as a pipeline

In [21]:
def clean_data(df):
    '''
    INPUT: The dataframe containing unprocessed data (106 columns).
    OUTPUT: A dataframe containing the cleaned data (82 columns, with boolean, percentage, currency, etc in the correct data type)
    '''
    # Define the columns that are no longer needed because they are duplicate, empty, or meaningless
    drop_list = ['listing_url','scrape_id','experiences_offered','thumbnail_url','medium_url','xl_picture_url',
             'host_thumbnail_url', 'neighbourhood_group_cleansed', 'city','state','market','smart_location',
            'country_code', 'country', 'square_feet', 'minimum_minimum_nights', 'maximum_minimum_nights',
             'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm','maximum_nights_avg_ntm',
             'calendar_last_scraped','jurisdiction_names']
    df_dropped = df.drop(labels = drop_list, axis = 1)
    
    # Define the columns that need to adjust data type
    cols_to_bool = ['host_is_superhost','host_has_profile_pic','host_identity_verified','has_availability','instant_bookable',
                'is_business_travel_ready','require_guest_profile_picture','require_guest_phone_verification']
    cols_to_price = ['price','weekly_price','monthly_price','security_deposit','cleaning_fee','extra_people']
    cols_to_datetime = ['last_scraped', 'first_review','last_review']
    cols_to_perc = ['host_response_rate','host_acceptance_rate']
    cols_to_str = ['id','host_id']

    # change boolean values from 't' or 'f' to boolean type
    df_dropped[cols_to_bool] = (df_dropped[cols_to_bool] == 't')

    # change price from the form '$1,200.00' to a float 1200
    for col in cols_to_price:
        df_dropped[col] = df_dropped[col].apply(clean_currency).astype('float')
    
    # change datetime from string form to datetime type
    df_dropped[cols_to_datetime] = df_dropped[cols_to_datetime].apply(lambda x: pd.to_datetime(x))

    # change percentage data from form '90%' to float value 0.9
    for col in cols_to_perc:
        df_dropped[col] = df_dropped[col].apply(clean_perc).astype('float')/100

    # change id columns to string because their numerical values are meaningless
    for col in cols_to_str:
        df_dropped[col] = df_dropped[col].astype(str)

    return df_dropped

In [22]:
def load_and_clean():
    df = load_data()
    df_dropped = clean_data(df)
    return df_dropped
df_dropped = load_and_clean()
df_dropped.info()

100%|██████████| 14/14 [00:36<00:00,  2.62s/it]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12074 entries, 0 to 7275
Data columns (total 83 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   id                                            12074 non-null  object        
 1   last_scraped                                  12074 non-null  datetime64[ns]
 2   name                                          12074 non-null  object        
 3   summary                                       11730 non-null  object        
 4   space                                         10064 non-null  object        
 5   description                                   11947 non-null  object        
 6   neighborhood_overview                         8759 non-null   object        
 7   notes                                         6860 non-null   object        
 8   transit            

In [142]:
df_dropped.to_pickle('airbnb_SF_2019_04_to_2020_04.pkl')