# Predicting Airbnb Prices

## Table of Contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

Jakarta is the capital and largest city of Indonesia where over 10 million people live and it has a population density of 14,464 people per square kilometer. Jakarta consists of five administrative cities and one administrative regency. The administrative cities of Jakarta are: Central Jakarta, West Jakarta, South Jakarta, East Jakarta and North Jakarta. The only administrative regency in Jakarta is the Thousand Islands. In this project, I limit the scope of my research to the districts of Central Jakarta.

As the capital city of Indonesia, Jakarta is one of the most popular destinations in Indonesia. Thus, there is a lot of potential profit that could be obtained by property owners through listing a home on Airbnb in Jakarta. However, it’s hard for a new host to determine the rate for nightly stay. This research aims to solve this problem by predicting an efficient rate by using machine learning model which is trained with data from Airbnb listings.

## Data <a name="data"></a>

### Import Necessary Libraries

In [1]:
import pandas as pd # Library for data analysis
import numpy as np # Library to handle data in a vectorized manner
import requests # Library to handle requests
from pandas.io.json import json_normalize # Library to transform json file into a pandas dataframe 

Define location queries.

In [2]:
locations = ['Cempaka+Putih+Barat', 'Cempaka+Putih+Timur', 'Rawasari', 'Cideng', 'Duri+Pulo', 'Gambir', 'Kebon+Kelapa', 'Petojo+Selatan', 'Petojo+Utara', 'Galur', 'Johar+Baru', 'Kampung+Rawa', 'Tanah+Tinggi', 'Cempaka+Baru', 'Gunung+Sahari+Selatan', 'Harapan+Mulya', 'Kebon+Kosong', 'Kemayoran', 'Serdang', 'Sumur+Batu', 'Utan+Panjang', 'Cikini', 'Gondangdia', 'Kebon+Sirih', 'Menteng', 'Pegangsaan', 'Gunung+Sahari+Utara', 'Karang+Anyar', 'Kartini', 'Mangga+Dua+Selatan', 'Pasar+Baru', 'Bungur', 'Kenari', 'Kramat', 'Kwitang', 'Paseban', 'Senen', 'Bendungan+Hilir', 'Gelora', 'Kampung+Bali', 'Karet+Tengsin', 'Kebon+Kacang', 'Kebon+Melati', 'Petamburan']

### Use Airbnb API to Get Listings Data

Create a function to extract relevant data from the result requested to Airbnb API.

In [3]:
def getListings(locations=[]):
    
    _FORMAT = 'for_explore_search_web'
    ITEMS_PER_GRID = '300'
    KEY = 'd306zoyjsyarp7ifhu67rjxn52tv0t20'
    SECTION_OFFSET = '4'
    SUPPORT_FOR_YOU_V3 = 'true'
    TAB_ID = 'home_tab'
    TIMEZONE_OFFSET = '300'
    VERSION = '1.3.4'
    CURRENCY = 'IDR'

    listings_list = []

    for location in locations:

        # Create the API request URL
        url = 'https://api.airbnb.com/v2/explore_tabs?_format={}&items_per_grid={}&key={}&location={}&section_offset={}&supports_for_you_v3={}&tab_id={}' \
              '&timezone_offset={}&version={}&currency={}'.format(
            _FORMAT, 
            ITEMS_PER_GRID, 
            KEY, 
            location, 
            SECTION_OFFSET, 
            SUPPORT_FOR_YOU_V3, 
            TAB_ID, 
            TIMEZONE_OFFSET, 
            VERSION, 
            CURRENCY)
        
        # Make the GET request
        results = requests.get(url).json()['explore_tabs'][0]['sections'][0]['listings']
        
        # Return only relevant information for each location
        listings_list.append([(
            listing['listing'].get('id', np.nan),
            location.replace('+', ' '),
            listing['listing'].get('lat', np.nan),
            listing['listing'].get('lng', np.nan),
            listing['listing'].get('person_capacity', np.nan),
            listing['listing'].get('bathrooms', np.nan),
            listing['listing'].get('bedrooms', np.nan),
            listing['listing'].get('beds', np.nan),
            listing['listing'].get('reviews_count', np.nan),
            listing['listing'].get('room_type', np.nan),
            listing['listing'].get('avg_rating', np.nan),
            listing['listing'].get('min_nights', np.nan),
            listing['listing'].get('max_nights', np.nan),
            listing['pricing_quote']['rate'].get('amount', np.nan),
            listing['pricing_quote'].get('rate_type', np.nan)
        ) for listing in results])
    
    listings = pd.DataFrame([item for listing_list in listings_list for item in listing_list])
    listings.columns = ['Listing ID', 'Location', 'Latitude', 'Longitude', 'Number of Guests', 'Bathrooms', 'Bedrooms', \
                       'Beds', 'Review Count', 'Room Type', 'Rating', 'Minimum Nights', 'Maximum Nights', 'Rate (Rp)', 'Rate Type']
    
    return(listings)

Run the above function for each location and create listings dataframe.

In [4]:
listings_df = getListings(locations=locations)

# Check the size of the dataframe
print(listings_df.shape)
listings_df.head()

(2104, 15)


Unnamed: 0,Listing ID,Location,Latitude,Longitude,Number of Guests,Bathrooms,Bedrooms,Beds,Review Count,Room Type,Rating,Minimum Nights,Maximum Nights,Rate (Rp),Rate Type
0,18378734,Cempaka Putih Barat,-6.19312,106.85025,2,1.0,0.0,1.0,144,Entire home/apt,4.67,2,1125,270000.0,nightly
1,32297852,Cempaka Putih Barat,-6.19104,106.87336,2,1.0,0.0,1.0,20,Entire home/apt,5.0,2,360,169225.0,nightly
2,19820937,Cempaka Putih Barat,-6.19815,106.85105,2,1.0,0.0,2.0,70,Entire home/apt,4.59,4,30,166348.0,nightly
3,40735320,Cempaka Putih Barat,-6.18998,106.87367,2,1.0,0.0,1.0,4,Entire home/apt,5.0,7,1125,200000.0,nightly
4,43812726,Cempaka Putih Barat,-6.18463,106.87085,2,1.0,0.0,,0,Entire home/apt,,1,30,240000.0,nightly


Drop all duplicate values

In [5]:
# Sort dataframe according to Location
listings_df.sort_values('Location', inplace=True)

# Dropping all duplicate values except the first value
listings_df.drop_duplicates(subset='Listing ID', keep='first', inplace=True)

# Check the size of the dataframe
print(listings_df.shape)
listings_df.head()

(1043, 15)


Unnamed: 0,Listing ID,Location,Latitude,Longitude,Number of Guests,Bathrooms,Bedrooms,Beds,Review Count,Room Type,Rating,Minimum Nights,Maximum Nights,Rate (Rp),Rate Type
1780,14223825,Bendungan Hilir,-6.21361,106.81209,2,1.0,1.0,1.0,0,Private room,,7,1125,260000.0,nightly
1808,21949711,Bendungan Hilir,-6.19742,106.81673,2,1.0,1.0,1.0,80,Entire home/apt,4.85,1,365,383428.0,nightly
1809,12095585,Bendungan Hilir,-6.22894,106.80621,3,1.0,1.0,1.0,19,Entire home/apt,4.76,1,1125,600000.0,nightly
1810,42779832,Bendungan Hilir,-6.19506,106.81784,3,1.0,2.0,2.0,0,Entire home/apt,,2,1125,270000.0,nightly
1811,39062975,Bendungan Hilir,-6.20995,106.81957,4,1.0,2.0,2.0,22,Entire home/apt,4.95,3,1125,450000.0,nightly


Evaluate missing data

In [6]:
missing_data = listings_df.isnull()

for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())  
    print('')

Listing ID
False    1043
Name: Listing ID, dtype: int64

Location
False    1043
Name: Location, dtype: int64

Latitude
False    1043
Name: Latitude, dtype: int64

Longitude
False    1043
Name: Longitude, dtype: int64

Number of Guests
False    1043
Name: Number of Guests, dtype: int64

Bathrooms
False    1042
True        1
Name: Bathrooms, dtype: int64

Bedrooms
False    1035
True        8
Name: Bedrooms, dtype: int64

Beds
False    1029
True       14
Name: Beds, dtype: int64

Review Count
False    1043
Name: Review Count, dtype: int64

Room Type
False    1043
Name: Room Type, dtype: int64

Rating
True     552
False    491
Name: Rating, dtype: int64

Minimum Nights
False    1043
Name: Minimum Nights, dtype: int64

Maximum Nights
False    1043
Name: Maximum Nights, dtype: int64

Rate (Rp)
False    1043
Name: Rate (Rp), dtype: int64

Rate Type
False    1043
Name: Rate Type, dtype: int64



Because there's more than 50% missing data in the Rating column, drop rows with missing value

In [7]:
listings_df.dropna(subset=['Rating'], axis=0, inplace=True)

# Check the size of the dataframe
print(listings_df.shape)
listings_df.head()

(491, 15)


Unnamed: 0,Listing ID,Location,Latitude,Longitude,Number of Guests,Bathrooms,Bedrooms,Beds,Review Count,Room Type,Rating,Minimum Nights,Maximum Nights,Rate (Rp),Rate Type
1808,21949711,Bendungan Hilir,-6.19742,106.81673,2,1.0,1.0,1.0,80,Entire home/apt,4.85,1,365,383428.0,nightly
1809,12095585,Bendungan Hilir,-6.22894,106.80621,3,1.0,1.0,1.0,19,Entire home/apt,4.76,1,1125,600000.0,nightly
1811,39062975,Bendungan Hilir,-6.20995,106.81957,4,1.0,2.0,2.0,22,Entire home/apt,4.95,3,1125,450000.0,nightly
1777,12575863,Bendungan Hilir,-6.22503,106.81923,1,1.5,1.0,1.0,77,Entire home/apt,4.91,2,1125,350000.0,nightly
1776,2317417,Bendungan Hilir,-6.20933,106.81047,2,1.0,1.0,1.0,12,Entire home/apt,4.82,4,365,454433.0,nightly


Evaluate missing data once again

In [8]:
missing_data = listings_df.isnull()

for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())  
    print('')

Listing ID
False    491
Name: Listing ID, dtype: int64

Location
False    491
Name: Location, dtype: int64

Latitude
False    491
Name: Latitude, dtype: int64

Longitude
False    491
Name: Longitude, dtype: int64

Number of Guests
False    491
Name: Number of Guests, dtype: int64

Bathrooms
False    491
Name: Bathrooms, dtype: int64

Bedrooms
False    491
Name: Bedrooms, dtype: int64

Beds
False    491
Name: Beds, dtype: int64

Review Count
False    491
Name: Review Count, dtype: int64

Room Type
False    491
Name: Room Type, dtype: int64

Rating
False    491
Name: Rating, dtype: int64

Minimum Nights
False    491
Name: Minimum Nights, dtype: int64

Maximum Nights
False    491
Name: Maximum Nights, dtype: int64

Rate (Rp)
False    491
Name: Rate (Rp), dtype: int64

Rate Type
False    491
Name: Rate Type, dtype: int64



Drop data with Room Type: Shared Room

In [9]:
listings_df = listings_df[listings_df['Room Type'] != 'Shared room']

# Check the size of the dataframe
print(listings_df.shape)
listings_df.head()

(488, 15)


Unnamed: 0,Listing ID,Location,Latitude,Longitude,Number of Guests,Bathrooms,Bedrooms,Beds,Review Count,Room Type,Rating,Minimum Nights,Maximum Nights,Rate (Rp),Rate Type
1808,21949711,Bendungan Hilir,-6.19742,106.81673,2,1.0,1.0,1.0,80,Entire home/apt,4.85,1,365,383428.0,nightly
1809,12095585,Bendungan Hilir,-6.22894,106.80621,3,1.0,1.0,1.0,19,Entire home/apt,4.76,1,1125,600000.0,nightly
1811,39062975,Bendungan Hilir,-6.20995,106.81957,4,1.0,2.0,2.0,22,Entire home/apt,4.95,3,1125,450000.0,nightly
1777,12575863,Bendungan Hilir,-6.22503,106.81923,1,1.5,1.0,1.0,77,Entire home/apt,4.91,2,1125,350000.0,nightly
1776,2317417,Bendungan Hilir,-6.20933,106.81047,2,1.0,1.0,1.0,12,Entire home/apt,4.82,4,365,454433.0,nightly


Reset index

In [10]:
listings_df.reset_index(drop=True, inplace=True)

# Check the size of the dataframe
print(listings_df.shape)
listings_df.head()

(488, 15)


Unnamed: 0,Listing ID,Location,Latitude,Longitude,Number of Guests,Bathrooms,Bedrooms,Beds,Review Count,Room Type,Rating,Minimum Nights,Maximum Nights,Rate (Rp),Rate Type
0,21949711,Bendungan Hilir,-6.19742,106.81673,2,1.0,1.0,1.0,80,Entire home/apt,4.85,1,365,383428.0,nightly
1,12095585,Bendungan Hilir,-6.22894,106.80621,3,1.0,1.0,1.0,19,Entire home/apt,4.76,1,1125,600000.0,nightly
2,39062975,Bendungan Hilir,-6.20995,106.81957,4,1.0,2.0,2.0,22,Entire home/apt,4.95,3,1125,450000.0,nightly
3,12575863,Bendungan Hilir,-6.22503,106.81923,1,1.5,1.0,1.0,77,Entire home/apt,4.91,2,1125,350000.0,nightly
4,2317417,Bendungan Hilir,-6.20933,106.81047,2,1.0,1.0,1.0,12,Entire home/apt,4.82,4,365,454433.0,nightly


Check data types for each column

In [11]:
listings_df.dtypes

Listing ID            int64
Location             object
Latitude            float64
Longitude           float64
Number of Guests      int64
Bathrooms           float64
Bedrooms            float64
Beds                float64
Review Count          int64
Room Type            object
Rating              float64
Minimum Nights        int64
Maximum Nights        int64
Rate (Rp)           float64
Rate Type            object
dtype: object