# Predicting Airbnb Prices

## Table of Contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

Jakarta is the capital and largest city of Indonesia where over 10 million people live and it has a population density of 14,464 people per square kilometer. Jakarta consists of five administrative cities and one administrative regency. The administrative cities of Jakarta are: Central Jakarta, West Jakarta, South Jakarta, East Jakarta and North Jakarta. The only administrative regency in Jakarta is the Thousand Islands. In this project, I limit the scope of my research to the districts of Central Jakarta.

As the capital city of Indonesia, Jakarta is one of the most popular destinations in Indonesia. Thus, there is a lot of potential profit that could be obtained by property owners through listing a home on Airbnb in Jakarta. However, it’s hard for a new host to determine the rate for nightly stay. This research aims to solve this problem by predicting an efficient rate by using machine learning model which is trained with data from Airbnb listings.

## Data <a name="data"></a>

### Import Necessary Libraries

In [1]:
import pandas as pd # Library for data analysis
import numpy as np # Library to handle data in a vectorized manner
import requests # Library to handle requests
from pandas.io.json import json_normalize # Library to transform json file into a pandas dataframe 

Define location queries.

In [2]:
locations = ['Cempaka+Putih+Barat,+Central+Jakarta', 'Cempaka+Putih+Timur,+Central+Jakarta', 'Rawasari,+Central+Jakarta', 'Cideng,+Central+Jakarta', 'Duri+Pulo,+Central+Jakarta', 'Gambir,+Central+Jakarta', 'Kebon+Kelapa,+Central+Jakarta', 'Petojo+Selatan,+Central+Jakarta', 'Petojo+Utara,+Central+Jakarta', 'Galur,+Central+Jakarta', 'Johar+Baru,+Central+Jakarta', 'Kampung+Rawa,+Central+Jakarta', 'Tanah+Tinggi,+Central+Jakarta', 'Cempaka+Baru,+Central+Jakarta', 'Gunung+Sahari+Selatan,+Central+Jakarta', 'Harapan+Mulya,+Central+Jakarta', 'Kebon+Kosong,+Central+Jakarta', 'Kemayoran,+Central+Jakarta', 'Serdang,+Central+Jakarta', 'Sumur+Batu,+Central+Jakarta', 'Utan+Panjang,+Central+Jakarta', 'Cikini,+Central+Jakarta', 'Gondangdia,+Central+Jakarta', 'Kebon+Sirih,+Central+Jakarta', 'Menteng,+Central+Jakarta', 'Pegangsaan,+Central+Jakarta', 'Gunung+Sahari+Utara,+Central+Jakarta', 'Karang+Anyar,+Central+Jakarta', 'Kartini,+Central+Jakarta', 'Mangga+Dua+Selatan,+Central+Jakarta', 'Pasar+Baru,+Central+Jakarta', 'Bungur,+Central+Jakarta', 'Kenari,+Central+Jakarta', 'Kramat,+Central+Jakarta', 'Kwitang,+Central+Jakarta', 'Paseban,+Central+Jakarta', 'Senen,+Central+Jakarta', 'Bendungan+Hilir,+Central+Jakarta', 'Gelora,+Central+Jakarta', 'Kampung+Bali,+Central+Jakarta', 'Karet+Tengsin,+Central+Jakarta', 'Kebon+Kacang,+Central+Jakarta', 'Kebon+Melati,+Central+Jakarta', 'Petamburan,+Central+Jakarta']

### Use Airbnb API to Get Listings Data

Create a function to extract relevant data from the result requested to Airbnb API.

In [3]:
def getListings(locations=[]):
    
    _FORMAT = 'for_explore_search_web'
    ITEMS_PER_GRID = '300'
    KEY = 'd306zoyjsyarp7ifhu67rjxn52tv0t20'
    SECTION_OFFSET = '4'
    SUPPORT_FOR_YOU_V3 = 'true'
    TAB_ID = 'home_tab'
    TIMEZONE_OFFSET = '300'
    VERSION = '1.3.4'
    CURRENCY = 'IDR'

    listings_list = []

    for location in locations:

        # Create the API request URL
        url = 'https://api.airbnb.com/v2/explore_tabs?_format={}&items_per_grid={}&key={}&location={}&section_offset={}&supports_for_you_v3={}&tab_id={}' \
              '&timezone_offset={}&version={}&currency={}'.format(
            _FORMAT, 
            ITEMS_PER_GRID, 
            KEY, 
            location, 
            SECTION_OFFSET, 
            SUPPORT_FOR_YOU_V3, 
            TAB_ID, 
            TIMEZONE_OFFSET, 
            VERSION, 
            CURRENCY)
        
        # Make the GET request
        results = requests.get(url).json()['explore_tabs'][0]['sections'][0]['listings']
        
        # Return only relevant information for each location
        listings_list.append([(
            listing['listing'].get('id', np.nan),
            location.replace('+', ' '),
            listing['listing'].get('lat', np.nan),
            listing['listing'].get('lng', np.nan),
            listing['listing'].get('person_capacity', np.nan),
            listing['listing'].get('bathrooms', np.nan),
            listing['listing'].get('bedrooms', np.nan),
            listing['listing'].get('beds', np.nan),
            listing['listing'].get('reviews_count', np.nan),
            listing['listing'].get('room_type', np.nan),
            listing['listing'].get('avg_rating', np.nan),
            listing['listing'].get('min_nights', np.nan),
            listing['listing'].get('max_nights', np.nan),
            listing['pricing_quote']['rate'].get('amount', np.nan),
            listing['pricing_quote'].get('rate_type', np.nan)
        ) for listing in results])
    
    listings = pd.DataFrame([item for listing_list in listings_list for item in listing_list])
    listings.columns = ['Listing ID', 'Location', 'Latitude', 'Longitude', 'Number of Guests', 'Bathrooms', 'Bedrooms', \
                       'Beds', 'Review Count', 'Room Type', 'Rating', 'Minimum Nights', 'Maximum Nights', 'Rate (Rp)', 'Rate Type']
    
    return(listings)

Run the above function for each location and create listings dataframe.

In [4]:
listings_df = getListings(locations=locations)

# Check the size of the dataframe
print(listings_df.shape)
listings_df.head()

(2154, 15)


Unnamed: 0,Listing ID,Location,Latitude,Longitude,Number of Guests,Bathrooms,Bedrooms,Beds,Review Count,Room Type,Rating,Minimum Nights,Maximum Nights,Rate (Rp),Rate Type
0,18378734,"Cempaka Putih Barat, Central Jakarta",-6.19312,106.85025,2,1.0,0.0,1.0,144,Entire home/apt,4.67,2,1125,270000.0,nightly
1,32297852,"Cempaka Putih Barat, Central Jakarta",-6.19104,106.87336,2,1.0,0.0,1.0,20,Entire home/apt,5.0,2,360,169225.0,nightly
2,19820937,"Cempaka Putih Barat, Central Jakarta",-6.19815,106.85105,2,1.0,0.0,2.0,70,Entire home/apt,4.59,4,30,166348.0,nightly
3,40735320,"Cempaka Putih Barat, Central Jakarta",-6.18998,106.87367,2,1.0,0.0,1.0,4,Entire home/apt,5.0,7,1125,200000.0,nightly
4,5157038,"Cempaka Putih Barat, Central Jakarta",-6.19481,106.85622,3,1.0,1.0,1.0,75,Entire home/apt,4.84,4,1125,335052.0,nightly


Drop all duplicate values.

In [5]:
# Sort dataframe according to Location
listings_df.sort_values('Location', inplace=True)

# Dropping all duplicate values except the first value
listings_df.drop_duplicates(subset='Listing ID', keep='first', inplace=True)

# Check the size of the dataframe
print(listings_df.shape)
listings_df.head()

(725, 15)


Unnamed: 0,Listing ID,Location,Latitude,Longitude,Number of Guests,Bathrooms,Bedrooms,Beds,Review Count,Room Type,Rating,Minimum Nights,Maximum Nights,Rate (Rp),Rate Type
1812,22987741,"Bendungan Hilir, Central Jakarta",-6.21335,106.81134,3,1.0,2.0,2.0,25,Entire home/apt,4.76,1,1125,700000.0,nightly
1826,10344900,"Bendungan Hilir, Central Jakarta",-6.20482,106.81693,2,1.0,1.0,0.0,20,Entire home/apt,4.55,2,1125,213544.0,nightly
1825,22183883,"Bendungan Hilir, Central Jakarta",-6.20685,106.81307,1,1.0,1.0,1.0,0,Private room,,1,1125,150000.0,nightly
1824,4619772,"Bendungan Hilir, Central Jakarta",-6.21704,106.80577,2,1.0,1.0,1.0,17,Entire home/apt,4.76,4,1125,455560.0,nightly
1823,40116584,"Bendungan Hilir, Central Jakarta",-6.20918,106.80547,2,1.0,1.0,1.0,0,Private room,,2,60,180000.0,nightly


Evaluate missing data.

In [6]:
missing_data = listings_df.isnull()

for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())  
    print('')

Listing ID
False    725
Name: Listing ID, dtype: int64

Location
False    725
Name: Location, dtype: int64

Latitude
False    725
Name: Latitude, dtype: int64

Longitude
False    725
Name: Longitude, dtype: int64

Number of Guests
False    725
Name: Number of Guests, dtype: int64

Bathrooms
False    725
Name: Bathrooms, dtype: int64

Bedrooms
False    722
True       3
Name: Bedrooms, dtype: int64

Beds
False    717
True       8
Name: Beds, dtype: int64

Review Count
False    725
Name: Review Count, dtype: int64

Room Type
False    725
Name: Room Type, dtype: int64

Rating
False    391
True     334
Name: Rating, dtype: int64

Minimum Nights
False    725
Name: Minimum Nights, dtype: int64

Maximum Nights
False    725
Name: Maximum Nights, dtype: int64

Rate (Rp)
False    725
Name: Rate (Rp), dtype: int64

Rate Type
False    725
Name: Rate Type, dtype: int64



Because the quantity of missing data in the Rating column is significant, drop rows with missing value.

In [7]:
listings_df.dropna(subset=['Rating'], axis=0, inplace=True)

# Check the size of the dataframe
print(listings_df.shape)
listings_df.head()

(391, 15)


Unnamed: 0,Listing ID,Location,Latitude,Longitude,Number of Guests,Bathrooms,Bedrooms,Beds,Review Count,Room Type,Rating,Minimum Nights,Maximum Nights,Rate (Rp),Rate Type
1812,22987741,"Bendungan Hilir, Central Jakarta",-6.21335,106.81134,3,1.0,2.0,2.0,25,Entire home/apt,4.76,1,1125,700000.0,nightly
1826,10344900,"Bendungan Hilir, Central Jakarta",-6.20482,106.81693,2,1.0,1.0,0.0,20,Entire home/apt,4.55,2,1125,213544.0,nightly
1824,4619772,"Bendungan Hilir, Central Jakarta",-6.21704,106.80577,2,1.0,1.0,1.0,17,Entire home/apt,4.76,4,1125,455560.0,nightly
1821,30411330,"Bendungan Hilir, Central Jakarta",-6.20469,106.81627,4,1.5,2.0,2.0,45,Entire home/apt,4.96,2,1125,435000.0,nightly
1819,12575863,"Bendungan Hilir, Central Jakarta",-6.22503,106.81923,1,1.5,1.0,1.0,77,Entire home/apt,4.91,2,1125,350000.0,nightly


Evaluate missing data once again.

In [8]:
missing_data = listings_df.isnull()

for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())  
    print('')

Listing ID
False    391
Name: Listing ID, dtype: int64

Location
False    391
Name: Location, dtype: int64

Latitude
False    391
Name: Latitude, dtype: int64

Longitude
False    391
Name: Longitude, dtype: int64

Number of Guests
False    391
Name: Number of Guests, dtype: int64

Bathrooms
False    391
Name: Bathrooms, dtype: int64

Bedrooms
False    391
Name: Bedrooms, dtype: int64

Beds
False    390
True       1
Name: Beds, dtype: int64

Review Count
False    391
Name: Review Count, dtype: int64

Room Type
False    391
Name: Room Type, dtype: int64

Rating
False    391
Name: Rating, dtype: int64

Minimum Nights
False    391
Name: Minimum Nights, dtype: int64

Maximum Nights
False    391
Name: Maximum Nights, dtype: int64

Rate (Rp)
False    391
Name: Rate (Rp), dtype: int64

Rate Type
False    391
Name: Rate Type, dtype: int64



Drop data with Room Type: Shared Room.

In [9]:
listings_df = listings_df[listings_df['Room Type'] != 'Shared room']

# Check the size of the dataframe
print(listings_df.shape)
listings_df.head()

(388, 15)


Unnamed: 0,Listing ID,Location,Latitude,Longitude,Number of Guests,Bathrooms,Bedrooms,Beds,Review Count,Room Type,Rating,Minimum Nights,Maximum Nights,Rate (Rp),Rate Type
1812,22987741,"Bendungan Hilir, Central Jakarta",-6.21335,106.81134,3,1.0,2.0,2.0,25,Entire home/apt,4.76,1,1125,700000.0,nightly
1826,10344900,"Bendungan Hilir, Central Jakarta",-6.20482,106.81693,2,1.0,1.0,0.0,20,Entire home/apt,4.55,2,1125,213544.0,nightly
1824,4619772,"Bendungan Hilir, Central Jakarta",-6.21704,106.80577,2,1.0,1.0,1.0,17,Entire home/apt,4.76,4,1125,455560.0,nightly
1821,30411330,"Bendungan Hilir, Central Jakarta",-6.20469,106.81627,4,1.5,2.0,2.0,45,Entire home/apt,4.96,2,1125,435000.0,nightly
1819,12575863,"Bendungan Hilir, Central Jakarta",-6.22503,106.81923,1,1.5,1.0,1.0,77,Entire home/apt,4.91,2,1125,350000.0,nightly


Drop data with Number of Guests more than 4.

In [10]:
listings_df = listings_df[listings_df['Number of Guests'] < 5]

# Check the size of the dataframe
print(listings_df.shape)
listings_df.head()

(342, 15)


Unnamed: 0,Listing ID,Location,Latitude,Longitude,Number of Guests,Bathrooms,Bedrooms,Beds,Review Count,Room Type,Rating,Minimum Nights,Maximum Nights,Rate (Rp),Rate Type
1812,22987741,"Bendungan Hilir, Central Jakarta",-6.21335,106.81134,3,1.0,2.0,2.0,25,Entire home/apt,4.76,1,1125,700000.0,nightly
1826,10344900,"Bendungan Hilir, Central Jakarta",-6.20482,106.81693,2,1.0,1.0,0.0,20,Entire home/apt,4.55,2,1125,213544.0,nightly
1824,4619772,"Bendungan Hilir, Central Jakarta",-6.21704,106.80577,2,1.0,1.0,1.0,17,Entire home/apt,4.76,4,1125,455560.0,nightly
1821,30411330,"Bendungan Hilir, Central Jakarta",-6.20469,106.81627,4,1.5,2.0,2.0,45,Entire home/apt,4.96,2,1125,435000.0,nightly
1819,12575863,"Bendungan Hilir, Central Jakarta",-6.22503,106.81923,1,1.5,1.0,1.0,77,Entire home/apt,4.91,2,1125,350000.0,nightly


Drop data with Minimum Nights more than 3.

In [11]:
listings_df = listings_df[listings_df['Minimum Nights'] < 4]

# Check the size of the dataframe
print(listings_df.shape)
listings_df.head()

(281, 15)


Unnamed: 0,Listing ID,Location,Latitude,Longitude,Number of Guests,Bathrooms,Bedrooms,Beds,Review Count,Room Type,Rating,Minimum Nights,Maximum Nights,Rate (Rp),Rate Type
1812,22987741,"Bendungan Hilir, Central Jakarta",-6.21335,106.81134,3,1.0,2.0,2.0,25,Entire home/apt,4.76,1,1125,700000.0,nightly
1826,10344900,"Bendungan Hilir, Central Jakarta",-6.20482,106.81693,2,1.0,1.0,0.0,20,Entire home/apt,4.55,2,1125,213544.0,nightly
1821,30411330,"Bendungan Hilir, Central Jakarta",-6.20469,106.81627,4,1.5,2.0,2.0,45,Entire home/apt,4.96,2,1125,435000.0,nightly
1819,12575863,"Bendungan Hilir, Central Jakarta",-6.22503,106.81923,1,1.5,1.0,1.0,77,Entire home/apt,4.91,2,1125,350000.0,nightly
1827,12855114,"Bendungan Hilir, Central Jakarta",-6.19405,106.81437,2,1.0,1.0,1.0,136,Entire home/apt,4.9,2,1125,300000.0,nightly


Remove outliers with IQR.

In [12]:
Q1 = listings_df['Rate (Rp)'].quantile(0.25)
Q3 = listings_df['Rate (Rp)'].quantile(0.75)
IQR = Q3 - Q1
filter = (Q1-IQR <= listings_df['Rate (Rp)']) & (listings_df['Rate (Rp)'] <= Q3+IQR)

listings_df = listings_df[filter]

# Check the size of the dataframe
print(listings_df.shape)
listings_df.head()

(252, 15)


Unnamed: 0,Listing ID,Location,Latitude,Longitude,Number of Guests,Bathrooms,Bedrooms,Beds,Review Count,Room Type,Rating,Minimum Nights,Maximum Nights,Rate (Rp),Rate Type
1826,10344900,"Bendungan Hilir, Central Jakarta",-6.20482,106.81693,2,1.0,1.0,0.0,20,Entire home/apt,4.55,2,1125,213544.0,nightly
1821,30411330,"Bendungan Hilir, Central Jakarta",-6.20469,106.81627,4,1.5,2.0,2.0,45,Entire home/apt,4.96,2,1125,435000.0,nightly
1819,12575863,"Bendungan Hilir, Central Jakarta",-6.22503,106.81923,1,1.5,1.0,1.0,77,Entire home/apt,4.91,2,1125,350000.0,nightly
1827,12855114,"Bendungan Hilir, Central Jakarta",-6.19405,106.81437,2,1.0,1.0,1.0,136,Entire home/apt,4.9,2,1125,300000.0,nightly
1811,40117294,"Bendungan Hilir, Central Jakarta",-6.20994,106.80463,3,1.0,1.0,1.0,6,Private room,4.83,2,60,210000.0,nightly


Reset index.

In [13]:
listings_df.reset_index(drop=True, inplace=True)

# Check the size of the dataframe
print(listings_df.shape)
listings_df.head()

(252, 15)


Unnamed: 0,Listing ID,Location,Latitude,Longitude,Number of Guests,Bathrooms,Bedrooms,Beds,Review Count,Room Type,Rating,Minimum Nights,Maximum Nights,Rate (Rp),Rate Type
0,10344900,"Bendungan Hilir, Central Jakarta",-6.20482,106.81693,2,1.0,1.0,0.0,20,Entire home/apt,4.55,2,1125,213544.0,nightly
1,30411330,"Bendungan Hilir, Central Jakarta",-6.20469,106.81627,4,1.5,2.0,2.0,45,Entire home/apt,4.96,2,1125,435000.0,nightly
2,12575863,"Bendungan Hilir, Central Jakarta",-6.22503,106.81923,1,1.5,1.0,1.0,77,Entire home/apt,4.91,2,1125,350000.0,nightly
3,12855114,"Bendungan Hilir, Central Jakarta",-6.19405,106.81437,2,1.0,1.0,1.0,136,Entire home/apt,4.9,2,1125,300000.0,nightly
4,40117294,"Bendungan Hilir, Central Jakarta",-6.20994,106.80463,3,1.0,1.0,1.0,6,Private room,4.83,2,60,210000.0,nightly


Check data types for each column.

In [14]:
listings_df.dtypes

Listing ID            int64
Location             object
Latitude            float64
Longitude           float64
Number of Guests      int64
Bathrooms           float64
Bedrooms            float64
Beds                float64
Review Count          int64
Room Type            object
Rating              float64
Minimum Nights        int64
Maximum Nights        int64
Rate (Rp)           float64
Rate Type            object
dtype: object

Check the summary of the data.

In [15]:
listings_df.describe()

Unnamed: 0,Listing ID,Latitude,Longitude,Number of Guests,Bathrooms,Bedrooms,Beds,Review Count,Rating,Minimum Nights,Maximum Nights,Rate (Rp)
count,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0
mean,25159060.0,-6.184911,106.834473,2.535714,1.059524,0.97619,1.373016,39.865079,4.720794,1.678571,857.253968,334464.722222
std,10827370.0,0.022536,0.024753,0.885201,0.200668,0.661572,0.688368,48.291257,0.328272,0.71122,447.888111,99773.65622
min,290283.0,-6.24268,106.78203,1.0,0.5,0.0,0.0,3.0,2.33,1.0,5.0,142000.0
25%,17506030.0,-6.196313,106.816997,2.0,1.0,1.0,1.0,9.0,4.6675,1.0,365.0,261006.0
50%,23943290.0,-6.19124,106.837675,2.0,1.0,1.0,1.0,26.0,4.8,2.0,1125.0,325000.0
75%,35286110.0,-6.17465,106.850912,3.0,1.0,1.0,2.0,48.25,4.9125,2.0,1125.0,400000.0
max,42842910.0,-6.12839,106.89362,4.0,2.0,3.0,4.0,369.0,5.0,3.0,1125.0,585000.0


Export the data into .csv

In [16]:
listings_df.to_csv(r'~/Projects/predicting_airbnb_prices/data/central_jakarta_airbnb_listings.csv', index=False)