# Optimal Price for an AirBnB Rental
## Introduction
AirBnB is a marketplace for short term rentals that allows users to list part or all of a living space for others to rent. Users can rent everything from a room in an apartment to your entire house on AirBnB. Because most of the listings are on a short-term basis, AirBnB has grown to become a popular alternative to hotels. The company itself has grown from it's founding in 2008 to a 30 billion dollar valuation in 2016 and is currently worth more than any hotel chain in the world.
## Problem Definition
One challenge that hosts looking to rent their living space face is determining the optimal nightly rent price. In many areas, renters are presented with a good selection of listings and can filter on criteria like price, number of bedrooms, room type and more. Since AirBnB is a marketplace, the amount a host can charge on a nightly basis is closely linked to the dynamics of the marketplace. In this project I will be taking on the role of a host wanting to rent my living space.
As a host, if I try to charge above market price for my living space, then renters will select more affordable alternatives which are similar to mine. If I set my nightly rent price too low, I could miss out on potential revenue.

One strategy I could use is to:

* find a few listings that are similar to mine (k=5),
* average the listed price for the ones most similar to mine,
* set my listing price to this calculated average price.

In this project I will use data on local listings to predict the optimal price to set for a rental. I will use the __K-Nearest Neighbors__ algorithm to solve this problem.

## The Data
While AirBnB doesn't release any data on the listings in their marketplace, a separate group named _Inside AirBnB_ has extracted data on a sample of the listings for many of the major cities on the website. I'll be working with their dataset from October 3, 2015 on the listings from Washington, D.C., the capital of the United States. Here's a [direct link to that dataset.](http://data.insideairbnb.com/united-states/dc/washington-dc/2015-10-03/data/listings.csv.gz) Each row in the dataset is a specific listing that's available for renting on AirBnB in the Washington, D.C. area.

The data file is named `dc_airbnb.csv`, the columns and their descriptions are below:
* `host_response_rate`: the response rate of the host
* `host_acceptance_rate`: number of requests to the host that convert to rentals
* `host_listings_count`: number of other listings the host has
* `latitude`: latitude dimension of the geographic coordinates
* `longitude`: longitude part of the coordinates
* `city`: the city the living space resides
* `zipcode`: the zip code the living space resides
* `state`: the state the living space resides
* `accommodates`: the number of guests the rental can accommodate
* `room_type`: the type of living space (Private room, Shared room or Entire home/apt
* `bedrooms`: number of bedrooms included in the rental
* `bathrooms`: number of bathrooms included in the rental
* `beds`: number of beds included in the rental
* `price`: nightly price for the rental
* `cleaning_fee`: additional fee used for cleaning the living space after the guest leaves
* `security_deposit`: refundable security deposit, in case of damages
* `minimum_nights`: minimum number of nights a guest can stay for the rental
* `maximum_nights`: maximum number of nights a guest can stay for the rental
* `number_of_reviews`: number of reviews that previous guests have left


In [1]:
%pip install numpy
import pandas as pd
import numpy as np
dc_listings = pd.read_csv("dc_airbnb.csv")
dc_listings.head()

Note: you may need to restart the kernel to use updated packages.


Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
0,92%,91%,26,4,Entire home/apt,1.0,1.0,2.0,$160.00,$115.00,$100.00,1,1125,0,38.890046,-77.002808,Washington,20003,DC
1,90%,100%,1,6,Entire home/apt,3.0,3.0,3.0,$350.00,$100.00,,2,30,65,38.880413,-76.990485,Washington,20003,DC
2,90%,100%,2,1,Private room,1.0,2.0,1.0,$50.00,,,2,1125,1,38.955291,-76.986006,Hyattsville,20782,MD
3,100%,,1,2,Private room,1.0,1.0,1.0,$95.00,,,1,1125,0,38.872134,-77.019639,Washington,20024,DC
4,92%,67%,1,4,Entire home/apt,1.0,1.0,1.0,$50.00,$15.00,$450.00,7,1125,0,38.996382,-77.041541,Silver Spring,20910,MD


In [2]:
dc_listings.dtypes

host_response_rate       object
host_acceptance_rate     object
host_listings_count       int64
accommodates              int64
room_type                object
bedrooms                float64
bathrooms               float64
beds                    float64
price                    object
cleaning_fee             object
security_deposit         object
minimum_nights            int64
maximum_nights            int64
number_of_reviews         int64
latitude                float64
longitude               float64
city                     object
zipcode                  object
state                    object
dtype: object

### Finding Similar Listings
To find listings similar to mine, I need a similarity metric to compare attributes. I will use the univariate case of the Euclidean distance metric:

<center><i> 
    distance = |q1 - p1| 
    </i></center>

* where q1 and p1 are feature values of observations to be compared

My living space can accomodate 3 people, so I will calculate the distance, using the `accomodates` column, between my listing and all others in the data set.

In [3]:
#calculating Euclidean distance for all observations
dc_listings['distance'] = dc_listings['accommodates'].apply(lambda x: np.abs(x-3))

#ranking spaces by ascending distance
print(dc_listings['distance'].value_counts())

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64


From looking at these results, I am primarily interested in data where distance = 0. This is because a distance of 0 means the value of both features are exactly the same; there are 461 listings in this data set that can also accomodate 3 people.

In order to avoid biasing my results to the ordering of the data set, I will randomize the order then sort the dataframe by the `distance` column.

In [4]:
np.random.seed(1)
#returns a numpy array of shuffled index values and returns a dataframe containing the shuffled order
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
dc_listings = dc_listings.sort_values('distance')
#now that the dataframe is sorted, I'll display the first 10 values in the price column that I'll work with next
print(dc_listings.iloc[0:10]['price'])

577     $185.00
2166    $180.00
3631    $175.00
71      $128.00
1011    $115.00
380     $219.00
943     $125.00
3107    $250.00
1499     $94.00
625     $150.00
Name: price, dtype: object


### Find Average Listing Price
Before I can select the 5 most similar living spaces and compute the average price, I need to clean the `price` column. Currently, the column contains comma characters (,) and dollar sign characters and is formatted as a text column instead of a numeric one. I need to remove these values and convert the entire column to the float datatype. Then, we can calculate the average price.

In [5]:
#cleaning price and converting to float
cleaned_price = dc_listings['price'].str.replace(',','').str.replace('$','')
dc_listings['price'] = pd.to_numeric(cleaned_price, errors='coerce')

In [6]:
#calculating the mean price of the first five values
mean_price = np.mean(dc_listings.iloc[0:5]['price'])
print(mean_price)

156.6


Based on the average price of other listings that accomodate 3 people, I should charge $156.60 per night for a guest to stay in my place.
I can create a more general function that can suggest the optimal price for other values of the `accommodates` column:

In [12]:
#where 'new_listing' = accomodates; NOTE: this function does not shuffle the indices
def predict_price(new_listing):
    temp_df = dc_listings.copy()
    temp_df['distance'] = temp_df.accommodates.apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    price = temp_df.iloc[0:5]['price']
    
    return(np.mean(price))

#test
print(predict_price(3))

194.8


### Evaluating Model Performance

In [15]:
train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]

def predict_price(new_listing):
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['price']
    
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)

test_df['predicted_price'] = test_df['accommodates'].apply(predict_price)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


In [16]:
mae = (np.abs(test_df['price']-test_df['predicted_price'])).mean()
print(mae)

96.32309344790546


This tells me that my model is 96 dollars off from the actual price. This is a large error considering the original average listing price is 156 dollars and the optimal listing price is 195 dollars. There are 2 ways I can alter the model to try to improve the accuracy (decrease the RMSE during validation):

* increase the number of attributes the model uses to calculate similarity when ranking the closest neighbors
* increase k, the number of nearby neighbors the model uses when computing the prediction 

I'll start by increasing the number of attributes first to see if I can reduce the model error.

In [17]:
#look for non-ordinal/non-numeric and missing data
dc_listings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 577 to 1224
Data columns (total 20 columns):
host_response_rate      3289 non-null object
host_acceptance_rate    3109 non-null object
host_listings_count     3723 non-null int64
accommodates            3723 non-null int64
room_type               3723 non-null object
bedrooms                3702 non-null float64
bathrooms               3696 non-null float64
beds                    3712 non-null float64
price                   3723 non-null float64
cleaning_fee            2335 non-null object
security_deposit        1426 non-null object
minimum_nights          3723 non-null int64
maximum_nights          3723 non-null int64
number_of_reviews       3723 non-null int64
latitude                3723 non-null float64
longitude               3723 non-null float64
city                    3723 non-null object
zipcode                 3714 non-null object
state                   3723 non-null object
distance                3723 non-nu

Now I'll drop the columns that are non-numeric or non-ordinal

In [18]:
drop_columns = ['room_type', 'city', 'state', 'latitude', 'longitude', 'zipcode', 'host_response_rate', 'host_acceptance_rate', 'host_listings_count']

dc_listings = dc_listings.drop(drop_columns, axis=1)

In [20]:
#looking for missing values
dc_listings.isnull().sum()

accommodates            0
bedrooms               21
bathrooms              27
beds                   11
price                   0
cleaning_fee         1388
security_deposit     2297
minimum_nights          0
maximum_nights          0
number_of_reviews       0
distance                0
dtype: int64

I'll also drop both the `cleaning_fee` and `security_deposit` columns since they are missing a lot of data.

In [22]:
dc_listings = dc_listings.drop(['cleaning_fee', 'security_deposit'], axis=1)

KeyError: "['cleaning_fee' 'security_deposit'] not found in axis"

In [24]:
#I'll also drop the rest of the missing data
dc_listings = dc_listings.dropna(axis=0)
dc_listings.isnull().sum()

accommodates         0
bedrooms             0
bathrooms            0
beds                 0
price                0
minimum_nights       0
maximum_nights       0
number_of_reviews    0
distance             0
dtype: int64