<a href="https://colab.research.google.com/github/dfeng423/Optimal-Prices-for-Renters-in-Airbnb/blob/master/optimal_prices_for_renters_in_airbnb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Optimal Airbnb List Price Suggestion

## Problem Definition

As a host, if we try to charge above market price for a living space we'd like to rent, then renters will select more affordable alternatives which are similar to ours. If we set our nightly rent price too low, we'll miss out on potential revenue.

## DataSet

The dataset we will be working with is from October 3, 2015 on the listings from Washington, D.C., the capital of the United States.

To make the dataset less cumbersome to work with, we've removed many of the columns in the original dataset and renamed the file to dc_airbnb.csv. Here are the columns we kept:

- host_response_rate: the response rate of the host
- host_acceptance_rate: number of requests to the host that convert to rentals
- host_listings_count: number of other listings the host has
- latitude: latitude dimension of the geographic coordinates
- longitude: longitude part of the coordinates
- city: the city the living space resides
- zipcode: the zip code the living space resides
- state: the state the living space resides
- accommodates: the number of guests the rental can accommodate
- room_type: the type of living space (Private room, Shared room or Entire home/apt)
- bedrooms: number of bedrooms included in the rental
- bathrooms: number of bathrooms included in the rental
- beds: number of beds included in the rental
- price: nightly price for the rental
- cleaning_fee: additional fee used for cleaning the living space after the guest leaves
- security_deposit: refundable security deposit, in case of damages
- minimum_nights: minimum number of nights a guest can stay for the rental
- maximum_nights: maximum number of nights a guest can stay for the rental
- number_of_reviews: number of reviews that previous guests have left


In [0]:
import pandas as pd
import numpy as np

dc_listings = pd.read_csv("dc_airbnb.csv")
print(dc_listings)
dc_listings.info()

     host_response_rate host_acceptance_rate  ...  zipcode  state
0                   92%                  91%  ...    20003     DC
1                   90%                 100%  ...    20003     DC
2                   90%                 100%  ...    20782     MD
3                  100%                  NaN  ...    20024     DC
4                   92%                  67%  ...    20910     MD
...                 ...                  ...  ...      ...    ...
3718               100%                  60%  ...    20003     DC
3719               100%                  50%  ...    20003     DC
3720               100%                 100%  ...    20003     DC
3721                88%                 100%  ...    20002     DC
3722                70%                 100%  ...    20003     DC

[3723 rows x 19 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3723 entries, 0 to 3722
Data columns (total 19 columns):
host_response_rate      3289 non-null object
host_acceptance_rate    3109 n

## Univariate K-Nearest Neighbors


#### overview of the k-nearest neighbors algorithm:

1.  select the number of similar listings, k, we want to compare with;
2.  for each listing, calculate how similar it is to our unpriced listing.
3.  rank each listing by the similarity metric and select the first k listings.
4. calculate the mean list price for the k similar listings and use as our list price.



#### Implementation

Euclidean Distance = sqrt(sum i to N (x1_i – x2_i)^2), Where x1 is the first row of data, x2 is the second row of data and i is the index to a specific column as we sum across all columns.

In [0]:
np.random.seed(1)
#  the price column contains comma characters (,) and dollar sign characters and is formatted as a text column instead of a numeric one. We need to remove these values and convert the entire column to the float datatype. Then, we can calculate the average price.
stripped_commas = dc_listings['price'].replace(',', '')
stripped_dollars = stripped_commas.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))] #np.random.permutation() function to return a NumPy array of shuffled index values.


# Write a function named predict_price that can use the k-nearest neighbors machine learning technique to calculate the suggested price for any value for accommodates.
def predict_price(new_listing):
    temp_df = dc_listings.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing)) # univariate case, Euclidean distance
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors = temp_df.iloc[0:5]['price'] # select 5 "nearest neighbors" 
    predicted_price = nearest_neighbors.mean()
    return(predicted_price)

acc_one = predict_price(1) # accommodates 1 person, assign the suggested price to acc_one
acc_two = predict_price(2)
acc_four = predict_price(4)
print(acc_one)
print(acc_two)
print(acc_four)

68.0
112.8
124.8


#### Model Performance Evaluation

In [0]:
import pandas as pd
import numpy as np
dc_listings = pd.read_csv("dc_airbnb.csv")
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
train_df = dc_listings.iloc[0:2792] #75%
test_df = dc_listings.iloc[2792:] # 25%

def predict_price(new_listing):
    ## DataFrame.copy() performs a deep copy
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)
test_df['predicted_price'] = test_df['accommodates'].apply(predict_price)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


######MAE

In [0]:
# mean absolute error
mae = (test_df['predicted_price'] - test_df['price']).map(np.absolute).mean()
print(mae)

56.29001074113876


##### MSE


In [0]:
# mean squared error: The MSE makes the gap between the predicted and actual values more clear.
test_df['squared_error'] = (test_df['predicted_price'] - test_df['price'])**(2)
mse = test_df['squared_error'].mean()
print(mse)

18646.525370569325


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


##### using the bathrooms column to build the model


In [0]:
train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]

def predict_price(new_listing):
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['bathrooms'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbors_prices.mean()
    return(predicted_price)

test_df['predicted_price'] = test_df['bathrooms'].apply(lambda x: predict_price(x))
test_df['squared_error'] = (test_df['predicted_price'] - test_df['price'])**(2)
mse = test_df['squared_error'].mean()
print(mse)

18405.444081632548


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


We observed an improvement in accuracy after we switched to the bathrooms column.

##### RMSE: Root mean squared error. Root mean squared error is an error metric whose units are the base unit (in our case, dollars).

This means that the individual errors grows quadratically and has a different effect on the final RMSE value.


In [0]:
rmse = mse**0.5
print(rmse)

136.55228072269364


The model achieved an RMSE value of approximately 135.6, which implies that we should expect for the model to be off by 135.6 dollars on average for the predicted price values. Given that most of the living spaces are listed at just a few hundred dollars, we need to reduce this error as much as possible to improve the model's usefulness.

Taking the square root of the average squared errors has some interesting implications for RMSE. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are particularly undesirable.



## Multivariate K-Nearest Neighbors 

### **Feature Selection**

It's clear that using just a single feature to compare listings doesn't reflect the reality of the market. An apartment that can accommodate 4 guests in a popular part of Washington D.C. will rent for much higher than one that can accommodate 4 guests in a crime ridden area.

There are 2 ways we can tweak the model to try to improve the accuracy (decrease the RMSE during validation):

- increase the number of attributes the model uses to calculate similarity when ranking the closest neighbors
- increase k, the number of nearby neighbors the model uses when computing the prediction

Here we'll increasing the number of attributes the model uses.

When selecting more attributes to use in the model, we need to watch out for columns that don't work well with the distance equation. This includes columns containing:

- non-numerical values (e.g. city or state)
 +  Euclidean distance equation expects numerical values
- missing values
 + distance equation expects a value for each observation and attribute
- non-ordinal values (e.g. latitude or longitude)
 + ranking by Euclidean distance doesn't make sense if all attributes aren't ordinal

In [0]:
import pandas as pd
import numpy as np
np.random.seed(1)

dc_listings = pd.read_csv('dc_airbnb.csv')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

dc_listings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 574 to 1061
Data columns (total 19 columns):
host_response_rate      3289 non-null object
host_acceptance_rate    3109 non-null object
host_listings_count     3723 non-null int64
accommodates            3723 non-null int64
room_type               3723 non-null object
bedrooms                3702 non-null float64
bathrooms               3696 non-null float64
beds                    3712 non-null float64
price                   3723 non-null float64
cleaning_fee            2335 non-null object
security_deposit        1426 non-null object
minimum_nights          3723 non-null int64
maximum_nights          3723 non-null int64
number_of_reviews       3723 non-null int64
latitude                3723 non-null float64
longitude               3723 non-null float64
city                    3723 non-null object
zipcode                 3714 non-null object
state                   3723 non-null object
dtypes: float64(6), int64(5), objec

#### 1. Removing features

Remove the 9 columns we discussed above from dc_listings:
- 3 containing non-numerical values
 + room_type: e.g. Private room
 + city: e.g. Washington
 + state: e.g. DC
- 3 containing numerical but non-ordinal values
 + latitude: e.g. 38.913458
 + longitude: e.g. -77.031
 + zipcode: e.g. 20009

Geographic values like these aren't ordinal, because a smaller numerical value doesn't directly correspond to a smaller value in a meaningful way. 
- 3 describing the host instead of the living space itself
 + host_response_rate
 + host_acceptance_rate
 + host_listings_count

Since a host could have many living spaces and we don't have enough information to uniquely group living spaces to the hosts themselves, let's avoid using any columns that don't directly describe the living space or the listing itself

In [0]:
drop_columns = ['room_type', 'city', 'state', 'latitude', 'longitude', 'zipcode', 'host_response_rate', 'host_acceptance_rate', 'host_listings_count']
dc_listings = dc_listings.drop(drop_columns, axis=1)
print(dc_listings.isnull().sum())




accommodates            0
bedrooms               21
bathrooms              27
beds                   11
price                   0
cleaning_fee         1388
security_deposit     2297
minimum_nights          0
maximum_nights          0
number_of_reviews       0
dtype: int64


#### 2. Handling missing values


Of the remaining columns, 3 columns have a few missing values (less than 1% of the total number of rows):

- bedrooms
- bathrooms
- beds

Since the number of rows containing missing values for one of these 3 columns is low, we can select and remove those rows without losing much information. There are also 2 columns that have a large number of missing values:

- cleaning_fee - 37.3% of the rows
- security_deposit - 61.7% of the rows

and we can't handle these easily. We can't just remove the rows containing missing values for these 2 columns because we'd miss out on the majority of the observations in the dataset. Instead, let's remove these 2 columns entirely from consideration.

In [0]:
dc_listings.drop(['cleaning_fee','security_deposit'],axis = 1,inplace = True)
dc_listings.dropna(axis = 0,inplace = True)

print(dc_listings.isnull().sum())

accommodates         0
bedrooms             0
bathrooms            0
beds                 0
price                0
minimum_nights       0
maximum_nights       0
number_of_reviews    0
dtype: int64


#### 3. Normalize columns

In [0]:
print(dc_listings.head())

      accommodates  bedrooms  ...  maximum_nights  number_of_reviews
574              2       1.0  ...               4                149
1593             2       1.0  ...              30                 49
3091             1       1.0  ...            1125                  1
420              2       1.0  ...             730                  2
808             12       5.0  ...            1825                 34

[5 rows x 8 columns]


We noticed that while the accommodates, bedrooms, bathrooms, beds, and minimum_nights columns hover between 0 and 12 (at least in the first few rows), the values in the maximum_nights and number_of_reviews columns span much larger ranges. For example, the maximum_nights column has values as low as 4 and high as 1825, in the first few rows itself. If we use these 2 columns as part of a k-nearest neighbors model, these attributes could end up having an outsized effect on the distance calculations because of the largeness of the values.

To prevent any single column from having too much of an impact on the distance, we can normalize all of the columns to have a mean of 0 and a standard deviation of 1.

In [0]:
# To apply this transformation across all of the columns in a Dataframe, you can use the corresponding Dataframe methods mean() and std():

normalized_listings = (dc_listings - dc_listings.mean())/(dc_listings.std())
normalized_listings['price'] = dc_listings['price']
print(normalized_listings.head(3))

      accommodates  bedrooms  ...  maximum_nights  number_of_reviews
574      -0.596544 -0.249467  ...       -0.016604           4.579650
1593     -0.596544 -0.249467  ...       -0.016603           1.159275
3091     -1.095499 -0.249467  ...       -0.016573          -0.482505

[3 rows x 8 columns]


#### 4. Calculate Euclidean distance for multivariate case

In [0]:
# Calculate the Euclidean distance using only the accommodates and bathrooms features between the first row and fifth row in normalized_listings using the distance.euclidean() function.
from scipy.spatial import distance
first_listing = normalized_listings.iloc[0][['accommodates', 'bathrooms']]
fifth_listing = normalized_listings.iloc[4][['accommodates', 'bathrooms']]
first_fifth_distance = distance.euclidean(first_listing, fifth_listing)
print(first_fifth_distance)

5.272543124668404


#### 5. Fitting a model and making predictions By scikit-learn library

In [0]:
from sklearn.neighbors import KNeighborsRegressor

train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]
train_columns = ['accommodates', 'bathrooms']

# Instantiate ML model.
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

# Fit model to data.
knn.fit(train_df[train_columns], train_df['price'])

# Use model to make predictions.
predictions = knn.predict(test_df[train_columns])

#### 6. Calculating MSE using Scikit-Learn

In [0]:
from sklearn.metrics import mean_squared_error

two_features_mse = mean_squared_error(test_df['price'], predictions)
two_features_rmse = two_features_mse ** (1/2)
print(two_features_mse)
print(two_features_rmse)

15600.51385665529
124.90201702396679


#### 7. Using more features

In [0]:
features = ['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews']
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')
knn.fit(train_df[features], train_df['price'])
four_predictions = knn.predict(test_df[features])
four_mse = mean_squared_error(test_df['price'], four_predictions)
four_rmse = four_mse ** (1/2)
print(four_mse)
print(four_rmse)

13322.432400455064
115.42284176217056


In [0]:
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

features = train_df.columns.tolist()
features.remove('price')

knn.fit(train_df[features], train_df['price'])
all_features_predictions = knn.predict(test_df[features])
all_features_mse = mean_squared_error(test_df['price'], all_features_predictions)
all_features_rmse = all_features_mse ** (1/2)
print(all_features_mse)
print(all_features_rmse)

15455.275631399316
124.31924883701363


The RMSE value actually increased to 125.1 when we used all of the features available to us.

Lever we could use in K-Nearest Neighbors:
- increase the number of attributes the model uses to calculate similarity when ranking the closest neighbors to

- select the relevant attributes the model uses to calculate similarity when ranking the closest neighbors

### **Hyperparameter Optimization**

## Reference:
1. https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92
2. https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d#.lyc8od1ix
3. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor