# Airbnb Price Prediction

## Building multiple models for predicting AirBnB price listings based on multiple features
 
## Team Members:
    - Azam Zubairi (27424), Umair Afzal (28004)

### Project Index
1. [Import Libaray and Data](#import)
2. [Exploratory Data Analysis](#eda)
3. [Preprocessing data](#proc)
4. [Encoding categorical variables](#encode)
5. [Train Test Split](#ttsplit)
6. [Linear Regression](#lreg)
7. [Random Forest Regression](#rfreg)
8. [Decision Tree Regression](#dtreg)
9. [Model Comparison](#compare)
10. [Conclusion](#conclude)

<a id='import'></a>
# Import Libaray and Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy.stats as st
from scipy import stats
from scipy.stats import norm, skew #for some statistics

from sklearn import ensemble, tree, linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error

import folium

from folium import plugins
import branca.colormap as cm

import plotly.express as px
import plotly.offline as py
import plotly.graph_objs as go

In [2]:
df=pd.read_csv('listing_summary.csv')
df.head()


Columns (17) have mixed types.Specify dtype option on import or set low_memory=False.



Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,150,30,48,2019-11-04,0.32,3,334,0,
1,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,60,30,50,2019-12-02,0.32,2,365,0,
2,5136,"Spacious Brooklyn Duplex, Patio + Garden",7378,Rebecca,Brooklyn,Sunset Park,40.66265,-73.99454,Entire home/apt,275,5,2,2021-08-08,0.02,1,201,1,
3,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Midtown,40.76457,-73.98317,Private room,68,2,520,2022-02-18,3.33,1,154,46,
4,5203,Cozy Clean Guest Room - Family Apt,7490,MaryEllen,Manhattan,Upper West Side,40.8038,-73.96751,Private room,75,2,118,2017-07-21,0.77,1,0,0,


In [3]:
df.shape

(37631, 18)

<a id='eda'></a>
# Exploratory Data Analysis

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37631 entries, 0 to 37630
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              37631 non-null  int64  
 1   name                            37617 non-null  object 
 2   host_id                         37631 non-null  int64  
 3   host_name                       37544 non-null  object 
 4   neighbourhood_group             37631 non-null  object 
 5   neighbourhood                   37631 non-null  object 
 6   latitude                        37631 non-null  float64
 7   longitude                       37631 non-null  float64
 8   room_type                       37631 non-null  object 
 9   price                           37631 non-null  int64  
 10  minimum_nights                  37631 non-null  int64  
 11  number_of_reviews               37631 non-null  int64  
 12  last_review                     

In [5]:
from pandas_profiling import ProfileReport

profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

MemoryError: Unable to allocate 2.02 TiB for an array with shape (277061618929,) and data type float64



In [None]:
#count missing values and percentage 
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum())/df.isnull().count().sort_values(ascending=False)
percent = percent * 100
missing_data = pd.concat([total, percent], axis=1, keys=['Total','Percent'], sort=False).sort_values('Total', ascending=False)
missing_data

We can see that the license column is almost empty so we will be dropping that column. Additionally there are null values in name, host_name, last_review and reviews_per_month columns. Since the number of null values for last_review and reviews per month are same, so we can assume that those 8974 listings have not gotten any reviews, we will replace missing values with 0 for reviews_per_month column.

In [None]:
#Bar chart for room type
ax = sns.countplot(x="room_type", data=df)

We can see that there are 4 room types, and most listings are either "Entire home/apt" or "Private room".

In [None]:
#Calculating the mean price for each room type
room_price = df.groupby("room_type")["price"].mean()
room_price

In [None]:
#Plotting room type against price to compare prices of each room type
plt.figure(figsize = (10,6))
c = ['b', 'r', 'g', 'y']
df.groupby('room_type')['price'].mean().plot(kind='bar', stacked=True, color=c)
plt.xlabel('Price')
plt.ylabel('Room type')
plt.title("Average Price of Room Type")
plt.show

It can be seen that Hotel room and Entire home are more expensive than Private room and Shared room, which can be expected.

In [None]:
#Exploring price distribution
plt.figure(figsize = (10,6))
plt.title('Price Distribution',fontsize=20)
sns.kdeplot(df['price'], shade='True', legend='True')


We can see that prices range from 0 to 10000. And from the plot we can observe that most of listings have price less than 500. And the graph shows a skewed distribution. To make analysis and to get better scores, we will apply log transformation for the price column.

In [None]:
#Mapping the whole data
fig=px.scatter_mapbox(data_frame=df,
                      lat="latitude",
                      lon="longitude",
                      color="price",
                    hover_data=["price"],
                     hover_name="neighbourhood",
                     height=800,
                      width=1000,
                     size="price",
                     zoom=10
                     );

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":1,"l":0,"b":0})
fig.show()

As visualizing the whole data we observed that some listings in the map above have very high prices. These can be either luxury listings or there was some error when setting the price.

In [None]:
#Filtering the price > 1000
price1000 = df['price'] > 1000
df1000 = df[price1000]
fig=px.scatter_mapbox(data_frame=df1000,
                      lat="latitude",
                      lon="longitude",
                      color="price",
                    hover_data=["price"],
                     hover_name="neighbourhood",
                     height=800,
                      width=1000,
                     size="price",
                     zoom=10
                     );

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":1,"l":0,"b":0})
fig.show()

In [None]:
#Filtering the price < 500
price500 = df['price'] < 500
df500 = df[price500]
fig=px.scatter_mapbox(data_frame=df500,
                      lat="latitude",
                      lon="longitude",
                      color="price",
                    hover_data=["price"],
                     hover_name="neighbourhood",
                     height=800,
                      width=1000,
                     size="price",
                     zoom=10
                     );

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":1,"l":0,"b":0})
fig.show()

In [None]:
#Exploring the neighbourhood_group variable
plt.figure(figsize = (10,6))
ax = sns.countplot(x="neighbourhood_group", data=df)

More listings can be seen in Brooklyn followed by Manhattan. However, Staten Island has the least listings. So, it can be concluded as the Brooklyn and Manhattan are the areas which attract the most visitors.

In [None]:
#Plotting the average price of the Neighbourhood Groups
plt.figure(figsize = (10,6))
c = ['b', 'r', 'g', 'y', 'm']
ng_p_mean_df = df.groupby("neighbourhood_group")["price"].mean().sort_values().plot(kind='bar', stacked=True, color=c)
plt.xlabel("Price")
plt.ylabel("Neighbourhood Group")
plt.title("Average Price of Neighbourhood Groups")

We can see that the average price in Manhattan is much higher than other neighbourhood groups. 

In [None]:
#Explore minimum nights variable
plt.figure(figsize = (10,6))
plt.title('Minimum Nights')
sns.stripplot(df['minimum_nights'])

In [None]:
df.boxplot(column="minimum_nights")

We can see that minimum nights are between 0 night to 4 years. And most listings provide service ranging from 1 night to 1 year.

<a id='proc'></a>
# Preprocessing data

Some columns are of no use for us and will not affect the model. So we will drop those columns.

In [None]:
df.drop(['name','id','host_name', 'host_id', 'last_review', 'license'], axis=1, inplace=True)

In [None]:
df.head()

In [None]:
df.isnull().sum()

We can see that we still have null values in reviews_per_month column, so we will impute missing values with 0 of reviews_per_month column.

In [None]:
df['reviews_per_month'].fillna(0, inplace=True)

In [None]:
#Exploring and removing outliers
df[df["price"]>500]

1214 listings have price per day > 500. These are either very lavish luxury listings or there is an error in the data. Nonetheless, since these records are impacting our data, we will treat them as outliers and drop them.

In [None]:
df=df[df["price"]<500]

In [None]:
df.iloc[:,3:].describe()

In [None]:
#For a more normal distribution we apply log transformation to the price
df['price'] = np.log(df.price+1)

In [None]:
plt.figure(figsize=(10,6))
sns.distplot(df['price'], fit=norm)
plt.title("Log-Price Distribution Plot")

In [None]:
#Corelation Matrix
plt.figure(figsize=(15,12))
corr=df.corr(method='pearson')
sns.heatmap(corr, annot=True, fmt=".2f", vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}).set(ylim=(11, 0))

<a id='encode'></a>
# Encoding categorical variables

In [None]:
feature_columns=['neighbourhood_group','room_type','price','minimum_nights','calculated_host_listings_count','availability_365']
features=df[feature_columns]
features.head()

In [None]:
features['room_type']=features['room_type'].factorize()[0]
features['neighbourhood_group']=features['neighbourhood_group'].factorize()[0]
features.head()

<a id='ttsplit'></a>
# Train Test Split

In [None]:
y = features['price']
x= features.drop(['price'],axis=1)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

# collect scores of each algorithm for model comparison
mae_dict = {}
mse_dict = {}
x

<a id='lreg'></a>
# Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


model = LinearRegression()
model.fit(x_train,y_train)
y_pred=model.predict(x_test)


print('MAE:{:.3f}'.format(mean_absolute_error(y_test, y_pred)))
print('MSE:{:.3f}'.format(mean_squared_error(y_test, y_pred)))
print('RMSE:{:.3f}'.format(np.sqrt(mean_squared_error(y_test, y_pred))))

In [None]:
mae_dict['Liner Regression'] = mean_absolute_error(y_test, y_pred)
mse_dict['Liner Regression'] = mean_squared_error(y_test, y_pred)

In [None]:
error=pd.DataFrame(np.array(y_test).flatten(),columns=['actual'])
# expm1 is used to revert the data normalized by log1p
error['actual']=np.expm1(error['actual'])
error['prediction']=np.array(np.expm1(y_pred))
error.head(20)

<a id='rfreg'></a>
# Random Forest Regression 

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import joblib

In [None]:
# params for grid search
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 500, num = 5)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 6)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
# Create the random grid
rm_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [None]:
rf_model = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf_model, 
                                param_distributions = rm_grid, 
                                n_iter = 10, 
                                cv = 3, 
                                verbose=2, 
                                random_state=66,
                                n_jobs = -1)

In [None]:
# search for the best parameters
rf_random.fit(x_train, y_train)

In [None]:
rf_random.best_estimator_

In [None]:
rf_model_best = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
                      max_features='sqrt', max_leaf_nodes=None,
                      min_impurity_decrease=0.0,
                      min_samples_leaf=4, min_samples_split=10,
                      min_weight_fraction_leaf=0.0, n_estimators=300,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [None]:
rf_model_best.fit(x_train, y_train)
y_pred = rf_model_best.predict(x_test)

In [None]:
print('MAE:{:.3f}'.format(mean_absolute_error(y_test, y_pred)))
print('MSE:{:.3f}'.format(mean_squared_error(y_test, y_pred)))
print('RMSE:{:.3f}'.format(np.sqrt(mean_squared_error(y_test, y_pred))))

In [None]:
mae_dict['Random Forest'] = mean_absolute_error(y_test, y_pred)
mse_dict['Random Forest'] = mean_squared_error(y_test, y_pred)

In [None]:
error=pd.DataFrame(np.array(y_test).flatten(),columns=['actual'])
error['actual']=np.expm1(error['actual'])
error['prediction']=np.array(np.expm1(y_pred))
error.head(20)

<a id='dtreg'></a>
# Decision Tree Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
dtree_model = DecisionTreeRegressor();

In [None]:
dtree_model.fit(x_train, y_train)
y_pred = dtree_model.predict(x_test)

In [None]:
print('MAE:{:.3f}'.format(mean_absolute_error(y_test, y_pred)))
print('MSE:{:.3f}'.format(mean_squared_error(y_test, y_pred)))
print('RMSE:{:.3f}'.format(np.sqrt(mean_squared_error(y_test, y_pred))))

In [None]:
mae_dict['Decision Tree'] = mean_absolute_error(y_test, y_pred)
mse_dict['Decision Tree'] = mean_squared_error(y_test, y_pred)

In [None]:
error=pd.DataFrame(np.array(y_test).flatten(),columns=['actual'])
# expm1 is used to revert the data normalized by log1p
error['actual']=np.expm1(error['actual'])
error['prediction']=np.array(np.expm1(y_pred))
error.head(20)

In [None]:
plt.figure(figsize=(10,10))
sns.regplot(y=np.array(np.expm1(y_pred)), x=np.array(np.expm1(y_test)), line_kws={"color": "red"}, color='springgreen')
plt.title('Evaluated predictions', fontsize=15)
plt.xlabel('Predicted values')
plt.ylabel('Real values')
plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (16,8))
ax1.scatter(np.array(np.expm1(y_test)),np.expm1(y_pred))
ax1.set_xlabel("True", size = 14)
ax1.set_ylabel("Prediction", size = 14)
ax2.plot(np.array(np.expm1(y_test)), label="True")
ax2.plot(np.expm1(y_pred), label = "Prediction")
ax2.legend(bbox_to_anchor=(1.05, 1), loc='upper left', prop={'size': 12})

<a id='compare'></a>
# Model Comparison

In [None]:
# compare r-squared, mae, mse scores of three ML algorithms
fig, (ax1, ax2) = plt.subplots(2, 1 ,figsize = (6, 16))
ax1.set_title("MAE")
ax1.plot(list(mae_dict.keys()), list(mae_dict.values()), marker = "o", color = "green")
ax2.set_title("MSE")
ax2.plot(list(mse_dict.keys()), list(mse_dict.values()), marker = "o", color = "orange")

<a id='conclude'></a>
# Conclusion

By applying data to three ML Algorithms, which are Liner Regression, Random Forest Regression and Decision Tree Regression, we found that the Random Forest Regression has the best performance. Random Forest has the highest R-squared score and lowest MSE score. In terms of works in the future, there are two directions, firstly, we could consider finding the relationship between Aribnb Naming and its price. Secondly, instead of predicting a specific price, we could turn it into a classification problem, predicting a price bucket, which will show Airbnb home provider a lowest suggest price and a highest suggest price. By doing so, accuracy may very well increase and this model would be more useful from a practical standpoint.