In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv("preprocessed.csv")
data.head()

Unnamed: 0,listing_id,latitude,longitude,accomodates,bedrooms,beds,host_since,date,available,price,...,"""nvt body soap""","""Electrolux induction stove""","""Qumi Bluetooth sound system""","""Private patio or balcony""","""Private entrance""","""Extra pillows and blankets""","""Smeg gas stove""","""Whirlpool oven""","""Carbon monoxide alarm""",HBO Max
0,40334325,51.20989,4.42298,2,1.0,2.0,1574467200,1659484800,0,56.0,...,0,0,0,0,0,0,0,0,0,0
1,22742449,51.21905,4.42292,4,2.0,2.0,1484092800,1668297600,1,95.0,...,0,0,0,0,0,0,0,0,0,0
2,34621717,51.19893,4.40269,2,1.0,1.0,1415491200,1650153600,0,75.0,...,0,0,0,0,0,0,0,0,0,0
3,38281744,51.219448,4.402464,2,1.0,1.0,1442361600,1643587200,1,150.0,...,0,0,0,0,0,0,0,0,0,0
4,18835003,51.21328,4.39494,2,1.0,1.0,1495238400,1653091200,0,100.0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319192 entries, 0 to 319191
Columns: 592 entries, listing_id to  HBO Max
dtypes: float64(6), int64(586)
memory usage: 1.4 GB


In [4]:
X = data.drop('price',axis=1)
y = data[['price']]

Let's compare the R2 score and mean squared error when we include the amenities columns, and the same scores when these columns are excluded. There are more than 500 amenity columns, so it would siginificantly reduce complexity if these columns can be removed due to not being significant contributors to the model.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42) #data with all columns

In [6]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)

LinearRegression()

In [7]:
from sklearn.metrics import r2_score, mean_squared_error
print(r2_score(y_test, linreg.predict(X_test)))
print(np.sqrt(mean_squared_error(y_test, linreg.predict(X_test)))) #RMSE

0.2295801953474722
165.88911074557882


In [8]:
X_train, X_test, y_Train, y_test = train_test_split(X.iloc[:,:54], y, test_size=0.2, random_state=42) #data with amenity columns removed

In [9]:
linreg2 = LinearRegression()
linreg2.fit(X_train, y_train)

LinearRegression()

In [10]:
print(r2_score(y_test, linreg2.predict(X_test)))
print(np.sqrt(mean_squared_error(y_test, linreg2.predict(X_test))))

0.22958019483555792
165.88911080069227


These two linear regression models show that the additional columns added by amenities really don't offer much value - the two metrics are near identical with or without all those extra columns. Let us still test a few more models before removing them.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [12]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)

DecisionTreeRegressor()

In [13]:
print(r2_score(y_test, dtr.predict(X_test)))
print(np.sqrt(mean_squared_error(y_test, dtr.predict(X_test))))

0.9868624115709951
21.662670107332985


In [14]:
X_train, X_test, y_Train, y_test = train_test_split(X.iloc[:,:54], y, test_size=0.2, random_state=42)

In [15]:
dtr2 = DecisionTreeRegressor().fit(X_train,y_train)

In [16]:
print(r2_score(y_test, dtr2.predict(X_test)))
print(np.sqrt(mean_squared_error(y_test, dtr2.predict(X_test))))

0.9870219853035858
21.530706991087936


Decision Tree is offering much better scores, though that might be due to overfitting. There's also a slight degradation in both scores when taking out the amenities columns, though it's not very major. It would certainly make more sense to decrease the complexity of the data and make a small sacrifice in the scoring. Let us look at one final model to see if it's worth keeping the additional columns

In [17]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=50)

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [19]:
rfr.fit(X_train, y_train)

  rfr.fit(X_train, y_train)


RandomForestRegressor(n_estimators=50)

In [20]:
print(r2_score(y_test, rfr.predict(X_test)))
print(np.sqrt(mean_squared_error(y_test, rfr.predict(X_test))))

0.9899947759340586
18.904603344508832


In [21]:
X_train, X_test, y_Train, y_test = train_test_split(X.iloc[:,:54], y, test_size=0.2, random_state=42)

In [22]:
rfr2 = RandomForestRegressor(n_estimators=50).fit(X_train, y_train)

  rfr2 = RandomForestRegressor(n_estimators=50).fit(X_train, y_train)


In [23]:
print(r2_score(y_test, rfr2.predict(X_test)))
print(np.sqrt(mean_squared_error(y_test, rfr2.predict(X_test))))

0.9903623735679983
18.554070571532662


Once again, we see that the two datasets do not have much difference in scoring with the same model. Let us drop the excess complexity presented by the amenities columns. This leaves us with 54 columns.

In [24]:
new_X = X.iloc[:,:53]

In [41]:
new_data = new_X.copy()
new_data['price']=y
new_data.to_csv("final_data.csv",index=False)

**We now have our final data for model building. This has been saved as a csv file, called 'final_data.csv' . All further models will be based on this data, which has 54 columns and 319192 rows.**

# Start here to compare some models on cut-down dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv("final_data.csv")
X = data.drop("price",axis=1)
y = data[['price']]

In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319192 entries, 0 to 319191
Data columns (total 54 columns):
 #   Column                                             Non-Null Count   Dtype  
---  ------                                             --------------   -----  
 0   listing_id                                         319192 non-null  int64  
 1   latitude                                           319192 non-null  float64
 2   longitude                                          319192 non-null  float64
 3   accomodates                                        319192 non-null  int64  
 4   bedrooms                                           319192 non-null  float64
 5   beds                                               319192 non-null  float64
 6   host_since                                         319192 non-null  int64  
 7   date                                               319192 non-null  int64  
 8   available                                          319192 non-null  int64 

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2) #80% training data, 20% test data.
#There may be some variability in reported and observed scores due to not using random_state in this step. 

In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
linreg = LinearRegression().fit(X_train,y_train)
print(linreg.score(X_test,y_test))
print(mean_squared_error(y_test, linreg.predict(X_test), squared=False))

0.2241974570455354
149.82455115019638


Linear Regression does not seem to be a good enough model for this dataset. We observe an $R^2$ score of only 22.42%, with an RMSE of $149.42, which is not a great score.

In [5]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor(random_state=42).fit(X_train, y_train)
print(dtr.score(X_test,y_test))
print(mean_squared_error(y_test, dtr.predict(X_test), squared=False))

0.98642666917453
19.81757922585165


Decision Tree Regressor immediately shows a massive improvement over Linear Regression, with an $R^2$ score of 98.64% and RMSE of $19.82

In [6]:
dtr2 = DecisionTreeRegressor(max_depth=15, max_features='sqrt',random_state=42).fit(X_train, y_train)
print(dtr2.score(X_test,y_test))
print(mean_squared_error(y_test, dtr2.predict(X_test), squared=False))

0.9580580100381929
34.836290371280626


With some parameter tuning to reduce overfitting, we still see an $R^2$ score of 95.81% and an RMSE of $34.84 with Decision Tree Regressor.

In [7]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=200, n_jobs=-1).fit(X_train,y_train.values.ravel())
print(rfr.score(X_test,y_test))
print(mean_squared_error(y_test, rfr.predict(X_test), squared=False))

0.9890407420012256
17.807295329092803


Random Forest Regressor further improves on the score of Decision Tree, with an $R^2$ score of 98.9% and RMSE $17.81. This is the best result we have obtained.

In [8]:
rfr2 = RandomForestRegressor(min_samples_split=100, n_jobs=-1, max_features='sqrt', max_samples=0.7).fit(X_train,y_train.values.ravel())
print(rfr2.score(X_test,y_test))
print(mean_squared_error(y_test, rfr2.predict(X_test), squared=False))

0.972624169807614
28.14433782185646


To reduce overfitting, we do some hyperparameter tuning, and obtain an $R^2$ score of 97.26% and RMSE of $28.14 with Random Forest.

In [9]:
from sklearn.ensemble import AdaBoostRegressor
ada = AdaBoostRegressor().fit(X_train,y_train.values.ravel())
print(ada.score(X_test,y_test))
print(mean_squared_error(y_test, ada.predict(X_test), squared=False))

0.05446534109649981
165.40403970594974


In [10]:
ada2 = AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=5)).fit(X_train,y_train.values.ravel())
print(ada2.score(X_test,y_test))
print(mean_squared_error(y_test, ada2.predict(X_test),squared=False))

0.4684638069176431
124.01488266252962


Adaboost fails to give a good result on this dataset. The default parameters only manage an $R^2$ score of 5.45%, with an RMSE of $165.4. This score is worse than Linear Regression, with a much more complex and time-intensive model. 

In [11]:
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor().fit(X_train,y_train)
print(gbr.score(X_test,y_test))
print(mean_squared_error(y_test, gbr.predict(X_test), squared=False))

  return f(*args, **kwargs)


0.791657006996321
77.64202169374471


GBM gives a decent $R^2$ score of 79.17% and RMSE of $77.64.

In [15]:
import xgboost as xgb
xgboost = xgb.XGBRegressor().fit(X_train,y_train.values.ravel())
print(xgboost.score(X_test,y_test))
print(mean_squared_error(y_test, xgboost.predict(X_test), squared=False))

0.979996147759562
24.058259491727803


XGBoost also gives a good $R^2$ score, 98%. The RMSE is $24.06.

#### Using the best model we have, random forest, let's find the top predictors in our model

In [18]:
importance_df = pd.DataFrame({'columns': X.columns, 'importance': rfr.feature_importances_})

In [23]:
importance_df.sort_values(by='importance',ascending=False).head()

Unnamed: 0,columns,importance
38,property_type_Private room in townhouse,0.276067
24,property_type_Entire villa,0.142162
3,accomodates,0.075672
11,bathrooms,0.068826
5,beds,0.065905
