# Predicting Airbnb Prices for Munich

The goal of our data mining project is to predict prices for new Airbnb listings in Munich. To achieve this, we will train a regression model on existing Airbnb data from www.insideairbnb.com.

## Table of Contents
##### [1 Preprocessing](#preprocessing)
##### [2 Data Mining](#data_mining)
##### [3 Interpretation and Evaluation](#interpretation_evaluation)

In [47]:
#Imports
import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold, GridSearchCV, train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyRegressor
from math import sqrt
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.feature_selection import RFECV, f_regression, SelectFromModel
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import itertools
from sklearn import neighbors
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import fbeta_score, make_scorer, mean_squared_error
from sklearn.model_selection import GridSearchCV

<a id='preprocessing'></a>
## 1 Preprocessing

In [48]:
%run modules/preprocessing.py
preprocessed_df = load_and_preprocess_dataset()

2019-11-24 13:14:30 : Dataset loaded and preprocessed.


In [49]:
%run modules/preprocessing.py
features = select_best_features(preprocessed_df, number_of_features = 70) # Total number of features: 70
label = preprocessed_df['maximum_price']

Selected Features:  15               maximum_nights
12             security_deposit
13                 cleaning_fee
1     host_total_listings_count
10                         beds
                ...            
23                      heating
11                     bed_type
20                     internet
34           verification_phone
2          host_has_profile_pic
Name: Specs, Length: 70, dtype: object


In [50]:
# Create suitable bins for stratification in train-test-split
bins = pd.qcut(label, 30, labels=False)

# Train-test-split
x_train, x_test, y_train, y_test = train_test_split(features, label, test_size = 0.2, random_state = 42, stratify=bins)

In [None]:
# if you want to delete outliers, execute this after the train-test-split:

# %run modules/preprocessing.py
# x_train, y_train = delete_price_outliers(x_train, y_train)

In [51]:
# Check correlation of independant variables
corr_matrix = preprocessed_df.corr()
for (column_name, column_data) in corr_matrix.iteritems():
    for row_name, value in column_data.iteritems():
        if(value > 0.6 and column_name != row_name):
            print(column_name, " ", row_name, " ", value)

accommodates   beds   0.7060608155629018
bedrooms   beds   0.6501529074983061
beds   accommodates   0.7060608155629018
beds   bedrooms   0.6501529074983061
require_guest_profile_picture   require_guest_phone_verification   0.7871857495927765
require_guest_phone_verification   require_guest_profile_picture   0.7871857495927765
verification_government_id   verification_jumio   0.9765520138913213
verification_jumio   verification_government_id   0.9765520138913213
verification_offline_government_id   verification_selfie   0.6336336790497336
verification_selfie   verification_offline_government_id   0.6336336790497336


<a id='data_mining'></a>
## 2 Data Mining

### 2.1 Evaluation of a Dummy Regressor

In [22]:
# k-fold cross-validation (k = 10)
scores = []
rmse = []
k_fold_cross_validation = KFold(10, True, 1)
for train_index, test_index in k_fold_cross_validation.split(features):

    # Split the dataset for training and testing
    x_train, x_test, y_train, y_test = features.loc[train_index, :], features.loc[test_index, :], label[train_index], label[test_index]
    
    # Dummy Regressor
    regressor = DummyRegressor(strategy='median')
    regressor.fit(x_train, y_train)

    # Evaluation using testing dataset
    scores.append(regressor.score(x_test, y_test))  
    predictions = regressor.predict(x_test)
    rmse.append(sqrt(mean_squared_error(y_test, predictions)))

# Calculate performance measures
print("Dummy Regressor: ", str(np.mean(scores)))
print("RMSE: ", str(np.mean(rmse)))

Dummy Regressor:  -0.07154551044155429
RMSE:  143.85943496805479


### 2.2 Evaluation of different Regression Approaches

In [11]:
#x_train, x_test, y_train, y_test = train_test_split(features, label, test_size = 0.2, random_state = 0)

#x_train, y_train = delete_price_outliers(x_train, y_train)

# test different regression approaches
estimators = [ LinearRegression(), Ridge(), KNeighborsRegressor(), DecisionTreeRegressor(), MLPRegressor(), SVR() ]
svr = [ SVR() ]
pipeline = Pipeline( [ ('preprocessing', StandardScaler()), ('estimator', None) ])

# define a parameter grid
parameters = {
    'estimator': estimators
}

# define and run a grid search using MSE as scoring metric
search = GridSearchCV(pipeline, parameters, cv=10, scoring='neg_mean_squared_error')
search.fit(x_train, y_train)

# evaluate on test set
predictions = search.predict(x_test)
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Best Model: {}".format(search.best_params_))
print("RMSE: {}".format(sqrt(mse)))
print("R^2: {}".format(r2))

Best Model: {'estimator': Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)}
RMSE: 100.0147387841499
R^2: 0.43635167861691004


### 2.3 Evaluation of the Support Vector Machine

In [None]:
%run modules/evaluation.py

best_r2 = 0

# k-fold cross-validation (k = 10)
scores = []
k_fold_cross_validation = KFold(10, True, 1)
for train_index, test_index in k_fold_cross_validation.split(features):

    # Split the dataset for training and testing
    x_train, x_test, y_train, y_test = features.loc[train_index, :], features.loc[test_index, :], label[train_index], label[test_index]

    # Support Vector Regressor (SVR) using training dataset
    svr = SVR(kernel='linear', C = 0.7)
    svr.fit(x_train, y_train)

    # Evaluation using testing dataset
    scores.append(svr.score(x_test, y_test))  

# Calculate performance measures
print("r2: ", str(np.mean(scores)))

In [33]:
%run modules/evaluation.py


best_r2 = 0

# Generate all feature combinations
feature_combinations = generate_feature_combinations(already_preprocessed)

for feature_combination in feature_combinations:
    
    # Filter the selected features
    selected_features = already_preprocessed[feature_combination]
    
    # k-fold cross-validation (k = 10)
    scores = []
    k_fold_cross_validation = KFold(10, True, 1)
    for train_index, test_index in k_fold_cross_validation.split(selected_features):
    
        # Split the dataset for training and testing
        x_train, x_test, y_train, y_test = selected_features.loc[train_index, :], selected_features.loc[test_index, :], label[train_index], label[test_index]

        # Support Vector Regressor (SVR) using training dataset
        svr = SVR(kernel='linear', C = 0.7)
        svr.fit(x_train, y_train)

        # Evaluation using testing dataset
        scores.append(svr.score(x_test, y_test))  

    # Calculate performance measures
    print(np.mean(scores), " - ", feature_combination)

    # Save best model
    if(np.mean(scores) > best_r2):
          best_r2 = np.mean(scores)
        
print("Best: ", best_r2)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

### 2.4 Evaluation of Linear Regression

In [102]:
#, stratify=y_binned
#, random_state = 42
x_train, x_test, y_train, y_test = train_test_split(features, label, test_size = 0.2, random_state = 42, stratify=bins)

#x_train, y_train = delete_price_outliers(x_train, y_train)

reg = LinearRegression()
reg.fit(x_train, y_train)
prediction = reg.predict(x_test)
mse = mean_squared_error(y_test, prediction)
r2 = r2_score(y_test, prediction)
print("MSE:", mse)
print("RMSE:", sqrt(mse))
print("R^2:", r2)

MSE: 12205.405453113695
RMSE: 110.47807679858342
R^2: 0.39620656940976784


### 2.5 Evaluation of Advanced Regression

In [128]:
#1. Ridge Regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

rr_est = Ridge()
param = {"alpha": [1e-10, 1e-8, 1e-4,1e-3, 1e-2, 1, 5, 10, 20]}

rr_est_opt = GridSearchCV(rr_est, param, scoring="neg_mean_squared_error", cv=5)
rr_est_opt.fit(x_train, y_train)

# Get best param
print(rr_est_opt.best_params_)
print(rr_est_opt.best_score_)

{'alpha': 20}
-4696.814613726855


In [129]:
# Test with alpha 20
from sklearn.linear_model import ridge
rr = Ridge(alpha = 20)
rr.fit(x_train, y_train)
rr.coef_

#Test and Evaluate
rr_predictions = rr.predict(x_test)

mse = mean_squared_error(y_test, rr_predictions)
r2 = r2_score(y_test, rr_predictions)

print("MSE:", mse)
print("RMSE:", sqrt(mse))
print("R^2:", r2)

MSE: 11276.885237588422
RMSE: 106.19267977402407
R^2: 0.3831739616006318


## 2.6. Evaluation of KNN Regression

In [2]:
import plotly.express as px

#### Evaluate development of error term with different values for K

In [None]:
%run modules/evaluation.py

best_nmse = -100000

neg_mse = []
for K in range(30):
    K = K + 1
    knn = neighbors.KNeighborsRegressor(n_neighbors=K)

    # k-fold cross-validation (k = 10)
    scores = []
    k_fold_cross_validation = KFold(10, True, 1)
    for train_index, test_index in k_fold_cross_validation.split(features):

        # KNN Regression using training dataset
        knn = neighbors.KNeighborsRegressor(n_neighbors=K)
        knn.fit(x_train, y_train)

        # Evaluation using testing dataset
        scores.append(cross_val_score(knn, x_train, y_train, scoring='neg_mean_squared_error'))  

    # Calculate performance measures
    print("Negative mean squared error: ", str(np.mean(scores)), "for a K of", K)
    neg_mse.append(np.mean(scores))
    
    # Save best model
    if(np.mean(scores) > best_nmse):
          best_nmse = np.mean(scores)
            
# plotting the rmse values against k values
curve = px.line(data, x="Value of K", y="Negative Mean Squared Error", title='Change of NMSE for different values of K')
curve.show()


print("Best: ", best_nmse)

Negative Mean squared error:  -21477.76092454774 for a K of 1
Negative Mean squared error:  -18172.159012237866 for a K of 2
Negative Mean squared error:  -17441.135494347203 for a K of 3
Negative Mean squared error:  -16850.657433589662 for a K of 4
Negative Mean squared error:  -16791.13280286299 for a K of 5
Negative Mean squared error:  -16765.48719289242 for a K of 6
Negative Mean squared error:  -16734.8804725024 for a K of 7
Negative Mean squared error:  -16628.455845007862 for a K of 8
Negative Mean squared error:  -16626.042864958483 for a K of 9
Negative Mean squared error:  -16627.058175746464 for a K of 10
Negative Mean squared error:  -16611.733558632426 for a K of 11
Negative Mean squared error:  -16613.513410009924 for a K of 12
Negative Mean squared error:  -16660.906663192593 for a K of 13
Negative Mean squared error:  -16697.553606518184 for a K of 14
Negative Mean squared error:  -16750.762711447842 for a K of 15
Negative Mean squared error:  -16744.852812006735 for 

#### GridSearch to evaluate the best parameter values

In [11]:
# create an estimator
knn_estimator = neighbors.KNeighborsRegressor()
parameters = {
    'n_neighbors': range(2, 30),
    'weights': ['uniform', 'distance'],
    'metric': ['manhattan', 'euclidean', 'minkowski', 'chebyshev']
}

# specify the cross validation
k_fold_cross_validation = KFold(10, True, 1)

grid_search_estimator = GridSearchCV(knn_estimator, parameters, cv=k_fold_cross_validation, scoring='neg_mean_squared_error')
grid_search_estimator.fit(x_train,y_train)
grid_search_estimator.best_params_

# evaluate on test set
predictions = grid_search_estimator.predict(x_test)
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Best Model: {}".format(grid_search_estimator.best_params_))
print("RMSE: {}".format(sqrt(mse)))
print("R^2: {}".format(r2))


# print the best parameter setting
print("best score is {} with params {}".format(grid_search_estimator.best_score_, grid_search_estimator.best_params_))

Best Model: {'metric': 'manhattan', 'n_neighbors': 12, 'weights': 'distance'}
RMSE: 130.14655641574794
R^2: 0.16208196988783952
best score is -15456.122970596134 with params {'metric': 'manhattan', 'n_neighbors': 12, 'weights': 'distance'}


#### Tryout different feature combinations

In [None]:
%run modules/evaluation.py

best_r2 = 0
best_feature_combi = 0

# Generate all feature combinations
feature_combinations = generate_feature_combinations(features)

for feature_combination in feature_combinations:
    # Filter the selected features
    
    x_train = x_train[feature_combination]
    y_train = y_train[feature_combination]
            
    # Tryout with manhatten
    knn = KNeighborsRegressor(n_neighbors = 9, metric = 'manhattan', weights= 'uniform')
    knn.fit(x_train, y_train)
    price_predicted = knn.predict(x_test)

    # evaluate using different measures
    mse = mean_squared_error(y_test, price_predicted)
    r2 = r2_score(y_test, price_predicted)
        
    print("Evaluation of feature combination", feature_combination)
    print("MSE:", mse)
    print("RMSE:", sqrt(mse))
    print("R^2:", r2)
        
    if(r2 > best_r2):
        best_r2 = r2
        best_feature_combi = feature_combination
            
    print("Best score so far:", best_r2, "with the features:", best_feature_combi)
        
print("Best score overall: ", best_r2, "with the features:", best_feature_combi)

# 2.7 Random Forest

In [52]:
#Random Forest
print("Random Forest:")

x_train, x_test, y_train, y_test = train_test_split(features, label, test_size = 0.2, random_state = 42, stratify=bins)
#x_train, y_train = delete_price_outliers(x_train, y_train)


rf_regressor = RandomForestRegressor()

#define ranges for parameters
max_depth = np.arange(5,15,2)
min_samples_split = np.arange(2,100,25)
min_samples_leafs = np.arange(1,50,15)
max_leaf_nodes = np.arange(2,50,15)


pipeline = Pipeline( [ ('preprocessing', StandardScaler()), ('regressor', rf_regressor) ])


# define and run a grid search using r2 as scoring metric
parameters = {'regressor__n_estimators': [100],
                'regressor__max_depth': max_depth,
                'regressor__n_jobs': [-1],
                'regressor__min_samples_leaf': min_samples_leafs,
                   'regressor__min_samples_split': min_samples_split,
                'regressor__max_leaf_nodes': max_leaf_nodes}

# define and run a grid search using r2 as scoring metric
rf_search = GridSearchCV(pipeline, parameters, cv=10, scoring='r2', verbose=1, n_jobs=-1)
rf_search.fit(x_train, y_train)

#Training Error
print(f"Best Model: {rf_search.best_params_}")
training_predictions = rf_search.predict(x_train)
training_mse = mean_squared_error(y_train, training_predictions)
training_r2 = r2_score(y_train, training_predictions)
print("Training performance:")
print(f"RMSE: {sqrt(training_mse)}")
print(f"R^2: {training_r2}")

print()

#Test Error
test_predictions = rf_search.predict(x_test)
test_mse = mean_squared_error(y_test, test_predictions)
test_r2 = r2_score(y_test, test_predictions)
print("Test performance:")
print(f"RMSE: {sqrt(test_mse)}")
print(f"R^2: {test_r2}")

Random Forest:
Fitting 10 folds for each of 320 candidates, totalling 3200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   10.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   44.6s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  7.5min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed: 11.1min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 15.9min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed: 21.9min
[Parallel(n_jobs=-1)]: Done 3200 out of 3200 | elapsed: 22.0min finished


Best Model: {'regressor__max_depth': 9, 'regressor__max_leaf_nodes': 47, 'regressor__min_samples_leaf': 16, 'regressor__min_samples_split': 27, 'regressor__n_estimators': 100, 'regressor__n_jobs': -1}
Training performance:
RMSE: 107.40282089437214
R^2: 0.41768276710108176

Test performance:
RMSE: 109.96406662732902
R^2: 0.40181191651089243


#### Examine importance of individual features

In [53]:
tuned_rf = rf_search.best_estimator_
importances = pd.DataFrame({'feature':x_train.columns,'importance':np.round(tuned_rf[1].feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False).set_index('feature')
importances.head(40)

Unnamed: 0_level_0,importance
feature,Unnamed: 1_level_1
accommodates,0.754
average_rent_neighbourhood,0.037
cleaning_fee,0.032
security_deposit,0.024
maximum_nights,0.02
minimum_nights,0.015
distance_centre,0.014
ludwigsvorstadt-isarvorstadt,0.012
is_location_exact,0.01
verification_reviews,0.009


In [54]:
# create selector for features that have an importance of more than 0.001
sfm = SelectFromModel(tuned_rf[1], threshold=0.001)

# Train selector
sfm.fit(x_train, y_train)

SelectFromModel(estimator=RandomForestRegressor(bootstrap=True, criterion='mse',
                                                max_depth=9,
                                                max_features='auto',
                                                max_leaf_nodes=47,
                                                min_impurity_decrease=0.0,
                                                min_impurity_split=None,
                                                min_samples_leaf=16,
                                                min_samples_split=27,
                                                min_weight_fraction_leaf=0.0,
                                                n_estimators=100, n_jobs=-1,
                                                oob_score=False,
                                                random_state=None, verbose=0,
                                                warm_start=False),
                max_features=None, norm_order=1, prefit=False, thresho

In [55]:
# create datasets without the features that did not pass the threshold and learn model again
X_important_train = sfm.transform(x_train)
X_important_test = sfm.transform(x_test)

#update  ranges for parameters
max_depth = np.arange(8,15,2)
min_samples_split = np.arange(20,30,2)
min_samples_leafs = np.arange(10,20,2)
max_leaf_nodes = np.arange(60,70,2)

# define and run a grid search using r2 as scoring metric
parameters = {'regressor__n_estimators': [100],
                'regressor__max_depth': max_depth,
                'regressor__n_jobs': [-1],
                'regressor__min_samples_leaf': min_samples_leafs,
                   'regressor__min_samples_split': min_samples_split,
                'regressor__max_leaf_nodes': max_leaf_nodes}

# define and run a grid search using r2 as scoring metric
rfi_search = GridSearchCV(pipeline, parameters, cv=10, scoring='r2', verbose=1, n_jobs=-1)
rfi_search.fit(X_important_train, y_train)

#Training Error
print(f"Best Model: {rfi_search.best_params_}")
training_predictions = rfi_search.predict(X_important_train)
training_mse = mean_squared_error(y_train, training_predictions)
training_r2 = r2_score(y_train, training_predictions)
print("Training performance:")
print(f"RMSE: {sqrt(training_mse)}")
print(f"R^2: {training_r2}")

print()

#Test Error
test_predictions = rfi_search.predict(X_important_test)
test_mse = mean_squared_error(y_test, test_predictions)
test_r2 = r2_score(y_test, test_predictions)
print("Test performance:")
print(f"RMSE: {sqrt(test_mse)}")
print(f"R^2: {test_r2}")

Fitting 10 folds for each of 500 candidates, totalling 5000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   17.3s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  8.1min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed: 11.8min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 16.1min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed: 21.1min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 26.9min
[Parallel(n_jobs=-1)]: Done 4992 tasks      | elapsed: 33.5min
[Parallel(n_jobs=-1)]: Done 5000 out of 5000 | elapsed: 33.5min finished


Best Model: {'regressor__max_depth': 10, 'regressor__max_leaf_nodes': 64, 'regressor__min_samples_leaf': 10, 'regressor__min_samples_split': 24, 'regressor__n_estimators': 100, 'regressor__n_jobs': -1}
Training performance:
RMSE: 104.07907527111573
R^2: 0.4531664950272928

Test performance:
RMSE: 108.60648274964373
R^2: 0.41649084951226334


<a id='interpretation_evaluation'></a>
## 3 Interpretation and Evaluation