# MODELING METHODS

Since predicting Airbnb rental price is a continuous outcome, I will use regression modeling. First I will create a linear regression model predicting price with all of the predictor variables and use this as my baseline model to compare other models against. The other models I will use are Ridege, LASSO, Random Forest, SVM and Gradient Boosting.

I will evaluate and compare the performace of the models using two different metrics, Root Mean Squared Error (RMSE) and R-squared. RMSE measures the magnitude of the residuals or margin of error. R-squared measures the proportion of variance for the target variable that is explained by the independent variables. Ideally, lower RSME (closer to zero) and higher R-squared (closer to 1) values are indicative of a good model.

Before diving into modeling, I will prepare the data. 

# DATA PREPARATION

In [315]:
# Load the required libraries and modules
%matplotlib inline 

import timeit
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import cufflinks as cf
import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)
cf.go_offline()

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.model_selection import cross_val_score, cross_val_predict, cross_validate
from sklearn.utils import shuffle

import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols

import warnings
warnings.filterwarnings("ignore")

In [316]:
# Read in the CSV file
df = pd.read_csv('Data/airbnb_clean.csv')
df.head()

Unnamed: 0,listing_id,zip_code,latitude,longitude,room_type,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,number_of_reviews,review_scores_rating,neighbourhood,number_of_bookings,bedroom_bath_ratio
0,2265,78702,30.2775,-97.71398,Entire home/apt,4,2.0,2.0,2.0,225.0,30,24,93.0,East Downtown,365.0,100.0
1,5245,78702,30.27577,-97.71379,Private room,2,1.0,1.0,2.0,100.0,30,9,91.0,East Downtown,354.0,100.0
2,5456,78702,30.26112,-97.73448,Entire home/apt,3,1.0,1.0,2.0,95.0,2,499,96.0,East Downtown,74.0,100.0
3,75174,78702,30.24773,-97.72584,Entire home/apt,3,1.0,1.0,1.0,130.0,2,249,98.0,East Downtown,131.0,100.0
4,76911,78702,30.26775,-97.72695,Entire home/apt,10,3.0,5.0,12.0,821.0,2,126,99.0,East Downtown,56.0,60.0


To ease the prediction of price, the distribution should be normally distributed. From prior exploratory data analysis and statistical inference, I observed that the data was heavily skewed to the right which means it is not normally distributed. I will transform price to log10.

In [317]:
# Convert raw price column to log
df['price'] = np.log10(df.price)

I will drop some columns that I don't think I need in the modeling. These columns are the listing id, zip code, latitude and longitude.

In [318]:
# Drop columns
drop_cols = ['listing_id', 'zip_code', 'latitude', 'longitude']
df.drop(columns=drop_cols, axis=1, inplace=True)

In [319]:
# Inspect new df
df.head()

Unnamed: 0,room_type,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,number_of_reviews,review_scores_rating,neighbourhood,number_of_bookings,bedroom_bath_ratio
0,Entire home/apt,4,2.0,2.0,2.0,2.352183,30,24,93.0,East Downtown,365.0,100.0
1,Private room,2,1.0,1.0,2.0,2.0,30,9,91.0,East Downtown,354.0,100.0
2,Entire home/apt,3,1.0,1.0,2.0,1.977724,2,499,96.0,East Downtown,74.0,100.0
3,Entire home/apt,3,1.0,1.0,1.0,2.113943,2,249,98.0,East Downtown,131.0,100.0
4,Entire home/apt,10,3.0,5.0,12.0,2.914343,2,126,99.0,East Downtown,56.0,60.0


The regression models I will be using can not handle categorical variables unless I convert them to numerical values. There are two categorical variables in the data, room_type and neighbourhood. I will use the one hot encoding method to map each category to a vector that contains 1 and 0, denoting the presence or absence of that variable. 

In [320]:
# Perform get_dummies to encode categorical variables
df = pd.get_dummies(df, columns=['room_type', 'neighbourhood'], drop_first=True)

# Confirm that only numeric variables remain
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11333 entries, 0 to 11332
Data columns (total 44 columns):
accommodates                                11333 non-null int64
bathrooms                                   11333 non-null float64
bedrooms                                    11333 non-null float64
beds                                        11333 non-null float64
price                                       11333 non-null float64
minimum_nights                              11333 non-null int64
number_of_reviews                           11333 non-null int64
review_scores_rating                        11333 non-null float64
number_of_bookings                          11333 non-null float64
bedroom_bath_ratio                          11333 non-null float64
room_type_Hotel room                        11333 non-null uint8
room_type_Private room                      11333 non-null uint8
room_type_Shared room                       11333 non-null uint8
neighbourhood_Balcones Civic Ass

My next step is to create arrays of the dependent target (y) and independent predictor (X) variables.

In [321]:
# Split data into target and predictors 
y = df.price.values
X = df.drop('price', axis = 1).values

The numeric variables need to be scaled for more uniform and fair influence for all weights.

In [322]:
# Transform the depdendent variables
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)

I will initialize an array for the MSE and R-squared scores because I will be using both for evaluation throughout this notebook. MSE will be converted to RMSE for comparison. 

In [323]:
# Array for mean squared error and R-squared 
scoring = ['neg_mean_squared_error', 'r2']

# LINEAR REGRESSION

In [324]:
# Shuffle the data 
X, y = shuffle(X, y)

In [325]:
# Instantiate the linear regressor 
linear_regressor = LinearRegression()

# Perform 5 fold cross validation on the data
linear_scores = cross_validate(linear_regressor, X, y, cv=5, scoring=scoring, n_jobs=-1, verbose=2)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.1s finished


In [326]:
# Capture the average RMSE and R-squared scores from cross validation
linear_rmse = np.math.sqrt(abs(linear_scores['test_neg_mean_squared_error'].mean()))
linear_r2 = abs(linear_scores['test_r2'].mean())

In [327]:
# Store and print the RMSE and R-squared scores in dataframe
scores_df = pd.DataFrame([{'Model':'Linear Regression', 'RMSE':linear_rmse, 'R-squared':linear_r2}])
print(scores_df.to_string(index=False))

             Model      RMSE  R-squared
 Linear Regression  0.331937   0.499861


# RIDGE REGRESSION

In [328]:
# Shuffle the data 
X, y = shuffle(X, y)

In [329]:
# Create dictionary of alpha values
param_grid = {'alpha':[0.001, 0.01, 0.1, 1, 5, 10, 50, 100, 500, 1000]}

# Perform 5 fold grid search to get the best alpha value
ridge_gsc = GridSearchCV(
    estimator=Ridge(),
    param_grid=param_grid,
    cv=5, 
    n_jobs=-1,
    verbose=2
)

ridge_grid_result = ridge_gsc.fit(X, y)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  19 out of  50 | elapsed:    0.1s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    0.2s finished


In [330]:
# Get and print the best alpha from the grid search
ridge_best_params = ridge_grid_result.best_params_
print(ridge_best_params)

{'alpha': 5}


In [331]:
# Instantiate ridge regressor with the best alpha
ridge = Ridge(alpha=ridge_best_params['alpha'])

In [332]:
# Perform 5 fold cross validation on the data 
ridge_scores = cross_validate(ridge, X, y, cv=5, scoring=scoring, return_estimator=True)

In [333]:
# Capture the average RMSE and R-squared scores from cross validation
ridge_rmse = np.math.sqrt(abs(ridge_scores['test_neg_mean_squared_error'].mean()))
ridge_r2 = abs(ridge_scores['test_r2'].mean())

In [334]:
# Store and print the RMSE and R-squared scores in dataframe
scores_df = scores_df.append(pd.DataFrame([{'Model':'Ridge Regression', 
                                            'RMSE':ridge_rmse, 'R-squared':ridge_r2}]))

print(scores_df.to_string(index=False))

             Model      RMSE  R-squared
 Linear Regression  0.331937   0.499861
  Ridge Regression  0.331370   0.501626


# LASSO REGRESSION

In [335]:
# Shuffle the data 
X, y = shuffle(X, y)

In [336]:
# Create dictionary of alpha values
param_grid = {'alpha':[0.001, 0.01, 0.1, 1, 5, 10, 50, 100, 500, 1000]}

# Perform 5 fold grid search to get the best alpha value
lasso_gsc = GridSearchCV(
    estimator=Lasso(),
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    verbose=2
)

lasso_grid_result = lasso_gsc.fit(X, y)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  19 out of  50 | elapsed:    0.1s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    0.3s finished


In [337]:
# Get and print the best alpha from the grid search
lasso_best_params = lasso_grid_result.best_params_
print(lasso_best_params)

{'alpha': 0.001}


In [338]:
# Instantiate lasso regressor with the best alpha
lasso = Lasso(alpha=lasso_best_params['alpha'])

In [339]:
# Perform 5 fold cross validation on the data 
lasso_scores = cross_validate(lasso, X, y, cv=5, scoring=scoring, return_estimator=True)

In [340]:
# Capture the average RMSE and R-squared scores from cross validation
lasso_rmse = np.math.sqrt(abs(lasso_scores['test_neg_mean_squared_error'].mean()))
lasso_r2 = abs(lasso_scores['test_r2'].mean())

In [341]:
# Store and print the RMSE and R-squared scores in dataframe
scores_df = scores_df.append(pd.DataFrame([{'Model':'Lasso Regression', 
                                            'RMSE':lasso_rmse, 'R-squared':lasso_r2}]))

print(scores_df.to_string(index=False))

             Model      RMSE  R-squared
 Linear Regression  0.331937   0.499861
  Ridge Regression  0.331370   0.501626
  Lasso Regression  0.331186   0.502434


# SVM REGRESSION

In [342]:
# Create grid of different hyperparameter values
random_grid = {
    'kernel': ['rbf', 'poly', 'sigmoid'], 
    'C': [0.1, 1, 10, 50, 100, 1000],
    'gamma': [0, 0.01, 0.001, 0.0005, 'auto'],
    'epsilon': [0, 0.01, 0.1, 1, 2]
}

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations
svm_random = RandomizedSearchCV(
    estimator=SVR(), 
    param_distributions=random_grid, 
    n_iter=100, 
    cv=3, 
    verbose=2, 
    random_state=42, 
    n_jobs=-1
)

In [343]:
# Fit the random search model
svm_grid_result = svm_random.fit(X, y)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   15.8s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 11.1min finished


Executed in 669.30 seconds.


In [344]:
# Get the best hyperparameters from the random search
print(svm_grid_result.best_params_)

{'kernel': 'rbf', 'gamma': 'auto', 'epsilon': 0.1, 'C': 1}


In [345]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'kernel': ['rbf'],
    'gamma': ['auto'],
    'epsilon': [0.05, 0.1, 0.15],
    'C': [0.5, 1, 2]
}

# Instantiate the grid search
svm_gsc = GridSearchCV(
    estimator=SVR(), 
    param_grid=param_grid, 
    cv=3, 
    n_jobs=-1,
    verbose=2
)

# Fit the grid search model
svm_grid_result = svm_gsc.fit(X, y)

Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  27 | elapsed:   15.1s remaining:   25.6s
[Parallel(n_jobs=-1)]: Done  24 out of  27 | elapsed:   23.6s remaining:    3.0s
[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:   24.5s finished


In [346]:
# Get and print the best hyperparameters from the grid search
svm_best_params = svm_grid_result.best_params_
print(svm_best_params)

{'C': 2, 'epsilon': 0.15, 'gamma': 'auto', 'kernel': 'rbf'}


In [347]:
# Instantiate SVM with the best hyperparameters
svm = SVR(
    kernel=svm_best_params['kernel'],
    C=svm_best_params['C'], 
    gamma=svm_best_params['gamma'],
    epsilon=svm_best_params['epsilon'],
)

In [348]:
# Perform 5 fold cross validation on the data 
svm_scores = cross_validate(svm, X, y, cv=5, scoring=scoring)

In [349]:
# Capture the average RMSE and R-squared scores from cross validation
svm_rmse = np.math.sqrt(abs(svm_scores['test_neg_mean_squared_error'].mean()))
svm_r2 = abs(svm_scores['test_r2'].mean())

In [350]:
# Store and print the RMSE and R-squared scores in dataframe
scores_df = scores_df.append(pd.DataFrame([{'Model':'SVM Regression', 
                                            'RMSE':svm_rmse, 'R-squared':svm_r2}]))

print(scores_df.to_string(index=False))

             Model      RMSE  R-squared
 Linear Regression  0.331937   0.499861
  Ridge Regression  0.331370   0.501626
  Lasso Regression  0.331186   0.502434
    SVM Regression  0.320084   0.535206


# RANDOM FOREST REGRESSION

In [351]:
# Number of trees
n_estimators = [int(x) for x in np.linspace(start=100, stop=4000, num=20)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 200, num=20)]

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 9, 14, 20]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4, 7, 11]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create grid of different hyperparameter values
random_grid = {
    'n_estimators': n_estimators,
    'max_features': max_features,
    'max_depth': max_depth,
    'min_samples_split': min_samples_split,
    'min_samples_leaf': min_samples_leaf,
    'bootstrap': bootstrap
}

print(random_grid)

{'n_estimators': [100, 305, 510, 715, 921, 1126, 1331, 1536, 1742, 1947, 2152, 2357, 2563, 2768, 2973, 3178, 3384, 3589, 3794, 4000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200], 'min_samples_split': [2, 5, 9, 14, 20], 'min_samples_leaf': [1, 2, 4, 7, 11], 'bootstrap': [True, False]}


In [353]:
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations
rfr_random = RandomizedSearchCV(
    estimator=RandomForestRegressor(), 
    param_distributions=random_grid, 
    n_iter=100, 
    cv=3, 
    verbose=2, 
    random_state=42, 
    n_jobs=-1
)

In [354]:
# Fit the random search model
rfr_random.fit(X, y)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   32.5s
[Parallel(n_jobs=-1)]: Done 130 tasks      | elapsed: 12.0min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 55.9min finished


Executed in 1581.02 seconds.


In [359]:
# Get the best hyperparameters from the random search
print(rfr_random.best_params_)

{'n_estimators': 3178, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 110, 'bootstrap': False}


In [356]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [False],
    'max_depth': [100, 110, 120],
    'max_features': ['sqrt'],
    'min_samples_leaf': [1, 2],
    'min_samples_split': [4, 5, 6],
    'n_estimators': [3000, 3100, 3200, 3300]
}

# Instantiate the grid search
rfr_gsc = GridSearchCV(
    estimator=RandomForestRegressor(), 
    param_grid=param_grid, 
    cv=3, 
    n_jobs=-1,
    verbose=2
)

In [357]:
# Fit the grid search model
rfr_grid_result = rfr_gsc.fit(X, y)

Fitting 3 folds for each of 72 candidates, totalling 216 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 130 tasks      | elapsed: 11.3min
[Parallel(n_jobs=-1)]: Done 216 out of 216 | elapsed: 17.3min finished


In [360]:
# Get and print the best hyperparameters from the grid search
rfr_best_params = rfr_grid_result.best_params_
print(rfr_best_params)

{'bootstrap': False, 'max_depth': 120, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 6, 'n_estimators': 3100}


In [361]:
# Instantiate random forest regressor with the best hyperparameters
rfr = RandomForestRegressor(
    bootstrap=rfr_best_params['bootstrap'],
    max_depth=rfr_best_params['max_depth'], 
    max_features=rfr_best_params['max_features'],
    min_samples_leaf=rfr_best_params['min_samples_leaf'],
    min_samples_split=rfr_best_params['min_samples_split'],
    n_estimators=rfr_best_params['n_estimators']
)

In [362]:
# Perform 5 cross validation on the data
rfr_scores = cross_validate(rfr, X, y, cv=5, scoring=scoring, n_jobs=-1, verbose=2)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   39.0s finished


In [363]:
# Capture the average RMSE and R-squared scores from cross validation
rfr_rmse = np.math.sqrt(abs(rfr_scores['test_neg_mean_squared_error'].mean()))
rfr_r2 = abs(rfr_scores['test_r2'].mean())

In [364]:
# Store and print the RMSE and R-squared scores in dataframe
scores_df = scores_df.append(pd.DataFrame([{'Model':'Random Forest Regression', 
                                            'RMSE':rfr_rmse, 'R-squared':rfr_r2}]))

print(scores_df.to_string(index=False))

                    Model      RMSE  R-squared
        Linear Regression  0.331937   0.499861
         Ridge Regression  0.331370   0.501626
         Lasso Regression  0.331186   0.502434
           SVM Regression  0.320084   0.535206
 Random Forest Regression  0.268167   0.673625


# GRADIENT BOOSTING REGRESSION

In [365]:
# Create grid of different hyperparameter values
random_grid = {
    'learning_rate': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3],
    'n_estimators': n_estimators, 
    'max_depth': max_depth,
    'min_samples_split': min_samples_split,
    'min_samples_leaf': min_samples_leaf,
    'max_features': max_features
}

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations
gbr_random = RandomizedSearchCV(
    estimator=GradientBoostingRegressor(), 
    param_distributions=random_grid, 
    n_iter=100, 
    cv=3, 
    verbose=2, 
    random_state=42, 
    n_jobs=-1
)

In [366]:
# Fit the random search model
gbr_grid_result = gbr_random.fit(X, y)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 130 tasks      | elapsed: 485.7min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 602.3min finished


Executed in 12665.02 seconds.


In [367]:
# Get the best hyperparameters from the random search
print(gbr_grid_result.best_params_)

{'n_estimators': 3589, 'min_samples_split': 9, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 10, 'learning_rate': 0.01}


In [368]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'learning_rate': [0.01],
    'max_depth': [5, 10, 15],
    'min_samples_split': [8, 9, 10],
    'n_estimators': [3000, 3500, 4000],
    'min_samples_leaf': [3, 4, 5],
    'max_features': ['sqrt']
}

# Instantiate the grid search
gbr_gsc = GridSearchCV(
    estimator=GradientBoostingRegressor(), 
    param_grid=param_grid, 
    cv=5, 
    n_jobs=-1,
    verbose=2
)

In [369]:
# Fit the grid search model
gbr_grid_result = gbr_gsc.fit(X, y)

Fitting 5 folds for each of 81 candidates, totalling 405 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 130 tasks      | elapsed:  9.0min
[Parallel(n_jobs=-1)]: Done 333 tasks      | elapsed: 47.5min
[Parallel(n_jobs=-1)]: Done 405 out of 405 | elapsed: 73.3min finished


Executed in 3777.86 seconds.


In [370]:
# Get and print the best hyperparameters from the grid search
gbr_best_params = gbr_grid_result.best_params_
print(gbr_best_params)

{'learning_rate': 0.01, 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 5, 'min_samples_split': 10, 'n_estimators': 3500}


In [371]:
# Instantiate gradient boosting regressor with the best hyperparameters
gbr = GradientBoostingRegressor(
    learning_rate=gbr_best_params['learning_rate'],
    max_depth=gbr_best_params['max_depth'], 
    min_samples_split=gbr_best_params['min_samples_split'],
    n_estimators=gbr_best_params['n_estimators']
)

In [372]:
# Perform 5 cross validation on the data
gbr_scores = cross_validate(gbr, X, y, cv=5, scoring=scoring, n_jobs=-1, verbose=2)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  3.5min finished


In [373]:
# Capture the average RMSE and R-squared scores from cross validation
gbr_rmse = np.math.sqrt(abs(gbr_scores['test_neg_mean_squared_error'].mean()))
gbr_r2 = abs(gbr_scores['test_r2'].mean())

In [374]:
# Store and print the RMSE and R-squared scores in dataframe
scores_df = scores_df.append(pd.DataFrame([{'Model':'Gradient Boosting Regression', 
                                            'RMSE':gbr_rmse, 'R-squared':gbr_r2}]))

print(scores_df.to_string(index=False))

                        Model      RMSE  R-squared
            Linear Regression  0.331937   0.499861
             Ridge Regression  0.331370   0.501626
             Lasso Regression  0.331186   0.502434
               SVM Regression  0.320084   0.535206
     Random Forest Regression  0.268167   0.673625
 Gradient Boosting Regression  0.273974   0.659218


In [375]:
# Plot the scores for a visual comparison
fig = go.Figure(
    data=[
        go.Bar(
            name="RMSE",
            x=scores_df["Model"],
            y=scores_df["RMSE"],
            offsetgroup=0,
        ),
        go.Bar(
            name="R-squared",
            x=scores_df["Model"],
            y=scores_df["R-squared"],
            offsetgroup=1,
        ),
    ],
    layout=go.Layout(
        title="Model Comparison",
        yaxis_title="Scores"
        
    )
)
fig.show()

R-squared represents the coefficient of how well the values fit compared to the original values. The value from 0 to 1 interpreted as percentages. The higher the value is, the better the model is. RMSE is the measure of error, so the lower the value is, the better. I can see from the graph above that random forest regression had the highest R-squared of 0.67 (67%) and also had the lowest RMSE score of 0.27. Gradient boosting regression wasn't very far behind at 0.66 (66%) R-squared and also 0.27 RMSE. Linear, ridge and lasso regressions were the worst performing models with relatively similar scores around 0.50 (50%) R-squared and 0.33 RSME.