<center> <h1><font size=7> Case Study C</font> </h1> </center>

# Predicting AirBnB Prices - Example Solution - part 1

This notebook will take the data that was cleaned in the Case Study C part 1 notebook.

This notebook focuses on the analysis of features and regression modelling to predict the "price" of listings

# 1. Import packages

In [None]:
import pandas as pd 
import numpy as np

import sklearn as sk
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import RobustScaler, PowerTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor

import matplotlib.pyplot as plt

import random
random.seed(123)

%matplotlib inline

# 2. Reading the data

In [None]:
airbnb_df=pd.read_csv('../../data/airbnb/example_cleaned_data.csv')

In [None]:
airbnb_df.columns

In [None]:
airbnb_df

# 3. Exploratory data analysis

## 3.1. Distribution of price

In [None]:
plt.figure(figsize=(20,6))
airbnb_df['price'].plot.density()
plt.plot(airbnb_df['price'], [0.0001]*len(airbnb_df), '|', color='r');

We can see from the above that there are some substantial outliers in our data (£8000 for a night? really?) which may impact our ability to model. The data is positive (prices we hope) and has a long right tail.

In [None]:
airbnb_df['price'][airbnb_df['price'] < 500].plot.density();

In [None]:
# plotting the log of the data shows we have a near lognormal distribution
airbnb_df['price'][airbnb_df['price'] < 500].apply(np.log1p).plot.density();

## 3.2. Impact of factors on price

### 3.2.1. Property type 

In [None]:
airbnb_df.boxplot(column='price', by='property_type', figsize=(20,6), rot=90);

In [None]:
with pd.option_context('display.max_rows', None): # this line stops our Series collapsing
    print(airbnb_df.groupby("property_type")["price"].agg(np.mean).sort_values(ascending=False))

I would have also done room type, but that's been encoded so can't plot it in a boxplot

### 3.2.2. Reviews

In [None]:
airbnb_df.plot.scatter(x='number_of_reviews', y='price', alpha=0.4, figsize=(20,8));

In [None]:
airbnb_df[airbnb_df["price"] < 500].plot.scatter(x='number_of_reviews', y='price', alpha=0.3, figsize=(20,8));

Excluding the price outliers there isn't a clear relationship. There is not a clear linear relationship, a transformation of the data could better show a relationship.

In [None]:
plt.figure(figsize=(20,8))
plt.scatter(np.log1p(airbnb_df['number_of_reviews']), airbnb_df['price'], alpha=0.3)
plt.title('Price vs log(reviews)');

Potentially some positive correlation, however, this is unclear from plotting alone.

### 3.2.3. Bathrooms, bedrooms and accomodates

In [None]:
f, ax=plt.subplots(figsize=(4,4))

plt.scatter(x=airbnb_df['bathrooms'], y=airbnb_df['price'], alpha=0.2);

Some outliers with high price or high bathroom number, but in this visualisation difficult to see a clear relationship.

In [None]:
f, ax=plt.subplots(figsize=(4,4))
airbnb_df_subset = airbnb_df[airbnb_df['price'] < 500]
plt.scatter(x=airbnb_df_subset['bathrooms'], y=airbnb_df_subset['price'], alpha=0.2);

There appears to be *some* correlation, or at least a relationship. The outlier high bathrooms do tend to cost more. The properties with no bathrooms cost less in general, there is some trend between.

In [None]:
f, ax=plt.subplots(figsize=(4,4))

plt.scatter(x=airbnb_df['bedrooms'], y=airbnb_df['price']);



In [None]:
f, ax=plt.subplots(figsize=(4,4))

plt.scatter(x=airbnb_df['accommodates'], y=airbnb_df['price']);

# 4. Feature engineering

Out non-numeric columns need to be converted into numerical formats.

In [None]:
engineered_df = airbnb_df.copy()

# take the log of reviews as new feature
engineered_df['logreviews'] = np.log1p(engineered_df['number_of_reviews'])  


engineered_df = pd.get_dummies(engineered_df, columns=[
                               'city'])  # OHE the cities

In [None]:
engineered_df

# 5.Inital Model

In [None]:
engineered_df.columns

In [None]:
# select features and target
# remove unique features and unneeded features

features = engineered_df.drop(columns=['id', 'property_type', 'LSOA11CD', 'price', 
                                       'number_of_reviews', 'neighbourhood_cleansed'])
target = engineered_df['price']  

In [None]:
# create a train / test split
# the earlier we do this, the better with regards to
# influencing our decisions using the test set
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=123)

In [None]:
scaler = RobustScaler()

# fit scaler to training data
# robust scaling all numeric features, we have some data with significant outliers
# we do not want impacting our scales
X_train_scaled = scaler.fit_transform(X=X_train)

# transform but do NOT fit on the test data
X_test_scaled = scaler.transform(X=X_test)

In [None]:
# Define the parameters to be searched in the cross validated linear models

alphas = [1000, 100, 50, 20, 10, 1, 0.1, 0.01]

lr = LinearRegression()
ridge = RidgeCV(alphas=alphas)
lasso = LassoCV(alphas=alphas, max_iter=10000)


We are choosing to evaluate using mean average error because it is versatile to outliers in the target distribution. Prices that are way off the average (such as £8,000) will not impact the evaluation as significantly.

This choice is made due to the challenge of predicting these outliers. In addition, just because a listing is made and the price is set, this does not mean:

* any one has actually ever paid that amount
* it is a reflection of value based on the attributes in the data

In [None]:
# Loop through each model type
for model, name in zip([lr, ridge, lasso], ['LinearRegression', 'RidgeRegression', 'LassoRegression']):
    
    # fit model
    model.fit(X_train_scaled, y_train)
    
    # generate prediction to evaluate on training set
    y_pred_train = model.predict(X_train_scaled)
    mae_train = mean_absolute_error(y_pred=y_pred_train, y_true=y_train).round(3)
    
    # generate predictions on the TEST set
    # we do both to compare
    y_pred_test = model.predict(X_test_scaled)
    mae_test = mean_absolute_error(y_pred=y_pred_test, y_true=y_test).round(3)
    
    best_alpha = ''
    if name != 'LinearRegression':
        best_alpha = ' best alpha: ' + str(model.alpha_)
        
    print(f"{name}\n\t MAE train: {mae_train}\t  MAE test:{mae_test} \t{best_alpha}")

In [None]:
# knn regressor
grid_params = {
    'n_neighbors': [3, 7, 12, 14, 40, 60, 80, 100],
    'weights': ['distance', 'uniform'],
    'metric': ['minkowski', 'euclidean', 'manhattan']
}

grid = GridSearchCV(KNeighborsRegressor(), grid_params,
                    cv=5, verbose=1, n_jobs=-1)

grid_result = grid.fit(X_train_scaled, y_train)
print('Best Score: ', grid_result.best_score_)
print('Best estimator: ', grid_result.best_estimator_)
print('Best Params: ', grid_result.best_params_)

In [None]:
y_pred_knn = grid_result.best_estimator_.predict(X_test_scaled)

print("Best KNN rmse:", mean_absolute_error(y_pred=y_pred_knn, y_true=y_test).round(3))

The KNN outperforms the linear / regularized models. This shows there may be significant non-linearity in the relationships.

We can further improve our model by transforming more data.

### Exploring target transformation

Our target is highly skewed. If we transform it into a different distribution, train the model, then transform it back to the original dimension to evaluate it we can improve our model's performance. This is because our model will find it easier to relate features to a normal target rather than a skewed target.

We are going to assume out `"price"` column follows a log-normal distribution. Therefore, to convert it to a normal distribution we will take the log of it (in practice, a log(X+1) transformation to avoid undefined values). To convert back to the regular distribution we will need to take the exponent of each value (the opposite of log) and then subract 1.

*forward transformation for each data point:* $log(x+1)$

*backward transformation for each data point:* $exp(x) - 1$

This will *hopefully* improve our model's ability to predict.

In [None]:
y_train_log, y_test_log = np.log1p(y_train), np.log1p(y_test)

In [None]:
# Loop through each model type
for model, name in zip([lr, ridge, lasso], ['LinearRegression', 'RidgeRegression', 'LassoRegression']):
    
    # fit model
    model.fit(X_train_scaled, y_train_log)
    
    # generate prediction to evaluate on training set
    y_pred_train = model.predict(X_train_scaled)
    mae_train = mean_absolute_error(y_pred=np.expm1(y_pred_train), y_true=y_train).round(3)
    
    # remember np.exp(y_train_log) == y_train
    
    # generate predictions on the TEST set
    # we do both to compare
    y_pred_test = model.predict(X_test_scaled)
    mae_test = mean_absolute_error(y_pred=np.expm1(y_pred_test), y_true=np.expm1(y_test_log)).round(3)
    
    best_alpha = ''
    if name != 'LinearRegression':
        best_alpha = ' best alpha: ' + str(model.alpha_)
    print(f"{name}\n\t MAE train: {mae_train}\t  MAE test:{mae_test} \t{best_alpha}")

We can see that from the above our performance has already improved, reducing the MAE from ~38 to 32.

In [None]:
# knn regressor
grid_params = {
    'n_neighbors': [3, 7, 12, 14, 40, 60, 80, 100],
    'weights': ['distance', 'uniform'],
    'metric': ['minkowski', 'euclidean', 'manhattan']
}

grid = GridSearchCV(KNeighborsRegressor(), grid_params,
                    cv=5, verbose=1, n_jobs=-1)

grid_result = grid.fit(X_train_scaled, y_train_log)
print('Best Score: ', grid_result.best_score_)
print('Best estimator: ', grid_result.best_estimator_)
print('Best Params: ', grid_result.best_params_)

In [None]:
y_pred_knn = grid_result.best_estimator_.predict(X_test_scaled)

print("Best KNN rmse:", mean_absolute_error(y_pred=np.expm1(y_pred_knn), y_true=y_test).round(3))

We have yet again shaved off more error in our model, this time jumping from 32 to 26 by just transforming our target with a log and back.

# 6. Improved modelling (Playing around with sampling and data)

In this section we will only keep airbnbs that have more than 5 reviews, clip high prices to 3 standard deviations and take an even sample of manchester and bristol.

It's important we only do this with our **training** data. We don't want to bias the evaluation step. If we were to remove data from our **test** set we would be making the evaluation easier,  decreasing the representativeness of the evaluation.

In [None]:
new_engineered_df = airbnb_df.copy()

# take the log of reviews as new feature
new_engineered_df['logreviews'] = np.log1p(new_engineered_df['number_of_reviews'])  


new_engineered_df = pd.get_dummies(new_engineered_df, columns=[
                               'city']) 

new_features = new_engineered_df.drop(columns=['id', 'property_type', 'LSOA11CD', 
                                       'number_of_reviews', 'price', 'neighbourhood_cleansed'])
new_target = new_engineered_df['price']  

In [None]:
# Create new split with new features
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(features, 
                                                                    target, 
                                                                    test_size=0.2, 
                                                                    random_state=123)

In [None]:
# create mask so it can be applied to both X_train and y_train
# we only have the number of reviews as a log feature, but it's easy to convert to
review_number_mask = X_train_new["logreviews"] > np.log1p(5)

X_train_new, y_train_new = X_train_new[review_number_mask], y_train_new[review_number_mask]

In [None]:
# Calculate what price is needed to cut off for lower and upper price

# get the mean and 1 standard deviation
mean_price, std_price = y_train_new.mean(), y_train_new.std()

# identify outliers defined as 3 std out from mean
cut_off = std_price * 3
lower_bound, upper_bound = mean_price - std_price, mean_price + std_price

print("mean", mean_price)
print("std", std_price)
print("lower bound", lower_bound)
print("upper bound", upper_bound)

Probably not many prices below that lower bound... As we are assuming a normal distribution (by using mean and variance) we could convert to a normal distribution using `np.log1p` again. Instead of the lower bound, lets choose a sensible boundry using *domain knowledge*. There probably are not reasonable to model properties going for less than £15 per night.

In [None]:
outlier_mask = (y_train_new > 15) & (y_train_new < upper_bound)

In [None]:
X_train_new, y_train_new = X_train_new[outlier_mask], y_train_new[outlier_mask]

Produce an even sample across Bristol and Manchester for the training data

In [None]:
X_train_new.groupby("city").size()

In [None]:
# Sample the data based on the city of origin
X_train_new_reweighted = X_train_new.groupby("city").sample(n=800, random_state=123).drop(columns="city")

X_test_new = X_test_new.drop(columns="city")

In [None]:
# keep only y that have the same index as the resulting X
y_train_new_reweighted = y_train_new[X_train_reweighted.index]

In [None]:
X_train_new_reweighted

In [None]:
scaler = RobustScaler()

X_train_new_scaled = scaler.fit_transform(X=X_train_new_reweighted)

# transform but do NOT fit on the test data
X_test_new_scaled = scaler.transform(X=X_test_new)

In [None]:
# Using the previously initialized model objects - same search parameters

# Loop through each model type
for model, name in zip([lr, ridge, lasso], ['LinearRegression', 'RidgeRegression', 'LassoRegression']):
    
    # fit model
    model.fit(X_train_new_reweighted, y_train_new_reweighted)
    
    # generate prediction to evaluate on training set
    y_pred_train = model.predict(X_train_new_reweighted)
    mae_train = mean_absolute_error(y_pred=y_pred_train, y_true=y_train_new_reweighted).round(3)
    
    # generate predictions on the TEST set
    # we do both to compare
    y_pred_test = model.predict(X_test_new_scaled)
    mae_test = mean_absolute_error(y_pred=y_pred_test, y_true=y_test_new).round(3)
    
    best_alpha = ''
    if name != 'LinearRegression':
        best_alpha = ' best alpha: ' + str(model.alpha_)
        
    print(f"{name}\n\t MAE train: {mae_train}\t  MAE test:{mae_test} \t{best_alpha}")

What this has done has improved our prediction ability within the training set, but made our model much worse on the test set. This is because our test set contains data that our model simply hasn't seen anything like before. The test data may have "price" outside of the range set, and the learned parameters may be wrong for lower review counts.

# 6.1. Model diagnostics

In [None]:
# plotting true vs predicted for the knn log model

limit = 500

# we need to have an equal scale to best interpret this graph
# so the figure size must be the same for x and y
plt.figure(figsize=(6,6))
plt.plot([0,limit],[0,limit], "--", c="r") # plot straight line for comparison
plt.scatter( y_test, np.expm1(y_pred_knn),alpha=0.2)
plt.ylabel("Predicted Values")
plt.xlabel("True values")
plt.xlim(0, limit) # this excludes some outliers
plt.ylim(0, limit);

From this plot we can see that the predictions are not consistently under or over predicting for the non-outlier data.

In [None]:
#plotting residuals / error
residuals = y_test-np.expm1(y_pred_knn)
plt.scatter(np.expm1(y_pred_knn),residuals, alpha=0.1)
plt.ylabel("Residuals")
plt.xlabel("Predicted values");

From these plots we can understand a few things:

Our best model - 
* Under predicts high values of the prices / is unable to predict well the large price values
* There doesn't appear to be a clear correlation between size of residuals and preducted value for the outliers

When we consider how the knn regression works, taking the average of surrounding data points, we are unlikely to be able to predict these outliers, as we will always tend towards the mean.

To further improve the model however, we could rebalance more classes for training and transform more features into different distributions that may be easier for our model to calculate relevant distances for.

Our training and test splits appear to be similar in result, indicating that we have not yet overfit to the data (which is an interesting concept in of itself when thinking about the knn regressor), not have we underfit when using our tuned hyper parameters.

Further exploration would entail looking at how to predict these higher values better, potentially with multi-level models or better outlier handling.