# Modeling


Oveviews of Models:
 
- Model to predict whether a host is a super host status
- Model to predict either price, number of bookings or occupancy rate
- CLustering Analysis (may not be needed if we use PCA)

In [5]:
# Imports
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, StratifiedKFold, KFold, cross_val_score
from sklearn.metrics import make_scorer, recall_score, precision_recall_curve, average_precision_score, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import lightgbm as lgb


In [2]:
# Loading Dataset
df = pd.read_csv("Datasets/AirbnbData/ModelingData.csv")

# Convert Bool columns to int to avoid any errors
df[['oven','stove', 'refrigerator', 'air conditioning', 'tv', 'parking',
    'gym/exercise equipment', 'pool', 'hygiene products', 'laundry',
    'coffee', 'view']] = df[['oven','stove', 'refrigerator', 'air conditioning', 'tv', 'parking',
                            'gym/exercise equipment', 'pool', 'hygiene products', 'laundry',
                            'coffee', 'view']].astype(int)

# Modeling to predict whether a host is a super host

A scenario where this model would be useful is if Airbnb wanted to Re-vamp its superhost assignment methods. Currently this is how Airbnb assigns superhost status:
<blockquote>To be a Superhost, hosts must be the listing owner of a homes listing with an account in good standing and need to have met the following criteria:

Hosted at least 10 reservations, or 3 reservations that total at least 100 nights
Maintained a 90% or higher response rate  
Maintained a less than 1% cancellation rate, with exceptions for cancellations due to Major Disruptive Events or other valid reasons  
Maintained a 4.8 or higher overall rating (A review counts towards Superhost status when either both the guest and the host have submitted a review, or the 14-day window for reviews is over, whichever comes first.)  
Note: The criteria is only evaluated for listings in which the host is the listing owner—any listings in which the host is a co-host won’t contribute towards their Superhost eligibility.</blockquote>

Source: https://airbnb.com/help/article/829

Airbnb may want to modify what factors go into assigning a superhost by looking at other aspects of a listing, but without having much of an effect on the people who currently are superhosts. Another use would be to give suggestions to people who list properties on what aspects of their listings to focus on in order to earn superhost status later on.  
We would then want to find a model that has a high true positive rate, meaning we correctly classify superhosts as superhosts. We want to focus on a model with higher Recall since the superhost variable is a bit unbalanced. However, we also don't want to compromise the overall accuracy too much as we don't just want to assign the majority of hosts as superhosts, since this would defeat the purpose of having a superhost feature in the first place.

Models to test: 
- Logistic Regression
- XGBoost/Light GBM/CatBoost
- SVM

## Logistic Regression

Looking to train using Stratified K-fold cross validation, start with a model using many variables and later look to build a reduced one that is more interpretable

In [3]:
# Seperate features matrix, dropping some of the uneeded/multicolinear columns

X = df.drop(columns=['id', 'host_id', 'host_since', 'host_is_superhost', 'neighbourhood_cleansed', 'price', 'host_listings_count', "host_total_listings_count"])
y = df["host_is_superhost"]

# Convert object categorical to Dummy encoding

dummy_cols = pd.get_dummies(X[["license", "host_response_time", "neighbourhood_group_cleansed", "room_type"]], drop_first=True, dtype=int)
X = X.drop(["license", "host_response_time", "neighbourhood_group_cleansed", "room_type"], axis=1)
X = pd.concat([X, dummy_cols], axis=1)

X.shape

(42430, 50)

In [None]:
# Run Stratified K-Fold Cross Validation, use Precision-Recall Curve to find a good threshold
# Use GridSearchCV to find the best combinations of parameters for Logistic Regression 
params_logistic = {
    'penalty': ['l1', 'l2', 'elasticnet', None],
    'C': np.logspace(-2, 2, 10),
    'solver': ['saga', 'lbfgs'], 
    'l1_ratio': np.linspace(0, 1, 5)  
}
logit = LogisticRegression(max_iter=1000, random_state=9)

Skfold = StratifiedKFold(n_splits=5, shuffle=False)

logistic_search = RandomizedSearchCV(estimator=logit, param_distributions=params_logistic, cv=Skfold, scoring='roc_auc', n_jobs=-1, verbose=2, n_iter=100)

logistic_search.fit(X, y)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


120 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
75 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Afif\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Afif\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Afif\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1194, in fit
    solver = _check_solver(self

In [None]:
print(f"R^2 Score for best estimator: {logistic_search.best_score_}")

Recall Score for best estimator: 0.4540736270368135


array([0.001     , 0.00215443, 0.00464159, 0.01      , 0.02154435,
       0.04641589, 0.1       , 0.21544347, 0.46415888, 1.        ])

In [None]:
# Fitting Light GBM 

param_dist = {
    'num_leaves': np.arange(10, 50, 5),
    'learning_rate': [0.0001, 0.001, 0.01, 0.1, 1.0],
    'n_estimators': [10, 50, 100, 150, 200, 300, 400, 500],
    'subsample': [0.5, 0.7, 1.0],
    'max_depth': [3, 4, 5, 6]
}

In [None]:
# Plotting Curves, use precision recall to determine a good threshold

# Best Logistic Model
best_logistic = logistic_search.best_estimator_
logistic_search.fit(X, y)

logistic_proba = logistic_search.predict_proba(X)[:, 1]

logistic_precision, logistic_recall, _ = precision_recall_curve(y, logistic_proba)
avg_precision = average_precision_score(y, logistic_proba)

logistic_fpr, logistic_tpr, _ = roc_curve(y, logistic_proba)
logistic_roc_auc = roc_auc_score(y, logistic_proba)

# Best LightGBM





fig, axes = plt.subplots(1, 2, figsize=(10, 5))

axes[0].plot(logistic_recall, logistic_precision, label=f'AP = {avg_precision:.2f}', color='b')
axes[0].set_title('Precision-Recall Curve')
axes[0].set_xlabel('Recall')
axes[0].set_ylabel('Precision')
axes[0].legend(loc='best')
axes[0].grid()

axes[1].plot(logistic_fpr, logistic_tpr, label=f'AUC = {logistic_roc_auc:.2f}', color='b')
axes[1].plot([0, 1], [0, 1], linestyle='--', color='r', label='Random Guessing')
axes[1].set_title('ROC Curve')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].legend(loc='best')
axes[1].grid()

plt.tight_layout()

NameError: name 'logistic_search' is not defined