## Instructions {-}

- This is the template for the code and report on the Prediction Problem assignments.

- Your code in steps 1, 3, 4, and 5 will be executed sequentially, and must produce the RMSE / accuracy claimed on Kaggle.

- Your code in step 2 will also be executed, and must produce the optimal hyperparameter values used to train the model.

In [71]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score,train_test_split, KFold, cross_val_predict
from sklearn.metrics import mean_squared_error,r2_score,roc_curve,auc,precision_recall_curve, accuracy_score, \
recall_score, precision_score, confusion_matrix, mean_absolute_error
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, ParameterGrid, StratifiedKFold, RandomizedSearchCV
from sklearn.ensemble import BaggingRegressor,BaggingClassifier,AdaBoostRegressor,AdaBoostClassifier, \
RandomForestRegressor, GradientBoostingRegressor,VotingRegressor, StackingRegressor, VotingClassifier, StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, Ridge, Lasso,LogisticRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
import itertools as it
import time as time
import xgboost as xgb
from datetime import datetime as dt

from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from skopt.plots import plot_objective, plot_histogram, plot_convergence
import warnings
from IPython import display

## Read data

In [2]:
train = pd.read_csv('./datasets/train_classification.csv')
test = pd.read_csv('./datasets/test_classification.csv')

## 1) Data pre-processing

Put the data pre-processing code. You don't need to explain it. You may use the same code from last quarter.

In [3]:
train['has_missing'] = train.isnull().any(axis=1).astype(int)
test['has_missing'] = test.isnull().any(axis=1).astype(int)

In [4]:
# Define a function to categorize the property types
def categorize_property(property_type):
    if 'Entire' in property_type:
        return 'Entire Home/Apartment'
    elif 'Private' in property_type:
        return 'Private Room'
    elif 'Shared' in property_type:
        return 'Shared Accommodation'
    elif property_type in ['Room in hotel', 'Room in boutique hotel', 'Boat']:
        return 'Specialty Accommodations'
    else:
        return 'Other'

In [5]:
# overall function to clean training and test data
def clean_data(df):
    
    if 'host_is_superhost' in df.columns:
        df.host_is_superhost = df.host_is_superhost.replace({'t': 1, 'f': 0})
        
    # replace missing values of numeric variables wtih the median
    numeric_columns = df.select_dtypes(include=['number']).columns
    df[numeric_columns] = df[numeric_columns].apply(lambda x: x.fillna(x.median()))

    # replace missing values of categorical variables with the mode 
    categorical_columns = df.select_dtypes(include=['object']).columns
    df[categorical_columns] = df[categorical_columns].fillna(df[categorical_columns].mode().iloc[0])
    
    # replace any 0 values to 1 so that it can go through log transformation
    df['beds'] = df['beds'].replace(0, .01)
    df['accommodates'] = df['accommodates'].replace(0, .01)
    df['number_of_reviews'] = df['number_of_reviews'].replace(0, .01)
    df['reviews_per_month'] = df['reviews_per_month'].replace(0, .01)
    df['number_of_reviews_ltm'] = df['number_of_reviews_ltm'].replace(0, .01)
    df['number_of_reviews_l30d'] = df['number_of_reviews_l30d'].replace(0, .01)
    df['host_total_listings_count'] = df['host_total_listings_count'].replace(0, .01)
    df['host_listings_count'] = df['host_listings_count'].replace(0, .01)
    df['calculated_host_listings_count_private_rooms'] = df['calculated_host_listings_count_private_rooms'].replace(0, .01)
    df['calculated_host_listings_count_shared_rooms'] = df['calculated_host_listings_count_shared_rooms'].replace(0, .01)
    df['calculated_host_listings_count_entire_homes'] = df['calculated_host_listings_count_entire_homes'].replace(0, .01)
    
    df['log_beds'] = np.log(df.beds)
    df['log_accommodates'] = np.log(df.accommodates)
    df['log_reviews'] = np.log(df.number_of_reviews)
    df['log_reviews_per_month'] = np.log(df.reviews_per_month)
    df['log_reviews_ltm'] = np.log(df.number_of_reviews_ltm)
    df['log_reviews_l30d'] = np.log(df.number_of_reviews_l30d)
    df['log_host_total_listings_count'] = np.log(df.host_total_listings_count)
    df['log_host_listings_count'] = np.log(df.host_listings_count)
    df['log_host_listings_count_private_rooms'] = np.log(df.calculated_host_listings_count_private_rooms)
    df['log_host_listings_count_shared_rooms'] = np.log(df.calculated_host_listings_count_shared_rooms)
    df['log_host_listings_count_entire_homes'] = np.log(df.calculated_host_listings_count_entire_homes)

    # calculate the number of days since the host became a host
    df['host_since'] = pd.to_datetime(df['host_since'])
    current_date = dt.now()
    df['host_since_days'] = (current_date - df['host_since']).dt.days
    
    # calculate days since first/last review
    df['first_review'] = pd.to_datetime(df['first_review'], errors='coerce')
    df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce')

    df['first_review_days'] = (current_date - df['first_review']).dt.days
    df['last_review_days'] = (current_date - df['last_review']).dt.days
    
    # make response_rate and acceptance_rate into numeric dtype
    df['host_response_rate'] = df['host_response_rate'].str.rstrip('%').astype('float')
    df['host_acceptance_rate'] = df['host_acceptance_rate'].str.rstrip('%').astype('float')
    
    # subgroup property_type (similar levels as room_type so discard room predictor)
    df['property_cats'] = df['property_type'].apply(categorize_property)
    
    # extract numeric values from the 'bathrooms' column
    df['bath_numeric'] = df['bathrooms_text'].str.extract('(\d+\.*\d*)', expand=False).astype(float)

    # handle "Half-bath" by assigning a numeric value of 0.5
    df['bath_numeric'] = df.apply(lambda row: 0.5 if 'half' in row['bathrooms_text'].lower() \
                                  else row['bath_numeric'], axis=1)
    
    # to binary
    df.host_identity_verified = df.host_identity_verified.replace({'t': 1, 'f': 0})
    df.host_has_profile_pic = df.host_has_profile_pic.replace({'t': 1, 'f': 0})
    df.has_availability = df.has_availability.replace({'t': 1, 'f': 0})
    df.instant_bookable = df.instant_bookable.replace({'t': 1, 'f': 0})

    # drop the modified/redundant columns
    df.drop(columns = ['host_since', 'first_review', 'last_review', 'property_type', 'bathrooms_text', \
                       'number_of_reviews', 'reviews_per_month', 'number_of_reviews_ltm', \
                       'number_of_reviews_l30d', 'host_total_listings_count', 'host_listings_count', \
                      'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms', \
                       'calculated_host_listings_count_entire_homes'], inplace = True)
    
    # drop predictors that have low corr with log_price and high corr with others to help remove multi-collinearity
    df.drop(columns = ['first_review_days', 'last_review_days', 'host_acceptance_rate', 'host_response_rate', 
                       'availability_60', 'availability_90', 'minimum_minimum_nights', \
                       'maximum_maximum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', \
                       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm'], inplace = True)

In [6]:
clean_data(train)
clean_data(test)

In [7]:
# drop the string categorical predictors 
train = train.drop(columns = ['host_neighbourhood', 'neighbourhood_cleansed', 'host_location'])
test = test.drop(columns = ['host_neighbourhood', 'neighbourhood_cleansed', 'host_location'])

In [8]:
# OHE the remaining categorical variables
host_response_time_dummies = pd.get_dummies(train['host_response_time'], prefix='host_response_time')
train = pd.concat([train, host_response_time_dummies], axis = 1)

host_response_time_dummies = pd.get_dummies(test['host_response_time'], prefix='host_response_time')
test = pd.concat([test, host_response_time_dummies], axis = 1)

In [9]:
host_verifications_dummies = pd.get_dummies(train['host_verifications'], prefix='host_verifications')
train = pd.concat([train, host_verifications_dummies], axis = 1)

host_verifications_dummies = pd.get_dummies(test['host_verifications'], prefix='host_verifications')
test = pd.concat([test, host_verifications_dummies], axis = 1)

In [10]:
room_type_dummies = pd.get_dummies(train['room_type'], prefix='room_type')
train = pd.concat([train, room_type_dummies], axis = 1)

room_type_dummies = pd.get_dummies(test['room_type'], prefix='room_type')
test = pd.concat([test, room_type_dummies], axis = 1)

In [11]:
property_cats_dummies = pd.get_dummies(train['property_cats'], prefix='property_cats')
train = pd.concat([train, property_cats_dummies], axis = 1)

property_cats_dummies = pd.get_dummies(test['property_cats'], prefix='property_cats')
test = pd.concat([test, property_cats_dummies], axis = 1)

In [12]:
train = train.drop(columns = ['host_response_time', 'host_verifications', 'room_type', 'property_cats'])
test = test.drop(columns = ['host_response_time', 'host_verifications', 'room_type', 'property_cats'])

In [13]:
# variable spacing
train.rename(columns=lambda x: x.replace(' ', '_'), inplace=True)
test.rename(columns=lambda x: x.replace(' ', '_'), inplace=True)

In [14]:
# set response and predictors for scaling
y_train = train.host_is_superhost
X_train = train.drop(columns = ['id', 'host_is_superhost', 'host_id'])
X_test = test.drop(columns = ['id', 'host_id'])

In [15]:
sc = StandardScaler()
sc.fit(X_train)
X_train_scaled = sc.transform(X_train)
X_test_scaled = sc.transform(X_test)

In [24]:
alphas = np.logspace(-4,1,100)

warnings.filterwarnings("ignore")
lcv = LassoCV(alphas=alphas, cv=10)
lcv.fit(X_train_scaled, y_train)

In [25]:
X_train_cleaned = X_train_scaled.T[lcv.coef_!=0].T
X_test_cleaned = X_test_scaled.T[lcv.coef_!=0].T

## 2) Hyperparameter tuning

### How many attempts did it take you to tune the model hyperparameters?

I took me around 20+ attempts to tune the model hyperparamters.

### Which tuning method did you use (grid search / Bayes search / etc.)?

I used Grid Search.

### What challenges did you face while tuning the hyperparameters, and what actions did you take to address those challenges?

There were not that many challenges tuning the hyperparamters, the main challenge was making sure that the dataset was properly prepared for the models. Narrowing down the ranges was easier this time as I just used the same tuning methods as the previous models.

### How many hours did you spend on hyperparameter tuning?

It took me around 6 hours to tune.

**Paste the hyperparameter tuning code below. You must show at least one hyperparameter tuning procedure.**

In [1]:
#Hyperparameter tuning code

In [78]:
base = DecisionTreeClassifier(random_state = 1, max_depth=16, max_leaf_nodes=186)
ada = AdaBoostClassifier(estimator = base_model, random_state = 1)

In [80]:
skf = StratifiedKFold(n_splits=10, shuffle = True, random_state = 1)

In [92]:
warnings.filterwarnings("ignore")

ada_grid = {'estimator__max_depth': [11, 12],
        'n_estimators': [190],
        'learning_rate': [1,10]
       } 

ada_gcv = GridSearchCV(ada, ada_grid, cv = skf, scoring = 'accuracy', n_jobs = -1, verbose = 1)

ada_gcv.fit(X_train_cleaned, y_train)

print(ada_gcv.best_score_)
print(ada_gcv.best_params_)

Fitting 10 folds for each of 4 candidates, totalling 40 fits
0.8854694431650142
{'estimator__max_depth': 11, 'learning_rate': 1, 'n_estimators': 190}


In [37]:
rf_model = RandomForestClassifier(random_state=1, n_estimators=200, bootstrap=True, max_features=0.75, 
                                 max_samples=0.9)

In [43]:
base_model = DecisionTreeClassifier(random_state=1) 
bag_model = BaggingClassifier(estimator=base_model, bootstrap=False, bootstrap_features=False, max_features=0.75,
                             max_samples=0.9, n_estimators=250, random_state=1)

In [93]:
base_model1 = DecisionTreeClassifier(random_state = 1, max_depth = 11)
ada_model = AdaBoostClassifier(estimator = base_model1, random_state = 1, learning_rate = 1, n_estimators = 190)

**Paste the optimal hyperparameter values below.**

The optimal hyperparamter values are inputted above, I am using the hyperparamters that obtained the best scores I achieved from before with each model, though I improved upon the AdaBoost model a bit further.

## 3) Model

Using the optimal model hyperparameters, train the model, and paste the code below.

In [94]:
en = StackingClassifier(estimators=[('bag',bag_model),('ada',ada_model)],
                                   final_estimator=LogisticRegression(random_state=1,max_iter=10000),n_jobs=-1,
                                   cv = StratifiedKFold(n_splits=5,shuffle=True,random_state=1))
en.fit(X_train_cleaned, y_train)

In [95]:
y_preds = en.predict(X_test_cleaned)
y_preds = y_preds.astype(int)

## 4) Put any ad-hoc steps for further improving model accuracy
For example, scaling up or scaling down the predictions, capping predictions, etc.

Put code below.

#### Implementing `host_id`

In [96]:
id = test.id.values
predicted = y_preds
submission = pd.DataFrame({'id': id, 'predicted': predicted})
submission = submission.reset_index(drop=True)

In [97]:
# Add 'host_id' to submission
submission['host_id'] = test['host_id']

# Use apply to replace 'predicted' based on 'host_id'
submission['predicted'] = submission.apply(lambda row: train[train['host_id'] == row['host_id']]['host_is_superhost'].values[0] 
                                           if not train[train['host_id'] == row['host_id']].empty else row['predicted'], axis=1)

# Drop 'host_id' from submission
submission = submission.drop(columns=['host_id'])

## 5) Export the predictions in the format required to submit on Kaggle
Put code below.

In [98]:
submission.to_csv('ensemble_classification_submission.csv', index=False)