__Title:__ Lab 2: Classificaiton  
__Authors:__ Butler, Derner, Holmes  
__Date:__ 2/5/23 

## Ruberic

| Category                  | Available | Requirements                                                                                                                                                                                                                                                                                                                                                                                                                      |
|---------------------------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Total Points              | 100       |                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| Data Preparation Part 1   | 10        | Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.                                                                                                                                                                        |
| Data Preparation Part 2   | 5         | Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).                                                                                                                                                                                                                                                                                          |
| Modeling and Evaluation 1 | 10        | Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.                                                                                                                                               |
| Modeling and Evaluation 2 | 10        | Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time.                                                       |
| Modeling and Evaluation 3 | 20        | Create three different classification/regression models for each task (e.g., random forest, KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms! |
| Modeling and Evaluation 4 | 10        | Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.                                                                                                                                                                                                             |
| Modeling and Evaluation 5 | 10        | Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques—be sure they are appropriate for your chosen method of validation as discussed in unit 7 of the course.                       |
| Modeling and Evaluation 6 | 10        | Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.                                                                                                                                                |
| Deployment                | 5         | How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.?                                                                       |
| Exceptional Work          | 10        | You have free reign to provide additional analyses. One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?                                                                                                                                                                 |

__Libraries & Set-up__

In [1]:
# Import libraries
## Support Libraries
import pandas as pd
import numpy as np
import warnings

## Plotting
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp
import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.stats.contingency_tables import mcnemar

## Preprocessing
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

## Model Selection
from sklearn.model_selection import train_test_split, GridSearchCV

## Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier

## Feature Selection
from sklearn.feature_selection import SelectFromModel, VarianceThreshold, SelectPercentile

## Model Performance
from sklearn.metrics import classification_report, roc_curve, auc, mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics.cluster import contingency_matrix
from sklearn.inspection import permutation_importance

# Notebook Settings
warnings.filterwarnings(action='once')
pd.set_option('display.max_columns', None)

In [2]:
# Dataset
url = 'https://github.com/cdholmes11/MSDS-7331-ML1-Labs/blob/main/Mini-Lab_LogisticRegression_SVMs/Hotel%20Reservations.csv?raw=true'
hotel_df = pd.read_csv(url, encoding = "utf-8")

### Data Preparation Part 1
Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.
_____________
__Dependent Variables__  
Classification: [booking_status]  
Regression: [avg_price_per_room]

In [3]:
# Dropping index column arrival_year
hotel_df_trim = hotel_df.drop(['Booking_ID', 'arrival_year', 'no_of_previous_bookings_not_canceled'], axis=1)
hotel_df_final = hotel_df_trim.loc[hotel_df_trim['avg_price_per_room'] < 400]

# Create data type groups
cat_features = ['type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved', 'market_segment_type',
    'repeated_guest', 'booking_status']
int_features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'arrival_month',
    'arrival_date', 'no_of_previous_cancellations', 'no_of_special_requests']
float_features = ['lead_time', 'avg_price_per_room']
cont_features = int_features + float_features

# Enforce data types
hotel_df_trim[cat_features] = hotel_df_trim[cat_features].astype('category')
hotel_df_trim[int_features] = hotel_df_trim[int_features].astype(np.int64)
hotel_df_trim[float_features] = hotel_df_trim[float_features].astype(np.float64)

# Making indexable list suitable for pipeline
cat_features_final = list(hotel_df_final[cat_features].columns)
cont_features_final = list(hotel_df_final[cont_features].columns)

In [4]:
# Classificaiton - Train Test Split 
X = hotel_df_final.drop('booking_status', axis = 1)
Y = hotel_df_final['booking_status']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=110)

# y_test as int for AU-ROC
y_test_num = np.where(y_test == 'Canceled', 1, 0)

In [5]:
# Regression - Train Test Split 
X_reg = hotel_df_final.drop('avg_price_per_room', axis = 1)
Y_reg = hotel_df_final['avg_price_per_room']

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, Y_reg, test_size=0.2, random_state=110)

In [6]:
# Pipeline - Classification
numeric_features = cont_features_final
numeric_transformer = Pipeline(
    steps=[("scaler", StandardScaler())]
)

categorical_features = [x for x in cat_features_final if x != 'booking_status']
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore"))
    ]
)
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

In [7]:
# Pipeline - Regression
numeric_features_reg = [x for x in cont_features_final if x != 'avg_price_per_room']
numeric_transformer_reg = Pipeline(
    steps=[("scaler", StandardScaler())]
)

categorical_features_reg = cat_features_final
categorical_transformer_reg = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore"))
    ]
)
preprocessor_reg = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer_reg, numeric_features_reg),
        ("cat", categorical_transformer_reg, categorical_features_reg),
    ]
)

### Data Preparation Part 2
Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).
______

Our final dataset is very similar to the original dataset found on Kaggle (Hotel Reservations Dataset, 2023). It contains 16 total features and 36,274 observations. We've removed several features due to concerns of outlier influence or irrelevance and limited the range of [avg_price_per_room]. For this reason, all conclusions will be limited to the upper range of $400 for [avg_price_per_room].

Dataset Updates:
 - [avg_price_per_room] - filtered to under $400 due to lack of observations over this price point and concerns of outlier influence
 - [no_of_previous_bookings_not_canceled] - dropped due to concerns of outlier influence and because it explains the same observations as [repated_guest]
 - [Booking_ID] - dropped because it's irrelevant for future predictions
 - [arrival_year] - dropped because this is not a time series model and thus, it's inclusion would limit our model's potential for classification of arrival years not found in our testing dataset.

 Outside of the removal of the dependent feature for classification and regression respectively, the datasets will be the same. Below are the descriptions as outlined from Kaggle.

 | Feature                       | Data Type   | Description                                                                                   |
|-------------------------------|-------------|-----------------------------------------------------------------------------------------------|
| no_of_adults                  | Integer     | Number of adults                                                                              |
| no_of_children                | Integer     | Number of Children                                                                            |
| no_of_weekend_nights          | Integer     | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
| no_of_week_nights             | Integer     | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel      |
| type_of_meal_plan             | Categorical | Type of meal plan booked by the customer                                                      |
| required_car_parking_space    | Integer     | Does the customer require a car parking space?                                                |
| room_type_reserved            | Categorical | Does the customer require a car parking space?                                                |
| lead_time                     | Integer     | Number of days between the date of booking and the arrival date                               |
| arrival_month                 | Integer     | Month of arrival date                                                                         |
| arrival_date                  | Integer     | Date of the month                                                                             |
| market_segment_type           | Categorical | Market segment designation                                                                    |
| repeated_guest                | Integer     | Is the customer a repeated guest?                                                             |
| no_of_previous_cancellations  | Integer     | Number of previous bookings that were canceled by the customer prior to the current booking   |
| avg_price_per_room            | Float       | Average price per day of the reservation; prices of the rooms are dynamic                     |
| no_of_special_requests        | Integer     | Total number of special requests made by the customer                                         |
| booking_status                | Categorical | Flag indicating if the booking was canceled or not                                            |

### Modeling and Evaluation 1
Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.
__________
__Classification__  
We will be using AUC-ROC to as the primary metric for evaluating our models. AUC-ROC will give us the model with the best classification of cancellations while reducing false postives. The main use cases for models like this would be for future predictions and understanding the factors that lead to cancelations. If our model is frequently falsely predicting cancellations in order to increase its true-positive rate, it will likely lead the end user to make mislead conclusions. Therefore, a model that is not able to do both well, is not useful to the targeted audience of this analysis.


__Regression__  
We will utilize RMSE as the primary metric to compare models and choose the best hyperparameters. For regression tasks, we believe this metric will provide the best understanding of each model's ability to predict Average Price Per Room in a manor that is easily understood by end users unfamiliar with statistical metrics. RMSE is in the same unit, dollars, as the dependent variable. While this dataset is not designed to regress on this variable, we did believe that it would be an interesting analysis to see if we could reliably predict average price per room based on the given variables. If we can, it has the potential to provide hotel operators some insightful information about their offerings.

### Modeling and Evaluation 2
Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time.
______________________
In every model of this analysis we perform 10-fold cross validation with a 20% holdout for final model scores. This helps us reduce the risk of overfitting while maintaining our confidence in the model's ability to generalize. No model tuning will be done to improve scores against the holdout.This maintains the integrity of our holdout dataset to speak to each models ability to generalize to future datasets.

### Modeling and Evaluation 3
Create three different classification/regression models for each task (e.g., random forest, KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms!
__________________

#### Task 1: Classification - Cancelations

__Gradient Boosting__

In [34]:
# Test Parameters
gb_param_grid = {
              'classifier__n_estimators': [50, 100, 1000, 2000],
              'classifier__learning_rate': [0.01, 0.1, 1],
              'classifier__max_depth': [2, 5, 10],
              'classifier__subsample': [0.5, 0.75, 1.0],
              'classifier__max_features': ['sqrt', 'log2', None],
              'classifier__random_state': [110],
              
    }

In [36]:
%%capture
# Model
gb = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", GradientBoostingClassifier())]
)

grid_gb = GridSearchCV(
    gb,
    gb_param_grid,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=3,
    scoring='roc_auc'
)

# Fit
grid_gb.fit(X_train, y_train)

In [37]:
# Performance Metrics
print('Internal CV score(AUC-ROC):')
print(f'{grid_gb.best_score_:.4f}')

Internal CV score(AUC-ROC):
0.9546


In [38]:
# Best Parameters
pd.DataFrame.from_dict(grid_gb.best_params_, orient='index', columns=['Values']).reset_index().rename(columns={'index': 'Best Parameters'})

Unnamed: 0,Best Parameters,Values
0,classifier__learning_rate,0.01
1,classifier__max_depth,10.0
2,classifier__max_features,
3,classifier__n_estimators,1000.0
4,classifier__random_state,110.0
5,classifier__subsample,0.75


In [39]:
%%capture
# Best Paramters
gb_param_grid_final = {
              'classifier__n_estimators': [grid_gb.best_params_['classifier__n_estimators']],
              'classifier__learning_rate': [grid_gb.best_params_['classifier__learning_rate']],
              'classifier__max_depth': [grid_gb.best_params_['classifier__max_depth']],
              'classifier__subsample': [grid_gb.best_params_['classifier__subsample']],
              'classifier__max_features': [grid_gb.best_params_['classifier__max_features']],
              'classifier__random_state': [grid_gb.best_params_['classifier__random_state']],
    }

# Model
grid_gb_final = GridSearchCV(
    gb,
    gb_param_grid_final,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=10,
    scoring='roc_auc'
)

# Fit and Predict
grid_gb_final.fit(X_train, y_train)
y_pred_gb_final = grid_gb_final.predict(X_test)

In [40]:
# Performance Metrics
print('Internal CV score(AUC-ROC):')
print(f'{grid_gb_final.best_score_:.4f}')

# Converting to Int
y_pred_gb_final = np.where(y_pred_gb_final == 'Canceled', 1, 0)
print(classification_report(y_test_num, y_pred_gb_final))

Internal CV score(AUC-ROC):
0.9582
              precision    recall  f1-score   support

           0       0.91      0.94      0.93      4824
           1       0.88      0.82      0.85      2431

    accuracy                           0.90      7255
   macro avg       0.90      0.88      0.89      7255
weighted avg       0.90      0.90      0.90      7255



In [148]:
# Feature Importance
f_imp_gb = grid_gb_final.best_estimator_.named_steps['classifier'].feature_importances_
feature_importance_gb = dict(zip(X_train.columns, f_imp_gb))
x_val_gb = list(feature_importance_gb.keys())
y_vals_gb = list(feature_importance_gb.values())

gb_feature_imp = pd.DataFrame({
    'Features': x_val_gb,
    'Values': y_vals_gb

})

fig = px.bar(
        x=x_val_gb,
        y=y_vals_gb,
        title="Gradient Boost - Feature Importance for Booking Status",
        height=500,
        width=1000,
        labels={'y':'Importance Value', 'x': 'Feature'})
fig.update_xaxes(categoryorder='total descending')
fig.show()

In [42]:
# ROC Curve
fpr_gb, tpr_gb, thresholds_gb = roc_curve(y_test_num, y_pred_gb_final)
roc_auc_gb = auc(fpr_gb, tpr_gb)
fig = px.area(
    x=fpr_gb, y=tpr_gb,
    title=f'Gradient Boost - ROC Curve (AUC={roc_auc_gb:.4f})',
    labels=dict(x='False Positive Rate', y='True Positive Rate'),
    width=700, height=500
)
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=0, x1=1, y0=0, y1=1
)

fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()

__Decision Trees__

In [43]:
# Test Parameters
dt_param_grid = {
              'classifier__max_depth': [1, 2, 5, 10, 20, 50],
              'classifier__min_samples_split': [2, 4, 8, 16, 22, 32],
              'classifier__min_samples_leaf': [1, 2, 3, 4, 5, 10],
              'classifier__random_state': [110]
    }

In [44]:
%%capture
# Model
dt = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", DecisionTreeClassifier())]
)

grid_dt = GridSearchCV(
    dt,
    dt_param_grid,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=3,
    scoring='roc_auc'
)

# Fit
grid_dt.fit(X_train, y_train)

In [45]:
# Performance Metrics
print('Internal CV score(AUC-ROC):')
print(f'{grid_dt.best_score_:.4f}')

Internal CV score(AUC-ROC):
0.9239


In [46]:
# Best Parameters
pd.DataFrame.from_dict(grid_dt.best_params_, orient='index', columns=['Values']).reset_index().rename(columns={'index': 'Best Parameters'})

Unnamed: 0,Best Parameters,Values
0,classifier__max_depth,10
1,classifier__min_samples_leaf,3
2,classifier__min_samples_split,32
3,classifier__random_state,110


In [47]:
%%capture
# Best Paramters
dt_param_grid_final = {
              'classifier__max_depth': [grid_dt.best_params_['classifier__max_depth']],
              'classifier__min_samples_split': [grid_dt.best_params_['classifier__min_samples_split']],
              'classifier__min_samples_leaf': [grid_dt.best_params_['classifier__min_samples_leaf']],
              'classifier__random_state': [grid_dt.best_params_['classifier__random_state']]
    }

# Model
grid_dt_final = GridSearchCV(
    dt,
    dt_param_grid_final,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=10,
    scoring='roc_auc'
)

# Fit and Predict
grid_dt_final.fit(X_train, y_train)
y_pred_dt_final = grid_dt_final.predict(X_test)

In [48]:
# Performance Metrics
print('Internal CV score(AUC-ROC):')
print(f'{grid_dt_final.best_score_:.4f}')

# Converting to Int
y_pred_dt_final = np.where(y_pred_dt_final == 'Canceled', 1, 0)
print(classification_report(y_test_num, y_pred_dt_final))

Internal CV score(AUC-ROC):
0.9282
              precision    recall  f1-score   support

           0       0.90      0.91      0.90      4824
           1       0.81      0.79      0.80      2431

    accuracy                           0.87      7255
   macro avg       0.85      0.85      0.85      7255
weighted avg       0.87      0.87      0.87      7255



In [150]:
# Feature Importance
f_imp_dt = grid_dt_final.best_estimator_.named_steps['classifier'].feature_importances_
feature_importance_dt = dict(zip(X_train.columns, f_imp_dt))
x_vals_dt = list(feature_importance_dt.keys())
y_vals_dt = list(feature_importance_dt.values())

dt_feature_imp = pd.DataFrame({
    'Features': x_vals_dt,
    'Values': y_vals_dt

})

fig = px.bar(
        x=x_vals_dt,
        y=y_vals_dt,
        title="Decision Tree - Feature Importance for Booking Status",
        height=500,
        width=1000,
        labels={'y':'Importance Value', 'x': 'Feature'})
fig.update_xaxes(categoryorder='total descending')
fig.show()

In [50]:
# ROC Curve
fpr_dt, tpr_dt, thresholds_dt = roc_curve(y_test_num, y_pred_dt_final)
roc_auc_dt = auc(fpr_dt, tpr_dt)
fig = px.area(
    x=fpr_dt, y=tpr_dt,
    title=f'Decision Tree - ROC Curve (AUC={roc_auc_dt:.4f})',
    labels=dict(x='False Positive Rate', y='True Positive Rate'),
    width=700, height=500
)
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=0, x1=1, y0=0, y1=1
)

fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()

__SVM__

In [51]:
# Test Parameters
param_grid_svc = {
    'classifier__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'classifier__C': [0.1, 1, 10, 100],
    'classifier__degree': [2, 4],
    'classifier__gamma': ['scale', 'auto'],
    'classifier__class_weight': [None, 'balanced'],
    'classifier__random_state': [110]
}

In [52]:
%%capture
# Model
svm = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", SVC())]
)

grid_clf_svm = GridSearchCV(
    svm,
    param_grid_svc,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=3,
    scoring='roc_auc'
)

# Fit
grid_clf_svm.fit(X_train, y_train)

In [53]:
# Performance Metrics
print('Internal CV score(AUC-ROC):')
print(f'{grid_clf_svm.best_score_:.4f}')

Internal CV score(AUC-ROC):
0.9112


In [54]:
# Best Parameters
pd.DataFrame.from_dict(grid_clf_svm.best_params_, orient='index', columns=['Values']).reset_index().rename(columns={'index': 'Best Parameters'})

Unnamed: 0,Best Parameters,Values
0,classifier__C,100
1,classifier__class_weight,balanced
2,classifier__degree,2
3,classifier__gamma,scale
4,classifier__kernel,rbf
5,classifier__random_state,110


In [55]:
%%capture
# Best Parameters
param_grid_svc_final = {
    'classifier__kernel': [grid_clf_svm.best_params_['classifier__kernel']],
    'classifier__C': [grid_clf_svm.best_params_['classifier__C']],
    'classifier__degree': [grid_clf_svm.best_params_['classifier__degree']],
    'classifier__gamma': [grid_clf_svm.best_params_['classifier__gamma']],
    'classifier__class_weight': [grid_clf_svm.best_params_['classifier__class_weight']],
    'classifier__random_state': [grid_clf_svm.best_params_['classifier__random_state']]
}

# Model
grid_clf_svm_final = GridSearchCV(
    svm,
    param_grid_svc_final,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=10,
    scoring='roc_auc'
)
grid_clf_svm_final.fit(X_train, y_train)
y_pred_svm_final = grid_clf_svm_final.predict(X_test)

In [56]:
# Performance Metrics
print('Internal CV score(AUC-ROC):')
print(f'{grid_clf_svm_final.best_score_:.4f}')

# Converting to Int
y_pred_svm_final = np.where(y_pred_svm_final == 'Canceled', 1, 0)
print(classification_report(y_test_num, y_pred_svm_final))

Internal CV score(AUC-ROC):
0.9172
              precision    recall  f1-score   support

           0       0.91      0.86      0.88      4824
           1       0.74      0.83      0.79      2431

    accuracy                           0.85      7255
   macro avg       0.83      0.84      0.83      7255
weighted avg       0.85      0.85      0.85      7255



In [57]:
# Feature Importance
perm_importance = permutation_importance(grid_clf_svm_final, X_test, y_test, n_jobs=-1)

# Top 10 - Feature Importance SVM - GridCV
feature_names_grid_final = grid_clf_svm_final.best_estimator_.named_steps['preprocessor'].get_feature_names_out()
mean_importance = perm_importance.importances_mean

feat_imp_svc_final = pd.DataFrame(zip(feature_names_grid_final, mean_importance), columns=['Feature', 'Value'])
feat_imp_svc_final['Absolute Value'] = feat_imp_svc_final['Value'].apply(lambda x: abs(x))
feat_imp_svc_final = feat_imp_svc_final.sort_values('Absolute Value', ascending = False)

# Feature Importance from permutation_importance()
feat_imp_svc_final.head(10).reset_index(drop=True)

Unnamed: 0,Feature,Value,Absolute Value
0,num__no_of_special_requests,0.219161,0.219161
1,cat__required_car_parking_space_0,0.135169,0.135169
2,cat__type_of_meal_plan_Not Selected,0.091691,0.091691
3,cat__type_of_meal_plan_Meal Plan 1,0.091073,0.091073
4,num__lead_time,0.043637,0.043637
5,num__no_of_weekend_nights,0.036028,0.036028
6,num__no_of_adults,0.029221,0.029221
7,num__no_of_week_nights,0.028436,0.028436
8,num__avg_price_per_room,0.026327,0.026327
9,num__arrival_month,0.018303,0.018303


In [58]:
# Feature Importance Plot
fig = px.bar(
    feat_imp_svc_final,
    x='Feature',
    y='Value',
    title='SVM - Feature Importance for Booking Status',
    height=500,
    width=1000
)
fig.show()   

In [59]:
# ROC Curve
fpr_svm, tpr_svm, thresholds_svm = roc_curve(y_test_num, y_pred_svm_final)
roc_auc_svm = auc(fpr_svm, tpr_svm)
fig = px.area(
    x=fpr_svm, y=tpr_svm,
    title=f'SVM - ROC Curve (AUC={roc_auc_svm:.4f})',
    labels=dict(x='False Positive Rate', y='True Positive Rate'),
    width=700, height=500
)
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=0, x1=1, y0=0, y1=1
)

fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()

#### Task 2: Regression - Average Room Price

__Gradient Boosting__

In [60]:
# Test Parameters
gb_reg_param_grid = {
    'regressor__learning_rate': [0.01, 0.1, 0.5],
    'regressor__n_estimators': [50, 100, 200],
    'regressor__max_depth': [3, 5, None],
    'regressor__min_samples_split': [2, 5, 10],
    'regressor__min_samples_leaf': [1, 2, 4],
    'regressor__max_features': ['auto', 'sqrt', None],
    'regressor__random_state': [110]        
    }

In [61]:
%%capture
# Model
gb_reg = Pipeline(
    steps=[("preprocessor", preprocessor_reg), ("regressor", GradientBoostingRegressor())]
)

grid_gb_reg = GridSearchCV(
    gb_reg,
    gb_reg_param_grid,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=3,
    scoring='neg_root_mean_squared_error'
)

# Fit
grid_gb_reg.fit(X_train_reg, y_train_reg)

In [62]:
# Performance Metrics
print('Internal CV score(neg-RMSE):')
print(f'{grid_gb_reg.best_score_:.4f}')

Internal CV score(neg-RMSE):
-15.7462


In [63]:
# Best Parameters Table
pd.DataFrame.from_dict(grid_gb_reg.best_params_, orient='index', columns=['Values']).reset_index().rename(columns={'index': 'Best Parameters'})

Unnamed: 0,Best Parameters,Values
0,regressor__learning_rate,0.1
1,regressor__max_depth,
2,regressor__max_features,sqrt
3,regressor__min_samples_leaf,4
4,regressor__min_samples_split,2
5,regressor__n_estimators,200
6,regressor__random_state,110


In [64]:
%%capture
# Best Parameters
param_grid_gb_reg_final = {
    'regressor__learning_rate': [grid_gb_reg.best_params_['regressor__learning_rate']],
    'regressor__n_estimators': [grid_gb_reg.best_params_['regressor__n_estimators']],
    'regressor__max_depth': [grid_gb_reg.best_params_['regressor__max_depth']],
    'regressor__min_samples_split': [grid_gb_reg.best_params_['regressor__min_samples_split']],
    'regressor__min_samples_leaf': [grid_gb_reg.best_params_['regressor__min_samples_leaf']],
    'regressor__max_features': [grid_gb_reg.best_params_['regressor__max_features']],
    'regressor__random_state': [grid_gb_reg.best_params_['regressor__random_state']]
}

# Model
grid_gb_reg_final = GridSearchCV(
    gb_reg,
    param_grid_gb_reg_final,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=10,
    scoring='neg_root_mean_squared_error'
)

# Fit
grid_gb_reg_final.fit(X_train_reg, y_train_reg)
y_pred_gb_reg_final = grid_gb_reg_final.predict(X_test_reg)

In [66]:
# Performance Metrics
print('Internal CV score(neg-RMSE):')
print(f'{grid_gb_reg_final.best_score_:.4f}')

# Compute regression metrics
rmse_gb = mean_squared_error(y_test_reg, y_pred_gb_reg_final, squared=False)
mae_gb = mean_absolute_error(y_test_reg, y_pred_gb_reg_final)
r2_gb = r2_score(y_test_reg, y_pred_gb_reg_final)

# Print the results
print("Root Mean Squared Error: {:.2f}".format(rmse_gb))
print("Mean Absolute Error: {:.2f}".format(mae_gb))
print("R^2 Score: {:.2f}".format(r2_gb))

Internal CV score(neg-RMSE):
-15.2703
Root Mean Squared Error: 15.88
Mean Absolute Error: 8.45
R^2 Score: 0.80


In [173]:
# Feature Importance
f_imp_gb_reg = grid_gb_reg_final.best_estimator_.named_steps['regressor'].feature_importances_
feature_importance_gb_reg = dict(zip(X_train.columns, f_imp_gb_reg))
x_vals_gb_reg = list(feature_importance_gb_reg.keys())
y_vals_gb_reg = list(feature_importance_gb_reg.values())

gb_feature_imp_reg = pd.DataFrame({
    'Features': x_vals_gb_reg,
    'Values': y_vals_gb_reg

})

fig = px.bar(
        x=x_vals_gb_reg,
        y=y_vals_gb_reg,
        title="Gradient Boosting - Feature Importance for Average Price Per Room",
        height=500,
        width=1000,
        labels={'y':'Importance Value', 'x': 'Feature'})
fig.update_xaxes(categoryorder='total descending')
fig.show()

In [68]:
# Create a scatter plot using Plotly
fig = px.scatter(x=y_test_reg, y=y_pred_gb_reg_final,
                 labels={'x': 'Average Price Per Room', 'y': 'Predicted Values'},
                 title='Gradient Boosting - Regression Model Performance of Average Price Per Room',
                 trendline='ols', trendline_color_override='red',
                 marginal_x='histogram', marginal_y='histogram',
                 height=500,
                 width=1000)
fig.show()

__Decision Trees__

In [69]:
# Test Parameters
dt_reg_param_grid = {
    'regressor__criterion': ['mse', 'friedman_mse', 'mae'],
    'regressor__splitter': ['best', 'random'],
    'regressor__max_depth': [None, 2, 5, 10],
    'regressor__min_samples_split': [2, 5, 10],
    'regressor__min_samples_leaf': [1, 2, 4],
    'regressor__max_features': [None, 'auto', 'sqrt', 'log2'],
    'regressor__random_state': [110]        
    }

In [70]:
%%capture
# Model
dt_reg = Pipeline(
    steps=[("preprocessor", preprocessor_reg), ("regressor", DecisionTreeRegressor())]
)

grid_dt_reg = GridSearchCV(
    dt_reg,
    dt_reg_param_grid,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=3,
    scoring='neg_root_mean_squared_error'
)

# Fit
grid_dt_reg.fit(X_train_reg, y_train_reg)

In [71]:
# Performance Metrics
print('Internal CV score(neg-RMSE):')
print(f'{grid_dt_reg.best_score_:.4f}')

Internal CV score(neg-RMSE):
-19.4389


In [72]:
# Best Parameters Table
pd.DataFrame.from_dict(grid_dt_reg.best_params_, orient='index', columns=['Values']).reset_index().rename(columns={'index': 'Best Parameters'})

Unnamed: 0,Best Parameters,Values
0,regressor__criterion,friedman_mse
1,regressor__max_depth,
2,regressor__max_features,
3,regressor__min_samples_leaf,4
4,regressor__min_samples_split,10
5,regressor__random_state,110
6,regressor__splitter,best


In [73]:
%%capture
# Best Parameters
param_grid_dt_reg_final = {
    'regressor__criterion': [grid_dt_reg.best_params_['regressor__criterion']],
    'regressor__splitter': [grid_dt_reg.best_params_['regressor__splitter']],
    'regressor__max_depth': [grid_dt_reg.best_params_['regressor__max_depth']],
    'regressor__min_samples_split': [grid_dt_reg.best_params_['regressor__min_samples_split']],
    'regressor__min_samples_leaf': [grid_dt_reg.best_params_['regressor__min_samples_leaf']],
    'regressor__max_features': [grid_dt_reg.best_params_['regressor__max_features']],
    'regressor__random_state': [grid_dt_reg.best_params_['regressor__random_state']],
}

# Model
grid_dt_reg_final = GridSearchCV(
    dt_reg,
    param_grid_dt_reg_final,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=10,
    scoring='neg_root_mean_squared_error'
)

# Fit
grid_dt_reg_final.fit(X_train_reg, y_train_reg)
y_pred_dt_reg_final = grid_dt_reg_final.predict(X_test_reg)

In [74]:
# Performance Metrics
print('Internal CV score(neg-RMSE):')
print(f'{grid_dt_reg_final.best_score_:.4f}')

# Compute regression metrics
rmse_dt = mean_squared_error(y_test_reg, y_pred_dt_reg_final, squared=False)
mae_dt = mean_absolute_error(y_test_reg, y_pred_dt_reg_final)
r2_dt = r2_score(y_test_reg, y_pred_dt_reg_final)

# Print the results
print("Root Mean Squared Error: {:.2f}".format(rmse_dt))
print("Mean Absolute Error: {:.2f}".format(mae_dt))
print("R^2 Score: {:.2f}".format(r2_gb))

Internal CV score(neg-RMSE):
-18.8140
Root Mean Squared Error: 18.60
Mean Absolute Error: 10.48
R^2 Score: 0.80


In [172]:
# Feature Importance
f_imp_dt_reg = grid_dt_reg_final.best_estimator_.named_steps['regressor'].feature_importances_
feature_importance_dt_reg = dict(zip(X_train.columns, f_imp_dt_reg))
x_vals_dt_reg = list(feature_importance_dt_reg.keys())
y_vals_dt_reg = list(feature_importance_dt_reg.values())

dt_feature_imp_reg = pd.DataFrame({
    'Features': x_vals_dt_reg,
    'Values': y_vals_dt_reg

})

fig = px.bar(
        x=x_vals_dt_reg,
        y=y_vals_dt_reg,
        title="Decision Tree - Feature Importance for Average Price Per Room",
        height=500,
        width=1000,
        labels={'y':'Importance Value', 'x': 'Feature'})
fig.update_xaxes(categoryorder='total descending')
fig.show()

In [76]:
# Create a scatter plot using Plotly
fig = px.scatter(x=y_test_reg, y=y_pred_gb_reg_final,
                 labels={'x': 'Average Price Per Room', 'y': 'Predicted Values'},
                 title='Decision Tree - Regression Model Performance of Average Price Per Room',
                 trendline='ols', trendline_color_override='red',
                 marginal_x='histogram', marginal_y='histogram',
                 height=500,
                 width=1000)
fig.show()

__SVM__

In [77]:
# Test Parameters
svr_reg_param_grid = {
    'regressor__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'regressor__degree': [2, 3, 4],
    'regressor__gamma': ['scale', 'auto'],
    'regressor__C': [0.1, 1, 10, 100],
    'regressor__epsilon': [0.1, 0.3, 0.5],
}

In [13]:
%%capture
# Model
svr_reg = Pipeline(
    steps=[("preprocessor", preprocessor_reg), ("regressor", SVR())]
)

grid_svr_reg = GridSearchCV(
    svr_reg,
    svr_reg_param_grid,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=3,
    scoring='neg_root_mean_squared_error'
)

# Fit
grid_svr_reg.fit(X_train_reg, y_train_reg)

In [14]:
# Performance Metrics
print('Internal CV score(neg-RMSE):')
print(f'{grid_svr_reg.best_score_:.4f}')

Internal CV score(neg-RMSE):
-19.9496


In [15]:
# Best Parameters Table
pd.DataFrame.from_dict(grid_svr_reg.best_params_, orient='index', columns=['Values']).reset_index().rename(columns={'index': 'Best Parameters'})

Unnamed: 0,Best Parameters,Values
0,regressor__C,100
1,regressor__degree,2
2,regressor__epsilon,0.5
3,regressor__gamma,scale
4,regressor__kernel,rbf


In [16]:
%%capture
# Best Parameters
param_grid_svr_reg_final = {
    'regressor__kernel': [grid_svr_reg.best_params_['regressor__kernel']],
    'regressor__degree': [grid_svr_reg.best_params_['regressor__degree']],
    'regressor__gamma': [grid_svr_reg.best_params_['regressor__gamma']],
    'regressor__C': [grid_svr_reg.best_params_['regressor__C']],
    'regressor__epsilon': [grid_svr_reg.best_params_['regressor__epsilon']],
}

# Model
grid_svr_reg_final = GridSearchCV(
    svr_reg,
    param_grid_svr_reg_final,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=10,
    scoring='neg_root_mean_squared_error'
)

# Fit
grid_svr_reg_final.fit(X_train_reg, y_train_reg)
y_pred_svr_reg_final = grid_svr_reg_final.predict(X_test_reg)

In [17]:
# Performance Metrics
print('Internal CV score(neg-RMSE):')
print(f'{grid_svr_reg_final.best_score_:.4f}')

# Compute regression metrics
rmse_svr = mean_squared_error(y_test_reg, y_pred_svr_reg_final, squared=False)
mae_svr = mean_absolute_error(y_test_reg, y_pred_svr_reg_final)
r2_svr = r2_score(y_test_reg, y_pred_svr_reg_final)

# Print the results
print("Root Mean Squared Error: {:.2f}".format(rmse_svr))
print("Mean Absolute Error: {:.2f}".format(mae_svr))
print("R^2 Score: {:.2f}".format(r2_svr))

Internal CV score(neg-RMSE):
-19.7780
Root Mean Squared Error: 20.13
Mean Absolute Error: 12.57
R^2 Score: 0.68


In [25]:
# Create a scatter plot using Plotly
fig = px.scatter(x=y_test_reg, y=y_pred_svr_reg_final,
                 labels={'x': 'Average Price Per Room', 'y': 'Predicted Values'},
                 title='SVM - Regression Model Performance of Average Price Per Room',
                 trendline='ols', trendline_color_override='red',
                 marginal_x='histogram', marginal_y='histogram',
                 height=500,
                 width=1000)
fig.show()

 ### Modeling and Evaluation 4
 Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.

__Task 1: Classification__

In [96]:
# All ROC Curves
trace1 = go.Scatter(
    x=fpr_gb,
    y=tpr_gb,
    mode='lines',
    name=f'Gradient Boost - AUC={roc_auc_gb:.4f}'
    )
trace2 = go.Scatter(
    x=fpr_dt,
    y=tpr_dt,
    mode='lines',
    name=f'Decision Tree - AUC={roc_auc_dt:.4f}'
    )
trace3 = go.Scatter(
    x=fpr_svm,
    y=tpr_svm,   
    mode='lines',
    name=f'SVM - AUC={roc_auc_svm:.4f}'
    )
trace4 = go.Scatter(
    x=[0, 1],
    y=[0, 1],
    mode='lines',
    name='Baseline', line=dict(dash='dot')
    )

data = [trace1, trace2, trace3, trace4]

layout = go.Layout(title='AUC-ROC - Booking Status Classification',
                   xaxis=dict(title='False Positive Rate'),
                   yaxis=dict(title='True Positive Rate'),
                   hovermode='closest',
                   height=500,
                   width=1000,
                   xaxis_range=[0, 1],
                   yaxis_range=[0, 1])


fig = go.Figure(data=data, layout=layout)

fig.show()

Our analysis indicates that Graident Boosting is the best classification method for predicting cancelations of hotel reservations. It's AUC-ROC score is 3% higher than the second best model, Decision Trees. This AUC-ROC score indicates that it was best at both predicting cancellations while minimizing false postives.

In the above plot, the findings of our analysis can be quickly consumed and understood. For those more familiar with the AUC-ROC curve, they are able to compare the prediction tradeoffs between each model. For those unfamiliar, the plot is simple enough to quickly educate someone on the important takeaways. While this same information can be found at the end of each model anlaysis, this plot aggegates the important data and is built in a manor that is easily expanable for future models. The goal of our notebook is to build out foundations and methods that allow for repeated use and iteration. This plot accomplishes that goal. For that reason, it is not only useful to the data scientist, but to the business leader as well. 

__Task 2: Regression__

In [111]:
# RMSE Plot
rmse_df = pd.DataFrame.from_dict({
    'Model': ['Gradient Boosting', 'Decision Trees', 'SVM'], 
    'RMSE': [round(rmse_gb,2), round(rmse_dt,2), round(rmse_svr,2)]
    })


fig = px.bar(
    rmse_df, 
    x='Model', 
    y='RMSE',
    height=500,
    width=1000,
    title='Model RMSE Scores - Regression of Average Price Per Room',
    text='RMSE',
    )
fig.update_xaxes(categoryorder='total ascending')
fig.show()

Our analysis indicates that Gradient Boosting was again the best model. It outperformed Decision Trees and SVM in predicting Average Price Per Room.

For a simple and straight forward comparision, this bar chart of RMSE shows which models performed the best and by how much. For a deeper understanding about each models performance, we've built out a unified residuals plot for each model. From these, we can see that each model has a fairly uniform distirbution round around 0 at all prices. Additionaly, we can see that Gradient Boosting appears to have a slightly tighter residual plot than Decision Trees and SVM. While the difference isn't significant, it's a little bit easier to see that SVM has a looser spread.

These plots allow us to dig further into the indivudal observations that appear to be a problem for each model. By doing this, we could potentially improve the models or provide discover helpful information to hotel operators.

In [122]:
# Residuals
gb_res = y_pred_gb_reg_final - y_test_reg
dt_res = y_pred_dt_reg_final - y_test_reg
svr_res = y_pred_svr_reg_final - y_test_reg

x = np.arange(0, 400, 0.1)
fig = sp.make_subplots(rows=3, cols = 1, shared_xaxes=False)

# Traces
fig.add_trace(go.Scatter(x=x, y=gb_res, mode='markers', name=f'Gradient Boosting - RMSE: {rmse_gb:.2f}'), row=1, col=1)
fig.add_trace(go.Scatter(x=x, y=dt_res, mode='markers', name=f'Decision Trees - RMSE: {rmse_dt:.2f}'), row=2, col=1)
fig.add_trace(go.Scatter(x=x, y=svr_res, mode='markers', name=f'SVM - RMSE: {rmse_svr:.2f}'), row=3, col=1)


# Set plot titles and axis labels
fig.update_layout(title='Residual Plots - Regression of Average Price Per Room',
                  xaxis_title='Average Price Per Price',
                  yaxis_title='Residuals',
                  height = 1200,
                  width = 1000)

# Show plot
fig.show()

### Modeling and Evaluation 5
Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques—be sure they are appropriate for your chosen method of validation as discussed in unit 7 of the course.
_____________

For our study, the primary metrics to evaluate each model were performance. We've discussed in the earlier sections which models performed best according to our selected meterics. Additionally, because of the way we built our analysis, virtually every other element of the model building process required an equal effort to build. For that reason, we will be focusing on the resource cost of each model and the statistical differences between each model in this section.

__Classification__  
Earlier, we discussed that Gradient Boosting was performed the best for our metric AUC-ROC. However, training the model took significantly longer than our second place model, Decision Trees. SVM had the highest cost to model by far. Once the final parameters are found through GridSearchCV, each model was fairly quick to fit. However, Gradient Boosting took 10 times as long to find the best paramters than Gradient Boosting. If performance is weighted heavily in the decision making process and the model must refind the best parameters, we suggest using the very cost effective Decision Trees model.

For comparing significant differences between classification models, we usedMcNemar's test. The results found in the below table indicate that between each model, the difference appears to be significant.

In [143]:
# Contingency tables for mcnemar function
gb_cm = contingency_matrix(y_test_num, y_pred_gb_final)
dt_cm = contingency_matrix(y_test_num, y_pred_dt_final)
svm_cm = contingency_matrix(y_test_num, y_pred_svm_final)

# Correct
gb_correct = gb_cm[0][0] + gb_cm[1][1]
dt_correct = dt_cm[0][0] + dt_cm[1][1]
svm_correct = svm_cm[0][0] + svm_cm[1][1]

# Incorrect
gb_incorrect = gb_cm[0][1] + gb_cm[1][0]
dt_incorrect = dt_cm[0][1] + dt_cm[1][0]
svm_incorrect = svm_cm[0][1] + svm_cm[1][0]

contingency_table_gb_dt = np.array([[gb_correct, gb_incorrect], [dt_incorrect, dt_correct]])
contingency_table_gb_svm = np.array([[gb_correct, gb_incorrect], [svm_incorrect, svm_correct]])
contingency_table_dt_svm = np.array([[dt_correct, dt_incorrect], [svm_incorrect, svm_correct]])

result_1 = mcnemar(contingency_table_gb_dt, exact=True)
result_2 = mcnemar(contingency_table_gb_svm, exact=True)
result_3 = mcnemar(contingency_table_dt_svm, exact=True)

mcnemar_df = pd.DataFrame({
    'Model Comparison': ['Gradient Boosting vs. Decision Trees', 'Gradient Boosting vs. SVM', 'Decision Trees vs. SVM'],
    'McNemar t-statistic': [result_1.statistic, result_2.statistic, result_3.statistic],
    'McNemar p-value': [result_1.pvalue, result_2.pvalue, result_3.pvalue]
})

mcnemar_df

Unnamed: 0,Model Comparison,McNemar t-statistic,McNemar p-value
0,Gradient Boosting vs. Decision Trees,703.0,1.790481e-09
1,Gradient Boosting vs. SVM,703.0,2.292849e-21
2,Decision Trees vs. SVM,948.0,0.0005288113


__Regression__  
Our regression models were very similar in both performance and overall resource cost to their classification counterparts. Gradient Boosting performed the best in term of RMSE, but it was 20 times slower than our Decision Trees model. The SVM model was both the worst performer and took nearly 50 minutes to run through it's various hyperparameters. Once again, for those who are resource cost adverse, the Decision Trees model offers a good balance of prediction performance and resource cost requirement.

When looking at the residual differences in the below t-tests, we see no significant differences between the models. While we can see difference in our above findings, we do not have enough evidence to support the claim of signifcant differences at the 95% confidence level between any of the three models.

In [130]:
# GB to DT ttest
t_stat_gb_dt, p_value_gb_dt = stats.ttest_ind(gb_res, dt_res)
t_stat_gb_svr, p_value_gb_svr = stats.ttest_ind(gb_res, svr_res)
t_stat_dt_svr, p_value_dt_svr = stats.ttest_ind(dt_res, svr_res)

# Results
ttest_df = pd.DataFrame({
    'Model Comparison': ['Gradient Boosting vs. Decision Trees', 'Gradient Boosting vs. SVM', 'Decision Trees vs. SVM'],
    't-statistic': [round(t_stat_gb_dt,3), round(t_stat_gb_svr,3), round(t_stat_dt_svr,3)],
    'p-value': [round(p_value_gb_dt,3), round(p_value_gb_svr,3), round(p_value_dt_svr,3)]
})
ttest_df

Unnamed: 0,Model Comparison,t-statistic,p-value
0,Gradient Boosting vs. Decision Trees,0.321,0.748
1,Gradient Boosting vs. SVM,0.525,0.599
2,Decision Trees vs. SVM,0.205,0.838


### Modeling and Evaluation 6
Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.
______________
__Classification__  
Given the structure of each model, we are able to directly pull feature importance out of the Gradient Boosting and Decision Trees models on the same scale.

While each model gave different weights to the features, the ranking was nearly identical. The one exception to this is [no_of_weekend_nights]. In the Decision Trees model, it was ranked 5th. In the Gradient Boosting Model, it was ranked 7th. The vast majority of the weighting was given to the top three features.

1. <font color=dodgerblue>[arrival_month]</font>
    - Without futher understanding as to which months were more important, we can't speak in detail. However, we can see that the time of year plays a strong role in the classification of the record.
    - Our hypothesis is that there is clustering happening around holidays and summer months when travel is greatest.
2. <font color=dodgerblue>[arrival_date]</font>
    - Similar to arrival_month, we can't see the specific days that are leading to strong classification weights.
    - Our hypothesis is that travel clustering around the beginning and end of months is driving the strong classification weights.
3. <font color=dodgerblue>[num_lead_time]</font>
    - Longer lead times are more likely to lead to cancelation.
    - Logically, this makes sense. The further out from the actual reservation, the more time a customer has to change their mind or have their circumstances change. 


In [170]:
gb_features_imp_new = gb_feature_imp.rename(columns={'Feature': 'Feature', 'Value': 'GB'})
dt_features_imp_new = dt_feature_imp.rename(columns={'Feature': 'Feature', 'Value': 'DT'})
feature_imp = pd.merge(gb_features_imp_new, dt_features_imp_new,on='Features').rename(columns={'Values_x': 'GB', 'Values_y': 'DT'})
feature_imp.sort_values('GB', ascending=False).reset_index(drop=True)

Unnamed: 0,Features,GB,DT
0,arrival_month,0.35498,0.422175
1,arrival_date,0.180213,0.141122
2,lead_time,0.083978,0.105747
3,type_of_meal_plan,0.075437,0.058778
4,required_car_parking_space,0.065466,0.023628
5,no_of_week_nights,0.038925,0.022252
6,no_of_weekend_nights,0.029362,0.026704
7,no_of_adults,0.027213,0.023479
8,market_segment_type,0.005583,0.004933
9,avg_price_per_room,0.005224,0.001827


The benefit to SVM, is that we get to see the categorical levels broken out by significance. While this prevents us from directly comparing the feature importance to the other models, it does give us insight into the dataset that the other two models do not.

Since the results are the same as our findings from the previous Mini-Lab, we have restated our analysis and hypthesis about the top four most important features.

1. <font color=dodgerblue>[num_no_of_special_requests]</font>
    - SVM has weighted special requests at number one and moved lead time to 5th.
    - Based on the weight of the logistic regression model, this would indicate more special requests leading to fewer cancelations.
2. <font color=dodgerblue>[cat__required_car_parking_space_0]</font>
    - Customers who do not require a parking spot are weighted much high in this model.
    - Based on the weight of the logistic regression model, this would indicate that customers who don't require parking would lead to more cancelations.
    - The distinct difference is that not requiring a parking spot is more important to the classification than requiring a parking spot.
3. <font color=dodgerblue>[cat__type_of_meal_plan_Not Selected]</font>
    - Falling just out of the top 10 for our Logistical Regression model, [cat__type_of_meal_plan_Not Selected] is ranked third in the SVM model.
    - Based on the weight of the logistic regression model, this would indicate that customers who do not select a meal plan are more likely to cancel.
    - Our hypotheis is that more series and committed customers are likely to fill out this part of the form.
4. <font color=dodgerblue>[cat__type_of_meal_plan_Meal Plan 1]</font>
    - Without knowing the specifics of what Meal Plan 1 contains, it is impossible to draw any conclusions as to the reason for this position in the ranking.
    - However, we can assume that the selection of Meal Plan 1 is likely associated with customers keeping their reservation. This is based on the Logistical Regression model.
    - Additionally, we can point out that two of the Meal Plan levels made it into the top 5. Further understanding of this feature would be beneficial for hotel operators.

Source: (Butler et al., 2023)

In [166]:
feat_imp_svc_final.sort_values('Absolute Value', ascending=False)

Unnamed: 0,Feature,Value,Absolute Value
7,num__no_of_special_requests,0.219161,0.219161
14,cat__required_car_parking_space_0,0.135169,0.135169
13,cat__type_of_meal_plan_Not Selected,0.091691,0.091691
10,cat__type_of_meal_plan_Meal Plan 1,0.091073,0.091073
8,num__lead_time,0.043637,0.043637
2,num__no_of_weekend_nights,0.036028,0.036028
0,num__no_of_adults,0.029221,0.029221
3,num__no_of_week_nights,0.028436,0.028436
9,num__avg_price_per_room,0.026327,0.026327
4,num__arrival_month,0.018303,0.018303


__Regression__  
Once again, we were only able to directly compare the feature importance of the Gradient Boosting and Decision Tree models. We were unable to pull feature importance from the SVM model for regression. Unlike the classification model, the permutation_importance() function did not work. If we used a linear kernal, we would be able to pull feature names from the model, but it would not be representative of the final model that best fit the data set.

In [174]:
gb_features_imp_reg_new = gb_feature_imp_reg.rename(columns={'Feature': 'Feature', 'Value': 'GB'})
dt_features_imp_reg_new = dt_feature_imp_reg.rename(columns={'Feature': 'Feature', 'Value': 'DT'})
feature_imp_reg = pd.merge(gb_features_imp_reg_new, dt_features_imp_reg_new, on='Features').rename(columns={'Values_x': 'GB', 'Values_y': 'DT'})
feature_imp_reg.sort_values('GB', ascending=False).reset_index(drop=True)

Unnamed: 0,Features,GB,DT
0,arrival_month,0.136616,0.105857
1,type_of_meal_plan,0.134492,0.133772
2,required_car_parking_space,0.089604,0.084013
3,no_of_adults,0.052037,0.06349
4,no_of_children,0.042887,0.019328
5,no_of_week_nights,0.04033,0.025288
6,no_of_weekend_nights,0.027085,0.017422
7,market_segment_type,0.02568,0.040755
8,lead_time,0.023255,0.011854
9,arrival_date,0.012052,0.00442


In [194]:
# Feature Importance Plot
fig = go.Figure()
fig.add_trace(go.Bar(x=feature_imp_reg['Features'],
                     y=feature_imp_reg['GB'],
                     name="GB"))
fig.add_trace(go.Bar(x=feature_imp_reg['Features'],
                     y=feature_imp_reg['DT'],
                     name="DT"))

fig.update_layout(
    title='Feature Importance - Model Comparison - Regression on Average Price Per Room',
    height=500,
    width=1200
)
fig.update_xaxes(categoryorder='total descending')

Looking at the grouped bar plot of feature importance, we can more easily see the differences between the models.

1. <font color=dodgerblue>[arrival_month]</font>
    - Gradient Boosting puts a much higher weight on this feature.
    - Our hypothesis is that high travel months are coorelated with higher room prices.
2. <font color=dodgerblue>[type_of_meal_plan]</font>
    - Without knowing what the meal plans consist of, it's hard to make any solid hypothesis'.
    - That being said, our hypthesis is that hotels with higher end meal plans or meal plans at all tend to be more expensive.
3. <font color=dodgerblue>[required_car_parking_space]</font>
    - Yet again, without knowing which level of the feature is coorelated with higher prices, it's hard to make any hypothesis'.
    - We know that parking at higher end hotels tends to be expensive. Our hypothesis is that those who require parking would tend to book cheaper hotels.

### Deployment
How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.?
_________
__Classification__
We believe our model for classification is very useful. Not only do we provide three models with good AUC-ROC scores, we also are able to show relative feature imporatance. Even though the SVM model came at a higher performance cost and a lower AUC-ROC score, it did offer a categorical level feature importance view that we believe would be very benefitial to hotel operators. We believe these models would be benefitial to them because they can provide actionable data to better understand the factors that lead to cancelations. While their practices may not chance, they would at least be expecting a certain cancellation rate based on the feature values of their future datasets.

__Regression__
We believe the model would be best deployed as a server side resource. Some of these models require a sigificant amount of resources to optimize. We would suggest rerunning the model every quarter to see if any changes or new features would impact the best parameters or best model. In the last two years, we went through a global pandemic. This lead to many cancelations. If a parallel dataset could be gathered that tracked regional desasters, we believe that it could offer a greater understanding of outliers in the data. While we wouldn't have access to this information for future predictions, it could help us better train our existing models.

### Exceptional Work
You have free reign to provide additional analyses. One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?
___________________________
Throughout this analysis, we have utilizeded a common pipeline and built parameter grids for GridSearchCV(). A great deal of time and effort was put in to understand and build out a framework for easily emplementing additional models. As much as possible, we wanted to reduce repeated tasks. Each task has it's own train/test split andpipeline. Instead of remaking the categorical and numeric variable lists, we were able to build off the same base lists and remove the prediction variable as needed for the pipeline transformers. Working off the strengths of sci-kit learn, we were able to build out three classification models and three regression models with shared designs and foundations. This grately increases the speed with which we can incorporate additional models and compare their feature importance and predictive abilities.

In addition to the extra work outlined above, we've built out two additional classification models to show the ease with which new models can be incorporated as this point.

__KNN__

In [8]:
%%capture
# Test Parameter
knn_param_grid = {
    'classififer__n_neighbors': [3,5,7,9,15,25,35,45,55],
    'classififer__weights': ['uniform', 'distance'],
    'classififer__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
}

# Model
knn = Pipeline(
    steps=[("preprocessor", preprocessor), ('classififer', KNeighborsClassifier())]
    )

knn_grid = GridSearchCV(
    knn,
    knn_param_grid,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=3,
    scoring='roc_auc'
    ) 

# Fit
knn_grid.fit(X_train, y_train)

In [9]:
# Performance Metrics
print('Internal CV score(AUC-ROC):')
print(f'{knn_grid.best_score_:.4f}')

Internal CV score(AUC-ROC):
0.9278


In [10]:
# Best Parameters
pd.DataFrame.from_dict(knn_grid.best_params_, orient='index', columns=['Values']).reset_index().rename(columns={'index': 'Best Paramaters'})

Unnamed: 0,Best Paramaters,Values
0,classififer__algorithm,auto
1,classififer__n_neighbors,55
2,classififer__weights,distance


In [17]:
%%capture
# Best Paramters
knn_param_grid_final = {
    'classififer__n_neighbors': [knn_grid.best_params_['classififer__n_neighbors']],
    'classififer__weights': [knn_grid.best_params_['classififer__weights']],
    'classififer__algorithm': [knn_grid.best_params_['classififer__algorithm']]
}

# Model
grid_knn_final = GridSearchCV(
    knn,
    knn_param_grid_final,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=10,
    scoring='roc_auc'
)
grid_knn_final.fit(X_train, y_train)
y_pred_knn = grid_knn_final.predict(X_test)

In [18]:
print('Internal CV score(AUC-ROC):')
print(round(grid_knn_final.best_score_,4))

# Converting to Int
y_pred_knn = np.where(y_pred_knn == 'Canceled', 1, 0)
print(classification_report(y_test_num, y_pred_knn))

Internal CV score(AUC-ROC):
0.9332
              precision    recall  f1-score   support

           0       0.89      0.92      0.91      4824
           1       0.84      0.77      0.80      2431

    accuracy                           0.87      7255
   macro avg       0.86      0.85      0.85      7255
weighted avg       0.87      0.87      0.87      7255



The KNN model with K = 55 gave us a precision of canceled bookings of 0.84 and a precision of not canceled bookings of 0.89. The recall of canceled is 0.77 and of not canceled is 0.92. This gives us an f1-score of 0.80 for canceled and 0.91 for not canceled. What is immediately apparent from these numbers is that the KNN model is better at predicting bookings that were not canceled compared to those that were canceled. The model also has an accuracy of 0.87.  After this, we one hot encoded Canceled and Not_canceled having 1 = Canceled and 0 = not canceled. This was done to be able to generate a ROC curve which is down below.  It has an area under the curve of 0.8483 with a false positive rate of 0.0756 and a true positive rate of 0.7725. 

In [20]:
fpr_knn, tpr_knn, thresholds_knn = roc_curve(y_test_num, y_pred_knn)
roc_auc_knn = auc(fpr_knn, tpr_knn)
fig = px.area(
    x=fpr_knn, y=tpr_knn,
    title=f'KNN - ROC Curve - ROC Curve (AUC={roc_auc_knn:.4f})',
    labels=dict(x='False Positive Rate', y='True Positive Rate'),
    width=700, height=500
)
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=0, x1=1, y0=0, y1=1
)

fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()

__Random Forest__

In [198]:
# Random Forest Paramters
rf_param_grid = {
    'classifier__n_estimators': [10, 50, 100, 200, 300],
    'classifier__criterion': ['gini', 'entropy', 'log_loss'],
    'classifier__max_depth': [5, 10, 20, None],
    'classifier__min_samples_split': [1, 2, 5, 10, None],
    'classifier__max_features': ['sqrt', 'log2', None]
    }

In [199]:
%%capture
# Model
rf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier(random_state=110))]
)

grid_rf = GridSearchCV(
    rf,
    rf_param_grid,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=3,
    scoring='roc_auc'
)
grid_rf.fit(X_train, y_train)

In [200]:
# Best Parameters
pd.DataFrame.from_dict(grid_rf.best_params_, orient='index', columns=['Values']).reset_index().rename(columns={'index': 'Best Paramaters'})

Unnamed: 0,Best Paramaters,Values
0,classifier__criterion,entropy
1,classifier__max_depth,20
2,classifier__max_features,
3,classifier__min_samples_split,1
4,classifier__n_estimators,300


In [206]:
%%capture
# Random Forest Paramters
rf_param_grid_final = {
    'classifier__n_estimators': [grid_rf.best_params_['classifier__n_estimators']],
    'classifier__criterion': [grid_rf.best_params_['classifier__criterion']],
    'classifier__max_depth': [grid_rf.best_params_['classifier__max_depth']],
    'classifier__min_samples_split': [grid_rf.best_params_['classifier__min_samples_split']],
    'classifier__max_features': [grid_rf.best_params_['classifier__max_features']]
    }

# Model
rf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier(random_state=110))]
)

grid_rf_final = GridSearchCV(
    rf,
    rf_param_grid_final,
    verbose=False,
    n_jobs=-1,
    refit=True,
    cv=10,
    scoring='roc_auc'
)
grid_rf_final.fit(X_train, y_train)
y_pred_rf = grid_rf_final.predict(X_test)

In [207]:
# Performance Metrics
print('Internal CV score(AUC-ROC):')
print(f'{grid_rf_final.best_score_:.4f}')

# Converting to Int
y_pred_rf = np.where(y_pred_rf == 'Canceled', 1, 0)
print(classification_report(y_test_num, y_pred_rf))

Internal CV score(AUC-ROC):
0.9559
              precision    recall  f1-score   support

           0       0.91      0.94      0.93      4824
           1       0.88      0.82      0.85      2431

    accuracy                           0.90      7255
   macro avg       0.89      0.88      0.89      7255
weighted avg       0.90      0.90      0.90      7255



In [208]:
# Feature Importance
f_imp_rf = grid_rf_final.best_estimator_.named_steps['classifier'].feature_importances_
feature_importance_rf = dict(zip(X_train.columns, f_imp_rf))
x_vals_rf = list(feature_importance_rf.keys())
y_vals_rf = list(feature_importance_rf.values())

fig = px.bar(
        x=x_vals_rf,
        y=y_vals_rf,
        title="Random Forest - Feature Importance",
        height=500,
        width=1000,
        labels={'y':'Importance Value', 'x': 'Feature'})
fig.update_xaxes(categoryorder='total descending')
fig.show()

In [209]:
# ROC Curve
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test_num, y_pred_rf)
roc_auc_rf = auc(fpr_rf, tpr_rf)
fig = px.area(
    x=fpr_rf, y=tpr_rf,
    title=f'Random Forest - ROC Curve (AUC={roc_auc_rf:.4f})',
    labels=dict(x='False Positive Rate', y='True Positive Rate'),
    width=700, height=500
)
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=0, x1=1, y0=0, y1=1
)

fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()

## Sources

1. Hotel Reservations Dataset. (2023, January 4). Kaggle. https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset?resource=download
2. Butler, Derner, Holmes, & Traxler. (2023). Mini-Lab: Logistic Regression and SVMs. SMU Data Science.