# Data Notes

[Kaggle data link](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand?phase=FinishSSORegistration&returnUrl=%2Fdatasets%2Fjessemostipak%2Fhotel-booking-demand%2Fversions%2F1%3Fresource%3Ddownload&SSORegistrationToken=CfDJ8GXdT74sZy9Iv4qC0qaf2Rf61U5cghMVcrIzVYejlgulhNkpQt-H_4S2JkygYL_VUoIwNsbzqwsZf-1V14Z_PmnTsyrWTf1Mu8_QfqF707zxNzsrt-aREOAfkQENnbL4SWFvYixqsWMdN6XIfbjE4tOOm1-wQATX-ycfWmAJ49IIHxzZfY8TQ1DYZPlg6xBMZ6HiGG3olWZD_XadA0TwZSa4yQ6yY4jJXBBb5DneFZUPNY6YGeFET1vAyPHZjuIvdUpjsqXE4CZdHHh0hA8rmex3XjkAvKaGW2gqse60qALh2jY7ry9HbfcuFSXEaw50t9wIgy1PFZunrKp5KL4gjmj-E-MT7FE&DisplayName=Francis+Alvarez)

* **adr**
    * Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
* **reservation_status**
    * Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why

* **reservation_status_date**
    * Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

# Modeling Notes

## Modeling Procedure
1. Data EDA
2. Determine Initial Interest Variables
3. Process Data
4. Check VIF

TODO: ADR
* There is negative

In [1]:
import pandas as pd
from dataclasses import dataclass
from IPython.display import HTML
import statsmodels.api as sm
import pprint as pprint
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.formula.api as smf
import statsmodels.api as sm
import re
import seaborn as sns
import plotly.express as px
from dominance_analysis import Dominance

# 1. Exploratory Data Analysis (EDA)
* Data Type
* Unique Values
* Missing Values

Notes: 
* **ADR**: had negative values and a max of 5,400

In [2]:
df = pd.read_csv("data/raw/hotel_bookings.csv")
display(df.head())

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


In [3]:
df.shape

(119390, 32)

In [4]:
df.dtypes

hotel                              object
is_canceled                         int64
lead_time                           int64
arrival_date_year                   int64
arrival_date_month                 object
arrival_date_week_number            int64
arrival_date_day_of_month           int64
stays_in_weekend_nights             int64
stays_in_week_nights                int64
adults                              int64
children                          float64
babies                              int64
meal                               object
country                            object
market_segment                     object
distribution_channel               object
is_repeated_guest                   int64
previous_cancellations              int64
previous_bookings_not_canceled      int64
reserved_room_type                 object
assigned_room_type                 object
booking_changes                     int64
deposit_type                       object
agent                             

In [5]:
display(df.describe())

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,agent,company,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests
count,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119386.0,119390.0,119390.0,119390.0,119390.0,119390.0,103050.0,6797.0,119390.0,119390.0,119390.0,119390.0
mean,0.370416,104.011416,2016.156554,27.165173,15.798241,0.927599,2.500302,1.856403,0.10389,0.007949,0.031912,0.087118,0.137097,0.221124,86.693382,189.266735,2.321149,101.831122,0.062518,0.571363
std,0.482918,106.863097,0.707476,13.605138,8.780829,0.998613,1.908286,0.579261,0.398561,0.097436,0.175767,0.844336,1.497437,0.652306,110.774548,131.655015,17.594721,50.53579,0.245291,0.792798
min,0.0,0.0,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6.0,0.0,-6.38,0.0,0.0
25%,0.0,18.0,2016.0,16.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,62.0,0.0,69.29,0.0,0.0
50%,0.0,69.0,2016.0,28.0,16.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,179.0,0.0,94.575,0.0,0.0
75%,1.0,160.0,2017.0,38.0,23.0,2.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,229.0,270.0,0.0,126.0,0.0,1.0
max,1.0,737.0,2017.0,53.0,31.0,19.0,50.0,55.0,10.0,10.0,1.0,26.0,72.0,21.0,535.0,543.0,391.0,5400.0,8.0,5.0


## Check for Missing Values

In [6]:
def check_missing_proportion(df):
    # Calculate the proportion of missing values in each column
    missing_proportion = df.isnull().mean()

    # Filter out columns with no missing values (if needed)
    missing_proportion = missing_proportion[missing_proportion > 0]

    if missing_proportion.empty:
        print("No columns with missing values.")
    else:
        # Print the missing proportion for each column
        for col, prop in missing_proportion.items():
            print(f"Column '{col}' has {prop * 100:.2f}% missing values.")
            
check_missing_proportion(df=df)

Column 'children' has 0.00% missing values.
Column 'country' has 0.41% missing values.
Column 'agent' has 13.69% missing values.
Column 'company' has 94.31% missing values.


## Check for Unique Values

In [7]:
# Function to check unique values in categorical columns
def unique_values_in_categorical_columns(df):
    # Select categorical columns (object and category data types)
    categorical_cols = df.select_dtypes(include=['object', 'category'])

    # Calculate the number of unique values in each categorical column
    unique_counts = categorical_cols.nunique()

    if unique_counts.empty:
        print("No categorical columns in the DataFrame.")
    else:
        # Print the number of unique values for each categorical column
        for col, count in unique_counts.items():
            print(f"Column '{col}' has {count} unique values.")

unique_values_in_categorical_columns(df)

Column 'hotel' has 2 unique values.
Column 'arrival_date_month' has 12 unique values.
Column 'meal' has 5 unique values.
Column 'country' has 177 unique values.
Column 'market_segment' has 8 unique values.
Column 'distribution_channel' has 5 unique values.
Column 'reserved_room_type' has 10 unique values.
Column 'assigned_room_type' has 12 unique values.
Column 'deposit_type' has 3 unique values.
Column 'customer_type' has 4 unique values.
Column 'reservation_status' has 3 unique values.
Column 'reservation_status_date' has 926 unique values.


## Display Unique Values for Categorical Variables
I'm interested in cases where there are less than 20 nunique values.

In [8]:
def unique_categorical_below_threshold(df, N):
    # Select categorical columns (object and category data types)
    categorical_cols = df.select_dtypes(include=['object', 'category'])

    # Calculate the number of unique values in each categorical column
    unique_counts = categorical_cols.nunique()

    # Filter columns where the number of unique values is less than or equal to N
    below_threshold = unique_counts[unique_counts <= N]

    if below_threshold.empty:
        print(f"No categorical columns with {N} or fewer unique values.")
    else:
        # Print the unique values for each categorical column below the threshold
        for col in sorted(below_threshold.index):
            unique_vals = df[col].unique()
            print(f"Column '{col}' has {len(unique_vals)} unique values (<= {N}): {unique_vals}")
unique_categorical_below_threshold(df, N=20)

Column 'arrival_date_month' has 12 unique values (<= 20): ['July' 'August' 'September' 'October' 'November' 'December' 'January'
 'February' 'March' 'April' 'May' 'June']
Column 'assigned_room_type' has 12 unique values (<= 20): ['C' 'A' 'D' 'E' 'G' 'F' 'I' 'B' 'H' 'P' 'L' 'K']
Column 'customer_type' has 4 unique values (<= 20): ['Transient' 'Contract' 'Transient-Party' 'Group']
Column 'deposit_type' has 3 unique values (<= 20): ['No Deposit' 'Refundable' 'Non Refund']
Column 'distribution_channel' has 5 unique values (<= 20): ['Direct' 'Corporate' 'TA/TO' 'Undefined' 'GDS']
Column 'hotel' has 2 unique values (<= 20): ['Resort Hotel' 'City Hotel']
Column 'market_segment' has 8 unique values (<= 20): ['Direct' 'Corporate' 'Online TA' 'Offline TA/TO' 'Complementary' 'Groups'
 'Undefined' 'Aviation']
Column 'meal' has 5 unique values (<= 20): ['BB' 'FB' 'HB' 'SC' 'Undefined']
Column 'reservation_status' has 3 unique values (<= 20): ['Check-Out' 'Canceled' 'No-Show']
Column 'reserved_room_

## Display Unique Values for Numerical Variables
My thought is there are some dichotomous values, [0,1] that are set as type int64.

In [9]:
# Function to check unique values in categorical columns
def unique_values_in_numerical_columns(df):
    # Select numerical columns (int and float data types)
    numerical_cols = df.select_dtypes(include=['number'])

    # Calculate the number of unique values in each numerical column
    unique_counts = numerical_cols.nunique()
    
    if unique_counts.empty:
        print("No categorical columns in the DataFrame.")
    else:
        unique_counts = unique_counts.loc[sorted(unique_counts.index)]
        # Print the number of unique values for each numerical column
        for col, count in unique_counts.items():
            print(f"Column '{col}' has {count} unique values.")

unique_values_in_numerical_columns(df)

Column 'adr' has 8879 unique values.
Column 'adults' has 14 unique values.
Column 'agent' has 333 unique values.
Column 'arrival_date_day_of_month' has 31 unique values.
Column 'arrival_date_week_number' has 53 unique values.
Column 'arrival_date_year' has 3 unique values.
Column 'babies' has 5 unique values.
Column 'booking_changes' has 21 unique values.
Column 'children' has 5 unique values.
Column 'company' has 352 unique values.
Column 'days_in_waiting_list' has 128 unique values.
Column 'is_canceled' has 2 unique values.
Column 'is_repeated_guest' has 2 unique values.
Column 'lead_time' has 479 unique values.
Column 'previous_bookings_not_canceled' has 73 unique values.
Column 'previous_cancellations' has 15 unique values.
Column 'required_car_parking_spaces' has 5 unique values.
Column 'stays_in_week_nights' has 35 unique values.
Column 'stays_in_weekend_nights' has 17 unique values.
Column 'total_of_special_requests' has 6 unique values.


In [10]:
def numerical_columns_below_threshold(df, N):
    # Select numerical columns (int and float data types)
    numerical_cols = df.select_dtypes(include=['number'])

    # Calculate the number of unique values in each numerical column
    unique_counts = numerical_cols.nunique()

    # Filter columns where the number of unique values is less than the threshold N
    below_threshold = unique_counts[unique_counts < N]

    if below_threshold.empty:
        print(f"No numerical columns with fewer than {N} unique values.")
    else:
        # Print the number of unique values for each numerical column below the threshold
        below_threshold = below_threshold.loc[sorted(below_threshold.index)]
        for col in below_threshold.index:
            unique_vals = df[col].unique()
            print(f"Column '{col}' has {len(unique_vals)} unique values (< {N}): {unique_vals}")
# Run the function
numerical_columns_below_threshold(df, N=10)

Column 'arrival_date_year' has 3 unique values (< 10): [2015 2016 2017]
Column 'babies' has 5 unique values (< 10): [ 0  1  2 10  9]
Column 'children' has 6 unique values (< 10): [ 0.  1.  2. 10.  3. nan]
Column 'is_canceled' has 2 unique values (< 10): [0 1]
Column 'is_repeated_guest' has 2 unique values (< 10): [0 1]
Column 'required_car_parking_spaces' has 5 unique values (< 10): [0 1 2 8 3]
Column 'total_of_special_requests' has 6 unique values (< 10): [0 1 3 2 4 5]


In [11]:
def numerical_columns_below_threshold_crosstabs(df, dv="is_canceled", N="10"):
    """dv: dependent variable"""
    # Select numerical columns (int and float data types)
    numerical_cols = df.select_dtypes(include=['number'])

    # Calculate the number of unique values in each numerical column
    unique_counts = numerical_cols.nunique()

    # Filter columns where the number of unique values is less than the threshold N
    below_threshold = unique_counts[unique_counts < N]

    if below_threshold.empty:
        print(f"No numerical columns with fewer than {N} unique values.")
    else:
        # Print the number of unique values for each numerical column below the threshold
        below_threshold = below_threshold.loc[sorted(below_threshold.index)]
        for col in below_threshold.index:
            tbl = pd.crosstab(df["is_canceled"], df[col], dropna=False)
            print(f"\nColumn '{col}'\n{'='*20}\n{tbl}")
            
# Run the function
numerical_columns_below_threshold_crosstabs(df, dv="is_canceled", N=10)


Column 'arrival_date_year'
arrival_date_year   2015   2016   2017
is_canceled                           
0                  13854  36370  24942
1                   8142  20337  15745

Column 'babies'
babies          0    1   2   9   10
is_canceled                        
0            74416  735  13   1   1
1            44057  165   2   0   0

Column 'children'
children      0.0   1.0   2.0   3.0   10.0
is_canceled                               
0            69702  3294  2111    59     0
1            41094  1567  1541    17     1

Column 'is_canceled'
is_canceled      0      1
is_canceled              
0            75166      0
1                0  44224

Column 'is_repeated_guest'
is_repeated_guest      0     1
is_canceled                   
0                  71908  3258
1                  43672   552

Column 'required_car_parking_spaces'
required_car_parking_spaces      0     1   2  3  8
is_canceled                                       
0                            67750  7383  28  

In [12]:
# Function to check unique values in categorical columns
def get_dist_plots_for_numerical_columns(df):
    # Select numerical columns (int and float data types)
    numerical_cols = df.select_dtypes(include=['number'])
    for num_col in numerical_cols:
        fig = px.histogram(df, x=num_col, color="is_canceled", marginal="rug")
        fig.write_image(f"figs/dist_plots/{num_col}.png")

get_dist_plots_for_numerical_columns(df)

# Determine Initial Interest Variables
Goal: Create a model to predict whether a consumer will keep or cancel their resolution. 
    * Dependent Variable: "is_canceled" (need to convert to factor)
    
*Initial interest columns*
* hotel                              object - yes
* is_canceled                         int64 - yes, IV (convert to category)
* lead_time                           int64 - yes
* stays_in_weekend_nights             int64 - yes
* stays_in_week_nights                int64 - yes
* adults                              int64 - yes
* children                          float64 - Yes (missing)
* babies                              int64 - yes
* meal                               object - yes
* country                            object - yes (country)
* market_segment                     object - yes
* distribution_channel               object - yes
* is_repeated_guest                   int64 - yes (convert to category)
* previous_cancellations              int64 - yes
* previous_bookings_not_canceled      int64 - yes
* reserved_room_type                 object - yes
* assigned_room_type                 object - no, not relevant for cancelling
* booking_changes                     int64 - yes
* deposit_type                       object - yes
* agent                             float64 - no (missing)
* company                           float64 - no (missing)
* days_in_waiting_list                int64 - no
* customer_type                      object - yes
* adr                               float64 - yes (average daily rate)
* required_car_parking_spaces         int64 - no (Number of car parking spaces required by the customer)
* total_of_special_requests           int64 - yes (Number of special requests made by the customer ...)
* reservation_status                 object - no
* reservation_status_date            object - no





# 3 Process Data

In [13]:
df["is_repeated_guest"] = df["is_repeated_guest"].astype(str)

In [14]:
class LogisticRegressionParams:
    dv = "is_canceled"
    ivs_num = ["lead_time", 
           "stays_in_weekend_nights",
           "stays_in_week_nights",
           "adults",
           "children",
           "babies",
           "previous_cancellations",
           "previous_bookings_not_canceled",
           "booking_changes",
           "adr", 
           "total_of_special_requests"
          ]
    ivs_cat = ["hotel", 
               "meal",
               "country", 
               "market_segment",
               "distribution_channel", 
               "is_repeated_guest",
               "reserved_room_type",
               "deposit_type",
               "customer_type"
              ]

    convert_to_cat = [
        "is_repreated_guest"
    ]
    def __init__(self):
        pass
    
    def get_sm_formula(self):
        formula = (self.dv + " ~ " 
           + " + ".join(self.ivs_num) 
           + " + " + " + ".join([f"C({ii})" for ii in self.ivs_cat])
          )
        return formula

lrp = LogisticRegressionParams()
formula = lrp.get_sm_formula()

## Filter for fields that are of interest
If I don't think it has practical importance in the initial model then I'm remove it from the data.

In [15]:
n_cols_before = df.copy().shape[1]
df = df[[lrp.dv] + lrp.ivs_num + lrp.ivs_cat]
n_cols_after = df.shape[1]
print(f"Number of dropped columns:\t{n_cols_before - n_cols_after}")
print(f"Remaining columns:\t{n_cols_after}")

Number of dropped columns:	11
Remaining columns:	21


## Remove rows with missing values

In [16]:
n_rows_before = df.copy().shape[0]
df = df.dropna(subset=[lrp.dv] + lrp.ivs_num + lrp.ivs_cat)
n_rows_after = df.shape[0]
print(f"Number of dropped rows:\t{n_rows_before - n_rows_after}")

Number of dropped rows:	492


## Remove outliers

* adr, values below zero and extreme values

In [17]:
print(df["adr"].nlargest(5))

48515     5400.0
111403     510.0
15083      508.0
103912     451.5
13142      450.0
Name: adr, dtype: float64


In [18]:
df["adr"].nsmallest(5)

14969   -6.38
0        0.00
1        0.00
125      0.00
167      0.00
Name: adr, dtype: float64

In [19]:
n_rows_before = df.copy().shape[0]
df = df.loc[(df["adr"]>=0) & (df["adr"] <=5000)]
n_rows_after = df.shape[0]
print(f"Number of dropped rows:\t{n_rows_before - n_rows_after}")

Number of dropped rows:	2


## Check Categorical Observation Sizes
1. Check for variable levels with zero observations
2. Check for varibale levels with <30 observations. 30 is arbitrary

In [20]:
def flag_categorical_vars(df, dependent_var):
    categorical_vars = df.select_dtypes(include=['object', 'category']).columns
    flagged_vars = {}

    for var in categorical_vars:
        if var == dependent_var:
            continue

        # Create contingency table
        contingency_table = pd.crosstab(df[var], df[dependent_var])

        # Check for zero response options
        zero_response_options = (contingency_table == 0).sum().sum()
        
        # Check for observed values below 30
        observed_values_below_30 = (contingency_table < 30).sum().sum()
        
        if zero_response_options > 0 or observed_values_below_30 > 0:
            flagged_vars[var] = {
                'zero_response_options': zero_response_options,
                'observed_values_below_30': observed_values_below_30
            }
    
    return flagged_vars

dependent_var = lrp.dv
flagged_vars = flag_categorical_vars(df, dependent_var)

print("Flagged Variables:")
for var, details in flagged_vars.items():
    nuniq = df[var].nunique()
    print(f"\n{var}:")
    print("="*20)
    print(f"  Nunique Response Options: {nuniq}")
    print(f"  Zero Response Options: {details['zero_response_options']}, "
          f"({(details['zero_response_options']/nuniq) *100:0.1f}%)")
    print(f"  Observed Values Below 30: {details['observed_values_below_30']}, "
          f"({(details['observed_values_below_30']/(nuniq*2)) *100:0.1f}%)")

Flagged Variables:

country:
  Nunique Response Options: 177
  Zero Response Options: 62, (35.0%)
  Observed Values Below 30: 261, (73.7%)

distribution_channel:
  Nunique Response Options: 5
  Zero Response Options: 1, (20.0%)
  Observed Values Below 30: 2, (20.0%)

reserved_room_type:
  Nunique Response Options: 10
  Zero Response Options: 1, (10.0%)
  Observed Values Below 30: 4, (20.0%)


Based on the number of observed values below 30 and empty reponse options I'm omitting the following variables.
* country

**distribution_channel**: 
Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”

Merge undefined & GDS into TA/TO.

**reserved_room_type**:
Code of room type reserved. Code is presented instead of designation for anonymity reasons.
Could drop rows with value L&P since they don't have enough responses. I feel okay about removing 8 rows instead of merging them into a group I don't understand.

In [21]:
pd.crosstab(df["distribution_channel"], df[lrp.dv])

is_canceled,0,1
distribution_channel,Unnamed: 1_level_1,Unnamed: 2_level_1
Corporate,5037,1454
Direct,11939,2543
GDS,156,37
TA/TO,57611,40118
Undefined,1,0


In [22]:
pd.crosstab(df["reserved_room_type"], df[lrp.dv])

is_canceled,0,1
reserved_room_type,Unnamed: 1_level_1,Unnamed: 2_level_1
A,52021,33578
B,750,364
C,623,308
D,13072,6101
E,4588,1909
F,2010,880
G,1320,763
H,356,245
L,4,2
P,0,2


In [23]:
n_before = df.copy().shape[0]
df = df.query('distribution_channel not in "Undefined" and '
              'reserved_room_type not in ["L", "P"]')
n_after = df.copy().shape[0]
print(f"Rows dropped: {n_before - n_after}")

Rows dropped: 9


### Remove Fields
Remove county from our analysis object

In [24]:
lrp.ivs_cat.remove("country")
df = df.drop(columns="country")

In [25]:
cat_levels = {}
for iv_cat in lrp.ivs_cat:
    uniq_vals = sorted(df[iv_cat].unique())
    cat_levels[iv_cat] = [iv_cat + "_" + xx for xx in uniq_vals[1:]]
pprint.pprint(cat_levels)

{'customer_type': ['customer_type_Group',
                   'customer_type_Transient',
                   'customer_type_Transient-Party'],
 'deposit_type': ['deposit_type_Non Refund', 'deposit_type_Refundable'],
 'distribution_channel': ['distribution_channel_Direct',
                          'distribution_channel_GDS',
                          'distribution_channel_TA/TO'],
 'hotel': ['hotel_Resort Hotel'],
 'is_repeated_guest': ['is_repeated_guest_1'],
 'market_segment': ['market_segment_Complementary',
                    'market_segment_Corporate',
                    'market_segment_Direct',
                    'market_segment_Groups',
                    'market_segment_Offline TA/TO',
                    'market_segment_Online TA'],
 'meal': ['meal_FB', 'meal_HB', 'meal_SC', 'meal_Undefined'],
 'reserved_room_type': ['reserved_room_type_B',
                        'reserved_room_type_C',
                        'reserved_room_type_D',
                        'reserved_room_t

# 4. Check VIF
Remove covariates with VIF > 10. Literature uses 10, 5, 2.5 as the VIF threshold.

In [26]:
def extract_categorical_model(val):
    """Extracts market_segment from C(market_segment)[T.Complementary]"""
    if val.startswith("C("):
        val = re.search('^C\((.*)\).*', val).group(1)
    else:
        pass
    return val
# print(extract_categorical_model('C(meal)[T.SC]'))
# print(extract_categorical_model('stays_in_weekend_nights'))

def vif_from_R(model):
    """
    model: glm statsmodel
    """
    full_terms = model.model.data.design_info.column_names
    v = model.cov_params()

    # drop Intercept term
    if full_terms[0] == "Intercept":
        full_terms = full_terms[1:]
        v = v.iloc[1:, 1:]

    # C(meal)[T.FB], C(meal)[T.HB] becomes "meal"
    full_terms = [extract_categorical_model(ii) for ii in full_terms]
    # Using list comprehension to filter unique values while preserving order
    terms = []
    [terms.append(ii) for ii in full_terms if ii not in terms]
    n_terms = len(terms)

    R = sm.stats.moment_helpers.cov2corr(v)
    # Convert for easier filtering
    R = pd.DataFrame(R)
    R.index = full_terms
    R.columns = full_terms

    detR = np.linalg.det(R)
    result = pd.DataFrame(np.zeros((n_terms, 3)))
    result.index = terms
    result.columns = ["GVIF", "Df", "GVIF^(1/(2*Df))"]
    for term in terms:
        result.loc[term, "GVIF"] = np.linalg.det(R.loc[[term], [term]]) * np.linalg.det(R.loc[R.index!=term, R.columns!=term])/detR
        result.loc[term, "Df"] = len([ii for ii in full_terms if ii == term])

    result.loc[:, "GVIF^(1/(2*Df))"] = result.loc[:,"GVIF"]**(1/(2 * result.loc[:,"Df"]))
    return result

### Check VIF with Python Libraries

In [27]:
#### Independent variables (without the target variable 'buy')
X = df[lrp.ivs_cat + lrp.ivs_num]

# Fit logistic regression model using formula
formula = lrp.get_sm_formula()
model = smf.logit(formula, data=df).fit()

# Check for VIF
# Get the design matrix
X = model.model.exog

# Calculate VIF
vif_data = pd.DataFrame()
vif_data["feature"] = model.model.exog_names
vif_data["VIF"] = [variance_inflation_factor(X, i) for i in range(X.shape[1])]

print(vif_data)

Optimization terminated successfully.
         Current function value: 0.448171
         Iterations 9
                                feature         VIF
0                             Intercept  550.359159
1              C(hotel)[T.Resort Hotel]    1.527325
2                         C(meal)[T.FB]    1.066968
3                         C(meal)[T.HB]    1.218659
4                         C(meal)[T.SC]    1.252856
5                  C(meal)[T.Undefined]    1.083971
6    C(market_segment)[T.Complementary]    4.569303
7        C(market_segment)[T.Corporate]   21.863427
8           C(market_segment)[T.Direct]   57.176447
9           C(market_segment)[T.Groups]   79.387985
10   C(market_segment)[T.Offline TA/TO]   92.525448
11       C(market_segment)[T.Online TA]  142.159601
12    C(distribution_channel)[T.Direct]   10.712941
13       C(distribution_channel)[T.GDS]    1.110227
14     C(distribution_channel)[T.TA/TO]    9.264862
15            C(is_repeated_guest)[T.1]    1.363222
16           C

### Check VIF with R libaries
R allows for grouped VIF with categorical variables

In [28]:
vif_from_R(model=model)

Unnamed: 0,GVIF,Df,GVIF^(1/(2*Df))
hotel,1.473928,1.0,1.214054
meal,1.665499,4.0,1.065842
market_segment,67.661734,6.0,1.420786
distribution_channel,24.334875,3.0,1.702308
is_repeated_guest,1.319138,1.0,1.148537
reserved_room_type,3.268948,7.0,1.088287
deposit_type,1.085992,2.0,1.020838
customer_type,2.238043,3.0,1.143698
lead_time,1.292819,1.0,1.137022
stays_in_weekend_nights,1.38628,1.0,1.177404


In [29]:
lrp.ivs_cat.remove("market_segment")
formula = lrp.get_sm_formula()
# Fit logistic regression model using formula
model = smf.logit(formula, data=df).fit()
vif_from_R(model=model)

Optimization terminated successfully.
         Current function value: 0.459878
         Iterations 9


Unnamed: 0,GVIF,Df,GVIF^(1/(2*Df))
hotel,1.45258,1.0,1.20523
meal,1.503974,4.0,1.052337
distribution_channel,1.223919,3.0,1.03425
is_repeated_guest,1.312567,1.0,1.145673
reserved_room_type,3.171434,7.0,1.085935
deposit_type,1.02326,2.0,1.005765
customer_type,1.511278,3.0,1.07125
lead_time,1.263776,1.0,1.124178
stays_in_weekend_nights,1.397292,1.0,1.182071
stays_in_week_nights,1.512113,1.0,1.22968


# 5. Full Model

In [30]:
# Fit the logistic regression model
formula = lrp.get_sm_formula() 
model = sm.Logit.from_formula(formula, data=df).fit()
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.459878
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:            is_canceled   No. Observations:               118887
Model:                          Logit   Df Residuals:                   118854
Method:                           MLE   Df Model:                           32
Date:                Mon, 14 Oct 2024   Pseudo R-squ.:                  0.3029
Time:                        21:20:16   Log-Likelihood:                -54673.
converged:                       True   LL-Null:                       -78426.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                                          coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
Intercept                              -3.1034      0.071 

In [39]:
# Odds Ratios
np.exp(model.params)

Intercept                                0.044895
C(hotel)[T.Resort Hotel]                 0.942544
C(meal)[T.FB]                            1.428275
C(meal)[T.HB]                            0.698070
C(meal)[T.SC]                            1.492876
C(meal)[T.Undefined]                     0.518868
C(distribution_channel)[T.Direct]        0.541836
C(distribution_channel)[T.GDS]           0.517647
C(distribution_channel)[T.TA/TO]         1.324022
C(is_repeated_guest)[T.1]                0.472128
C(reserved_room_type)[T.B]               1.502190
C(reserved_room_type)[T.C]               0.976877
C(reserved_room_type)[T.D]               1.040394
C(reserved_room_type)[T.E]               1.024588
C(reserved_room_type)[T.F]               0.702419
C(reserved_room_type)[T.G]               0.753926
C(reserved_room_type)[T.H]               0.775641
C(deposit_type)[T.Non Refund]          170.758982
C(deposit_type)[T.Refundable]            1.492579
C(customer_type)[T.Group]                1.023402


# 6. AIC Removal

In [31]:
# Function for backward elimination based on AIC (removing entire categorical variables)
def stepwise_aic_reduction(data, dv: str, ivs: list):
    current_ivs = ivs
    formula = dv + " ~ " + " + ".join(ivs)
    current_formula = formula
    best_aic = float('inf')  # Initialize with a large value for AIC
    improvement = True

    while improvement:
        # Fit the logistic regression model
        model = sm.Logit.from_formula(current_formula, data=data).fit(disp=0)
        current_aic = model.aic
        print(f"Starting AIC: {current_aic: ,.0f}")

        # Initialize improvement flag
        improvement = False

        # Track the best AIC and formula for this iteration
        best_aic_this_round = current_aic
        worst_feature = None

        # Try removing each continuous or categorical variable as a whole
        for feature in current_ivs:
            reduced_ivs = [ii for ii in current_ivs if ii != feature]
            reduced_formula = dv + " ~ " + " + ".join(reduced_ivs)

            # Fit the reduced model
            reduced_model = sm.Logit.from_formula(reduced_formula, data=data).fit(disp=0)
            reduced_aic = reduced_model.aic
            print(f"{feature} \t AIC: {reduced_aic: ,.0f}")

            # If the reduced model has a lower AIC, track this variable for removal
            if reduced_aic < best_aic_this_round:
                best_aic_this_round = reduced_aic
                worst_feature = feature
                improvement = True

        # If we found a variable to remove that improves the AIC, update the formula
        if improvement:
            print(f"Removing '{worst_feature}' reduced AIC from {current_aic: ,.0f} to {best_aic_this_round: ,.0f}")
            current_ivs = [ii for ii in ivs if ii != worst_feature]
            current_formula = dv + " ~ " + " + ".join(current_ivs)
            best_aic = best_aic_this_round
    
    # Return the final model after AIC-based reduction
    final_model = sm.Logit.from_formula(current_formula, data=data).fit(disp=0)
    return final_model

model_ivs = lrp.ivs_num + [f"C({ii})" for ii in lrp.ivs_cat]
# # Perform AIC-based stepwise reduction
final_model = stepwise_aic_reduction(df, dv=lrp.dv, ivs=model_ivs)

# # Print final model summary
print(final_model.summary())

Starting AIC:  109,413
lead_time 	 AIC:  111,222
stays_in_weekend_nights 	 AIC:  109,437
stays_in_week_nights 	 AIC:  109,537
adults 	 AIC:  109,460
children 	 AIC:  109,480
babies 	 AIC:  109,412
previous_cancellations 	 AIC:  113,075
previous_bookings_not_canceled 	 AIC:  110,789
booking_changes 	 AIC:  110,224
adr 	 AIC:  110,429
total_of_special_requests 	 AIC:  112,927
C(hotel) 	 AIC:  109,421
C(meal) 	 AIC:  109,939
C(distribution_channel) 	 AIC:  110,669
C(is_repeated_guest) 	 AIC:  109,504
C(reserved_room_type) 	 AIC:  109,502
C(deposit_type) 	 AIC:  120,862
C(customer_type) 	 AIC:  111,283
Removing 'babies' reduced AIC from  109,413 to  109,412
Starting AIC:  109,412
lead_time 	 AIC:  111,220
stays_in_weekend_nights 	 AIC:  109,436
stays_in_week_nights 	 AIC:  109,536
adults 	 AIC:  109,459
children 	 AIC:  109,479
previous_cancellations 	 AIC:  113,075
previous_bookings_not_canceled 	 AIC:  110,788
booking_changes 	 AIC:  110,223
adr 	 AIC:  110,428
total_of_special_requests 

# 7. Dominance Analysis

In [32]:
# Dominance Prep
keep_cols = [lrp.dv] + lrp.ivs_num + lrp.ivs_cat
df_encoded = pd.get_dummies(df[keep_cols], columns=lrp.ivs_cat, drop_first=True)
print(df_encoded.columns)
ivs_features = [ii for ii in df_encoded.columns if ii != lrp.dv]
print(ivs_features)

Index(['is_canceled', 'lead_time', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies',
       'previous_cancellations', 'previous_bookings_not_canceled',
       'booking_changes', 'adr', 'total_of_special_requests',
       'hotel_Resort Hotel', 'meal_FB', 'meal_HB', 'meal_SC', 'meal_Undefined',
       'distribution_channel_Direct', 'distribution_channel_GDS',
       'distribution_channel_TA/TO', 'is_repeated_guest_1',
       'reserved_room_type_B', 'reserved_room_type_C', 'reserved_room_type_D',
       'reserved_room_type_E', 'reserved_room_type_F', 'reserved_room_type_G',
       'reserved_room_type_H', 'deposit_type_Non Refund',
       'deposit_type_Refundable', 'customer_type_Group',
       'customer_type_Transient', 'customer_type_Transient-Party'],
      dtype='object')
['lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes', '

In [33]:
nrow_before = df_encoded.shape[0]
df_encoded = df_encoded.loc[~(df_encoded["adr"]<0)]
nrow_after = df_encoded.shape[0]
print(f"The following number of rows were dropped: {nrow_before - nrow_after}")

The following number of rows were dropped: 0


In [34]:
dominance_classification=Dominance(data=df_encoded, target='is_canceled', top_k=10, objective=0)

Selecting 10 Best Predictors for the Model
Selected Predictors :  ['lead_time', 'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes', 'adr', 'total_of_special_requests', 'hotel_Resort Hotel', 'distribution_channel_Direct', 'deposit_type_Non Refund', 'customer_type_Transient-Party']

********************  Pseudo R-Squared of Complete Model :  ********************

MacFadden's R-Squared : 0.2908269160874535 



In [35]:
incr_variable_rsquare=dominance_classification.incremental_rsquare()

Selecting 10 Best Predictors for the Model
Selected Predictors :  ['lead_time', 'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes', 'adr', 'total_of_special_requests', 'hotel_Resort Hotel', 'distribution_channel_Direct', 'deposit_type_Non Refund', 'customer_type_Transient-Party']

Creating models for 1023 possible combinations of 10 features :


100%|███████████████████████████████████████████| 10/10 [06:44<00:00, 40.48s/it]


#########################  Model Training Done!!!!!  #########################

#########################  Calculating Variable Dominances  #########################


100%|███████████████████████████████████████████████| 9/9 [00:00<00:00, 28.44it/s]

#########################  Variable Dominance Calculation Done!!!!!  #########################






In [36]:
dominance_classification.plot_incremental_rsquare()