# Hotel Facilito

## ¿Quién va a cancelar?

*Hotel Facilito* se está preparando para las vacaciones de verano, sin embargo, les preocupa que muchos de sus clientes cancelan de ultima hora, dejándoles con habitaciones vacías – sucede que algunos clientes simplemente no se presentan y, a pesar de que a veces se cobra una cuota de reservación, la gran mayoría de las ganancias se obtiene cuando los huéspedes pagan el resto al ocupar la habitación. 

Usando sus datos, les gustaría que les ayudaras a identificar a aquellos clientes que tienen más posibilidad de cancelar ya que les gustaría darles seguimiento para que si en caso de que requieran cancelar, se haga con la mayor antelación posible.

![](./header.png)

## Datos

*Hoteles Facilito* tiene dos sucursales, uno ubicado en la capital del estado, "City Hotel" y otro en una comunidad cercana a la costa, "Resort Hotel".

Los datos que te ha enviado están en formato CSV, en donde cada línea representa una reservación con los siguientes atributos:

  - `hotel`: Reservation hotel.
  - `is_canceled`: Indicates whether the reservation was canceled or not
  - `lead_time`: Number of days that elapsed between the entering date of the booking into the PMS and the arrival date.
  - `arrival_date_year`: Year of arrival date.
  - `arrival_date_month`: Month of arrival date with 12 categories: “January” to “December”.
  - `arrival_date_week_number`: Week number of the arrival date.
  - `arrival_date_day_of_month`: Day of the month of the arrival date.
  - `stays_in_weekend_nights`: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel.
  - `stays_in_week_nights`: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel.
  - `adults`: Number of adults.
  - `children`: Number of children.
  - `babies`: Number of babies.
  - `meal`: Type of meal booked. Categorical value.
  - `country`: Country of origin. Categories are represented in ISO 3155–3:2013.
  - `market_segment`: Market segment designation.
  - `distribution_channel`: Booking distribution channel.
  - `is_repeated_guest`: Value indicating if the booking name was from a repeated guest (1) or not (0)
  - `previous_cancellations`: Number of previous bookings that were cancelled by the customer prior to the current booking.
  - `previous_bookings_not_canceled`: Number of previous bookings not cancelled by the customer prior to the current booking.
  - `reserved_room_type`: Code of room type reserved. Code is presented instead of designation for anonymity reasons.
  - `assigned_room_type`: Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
  - `booking_changes`: Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation.
  - `deposit_type`: Indication on if the customer made a deposit to guarantee the booking. 
  - `agent`: ID of the travel agency that made the booking.
  - `company`: ID of the company/entity that made the booking or responsible for paying the booking.
  - `days_in_waiting_list`: Number of days the booking was in the waiting list before it was confirmed to the customer.
  - `customer_type`: Type of booking.
  - `adr`: Average Daily Rate.
  - `required_car_parking_spaces`: Number of car parking spaces required by the customer.
  - `total_of_special_requests`: Number of special requests made by the customer (e.g. twin bed or high floor).
  - `reservation_status`: Reservation last status.
  - `reservation_status_date`: Date at which the last status was set.
  - `name`: Customer name.
  - `email`: Customer email.
  - `phone`: Customer phone.
  - `credit_card`: Last four digits of the customer credit card.

Los datos que te ha enviado la compañía están en el archivo `hotel_bookings.csv`.

> En realidad los datos provienen de [este dataset de Kaggle](https://www.kaggle.com/datasets/mojtaba142/hotel-booking), y puedes consultar más sobre el origen de los datos [en esta publicación](https://www.sciencedirect.com/science/article/pii/S2352340918315191)

## ¿Qué métricas podemos medir? 

¿Accuracy? pero, y ¿si nos interesa otra cosa?

Nos interesa hallar a TODAS las personas que potencialmente pueden cancelar, y la verdad es que a nuestros clientes no les molesta mucho si les llamamos para confirmar su reserva.

## Ejercicio

In [1]:
import pandas as pd

In [2]:
hotel_bookings = pd.read_csv("hotel_bookings_training.csv")

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
hotel_bookings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119190 entries, 0 to 119189
Data columns (total 36 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119190 non-null  object 
 1   is_canceled                     119190 non-null  int64  
 2   lead_time                       119190 non-null  int64  
 3   arrival_date_year               119190 non-null  int64  
 4   arrival_date_month              119190 non-null  object 
 5   arrival_date_week_number        119190 non-null  int64  
 6   arrival_date_day_of_month       119190 non-null  int64  
 7   stays_in_weekend_nights         119190 non-null  int64  
 8   stays_in_week_nights            119190 non-null  int64  
 9   adults                          119190 non-null  int64  
 10  children                        119186 non-null  float64
 11  babies                          119190 non-null  int64  
 12  meal            

### Sobre los datos personales...

Nos deshacemos de los datos personales – que además carecen de utilidad por ser únicos:

In [5]:
# Remove personal information of customers
hotel_bookings = hotel_bookings.drop(['name', 'email', 'phone-number', 'credit_card'], axis=1)

In [6]:
hotel_bookings.sample(10)

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
35260,Resort Hotel,1,277,2015,October,41,5,4,10,2,...,Non Refund,273.0,,0,Transient,78.5,0,0,Canceled,2015-10-02
20740,Resort Hotel,0,22,2016,September,40,30,0,2,2,...,No Deposit,240.0,,0,Transient,110.0,1,2,Check-Out,2016-10-02
103459,City Hotel,0,47,2016,April,17,19,0,2,2,...,No Deposit,9.0,,0,Transient,105.3,0,1,Check-Out,2016-04-21
29841,City Hotel,0,48,2016,February,9,23,0,2,2,...,No Deposit,9.0,,0,Transient,89.3,1,1,Check-Out,2016-02-25
50704,Resort Hotel,0,185,2015,September,39,21,1,1,2,...,No Deposit,208.0,,0,Transient-Party,90.0,0,0,Check-Out,2015-09-23
46328,Resort Hotel,0,132,2016,May,23,29,2,5,2,...,No Deposit,243.0,,0,Contract,62.9,0,1,Check-Out,2016-06-05
7461,City Hotel,0,18,2017,August,31,5,0,1,3,...,No Deposit,7.0,,0,Transient,241.48,0,2,Check-Out,2017-08-06
77628,City Hotel,1,31,2017,July,30,28,0,1,2,...,No Deposit,7.0,,0,Transient,112.0,0,1,Canceled,2017-06-28
11569,Resort Hotel,0,15,2016,August,32,4,0,3,3,...,No Deposit,241.0,,0,Transient,195.58,0,1,Check-Out,2016-08-07
7227,City Hotel,1,35,2016,February,7,13,1,1,1,...,No Deposit,9.0,,0,Transient,72.9,0,1,Canceled,2016-01-23


## EDA

In [7]:
from ydata_profiling import ProfileReport

In [8]:
profile = ProfileReport(hotel_bookings, title="Pandas Profiling Report")

In [9]:
# profile.to_file("bookings_profile.html")
## file:///Users/antonio.feregrino/hub/cf-ml/hotel-facilito/bookings_profile.html

### Custom plots ...

## Data leakage

La variable `reservation_status` indica el estado de la reservación, esta es en realidad un reflejo de `is_canceled`, si la dejamos dentro de las variables de entrada estamos filtrando información al modelo. Este nos dará un resultado excelente, pero en realidad no nos servirá de nada en la vida real. Debemos sacarla del dataset junto con otras varaibles asociadas:

In [10]:
# Avoid data leakage
hotel_bookings = hotel_bookings.drop(['reservation_status', 'reservation_status_date'], axis=1)

## Separa la variable a predecir


In [11]:
is_canceled = hotel_bookings['is_canceled'].copy()
hotel_data = hotel_bookings.drop(['is_canceled'], axis=1)

## Split dataset

In [12]:
# Calculate test and validation set size:
original_count = len(hotel_bookings)
training_size = 0.60 # 60% of records
test_size = (1 - training_size) / 2


training_count = int(original_count * training_size)
test_count = int(original_count * test_size)
validation_count = original_count - training_count - test_count

print(training_count, test_count, validation_count, original_count)

71514 23838 23838 119190


In [13]:
from sklearn.model_selection import train_test_split

train_x, rest_x, train_y, rest_y = train_test_split(hotel_data, is_canceled, train_size=training_count)
test_x, validate_x, test_y, validate_y = train_test_split(rest_x, rest_y, train_size=test_count)

print(len(train_x), len(test_x), len(validate_x))

71514 23838 23838


## Variables a codificar - One-hot encoding

 - hotel, meal, distribution_channel, reserved_room_type, assigned_room_type, customer_type, deposit_type

In [14]:
from sklearn.preprocessing import OneHotEncoder

In [15]:
one_hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown="ignore")

In [16]:
one_hot_encoder.fit(train_x[['hotel']])
one_hot_encoder.transform(train_x[['hotel']])

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [1., 0.]])

## Variables a binarizar

 - total_of_special_requests, required_car_parking_spaces, booking_changes, previous_bookings_not_canceled, previous_cancellations

In [17]:
from sklearn.preprocessing import Binarizer

In [18]:
binarizer = Binarizer()

In [19]:
_ = train_x.copy()
binarizer.fit(_[['total_of_special_requests']])
_['has_made_special_requests'] = binarizer.transform(train_x[['total_of_special_requests']])

_[['total_of_special_requests', 'has_made_special_requests']].sample(10)

Unnamed: 0,total_of_special_requests,has_made_special_requests
110362,4,1
116663,1,1
1458,0,0
50904,1,1
34560,0,0
65570,0,0
11599,1,1
37363,0,0
52496,1,1
21766,0,0


## Variables a escalar

 - adr

In [20]:
from sklearn.preprocessing import RobustScaler

In [21]:
scaler = RobustScaler()

In [22]:
_ = train_x.copy()
scaler.fit(_[['adr']])
_['adr_scaled'] = scaler.transform(train_x[['adr']])

_[['adr', 'adr_scaled']].sample(10)

Unnamed: 0,adr,adr_scaled
12594,192.67,1.722271
86422,62.0,-0.581908
60207,68.0,-0.476107
91153,62.0,-0.581908
88804,84.0,-0.193969
33536,229.0,2.362899
84466,244.0,2.627403
85041,90.0,-0.088168
71935,102.7,0.135779
95734,85.0,-0.176336


## Variables a dejar como tal

 - stays_in_weekend_nights, stays_in_week_nights
 
 
 > El tratamiento de estas depende del modelo a usar

## Armando un pipeline de transformación

In [23]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FeatureUnion, Pipeline

In [24]:
one_hot_encoding = ColumnTransformer([
    (
        'one_hot_encode',
        OneHotEncoder(sparse_output=False, handle_unknown="ignore"),
        [
            "hotel",
            "meal", 
            "distribution_channel", 
            "reserved_room_type", 
            "assigned_room_type", 
            "customer_type"
        ]
    )
])

In [25]:
binarizer = ColumnTransformer([
    (
        'binarizer',
        Binarizer(),
        [
            "total_of_special_requests", 
            "required_car_parking_spaces", 
            "booking_changes", 
            "previous_bookings_not_canceled", 
            "previous_cancellations",
        ]
    )
])
    
one_hot_binarized = Pipeline([
    ("binarizer", binarizer),
    ("one_hot_encoder", OneHotEncoder(sparse_output=False, handle_unknown="ignore")),
])

In [26]:
scaler = ColumnTransformer([
    ("scaler", RobustScaler(), ["adr"])
])

In [27]:
passthrough = ColumnTransformer([
    (
        "passthrough",
        "passthrough",
        [
            "stays_in_week_nights",
            "stays_in_weekend_nights",
        ]
    )
])

In [28]:
feature_engineering_pipeline = pipe = Pipeline(
    [
        (
            "features",
            FeatureUnion(
                [
                    ("categorical", one_hot_encoding),
                    ("categorical_binarized", one_hot_binarized),
                    ("scaled", scaler),
                    ("pass", passthrough),
                ]
            ),
        )
    ]
)

In [29]:
transformed = feature_engineering_pipeline.fit_transform(train_x)
transformed.shape

(71514, 51)

In [30]:
transformed

array([[ 1.        ,  0.        ,  0.        , ..., -1.36483865,
         3.        ,  1.        ],
       [ 1.        ,  0.        ,  1.        , ..., -0.32622113,
         2.        ,  1.        ],
       [ 1.        ,  0.        ,  1.        , ..., -0.25921354,
         1.        ,  2.        ],
       ...,
       [ 1.        ,  0.        ,  1.        , ...,  0.26450361,
         5.        ,  1.        ],
       [ 1.        ,  0.        ,  1.        , ..., -0.7582437 ,
         2.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        , ...,  0.35267149,
         2.        ,  0.        ]])

## Model training

In [31]:
# Get a fresh copy of the pipeline
from sklearn.base import clone

feature_transformer = clone(feature_engineering_pipeline)

features_train_x = feature_transformer.fit_transform(train_x)
features_validate_x = feature_transformer.transform(validate_x)

In [32]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

model = RandomForestClassifier(n_estimators=100)

model.fit(features_train_x, train_y)

## Model validation

In [33]:
from sklearn.metrics import accuracy_score, recall_score

pred_y = model.predict(features_validate_x)

print(accuracy_score(validate_y, pred_y))
print(recall_score(validate_y, pred_y))

0.8077439382498531
0.7130906624463034


## Construcción del pipeline final

In [34]:
final_inference_pipeline = Pipeline([
    ("feature_engineering", clone(feature_engineering_pipeline)),
    ("model", RandomForestClassifier(n_estimators=100))
])

In [35]:
final_training_dataset = pd.concat([train_x, validate_x])
final_training_response = pd.concat([train_y, validate_y])

In [36]:
final_inference_pipeline.fit(final_training_dataset, final_training_response)

## Model testing


In [37]:
test_pred_y = final_inference_pipeline.predict(test_x)

print(accuracy_score(test_pred_y, test_y))
print(recall_score(test_pred_y, test_y))

0.8150432083228458
0.7610536218250236


## Model persistence

In [38]:
from joblib import dump

dump(final_inference_pipeline, "inference_pipeline.joblib")

['inference_pipeline.joblib']

## Y entonces... ¿a quién le hablamos?

In [39]:
from joblib import load

ultimate_inference_pipeline = load("inference_pipeline.joblib")

In [40]:
new_customers = pd.read_csv("new_customers.csv")
new_customers.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,name,email,phone-number,credit_card
0,City Hotel,0,0,2016,March,13,22,0,1,2,...,,0,Transient,99.0,0,1,Elizabeth Morton,ElizabethMorton@xfinity.com,218-662-6872,************5891
1,City Hotel,0,21,2016,March,13,23,0,3,2,...,,0,Transient-Party,62.0,0,1,Virginia Ward,Virginia_Ward51@gmail.com,845-529-3632,************1071
2,City Hotel,0,418,2016,September,40,26,1,2,2,...,,223,Transient-Party,107.0,0,0,Joseph Taylor,Joseph_T@protonmail.com,451-454-5767,************5326
3,City Hotel,0,58,2016,March,12,17,0,3,2,...,,0,Transient,63.0,0,0,Sara Allen,Sara_Allen18@yandex.com,402-581-2687,************8597
4,Resort Hotel,0,130,2017,July,28,9,2,0,1,...,,0,Transient-Party,178.0,0,1,John Black,Black.John47@yandex.com,541-901-5663,************9017


In [41]:
new_customers['will_cancel'] = ultimate_inference_pipeline.predict(new_customers)
new_customers[['proba_check_in', 'proba_cancel']] = ultimate_inference_pipeline.predict_proba(new_customers)

In [42]:
new_customers[['name', 'phone-number', 'will_cancel', 'proba_cancel']].sort_values(by='proba_cancel', ascending=False).head(20)

Unnamed: 0,name,phone-number,will_cancel,proba_cancel
32,Renee Reed,970-325-8809,1,1.0
91,Donald Alvarez,105-155-9939,1,1.0
63,Deanna Jenkins,229-507-3138,1,1.0
46,Katelyn Jones,323-339-3265,1,1.0
43,Joseph Lawson,166-493-3428,1,1.0
22,Martin Valenzuela,165-579-5602,1,1.0
90,Regina Pacheco,350-100-9605,1,1.0
8,Carrie Tanner,392-436-1692,1,1.0
58,Cory Alexander,864-688-3246,1,0.993343
64,Daniel Ortiz,960-672-0720,1,0.993343


## De tarea...

 - Entrena un modelo con *data leakage* y ve si es sospechosamente bueno
 - ¿Qué pasa si el cliente no recibió la habitación que solicitó? (`reserved_room_type` vs `assigned_room_type`)
 - No consideramos las fechas en las que iba a quedarse, ¿y si las incluyes en tu modelo?
 - ¿Podemos usar validación cruzada? 
 - ¿Qué tal de la búsqueda de hiper parámetros?