# Latihan (Supervised Learning) - Hotel Booking demand Datasets

##  ``Hotel Reservation Cancellation Prediction``
 
Anda adalah Data Scientist di sebuah perusahaan hotel. Anda diberikan dataset berisi informasi pemesanan kamar hotel (*booking information*) baik untuk hotel kota (*city hotel*) maupun hotel resort. Dataset ini juga mengandung kapan *booking* dilakukan, lama menginap, jumlah pengunjung dewasa, anak-anak, dan/atau bayi, serta ketersediaan tempat parkir. Informasi lain mengenai dataset bisa Anda baca di keterangan dataset di bawah ini:  


## **Dataset**

Dataset ini berasal dari paper Jurnal Ilmiah berjudul "Hotel booking demand datasets" yang ditulis oleh Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019. Penjelasan tiap feature/variabel dari Jurnal bisa Anda akses di  https://www.sciencedirect.com/science/article/pii/S2352340918315191

Apabila ingin mengetahui keterangan di setiap kolom, Anda bisa akses ke: https://www.kaggle.com/jessemostipak/hotel-booking-demand/data. 

__Batasan Data untuk Ujian__
* __Ukuran data__ yang digunakan adalah **5000 baris (_rows_) awal [:5000]**.
* __Variabel__ yang dipakai berjumlah 16 kolom, yaitu: ['hotel', 'is_canceled', 'adults', 'children', 'babies', 'meal', 'country', 'market_segment', 'distribution_channel', 'reserved_room_type', 'booking_changes', 'deposit_type', 'days_in_waiting_list', 'customer_type', 'required_car_parking_spaces', 'total_of_special_requests']

## Cleaning and EDA
Lakukan data cleaning:
1) cek anomali pada data kalian dan perbaiki sesuai kebutuhan.

Lakukan EDA, selain untuk menentukan skema preprocessing kalian, jalankan Exploratory Data Analysis untuk:  
1) Memahami profil tamu/konsumen hotel (_customer profiling_).  
2) Memahami kebiasaan tamu/konsumen hotel (_customer behavior_).  

Berikan penjelasan tentang _insight_ yang Anda temukan dari hasil _Exploratory Data Analysis_ ke manajemen hotel!   

## Modelling

1. Ada **2 jenis kesalahan** yang mungkin terjadi dalam model ML di studi kasus ini, yaitu:
>* Model memprediksi user akan *cancel booking* (membatalkan pesanan), padahal sebenarnya/realisasinya user tidak membatalkan pesanan. (**false positive**)
>* Model memprediksi user tidak membatalkan pesanan, padahal sebenarnya/realisasinya user *cancel booking* (membatalkan pesanan). (**false negative**)  

Dalam konteks bisnis perhotelan, apabila pengunjung diasumsikan tidak *cancel booking* maka pihak hotel akan menyiapkan beberapa hal untuk menyambut kedatangan mereka, di antaranya:  
>* Menghubungi pengunjung terkait kapan perkiraan datang ke hotel,
>* Membersihkan, merapikan, dan menyiapkan kamar sesuai pesanan pengunjung,
>* Menyiapkan makanan dan minuman untuk menyambut kedatangan pengunjung,
>* Menolak pengunjung lain yang memesan kamar yang telah dipesan (*booked room*), dan
>* Memberi layanan penjemputan di bandara/stasiun/terminal apabila diperlukan.  

a. Pilih jenis kesalahan yang paling berpengaruh pada kerugian finansial perusahaan dan jelaskan alasan pilihan Anda! 
b. Pilih *evaluation metric* yang bisa menekan jenis kesalahan yang Anda pilih! Berikan alasan!  

2.  Pilihlah setidaknya 3 model _machine learning_ yang Anda pahami untuk mendapatkan benchmark model ML untuk memprediksi apakah user akan *cancel booking* atau tidak! 
>* Jelaskan secara singkat cara kerja model ML yang Anda gunakan!  

3.  Setelah Anda memilih _benchmark_ model terbaik, lakukan hyperparameter tunning untuk meningkatkan preforma model Anda! `Parameter` apa saja yang anda pilih untuk `tunning`? Jelaskan maksud dari tiap-tiap parameter tersebut!

4. Bagaimana performa model Anda setelah *Hyper-parameter Tuning*? Apakah ada tahap lanjutan yang bisa Anda lakukan untuk meningkatkan performa model lebih jauh? Buat kesimpulan akhir, model mana yang akan Anda pakai untuk memprediksi apakah user akan *cancel booking* atau tidak!  

``Good luck & Happy Coding``

In [1]:
# library
# basic data analysis and visualisation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# preprocessing
import category_encoders as ce
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# evaluation
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, roc_auc_score, average_precision_score, plot_precision_recall_curve, plot_roc_curve

# models
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# EDA

## Data

In [2]:
df= pd.read_csv('hotel_bookings.csv').head(5000)
df

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,Resort Hotel,1,212,2016,April,16,11,2,5,2,...,Non Refund,273.0,,0,Transient,76.05,0,0,Canceled,2015-10-16
4996,Resort Hotel,1,212,2016,April,16,11,2,5,2,...,Non Refund,273.0,,0,Transient,76.05,0,0,Canceled,2015-10-16
4997,Resort Hotel,1,212,2016,April,16,11,2,5,2,...,Non Refund,273.0,,0,Transient,67.05,0,0,Canceled,2015-10-16
4998,Resort Hotel,1,212,2016,April,16,11,2,5,2,...,Non Refund,273.0,,0,Transient,67.05,0,0,Canceled,2015-10-16


In [3]:
df=df[['hotel', 'is_canceled', 'adults', 'children', 'babies', 'meal', 'country', 'market_segment', 'distribution_channel', 'reserved_room_type', 'booking_changes', 'deposit_type', 'days_in_waiting_list', 'customer_type', 'required_car_parking_spaces', 'total_of_special_requests']]
df

Unnamed: 0,hotel,is_canceled,adults,children,babies,meal,country,market_segment,distribution_channel,reserved_room_type,booking_changes,deposit_type,days_in_waiting_list,customer_type,required_car_parking_spaces,total_of_special_requests
0,Resort Hotel,0,2,0.0,0,BB,PRT,Direct,Direct,C,3,No Deposit,0,Transient,0,0
1,Resort Hotel,0,2,0.0,0,BB,PRT,Direct,Direct,C,4,No Deposit,0,Transient,0,0
2,Resort Hotel,0,1,0.0,0,BB,GBR,Direct,Direct,A,0,No Deposit,0,Transient,0,0
3,Resort Hotel,0,1,0.0,0,BB,GBR,Corporate,Corporate,A,0,No Deposit,0,Transient,0,0
4,Resort Hotel,0,2,0.0,0,BB,GBR,Online TA,TA/TO,A,0,No Deposit,0,Transient,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,Resort Hotel,1,2,0.0,0,HB,PRT,Groups,Direct,D,0,Non Refund,0,Transient,0,0
4996,Resort Hotel,1,2,0.0,0,HB,PRT,Groups,Direct,D,0,Non Refund,0,Transient,0,0
4997,Resort Hotel,1,2,0.0,0,HB,PRT,Groups,Direct,A,0,Non Refund,0,Transient,0,0
4998,Resort Hotel,1,2,0.0,0,HB,PRT,Groups,Direct,A,0,Non Refund,0,Transient,0,0


### penjelasan data pada tiap kolom

**is_canceled**
* Value indicating if the booking was canceled (1) or not (0)

**adults, children, babies**
* Number of adults, children, babies

**meal**
* Type of meal booked.
    * BB – Bed & Breakfast
    * HB – Half board (breakfast and one other meal – usually dinner)
    * FB – Full board (breakfast, lunch and dinner)

**country**
* Country of origin. Categories are represented in the ISO 3155–3:2013 format

**market_segment & distribution_channel**
* Market segment designation & Booking distribution channel.
    * In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”

**reserved_room_type**
* Code of room type reserved. Code is presented instead of designation for anonymity reasons.

**booking_changes**
* Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

**deposit_type**
* Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: 
    * No Deposit – no deposit was made
    * Non Refund – a deposit was made in the value of the total stay cost
    * Refundable – a deposit was made with a value under the total cost of stay.

**days_in_waiting_list**
* Number of days the booking was in the waiting list before it was confirmed to the customer

**customer_type**
* Type of booking, assuming one of four categories:
    * Contract - when the booking has an allotment or other type of contract associated to it
    * Group – when the booking is associated to a group
    * Transient – when the booking is not part of a group or contract, and is not associated to other transient booking
    * Transient-party – when the booking is transient, but is associated to at least other transient booking

**required_car_parking_spaces**
* Number of car parking spaces required by the customer

**total_of_special_requests**
* Number of special requests made by the customer (e.g. twin bed or high floor)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   hotel                        5000 non-null   object 
 1   is_canceled                  5000 non-null   int64  
 2   adults                       5000 non-null   int64  
 3   children                     5000 non-null   float64
 4   babies                       5000 non-null   int64  
 5   meal                         5000 non-null   object 
 6   country                      4998 non-null   object 
 7   market_segment               5000 non-null   object 
 8   distribution_channel         5000 non-null   object 
 9   reserved_room_type           5000 non-null   object 
 10  booking_changes              5000 non-null   int64  
 11  deposit_type                 5000 non-null   object 
 12  days_in_waiting_list         5000 non-null   int64  
 13  customer_type     

In [5]:
df['is_canceled'].value_counts()/df.shape[0]*100
# data untuk pembatalan dan tidak cukup seimbang dimana 45.96% pemesanan dibatalkan dan 54.04% tidak dibatalkan

0    54.04
1    45.96
Name: is_canceled, dtype: float64

In [6]:
df.describe()

Unnamed: 0,is_canceled,adults,children,babies,booking_changes,days_in_waiting_list,required_car_parking_spaces,total_of_special_requests
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.4596,1.9698,0.1156,0.0148,0.2072,1.772,0.0938,0.602
std,0.498415,1.566326,0.444833,0.122409,0.612489,13.545358,0.292948,0.823245
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1.0,55.0,10.0,2.0,17.0,122.0,2.0,4.0


In [7]:
df.groupby(['is_canceled'])['required_car_parking_spaces'].value_counts()

is_canceled  required_car_parking_spaces
0            0                              2235
             1                               465
             2                                 2
1            0                              2298
Name: required_car_parking_spaces, dtype: int64

In [8]:
df.groupby(['is_canceled'])['total_of_special_requests'].value_counts()

is_canceled  total_of_special_requests
0            0                            1470
             1                             754
             2                             396
             3                              77
             4                               5
1            0                            1473
             1                             485
             2                             294
             3                              44
             4                               2
Name: total_of_special_requests, dtype: int64

In [9]:
df.groupby(['is_canceled'])['booking_changes'].value_counts()

is_canceled  booking_changes
0            0                  2116
             1                   429
             2                   107
             3                    27
             4                    16
             5                     4
             6                     2
             17                    1
1            0                  2126
             1                   153
             2                    13
             3                     4
             4                     2
Name: booking_changes, dtype: int64

In [10]:
df.describe(include='object')

Unnamed: 0,hotel,meal,country,market_segment,distribution_channel,reserved_room_type,deposit_type,customer_type
count,5000,5000,4998,5000,5000,5000,5000,5000
unique,1,5,56,6,3,9,3,4
top,Resort Hotel,BB,PRT,Online TA,TA/TO,A,No Deposit,Transient
freq,5000,3418,3174,2156,3657,3167,4461,3699


Profil tamu/konsumen hotel (customer profiling):
* pengunjung yang paling banyak berasal dari PRT = portugal
* mendapatkan info bookingan dari Travel Agent
* kebanyakan pengunjung yang tidak tergolong dengan suatu group

Kebiasaan tamu/konsumen hotel (customer behavior)
* tipe room yang paling banyak di pesan adalah room tipe A
* untuk meal yang paling banyak di pesan adalah BB = bed & breakfast (untuk makan siang dan malam pengunjung cenderung makan diluar)
* kebanyakan pengunjung tidak membutuhkan parkir mobil, request tambahan, dan mengganti pemesanan

## Cleaning data

In [11]:
df['hotel'].nunique() # karena data pada kolom hotel hanya ada 1, maka kita drop kolom hotel

1

In [12]:
df= df.drop(columns='hotel')
df.head()

Unnamed: 0,is_canceled,adults,children,babies,meal,country,market_segment,distribution_channel,reserved_room_type,booking_changes,deposit_type,days_in_waiting_list,customer_type,required_car_parking_spaces,total_of_special_requests
0,0,2,0.0,0,BB,PRT,Direct,Direct,C,3,No Deposit,0,Transient,0,0
1,0,2,0.0,0,BB,PRT,Direct,Direct,C,4,No Deposit,0,Transient,0,0
2,0,1,0.0,0,BB,GBR,Direct,Direct,A,0,No Deposit,0,Transient,0,0
3,0,1,0.0,0,BB,GBR,Corporate,Corporate,A,0,No Deposit,0,Transient,0,0
4,0,2,0.0,0,BB,GBR,Online TA,TA/TO,A,0,No Deposit,0,Transient,0,1


In [13]:
df[(df['adults']==0)&(df['children']==0)&(df['babies']==0)]

Unnamed: 0,is_canceled,adults,children,babies,meal,country,market_segment,distribution_channel,reserved_room_type,booking_changes,deposit_type,days_in_waiting_list,customer_type,required_car_parking_spaces,total_of_special_requests
2224,0,0,0.0,0,SC,PRT,Corporate,Corporate,A,1,No Deposit,0,Transient-Party,0,0
2409,0,0,0.0,0,SC,PRT,Corporate,Corporate,A,0,No Deposit,0,Transient,0,0
3181,0,0,0.0,0,SC,ESP,Groups,TA/TO,A,0,No Deposit,0,Transient-Party,0,0
3684,0,0,0.0,0,SC,PRT,Groups,TA/TO,A,1,No Deposit,122,Transient-Party,0,0
3708,0,0,0.0,0,SC,PRT,Groups,TA/TO,A,1,No Deposit,122,Transient-Party,0,0
4127,1,0,0.0,0,SC,,Offline TA/TO,TA/TO,P,0,No Deposit,0,Transient,0,0


In [14]:
# diasumsikan untuk pemesanan kamar yang tidak ada adult, children, dan babies,
# karena merupakan bagian dari group/corporate yang membutuhkan ruangan tambahan
# sehingga tidak perlu di drop atau di isi sebagai missing value

# Missing Value

In [15]:
df.isna().sum()

is_canceled                    0
adults                         0
children                       0
babies                         0
meal                           0
country                        2
market_segment                 0
distribution_channel           0
reserved_room_type             0
booking_changes                0
deposit_type                   0
days_in_waiting_list           0
customer_type                  0
required_car_parking_spaces    0
total_of_special_requests      0
dtype: int64

# Preprocessing

* Missing value: country --> simple imputer with most_frequent
* one hot encoding: meal, distribution_channel, deposit_type, customer_type --> jumlah kategori tidak terlalu banyak, tipe data nominal
* binary encoding: country, market_segment, reserved_room_type --> nominal dengan kategori cukup banyak
* no treatment: numerical 

In [16]:
binary_pipe= Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                      ('onehot',ce.BinaryEncoder())])

In [17]:
transformer= ColumnTransformer([
    ('one hot', OneHotEncoder(drop='first'), ['meal', 'distribution_channel', 'deposit_type', 'customer_type']), # pipeline (lihat diatas)
    ('binary', binary_pipe,['country', 'market_segment', 'reserved_room_type']), # encoding (binary)
], remainder= 'passthrough')

In [18]:
transformer.fit_transform(df)

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

# Splitting Data

In [19]:
x=df.drop(columns=['is_canceled'])
y=df['is_canceled']

In [20]:
x_train, x_test, y_train, y_test= train_test_split(x,y, stratify=y, random_state=2020)

In [21]:
x_train

Unnamed: 0,adults,children,babies,meal,country,market_segment,distribution_channel,reserved_room_type,booking_changes,deposit_type,days_in_waiting_list,customer_type,required_car_parking_spaces,total_of_special_requests
4232,2,0.0,0,BB,PRT,Online TA,TA/TO,A,0,No Deposit,0,Transient,0,2
918,3,0.0,0,BB,PRT,Online TA,TA/TO,A,1,No Deposit,0,Transient,0,2
4952,2,0.0,0,HB,GBR,Groups,Direct,D,0,No Deposit,0,Transient-Party,0,0
2842,2,0.0,0,BB,PRT,Online TA,TA/TO,A,0,No Deposit,0,Transient,0,0
4605,2,0.0,0,FB,PRT,Groups,Direct,A,0,Non Refund,0,Transient,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
842,2,0.0,0,BB,PRT,Offline TA/TO,TA/TO,A,0,No Deposit,0,Transient,0,0
3595,2,0.0,0,BB,PRT,Online TA,TA/TO,A,0,Non Refund,0,Transient,0,0
4606,2,0.0,0,FB,PRT,Groups,Direct,A,0,Non Refund,0,Transient,0,0
2287,1,0.0,0,BB,PRT,Corporate,Corporate,A,0,No Deposit,0,Transient-Party,0,0


# Benchmark

Model memprediksi user tidak membatalkan pesanan, padahal sebenarnya/realisasinya user cancel booking (membatalkan pesanan).  

Pada kasus Hotel Reservation Cancellation, **FP (false positive)** dimana pelanggan diprediksi akan membatalkan pesanan (1) tetapi tidak membatalkan pesanan kamar (0) dan **FN (false negative)** dimana pelanggan yang diprediksi tidak akan membatalkan pesanan (0) ternyata membatalkan pesanan kamar (1), kedua hal tersebut sama-sama merugikan, karena menyebabkan kehilangan pengunjung hotel, sehingga metrik evaluasi yang sesuai adalah **f1** untuk meminimalisir terjadinya hal tersebut


In [22]:
tree= DecisionTreeClassifier()
logreg= LogisticRegression(solver='liblinear')
knn=KNeighborsClassifier()

In [23]:
models=[tree,logreg,knn]
score=[]
rata=[]
std=[]

for i in models:
    skfold= StratifiedKFold(n_splits=5)
    estimator= Pipeline([
        ('preprocessing', transformer),
        ('model',i)
    ])
    model_cv= cross_val_score(estimator, x_train, y_train, cv=skfold, scoring='f1')
    score.append(model_cv)
    rata.append(model_cv.mean())
    std.append(model_cv.std())

In [24]:
pd.DataFrame({
    'model':['tree','logreg','knn'],
    'mean':rata,
    'std':std
})

Unnamed: 0,model,mean,std
0,tree,0.85307,0.013728
1,logreg,0.857136,0.01222
2,knn,0.829777,0.02262


Dari hasil cross validasi logreg memiliki performa yang lebih baik (dilihat dari nilai mean tertinggi) dan lebih stabil (dilihat dari nilai standard deviasi terendah) dari tree dan knn (dengan selisih yang sangat sedikit). Kita bisa memilih logreg sebagai benchmark untuk kita tunning.

# Model Performance with Data test

In [25]:
logreg= LogisticRegression(solver='liblinear')
estimator= Pipeline([
    ('preprocess', transformer),
    ('model', logreg)
])
estimator.fit(x_train, y_train)
y_pred= estimator.predict(x_test)

In [26]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.90      0.78      0.84       676
           1       0.78      0.90      0.83       574

    accuracy                           0.84      1250
   macro avg       0.84      0.84      0.84      1250
weighted avg       0.85      0.84      0.84      1250



pada model tree, logreg, dan knn, kita mempunyai hasil berupa precision, recall dan f1-score.  
pada model ini kita akan meminimalisir prediksi untuk False Positive dan False Negative.  
dimana precision terfokus pada FP, recall terfokus pada FN dan f1-score terfokus pada FP dan FN.

# Hyperparameter Tunning

In [27]:
logreg= LogisticRegression(solver='liblinear')
estimator= Pipeline([
    ('preprocess', transformer),
    ('model', logreg)
])

In [28]:
# estimator.get_params()

In [29]:
#logreg
hyperparam_space={
    'model__C':[100,10,1,0.1,0.001], # untuk  menentukan nilai regulasi menjadi float positif
    'model__solver':['liblinear','newton-cg','lbfgs'], # Algoritma untuk digunakan dalam masalah optimasi
    'model__max_iter':[100,125,150,175,200] # Jumlah maksimum iterasi yang diambil untuk pemecah masalah untuk konvergen
}

In [30]:
# #tree
# hyperparam_space={
#     'model__criterion':['gini','entropy'], # Fungsinya untuk mengukur kualitas sebuah split.
#     'model__max_depth':[3,4,5], # Kedalaman maksimum pohon
#     'model__min_samples_leaf':[2,3,4,5], # Jumlah maksimum iterasi yang diambil untuk pemecah masalah untuk konvergen
#     'model__min_samples_split':[5,7,9] # Jumlah minimum sampel yang diperlukan untuk membagi simpul internal
# }

In [31]:
skfold= StratifiedKFold(n_splits=5)
grid_search= GridSearchCV(
    estimator,
    param_grid= hyperparam_space,
    cv=skfold,
    scoring='f1',
    n_jobs=-1
)

In [32]:
grid_search.fit(x_train,y_train)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
             estimator=Pipeline(steps=[('preprocess',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('one '
                                                                         'hot',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['meal',
                                                                          'distribution_channel',
                                                                          'deposit_type',
                                                                          'customer_type']),
                                                                        ('binary',
                                                                         Pip

In [33]:
print(grid_search.best_score_)
print(grid_search.best_params_)

0.8574070654446665
{'model__C': 1, 'model__max_iter': 100, 'model__solver': 'newton-cg'}


In [34]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__C,param_model__max_iter,param_model__solver,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.221273,0.012479,0.063162,0.024564,100,100,liblinear,"{'model__C': 100, 'model__max_iter': 100, 'mod...",0.847278,0.842384,0.868280,0.869452,0.851296,0.855738,0.011090,33
1,0.356172,0.050216,0.052235,0.010242,100,100,newton-cg,"{'model__C': 100, 'model__max_iter': 100, 'mod...",0.847278,0.842384,0.868280,0.869452,0.851296,0.855738,0.011090,33
2,0.305921,0.041719,0.039963,0.008897,100,100,lbfgs,"{'model__C': 100, 'model__max_iter': 100, 'mod...",0.847278,0.845033,0.868280,0.871261,0.851296,0.856630,0.010956,21
3,0.195014,0.022565,0.045449,0.007387,100,125,liblinear,"{'model__C': 100, 'model__max_iter': 125, 'mod...",0.847278,0.842384,0.868280,0.869452,0.851296,0.855738,0.011090,33
4,0.314173,0.083157,0.041916,0.008332,100,125,newton-cg,"{'model__C': 100, 'model__max_iter': 125, 'mod...",0.847278,0.842384,0.868280,0.869452,0.851296,0.855738,0.011090,33
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70,0.216311,0.046136,0.040258,0.003242,0.001,175,newton-cg,"{'model__C': 0.001, 'model__max_iter': 175, 'm...",0.763636,0.695205,0.724919,0.739771,0.704698,0.725646,0.024531,66
71,0.200610,0.031160,0.048854,0.027721,0.001,175,lbfgs,"{'model__C': 0.001, 'model__max_iter': 175, 'm...",0.763636,0.695205,0.724919,0.739771,0.704698,0.725646,0.024531,66
72,0.150788,0.022735,0.041436,0.004123,0.001,200,liblinear,"{'model__C': 0.001, 'model__max_iter': 200, 'm...",0.799447,0.766667,0.795948,0.770149,0.722741,0.770990,0.027495,61
73,0.189782,0.023134,0.036650,0.003052,0.001,200,newton-cg,"{'model__C': 0.001, 'model__max_iter': 200, 'm...",0.763636,0.695205,0.724919,0.739771,0.704698,0.725646,0.024531,66


# Before and after Tunning

In [35]:
# before
logreg= LogisticRegression(solver='liblinear')
estimator= Pipeline([
    ('preprocess', transformer),
    ('model', logreg)
])
estimator.fit(x_train,y_train)

Pipeline(steps=[('preprocess',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('one hot',
                                                  OneHotEncoder(drop='first'),
                                                  ['meal',
                                                   'distribution_channel',
                                                   'deposit_type',
                                                   'customer_type']),
                                                 ('binary',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   BinaryEncoder())]),
                                                  ['country', 'market_segment',
       

In [36]:
y_pred= estimator.predict(x_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.90      0.78      0.84       676
           1       0.78      0.90      0.83       574

    accuracy                           0.84      1250
   macro avg       0.84      0.84      0.84      1250
weighted avg       0.85      0.84      0.84      1250



In [37]:
from sklearn.metrics import f1_score
f1_score(y_test,y_pred)

0.83454398708636

In [38]:
# after
best_model= grid_search.best_estimator_
best_model.fit(x_train,y_train)

Pipeline(steps=[('preprocess',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('one hot',
                                                  OneHotEncoder(drop='first'),
                                                  ['meal',
                                                   'distribution_channel',
                                                   'deposit_type',
                                                   'customer_type']),
                                                 ('binary',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   BinaryEncoder())]),
                                                  ['country', 'market_segment',
       

In [39]:
y_pred= best_model.predict(x_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.90      0.78      0.84       676
           1       0.78      0.90      0.84       574

    accuracy                           0.84      1250
   macro avg       0.84      0.84      0.84      1250
weighted avg       0.85      0.84      0.84      1250



In [40]:
f1_score(y_test,y_pred)

0.8352180936995154

Performa model setelah hypertunning naik sedikit, dengan f1-score dari 83.4% menjadi 83.5%

Apakah ada tahap lanjutan yang bisa Anda lakukan untuk meningkatkan performa model lebih jauh? 
* masih banyak parameter lain untuk meningkatkan performa model yang dapat dilihat pada `estimator.get_params()`
* pada classification (supervised) masih ada model lain selain tree,logreg dan knn

Buat kesimpulan akhir, model mana yang akan Anda pakai untuk memprediksi apakah user akan cancel booking atau tidak!
* untuk memprediksi apakah user akan cancel booking atau tidak, model yang akan saya gunakan adalah model logreg yang sudah di hypertunning karena performa meningkat walau hanya sedikit.