## **Desafio de Classificação:**

Utilizando a metodologia básica de um projeto para ciência de dados, implemente possíveis soluções para o seguinte estudo de caso (https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset?resource=download). Desenvolva e implemente métodos que se propõem a responder a questão: **Será que um hóspede irá cancelar sua reserva?**

Em seu projeto deverá constar uma análise detalhada do dataset e descrição das transformações de dados realizadas com suas respectivas justificativas. Adicionalmente, crie um baseline com as técnicas contidas nesta aula. Como forma de superar este baseline, pesquise por métodos e técnicas mais complexas. Você deverá entregar um relatório com os seguintes itens:

- Descrição do problema e análise dos dados
- Descrição das técnicas utilizadas
- Interpretação dos resultados obtidos
- Conclusão
- Apêndice (descrição das técnicas de Classificação apresentadas pelos colegas durante os seminários)

Adicionalmente, no relatório coloque a url de seu repositório on-line para consulta. Seu código deve estar comentado!!

Sistema de Avaliação:

- (1,0) Qualidade do relatório
- (1,0) Análise dos dados e descrição do problema
- (3,0) Implementação da solução
- (3,0) Interpretação dos resultados
- (2,0) Resumo das técnicas dos seminários de Classificação


# Cancel Culture

This is a project intended to predict whether or not a client will cancel a reservation. The data comes from a Kaggle dataset, and we will use many common ML algorithms and make comparison between them.

### Table of Contents

- [Introduction](##Introduction)
- [Exploratory Data Analysis](##Exploratory-Data-Analysis)


## Introduction

{this is a project that looks to predict whether or not a client will cancel a reservation}

{we use the kaggle.com/blah dataset}

{we do: exploratory data analysis}

{model training: knn, svm...}

{data interpretation and conclusion}


## Exploratory Data Analysis & Pre-processing


In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import (
    KFold,
    StratifiedKFold,
    cross_val_score,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import tree
from sklearn.ensemble import (
    RandomForestClassifier,
    BaggingClassifier,
    AdaBoostClassifier,
)
from xgboost import XGBClassifier

from imblearn.over_sampling import RandomOverSampler

In [2]:
# Load dataset
df = pd.read_csv("./hotel_reservations.csv")
df.head()

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,2,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.0,0,Not_Canceled
1,INN00002,2,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,INN00003,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.0,0,Canceled
3,INN00004,2,0,0,2,Meal Plan 1,0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.0,0,Canceled
4,INN00005,2,0,1,1,Not Selected,0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.5,0,Canceled


Lets get started with the analysis!

First, let's check our attributes in a more readable way:


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          36275 non-null  int64  
 2   no_of_children                        36275 non-null  int64  
 3   no_of_weekend_nights                  36275 non-null  int64  
 4   no_of_week_nights                     36275 non-null  int64  
 5   type_of_meal_plan                     36275 non-null  object 
 6   required_car_parking_space            36275 non-null  int64  
 7   room_type_reserved                    36275 non-null  object 
 8   lead_time                             36275 non-null  int64  
 9   arrival_year                          36275 non-null  int64  
 10  arrival_month                         36275 non-null  int64  
 11  arrival_date   

Some of these features are categorical (`<object>`), so let's check what these features have!


In [4]:
n_top_values = 5
for col in df.select_dtypes(include="object"):
    unique_values = df[col].nunique()
    top_values = sorted(df[col].unique())[:n_top_values]
    top_values_str = ", ".join([f"<{v}>" for v in top_values])
    print(
        f"{col}: ({unique_values} distinct) {top_values_str + ', ...' if len(top_values) == n_top_values else top_values_str}"
    )

Booking_ID: (36275 distinct) <INN00001>, <INN00002>, <INN00003>, <INN00004>, <INN00005>, ...
type_of_meal_plan: (4 distinct) <Meal Plan 1>, <Meal Plan 2>, <Meal Plan 3>, <Not Selected>
room_type_reserved: (7 distinct) <Room_Type 1>, <Room_Type 2>, <Room_Type 3>, <Room_Type 4>, <Room_Type 5>, ...
market_segment_type: (5 distinct) <Aviation>, <Complementary>, <Corporate>, <Offline>, <Online>, ...
booking_status: (2 distinct) <Canceled>, <Not_Canceled>


The `Booking_ID` is meaningless for our algorithms, so we can remove that column.

Also, we can convert our categorical features to numerical by using label encoding:


In [5]:
df.drop("Booking_ID", axis=1, inplace=True)

# identify categorical columns
cat_cols = df.select_dtypes(include="object").columns

# label encode categorical columns
for col in cat_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

## Model Training


### Splitting the dataset


In [6]:
X = df.drop("booking_status", axis=1)
Y = df["booking_status"]

ros = RandomOverSampler(random_state=0)
X, Y = ros.fit_resample(X, Y)

n_folds = 10
kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=0)

### KNN


In [7]:
knn = KNeighborsClassifier(n_neighbors=1)
scores = cross_val_score(knn, X, Y, cv=kf, n_jobs=-1)
print("Acurácia com 1 K-NN: %0.4f ± %0.4f" % (scores.mean(), scores.std()))

Acurácia com 1 K-NN: 0.8937 ± 0.0036


### Naive Bayes


In [8]:
nb = GaussianNB()
scores = cross_val_score(nb, X, Y, cv=kf, n_jobs=-1)
print("Acurácia com Naive-Bayes: %0.4f ± %0.4f" % (scores.mean(), scores.std()))

Acurácia com Naive-Bayes: 0.5672 ± 0.0032


### Regressão Logística


In [9]:
rlog = LogisticRegression(max_iter=5000)
scores = cross_val_score(rlog, X, Y, cv=kf, n_jobs=-1)
print("Acurácia com Regressão Logística: %0.4f ± %0.4f" % (scores.mean(), scores.std()))

Acurácia com Regressão Logística: 0.7786 ± 0.0051


### SVM


In [10]:
# svm = SVC(kernel='linear')
# scores = cross_val_score(svm, X, Y, cv=kf)
# print('Acurácia com SVM Linear: %0.4f +/- %0.4f' % (scores.mean(), scores.std()))

# svm = SVC(kernel='rbf')
# scores = cross_val_score(svm, X, Y, cv=kf)
# print('Acurácia com SVM RBF: %0.4f +/- %0.4f' % (scores.mean(), scores.std()))

# svm = SVC(kernel='poly', degree=3)
# scores = cross_val_score(svm, X, Y, cv=kf)
# print('Acurácia com SVM Poly: %0.4f +/- %0.4f' % (scores.mean(), scores.std()))

### Árvore de Decisão


In [11]:
dct = tree.DecisionTreeClassifier()
scores = cross_val_score(dct, X, Y, cv=kf)
print("Acurácia com Gini: %0.4f ± %0.4f" % (scores.mean(), scores.std()))

dct = tree.DecisionTreeClassifier(criterion="entropy")
scores = cross_val_score(dct, X, Y, cv=kf)
print("Acurácia com Entropy: %0.4f ± %0.4f" % (scores.mean(), scores.std()))

dct = tree.DecisionTreeClassifier(max_depth=3)
scores = cross_val_score(dct, X, Y, cv=kf)
print("Acurácia com Gini em 3 níveis: %0.4f ± %0.4f" % (scores.mean(), scores.std()))

Acurácia com Gini: 0.9264 +/- 0.0038
Acurácia com Entropy: 0.9282 +/- 0.0025
Acurácia com Gini em 3 níveis: 0.7737 +/- 0.0060


### Random Forest


In [12]:
rf = RandomForestClassifier(n_estimators=200, criterion="gini", max_depth=4)
scores = cross_val_score(rf, X, Y, cv=kf)
print("Acurácia Random Forest: %0.4f ± %0.4f" % (scores.mean(), scores.std()))

Acurácia Random Forest: 0.8019 +/- 0.0063


### XGBoost


In [13]:
xgb = XGBClassifier()
scores = cross_val_score(xgb, X, Y, cv=kf, n_jobs=-1)
print("Acurácia com XGBoost: %0.4f ± %0.4f" % (scores.mean(), scores.std()))

Acurácia com XGBoost: 0.8952 ± 0.0046
