<a href="https://colab.research.google.com/github/Vikry99/Belajar-Web-Programming/blob/main/Final_Project_Anaconda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Load dataset

In [6]:
df = pd.read_csv('https://github.com/Vikry99/dataset/blob/3faa50a057e69164d00206a45606f42f509a13ff/Europe%20Hotel%20Booking%20Satisfaction%20Score.csv')
df.head()

ParserError: ignored

In [None]:
df.info()

# EDA (Exploratory Data Analyst)

## Check shape dataset

In [None]:
row = df.shape[0]
column = df.shape[1]
print(f"Number of rows = {row} \nNumber of columns = {column}")

## Check missing value

In [None]:
df.info()

In [None]:
df.isnull().sum()

## Unique data from each column

In [None]:
print("Number of unique labels from each column")
print("==="*16)
for x in df.columns:
    print(f"{x} : {len(df[x].unique())} labels")
    print(f"{x} : {df[x].unique()} \n")

In the Type Of Booking column there is data 'Not defined', then we have to check the distribution of the data

In [None]:
print(df['Type Of Booking'].value_counts())
print(df['Type Of Booking'].value_counts(normalize=True))

If seen, the amount of data that is not defined is 7494 or about 7.2%. Therefore we can filter or drop the data because the percentage is small

In [None]:
x_drop = df[df['Type Of Booking'].apply(lambda x: x.startswith('Not defined'))].index
df = df.drop(x_drop)
df = df.reset_index(drop=True)
df.head()

In [None]:
df.shape

After filtering, the amount of data becomes 96,410 and this data will be used for the next stage.

## Check Duplicated

In [None]:
df["id"].duplicated().sum()

## Check data outliers

In [None]:
df.plot(kind = 'box', subplots=True, figsize=(25,20), layout = (5,5))
plt.show()

If we look at the visualization above, we will only find outliers in the checkin/checkout service data which are also small in number with a scale of 0-5. Therefore we can ignore the data outliers without the need to handle outliers

## View Distribution of target Label data

In [None]:
plt.rcParams["figure.figsize"] = [6,6]
plt.rcParams["figure.autolayout"] = True
ax = sns.countplot(x="satisfaction", data=df)

for p in ax.patches:
   ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.25, p.get_height()+0.01));

In [None]:
df.satisfaction.value_counts(normalize=True)

It can be seen that the distribution of the data is quite balanced with 53,229 dissatisfied and 43,181 satisfied data.

## Visualization of categorical data

In [None]:
# sns.set(style='darkgrid')
fig, ax = plt.subplots(2,2, figsize=(14, 12))
sns.countplot(data=df, x='Gender', hue='satisfaction', ax=ax[0][0])
sns.countplot(data=df, x='purpose_of_travel', hue='satisfaction', ax=ax[0][1])
sns.countplot(data=df, x='Type of Travel', hue='satisfaction', ax=ax[1][0])
sns.countplot(data=df, x='Type Of Booking', hue='satisfaction', ax=ax[1][1])
plt.tight_layout();

Based on the above visualization we can conclude:

1. In the Gender column, the number of Satisfied and Dissatisfied is almost the same
2. In the purpose of travel column, the highest number of Satisfied and Dissatisfied is in the type of tourism trip. This can be due to the fact that travel in terms of tourism is the most numerous compared to others.
3. In the type of travel column, the level of satisfaction is high for group-type trips, which is around 40k while on personal trips it is only around 2500
4. In the type of booking column, the level of satisfaction is high at the level of group bookings, which is around 34k, it is directly proportional to the type of group travel, while for individual/couple booking types that satisfy only about 9000, it is also directly proportional to the type of personal travel.

## Breakdown Categorical data for each feature

### Gender

In [None]:
def df_countplot(df, target):
    f, axes = plt.subplots(1, 2, figsize=(15,5))
    ax1 = sns.countplot( x = target, data = df,  ax=axes[0])

    counts = df.groupby([target, 'satisfaction']).size().to_frame('Total')
    counts = counts.reset_index()
    ax2 = sns.barplot(data=counts, y='Total', x=target, hue='satisfaction', ax=axes[1])
    plt.show()
#     return ax1

def pivot_satisfaction(df,target):
  df_rate = pd.pivot_table(
      df[['id',target,'satisfaction']],
      index       =[target],
      columns     =['satisfaction'],
      aggfunc     ="count",
      fill_value  =0,
  ).reset_index()

  df_rate.columns=[target,'neutral or dissatisfied','satisfied']

  df_rate['total'] = df_rate['neutral or dissatisfied'] + df_rate['satisfied']
  df_rate["satisfaction Rate"] = round((df_rate['satisfied']/df_rate['total'])*100,2)
  df_rate["dissatisfied Rate"] = round((df_rate['neutral or dissatisfied']/df_rate['total'])*100,2)
  return df_rate

In [None]:
df_countplot(df,"Gender")

In [None]:
pivot_satisfaction(df,"Gender")

We see that the levels of "satisfied" and "dissatisfied" in gender "Female" and "Male" are relatively balanced, namely 44.28% : 45.30% and 55.71% : 54.69% . Here we see that "Male" is slightly more satisfied with hotel services than "Female"

### Purpose of Travel

In [None]:
df_countplot(df,"purpose_of_travel")

In [None]:
pivot_satisfaction(df,"purpose_of_travel")

In the graph above, it is clear that the level of satisfaction of the "tourism" travel destination is high, although the level of dissatisfaction is also high. However, if we look in detail, the satisfaction level of each travel destination is relatively balanced, namely at 44-45% with the highest "tourism". At the level of "neutral or dissatisfied" the highest rating by "Aviation" with 55.97%. However, in general, the rates for each travel destination are almost the same.

### Type of Travel

In [None]:
df_countplot(df,"Type of Travel")

In [None]:
pivot_satisfaction(df,"Type of Travel")

If you look at the graph, there are interesting things here. That for a "Group Travel" type of trip, the "Satisfaction" level is quite high, namely 59.34%, inversely proportional to the "Personal Travel" type of trip. Even "Personal Travel" gets a very high "neutral or dissatisfied" rate of 89.64%. It can be concluded that this hotel is more intended for the benefit of "Group Travel".

### Type of Booking

In [None]:
df_countplot(df,"Type Of Booking")

In [None]:
pivot_satisfaction(df,"Type Of Booking")

The same thing we see in "Type Of Booking" which is directly proportional to "Type of Travel". Of course, if the "Type of Travel" is "Group Travel" then the "Type of Booking" is also the "Group Booking" type. The satisfaction rate for "Group Booking" reached 69.42% compared to "Individual/Couple". Here we can get an insight that the "Type of Booking" for "Couple" then most likely the "Type of Travel" that is taken is "Group Travel" (More than 1). That factor is why the level of "neutral or dissatisfied" in the "Type of Travel" type "Group Travel" is quite high.

## Numeric data visualization

In [None]:
fig, ax = plt.subplots(5,2, figsize=(14, 20))
sns.countplot(data=df, x='Hotel wifi service', hue='satisfaction', ax=ax[0][0])
sns.countplot(data=df, x='Departure/Arrival  convenience', hue='satisfaction', ax=ax[0][1])
sns.countplot(data=df, x='Ease of Online booking', hue='satisfaction', ax=ax[1][0])
sns.countplot(data=df, x='Hotel location', hue='satisfaction', ax=ax[1][1])
sns.countplot(data=df, x='Food and drink', hue='satisfaction', ax=ax[2][0])
sns.countplot(data=df, x='Other service', hue='satisfaction', ax=ax[2][1])
sns.countplot(data=df, x='Stay comfort', hue='satisfaction', ax=ax[3][0])
sns.countplot(data=df, x='Common Room entertainment', hue='satisfaction', ax=ax[3][1])
sns.countplot(data=df, x='Checkin/Checkout service', hue='satisfaction', ax=ax[4][0])
sns.countplot(data=df, x='Cleanliness', hue='satisfaction', ax=ax[4][1])
plt.tight_layout();

If we look more closely by assuming a score of 0-3 is dissatisfied and 4-5 is satisfied, we can see from this visualization that there are several high dissatisfied scores. We will take only the top 3, namely:

1. Hotel wifi service
2. Common room Entertainment
3. Stay Comfort

### Age

In [None]:
facet = sns.FacetGrid(df, hue = 'satisfaction', aspect = 4)
facet.map(sns.kdeplot, "Age", shade= True)
facet.add_legend()
plt.show()

plt.rcParams["figure.figsize"] = [8,8]
plt.rcParams["figure.autolayout"] = True
sns.histplot(data = df, x = "Age", kde = True, hue = "satisfaction")

It can be seen that the highest level of satisfaction is in the range of 40-60 while under the age of 40 the average level of satisfaction is dissatisfied and also over the age of 60 feel dissatisfied.

In [None]:
def age_group(age):
  if age < 40:
    return "Young"
  elif age > 60:
    return "Old"
  else:
    return "Middle"

df['age_group'] = df['Age'].apply(lambda age: age_group(age))

In [None]:
df_countplot(df,"age_group")

In [None]:
pivot_satisfaction(df,"age_group")

Next we will focus more on data where other than age_group which is not 'Middle' because there is an opputunity how we can improve our service

In [None]:
df_not_mid_age = df[df['age_group'] != 'Middle']
df_not_mid_age.shape

#### Age and hotel wife service

In [None]:
df_countplot(df_not_mid_age,"Hotel wifi service")

In [None]:
pivot_satisfaction(df_not_mid_age,"Hotel wifi service")

#### Age and common room entertaiment

In [None]:
df_countplot(df_not_mid_age,"Common Room entertainment")

In [None]:
pivot_satisfaction(df_not_mid_age,"Common Room entertainment")

#### Age and stay comfort

In [None]:
df_countplot(df_not_mid_age,"Stay comfort")

In [None]:
pivot_satisfaction(df_not_mid_age,"Stay comfort")

# Preprocessing data

## Encode data

### One hot encode

In [None]:
def one_hot_encode (df_,variable,top_x_labels):
  for label in top_x_labels:
    df_[variable + '_' + label] = np.where(df[variable]==label,1,0)

# Purpose_of_travel
one_hot = [x for x in df['purpose_of_travel'].value_counts().sort_values(ascending=False).head().index]
one_hot_encode(df,'purpose_of_travel',one_hot)
df = df.drop(['purpose_of_travel'], axis=1)

# Type_of_travel
one_hot = [x for x in df['Type of Travel'].value_counts().sort_values(ascending=False).head().index]
one_hot_encode(df,'Type of Travel',one_hot)
df = df.drop(['Type of Travel'], axis=1)

# Type_of_booking
one_hot = [x for x in df['Type Of Booking'].value_counts().sort_values(ascending=False).head().index]
one_hot_encode(df,'Type Of Booking',one_hot)
df = df.drop(['Type Of Booking'], axis=1)

df = df.drop(['age_group'], axis=1)


df.head()

### Label encoder

In [None]:
# first step
from sklearn.preprocessing import LabelEncoder
df['satisfaction'] = LabelEncoder().fit_transform(df['satisfaction'])

# second step
df['Gender'] = df['Gender'].replace('Female',0)
df['Gender'] = df['Gender'].replace('Male',1)
df['Gender'] = df['Gender'].astype('int')

In [None]:
df[['Gender','satisfaction']]

In [None]:
# Create correlation matrix
corr_matrix = df.corr()

plt.figure(figsize=(35,40))
sns.set(font_scale=1.5)
sns.heatmap(corr_matrix, annot=True, cmap='Blues',fmt='.2g' )

In [None]:
plt.figure(figsize=(6,12))
sns.heatmap(corr_matrix[['satisfaction']].sort_values(by=['satisfaction'],ascending=False,),annot=True)

# Modeling

In [None]:
# Disable Warnings
import warnings
warnings.filterwarnings("ignore")

# Machine Learning Model
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier, AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier, plot_importance

# Evaluation
from sklearn import metrics
from sklearn.metrics import classification_report, roc_auc_score, roc_curve
from sklearn.metrics import make_scorer,accuracy_score,roc_auc_score,precision_score,recall_score,f1_score,log_loss
from sklearn.metrics import confusion_matrix

# Train-Test Split
from sklearn.model_selection import train_test_split

# Cross Validation
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RepeatedStratifiedKFold


from sklearn.model_selection import cross_val_score, learning_curve

In [None]:
y= df['satisfaction']
X= df.drop(['satisfaction'],1)

In [None]:
X.shape,y.shape

In [None]:
X.head()

In [None]:
y

In [None]:
# Modelling Algorithms

kf = StratifiedKFold(n_splits=5,shuffle=True,random_state=0)

## Collect all model in one list
all_model = [DecisionTreeClassifier,
            LogisticRegression,
             KNeighborsClassifier,
             GaussianNB,
            RandomForestClassifier,
            GradientBoostingClassifier,
            ExtraTreesClassifier,
             XGBClassifier]

model_name = ['DecisionTreeClassifier',
            'LogisticRegression',
             'KNeighborsClassifier',
             'GaussianNB',
            'RandomForestClassifier',
            'GradientBoostingClassifier',
            'ExtraTreesClassifier',
             'XGBClassifier']
## loop for all model

datatr = []
datasc = []
Recall =[]
Precision =[]
auc =[]

for idx, model_type in enumerate(all_model):
    num = 1
    AccTrain = []
    AccTest = []
    RecallTemp = []
    PrecisionTemp = []
    AucTemp = []
    nfold = 1
    for train_index,test_index in kf.split(X,y):

        print("----------BEFORE------------")
        print("{} Acc Train: {}, {} of KFold {}".format(model_name[idx], AccTrain, nfold, kf.n_splits))
        print("{} Acc Test: {}, {} of KFold {}".format(model_name[idx], AccTest, nfold, kf.n_splits))
        print("{} Recall: {}, {} of KFold {}".format(model_name[idx], RecallTemp, nfold, kf.n_splits))
        print("{} Precission: {}, {} of KFold {}".format(model_name[idx], PrecisionTemp, nfold, kf.n_splits))
        print("{} AUC: {}, {} of KFold {}".format(model_name[idx], AucTemp, nfold, kf.n_splits))
        print("---------------------------")

        X_train_scaled, X_test_scaled = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        model = model_type()
        model.fit(X_train_scaled,y_train)
        y_pred=model.predict(X_test_scaled)

        AccTrain.append(model.score(X_train_scaled , y_train))
        AccTest.append(model.score(X_test_scaled , y_test))
        RecallTemp.append(recall_score(y_test,y_pred))
        PrecisionTemp.append(precision_score(y_test,y_pred))
        AucTemp.append(roc_auc_score(y_test, y_pred))

        print("----------AFTER------------")
        print("{} Acc Train: {}, {} of KFold {}".format(model_name[idx], AccTrain, nfold, kf.n_splits))
        print("{} Acc Test: {}, {} of KFold {}".format(model_name[idx], AccTest, nfold, kf.n_splits))
        print("{} Recall: {}, {} of KFold {}".format(model_name[idx], RecallTemp, nfold, kf.n_splits))
        print("{} Precission: {}, {} of KFold {}".format(model_name[idx], PrecisionTemp, nfold, kf.n_splits))
        print("{} AUC: {}, {} of KFold {}".format(model_name[idx], AucTemp, nfold, kf.n_splits))
        print("---------------------------")

        nfold += 1

    print("----------FINAL------------")
    print("{} Acc Train: {}".format(model_name[idx], np.mean(AccTrain)))
    print("{} Acc Test: {}".format(model_name[idx], np.mean(AccTest)))
    print("{} Recall: {}".format(model_name[idx], np.mean(RecallTemp)))
    print("{} Precission: {}".format(model_name[idx], np.mean(PrecisionTemp)))
    print("{} AUC: {}".format(model_name[idx], np.mean(AucTemp)))
    print("---------------------------")
    datatr.append(np.mean(AccTrain))
    datasc.append(np.mean(AccTest))
    Recall.append(np.mean(RecallTemp))
    Precision.append(np.mean(PrecisionTemp))
    auc.append(np.mean(AucTemp))



In [None]:
# Compare model each other
data_hasil = pd.DataFrame()
data_hasil['model'] = model_name
data_hasil['Accuracy training'] = datatr
data_hasil['Accuracy test'] = datasc
data_hasil['Precision'] = Precision
data_hasil['Recall']= Recall
data_hasil['AUC']=auc
data_hasil['gap'] = abs(data_hasil['Accuracy training'] - data_hasil['Accuracy test'])
data_hasil.sort_values(by='Accuracy test',ascending=False)

We choose the XGBClassifier model because the results of the test and training are not much different

# Train Model

In [None]:
model = XGBClassifier().fit(X_train_scaled,y_train)

# data training
y_pred=model.predict(X_train_scaled)
print(classification_report(y_train,y_pred))

In [None]:
# data tes
y_pred_test = model.predict(X_test_scaled)
print(classification_report(y_test,y_pred_test))

# Confusion Matrix

In [None]:
cm_test = confusion_matrix(y_pred_test, y_test)
print('Confusion Matrix: {}'.format(cm_test))
## visualisasi
sns.heatmap(cm_test, annot=True, fmt='d', cmap="Blues");

# Feature Importance

We want to see Feature Importance from the machine learning results that we choose

In [None]:
# view the feature scores
feature_scores = pd.Series(model.feature_importances_, index=X_train_scaled.columns).sort_values(ascending=False)
feature_scores

In [None]:
importances = pd.Series(data=model.feature_importances_,
                        index= X_train_scaled.columns)

importances_sorted = importances.sort_values()
plt.figure(figsize=(15,8))
importances_sorted.plot(kind='barh')
plt.title('Features Importances')
plt.show()

# Conclusion

Based on the Insights we got:

- There are 5 characteristics of customers who have the potential to be “neutral or dissatisfied” with the service, namely customers with gender “female”, purpose of travel “tourism”, type of travel “Group Travel”, type of booking “Individual/Couple”, and Age “Young ” i.e. age < 40 years

- There are 3 most important service factors that significantly affect customer satisfaction, including Hotel Wifi Service, Common Room Entertainment, and Stay comfort.

So our strategy is to focus on these 3 main factors to be able to increase customer satisfaction ratings for hotel services, especially to customers who are potentially neutral/dissatisfied. In addition, our recommendation is to provide attractive promos for customers who book in the individual/couple category.

Based on the results of the data mining process to modeling & evaluation, we got the best model that can predict whether the customer is satisfied or neutral/dissatisfied with hotel services, namely the XGBoost Classifier (XGB) model. To choose the best model, our first focus is the training and test accuracy score which is the highest but not overfitting (because basically our dataset can be said to be balanced). In addition, we also pay attention to the highest precision score of 0.92 because we want to suppress the false positive value (a condition where the predicted results are "customers are satisfied" when in fact "customers are not satisfied").

# Help Desk

If you have any questions regarding the code above, you can ask our linkedin

- https://www.linkedin.com/in/fauzi-reza-a4513a141/
- https://www.linkedin.com/in/aldanestitalentapakpahan/
- https://www.linkedin.com/in/brian-laurensz-21a7a6228/
- https://www.linkedin.com/in/yusfiflo/
- https://www.linkedin.com/in/heru-pratiknyo/