We did an end-to-end case for the past two weeks to give you a broad overview of what a data science project could look like using the CRISP-DM framework. From now on we'll take a slower, more methodological approach to ensure you are capable of taking on such a project yourself.

<img src="images\Logo_UCLL_ENG_RGB.png" style="background-color:white;" />

# Data Analytics & Machine learning

Lecturers: Aimée Lynn Backiel, Chidi Nweke, Daan Nijs

Academic year 2023-2024

## Lab 7 Part 2: End-to-end exercise + classification

### Lecture outline

1. One large exercise covering all we have done so far in this course applied to classification.

### Recap of last lecture(s)

#### Lab 1

1. We ensured we had a valid Python installation.
2. We learnt what a virtual environment is:
   * Isolated Python executable and packages.
   * We created a virtual environment.
3. Absolute path vs relative path recap.
4. Recap of data structures in Python

#### Lab 2
1. Installed Pandas
2. Learnt how to read data
3. Learnt how to calculate mean, mode, median etc.
4. Basic exploration of the 4 variables

#### Lab 3
1. Wrapped up computing summary statistics (mean, median, mode, ...)
2. Learnt how to deal with outliers 
3. Focused on exploration of dat

#### Lab 4
1. Univariate data visualization using Matplotlib
   1. Figures and axes
   2. Histograms
   3. Box plots
   4. Bar charts
2. Multivariate data visualization using Seaborn
   1. Scatter plots
   2. Small multiples
   3. Color coding

#### Lab 5
1. Intro to machine learning using scikit-learn
   1. Preprocessing
      1. One Hot encoding
      2. Scaling
      3. Outliers
   2. Regression

#### Lab 6
1. Preprocessing with scikit-learn
   1. ColumnTransformer: Apply a transformation to specific columns.
   2. Pipeline: Do several transformations after each other
2. Evaluation:
   1. Why the mean of the error is a bad idea
   2. Mean absolute error
   3. Mean squared error

### Lab 7
1. Feature engineering
   1. Binning
   2. Interactions
   3. Custom features
2. Rounding up model evaluation
   1. Cross validation
   2. Hyper parameter tuning

#### Our next case: Hotel booking dataset

source: https://www.sciencedirect.com/science/article/pii/S2352340918315191

*This data article describes two datasets with hotel demand data. One of the hotels (H1) is a resort hotel and the other is a city hotel (H2). Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Both hotels are located in Portugal: H1 at the resort region of Algarve and H2 at the city of Lisbon.*

The goal is to help the two hotels maximize their revenue.



|variable                       |class     |description |
|:------------------------------|:---------|:-----------|
|hotel                          |character | Hotel (H1 = Resort Hotel or H2 = City Hotel) |
|is_canceled                    |double    | Value indicating if the booking was canceled (1) or not (0) |
|lead_time                      |double    | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
|arrival_date_year              |double    | Year of arrival date|
|arrival_date_month             |character | Month of arrival date|
|arrival_date_week_number       |double    | Week number of year for arrival date|
|arrival_date_day_of_month      |double    | Day of arrival date|
|stays_in_weekend_nights        |double    | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
|stays_in_week_nights           |double    |  Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel|
|adults                         |double    | Number of adults|
|children                       |double    | Number of children|
|babies                         |double    |Number of babies |
|meal                           |character | Type of meal booked. Categories are presented in standard hospitality meal packages: <br> Undefined/SC – no meal package;<br>BB – Bed & Breakfast; <br> HB – Half board (breakfast and one other meal – usually dinner); <br> FB – Full board (breakfast, lunch and dinner) |
|country                        |character | Country of origin. Categories are represented in the ISO 3155–3:2013 format |
|market_segment                 |character | Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
|distribution_channel           |character | Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
|is_repeated_guest              |double    | Value indicating if the booking name was from a repeated guest (1) or not (0) |
|previous_cancellations         |double    | Number of previous bookings that were cancelled by the customer prior to the current booking |
|previous_bookings_not_canceled |double    | Number of previous bookings not cancelled by the customer prior to the current booking |
|reserved_room_type             |character | Code of room type reserved. Code is presented instead of designation for anonymity reasons |
|assigned_room_type             |character | Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons |
|booking_changes                |double    | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation|
|deposit_type                   |character | Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:<br>No Deposit – no deposit was made;<br>Non Refund – a deposit was made in the value of the total stay cost;<br>Refundable – a deposit was made with a value under the total cost of stay. |
|agent                          |character | ID of the travel agency that made the booking |
|company                        |character | ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons |
|days_in_waiting_list           |double    | Number of days the booking was in the waiting list before it was confirmed to the customer |
|customer_type                  |character | Type of booking, assuming one of four categories:<br>Contract - when the booking has an allotment or other type of contract associated to it;<br>Group – when the booking is associated to a group;<br>Transient – when the booking is not part of a group or contract, and is not associated to other transient booking;<br>Transient-party – when the booking is transient, but is associated to at least other transient booking|
|adr                            |double    | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
|required_car_parking_spaces    |double    | Number of car parking spaces required by the customer |
|total_of_special_requests      |double    | Number of special requests made by the customer (e.g. twin bed or high floor)|
|reservation_status             |character | Reservation last status, assuming one of three categories:<br>Canceled – booking was canceled by the customer;<br>Check-Out – customer has checked in but already departed;<br>No-Show – customer did not check-in and did inform the hotel of the reason why |
|reservation_status_date        |double    | Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel|





In [None]:
import pandas as pd

In [None]:
h1 = pd.read_csv("data/H1.csv")
h1.head()

In [None]:
h2 = pd.read_csv("data/H2.csv")
h2.head()

#### ❓ The first thing we need to decide is what we want to do with this dataset. We have many different variables and thus things we can do. What would you do with this dataset?

The first goal, finding out who cancels, is **predicting a categorical value**. `isCanceled` is a number but in our dataset it should be regarded as a discrete category. This is called **classification**. When doing classification we will need slightly different machine learning models but most importantly, we will need different evaluation metrics. The full details are discussed in the theory sessions.

We still start of by doing exploring the dataset using Pandas exclusively and move on to some plots later on. The choice of which tool to use is yours. In this session we will be

In [None]:
h1["hotel"] = "Resort hotel"
h2["hotel"] = "City hotel"
hotel_df = pd.concat([h1, h2], ignore_index=True)

In [None]:
hotel_df

### Data understanding

Business people might  questions like:

Are we booking rooms correctly in our system? Do all bookings have quests?

Where do the guests come from?

How much do they pay per night?

Does the price vary per time of the year?

When are the hotels the busiest?

How long do customers stay? Are there any noticeable differences?

When and for which type of customers do the biggest cancellations happen?

We will try and answer all of these.

In [None]:
# ❓ Do we have null values?

In [None]:
# ❓ What would you do with the missing values. (hint: df.fillna)


In [None]:
# ❓ Do all bookings have guests? 



In [None]:
# ❓ How much do they pay per night?


In [None]:
# ❓ Where do the guests come from?


In [None]:
# ❓ Does the price vary per time of the year?



In [None]:
# ❓ When are the hotels the busiest?



In [None]:
# ❓ How long do customers stay? Are there any noticeable differences?


In [None]:
# ❓ How long do customers stay? Are there any noticeable differences?



In [None]:
# ❓ Analyze the bookings per market segment



In [None]:
# ❓ Analyze the bookings per market segment


In [None]:
# ❓ When and for which type of customers do the biggest cancellations happen?



#### Data exploration with plots

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import numpy as np

In [None]:
hotel_df["YearMonth"] = hotel_df["ArrivalDateYear"].astype(str) + hotel_df[ "ArrivalDateMonth"]

hotel_df["YearMonth"]

In [None]:
# ❓ Make a plot that shows the bookings over time


In [None]:
# ❓ Make a line plot that shows the bookings over time. Divide into cancellations and not cancelled bookings.


In [None]:
# ❓ Where do the guests come from?

In [None]:
# Extra, just the top 10

fig, ax = plt.subplots(figsize=(24,24))
sns.countplot(data=hotel_df, x="Country", order = hotel_df['Country'].value_counts().iloc[:10].index, ax=ax)
ax.tick_params(axis='x', rotation=90)


In [None]:
# Extra extra

countryData = hotel_df.groupby("Country", as_index=False).size()

countryData.rename(columns={"size": "Number of Guests"}, inplace=True)
total_guests = countryData["Number of Guests"].sum()
countryData["Guests in %"] = np.round(countryData["Number of Guests"] / total_guests * 100, 2)

guest_map = px.choropleth(countryData,
                    locations=countryData.Country,
                    color=countryData["Guests in %"],
                    hover_name=countryData.Country, 
                    title="Home country of guests")
guest_map.show()

In [None]:
# ❓ How much do they pay per night (hint distribution of ...)


In [None]:
# ❓ How much do they pay per night (hint distribution of ...)
# Hint, only use the customers that didn't cancel


In [None]:
# ❓ Does the price vary per time of the year? (hint: distribution of ... per month or per year month)
# Hint, only use the customers that didn't cancel


In [None]:
# ❓ Is there a difference in daily rate for both hotels?



In [None]:
# ❓ How long do customers stay? Are there any noticeable differences? (hint: you can analyze this variable over YearMonth)
# Hint, only use the customers that didn't cancel


In [None]:
# ❓ What variables are correlated with cancellations? (hint use numeric_only=True when making the correlations)


In [None]:
# ❓How does the evolution of bookings per year look like?

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer # missing values
from sklearn.preprocessing import StandardScaler # Scaling the data to have a mean of 0 and a standard deviation of 1 
from sklearn.pipeline import make_pipeline # Composing multiple steps behind each other
from sklearn.compose import make_column_transformer # Applying a transformer to a subset of columns
from sklearn.preprocessing import OneHotEncoder # Turning categorical data to numeric by making a column for each unique value and denoting it as 0 or 1
from sklearn.model_selection import train_test_split

In [None]:
num_features = ["LeadTime","ArrivalDateWeekNumber","ArrivalDateDayOfMonth",
                "StaysInWeekendNights","StaysInWeekNights","Adults","Children",
                "Babies","PreviousCancellations",
                "PreviousBookingsNotCanceled", "ADR"]

cat_features = ["hotel","ArrivalDateMonth","Meal","MarketSegment", "IsRepeatedGuest", "RequiredCarParkingSpaces", "TotalOfSpecialRequests",
                "DistributionChannel","ReservedRoomType","DepositType","CustomerType"]

### ❓ What variable would be a bad idea to one hot encode?


In [None]:
# Divide the data into X and y


In [None]:
# Make a pipeline for the categorical values


In [None]:
# Make a pipeline for the numeric values



In [None]:
# compose the two pipelines you made



In [None]:
# Split the data into training and testing


In [None]:
from sklearn.linear_model import LogisticRegressionCV # linear regression for classification
from sklearn.tree import DecisionTreeClassifier # A single decision tree
from sklearn.ensemble import HistGradientBoostingClassifier # Fits many decision trees sequentially. Each tree is trained to improve on the performance of all the previous trees.
from sklearn.ensemble import RandomForestClassifier # Fits many decision trees in parallel with some added randomness. The final prediction is the average of all trees..


### ❗ Make sure to use n_jobs = -1 when doing cross validation. Training these models can take a while.

### ❓ What evaluation metrics do you already know for classification models? Which one will you use and why?

In [None]:
# Evaluate logistic regression



In [None]:
# Evaluate decision tree


In [None]:
# Evaluate gradient boosting


In [None]:
# Evaluate random forest


### ❓ Motivate which model is the best and use it on the final part of the data


In [None]:
from sklearn.metrics import # your chosen evaluation metric

In [None]:
# Fit the final model


In [None]:
# Make your predictions

In [None]:
# Calculate the final accuracy
