# Use PyTorch to Predict Hotel Cancellations

- [View Solution Notebook](./solutions.html)
- [View Project Page](https://www.codecademy.com/)

**Setup - Import libraries**

In [1]:
#modules are first imported 
import pandas as pd
import numpy as np

## Task Group 1 - Import and Inspect

The file `'datasets/resort_hotel_bookings.csv'` contains a subset of a [real-world dataset](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand) containing reservation and cancellation data for a 
resort hotel. 

Your goal in this project is build and train a neural network to predict if a customer will cancel their hotel booking reservation based on data including the booking dates, average daily cost, number of adults/children/babies, duration of stay, and so forth.

### Task 1

Begin by importing the CSV file to a pandas DataFrame named `hotels`.

Preview the first five rows using the `.head()` method.

In [None]:
"""
    -> we are importing and inspecting the dataset
    -> this is from a csv file 
    -> this contains hotel data
    -> we want a network to predict if customers will cancel their bookings or not 
    -> the data is first imported from a csv file, using the .read_csv method
    -> the .head() method is then used to inspect the first 5 rows of this 
"""

hotels = pd.read_csv("datasets/resort_hotel_bookings.csv")
hotels.head()

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,0,342,2015,July,27,1,0,0,2,0.0,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0.0,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0.0,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,0,13,2015,July,27,1,0,1,1,0.0,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,0,14,2015,July,27,1,0,2,2,0.0,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


<details><summary style="display:list-item; font-size:16px; color:blue;">Here's a quick summary of the columns</summary>

- **is_canceled**: Whether the booking was canceled (1) or kept (0)
- **lead_time**: Number of days between booking date and arrival date
- **arrival_date_year**: Year of arrival date
- **arrival_date_month**: Month of arrival date
- **arrival_date_week_number**: Week number of arrival date
- **arrival_date_day_of_month**: Day of the month of arrival date
- **stays_in_weekend_nights**: Number of weekend nights booked (Sat-Sun)
- **stay_in_week_nights**: Number of weekday nights booked (Mon-Fri)
- **adults**: Number of adults
- **children**: Number of children
- **babies**: Number of babies
- **meal**: Type of meal booked (Undefined/SC, BB, HB, or FB)
- **country**: Country of origin of the booker
- **market_segment**: Market segment (TA - travel agent, TO - tour operators)
- **distribution_channel**: Booking distribution channel (TA - travel agent, TO - tour operators)
- **is_repeated_guest**: Is this a repeated guest (1) or not (0)
- **previous_cancellations**: The number of previous bookings canceled by the customer
- **previous_bookings_not_canceled**: The number of previous bookings not canceled by the customer
- **reserved_room_type**: Room type reserved
- **assigned_room_type**: Type of assigned room booked
- **booking_changes**: Number of booking changes or modifications
- **deposit_type**: Type of deposit to guarantee booking (No Deposit, Non Refund, or Refundable)
- **agent**: ID of the travel agency that made the booking
- **company**: ID of the company that made the booking
- **days_in_waiting_list**: Number of days booking was waitlisted before confirmation
- **customer_type**: The customer type of booking (Contract, Group, Transient, or Transient-party)
- **adr**: The average daily rate (cost) of the booking
- **required_car_parking_spaces**: Number of parking spaces requested by the customer
- **total_of_special_requests**: Number of special requests by the customer
- **reservation_status**: The last reservation status (Canceled, Check-Out, No-Show)
- **reservation_status_date**: The date of the last reservation status

### Task 2

Let's explore the data types and whether any data is missing.

Use the `.info()` method on the `hotels` DataFrame to inspect the data.

In [3]:
"""
    -> data types 
    -> we are in the process of inspecting the dataset
    -> metadata about this is ƒirst printed, using the .info() method 
    -> this is executed on the variable which stores the csv data 
"""

hotels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40060 entries, 0 to 40059
Data columns (total 31 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   is_canceled                     40060 non-null  int64  
 1   lead_time                       40060 non-null  int64  
 2   arrival_date_year               40060 non-null  int64  
 3   arrival_date_month              40060 non-null  object 
 4   arrival_date_week_number        40060 non-null  int64  
 5   arrival_date_day_of_month       40060 non-null  int64  
 6   stays_in_weekend_nights         40060 non-null  int64  
 7   stays_in_week_nights            40060 non-null  int64  
 8   adults                          40060 non-null  int64  
 9   children                        40060 non-null  float64
 10  babies                          40060 non-null  int64  
 11  meal                            40060 non-null  object 
 12  country                         

<details><summary style="display:list-item; font-size:16px; color:blue;">What do we notice about the dataset under inspection?</summary>

There are 31 columns and 40,060 total observations in our dataset. The majority of columns do not have missing values.

However, we do notice that: 
- the `agent` and `company` columns seem to have missing values that need to be addressed
- the `country` column has a couple of missing values as well

There are a variety of data types represented. To work with a neural network, we'll have to address any non-numeric columns in our data preparation.

### Task 3

Let's now explore the cancellation column we want to predict.

Use the `.value_counts()` method on the `is_canceled` column to count the number **and** the percentage of overall cancellations. 

In [4]:
"""
    -> we are now inspecting the column in the dataset which we want to make predictions about <- the number of customers which made bookings and then cancelled them 
    -> the .value_counts() method counts the number of entries in a column in a dataset 
    -> this allows us to count the number 
"""

# Number of cancellations
print(hotels['is_canceled'].value_counts(0))

# Percentage of cancellations
print(hotels['is_canceled'].value_counts(1))

is_canceled
0    28938
1    11122
Name: count, dtype: int64
is_canceled
0    0.722366
1    0.277634
Name: proportion, dtype: float64


<details><summary style="display:list-item; font-size:16px; color:blue;">What do we notice about the number of cancellations?</summary>

The number of cancellations is much lower than the number of non-cancellations (27.8% canceled vs 72.2% did not cancel). 

We'll need to take this imbalance into account when we evaluate our model. For example, a naive model could simply predict every booking will **not be canceled** and achieve a decent accuracy of 72.2%.

### Task 4

The `reservation_status` column tells us if the booking was canceled while also telling us if the customer was a no-show.

We need to be sure to exclude this column from the training set, otherwise this information will be _leaked_ to our model resulting in inaccurate performance. 

First, let's take a quick look at the values in this column.

Use the `.value_counts()` method on the `reservation_status` column to count the number **and** the percentage of overall cancellations. 

In [5]:
"""
	-> we are getting rid of the data which we don't want 
	-> inspecting the data we want to predict and voiding the data we don't 
	-> but first importing the dataset in a csv file
"""

# Number of cancellations
print(hotels['reservation_status'].value_counts(0))

# Percentage of cancellations
print(hotels['reservation_status'].value_counts(1))

reservation_status
Check-Out    28938
Canceled     10831
No-Show        291
Name: count, dtype: int64
reservation_status
Check-Out    0.722366
Canceled     0.270369
No-Show      0.007264
Name: proportion, dtype: float64


<details><summary style="display:list-item; font-size:16px; color:blue;">What do we notice about the reservation_status column?</summary>

The number of no-shows is extremely small and consists of only 291 (or 0.7%) of observations in the dataset.

Later on, we'll look at creating a multiclass model to predict no-show in addition to canceled.

### Task 5

Before diving into building a model, let's continue to explore the dataset. It's important to understand how different columns interact with cancellations to guide our model structure! 

For example, cancellations might be higher in the summer months (June - September) and lower in the winter months (November - January).

Use the `.groupby()` method to group the data by the `arrival_date_month` column and apply the `.mean()` aggregation function on the `is_canceled` column. This will return the percent of reservations cancelled in each month.

Then, use the `.sort_values()` method to sort the percentages from lowest to highest.

In [6]:
"""
	-> inspecting trends in the data before building the model 
	-> the .groupby() method is used for this <- to group data by certain months 
	-> then the .mean() method to calculate means here 
	-> the structure of the model will depend on these trends 
	-> how we define the model architecture depends on this
	-> the .sort_values() method to sort the percentages in ascending order for this
"""

cancellations_by_month = hotels.groupby('arrival_date_month')['is_canceled'].mean()
cancellations_by_month.sort_values()

arrival_date_month
January      0.148199
November     0.189167
March        0.228717
December     0.238293
February     0.256204
October      0.275105
May          0.287721
April        0.293433
July         0.314017
September    0.323681
June         0.330706
August       0.334491
Name: is_canceled, dtype: float64

<details><summary style="display:list-item; font-size:16px; color:blue;">What do we notice about the percentage of cancellations by month?</summary>

It looks like our intuition was correct! Winter and spring have the lowest cancellation percentages, while summer and fall have the highest. This information can be very useful for our model!

It might be useful to do more exploratory data analysis to gain additional insights about hotel cancellations. For example, additional analysis may help you select better features to train the model on and exclude features that might seem irrelevant. But for now, let's move on to cleaning and preparing the data.

## Task Group 2 - Data Cleaning and Preparation

In this section, we'll encode categorical data for use in our neural networks.

### Task 6

To get a sense of the categorical data in the dataset, let's start by previewing the first five rows of all columns with `object` datatype.

Create a list named `object_columns` containing only the names of the object columns (except for the reservation status columns). Select those columns from `hotels` and preview the first `5` rows.

In [7]:
"""
	-> the data is first imported as a csv, then inspected
	-> before we make a model out of the data ,we need to get rid of (or encode) the categorical data
	-> this cell inspects the first 5 columns of the DataSet, using the .head() method for this 
	-> we also now have another variable which stores the collumn names of the dataset 
"""

object_columns = ['arrival_date_month', 'meal', 'country', 'market_segment', 'distribution_channel', 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type']
hotels[object_columns].head()

Unnamed: 0,arrival_date_month,meal,country,market_segment,distribution_channel,reserved_room_type,assigned_room_type,deposit_type,customer_type
0,July,BB,PRT,Direct,Direct,C,C,No Deposit,Transient
1,July,BB,PRT,Direct,Direct,C,C,No Deposit,Transient
2,July,BB,GBR,Direct,Direct,A,C,No Deposit,Transient
3,July,BB,GBR,Corporate,Corporate,A,A,No Deposit,Transient
4,July,BB,GBR,Online TA,TA/TO,A,A,No Deposit,Transient


<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Preview the first five rows subset by the object columns</summary>

Here's how we can subset the DataFrame by the object columns and preview the first five rows:

```py
object_columns = ['arrival_date_month', 'meal', 'country', 'market_segment', 'distribution_channel', 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type']
hotels[object_columns].head()
```

Additionally, it might be helpful to explore the categorical data in each object column using the `.value_counts()` method.

</details>

### Task 7

Typically, we don't want to use every column in training. For example, we may want to drop columns with many missing values or columns that are irrelevant to our prediction task.

Drop any columns you don't want to use to train a cancellation model (do not remove the target label column). Feel free to open our Hint to review the columns we chose to drop in our solution.

Note: We don't want to drop the `reservation_status` column from the dataset quite yet because we'll be using this column to train our multiclass neural network.

In [None]:
"""
    -> the first variable in this stores the collumn names in the set which we want to drop
    -> the second variable stores the dataset in which these have been dropped 
	-> the .drop() method is then used, to get rid of all of the columns in the dataset which we don't want to train the model on 
"""

drop_columns = ['country', 'agent', 'company', 'reservation_status_date',
                'arrival_date_week_number', 'arrival_date_day_of_month', 'arrival_date_year']

hotels = hotels.drop(labels=drop_columns, axis=1)

<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Drop columns in the dataset not used for training.</summary>

Here's a list of potential features to drop. Feel free to experiment on your own by dropping or keeping columns you might believe may contribute to training.

```py
drop_columns = ['country', 'agent', 'company', 'reservation_status_date',
                'arrival_date_week_number', 'arrival_date_day_of_month', 'arrival_date_year']

hotels = hotels.drop(labels=drop_columns, axis=1)
```

Here's why we chose these columns:

- `country` - there are many countries that only appear a handful of times in the dataset which may make our model less generalizable and even discriminate against customers based on their country
- `agent` - similar to `country`, there are many agents that only appear a handful of times which may make our model less generalizable (and there are many missing values!)
- `company` - similar to `agent`, there are many companies that only appear a handful of times which may make our model less generalizable (and there are many missing values!)
- `reservation_status_date` - tells us the date of the latest status change of the reservation which shouldn't be helpful and if anything may leak data
- `arrival_date_week_number` - tells us the week of the year which may be too specific and prone to overfitting
- `arrival_date_day_of_month` - tells us the day of the month which may be too specific and prone to overfitting
- `arrival_date_year` - tells us the year of the booking which may not be helpful to predict future years

</details>

### Task 8

Next, let's encode the `meal` column which tells us which type of meal(s) the customer booked: 

- `Undefined` and `SC` correspond to no meal packages
- `BB` corresponds to breakfast only
- `HB` (half board) corresponds to breakfast + lunch or dinner
- `FB` (full board) corresponds to breakfast, lunch, and dinner.

Label encode the `meal` column with a meaningful order (# of meals booked) using the following scheme:

- `Undefined` and `SC` to `0`
- `BB` to `1`
- `HB` to `2`
- `FB` to `3` 

In [9]:
"""
	-> we have imported the dataset, inspected it and dropped the columns which we don't want
	-> now we are converting the categorical (word) data into numerical data in one of the columns 
	-> this is done using the .replace() method, which looks like a dictionary 
"""

hotels['meal'] = hotels['meal'].replace({'Undefined':0, 'SC':0, 'BB':1, 'HB':2, 'FB':3})

### Task 9

Let's prepare the rest of the categorical columns using one-hot encoding. 

Create a list named `one_hot_columns` containing the list of categorical column names (all the remaining categorical columns) to be one-hot encoded using the `pd.get_dummies()` method.

Preview the cleaned `hotels` DataFrame using the `.head()` method.

In [10]:
"""
	-> you can't train a neural network on data which involves words
	-> we are in the process of converting categorical data into numerical data
	-> we are now doing this with the rest of the columns in the set which contain this, using one-hot encoding 
	-> the first variable in this stores the names of the columns in the set which we want to convert to numbers (encode)
	-> the .get_dummies() method is then used, to convert the categorical data from these columns in the set to numerical data
	-> we can see that this change has worked, by printing the first 5 columns in the dataset using the .head() method 
"""

one_hot_columns = ['arrival_date_month', 'distribution_channel', 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type', 'market_segment']

hotels = pd.get_dummies(hotels, columns=one_hot_columns, dtype=int)

hotels.head()

Unnamed: 0,is_canceled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,is_repeated_guest,previous_cancellations,...,customer_type_Contract,customer_type_Group,customer_type_Transient,customer_type_Transient-Party,market_segment_Complementary,market_segment_Corporate,market_segment_Direct,market_segment_Groups,market_segment_Offline TA/TO,market_segment_Online TA
0,0,342,0,0,2,0.0,0,1,0,0,...,0,0,1,0,0,0,1,0,0,0
1,0,737,0,0,2,0.0,0,1,0,0,...,0,0,1,0,0,0,1,0,0,0
2,0,7,0,1,1,0.0,0,1,0,0,...,0,0,1,0,0,0,1,0,0,0
3,0,13,0,1,1,0.0,0,1,0,0,...,0,0,1,0,0,1,0,0,0,0
4,0,14,0,2,2,0.0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,1


Perfect! It looks like we've handled all of the categorical variables and prepared the DataFrame for training.

Note that the cleaned DataFrame now has 67 columns due to the additional columns created using one-hot encoding.

## Task Group 3 - Create Training and Testing Sets

Next, let's convert our dataset into PyTorch tensors and split them into training and testing sets.

### Task 10

Let's import the necessary PyTorch libraries and modules. 

In [11]:
"""
	-> the dataset was imported, inspected and cleaned
	-> converting the categorical data into numerical data from this now adds in extra columns, so that we have 67 in total 
	-> once we have the data, now we want to convert it into PyTorch tensors - for training 
	-> we want there to be four of these, half of which are x and half of which are y
	-> half are also for testing and half are also for training, although this isn't necessarily done in a 50/50 split 
	-> the first step to doing this is to import additional modules 
"""

import torch
import torch.nn as nn
import torch.optim as optim

### Task 11

We need to start by separating our training features from the target labels.
\
Create a list named `train_features` that contains all of the feature names (column names excluding the target variables `is_canceled` and `reservation_status`).

In [12]:
"""
	-> now, we have an entire cleaned dataset
	-> we are splitting this into target labels and training features
	-> target labels 
		-> target, meaning something we want to predict (the target of the model)
		-> labels <- meaning, a parameter which we want to predict
			-> this parameter isn't a part of the model 
			-> it's an output (a label which we want to predict)
		-> these are also known as target columns 
			-> columns which contain data we want to use to make predictions with 

	-> training features
		-> features in the dataset <- there are multiple columns (factors that will go into making the prediction)
		-> training <- to train the model on 

	-> we are taking the cleaned dataset, and removing the data which we want the model to predict
	-> the first variable stores the names of the columns which store this data
	-> the second variable removes this from the cleaned set, by using a list comprehension 
"""

# Remove target columns
remove_cols = ['is_canceled', 'reservation_status']

# Select training features
train_features = [x for x in hotels.columns if x not in remove_cols]

<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Select training features.</summary>

```py
# Remove target columns
remove_cols = ['is_canceled', 'reservation_status']

# Select training features
train_features = [x for x in hotels.columns if x not in remove_cols]
```
 
</details>

### Task 12

Using the list of training features in `train_features`, create `X` and `y` tensors:

- `X` contains the data values from the `train_features` columns
- `y` contains the binary labels in the `is_canceled` column in `hotels`

Both `X` and `y` should have the float datatype.

Be sure to set the correct view of `y` using `.view(-1,1)`

In [13]:
"""
	-> the data is first imported and inspected
	-> the columns from this are then dropped which we won't use
	-> the categorical data in this is then converted into numerical data
	-> this is then separated into training and testing data 
	-> then we look at the training dataset only, and split this into x and y
	-> these are training features and test labels 
	-> these labels are created using the PyTorch .tensor() method 
"""

X = torch.tensor(hotels[train_features].values, dtype=torch.float)
y = torch.tensor(hotels['is_canceled'].values, dtype=torch.float).view(-1,1)

<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Create X and y tensors.</summary>

When creating the tensors, be sure to extract the data values in the specified columns using `.values` as floats:
    
```py
X = torch.tensor(hotels[train_features].values, dtype=torch.float)
y = torch.tensor(hotels['is_canceled'].values, dtype=torch.float).view(-1,1)
```
 
</details>

### Task 13

Let's now split our data contained in `X` and `y` into training and testing sets.

Import the `train_test_split` module from Scikit-learn's `sklearn.model_selection` library.

Split `X` and `y` using the following scheme:
- Use 80% of the data for the training set `X_train` and `y_train`
- Use 20% of the data for the testing set `X_test` and `y_test`
- Set the random state to `42` to match our solution

Print out the shape of `X_train` and `X_test` to see how many observations and columns are in the training and testing sets.

How many training features does our training set `X_train` have?

In [14]:
"""
	-> we are splitting all of the data into test and training sets 
	-> it is converted first into x and y, and then split into testing and training sets
	-> modules are first imported
	-> this is an 80/20 split, created with the train_test_split() method
	-> the .shape method is then used to print the shapes of these 
""" 

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.80,
                                                    test_size=0.20,
                                                    random_state=42) 

print("Training Shape:", X_train.shape)
print("Testing Shape:", X_test.shape)

Training Shape: torch.Size([32048, 65])
Testing Shape: torch.Size([8012, 65])


<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Split the dataset into training and testing splits.</summary>
    
```py
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.80,
                                                    test_size=0.20,
                                                    random_state=42) 
print("Training Shape:", X_train.shape)
print("Testing Shape:", X_test.shape)
```
It looks like our data was successfully split into 80% training and 20% testing sets. 

Importantly, we see that the number of columns is `65` which corresponds to the number of input nodes (or features) needed in the input layer of our neural network!

## Task Group 4 - Train a Neural Network for Binary Classification

Let's now create a neural network for binary classification to predict hotel cancellations.

### Task 14

Set a random seed to `42` using `torch.manual_seed(42)`.

Build the neural network architecture using `nn.Sequential` with the following:
- input layer with `65` nodes (equal to the number of training features)
- first hidden layer with `36` nodes and a ReLU activation
- second hidden layer with `18` nodes and a ReLU activation
- output layer with `1` node and a Sigmoid activation

Save the network to the variable `model`.

In [15]:
"""
	-> a random seed is first created with the .manual_seed() method
	-> before we train a neural network on the data, we need to initialise its architecture
	-> this is done with the .Sequential() method 
	-> each of the different arguments in this method represent a new layer in the neural network 
	-> ReLu and Sigmoids are activations 
	-> the .Linear() layers are hidden layers in the model 
""" 

torch.manual_seed(42)

model = nn.Sequential(
    nn.Linear(65, 36),
    nn.ReLU(),
    nn.Linear(36, 18),
    nn.ReLU(),
    nn.Linear(18, 1),
    nn.Sigmoid()
)

### Task 15

Next, let's define the loss function and optimizer used for training:
- set the **binary cross-entropy** loss function to the variable `loss`
- set the **Adam** optimizer to the variable `optimizer` with a learning rate of `0.005`

In [16]:
"""
	-> we have the data which has been imported, cleaned and split into four PyTorch tensors 
	-> the architecture of the model has also been initialised 
	-> now we are defining a loss function and optimiser to train the model on this data  
	-> the loss function 
		-> BCE <- binary cross-entropy 
		-> this is initialised with the .BCELoss() method 
	->  the optimiser function 
		-> this is the Adam optimiser and has a learning rate
		-> this is set with the .Adam() method
"""

loss = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.005)

### Task 16

Let's build the training loop to train our neural network.

Train the neural network for `1000` epochs.

Keep track of the training performance by printing out the binary cross-entropy loss and accuracy score every `100` epochs.

Before calculating accuracy, convert the model's predicted probabilities to binary labels (as integers) using `0.5` as the threshold.

In [None]:
"""
	->  we are training the model on the (training) dataset 1,000 times 
		-> this is done with a for loop 
	-> an epoch is one time the model is trained on this data
	-> we have the data and the model architecture 
	-> every 100 epochs, the binary cross-entropy loss and accuracy scores are printed 
		-> the model's predicted probabilities are converted to binary labels (as integers) during this process 
		-> modules are first imported for this 
		-> we then iterate through training epochs, for each one 
			-> the model() method is used to make predictions based on the dataset, which are stored in `predictions`
			-> the loss from these predictions is calculated, using the loss() method and this is optimised 
		-> for every 100 epochs 
			-> the accuracy score of the predictions is returned, using the accuracy_score() method 
			-> the output of this is then returned, by using an f string literal
"""

from sklearn.metrics import accuracy_score

num_epochs = 1000
for epoch in range(num_epochs):
    predictions = model(X_train)
    BCELoss = loss(predictions, y_train)
    BCELoss.backward()
    optimizer.step()
    optimizer.zero_grad()
    
    if (epoch + 1) % 100 == 0:
        predicted_labels = (predictions >= 0.5).int()
        accuracy = accuracy_score(y_train, predicted_labels)
        print(f'Epoch [{epoch+1}/{num_epochs}], BCELoss: {BCELoss.item():.4f}, Accuracy: {accuracy.item():.4f}')

Epoch [100/1000], BCELoss: 0.3975, Accuracy: 0.8224
Epoch [200/1000], BCELoss: 0.3625, Accuracy: 0.8297
Epoch [300/1000], BCELoss: 0.3537, Accuracy: 0.8347
Epoch [400/1000], BCELoss: 0.3442, Accuracy: 0.8377
Epoch [500/1000], BCELoss: 0.3399, Accuracy: 0.8395
Epoch [600/1000], BCELoss: 0.3339, Accuracy: 0.8409
Epoch [700/1000], BCELoss: 0.3353, Accuracy: 0.8426
Epoch [800/1000], BCELoss: 0.3315, Accuracy: 0.8414


<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Keep track of the training loss and accuracy.</summary>

    
Here's how to print the accuracy and BCE loss every 100 epochs during training:
    
```py
if (epoch + 1) % 100 == 0:
        predicted_labels = (predictions >= 0.5).int()
        accuracy = accuracy_score(y_train, predicted_labels)
        print(f'Epoch [{epoch+1}/{num_epochs}], BCELoss: {BCELoss.item():.4f}, Accuracy: {accuracy.item():.4f}')
```

### Task 17

Let's evaluate the trained neural network on the testing set:

1. Set the model to **evaluation mode**
2. Turn off gradient calculations
3. Generate predicted probabilities on `X_test`. Save the probabilities to the variable `test_predictions`.
4. Convert the predicted probabilities to binary labels using `0.5` as the threshold. Save the labels to the variable `test_predicted_labels`.

In [None]:
"""
	-> now we have a trained model
	-> we want to see how accurate its predictions are	
	-> the .eval() method is first used to set the model to evaluation mode 
	-> the gradient calculations are then turned off, using the .no_grad() method
	-> predicted probabilities are then generated, using the model() method 
	-> these are converted to integers, using the .int() method <- binary labels 
"""

model.eval()
with torch.no_grad():
    test_predictions = model(X_test)
    test_predicted_labels = (test_predictions >= 0.5).int()

### Task 18

Recall that the number of cancellations is much lower than the number of non-cancellations (27.8% canceled vs 72.2% did not cancel). 

To evaluate our neural network effectively, compute the accuracy, precision, recall, and F1 scores using the `sklearn.metrics` module:

- use the `accuracy_score` function to compute the overall accuracy
- use the `classification_report` function to compute the precision, recall, and F1 scores

Print out the accuracy and classification report.

In [None]:
"""
	-> more people cancel than don't 
	-> calculating the performance of the model 
		-> we want to calculate the accuracy, precision, recall, and F1 scores of the model 
		-> the classification_report function is used for this and then the accuracy and classification reports from it are printed 
	-> this involves first importing modules 
	-> the accuracy_score method is used to return the accuracy score for this, to 4 significant figures using an f string literal 
		-> this is repeated to generate the classification report, with the classification_report method
"""

from sklearn.metrics import accuracy_score, classification_report

test_accuracy = accuracy_score(y_test, test_predicted_labels)
print(f'Accuracy: {test_accuracy.item():.4f}')

report = classification_report(y_test, test_predicted_labels)
print("Classification Report:\n", report)

Overall, the model seems to perform reasonably well at predicting hotel cancellations!

The model has an overall accuracy of 83.7%, indicating that 83.7% of our model's predictions are correct.
The precision score tells us that when our model predicts a cancellation, it is correct ~72% of the time.
The recall score tells us that our model captures about 68% of the actual cancellations in our data. 

In future research, we could improve the model by performing a more in-depth analysis of the features and doing a more robust feature selection process (like gathering more features or dropping less useful features). 

Furthermore, we could modify the neural network architecture by changing the number of nodes across the hidden layers, trying out different activation functions and optimizers, adding more hidden layers, or training on additional epochs.

## Task Group 5 - Train a Neural Network for Multiclass Classification

Let's now extend our binary classification task to multiclass by attempting to also predict customers who **no-showed** within the `reservation_status` column.

If a hotel can accurately predict no-shows, they can reach out ahead of time to customers who are at high risk of not-showing to their reservation.

### Task 19

First, let's label encode the three categories in the `reservation_status` column:
- **Check-Out** to `2`
- **Canceled** to `1`
- **No-Show** to `0`

In [None]:
"""
	-> we can then analyse the model's accuracy score from this
	-> there are multiple ways which this model can be improved (see above)
	-> extending this to a multiclass classification model 
	-> we want to make more predictions, by adding a third category  
	->  this means converting the categorical data into numerical data again, by using the .replace() method 
"""

hotels['reservation_status'] = hotels['reservation_status'].replace({'Check-Out':2, 'Canceled':1, 'No-Show':0})

### Task 20

Using the same list of training features in `train_features`, create the `X` and `y` tensors where:

- `X` contains the data values from the `train_features` columns
- `y` contains the multiclass data values in the `reservation_status` column

Make sure that `y` uses the `long` datatype.

In [None]:
"""
	-> we are adding another class onto the model 
	-> the list of training features for doing this is the same 
	-> we repeat the process when making the binary classification model for this 
	-> this is again done using the .tensor() method
"""

X = torch.tensor(hotels[train_features].values, dtype=torch.float)
y = torch.tensor(hotels['reservation_status'].values, dtype=torch.long)

<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Create X and y tensors.</summary>

When creating the tensors, be sure to extract the data values in the specified columns using `.values`:
    
```py
X = torch.tensor(hotels[train_features].values, dtype=torch.float)
y = torch.tensor(hotels['reservation_status'].values, dtype=torch.long)
```
 
</details>

### Task 21

Similar to before, split the `X` and `y` tensors into training and testing splits using the following scheme:
- Use 80% of the data for the training set `X_train` and `y_train`
- Use 20% of the data for the testing set `X_test` and `y_test`
- Set the random state to `42`

In [None]:
"""
	-> we are still in the process of adding another class to the model 
	-> this involves splitting the x and y tensors into testing and training splits 
	-> modules are first imported for this 
	-> the train_test_split() method is then used to split the data
	-> this is printed, so we know it has worked
"""

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.8,
                                                    test_size=0.2, 
                                                    random_state=42) 

print("Training Shape:", X_train.shape)
print("Testing Shape:", X_test.shape)

<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Split the dataset into training and testing splits.</summary>
    
```py
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.80,
                                                    test_size=0.20,
                                                    random_state=42) 
print("Training Shape:", X_train.shape)
print("Testing Shape:", X_test.shape)
```
It looks like our data was successfully split into 80% training and 20% testing sets. 

Importantly, we see that the number of columns is `65` which corresponds to the number of input nodes (or features) needed in the input layer of our neural network!

### Task 22

Set a random seed using `torch.manual_seed(42)`.

Next, let's construct the multiclass neural network with the following architecture:

- input layer with `65` nodes (equal to the number of training features)
- first hidden layer with `65` nodes and a ReLU activation
- second hidden layer with `36` nodes and a ReLU activation
- final output layer with `3` nodes corresponding to each of the categories in `reservation_status`

Save the network to the variable `multiclass_model`.

In [None]:
"""
	-> we are still in the process of converting the this into a multiclass model, rather than a binary one
	-> a random seed is first set using torch.manual_seed(42)
	-> the architecture of the model is then altered
	-> ReLu <- this is an activation layer 
	-> the first layer is called the input layer
	-> some of the layers after this are referred to as hidden layers 
	-> the final layer is the output layer 
"""

torch.manual_seed(42)

multiclass_model = nn.Sequential(
    nn.Linear(65, 65),
    nn.ReLU(),
    nn.Linear(65, 36),
    nn.ReLU(),
    nn.Linear(36, 3)
)

### Task 23

Next, let's define the loss function and optimizer used for multiclass training:
- set the **cross-entropy** loss function for multiclass to the variable `loss`
- set the **Adam** optimizer to the variable `optimizer` with a learning rate of `0.01`

In [None]:
"""
	-> the loss function for the model is then altered
	-> this is a cross-entropy loss function for multiclass 
	-> this is then set to the variable `loss`, using the .CrossEntropyLoss() method
	-> the learning rate of the Adam optimiser is then altered
		-> this is done using the optim method 
"""

loss = nn.CrossEntropyLoss()
optimizer = optim.Adam(multiclass_model.parameters(), lr=0.01)

### Task 24

Let's build the training loop to train our neural network.

1. Train the neural network for `500` epochs.
2. Keep track of the training performance by printing out the cross-entropy loss and accuracy score every `100` epochs.
3. Be sure to convert the output probabilites of the multiclass model to labels using the `torch.argmax()` function.

In [None]:
"""
	-> we are still in the process of converting the binary classification model to a multiclassification model 
	-> we are now training the model with the third class being added onto it 
	-> this is the same process as before 
		-> an epoch is one time the model is trained on the data
		-> we train the model on the same data, hundreds of times 
		-> modules for this are first imported
		-> and then the model is trained on this, using the multiclass_model() method 
		-> the loss function for this is then calculated, along with the use of an optimiser
		-> the accuracy score for this is printed every 100 epochs 
		-> the torch.argmax() method converts the outputs of the model to labels from this
"""

from sklearn.metrics import accuracy_score

num_epochs = 500
for epoch in range(num_epochs):
    predictions = multiclass_model(X_train)
    CELoss = loss(predictions, y_train)
    CELoss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if (epoch + 1) % 100 == 0:
        predicted_labels = torch.argmax(predictions, dim=1)
        accuracy = accuracy_score(y_train, predicted_labels)
        print(f'Epoch [{epoch+1}/{num_epochs}], CELoss: {CELoss.item():.4f}, Accuracy: {accuracy.item():.4f}')

<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Keep track of the multiclass training loss and accuracy.</summary>

    
Here's how to print the accuracy and BCE loss every 100 epochs during training:
    
```py
if (epoch + 1) % 100 == 0:
        predicted_labels = torch.argmax(predictions, dim=1)
        accuracy = accuracy_score(y_train, predicted_labels)
        print(f'Epoch [{epoch+1}/{num_epochs}], CELoss: {CELoss.item():.4f}, Accuracy: {accuracy.item():.4f}')
```

### Task 25

Let's evaluate the trained neural network on the testing set:

1. Set the multiclass model to **evaluation mode**
2. Turn off gradient calculations
3. Generate predicted probabilities on `X_test`. Save the predicted probabilities to the variable `multiclass_predictions`.
4. Select the class with the largest predicted probability using the `torch.argmax()` function. Save the predicted classes to the variable `multiclass_predicted_labels`.

In [None]:
"""
	-> we have changed the model from binary classification into multiclassification 
	-> then trained the model on this 
	-> we are now evaluating this on the testing set
	-> this is the same process that was previously used
	-> the .eval() method is first used to set the model to evaluation mode for this	
	-> the gradient calculations are then turned off, by using the .no_grad() method 
	-> the .argmax() function is then used to select the class with the largest probability 
"""

multiclass_model.eval()
with torch.no_grad():
    multiclass_predictions = multiclass_model(X_test)
    multiclass_predicted_labels = torch.argmax(multiclass_predictions, dim=1)

### Task 26

Lastly, let's evaluate the multiclass neural network by calculating the overall accuracy, precision, recall, and F1 scores.

Using the `sklearn.metrics` module:
- use the `accuracy_score` function to compute and save the overall accuracy to the variable `multiclass_accuracy`
- use the `classification_report` function to compute and save the classification metrics for each class to the variable `multiclass_report`

Print the overall accuracy and classification report for our multiclass model.

In [None]:
"""
	-> we are now calculating the different accuracy scores for the multiclass model, with the additional category added in 
	-> modules are first imported for this 
	-> the accuracy_score() method is then used to generate an accuracy score for this, and is printed with an f string literal 
		-> this is repeated with the classification_report() method 
	-> we can compare the performance of this model to the binary classification model using these scores / reports 
	-> this can be interpreted in the context of data science, to gather more complex insights 
	-> it can also tell us where the model is inaccurate <- the percentages of the predictions which are correct 
		-> from the precision and recall score for this 
		-> these can be used to make decisions about the dataset
	-> comparing the accuracy of the multiclass and binary classification models 
	-> there are multiple ways this can be extended (see below)
"""

from sklearn.metrics import accuracy_score, classification_report

multiclass_accuracy = accuracy_score(y_test, multiclass_predicted_labels)
print(f'Accuracy: {multiclass_accuracy.item():.4f}')

multiclass_report = classification_report(y_test, multiclass_predicted_labels)
print("Classification Report:\n", multiclass_report)

Our multiclass neural network performs similarly to the binary classification network at predicting cancellations.

It has an overall accuracy of 84%, meaning that 84% of all the predictions were correct.
The precision in row `1` tells us that when our model predicts a cancellation, it is correct 72% of the time. 
The recall score in row `1` tells us that our model captures 68% of the actual cancellations in our data.

Unfortunately, the model doesn't do the best job of predicting whether or not the customer will no-show. 

For no-shows (row class `0`), the precision score tells us that when our model predicts a no-show it is correct 86% of the time which is surprising well.
However, the low recall score tells us that our model only captures 11% of actual no-shows which is not very good. The lower recall score brings the F1 score down to 27% which indicates a not-so-great balance between precision and recall. This means that the model doesn't predict many no-shows and will most likely not be able to capture most customers who no-show in real-life. 

If our goal is to be able to reach out to potential no-shows, the low recall score is concerning. However, this all may be due to the low number of no-shows in the dataset: it is much harder for our model to find patterns predicting a no-show without more data. However, unlike the binary model, the multiclass does make an attempt to classify no-shows while still being able to predict cancellations ahead of time with similar performance.

So that's the end of our project on predicting hotel cancellations using real-world data! 
In future research, we could improve the model by performing a more in-depth analysis of the features and doing a more robust feature selection process. Some examples might include collecting weather data at the time of each booking, reservations made on major holidays, economic conditions, or even global pandemics and health concerns.

Furthermore, we could also try to improve performance by modifying the neural network architecture like changing the number of nodes across the hidden layers, trying out different activation functions and optimizers, adding more hidden layers, or training on additional epochs, etc.