# Use PyTorch to Predict Hotel Cancellations

- [View Solution Notebook](./solutions.html)
- [View Project Page](https://www.codecademy.com/)

**Setup - Import libraries**

In [49]:
import pandas as pd
import numpy as np

## Task Group 1 - Import and Inspect

The file `'datasets/resort_hotel_bookings.csv'` contains a subset of a [real-world dataset](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand) containing reservation and cancellation data for a resort hotel. 

Your goal in this project is build and train a neural network to predict if a customer will cancel their hotel booking reservation based on data including the booking dates, average daily cost, number of adults/children/babies, duration of stay, and so forth.

### Task 1

Begin by importing the CSV file to a pandas DataFrame named `hotels`.

Preview the first five rows using the `.head()` method.

In [171]:
df = pd.read_csv("hotel_bookings.csv")

<details><summary style="display:list-item; font-size:16px; color:blue;">Here's a quick summary of the columns</summary>

- **is_canceled**: Whether the booking was canceled (1) or kept (0)
- **lead_time**: Number of days between booking date and arrival date
- **arrival_date_year**: Year of arrival date
- **arrival_date_month**: Month of arrival date
- **arrival_date_week_number**: Week number of arrival date
- **arrival_date_day_of_month**: Day of the month of arrival date
- **stays_in_weekend_nights**: Number of weekend nights booked (Sat-Sun)
- **stay_in_week_nights**: Number of weekday nights booked (Mon-Fri)
- **adults**: Number of adults
- **children**: Number of children
- **babies**: Number of babies
- **meal**: Type of meal booked (Undefined/SC, BB, HB, or FB)
- **country**: Country of origin of the booker
- **market_segment**: Market segment (TA - travel agent, TO - tour operators)
- **distribution_channel**: Booking distribution channel (TA - travel agent, TO - tour operators)
- **is_repeated_guest**: Is this a repeated guest (1) or not (0)
- **previous_cancellations**: The number of previous bookings canceled by the customer
- **previous_bookings_not_canceled**: The number of previous bookings not canceled by the customer
- **reserved_room_type**: Room type reserved
- **assigned_room_type**: Type of assigned room booked
- **booking_changes**: Number of booking changes or modifications
- **deposit_type**: Type of deposit to guarantee booking (No Deposit, Non Refund, or Refundable)
- **agent**: ID of the travel agency that made the booking
- **company**: ID of the company that made the booking
- **days_in_waiting_list**: Number of days booking was waitlisted before confirmation
- **customer_type**: The customer type of booking (Contract, Group, Transient, or Transient-party)
- **adr**: The average daily rate (cost) of the booking
- **required_car_parking_spaces**: Number of parking spaces requested by the customer
- **total_of_special_requests**: Number of special requests by the customer
- **reservation_status**: The last reservation status (Canceled, Check-Out, No-Show)
- **reservation_status_date**: The date of the last reservation status

### Task 2

Let's explore the data types and whether any data is missing.

Use the `.info()` method on the `hotels` DataFrame to inspect the data.

In [172]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal            

<details><summary style="display:list-item; font-size:16px; color:blue;">What do we notice about the dataset under inspection?</summary>

There are 31 columns and 40,060 total observations in our dataset. The majority of columns do not have missing values.

However, we do notice that: 
- the `agent` and `company` columns seem to have missing values that need to be addressed
- the `country` column has a couple of missing values as well

There are a variety of data types represented. To work with a neural network, we'll have to address any non-numeric columns in our data preparation.

### Task 3

Let's now explore the cancellation column we want to predict.

Use the `.value_counts()` method on the `is_canceled` column to count the number **and** the percentage of overall cancellations. 

In [173]:
# Number of cancellations
print(df['is_canceled'].value_counts(0))

# Percentage of cancellations
print(df['is_canceled'].value_counts(1))

is_canceled
0    75166
1    44224
Name: count, dtype: int64
is_canceled
0    0.629584
1    0.370416
Name: proportion, dtype: float64


<details><summary style="display:list-item; font-size:16px; color:blue;">What do we notice about the number of cancellations?</summary>

The number of cancellations is much lower than the number of non-cancellations (27.8% canceled vs 72.2% did not cancel). 

We'll need to take this imbalance into account when we evaluate our model. For example, a naive model could simply predict every booking will **not be canceled** and achieve a decent accuracy of 72.2%.

### Task 4

The `reservation_status` column tells us if the booking was canceled while also telling us if the customer was a no-show.

We need to be sure to exclude this column from the training set, otherwise this information will be _leaked_ to our model resulting in inaccurate performance. 

First, let's take a quick look at the values in this column.

Use the `.value_counts()` method on the `reservation_status` column to count the number **and** the percentage of overall cancellations. 

In [174]:
# Number of cancellations
print(df['reservation_status'].value_counts(0))

# Percentage of cancellations
print(df['reservation_status'].value_counts(1))

reservation_status
Check-Out    75166
Canceled     43017
No-Show       1207
Name: count, dtype: int64
reservation_status
Check-Out    0.629584
Canceled     0.360307
No-Show      0.010110
Name: proportion, dtype: float64


<details><summary style="display:list-item; font-size:16px; color:blue;">What do we notice about the reservation_status column?</summary>

The number of no-shows is extremely small and consists of only 291 (or 0.7%) of observations in the dataset.

Later on, we'll look at creating a multiclass model to predict no-show in addition to canceled.

### Task 5

Before diving into building a model, let's continue to explore the dataset. It's important to understand how different columns interact with cancellations to guide our model structure! 

For example, cancellations might be higher in the summer months (June - September) and lower in the winter months (November - January).

Use the `.groupby()` method to group the data by the `arrival_date_month` column and apply the `.mean()` aggregation function on the `is_canceled` column. This will return the percent of reservations cancelled in each month.

Then, use the `.sort_values()` method to sort the percentages from lowest to highest.

In [175]:
cancellations_by_month = df.groupby('arrival_date_month')['is_canceled'].mean()
cancellations_by_month.sort_values()

arrival_date_month
January      0.304773
November     0.312334
March        0.321523
February     0.334160
December     0.349705
July         0.374536
August       0.377531
October      0.380466
September    0.391702
May          0.396658
April        0.407972
June         0.414572
Name: is_canceled, dtype: float64

In [176]:
df['hotel'].unique()

array(['Resort Hotel', 'City Hotel'], dtype=object)

<details><summary style="display:list-item; font-size:16px; color:blue;">What do we notice about the percentage of cancellations by month?</summary>

It looks like our intuition was correct! Winter and spring have the lowest cancellation percentages, while summer and fall have the highest. This information can be very useful for our model!

It might be useful to do more exploratory data analysis to gain additional insights about hotel cancellations. For example, additional analysis may help you select better features to train the model on and exclude features that might seem irrelevant. But for now, let's move on to cleaning and preparing the data.

## Task Group 2 - Data Cleaning and Preparation

In this section, we'll encode categorical data for use in our neural networks.

### Task 6

To get a sense of the categorical data in the dataset, let's start by previewing the first five rows of all columns with `object` datatype.

Create a list named `object_columns` containing only the names of the object columns (except for the reservation status columns). Select those columns from `hotels` and preview the first `5` rows.

In [177]:
object_columns = ['hotel','arrival_date_month', 'meal', 'country', 'market_segment', 'distribution_channel', 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type']
df[object_columns].head()

Unnamed: 0,hotel,arrival_date_month,meal,country,market_segment,distribution_channel,reserved_room_type,assigned_room_type,deposit_type,customer_type
0,Resort Hotel,July,BB,PRT,Direct,Direct,C,C,No Deposit,Transient
1,Resort Hotel,July,BB,PRT,Direct,Direct,C,C,No Deposit,Transient
2,Resort Hotel,July,BB,GBR,Direct,Direct,A,C,No Deposit,Transient
3,Resort Hotel,July,BB,GBR,Corporate,Corporate,A,A,No Deposit,Transient
4,Resort Hotel,July,BB,GBR,Online TA,TA/TO,A,A,No Deposit,Transient


<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Preview the first five rows subset by the object columns</summary>

Here's how we can subset the DataFrame by the object columns and preview the first five rows:

```py
object_columns = ['arrival_date_month', 'meal', 'country', 'market_segment', 'distribution_channel', 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type']
hotels[object_columns].head()
```

Additionally, it might be helpful to explore the categorical data in each object column using the `.value_counts()` method.

</details>

In [178]:
df.isna().sum()

hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company         

### Task 7

Typically, we don't want to use every column in training. For example, we may want to drop columns with many missing values or columns that are irrelevant to our prediction task.

Drop any columns you don't want to use to train a cancellation model (do not remove the target label column). Feel free to open our Hint to review the columns we chose to drop in our solution.

Note: We don't want to drop the `reservation_status` column from the dataset quite yet because we'll be using this column to train our multiclass neural network.

In [179]:
drop_cols = ['country','agent','company','reservation_status_date']
df = df.drop(labels=drop_cols, axis=1)

In [180]:
df['children'].fillna(df['children'].median(), inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['children'].fillna(df['children'].median(), inplace = True)


<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Drop columns in the dataset not used for training.</summary>

Here's a list of potential features to drop. Feel free to experiment on your own by dropping or keeping columns you might believe may contribute to training.

```py
drop_columns = ['country', 'agent', 'company', 'reservation_status_date',
                'arrival_date_week_number', 'arrival_date_day_of_month', 'arrival_date_year']

hotels = hotels.drop(labels=drop_columns, axis=1)
```

Here's why we chose these columns:

- `country` - there are many countries that only appear a handful of times in the dataset which may make our model less generalizable and even discriminate against customers based on their country
- `agent` - similar to `country`, there are many agents that only appear a handful of times which may make our model less generalizable (and there are many missing values!)
- `company` - similar to `agent`, there are many companies that only appear a handful of times which may make our model less generalizable (and there are many missing values!)
- `reservation_status_date` - tells us the date of the latest status change of the reservation which shouldn't be helpful and if anything may leak data
- `arrival_date_week_number` - tells us the week of the year which may be too specific and prone to overfitting
- `arrival_date_day_of_month` - tells us the day of the month which may be too specific and prone to overfitting
- `arrival_date_year` - tells us the year of the booking which may not be helpful to predict future years

</details>

### Task 8

Next, let's encode the `meal` column which tells us which type of meal(s) the customer booked: 

- `Undefined` and `SC` correspond to no meal packages
- `BB` corresponds to breakfast only
- `HB` (half board) corresponds to breakfast + lunch or dinner
- `FB` (full board) corresponds to breakfast, lunch, and dinner.

Label encode the `meal` column with a meaningful order (# of meals booked) using the following scheme:

- `Undefined` and `SC` to `0`
- `BB` to `1`
- `HB` to `2`
- `FB` to `3` 

In [181]:
df['meal'].unique()

array(['BB', 'FB', 'HB', 'SC', 'Undefined'], dtype=object)

In [182]:
df['meal']=df['meal'].map({'Undefined':0, 'SC':0, 'BB':1, 'HB':2, 'FB':3})

### Task 9

Let's prepare the rest of the categorical columns using one-hot encoding. 

Create a list named `one_hot_columns` containing the list of categorical column names (all the remaining categorical columns) to be one-hot encoded using the `pd.get_dummies()` method.

Preview the cleaned `hotels` DataFrame using the `.head()` method.

In [183]:
cat_col = ['hotel','arrival_date_month', 'meal', 'market_segment', 'distribution_channel', 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type']
df=pd.get_dummies(df, columns=cat_col, dtype=int)

Perfect! It looks like we've handled all of the categorical variables and prepared the DataFrame for training.

Note that the cleaned DataFrame now has 67 columns due to the additional columns created using one-hot encoding.

## Task Group 3 - Create Training and Testing Sets

Next, let's convert our dataset into PyTorch tensors and split them into training and testing sets.

### Task 10

Let's import the necessary PyTorch libraries and modules. 

In [184]:
import torch 
import torch.nn as nn 
from torch import optim as optim


### Task 11

We need to start by separating our training features from the target labels.

Create a list named `train_features` that contains all of the feature names (column names excluding the target variables `is_canceled` and `reservation_status`).

In [185]:
train_features = df.drop(labels=['is_canceled','reservation_status'], axis =1)


In [186]:
train_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 77 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   lead_time                       119390 non-null  int64  
 1   arrival_date_year               119390 non-null  int64  
 2   arrival_date_week_number        119390 non-null  int64  
 3   arrival_date_day_of_month       119390 non-null  int64  
 4   stays_in_weekend_nights         119390 non-null  int64  
 5   stays_in_week_nights            119390 non-null  int64  
 6   adults                          119390 non-null  int64  
 7   children                        119390 non-null  float64
 8   babies                          119390 non-null  int64  
 9   is_repeated_guest               119390 non-null  int64  
 10  previous_cancellations          119390 non-null  int64  
 11  previous_bookings_not_canceled  119390 non-null  int64  
 12  booking_changes 

<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Select training features.</summary>

```py
# Remove target columns
remove_cols = ['is_canceled', 'reservation_status']

# Select training features
train_features = [x for x in hotels.columns if x not in remove_cols]
```
 
</details>

In [187]:
train_features.head()

Unnamed: 0,lead_time,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,...,assigned_room_type_K,assigned_room_type_L,assigned_room_type_P,deposit_type_No Deposit,deposit_type_Non Refund,deposit_type_Refundable,customer_type_Contract,customer_type_Group,customer_type_Transient,customer_type_Transient-Party
0,342,2015,27,1,0,0,2,0.0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,737,2015,27,1,0,0,2,0.0,0,0,...,0,0,0,1,0,0,0,0,1,0
2,7,2015,27,1,0,1,1,0.0,0,0,...,0,0,0,1,0,0,0,0,1,0
3,13,2015,27,1,0,1,1,0.0,0,0,...,0,0,0,1,0,0,0,0,1,0
4,14,2015,27,1,0,2,2,0.0,0,0,...,0,0,0,1,0,0,0,0,1,0


### Task 12

Using the list of training features in `train_features`, create `X` and `y` tensors:

- `X` contains the data values from the `train_features` columns
- `y` contains the binary labels in the `is_canceled` column in `hotels`

Both `X` and `y` should have the float datatype.

Be sure to set the correct view of `y` using `.view(-1,1)`

In [135]:
X = torch.tensor(train_features.values, dtype = torch.float)
y = torch.tensor(df[['is_canceled']].values, dtype = torch.float)

<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Create X and y tensors.</summary>

When creating the tensors, be sure to extract the data values in the specified columns using `.values` as floats:
    
```py
X = torch.tensor(hotels[train_features].values, dtype=torch.float)
y = torch.tensor(hotels['is_canceled'].values, dtype=torch.float).view(-1,1)
```
 
</details>

### Task 13

Let's now split our data contained in `X` and `y` into training and testing sets.

Import the `train_test_split` module from Scikit-learn's `sklearn.model_selection` library.

Split `X` and `y` using the following scheme:
- Use 80% of the data for the training set `X_train` and `y_train`
- Use 20% of the data for the testing set `X_test` and `y_test`
- Set the random state to `42` to match our solution

Print out the shape of `X_train` and `X_test` to see how many observations and columns are in the training and testing sets.

How many training features does our training set `X_train` have?

In [136]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
print(X_train.shape)
print(X_test.shape)

torch.Size([95512, 77])
torch.Size([23878, 77])


<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Split the dataset into training and testing splits.</summary>
    
```py
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.80,
                                                    test_size=0.20,
                                                    random_state=42) 
print("Training Shape:", X_train.shape)
print("Testing Shape:", X_test.shape)
```
It looks like our data was successfully split into 80% training and 20% testing sets. 

Importantly, we see that the number of columns is `65` which corresponds to the number of input nodes (or features) needed in the input layer of our neural network!

## Task Group 4 - Train a Neural Network for Binary Classification

Let's now create a neural network for binary classification to predict hotel cancellations.

### Task 14

Set a random seed to `42` using `torch.manual_seed(42)`.

Build the neural network architecture using `nn.Sequential` with the following:
- input layer with `65` nodes (equal to the number of training features)
- first hidden layer with `36` nodes and a ReLU activation
- second hidden layer with `18` nodes and a ReLU activation
- output layer with `1` node and a Sigmoid activation

Save the network to the variable `model`.

In [137]:
model = nn.Sequential(
    nn.Linear(77,36),
    nn.ReLU(),
    nn.Linear(36,18),
    nn.ReLU(),
    nn.Linear(18,1),
    nn.Sigmoid()
)

### Task 15

Next, let's define the loss function and optimizer used for training:
- set the **binary cross-entropy** loss function to the variable `loss`
- set the **Adam** optimizer to the variable `optimizer` with a learning rate of `0.005`

In [138]:
loss = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr= 0.005)

### Task 16

Let's build the training loop to train our neural network.

Train the neural network for `1000` epochs.

Keep track of the training performance by printing out the binary cross-entropy loss and accuracy score every `100` epochs.

Before calculating accuracy, convert the model's predicted probabilities to binary labels (as integers) using `0.5` as the threshold.

In [139]:
from sklearn.metrics import accuracy_score

In [140]:
epoch = 1000 
for i in range(1,epoch+1):
    probability = model(X_train)
    BCEloss = loss(probability, y_train)
    BCEloss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if i%100 ==0:
        pred1_0 = (probability>=0.5).int()
        accuracy = accuracy_score(pred1_0, y_train)

        print(f"epoch number = {i}, accuracy = {accuracy}, loss = {BCEloss} ")
        

epoch number = 100, accuracy = 0.724128905268448, loss = 0.5556249022483826 
epoch number = 200, accuracy = 0.770081246335539, loss = 0.4975144565105438 
epoch number = 300, accuracy = 0.7743320211072954, loss = 0.48603591322898865 
epoch number = 400, accuracy = 0.7870843454225647, loss = 0.4516979157924652 
epoch number = 500, accuracy = 0.793858363347014, loss = 0.45300573110580444 
epoch number = 600, accuracy = 0.8084638579445514, loss = 0.4278019368648529 
epoch number = 700, accuracy = 0.8072074713125053, loss = 0.4314311444759369 
epoch number = 800, accuracy = 0.8092805092553815, loss = 0.4298108220100403 
epoch number = 900, accuracy = 0.8138139710193484, loss = 0.42679882049560547 
epoch number = 1000, accuracy = 0.8147667308819834, loss = 0.42440131306648254 


<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Keep track of the training loss and accuracy.</summary>

    
Here's how to print the accuracy and BCE loss every 100 epochs during training:
    
```py
if (epoch + 1) % 100 == 0:
        predicted_labels = (predictions >= 0.5).int()
        accuracy = accuracy_score(y_train, predicted_labels)
        print(f'Epoch [{epoch+1}/{num_epochs}], BCELoss: {BCELoss.item():.4f}, Accuracy: {accuracy.item():.4f}')
```

### Task 17

Let's evaluate the trained neural network on the testing set:

1. Set the model to **evaluation mode**
2. Turn off gradient calculations
3. Generate predicted probabilities on `X_test`. Save the probabilities to the variable `test_predictions`.
4. Convert the predicted probabilities to binary labels using `0.5` as the threshold. Save the labels to the variable `test_predicted_labels`.

In [102]:
model.eval()
with torch.no_grad():
    preds_test = model(X_test)
    preds1_0_test = (preds_test>=0.5).int()
    

### Task 18

Recall that the number of cancellations is much lower than the number of non-cancellations (27.8% canceled vs 72.2% did not cancel). 

To evaluate our neural network effectively, compute the accuracy, precision, recall, and F1 scores using the `sklearn.metrics` module:

- use the `accuracy_score` function to compute the overall accuracy
- use the `classification_report` function to compute the precision, recall, and F1 scores

Print out the accuracy and classification report.

In [104]:
from sklearn.metrics import precision_score, recall_score, f1_score

accuracy_test = accuracy_score(preds1_0_test, y_test)
precision_test = precision_score(preds1_0_test, y_test)
recall_test = precision_score(preds1_0_test, y_test)
f1_test = f1_score(preds1_0_test, y_test)

print("accuracy = ", accuracy_test)
print("precision = ", precision_test)
print("recall = ", recall_test)
print("f1 = ", f1_test)

accuracy =  0.8123377167266941
precision =  0.634377438412663
recall =  0.634377438412663
f1 =  0.7175187543339847


Overall, the model seems to perform reasonably well at predicting hotel cancellations!

The model has an overall accuracy of 83.7%, indicating that 83.7% of our model's predictions are correct.
The precision score tells us that when our model predicts a cancellation, it is correct ~72% of the time.
The recall score tells us that our model captures about 68% of the actual cancellations in our data. 

In future research, we could improve the model by performing a more in-depth analysis of the features and doing a more robust feature selection process (like gathering more features or dropping less useful features). 

Furthermore, we could modify the neural network architecture by changing the number of nodes across the hidden layers, trying out different activation functions and optimizers, adding more hidden layers, or training on additional epochs.

## Task Group 5 - Train a Neural Network for Multiclass Classification

Let's now extend our binary classification task to multiclass by attempting to also predict customers who **no-showed** within the `reservation_status` column.

If a hotel can accurately predict no-shows, they can reach out ahead of time to customers who are at high risk of not-showing to their reservation.

### Task 19

First, let's label encode the three categories in the `reservation_status` column:
- **Check-Out** to `2`
- **Canceled** to `1`
- **No-Show** to `0`

In [195]:
df['reservation_status'].unique()

array([2, 1, 0], dtype=int64)

In [189]:
df['reservation_status'] = df['reservation_status'].replace({'Check-Out':2,'Canceled':1,'No-Show':0})

  df['reservation_status'] = df['reservation_status'].replace({'Check-Out':2,'Canceled':1,'No-Show':0})


### Task 20

Using the same list of training features in `train_features`, create the `X` and `y` tensors where:

- `X` contains the data values from the `train_features` columns
- `y` contains the multiclass data values in the `reservation_status` column

Make sure that `y` uses the `long` datatype.

In [196]:
X = torch.tensor(train_features.values, dtype = torch.float)
y = torch.tensor(df['reservation_status'].values, dtype = torch.long)

<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Create X and y tensors.</summary>

When creating the tensors, be sure to extract the data values in the specified columns using `.values`:
    
```py
X = torch.tensor(hotels[train_features].values, dtype=torch.float)
y = torch.tensor(hotels['reservation_status'].values, dtype=torch.long)
```
 
</details>

### Task 21

Similar to before, split the `X` and `y` tensors into training and testing splits using the following scheme:
- Use 80% of the data for the training set `X_train` and `y_train`
- Use 20% of the data for the testing set `X_test` and `y_test`
- Set the random state to `42`

In [197]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
print(X_train.shape)
print(X_test.shape)

torch.Size([95512, 77])
torch.Size([23878, 77])


<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Split the dataset into training and testing splits.</summary>
    
```py
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.80,
                                                    test_size=0.20,
                                                    random_state=42) 
print("Training Shape:", X_train.shape)
print("Testing Shape:", X_test.shape)
```
It looks like our data was successfully split into 80% training and 20% testing sets. 

Importantly, we see that the number of columns is `65` which corresponds to the number of input nodes (or features) needed in the input layer of our neural network!

### Task 22

Set a random seed using `torch.manual_seed(42)`.

Next, let's construct the multiclass neural network with the following architecture:

- input layer with `65` nodes (equal to the number of training features)
- first hidden layer with `65` nodes and a ReLU activation
- second hidden layer with `36` nodes and a ReLU activation
- final output layer with `3` nodes corresponding to each of the categories in `reservation_status`

Save the network to the variable `multiclass_model`.

In [198]:
multiclass_model = nn.Sequential(
    nn.Linear(77,65),
    nn.ReLU(),
    nn.Linear(65,36),
    nn.ReLU(),
    nn.Linear(36,3)
)

### Task 23

Next, let's define the loss function and optimizer used for multiclass training:
- set the **cross-entropy** loss function for multiclass to the variable `loss`
- set the **Adam** optimizer to the variable `optimizer` with a learning rate of `0.01`

In [199]:
loss = nn.CrossEntropyLoss()
optimizer = optim.Adam(multiclass_model.parameters(), lr = 0.01)

### Task 24

Let's build the training loop to train our neural network.

1. Train the neural network for `500` epochs.
2. Keep track of the training performance by printing out the cross-entropy loss and accuracy score every `100` epochs.
3. Be sure to convert the output probabilites of the multiclass model to labels using the `torch.argmax()` function.

In [201]:
epochs = 500
for i in range(1,epochs+1):
    preds = multiclass_model(X_train)
    cem_loss = loss(preds, y_train)
    cem_loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if i%100 == 0:
        pred_labels = torch.argmax(preds, dim=1)
        accuracy = accuracy_score(pred_labels, y_train)
        print(f"epoch number = {i}, accuracy = {accuracy:0.4f}, loss = {cem_loss}")

epoch number = 100, accuracy = 0.8007, loss = 0.4753957986831665
epoch number = 200, accuracy = 0.8131, loss = 0.4552313983440399
epoch number = 300, accuracy = 0.7988, loss = 0.4728319048881531
epoch number = 400, accuracy = 0.8091, loss = 0.4515247046947479
epoch number = 500, accuracy = 0.7754, loss = 0.47718796133995056


<details><summary style="display:list-item; font-size:16px; color:blue;">Hint: Keep track of the multiclass training loss and accuracy.</summary>

    
Here's how to print the accuracy and BCE loss every 100 epochs during training:
    
```py
if (epoch + 1) % 100 == 0:
        predicted_labels = torch.argmax(predictions, dim=1)
        accuracy = accuracy_score(y_train, predicted_labels)
        print(f'Epoch [{epoch+1}/{num_epochs}], CELoss: {CELoss.item():.4f}, Accuracy: {accuracy.item():.4f}')
```

### Task 25

Let's evaluate the trained neural network on the testing set:

1. Set the multiclass model to **evaluation mode**
2. Turn off gradient calculations
3. Generate predicted probabilities on `X_test`. Save the predicted probabilities to the variable `multiclass_predictions`.
4. Select the class with the largest predicted probability using the `torch.argmax()` function. Save the predicted classes to the variable `multiclass_predicted_labels`.

In [203]:
from sklearn.metrics import classification_report

In [204]:
multiclass_model.eval()
with torch.no_grad():
    preds = multiclass_model(X_test)
    pred_label = torch.argmax(preds, dim=1)
    accuracy = accuracy_score(pred_label, y_test)
    report = classification_report(y_test, pred_label)


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Task 26

Lastly, let's evaluate the multiclass neural network by calculating the overall accuracy, precision, recall, and F1 scores.

Using the `sklearn.metrics` module:
- use the `accuracy_score` function to compute and save the overall accuracy to the variable `multiclass_accuracy`
- use the `classification_report` function to compute and save the classification metrics for each class to the variable `multiclass_report`

Print the overall accuracy and classification report for our multiclass model.

In [206]:
print(f"accuracy = {accuracy}")
print(report)

accuracy = 0.7787084345422565
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       234
           1       0.98      0.43      0.60      8737
           2       0.74      0.99      0.85     14907

    accuracy                           0.78     23878
   macro avg       0.57      0.48      0.48     23878
weighted avg       0.82      0.78      0.75     23878



Our multiclass neural network performs similarly to the binary classification network at predicting cancellations.

It has an overall accuracy of 84%, meaning that 84% of all the predictions were correct.
The precision in row `1` tells us that when our model predicts a cancellation, it is correct 72% of the time. 
The recall score in row `1` tells us that our model captures 68% of the actual cancellations in our data.

Unfortunately, the model doesn't do the best job of predicting whether or not the customer will no-show. 

For no-shows (row class `0`), the precision score tells us that when our model predicts a no-show it is correct 86% of the time which is surprising well.
However, the low recall score tells us that our model only captures 11% of actual no-shows which is not very good. The lower recall score brings the F1 score down to 27% which indicates a not-so-great balance between precision and recall. This means that the model doesn't predict many no-shows and will most likely not be able to capture most customers who no-show in real-life. 

If our goal is to be able to reach out to potential no-shows, the low recall score is concerning. However, this all may be due to the low number of no-shows in the dataset: it is much harder for our model to find patterns predicting a no-show without more data. However, unlike the binary model, the multiclass does make an attempt to classify no-shows while still being able to predict cancellations ahead of time with similar performance.

So that's the end of our project on predicting hotel cancellations using real-world data! 
In future research, we could improve the model by performing a more in-depth analysis of the features and doing a more robust feature selection process. Some examples might include collecting weather data at the time of each booking, reservations made on major holidays, economic conditions, or even global pandemics and health concerns.

Furthermore, we could also try to improve performance by modifying the neural network architecture like changing the number of nodes across the hidden layers, trying out different activation functions and optimizers, adding more hidden layers, or training on additional epochs, etc.