# Hotel Bookings Cancellations - Understanding Data Leakage

In the evaluation process, both during model validation and testing phases, I compared the performance between an original model and another I intentionally modified to include data leakage through the variable  `early_cancellation_or_noshow`. This variable provides advance insight into cancellations, acting as an early indicator of the booking outcome.

## Model Performance Without Data Leakage

For the original model, accuracy and recall metrics during the validation phase were 0.8088 and 0.7134, respectively. During the testing phase, accuracy was 0.8146 with a recall of 0.7607. These metrics reflect the model's performance based on features without foreknowledge of the outcomes.

## Impact of Data Leakage

However, with the introduction of data leakage, both accuracy and recall metrics soared to 0.9997 in the validation phase, and to 0.9994 for accuracy and 0.9992 for recall in the testing phase. This significant inflation in perceived model performance demonstrates how data leakage can falsely enhance a machine learning model's effectiveness by providing it with information that would not realistically be available at the time of making predictions.

The almost perfect accuracy and recall achieved with data leakage suggest substantial overfitting, as the model has "learned" the outcomes directly from the training data, rather than generalizing from learned features.

## The Critical Importance of Avoiding Data Leakage

This experiment underscores the critical importance of identifying and preventing data leakage in predictive modeling practice. The intentional inclusion of data leakage in this case study showed how it could affect performance evaluations, resulting in metrics that do not truly represent the model's predictive capability.

These findings highlight the necessity for careful data preparation and feature selection practices to ensure models are robust, generalizable, and reliable in real applications, thus avoiding the false precision that data leakage can induce.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Set up the variable for your file path
file_path =  'data/hotel_bookings_training.csv' #or the Google Drive path

In [None]:
import pandas as pd

hotel_bookings = pd.read_csv(file_path)

from sklearn.model_selection import train_test_split

In [None]:
hotel_bookings.info()

In [None]:
# Remove personal information of customers
hotel_bookings = hotel_bookings.drop(['name', 'email', 'phone-number', 'credit_card'], axis=1)

In [None]:
hotel_bookings.sample(10)

##EDA

In [None]:
!pip install pandas_profiling

In [None]:
from ydata_profiling import ProfileReport # Previously pandas_profiling # Create a report on our data in an HTML file

In [None]:
profile = ProfileReport(hotel_bookings, title="Pandas Profiling Report")

In [None]:
!pip install matplotlib
!pip install --upgrade Pillow
import matplotlib.pyplot as plt
profile.to_file("data/bookings_profile.html")

In [None]:
#!ls

In [None]:
from google.colab import files
files.download('data/bookings_profile.html')

# Intentional Data Leakage



Data leakage occurs when information from outside the training dataset is used to create the model. This can happen in two main ways:

Not hiding certain information (e.g., accidentally including personal information that directly correlates with the target variable).
Using test/validation data for training.
This can lead to the model learning patterns it shouldn't know, resulting in misleadingly high performance when evaluated with the same data. However, this performance may significantly drop on new or unseen data.

# Introducing Data Leakage for Testing Purposes
To demonstrate the effects of data leakage I deliberately retain the 'reservation_status' variable instead of excluding it and I use the 'reservation_status_date' variable to create another variable that will create data leakege because it directly reflects the is_canceled outcome. This approach allows us to examine the significant impact data leakage can have on the model's perceived performance.

Note: In real-world scenarios, it's crucial to avoid data leakage to ensure the model's predictions are genuine and applicable.

This approach helps illustrate the importance of careful feature selection to prevent unrealistic model performance estimations.


## Retention of the `reservation_status` variable

The `reservation_status` variable indicates the status of the reservation, which is actually a reflection of `is_canceled`. If it is included among the input variables, I am effectively leaking information to the model. This will yield excellent results, but in reality, it won't be of any use in real-life scenarios.

In [None]:
#I deliberately retain the 'reservation_status's variable
#hotel_bookings = hotel_bookings.drop(['reservation_status', 'reservation_status_date'], axis=1)

## Creation of the `early_cancellation_or_noshow` variable based on the `duration_to_arrival` variable, obtained through `arrival_date` and `reservation_status_date`

In [None]:
from datetime import datetime

# The year and day of the month are converted to string format to enable concatenation with the month for the 'arrival_date'.
hotel_bookings['arrival_date'] = pd.to_datetime(hotel_bookings['arrival_date_year'].astype(str) + '-' +
                                                hotel_bookings['arrival_date_month'] + '-' +
                                                hotel_bookings['arrival_date_day_of_month'].astype(str))

# Ensuring that 'reservation_status_date' is formatted as datetime, similar to 'arrival_date'.
hotel_bookings['reservation_status_date'] = pd.to_datetime(hotel_bookings['reservation_status_date'])

# The duration (in days) between the reservation status date and the arrival date is calculated.
hotel_bookings['duration_to_arrival'] = (hotel_bookings['reservation_status_date'] - hotel_bookings['arrival_date']).dt.days


## Data Verification

### Checking Data Types
I verified the data types for date columns to ensure they were in the proper format for analysis.




In [None]:
print(hotel_bookings[['arrival_date', 'reservation_status_date']].dtypes)

###Comparing Original and Transformed Columns
I displayed original and transformed columns side by side to assess the effectiveness of my data transformations.

In [None]:
hotel_bookings[['arrival_date_year', 'arrival_date_month', 'arrival_date_day_of_month', 'arrival_date', 'reservation_status_date', 'duration_to_arrival']].head()

###Analyzing Null Values
I inspected the distribution of the 'duration_to_arrival' column, including its null values, to identify any patterns or irregularities.

In [None]:
print(hotel_bookings['duration_to_arrival'].describe())

###Evaluating Negative Values
I identified and examined rows with negative durations. These instances are significant as they indicate cancellations that occurred before the scheduled arrival date.

In [None]:
negative_duration_rows = hotel_bookings[hotel_bookings['duration_to_arrival'] < 0]
columns_of_interest = negative_duration_rows[['arrival_date', 'reservation_status_date', 'duration_to_arrival', 'reservation_status']]
print(columns_of_interest.head())

###Assessing Positive and Zero Durations
I reviewed rows with zero or positive durations. Zero durations represent no-shows, while positive durations signify guests who checked in as planned, making these distinctions crucial for model accuracy.

In [None]:
positive_and_zero_duration_rows = hotel_bookings[hotel_bookings['duration_to_arrival'] >= 0]
columns_of_interest_2 = positive_and_zero_duration_rows[['arrival_date', 'reservation_status_date', 'duration_to_arrival', 'reservation_status']]
print(columns_of_interest_2.sample(50))

### Conclusion:
My analysis showed that negative durations predominantly represent early cancellations, with durations close to zero being especially concerning as they indicate last-minute cancellations or no-shows. Conversely, positive durations typically indicate guests who checked in as anticipated.

###Introducing a New Variable for Early Cancellations and No-shows
To enhance the model's predictive capability, I created a variable that captures early cancellations and no-shows. This addition is aimed at providing deeper insights into booking behaviors and improving model performance.

#### Importance of Including `early_cancellation_or_noshow` in Training Data

When introducing the `early_cancellation_or_noshow` variable to the analysis, it's vital to incorporate it into the hotel_data dataset. This dataset, created from the original hotel_bookings data by excluding the `is_canceled` variable, forms the foundation for the model training.

Incorporating the `early_cancellation_or_noshow` variable is a key step. It introduces critical information related to data leakage into the training process, fulfilling the objective of data leakage for testing and analytical purposes. This method results in metrics that might seem "too good to be true," highlighting not an enhancement but an overfitting of the model. This overfitting, while illustrative of the potential impacts of data leakage, does not constitute an improvement in the model's predictive capacity. Instead, it serves as a caution against the misleading accuracy that can arise from improperly incorporating predictive information into the training data.



In [None]:
hotel_bookings['early_cancellation_or_noshow'] = ((hotel_bookings['duration_to_arrival'] <= 0).astype(int))
print(hotel_bookings.head())

In [None]:
print(hotel_bookings.columns) #Includes the new variable, 'early_cancellation_or_noshow'

## Extract the target variable
I carefully extracted the target variable from the dataset, ensuring that the integrity of the data remained intact for accurate model training.

In [None]:
is_canceled = hotel_bookings['is_canceled'].copy()
hotel_data = hotel_bookings.drop(['is_canceled'], axis=1)
print(hotel_data.columns) #Does not include the new variable, 'early_cancellation_or_noshow'. The hotel_data will be the train_x data.

####Verifying Dataset Columns

Inspecting the hotel_data columns reveals the absence of the newly introduced 'early_cancellation_or_noshow' variable. It's crucial to include this variable in hotel_data, which will serve as the basis for the training dataset (train_x). This ensures the model is trained with all pertinent features, including those introduced to illustrate the effects of data leakage.

In [None]:
hotel_data['early_cancellation_or_noshow'] = hotel_bookings['early_cancellation_or_noshow']
print (hotel_data.head())

#Splitting the Data into Training, Testing, and Validation Sets
Finally, I divided the data into distinct sets for training, testing, and validation. This step is crucial for evaluating the model's performance and its ability to generalize to new, unseen data.

In [None]:
# Obtaining the total number of records in the dataset
original_count = len(hotel_bookings)

# Defining the proportion of the dataset to allocate for training
training_size = 0.60  # 60% of records for training

# Calculating the sizes for the test and validation sets, splitting the remaining data equally
test_size = (1 - training_size) / 2  # 20% for testing, 20% for validation

# Calculating the actual number of records for each set based on their proportions
training_count = int(original_count * training_size)  # Number of records for training
test_count = int(original_count * test_size)  # Number of records for testing
validation_count = original_count - training_count - test_count  # Remaining records for validation

# Printing out the sizes for each set to verify the distribution
print(f"Training count: {training_count}, Test count: {test_count}, Validation count: {validation_count}, Total: {original_count}")


In [None]:
from sklearn.model_selection import train_test_split

# Splitting the data into training data for 'hotel_data' and the target variable 'is_canceled'.
# The dataset is split into training and 'rest' (which includes both test and validation subsets).
train_x, rest_x, train_y, rest_y = train_test_split(hotel_data, is_canceled, train_size=training_count)
# Here, 'hotel_data' and the target variable 'is_canceled' are being split.
# 'train_size' is set to 'training_count' (60% of records as defined above).

# Further split the 'rest' data into test and validation sets, each comprising 20% of the total data.
test_x, validate_x, test_y, validate_y = train_test_split(rest_x, rest_y, train_size=test_count)
# This operation splits the remaining data into test and validation subsets, based on 'test_count'.

# Printing the lengths of the training, test, and validation datasets to verify the splits.
print(len(train_x), len(test_x), len(validate_x))


#One-hot encoding

One-hot encoding is a technique for converting categorical variables (strings) into a numerical representation. In this case, it applies to the column indicating the hotel type associated with each booking.

While Pandas provides a convenient method called get_dummies for quick analysis, it's not reproducible in a production or more formal analysis setting. Instead, it's recommended to use the OneHotEncoder from the scikit-learn library for a more robust and reproducible approach.

## Variables to Encode - One-hot Encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
one_hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown="ignore")

In [None]:
one_hot_encoder.fit(train_x[['hotel']])
one_hot_encoder.transform(train_x[['hotel']])

When constructing a One-Hot Encoder, it's advisable to generate a sparse matrix and to ignore unknown variables/values. It's crucial to remember that the .fit method should always be applied to the training data (train_x), while the .transform method should be applied to both the testing (test_x) and validation data (validate_x).

For ordinal variables, consider using a Label Encoder instead.

In cases of high cardinality - where a categorical variable contains a large number of unique values - it's important to maintain information while reducing the number of categorical variables. Techniques such as embeddings or grouping can be employed to manage variables with high levels of unique values without resorting to one-hot encoding. For instance, a variable representing countries can exhibit high cardinality; in such scenarios, rather than applying one-hot encoding, more advanced techniques like embeddings or grouping countries into categorical variables (like continents) could be used to reduce cardinality effectively.

In the context of this project, NLTK (Natural Language Toolkit) is not utilized as there are no extensive text variables to process.

However, if your dataset includes variables with substantial text content (such as customer comments), tools and techniques for Natural Language Processing (NLP) in Python, like NLTK, can be highly effective for processing and extracting meaningful information from text data.

#Binarizer

## Variables to Binarize

 - total_of_special_requests, required_car_parking_spaces, booking_changes, previous_bookings_not_canceled, previous_cancellations

In this scenario, the chosen approach was to binarize these variables to determine whether a client made a specific request or took a particular action, translating it into a binary format represented as a 0 or 1 value. These variables will be incorporated into the feature engineering pipeline within the binarizer column transformer.

In [None]:
from sklearn.preprocessing import Binarizer

In [None]:
binarizer = Binarizer()

In [None]:
train_x_copy = train_x.copy()

binarizer.fit(_[['total_of_special_requests']])
train_x_copy['has_made_special_requests'] = binarizer.transform(train_x[['total_of_special_requests']])

train_x_copy[['total_of_special_requests', 'has_made_special_requests']].sample(10)

Instead of being a high-cardinality categorical variable, it is now a binary variable with two values, 1 and 0, yes and no.

 total_of_special_requests is not ordinal; it cannot be quantified. (Customers who have made 0, 1, 2, 3, 4, 5 special requests.)

The requests are different from each other, and the relationship between them varies.
Perhaps one request was very specific and another was for two bottles of water.

Binarizer: to determine if the customer made any requests or not. (True/False)

(see the HTML report, Pandas Profiling Report)

booking_changes: the number of changes requested by the customer. Perhaps we are not interested in how many changes the customer made, but whether they made any changes or not.

previous_cancellations, previous_bookings_not_canceled: To identify someone who has made cancellations and someone who has not, regardless of the number.

It's not to reduce the model's complexity.
Discarding the number of cancellations because it's not as informative.
Most clients did not cancel.
It improves the model, making it more general.

Reducing the model's complexity also reduces the execution time during training and testing, which is more economical.

Binarizer Documentation:
Specify the threshold. To determine if few or many requests were made, if more than 3, mark the requests as positive; if less, 0. binarizer = Binarizer (threshold=3)

One might choose not to binarize a variable in another case; the total requests could be treated as ordinal in another scenario.


#Scaler

## Variable to scale

 - adr

In the scikit-learn documentation, there are general recommendations on using scalers, which can be particularly useful for variables representing how much a hotel earns when it's occupied. This can include rooms that generate profit and others that result in losses, creating a wide range of values from -6 to 5000.

There are various scalers such as StandardScaler, MinMaxScaler (for normally distributed data), and AbsoluteScaler. For cases with skewed data and extraordinary outliers, which deviate significantly from the expected range, a different approach is recommended.

The RobustScaler, as detailed in scikit-learn's documentation, is specifically designed to handle outliers effectively. This scaler adjusts the data in a way that is less influenced by the presence of outliers, making it a suitable choice for variables with a wide range of values and potential outlier data points.

In [None]:
from sklearn.preprocessing import RobustScaler

In [None]:
scaler = RobustScaler()

In [None]:
train_x_copy = train_x.copy()
scaler.fit(_[['adr']])
train_x_copy['adr_scaled'] = scaler.transform(train_x[['adr']])

train_x_copy[['adr', 'adr_scaled']].sample(10)

The result is a smaller range that our machine learning model can handle.

Standardization is a specific form of data scaling. Generally, in the context of data preprocessing for machine learning, scaling refers to modifying the values of features (variables) to fit them onto a common scale. There are several ways to do this, with normalization and standardization being two of the most common.

**Standardization**
Standardization involves rescaling data so they have a mean (μ) of 0 and a standard deviation (σ) of 1. The formula for standardizing a feature is:


### **z = (x - μ) / σ**

where \(x\)  is the original value, (μ) is the mean of the feature, and
(σ) is the standard deviation of the feature. Standardization does not bound values to a specific range.

**Normalization**
On the other hand, normalization (often referred to as min-max scaling) rescales the data to a specific range, typically 0 to 1. The formula for normalization is:

### **xnorm = (x - x_min)/(x_max - x_min)**

where x_min and x_max are the minimum and maximum values of the feature, respectively.

Comparison and Usage

Standardization vs. Normalization: The choice between standardization and normalization depends on the specific model and the context of the problem.

Some machine learning models, like those that assume the data is **normally distributed**, may benefit more from standardization.

Others, especially those **sensitive to the magnitude of features but that do not assume a specific distribution**, like distance-based models, may benefit more from normalization.

Invariance of Standardization: Standardization is invariant to the scale of measurement, meaning it changes the data to a scale that is relative to the mean and standard deviation of the data, making it useful for comparisons and for models that are sensitive to variance in the data but not necessarily to the absolute magnitude.

In summary, both standardization and normalization are important data preprocessing techniques that scale features but do so in ways that may be more suitable for different types of models and analysis problems.

Bimodal and Multimodal Distributions

Scaling does not affect whether a distribution is bimodal or multimodal.
It depends. You should delve deeper into scaling techniques to use the scaler that best fits the data.
**It is always recommended to scale the data regardless of its distribution.**

#No Transformation

###Variables to Maintain in Their Original Form:

 - stays_in_weekend_nights, stays_in_week_nights

The approach to these variables is contingent upon the predictive model selected for implementation.

It is essential to evaluate the nature and assumptions of the chosen model to determine whether these variables require any form of transformation or can be incorporated directly in their original state.

#Transformation Pipeline

The transformation pipeline groups together various transformations to be applied to the data, streamlining the preprocessing phase. This approach enables the efficient execution of multiple operations in unison, ensuring consistency across the dataset. The pipeline is particularly useful for applying specific transformations, such as one-hot encoding, to several variables simultaneously.

Applying One-Hot Encoding
One-hot encoding is a crucial step in preparing categorical variables for machine learning models. This process converts categorical data into a format that can be provided to ML algorithms to improve prediction accuracy. In our pipeline, we use a ColumnTransformer to apply one-hot encoding to specified columns. The ColumnTransformer targets columns for transformation, ensuring that the encoded output is aligned with the corresponding feature in our dataset. The variables targeted for one-hot encoding in this case include: hotel, meal, distribution_channel, reserved_room_type, assigned_room_type, and customer_type. This methodical application of one-hot encoding across multiple variables enhances the model's ability to understand and utilize categorical data effectively.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FeatureUnion, Pipeline

In [None]:
one_hot_encoding = ColumnTransformer([
    (
        'one_hot_encode',
        OneHotEncoder(sparse_output=False, handle_unknown="ignore"),
        [
            "hotel",
            "meal",
            "distribution_channel",
            "reserved_room_type",
            "assigned_room_type",
            "customer_type"
        ]
    )
])

In [None]:
binarizer = ColumnTransformer([
    (
        'binarizer',
        Binarizer(),
        [
            "total_of_special_requests",
            "required_car_parking_spaces",
            "booking_changes",
            "previous_bookings_not_canceled",
            "previous_cancellations",
        ]
    )
])
#This one-hot encoder breaks down categorical variables into a binary format (0 and 1), effectively eliminating any hierarchy or order within the categories.
one_hot_binarized = Pipeline([ #both
    ("binarizer", binarizer),
    ("one_hot_encoder", OneHotEncoder(sparse_output=False, handle_unknown="ignore")),
])

In [None]:
scaler = ColumnTransformer([
    ("scaler", RobustScaler(), ["adr"])
])

###The passthrough approach

It allows for certain features to remain unaltered, ensuring that the original data structure is preserved for these specific variables while still benefiting from the one-hot encoding applied to other categorical variables.

### Including Key Variables in the Passthrough Transformer

Ensuring 'early_cancellation_or_noshow' is passed through unchanged is crucial. Variables not specified in the ColumnTransformer are ignored by the model, neither influencing training nor predictions. This step guarantees the model accounts for all relevant data, including our intentionally added data leakage variable for analysis.

In [None]:
passthrough = ColumnTransformer([
    (
        "passthrough",
        "passthrough",
        [
            "stays_in_week_nights",
            "stays_in_weekend_nights",
            "early_cancellation_or_noshow",  # Add the variable here.
        ]
    )
])

In machine learning workflows, efficiently managing data transformations is crucial for model performance. The process typically involves three key steps:

* Preparation of Individual Transformers or Pipelines: For specific groups of columns, individual transformers or pipelines are prepared to handle different data types or perform specific transformations, such as one-hot encoding for categorical variables or scaling for numerical variables.

* Integration into a Unified Feature Engineering Scheme: These transformers are then integrated into a global schema using either FeatureUnion or ColumnTransformer. This allows for the parallel application of all necessary transformations, ensuring a comprehensive and efficient feature engineering process.

* Global Pipeline Construction: The final step involves encapsulating the entire feature engineering process and the machine learning model into a global pipeline. This global pipeline, which can be considered a 'pipeline of pipelines,' ensures a seamless workflow from data preprocessing to model training and prediction.

This structured approach not only facilitates the management of complex data transformations but also prevents common errors such as data leakage, enhancing model development and deployment efficiency

## Feature Engineering Pipeline with Feature Union

This process consolidates all prior transformations, for which individual pipelines were created, into a comprehensive feature engineering pipeline. Essentially, it acts as a 'pipeline of pipelines,' effectively grouping together various transformation pipelines.

The Feature Union object serves as the core component, facilitating the merger of all transformation elements into a unified whole. This streamlined approach ensures that all specified transformations are applied in parallel, optimizing the feature engineering process for the machine learning model.

In [None]:
# Defining the feature engineering pipeline
feature_engineering_pipeline = Pipeline(
    [
        (  # A tuple with the name 'features' and the FeatureUnion object
            "features",
            FeatureUnion(
                [
                    ("categorical", one_hot_encoding),  # An identifier/any name and the one-hot encoder
                    ("categorical_binarized", one_hot_binarized),  # Binarized categorical features
                    ("scaled", scaler),  # Scaled features
                    ("pass", passthrough),  # Features to pass through without transformation
                ]
            ),
        )
    ]
)

In [None]:
# Applying the pipeline to the training data
transformed = feature_engineering_pipeline.fit_transform(train_x)
print(f"Transformed shape: {transformed.shape}")

In [None]:
transformed # It's a matrix that our model can handle.

## Model training

In [None]:
# Getting a fresh copy of the pipeline
from sklearn.base import clone

feature_transformer = clone(feature_engineering_pipeline)  # Obtaining an untrained copy of the pipeline.

features_train_x = feature_transformer.fit_transform(train_x)  # Training the pipeline and use it to transform the training dataset.
features_validate_x = feature_transformer.transform(validate_x)  # Transforming the validation dataset.


The dataset is now converted to numbers so the model can be trained.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

model = RandomForestClassifier(n_estimators=100) # Vary n_estimators to improve recall during validation.

model.fit(features_train_x, train_y)


## Model validation

The validation dataset serves as a critical tool for evaluating the model's performance, with a particular focus on the **recall score**. In this process, the model, already trained, is applied to identify the most suitable hyperparameters for optimal performance.

During this phase, while the algorithm employs the features from features_train_x to refine the model's internal parameters, the role of a data scientist is to leverage the validation dataset for the fine-tuning of the hyperparameters, enhancing the control over the model's behavior.

TRAINING DATA: This dataset is instrumental for the model to learn and adjust its internal parameters.
VALIDATION DATA: Conversely, this dataset is utilized by the data scientist to fine-tune the hyperparameters, aiming to optimize the model's performance.
The **modification of hyperparameters**, informed by the model's performance on metrics such as **recall and accuracy score**, is pivotal to ensuring the model's effectiveness. This approach underscores the iterative process of model refinement, where both the model's internal learning and the data scientist's strategic adjustments play crucial roles in achieving optimal predictive performance.

In [None]:
from sklearn.metrics import accuracy_score, recall_score

pred_y = model.predict(features_validate_x)

print(accuracy_score(validate_y, pred_y))
print(recall_score(validate_y, pred_y))

Results:

0.99970635120396

0.9996552120445926

Attempting to enhance the model involves:
- Adjusting the hyperparameters of the **RandomForestClassifier**, specifically the **n_estimators**, during model training. Evaluate performance on model validation using recall and accuracy score to determine the optimal estimators/hyperparameters based on the best results.
- Exploring other models such as **Support Vector Machines, Logistic Regression**, and others could also improve outcomes.

These results can be further improved by experimenting with different models or adjusting parameters/hyperparameters.

# Construction of the Final Pipeline

This final pipeline encapsulates the entire data processing and modeling flow, ensuring a streamlined and reproducible approach for prediction.

In [None]:
final_inference_pipeline = Pipeline([
    # Incorporate a fresh copy of the pre-established feature engineering pipeline.
    ("feature_engineering", clone(feature_engineering_pipeline)),

    # Model selection with pre-defined hyperparameters.
    ("model", RandomForestClassifier(n_estimators=100))
])

Concatenating the training data with the validation data to create a large dataset:

In [None]:
final_training_dataset = pd.concat([train_x, validate_x])  # 95352 records in total
final_training_response = pd.concat([train_y, validate_y])

Instead of using only the initial training dataset, I concatenate the training and validation datasets to create a larger dataset. This enhanced dataset is then used to train the model that will be deployed into production.

The dataset comprises the input variables, while the response includes the binary values 1 and 0.

Training the final pipeline which was created above with these data:

In [None]:
final_inference_pipeline.fit(final_training_dataset, final_training_response)

## Model testing

Seeing how the model performs in the real world:

In [None]:
test_pred_y = final_inference_pipeline.predict(test_x)

print(accuracy_score(test_pred_y, test_y))
print(recall_score(test_pred_y, test_y))

Results:

0.9994127024079201

0.9992082343626286

These are data that evaluate the model's effectiveness.

###Interpreting Results: successful data leakage (not recommended in real-life scenarios)
The results, showing accuracy and recall scores of approximately 0.9994 and 0.9992 respectively, align with the anticipation of overfitting due to deliberately induced data leakage. This level of performance, while seemingly exceptional, is indicative of the model having access to information it would not have in a realistic scenario, leading to inflated metrics. This serves as a clear demonstration of how data leakage can artificially enhance a model's performance, underscoring the necessity of vigilance against such pitfalls in model training and validation processes.

## Model persistence


An artifact refers to any object or file created as a result of training a machine learning model. In this case, there is only one piece, the model itself. In other scenarios, when not using pipelines, there might be multiple components to deploy to production.

In [None]:
from joblib import dump

dump(final_inference_pipeline, "inference_pipeline.joblib")


# New clients - checking the model performance

In [None]:
from joblib import load

ultimate_inference_pipeline = load("inference_pipeline.joblib")

This involves a file with 100 new clients that the hotel wants to get evaluated using the model to determine if they will cancel or not.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Set up the variable for your file path
file_path = '/content/drive/My Drive/My Colabs/enhanced_hotel_cancellation_prediction/data/new_customers.csv'
#file_path = 'data/new_customers.csv'

new_customers = pd.read_csv(file_path)
new_customers.head()

new_customers['will_cancel'] = ultimate_inference_pipeline.predict(new_customers)  # adds the column 'will_cancel', indicating if a customer will cancel (1) or not (0).
new_customers[['proba_check_in', 'proba_cancel']] = ultimate_inference_pipeline.predict_proba(new_customers)  # 'predict_proba' provides a probability estimate of a customer canceling or not.

# Selects the columns and sorts them in descending order by 'proba_cancel'.
new_customers[['name', 'phone-number', 'will_cancel', 'proba_cancel']].sort_values(by='proba_cancel', ascending=False).head(20)


### Final Overview: The Implications of Data Leakage

In this data leakage exploration, I intentionally incorporated the 'early_cancellation_or_noshow' variable to shed light on its profound impact on model performance. This strategic inclusion aimed to demonstrate the effects of data leakage for testing and analytical purposes. By comparing an original model's performance against a version modified to include this variable, I observed a stark difference in outcomes during both model validation and testing phases.

Initially, the original model, free from data leakage, displayed modest performance metrics: an accuracy of 0.8146 and a recall of 0.7607 during the testing phase. However, upon introducing the 'early_cancellation_or_noshow' variable, a variable that provides advanced insights into cancellation likelihoods, I witnessed a dramatic increase in performance. Accuracy and recall metrics soared to near-perfection—0.9997 in validation and 0.9994 for accuracy, with a recall of 0.9992 in the testing phase.

This experiment highlights the deceptive nature of data leakage, illustrating how it can artificially inflate a model's effectiveness by giving it access to information not realistically available at the time of prediction. The elevated performance metrics, while impressive at first glance, actually signal substantial overfitting. The model, in essence, "learned" the outcomes from the training data rather than through generalizing from the features it was intended to analyze.

These findings serve as a stark reminder of the critical need for vigilance against data leakage in the model training and validation processes. Ensuring that models are developed with integrity, trained on realistically obtainable information, and free from future insights is essential for maintaining the robustness and reliability of their predictive capabilities in real-world applications.
