
# Hotel Booking Cancellation Prediction

## Data Overview

A hotel operates two branches: a "City Hotel" located in the state's capital and a "Resort Hotel" near the coastal area. The provided dataset in CSV format encompasses reservations for these hotels with a wide array of attributes, offering a comprehensive view of each booking's characteristics.

### Dataset Attributes:

- **hotel**: Type of hotel ("City Hotel" or "Resort Hotel").
- **is_canceled**: Indicates if a booking was canceled (`1`) or not (`0`).
- **lead_time**: Days between the booking date and arrival date.
- **arrival_date_year**: Year of arrival.
- **arrival_date_month**: Month of arrival (January to December).
- **arrival_date_week_number**: Week number of arrival.
- **arrival_date_day_of_month**: Day of the month of arrival.
- **stays_in_weekend_nights**: Number of weekend nights (Saturday or Sunday) the guest stayed or booked.
- **stays_in_week_nights**: Number of weeknights (Monday to Friday) the guest stayed or booked.
- **adults**, **children**, **babies**: Number of adults, children, and babies, respectively.
- **meal**: Type of meal booked.
- **country**: Guest's country of origin (ISO 3155–3:2013 format).
- **market_segment**: Market segment designation.
- **distribution_channel**: Booking distribution channel.
- **is_repeated_guest**: Indicates if the booking was made by a repeated guest (`1`) or not (`0`).
- **previous_cancellations**: Number of previous bookings canceled by the customer before the current booking.
- **previous_bookings_not_canceled**: Number of previous bookings not canceled by the customer before the current booking.
- **reserved_room_type**, **assigned_room_type**: Codes for room types booked and assigned, respectively.
- **booking_changes**: Number of changes made to the booking from the time it was entered into the PMS until check-in or cancellation.
- **deposit_type**: Indicates if the customer made a deposit to guarantee the booking.
- **agent**, **company**: ID of the travel agency and company that made the booking.
- **days_in_waiting_list**: Days the booking was on the waiting list before being confirmed.
- **customer_type**: Type of booking.
- **adr**: Average daily rate.
- **required_car_parking_spaces**: Number of car parking spaces requested by the guest.
- **total_of_special_requests**: Number of special requests made by the guest.
- **reservation_status**: Last status of the booking.
- **reservation_status_date**: Date of the last status update.
- **name**, **email**, **phone**, **credit_card**: Customer's name, email, phone number, and last four digits of the credit card.

The dataset for this analysis is sourced from `hotel_bookings_training.csv`, which is a subset of a larger dataset available on Kaggle. Further information about the dataset's origin can be explored [here](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand).

### Focus on Recall

In this analysis, beyond the conventional metric of accuracy, the principal concern is the comprehensive identification of potential booking cancellations. The client prioritizes preventive measures and values the act of confirming bookings with customers — a practice seen as both acceptable and indicative of attentive customer service, even if it includes contacting those who might not eventually cancel their reservations. This emphasis on recall ensures that predictive efforts align with the client's strategy to mitigate last-minute cancellations through proactive engagement.
```

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Set up the variable for your file path
file_path =  'data/hotel_bookings_training.csv' #or the Google Drive path

In [None]:
import pandas as pd

hotel_bookings = pd.read_csv(file_path)

from sklearn.model_selection import train_test_split

In [None]:
hotel_bookings.info()

In [None]:
# Remove personal information of customers
hotel_bookings = hotel_bookings.drop(['name', 'email', 'phone-number', 'credit_card'], axis=1)

In [None]:
hotel_bookings.sample(10)

##EDA

In [None]:
!pip install pandas_profiling

In [None]:
from ydata_profiling import ProfileReport # Previously pandas_profiling # Create a report on our data in an HTML file

In [None]:
profile = ProfileReport(hotel_bookings, title="Pandas Profiling Report")

In [None]:
!pip install matplotlib
!pip install --upgrade Pillow
import matplotlib.pyplot as plt
profile.to_file("data/bookings_profile.html")

In [None]:
#!ls

In [None]:
from google.colab import files
files.download('data/bookings_profile.html')

### Avoiding Data Leakage and Handling Imbalanced Data

It's crucial to split our dataset before applying any transformations to ensure that our model training and evaluation phases are as realistic and unbiased as possible. Here, we explain our strategy to avoid data leakage and how we address the challenge of imbalanced data.

## Data Leakage

The variable `reservation_status` indicates the status of the reservation. However, it is essentially a reflection of `is_canceled`. If we include it among the input variables, we would be leaking information to the model. This could lead to an excellent model performance, but it wouldn't be useful in real-world scenarios. Therefore, we need to remove it from the dataset along with other associated variables.

Understanding Data Leakage
Data leakage can occur in several ways, including but not limited to:

Forgetting to hide certain information, such as personal data, which should not be available to the model during training.
Using information from the test/validation set to train the model.
Data leakage leads to models learning patterns they shouldn't, resulting in deceptively high performance when evaluated on the same data but potentially much poorer performance on new or unseen data. This underscores the importance of careful data handling and model evaluation strategies to ensure the model's real-world applicability and reliability.

In [None]:
# Avoid data leakage
hotel_bookings = hotel_bookings.drop(['reservation_status', 'reservation_status_date'], axis=1)

##Separate the Target Variable

In [None]:
is_canceled = hotel_bookings['is_canceled'].copy()
hotel_data = hotel_bookings.drop(['is_canceled'], axis=1)

#Split the Data into Training, Testing, and Validation Sets

In [None]:
# Obtain the total number of records in the dataset
original_count = len(hotel_bookings)

# Define the proportion of the dataset to allocate for training
training_size = 0.60  # 60% of records for training

# Calculate the sizes for the test and validation sets, splitting the remaining data equally
test_size = (1 - training_size) / 2  # 20% for testing, 20% for validation

# Calculate the actual number of records for each set based on their proportions
training_count = int(original_count * training_size)  # Number of records for training
test_count = int(original_count * test_size)  # Number of records for testing
validation_count = original_count - training_count - test_count  # Remaining records for validation

# Print out the sizes for each set to verify the distribution
print(f"Training count: {training_count}, Test count: {test_count}, Validation count: {validation_count}, Total: {original_count}")


In [None]:
from sklearn.model_selection import train_test_split

# Splitting the data into training data for 'hotel_data' and the target variable 'is_canceled'.
# The dataset is split into training and 'rest' (which includes both test and validation subsets).
train_x, rest_x, train_y, rest_y = train_test_split(hotel_data, is_canceled, train_size=training_count)
# Here, 'hotel_data' and the target variable 'is_canceled' are being split.
# 'train_size' is set to 'training_count' (60% of records as defined above).

# Further split the 'rest' data into test and validation sets, each comprising 20% of the total data.
test_x, validate_x, test_y, validate_y = train_test_split(rest_x, rest_y, train_size=test_count)
# This operation splits the remaining data into test and validation subsets, based on 'test_count'.

# Print the lengths of the training, test, and validation datasets to verify the splits.
print(len(train_x), len(test_x), len(validate_x))


#One-hot encoding

One-hot encoding is a technique for converting categorical variables (strings) into a numerical representation. In this case, it applies to the column indicating the hotel type associated with each booking.

While Pandas provides a convenient method called get_dummies for quick analysis, it's not reproducible in a production or more formal analysis setting. Instead, it's recommended to use the OneHotEncoder from the scikit-learn library for a more robust and reproducible approach.

## Variables to Encode - One-hot Encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
one_hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown="ignore")

In [None]:
one_hot_encoder.fit(train_x[['hotel']])
one_hot_encoder.transform(train_x[['hotel']])

When constructing a One-Hot Encoder, it's advisable to generate a sparse matrix and to ignore unknown variables/values. It's crucial to remember that the .fit method should always be applied to the training data (train_x), while the .transform method should be applied to both the testing (test_x) and validation data (validate_x).

For ordinal variables, consider using a Label Encoder instead.

In cases of high cardinality - where a categorical variable contains a large number of unique values - it's important to maintain information while reducing the number of categorical variables. Techniques such as embeddings or grouping can be employed to manage variables with high levels of unique values without resorting to one-hot encoding. For instance, a variable representing countries can exhibit high cardinality; in such scenarios, rather than applying one-hot encoding, more advanced techniques like embeddings or grouping countries into categorical variables (like continents) could be used to reduce cardinality effectively.

In the context of this project, NLTK (Natural Language Toolkit) is not utilized as there are no extensive text variables to process.

However, if your dataset includes variables with substantial text content (such as customer comments), tools and techniques for Natural Language Processing (NLP) in Python, like NLTK, can be highly effective for processing and extracting meaningful information from text data.

#Binarizer

## Variables to Binarize

 - total_of_special_requests, required_car_parking_spaces, booking_changes, previous_bookings_not_canceled, previous_cancellations

In this scenario, the chosen approach was to binarize these variables to determine whether a client made a specific request or took a particular action, translating it into a binary format represented as a 0 or 1 value. These variables will be incorporated into the feature engineering pipeline within the binarizer column transformer.

In [None]:
from sklearn.preprocessing import Binarizer

In [None]:
binarizer = Binarizer()

In [None]:
train_x_copy = train_x.copy()

binarizer.fit(_[['total_of_special_requests']])
train_x_copy['has_made_special_requests'] = binarizer.transform(train_x[['total_of_special_requests']])

train_x_copy[['total_of_special_requests', 'has_made_special_requests']].sample(10)

Instead of being a high-cardinality categorical variable, it is now a binary variable with two values, 1 and 0, yes and no.

 total_of_special_requests is not ordinal; it cannot be quantified. (Customers who have made 0, 1, 2, 3, 4, 5 special requests.)

The requests are different from each other, and the relationship between them varies.
Perhaps one request was very specific and another was for two bottles of water.

Binarizer: to determine if the customer made any requests or not. (True/False)

(see the HTML report, Pandas Profiling Report)

booking_changes: the number of changes requested by the customer. Perhaps we are not interested in how many changes the customer made, but whether they made any changes or not.

previous_cancellations, previous_bookings_not_canceled: To identify someone who has made cancellations and someone who has not, regardless of the number.

It's not to reduce the model's complexity.
Discarding the number of cancellations because it's not as informative.
Most clients did not cancel.
It improves the model, making it more general.

Reducing the model's complexity also reduces the execution time during training and testing, which is more economical.

Binarizer Documentation:
Specify the threshold. To determine if few or many requests were made, if more than 3, mark the requests as positive; if less, 0. binarizer = Binarizer (threshold=3)

One might choose not to binarize a variable in another case; the total requests could be treated as ordinal in another scenario.


#Scaler

## Variable to scale

 - adr

In the scikit-learn documentation, there are general recommendations on using scalers, which can be particularly useful for variables representing how much a hotel earns when it's occupied. This can include rooms that generate profit and others that result in losses, creating a wide range of values from -6 to 5000.

There are various scalers such as StandardScaler, MinMaxScaler (for normally distributed data), and AbsoluteScaler. For cases with skewed data and extraordinary outliers, which deviate significantly from the expected range, a different approach is recommended.

The RobustScaler, as detailed in scikit-learn's documentation, is specifically designed to handle outliers effectively. This scaler adjusts the data in a way that is less influenced by the presence of outliers, making it a suitable choice for variables with a wide range of values and potential outlier data points.

In [None]:
from sklearn.preprocessing import RobustScaler

In [None]:
scaler = RobustScaler()

In [None]:
train_x_copy = train_x.copy()
scaler.fit(_[['adr']])
train_x_copy['adr_scaled'] = scaler.transform(train_x[['adr']])

train_x_copy[['adr', 'adr_scaled']].sample(10)

The result is a smaller range that our machine learning model can handle.

Standardization is a specific form of data scaling. Generally, in the context of data preprocessing for machine learning, scaling refers to modifying the values of features (variables) to fit them onto a common scale. There are several ways to do this, with normalization and standardization being two of the most common.

**Standardization**
Standardization involves rescaling data so they have a mean (μ) of 0 and a standard deviation (σ) of 1. The formula for standardizing a feature is:


### **z = (x - μ) / σ**

where \(x\)  is the original value, (μ) is the mean of the feature, and
(σ) is the standard deviation of the feature. Standardization does not bound values to a specific range.

**Normalization**
On the other hand, normalization (often referred to as min-max scaling) rescales the data to a specific range, typically 0 to 1. The formula for normalization is:

### **xnorm = (x - x_min)/(x_max - x_min)**

where x_min and x_max are the minimum and maximum values of the feature, respectively.

Comparison and Usage

Standardization vs. Normalization: The choice between standardization and normalization depends on the specific model and the context of the problem.

Some machine learning models, like those that assume the data is **normally distributed**, may benefit more from standardization.

Others, especially those **sensitive to the magnitude of features but that do not assume a specific distribution**, like distance-based models, may benefit more from normalization.

Invariance of Standardization: Standardization is invariant to the scale of measurement, meaning it changes the data to a scale that is relative to the mean and standard deviation of the data, making it useful for comparisons and for models that are sensitive to variance in the data but not necessarily to the absolute magnitude.

In summary, both standardization and normalization are important data preprocessing techniques that scale features but do so in ways that may be more suitable for different types of models and analysis problems.

Bimodal and Multimodal Distributions

Scaling does not affect whether a distribution is bimodal or multimodal.
It depends. You should delve deeper into scaling techniques to use the scaler that best fits the data.
**It is always recommended to scale the data regardless of its distribution.**

#No Transformation

###Variables to Maintain in Their Original Form:

 - stays_in_weekend_nights, stays_in_week_nights

The approach to these variables is contingent upon the predictive model selected for implementation.

It is essential to evaluate the nature and assumptions of the chosen model to determine whether these variables require any form of transformation or can be incorporated directly in their original state.

#Transformation Pipeline

The transformation pipeline groups together various transformations to be applied to the data, streamlining the preprocessing phase. This approach enables the efficient execution of multiple operations in unison, ensuring consistency across the dataset. The pipeline is particularly useful for applying specific transformations, such as one-hot encoding, to several variables simultaneously.

Applying One-Hot Encoding
One-hot encoding is a crucial step in preparing categorical variables for machine learning models. This process converts categorical data into a format that can be provided to ML algorithms to improve prediction accuracy. In our pipeline, we use a ColumnTransformer to apply one-hot encoding to specified columns. The ColumnTransformer targets columns for transformation, ensuring that the encoded output is aligned with the corresponding feature in our dataset. The variables targeted for one-hot encoding in this case include: hotel, meal, distribution_channel, reserved_room_type, assigned_room_type, and customer_type. This methodical application of one-hot encoding across multiple variables enhances the model's ability to understand and utilize categorical data effectively.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FeatureUnion, Pipeline

In [None]:
one_hot_encoding = ColumnTransformer([
    (
        'one_hot_encode',
        OneHotEncoder(sparse_output=False, handle_unknown="ignore"),
        [
            "hotel",
            "meal",
            "distribution_channel",
            "reserved_room_type",
            "assigned_room_type",
            "customer_type"
        ]
    )
])

In [None]:
binarizer = ColumnTransformer([
    (
        'binarizer',
        Binarizer(),
        [
            "total_of_special_requests",
            "required_car_parking_spaces",
            "booking_changes",
            "previous_bookings_not_canceled",
            "previous_cancellations",
        ]
    )
])
#This one-hot encoder breaks down categorical variables into a binary format (0 and 1), effectively eliminating any hierarchy or order within the categories.
one_hot_binarized = Pipeline([ #both
    ("binarizer", binarizer),
    ("one_hot_encoder", OneHotEncoder(sparse_output=False, handle_unknown="ignore")),
])

In [None]:
scaler = ColumnTransformer([
    ("scaler", RobustScaler(), ["adr"])
])

In [None]:
passthrough = ColumnTransformer([
    (
        "passthrough",
        "passthrough",
        [
            "stays_in_week_nights",
            "stays_in_weekend_nights",
        ]
    )
])
#The passthrough approach allows for certain features to remain unaltered, ensuring that the original data structure is preserved for these specific variables while still benefiting from the one-hot encoding applied to other categorical variables.

In machine learning workflows, efficiently managing data transformations is crucial for model performance. The process typically involves three key steps:

* Preparation of Individual Transformers or Pipelines: For specific groups of columns, individual transformers or pipelines are prepared to handle different data types or perform specific transformations, such as one-hot encoding for categorical variables or scaling for numerical variables.

* Integration into a Unified Feature Engineering Scheme: These transformers are then integrated into a global schema using either FeatureUnion or ColumnTransformer. This allows for the parallel application of all necessary transformations, ensuring a comprehensive and efficient feature engineering process.

* Global Pipeline Construction: The final step involves encapsulating the entire feature engineering process and the machine learning model into a global pipeline. This global pipeline, which can be considered a 'pipeline of pipelines,' ensures a seamless workflow from data preprocessing to model training and prediction.

This structured approach not only facilitates the management of complex data transformations but also prevents common errors such as data leakage, enhancing model development and deployment efficiency

## Feature Engineering Pipeline with Feature Union

This process consolidates all prior transformations, for which individual pipelines were created, into a comprehensive feature engineering pipeline. Essentially, it acts as a 'pipeline of pipelines,' effectively grouping together various transformation pipelines.

The Feature Union object serves as the core component, facilitating the merger of all transformation elements into a unified whole. This streamlined approach ensures that all specified transformations are applied in parallel, optimizing the feature engineering process for the machine learning model.

In [None]:
# Define the feature engineering pipeline
feature_engineering_pipeline = Pipeline(
    [
        (  # A tuple with the name 'features' and the FeatureUnion object
            "features",
            FeatureUnion(
                [
                    ("categorical", one_hot_encoding),  # An identifier/any name and the one-hot encoder
                    ("categorical_binarized", one_hot_binarized),  # Binarized categorical features
                    ("scaled", scaler),  # Scaled features
                    ("pass", passthrough),  # Features to pass through without transformation
                ]
            ),
        )
    ]
)

In [None]:
# Apply the pipeline to the training data
transformed = feature_engineering_pipeline.fit_transform(train_x)
print(f"Transformed shape: {transformed.shape}")

In [None]:
transformed # It's a matrix that our model can handle.

## Model training

In [None]:
# Get a fresh copy of the pipeline
from sklearn.base import clone

feature_transformer = clone(feature_engineering_pipeline)  # Obtain an untrained copy of the pipeline.

features_train_x = feature_transformer.fit_transform(train_x)  # Train the pipeline and use it to transform the training dataset.
features_validate_x = feature_transformer.transform(validate_x)  # Transform the validation dataset.


We now have the dataset converted to numbers. We can start training the model.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

model = RandomForestClassifier(n_estimators=100) # Vary n_estimators to improve recall during validation.

model.fit(features_train_x, train_y)


## Model validation

Use the validation dataset to evaluate the model's performance, focusing on recall score.
This step involves using the trained model to determine the optimal hyperparameters for our model.

While the algorithm uses features_train_x to adjust the model's internal parameters,
as data scientists, we use the validation dataset to fine-tune the **hyperparameters** for better model control.

* TRAINING DATA: Used by the MODEL to learn internal parameters.
* VALIDATION DATA: Used by DATA SCIENTISTS to adjust **hyperparameters**.

We adjust hyperparameters based on the performance metrics like recall and accuracy score, ensuring the model's effectiveness.


In [None]:
from sklearn.metrics import accuracy_score, recall_score

pred_y = model.predict(features_validate_x)

print(accuracy_score(validate_y, pred_y))
print(recall_score(validate_y, pred_y))

Attempting to enhance the model involves:
- Adjusting the hyperparameters of the **RandomForestClassifier**, specifically the **n_estimators**, during model training. Evaluate performance on model validation using recall and accuracy score to determine the optimal estimators/hyperparameters based on the best results.
- Exploring other models such as **Support Vector Machines, Logistic Regression**, and others could also improve outcomes.

These results can be further improved by experimenting with different models or adjusting parameters/hyperparameters.

# Construction of the Final Pipeline

This final pipeline encapsulates the entire data processing and modeling flow, ensuring a streamlined and reproducible approach for prediction.

In [None]:
final_inference_pipeline = Pipeline([
    # Incorporate a fresh copy of the pre-established feature engineering pipeline.
    ("feature_engineering", clone(feature_engineering_pipeline)),

    # Model selection with pre-defined hyperparameters.
    ("model", RandomForestClassifier(n_estimators=100))
])

We can concatenate the training data with the validation data to create a large dataset:

In [None]:
final_training_dataset = pd.concat([train_x, validate_x])  # 95352 records in total
final_training_response = pd.concat([train_y, validate_y])

Instead of using only the initial training dataset, we concatenate the training and validation datasets to create a larger dataset. This enhanced dataset is then used to train the model that will be deployed into production.

The dataset comprises the input variables, while the response includes the binary values 1 and 0.

Train the final pipeline we created above with these data:

In [None]:
final_inference_pipeline.fit(final_training_dataset, final_training_response)

## Model testing

See how the model performs in the real world.

In [None]:
test_pred_y = final_inference_pipeline.predict(test_x)

print(accuracy_score(test_pred_y, test_y))
print(recall_score(test_pred_y, test_y))

Results:

0.8145817602147831

0.7606827545615068

Based on these numbers, you can compile a report for stakeholders. These are data that evaluate the model's effectiveness (because they are from a different dataset, it verifies there is no overfitting, etc.). In this case, it's a success because the results are even better than those obtained during the model training and validation process.

## Model persistence


An artifact refers to any object or file created as a result of training a machine learning model. In this case, there is only one piece, the model itself. In other scenarios, when not using pipelines, there might be multiple components to deploy to production.

In [None]:
from joblib import dump

dump(final_inference_pipeline, "inference_pipeline.joblib")


# So, who are we addressing?

In [None]:
from joblib import load

ultimate_inference_pipeline = load("inference_pipeline.joblib")

This involves a file with 100 new clients that the hotel wants us to evaluate using the model to determine if they will cancel or not.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Set up the variable for your file path
file_path = '/content/drive/My Drive/My Colabs/enhanced_hotel_cancellation_prediction/data/new_customers.csv'
#file_path = 'data/new_customers.csv'

new_customers = pd.read_csv(file_path)
new_customers.head()

new_customers['will_cancel'] = ultimate_inference_pipeline.predict(new_customers)  # adds the column 'will_cancel', indicating if a customer will cancel (1) or not (0).
new_customers[['proba_check_in', 'proba_cancel']] = ultimate_inference_pipeline.predict_proba(new_customers)  # 'predict_proba' provides a probability estimate of a customer canceling or not.

# Selects the columns and sorts them in descending order by 'proba_cancel'.
new_customers[['name', 'phone-number', 'will_cancel', 'proba_cancel']].sort_values(by='proba_cancel', ascending=False).head(20)


It's not a 100% probability, it's an estimate based on our model. The company determines what the highest percentage they consider to be significant is (it could be 60%).

The threshold for the RandomForestClassifier model is 50%. For example, 0.51 is marked by the model as a cancellation, but maybe 40 is marked as not a cancellation.



Questions and considerations for improving the model and analyzing potential data leakage:

* Prediction Interval and Confidence Levels: Generating a prediction interval for the probability that accommodates a 95% confidence level requires additional calculation and research. This can help in assessing the uncertainty in the model's predictions.

* Graph Creation: Learning to create various types of graphs is essential for visual data analysis and interpretation. Graphical representations can provide insights into the data and model performance.

* Data Leakage Concerns: Training a model with potential data leakage (e.g., not removing certain columns initially or using validation/test data during training) should raise suspicions if the model performs exceptionally well. If a model seems too good to be true initially, it's worth investigating for data leakage.

* Feature Engineering and Room Assignment: Incorporating derived features, such as whether the customer received the room they requested (comparing reserved_room_type vs. assigned_room_type), can enhance model predictions. Dates of stay could also be a powerful predictive tool, suggesting the inclusion of date features in your model.

* Cross-validation and Hyperparameter Tuning: Automated cross-validation can replace manual validation processes, making model evaluation more efficient. Hyperparameter tuning is also crucial for optimizing model performance.

* Handling reservation_status_date: Decomposing the date into year, month, and day can capture seasonal trends or specific calendar events affecting customer decisions. However, care should be taken to avoid inadvertently introducing data leakage through this process.

* Advanced Tools and Techniques: Exploring advanced tools that train meta-models to identify and address low-quality data can further improve model training and accuracy. Tools like Edge Impulse might offer such capabilities, though their specific functionalities should be explored further.
In summary, addressing these points involves careful consideration of data preprocessing, feature engineering, model evaluation techniques, and potential use of advanced tools to refine the predictive model and ensure its validity and robustness.

