# Phase 4: Machine Learning Models

This notebook trains multiple machine learning models to predict hotel booking cancellations:
1. Naive Bayes
2. Decision Tree

All models are trained using PySpark MLlib for distributed processing.


In [1]:
# Install PySpark if not already installed
%pip install pyspark findspark -q

# Import libraries
from pyspark.sql import SparkSession
from pyspark.ml.classification import (
    NaiveBayes, DecisionTreeClassifier
)
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.feature import MinMaxScaler, VectorAssembler
from pyspark.sql.functions import col
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

print("✓ Libraries imported")


Note: you may need to restart the kernel to use updated packages.
✓ Libraries imported


## Step 1: Initialize Spark and Load Data

**Note**: This notebook assumes you've run the previous preprocessing notebook. If not, run `03_spark_preprocessing.ipynb` first.


In [2]:
# Create or get existing Spark session
# Handle Spark initialization more robustly
import os

try:
    # Check if spark exists and is valid
    spark
    # Test if spark is actually working
    spark.sparkContext.getConf().get("spark.app.name")
    print("✓ Using existing Spark session")
except (NameError, AttributeError, Exception) as e:
    # Create new Spark session
    print("Creating new Spark session...")
    try:
        spark = SparkSession.builder \
            .appName("HotelBookingML") \
            .config("spark.sql.adaptive.enabled", "true") \
            .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
            .config("spark.driver.memory", "4g") \
            .getOrCreate()
        spark.sparkContext.setLogLevel("WARN")
        print("✓ New Spark session created")
        print(f"Spark version: {spark.version}")
    except Exception as spark_error:
        print(f"⚠ ERROR: Failed to create Spark session: {spark_error}")
        print("\nTroubleshooting steps:")
        print("1. Make sure PySpark is installed: !pip install pyspark")
        print("2. Restart the runtime: Runtime -> Restart runtime")
        print("3. Run this cell again")
        raise

# Load processed data from previous notebook
# If running notebooks separately, data is saved as Parquet files

# Set paths for data
import os

# Use relative path from notebook location
save_dir = os.path.join("..", "data", "processed_data")
save_dir = os.path.abspath(save_dir)  # Convert to absolute path
print(f"Working directory: {os.getcwd()}")

train_path = os.path.join(save_dir, "train_data.parquet")
test_path = os.path.join(save_dir, "test_data.parquet")

print(f"\nLooking for data in: {save_dir}")

try:
    # Try to load from saved Parquet files (when running notebooks separately)
    if os.path.exists(save_dir):
        train_df = spark.read.parquet(train_path)
        test_df = spark.read.parquet(test_path)
        print(f"✓ Data loaded from Parquet files")
        print(f"  - Training set: {train_df.count():,} records")
        print(f"  - Test set: {test_df.count():,} records")
    else:
        raise FileNotFoundError(f"Directory {save_dir} does not exist")
except Exception as e:
    # If Parquet files don't exist, try to use variables from previous notebook
    try:
        train_df
        test_df
        print("\n✓ Using train_df and test_df from previous notebook")
        print(f"  - Training set: {train_df.count():,} records")
        print(f"  - Test set: {test_df.count():,} records")
    except NameError:
        print("\n⚠ ERROR: Could not find processed data!")
        print(f"\nPossible issues:")
        print(f"1. Parquet files not found at: {save_dir}")
        print(f"2. Previous notebook variables not available")
        print(f"\nSolution:")
        print(f"  - Run '03_spark_preprocessing.ipynb' first")
        print(f"  - Make sure Cell 27 in preprocessing notebook completed successfully")
        print(f"  - Check that data was saved to: {save_dir}")
        print(f"\nError details: {e}")
        raise


Creating new Spark session...


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/12/30 20:57:08 WARN Utils: Your hostname, Abdelrahmans-MacBook-Pro.local, resolves to a loopback address: 127.0.0.1; using 192.168.1.18 instead (on interface en0)
25/12/30 20:57:08 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/30 20:57:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/30 20:57:09 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


✓ New Spark session created
Spark version: 4.1.0
Working directory: /Users/abdelrahman/Developer/Hotel Booking Cancellation Prediction/notebooks

Looking for data in: /Users/abdelrahman/Developer/Hotel Booking Cancellation Prediction/data/processed_data
✓ Data loaded from Parquet files
  - Training set: 95,673 records
  - Test set: 23,717 records


## Step 2: Model Training Functions

Define functions to train and evaluate models.


In [3]:
def prepare_data_for_naive_bayes(df):
    """
    Prepare data for Naive Bayes by scaling features to [0, 1] range.
    Naive Bayes requires non-negative feature values.
    """
    from pyspark.sql.functions import udf
    from pyspark.sql.types import ArrayType, DoubleType
    from pyspark.ml.linalg import Vectors, VectorUDT
    
    # Extract features vector to scale
    # MinMaxScaler scales features to [0, 1] range, ensuring non-negative values
    scaler = MinMaxScaler(
        inputCol="features",
        outputCol="features_scaled",
        min=0.0,
        max=1.0
    )
    
    # Fit scaler on training data and transform
    scaler_model = scaler.fit(df)
    df_scaled = scaler_model.transform(df)
    
    # Clip values to [0, 1] to handle any out-of-range values from test data
    # This ensures all values are non-negative for Naive Bayes
    def clip_vector(v):
        """Clip vector values to [0, 1] range."""
        values = v.toArray()
        clipped = [max(0.0, min(1.0, val)) for val in values]
        return Vectors.dense(clipped)
    
    clip_udf = udf(clip_vector, VectorUDT())
    df_scaled = df_scaled.withColumn("features", clip_udf(col("features_scaled"))).drop("features_scaled")
    
    return df_scaled, scaler_model

def train_naive_bayes(train_df):
    """Train Naive Bayes model."""
    print("Training Naive Bayes...")
    print("  - Scaling features to [0, 1] range (Naive Bayes requires non-negative values)...")
    
    # Scale features to make them non-negative
    train_df_scaled, scaler_model = prepare_data_for_naive_bayes(train_df)
    
    # Train Naive Bayes on scaled features
    nb = NaiveBayes(
        featuresCol='features',
        labelCol='label'
    )
    model = nb.fit(train_df_scaled)
    print("✓ Naive Bayes trained")
    return model, scaler_model

def train_decision_tree(train_df, max_depth=10, max_bins=256):
    """
    Train Decision Tree model.
    
    Args:
        train_df: Training DataFrame
        max_depth: Maximum depth of the tree (default: 10)
        max_bins: Maximum number of bins for discretizing continuous features
                  Must be at least as large as the number of values in any categorical feature
                  (default: 256 to handle high-cardinality categorical features)
    """
    print("Training Decision Tree...")
    print(f"  - maxDepth: {max_depth}, maxBins: {max_bins}")
    dt = DecisionTreeClassifier(
        featuresCol='features',
        labelCol='label',
        maxDepth=max_depth,
        maxBins=max_bins,  # Increased to handle high-cardinality categorical features
        impurity='gini'
    )
    model = dt.fit(train_df)
    print("✓ Decision Tree trained")
    return model

print("✓ Model training functions defined")


✓ Model training functions defined


In [4]:
def evaluate_model(predictions, model_name):
    """Evaluate model and return metrics."""
    # Binary classification evaluator for AUC
    binary_evaluator = BinaryClassificationEvaluator(
        labelCol='label',
        rawPredictionCol='rawPrediction',
        metricName='areaUnderROC'
    )
    
    # Multiclass evaluator for other metrics
    multiclass_evaluator = MulticlassClassificationEvaluator(
        labelCol='label',
        predictionCol='prediction',
        metricName='accuracy'
    )
    
    metrics = {
        'model': model_name,
        'accuracy': multiclass_evaluator.evaluate(predictions),
        'auc': binary_evaluator.evaluate(predictions)
    }
    
    # Calculate precision, recall, F1
    for metric_name in ['weightedPrecision', 'weightedRecall', 'f1']:
        evaluator = MulticlassClassificationEvaluator(
            labelCol='label',
            predictionCol='prediction',
            metricName=metric_name
        )
        metrics[metric_name] = evaluator.evaluate(predictions)
    
    return metrics

print("✓ Evaluation function defined")


✓ Evaluation function defined


## Step 3: Train Model 1 - Naive Bayes


In [5]:
# Train Naive Bayes
nb_model, nb_scaler = train_naive_bayes(train_df)

# Scale test data using the same scaler fitted on training data
print("  - Scaling test data...")
from pyspark.sql.functions import udf
from pyspark.ml.linalg import Vectors, VectorUDT

# Transform test data
test_df_scaled = nb_scaler.transform(test_df)

# Clip values to [0, 1] to handle any out-of-range values
def clip_vector(v):
    """Clip vector values to [0, 1] range."""
    values = v.toArray()
    clipped = [max(0.0, min(1.0, val)) for val in values]
    return Vectors.dense(clipped)

clip_udf = udf(clip_vector, VectorUDT())
test_df_scaled = test_df_scaled.withColumn("features", clip_udf(col("features_scaled"))).drop("features_scaled")

# Make predictions on scaled test data
nb_predictions = nb_model.transform(test_df_scaled)

# Evaluate
nb_metrics = evaluate_model(nb_predictions, "Naive Bayes")

print("\n=== Naive Bayes Results ===")
for metric, value in nb_metrics.items():
    if metric != 'model':
        print(f"{metric.capitalize()}: {value:.4f}")


Training Naive Bayes...
  - Scaling features to [0, 1] range (Naive Bayes requires non-negative values)...


                                                                                

✓ Naive Bayes trained
  - Scaling test data...


25/12/30 20:57:16 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS



=== Naive Bayes Results ===
Accuracy: 0.7709
Auc: 0.4942
Weightedprecision: 0.8257
Weightedrecall: 0.7709
F1: 0.7382


In [6]:
# Show sample predictions
print("\n=== Sample Predictions ===")
nb_predictions.select("label", "prediction", "probability").show(10)



=== Sample Predictions ===
+-----+----------+--------------------+
|label|prediction|         probability|
+-----+----------+--------------------+
|    1|       0.0|[0.78765659530828...|
|    1|       1.0|[0.13344172323286...|
|    1|       1.0|[0.38538517338815...|
|    0|       0.0|[0.93500605987029...|
|    0|       0.0|[0.90982296594122...|
|    0|       0.0|[0.88433900542598...|
|    0|       0.0|[0.91784373778433...|
|    0|       0.0|[0.88634665148152...|
|    0|       0.0|[0.89232141611945...|
|    0|       0.0|[0.91506695796855...|
+-----+----------+--------------------+
only showing top 10 rows


Traceback (most recent call last):
  File "/Users/abdelrahman/Developer/Hotel Booking Cancellation Prediction/venv/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 233, in manager
    code = worker(sock, authenticated)
  File "/Users/abdelrahman/Developer/Hotel Booking Cancellation Prediction/venv/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 87, in worker
    outfile.flush()
    ~~~~~~~~~~~~~^^
BrokenPipeError: [Errno 32] Broken pipe


## Step 4: Train Model 2 - Decision Tree


In [7]:
# Train Decision Tree
dt_model = train_decision_tree(train_df, max_depth=10)

# Make predictions
dt_predictions = dt_model.transform(test_df)

# Evaluate
dt_metrics = evaluate_model(dt_predictions, "Decision Tree")

print("\n=== Decision Tree Results ===")
for metric, value in dt_metrics.items():
    if metric != 'model':
        print(f"{metric.capitalize()}: {value:.4f}")


Training Decision Tree...
  - maxDepth: 10, maxBins: 256
✓ Decision Tree trained

=== Decision Tree Results ===
Accuracy: 0.8390
Auc: 0.8581
Weightedprecision: 0.8390
Weightedrecall: 0.8390
F1: 0.8390


## Step 5: Decision Tree Feature Importance


In [8]:
# Decision Tree feature importance (if available)
try:
    feature_importance = dt_model.featureImportances
    print("\n=== Top 10 Most Important Features (Decision Tree) ===")
    # Note: Feature names would need to be mapped from indices
    # This is a simplified version
    importances = feature_importance.toArray()
    top_indices = np.argsort(importances)[-10:][::-1]
    for idx in top_indices:
        print(f"Feature {idx}: {importances[idx]:.4f}")
except:
    print("Feature importance not available for this model")



=== Top 10 Most Important Features (Decision Tree) ===
Feature 25: 0.4414
Feature 0: 0.1046
Feature 20: 0.0859
Feature 13: 0.0837
Feature 17: 0.0761
Feature 21: 0.0739
Feature 12: 0.0429
Feature 1: 0.0337
Feature 26: 0.0192
Feature 18: 0.0070


## Step 6: Model Comparison


In [9]:
# Collect all metrics
all_metrics = [nb_metrics, dt_metrics]

# Create comparison DataFrame
metrics_df = pd.DataFrame(all_metrics)
metrics_df = metrics_df.set_index('model')

print("=== Model Comparison ===")
display(metrics_df.round(4))


=== Model Comparison ===


Unnamed: 0_level_0,accuracy,auc,weightedPrecision,weightedRecall,f1
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Naive Bayes,0.7709,0.4942,0.8257,0.7709,0.7382
Decision Tree,0.839,0.8581,0.839,0.839,0.839


In [10]:
# Save predictions for evaluation notebook
# Store predictions in variables for next notebook
print("✓ All models trained and evaluated")
print("\nModels and predictions available:")
print("  - nb_model, nb_predictions, nb_metrics")
print("  - dt_model, dt_predictions, dt_metrics")
print("  - metrics_df (comparison table)")

# Save metrics to CSV for report
import os

# Use relative path from notebook location
metrics_path = os.path.join("..", "data", "model_metrics.csv")
metrics_path = os.path.abspath(metrics_path)  # Convert to absolute path

metrics_df.to_csv(metrics_path)
print(f"\n✓ Metrics saved to {metrics_path}")


✓ All models trained and evaluated

Models and predictions available:
  - nb_model, nb_predictions, nb_metrics
  - dt_model, dt_predictions, dt_metrics
  - metrics_df (comparison table)

✓ Metrics saved to /Users/abdelrahman/Developer/Hotel Booking Cancellation Prediction/data/model_metrics.csv


## Summary


## Summary

✓ Naive Bayes trained and evaluated
✓ Decision Tree trained and evaluated
✓ Model comparison completed

**Next Steps**: Proceed to `05_evaluation_visualization.ipynb` for detailed evaluation and visualizations.
