# Phase 3: Spark Data Processing and Feature Engineering

This notebook demonstrates:
1. Setting up PySpark
2. Loading data into Spark DataFrame
3. Feature engineering using Spark transformations
4. Data preprocessing pipeline
5. Preparing data for machine learning

## Why Spark?
- Distributed processing for large datasets
- Scalable feature engineering
- MLlib for machine learning
- Handles data beyond memory limits


In [9]:
# Install PySpark
%pip install pyspark findspark -q


Note: you may need to restart the kernel to use updated packages.


In [10]:
# Import libraries
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, isnan, isnull, count, avg, sum as spark_sum
from pyspark.sql.types import IntegerType, DoubleType, StringType
from pyspark.ml.feature import VectorAssembler, StringIndexer, Imputer, StandardScaler
from pyspark.ml import Pipeline
import warnings
warnings.filterwarnings('ignore')

print("✓ Libraries imported")


✓ Libraries imported


## Step 1: Initialize Spark Session


In [11]:
# Create Spark session
spark = SparkSession.builder \
    .appName("HotelBookingPreprocessing") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

# Set log level to reduce verbosity
spark.sparkContext.setLogLevel("WARN")

print("✓ Spark session created")
print(f"Spark version: {spark.version}")
print(f"Spark UI: {spark.sparkContext.uiWebUrl if hasattr(spark.sparkContext, 'uiWebUrl') else 'N/A'}")


✓ Spark session created
Spark version: 4.1.0
Spark UI: http://192.168.1.18:4040


## Step 2: Load Data into Spark


In [12]:
# Load from CSV
csv_path = "../data/hotel_bookings.csv"

# Option: Load from MongoDB (requires MongoDB Spark Connector)
# For simplicity, we'll load from CSV and demonstrate MongoDB integration conceptually

# Load CSV into Spark DataFrame
# ⚠️ CRITICAL: Use nullValue="NA" to treat 'NA' strings as null values during load
# This prevents casting errors later in the pipeline
df_spark = spark.read.csv(
    csv_path, 
    header=True, 
    inferSchema=True,
    nullValue="NA"  # This converts 'NA' strings to null at source
)

print(f"✓ Data loaded into Spark")
print(f"Number of partitions: {df_spark.rdd.getNumPartitions()}")
print(f"Total records: {df_spark.count():,}")
print(f"\nSchema:")
df_spark.printSchema()


                                                                                

✓ Data loaded into Spark
Number of partitions: 5
Total records: 119,390

Schema:
root
 |-- hotel: string (nullable = true)
 |-- is_canceled: integer (nullable = true)
 |-- lead_time: integer (nullable = true)
 |-- arrival_date_year: integer (nullable = true)
 |-- arrival_date_month: string (nullable = true)
 |-- arrival_date_week_number: integer (nullable = true)
 |-- arrival_date_day_of_month: integer (nullable = true)
 |-- stays_in_weekend_nights: integer (nullable = true)
 |-- stays_in_week_nights: integer (nullable = true)
 |-- adults: integer (nullable = true)
 |-- children: integer (nullable = true)
 |-- babies: integer (nullable = true)
 |-- meal: string (nullable = true)
 |-- country: string (nullable = true)
 |-- market_segment: string (nullable = true)
 |-- distribution_channel: string (nullable = true)
 |-- is_repeated_guest: integer (nullable = true)
 |-- previous_cancellations: integer (nullable = true)
 |-- previous_bookings_not_canceled: integer (nullable = true)
 |-- re

In [13]:
# Show first few rows
print("=== Sample Data ===")
df_spark.show(5, truncate=False)


=== Sample Data ===
+------------+-----------+---------+-----------------+------------------+------------------------+-------------------------+-----------------------+--------------------+------+--------+------+----+-------+--------------+--------------------+-----------------+----------------------+------------------------------+------------------+------------------+---------------+------------+-----+-------+--------------------+-------------+----+---------------------------+-------------------------+------------------+-----------------------+
|hotel       |is_canceled|lead_time|arrival_date_year|arrival_date_month|arrival_date_week_number|arrival_date_day_of_month|stays_in_weekend_nights|stays_in_week_nights|adults|children|babies|meal|country|market_segment|distribution_channel|is_repeated_guest|previous_cancellations|previous_bookings_not_canceled|reserved_room_type|assigned_room_type|booking_changes|deposit_type|agent|company|days_in_waiting_list|customer_type|adr |required_car_p

25/12/30 16:48:20 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


## Step 3: Data Quality Check


In [14]:
# Check for missing values
print("=== Missing Values Analysis ===")

# Use isnull() for all columns (works for all data types: numeric, string, date, etc.)
# For numeric columns, NaN values are also considered null in Spark
missing_counts = df_spark.select([count(when(isnull(col(c)), c)).alias(c) 
                                   for c in df_spark.columns])

# Convert to pandas for better display
missing_pd = missing_counts.toPandas().T
missing_pd.columns = ['Missing Count']
missing_pd = missing_pd[missing_pd['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

if len(missing_pd) > 0:
    print("Columns with missing values:")
    display(missing_pd)
else:
    print("✓ No missing values found")
    

=== Missing Values Analysis ===
Columns with missing values:


Unnamed: 0,Missing Count
children,4


In [15]:
# Basic statistics
print("=== Basic Statistics ===")
df_spark.describe().show()


=== Basic Statistics ===




+-------+------------+-------------------+------------------+------------------+------------------+------------------------+-------------------------+-----------------------+--------------------+------------------+-------------------+--------------------+---------+-------+--------------+--------------------+-------------------+----------------------+------------------------------+------------------+------------------+-------------------+------------+-----------------+------------------+--------------------+---------------+------------------+---------------------------+-------------------------+------------------+
|summary|       hotel|        is_canceled|         lead_time| arrival_date_year|arrival_date_month|arrival_date_week_number|arrival_date_day_of_month|stays_in_weekend_nights|stays_in_week_nights|            adults|           children|              babies|     meal|country|market_segment|distribution_channel|  is_repeated_guest|previous_cancellations|previous_bookings_not_cance

                                                                                

## Step 4: Feature Engineering

Create new features using Spark transformations.


In [16]:
# Feature 1: Total nights
df_spark = df_spark.withColumn(
    "total_nights",
    col("stays_in_weekend_nights") + col("stays_in_week_nights")
)

# Feature 2: Total guests
# Use try_cast to safely handle 'NA' strings, then coalesce for nulls
from pyspark.sql.functions import coalesce, lit, expr
df_spark = df_spark.withColumn(
    "children_clean",
    expr("try_cast(children as double)")  # Safely cast, returns null if 'NA'
).withColumn(
    "babies_clean",
    expr("try_cast(babies as double)")  # Safely cast, returns null if 'NA'
).withColumn(
    "total_guests",
    coalesce(col("adults"), lit(0)) + 
    coalesce(col("children_clean"), lit(0.0)) +
    coalesce(col("babies_clean"), lit(0.0))
).drop("children_clean", "babies_clean")  # Drop temporary columns

# Feature 3: Total booking value (ADR × total nights)
df_spark = df_spark.withColumn(
    "total_booking_value",
    col("adr") * col("total_nights")
)

# Feature 4: Lead time categories
df_spark = df_spark.withColumn(
    "lead_time_category",
    when(col("lead_time") <= 30, "Very Short")
    .when(col("lead_time") <= 90, "Short")
    .when(col("lead_time") <= 180, "Medium")
    .when(col("lead_time") <= 365, "Long")
    .otherwise("Very Long")
)

# Feature 5: Previous cancellation ratio
df_spark = df_spark.withColumn(
    "previous_cancellation_ratio",
    col("previous_cancellations") / 
    (col("previous_cancellations") + col("previous_bookings_not_canceled") + 1)
)

print("✓ Feature engineering completed")
print("\nNew features created:")
print("  - total_nights")
print("  - total_guests")
print("  - total_booking_value")
print("  - lead_time_category")
print("  - previous_cancellation_ratio")


✓ Feature engineering completed

New features created:
  - total_nights
  - total_guests
  - total_booking_value
  - lead_time_category
  - previous_cancellation_ratio


In [17]:
# Verify new features
print("=== Sample of New Features ===")
df_spark.select("total_nights", "total_guests", "total_booking_value", 
                "lead_time_category", "previous_cancellation_ratio").show(10)


=== Sample of New Features ===
+------------+------------+-------------------+------------------+---------------------------+
|total_nights|total_guests|total_booking_value|lead_time_category|previous_cancellation_ratio|
+------------+------------+-------------------+------------------+---------------------------+
|           0|         2.0|                0.0|              Long|                        0.0|
|           0|         2.0|                0.0|         Very Long|                        0.0|
|           1|         1.0|               75.0|        Very Short|                        0.0|
|           1|         1.0|               75.0|        Very Short|                        0.0|
|           2|         2.0|              196.0|        Very Short|                        0.0|
|           2|         2.0|              196.0|        Very Short|                        0.0|
|           2|         2.0|              214.0|        Very Short|                        0.0|
|           2|     

## Step 5: Handle Missing Values


In [18]:
# Fill missing values
# Handle both null values and 'NA' strings using try_cast for safety
from pyspark.sql.functions import expr, coalesce, lit

# Children
df_spark = df_spark.withColumn(
    "children",
    coalesce(expr("try_cast(children as double)"), lit(0.0))  # try_cast returns null if 'NA', then coalesce to 0
)

# Agent
df_spark = df_spark.withColumn(
    "agent",
    coalesce(expr("try_cast(agent as double)"), lit(0.0))  # try_cast returns null if 'NA', then coalesce to 0
)

# Company
df_spark = df_spark.withColumn(
    "company",
    coalesce(expr("try_cast(company as double)"), lit(0.0))  # try_cast returns null if 'NA', then coalesce to 0
)

# Country (string column - handle differently)
df_spark = df_spark.withColumn(
    "country",
    when(col("country").isNull(), "Unknown")
    .when(col("country") == "NA", "Unknown")
    .otherwise(col("country"))
)

# CRITICAL: Clean is_canceled column early to avoid issues later
# This is the target variable and must be integer type with no 'NA' strings
df_spark = df_spark.withColumn(
    "is_canceled",
    when(col("is_canceled").cast("string").isin(["NA", "na", "Na", "nA", "null", "NULL", ""]), 0)
    .when(col("is_canceled").isNull(), 0)
    .otherwise(col("is_canceled").cast("integer"))
)

print("✓ Missing values handled (including 'NA' strings)")
print("✓ Target variable (is_canceled) cleaned and cast to integer")


✓ Missing values handled (including 'NA' strings)
✓ Target variable (is_canceled) cleaned and cast to integer


## Step 6: Prepare Features for ML

Select and prepare features for machine learning models.


In [19]:
# Select features for ML
# Numerical features
numerical_features = [
    "lead_time", "arrival_date_year", "arrival_date_week_number",
    "arrival_date_day_of_month", "stays_in_weekend_nights", 
    "stays_in_week_nights", "adults", "children", "babies",
    "previous_cancellations", "previous_bookings_not_canceled",
    "adr", "required_car_parking_spaces", "total_of_special_requests",
    "total_nights", "total_guests", "total_booking_value",
    "previous_cancellation_ratio"
]

# Categorical features to encode
categorical_features = [
    "hotel", "meal", "country", "market_segment",
    "distribution_channel", "reserved_room_type",
    "assigned_room_type", "deposit_type", "customer_type",
    "lead_time_category"
]

# Filter to existing columns
numerical_features = [f for f in numerical_features if f in df_spark.columns]
categorical_features = [f for f in categorical_features if f in df_spark.columns]

print(f"Numerical features: {len(numerical_features)}")
print(f"Categorical features: {len(categorical_features)}")


Numerical features: 18
Categorical features: 10


In [20]:
# Create StringIndexers for categorical features
indexers = []
for col_name in categorical_features:
    indexer = StringIndexer(
        inputCol=col_name,
        outputCol=f"{col_name}_indexed",
        handleInvalid="keep"
    )
    indexers.append(indexer)

print(f"✓ Created {len(indexers)} StringIndexers")


✓ Created 10 StringIndexers


In [21]:
# Apply StringIndexers
df_indexed = df_spark
for indexer in indexers:
    df_indexed = indexer.fit(df_indexed).transform(df_indexed)

print("✓ Categorical features indexed")


✓ Categorical features indexed


In [22]:
# Prepare feature columns for VectorAssembler
feature_columns = numerical_features + [f"{col}_indexed" for col in categorical_features]

# Final cleaning: Aggressively clean 'NA' strings and ensure proper types
df_cleaned = df_indexed

# Clean each numerical column: replace 'NA' strings and ensure double type
# Use try_cast to safely handle any remaining 'NA' strings
for col_name in numerical_features:
    if col_name in df_cleaned.columns:
        # Use try_cast to safely convert to double (returns null if 'NA' string)
        df_cleaned = df_cleaned.withColumn(
            col_name + "_temp",
            expr(f"try_cast({col_name} as double)")  # Safely cast, returns null if 'NA'
        ).withColumn(
            col_name,
            when(col(col_name + "_temp").isNull(), 0.0)  # Replace nulls with 0.0
            .otherwise(col(col_name + "_temp"))
        ).drop(col_name + "_temp")

print("✓ Cleaned 'NA' strings and ensured numerical columns are double type")

# Handle missing values manually - calculate mean on clean data
df_imputed = df_cleaned

print("Calculating means and filling missing values...")
for i, col_name in enumerate(numerical_features):
    if col_name in df_imputed.columns:
        try:
            # Calculate mean (this will work now that 'NA' strings are null)
            mean_result = df_imputed.select(avg(col(col_name)).alias("mean")).collect()[0]
            mean_value = mean_result["mean"] if mean_result["mean"] is not None else 0.0
            
            # Fill null values with mean
            df_imputed = df_imputed.withColumn(
                f"{col_name}_imputed",
                when(col(col_name).isNull(), mean_value)
                .otherwise(col(col_name))
            )
            
            if (i + 1) % 5 == 0:
                print(f"  Processed {i + 1}/{len(numerical_features)} columns...")
        except Exception as e:
            print(f"  Warning: Could not process {col_name}: {e}")
            # If mean calculation fails, just use 0.0
            df_imputed = df_imputed.withColumn(
                f"{col_name}_imputed",
                when(col(col_name).isNull(), 0.0)
                .otherwise(col(col_name))
            )

print("✓ Missing values filled with column means")

# Update feature columns to use imputed versions
feature_columns_imputed = [f"{col}_imputed" for col in numerical_features] + \
                         [f"{col}_indexed" for col in categorical_features]

print(f"✓ Features prepared: {len(feature_columns_imputed)} total features")


✓ Cleaned 'NA' strings and ensured numerical columns are double type
Calculating means and filling missing values...
  Processed 5/18 columns...
  Processed 10/18 columns...
  Processed 15/18 columns...
✓ Missing values filled with column means
✓ Features prepared: 28 total features


## Step 7: Create Feature Vector


In [23]:
# Assemble features into a vector
assembler = VectorAssembler(
    inputCols=feature_columns_imputed,
    outputCol="features",
    handleInvalid="skip"
)

df_features = assembler.transform(df_imputed)

print("✓ Features assembled into vector")
print(f"Feature vector size: {len(feature_columns_imputed)}")


✓ Features assembled into vector
Feature vector size: 28


In [24]:
# Select final columns for ML
# Aggressively clean is_canceled to handle any remaining 'NA' strings
from pyspark.sql.functions import expr, regexp_replace

# Step 1: Clean is_canceled column - replace any 'NA' variants with null, then cast
# This handles cases where 'NA' strings might still exist
df_ml = df_features.withColumn(
    "is_canceled_clean",
    # First, replace 'NA' strings with null
    when(col("is_canceled").cast("string").isin(["NA", "na", "Na", "nA", "null", "NULL", ""]), None)
    .otherwise(col("is_canceled"))
).withColumn(
    "label_temp",
    # Use try_cast to safely convert to integer
    expr("try_cast(is_canceled_clean as int)")
).withColumn(
    "label",
    # Convert nulls to 0, ensure integer type
    when(col("label_temp").isNull(), 0)
    .otherwise(col("label_temp").cast("integer"))
).select("features", "label")

# Step 2: Filter out any rows with null labels (safety check)
df_ml = df_ml.filter(col("label").isNotNull())

# Step 3: Verify data is clean before caching
# Force evaluation to catch any errors early
print("Verifying data integrity...")
try:
    sample_count = df_ml.limit(100).count()
    print(f"✓ Sample check passed ({sample_count} records)")
except Exception as e:
    print(f"⚠ Error during verification: {e}")
    print("Attempting to filter problematic rows...")
    # Try to filter out any rows that might cause issues
    df_ml = df_ml.filter(col("label").isNotNull())
    sample_count = df_ml.limit(100).count()
    print(f"✓ After filtering: {sample_count} records")

# Cache for faster access (only after data is verified clean)
df_ml.cache()

print("\n✓ Data prepared for ML")
print(f"Total records: {df_ml.count():,}")
df_ml.show(5)


Verifying data integrity...
✓ Sample check passed (100 records)

✓ Data prepared for ML
Total records: 119,390
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(28,[0,1,2,3,6,15...|    0|
|(28,[0,1,2,3,6,15...|    0|
|(28,[0,1,2,3,5,6,...|    0|
|(28,[0,1,2,3,5,6,...|    0|
|(28,[0,1,2,3,5,6,...|    0|
+--------------------+-----+
only showing top 5 rows


## Step 8: Train/Test Split


In [25]:
# Split data into training and test sets (80/20)
train_df, test_df = df_ml.randomSplit([0.8, 0.2], seed=42)

print("=== Train/Test Split ===")
print(f"Training set: {train_df.count():,} records ({train_df.count()/df_ml.count()*100:.1f}%)")
print(f"Test set: {test_df.count():,} records ({test_df.count()/df_ml.count()*100:.1f}%)")

# Check label distribution in both sets
print("\n=== Label Distribution ===")
print("Training set:")
train_df.groupBy("label").count().show()
print("\nTest set:")
test_df.groupBy("label").count().show()


=== Train/Test Split ===
Training set: 95,673 records (80.1%)
Test set: 23,717 records (19.9%)

=== Label Distribution ===
Training set:
+-----+-----+
|label|count|
+-----+-----+
|    1|35495|
|    0|60178|
+-----+-----+


Test set:
+-----+-----+
|label|count|
+-----+-----+
|    1| 8729|
|    0|14988|
+-----+-----+



## Step 9: Save Processed Data (Optional)

Save processed data for use in next notebook.


In [26]:
# Save processed data to Parquet for use in next notebook
# This allows you to run notebooks separately

import os

# Create directory for saved data
save_dir = "../data/processed_data"
os.makedirs(save_dir, exist_ok=True)

# Save train and test DataFrames
train_path = f"{save_dir}/train_data.parquet"
test_path = f"{save_dir}/test_data.parquet"

print("Saving processed data...")
train_df.write.mode("overwrite").parquet(train_path)
test_df.write.mode("overwrite").parquet(test_path)

print("✓ Data preprocessing completed")
print(f"\n✓ Training data saved to: {train_path}")
print(f"✓ Test data saved to: {test_path}")
print("\nReady for machine learning models!")
print("\nNext steps:")
print("  - Run the next notebook (04_ml_models.ipynb)")
print("  - The data will be automatically loaded from saved Parquet files")
print("  - Features are in 'features' column")
print("  - Labels are in 'label' column")


Saving processed data...
✓ Data preprocessing completed

✓ Training data saved to: ../data/processed_data/train_data.parquet
✓ Test data saved to: ../data/processed_data/test_data.parquet

Ready for machine learning models!

Next steps:
  - Run the next notebook (04_ml_models.ipynb)
  - The data will be automatically loaded from saved Parquet files
  - Features are in 'features' column
  - Labels are in 'label' column


                                                                                

## Summary

✓ Spark session initialized
✓ Data loaded and cached
✓ Feature engineering completed
✓ Missing values handled
✓ Categorical features encoded
✓ Features assembled into vectors
✓ Train/test split performed

**Next Steps**: Proceed to `04_ml_models.ipynb` to train machine learning models.
