# CS 236 Project Phase 1: Data Cleaning and EDA

## Installation

We installed the required dependencies using pip.

```
pip install pyspark pandas plotly nbformat
```

We couldn't get PySpark 3 to work on our machines, so we used the latest PySpark 4 instead.
We also had some issues with the Java version - Java 25 is not supported, so we used Java 21.

# Analysis

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, when, col, count, expr
from pyspark.sql.types import IntegerType

spark = SparkSession.builder.getOrCreate()

hotel_booking = spark.read.csv(
    "datasets/hotel-booking.csv",
    header=True,
    inferSchema=True,
)
customer_reservations = spark.read.csv(
    "datasets/customer-reservations.csv",
    header=True,
    inferSchema=True,
)

hotel_booking.printSchema()
customer_reservations.printSchema()

root
 |-- hotel: string (nullable = true)
 |-- booking_status: integer (nullable = true)
 |-- lead_time: integer (nullable = true)
 |-- arrival_year: integer (nullable = true)
 |-- arrival_month: string (nullable = true)
 |-- arrival_date_week_number: integer (nullable = true)
 |-- arrival_date_day_of_month: integer (nullable = true)
 |-- stays_in_weekend_nights: integer (nullable = true)
 |-- stays_in_week_nights: integer (nullable = true)
 |-- market_segment_type: string (nullable = true)
 |-- country: string (nullable = true)
 |-- avg_price_per_room: double (nullable = true)
 |-- email: string (nullable = true)

root
 |-- Booking_ID: string (nullable = true)
 |-- stays_in_weekend_nights: integer (nullable = true)
 |-- stays_in_week_nights: integer (nullable = true)
 |-- lead_time: integer (nullable = true)
 |-- arrival_year: integer (nullable = true)
 |-- arrival_month: integer (nullable = true)
 |-- arrival_date: integer (nullable = true)
 |-- market_segment_type: string (nullable 

We can do some exploratory data analysis to see what these datasets are like. For instance, we can see which are the most popular months are.

In [2]:
customer_reservations.groupBy("arrival_month").agg(count("*").alias("count")).sort("arrival_month").plot.line(x="arrival_month", y="count")

We can also see what the lead times of customers are.

In [3]:
customer_reservations.select("lead_time").plot.hist(column="lead_time")

From the above, we can see that it heavily skewed, showing that most people book close to the date they are planning to stay.

Here we can see what the most popular market segments are. We find that there are five segments, with online being the most popular, offline being the next most, and the rest far below.

In [4]:
customer_reservations.groupBy("market_segment_type").agg(count("*").alias("count")).plot.bar(x="market_segment_type", y="count")

There are two parts that need to be aligned. We need to rename `arrival_date` to match `arrival_date_day_of_month` and map the booking status from Canceled and Not_Cancelled to 0 and 1.

In [5]:
customer_reservations = customer_reservations.withColumn(
    "booking_status", when(col("booking_status") == "Not_Cancelled", 1).otherwise(0)
).withColumnRenamed("arrival_date", "arrival_date_day_of_month")

After we align the two datasets, we can merge them and write to a new file.

In [6]:
customer_reservations.printSchema()

# merged_df = hotel_booking.union(customer_reservations)
# with open('merged.csv') as f:
#     merged_df.write.csv(f, mode='overwrite')

root
 |-- Booking_ID: string (nullable = true)
 |-- stays_in_weekend_nights: integer (nullable = true)
 |-- stays_in_week_nights: integer (nullable = true)
 |-- lead_time: integer (nullable = true)
 |-- arrival_year: integer (nullable = true)
 |-- arrival_month: integer (nullable = true)
 |-- arrival_date_day_of_month: integer (nullable = true)
 |-- market_segment_type: string (nullable = true)
 |-- avg_price_per_room: double (nullable = true)
 |-- booking_status: integer (nullable = false)

