# Hotel Booking Demand (July 2015 - August 2017)

Booking analysis is going to be performed as follows:

1. PySpark **environment setup**
2. Data source and **Spark data abstraction** (DataFrame) **set up**
3. Data set **metadata analysis**:
  1. Display **schema and size** of the DataFrame
  2. Data cleaning and formatting
  3. Get one or multiple **random samples** from the data set to better understand what the data is all about
  4. Identify **data entities**, **metrics** and **dimensions**
  5. **Columns/fields categorization**
4. Columns groups **basic profiling** to better understand our data set:
  1. **Timing related** columns basic profiling
  2. **Booking related** columns basic profiling
5. **Answer some business questions** to improve service
  1. **Ratio of bookings** per season
  2. **Season cancellation statistics** by different characteristics (days, people, changes and lead time)
  3. **Top 20 countries** with more bookings in Warm Seasons (Spring and Summer)

Let's go for it:

## 1. PySpark environment setup

In [2]:
import findspark
findspark.init()

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

## 2. Data source and Spark data abstraction (DataFrame) setup

In [3]:
bookingsDF = spark.read \
                  .option("inferSchema", "true") \
                  .option("header", "true") \
                  .csv("hotel_bookings.csv")

## 3. Data set metadata analysis
### A. Display schema and size of the DataFrame

In [4]:
from IPython.display import display, Markdown

bookingsDF.printSchema()
display(Markdown("This DataFrame has **%d rows**." % bookingsDF.count()))

root
 |-- hotel: string (nullable = true)
 |-- is_canceled: integer (nullable = true)
 |-- lead_time: integer (nullable = true)
 |-- arrival_date_year: integer (nullable = true)
 |-- arrival_date_month: string (nullable = true)
 |-- arrival_date_week_number: integer (nullable = true)
 |-- arrival_date_day_of_month: integer (nullable = true)
 |-- stays_in_weekend_nights: integer (nullable = true)
 |-- stays_in_week_nights: integer (nullable = true)
 |-- adults: integer (nullable = true)
 |-- children: string (nullable = true)
 |-- babies: integer (nullable = true)
 |-- meal: string (nullable = true)
 |-- country: string (nullable = true)
 |-- market_segment: string (nullable = true)
 |-- distribution_channel: string (nullable = true)
 |-- is_repeated_guest: integer (nullable = true)
 |-- previous_cancellations: integer (nullable = true)
 |-- previous_bookings_not_canceled: integer (nullable = true)
 |-- reserved_room_type: string (nullable = true)
 |-- assigned_room_type: string (nullab

This DataFrame has **119390 rows**.

### B. Data cleaning & formatting

In [5]:
# Converting the month (string) to an integer
from pyspark.sql.functions import when, count, col, countDistinct, desc, first
bookingsFinalDF = bookingsDF\
   .where(col("arrival_date_month")!="NA")\
   .withColumn("month", when(col("arrival_date_month")=="January",1)\
                       .when((col("arrival_date_month")=="February"),2)\
                       .when((col("arrival_date_month")=="March"),3)\
                       .when((col("arrival_date_month")=="April"),4)\
                       .when((col("arrival_date_month")=="May"),5)\
                       .when((col("arrival_date_month")=="June"),6)\
                       .when((col("arrival_date_month")=="July"),7)\
                       .when((col("arrival_date_month")=="August"),8)\
                       .when((col("arrival_date_month")=="September"),9)\
                       .when((col("arrival_date_month")=="October"),10)\
                       .when((col("arrival_date_month")=="November"),11)\
                       .otherwise(12))

# Deleting some columns with null values and non-relevant columns
columns_to_drop = ["market_segment", "distribution_channel", "deposit_type", "agent", \
                   "company", "customer_type", "adr"]
bookingsFinalDF = bookingsFinalDF.drop(*columns_to_drop)

### C. Get one or multiple random samples from the data set

In [6]:
bookingsFinalDF.cache() # Optimization to make the processing faster
bookingsFinalDF.sample(False, 0.1).take(2)

[Row(hotel='Resort Hotel', is_canceled=0, lead_time=78, arrival_date_year=2015, arrival_date_month='July', arrival_date_week_number=27, arrival_date_day_of_month=1, stays_in_weekend_nights=2, stays_in_week_nights=5, adults=2, children='0', babies=0, meal='BB', country='PRT', is_repeated_guest=0, previous_cancellations=0, previous_bookings_not_canceled=0, reserved_room_type='D', assigned_room_type='D', booking_changes=0, days_in_waiting_list=0, required_car_parking_spaces=1, total_of_special_requests=0, reservation_status='Check-Out', reservation_status_date='2015-07-08', month=7),
 Row(hotel='Resort Hotel', is_canceled=0, lead_time=1, arrival_date_year=2015, arrival_date_month='July', arrival_date_week_number=27, arrival_date_day_of_month=2, stays_in_weekend_nights=0, stays_in_week_nights=1, adults=2, children='2', babies=0, meal='BB', country='ESP', is_repeated_guest=0, previous_cancellations=0, previous_bookings_not_canceled=0, reserved_room_type='C', assigned_room_type='C', booking_

### D. Data entities, metrics and dimensions

I've identified the following elements:

* **Entities:** Booking (main one which is measured - facts), Country (dimension), Date (dimension)
* **Metrics:** Number of days (weekend and weekday), Number of people (adults and children), Arrival Season, ...
* **Dimensions:** Type of hotel, Room Type, Type of Meal, Reservation Status, ...

### E. Column categorization

The following could be a potential column categorization:

* **Timing related columns:** *arrival_date_year*, *month*, *arrival_date_day_of_month*, arrival_date_week_number and *reservation_status_date*
* **Booking related columns:** *hotel*, *is_canceled*, *lead_time*, *stays_in_weekend_nights*, *stays_in_week_nights*, *adults*, *children*, *babies*, *meal*, *country*, *is_repeated_guest*, *previous_cancellations*, *previous_bookings_not_canceled*, *reserved_room_type*, *assigned_room_type*, *booking_changes*, *days_in_waiting_list*, *required_car_parking_spaces*, *total_of_special_requests* and *reservation_status*

## 4. Columns groups basic profiling to better understand our data set
### A. Timing related columns basic profiling

In [7]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit


print ("Summary of columns Year, Month, DayofMonth, NumberOfWeek:")
bookingsFinalDF.select("arrival_date_year","month","arrival_date_day_of_month","arrival_date_week_number").summary().show()

print("Checking for nulls on columns Year, Month, DayofMonth, NumberOfWeek:")
bookingsFinalDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["arrival_date_year","month",\
                                                                           "arrival_date_day_of_month", \
                                                                           "arrival_date_week_number"]]).show()

print("Checking amount of distinct values in columns Year, Month, DayofMonth, NumberOfWeek:")
bookingsFinalDF.select([countDistinct(c).alias(c) for c in ["arrival_date_year","month","arrival_date_day_of_month",\
                                                            "arrival_date_week_number"]]).show()

print ("Most and least frequent occurrences for Month and DayOfMonth columns:")
MonthOccurrencesDF = bookingsFinalDF.groupBy("month").agg(count(lit(1)).alias("Total"))
dayOfMonthDF       = bookingsFinalDF.groupBy("arrival_date_day_of_month").agg(count(lit(1)).alias("Total"))

leastFreqMonth      = MonthOccurrencesDF.orderBy(col("Total").asc()).first()
mostFreqMonth       = MonthOccurrencesDF.orderBy(col("Total").desc()).first()
leastFreqDayOfMonth = dayOfMonthDF.orderBy(col("Total").asc()).first()
mostFreqDayOfMonth  = dayOfMonthDF.orderBy(col("Total").desc()).first()

display(Markdown("""
| %s | %s | %s | %s |
|----|----|----|----|
| %s | %s | %s | %s |
""" % ("leastFreqMonth", "mostFreqMonth", "leastFreqDayOfMonth", "mostFreqDayOfMonth", \
       "%d (%d occurrences)" % (leastFreqMonth["month"], leastFreqMonth["Total"]), \
       "%d (%d occurrences)" % (mostFreqMonth["month"], mostFreqMonth["Total"]), \
       "%d (%d occurrences)" % (leastFreqDayOfMonth["arrival_date_day_of_month"], leastFreqDayOfMonth["Total"]), \
       "%d (%d occurrences)" % (mostFreqDayOfMonth["arrival_date_day_of_month"], mostFreqDayOfMonth["Total"]))))

Summary of columns Year, Month, DayofMonth, NumberOfWeek:
+-------+------------------+-----------------+-------------------------+------------------------+
|summary| arrival_date_year|            month|arrival_date_day_of_month|arrival_date_week_number|
+-------+------------------+-----------------+-------------------------+------------------------+
|  count|            119390|           119390|                   119390|                  119390|
|   mean| 2016.156554150264|6.552483457576011|       15.798241058715135|       27.16517296255968|
| stddev|0.7074759445220408|3.090618686900272|        8.780829470578343|      13.605138355497665|
|    min|              2015|                1|                        1|                       1|
|    25%|              2016|                4|                        8|                      16|
|    50%|              2016|                7|                       16|                      28|
|    75%|              2017|                9|              


| leastFreqMonth | mostFreqMonth | leastFreqDayOfMonth | mostFreqDayOfMonth |
|----|----|----|----|
| 1 (5929 occurrences) | 8 (13877 occurrences) | 31 (2208 occurrences) | 17 (4406 occurrences) |


### B. Booking related columns basic profiling

In [8]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first

print ("Summary of columns Weekend nights, Week nights, Number of adults, Number of children, Number of babies:")
bookingsFinalDF.select("stays_in_weekend_nights", "stays_in_week_nights", "adults", "children","babies").summary().show()

print("Checking for nulls on columns Hotel, Weekend nights, Week nights, Number of adults, Number of children, Number of babies:")
bookingsFinalDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["hotel", "stays_in_weekend_nights", "stays_in_week_nights", "adults", "children", "babies"]]).show()

print("Checking amount of distinct values in columns Hotel, Weekend nights, Week nights, Number of adults, Number of children, Number of babies:")
bookingsFinalDF.select([countDistinct(c).alias(c) for c in ["hotel", "stays_in_weekend_nights", "stays_in_week_nights", "adults", "children", "babies"]]).show()

print ("Most and least frequent occurrences for Weekend nights, Week nights, Number of adults, Number of children, Number of babies:")
TotalWeekendNightsDF = bookingsFinalDF.groupBy("stays_in_weekend_nights").agg(count(lit(1)).alias("Total"))
TotalWeekNightsDF    = bookingsFinalDF.groupBy("stays_in_week_nights").agg(count(lit(1)).alias("Total"))
AdultsDF             = bookingsFinalDF.groupBy("adults").agg(count(lit(1)).alias("Total"))
ChildrenDF           = bookingsFinalDF.groupBy("children").agg(count(lit(1)).alias("Total"))
BabiesDF             = bookingsFinalDF.groupBy("babies").agg(count(lit(1)).alias("Total"))

leastFreqTotalWeekendNights    = TotalWeekendNightsDF.orderBy(col("Total").asc()).first()
mostFreqTotalWeekendNights     = TotalWeekendNightsDF.orderBy(col("Total").desc()).first()
leastFreqTotalWeekNights       = TotalWeekNightsDF.orderBy(col("Total").asc()).first()
mostFreqTotalWeekNights        = TotalWeekNightsDF.orderBy(col("Total").desc()).first()
leastFreqAdults                = AdultsDF.orderBy(col("Total").asc()).first()
mostFreqAdults                 = AdultsDF.orderBy(col("Total").desc()).first()
leastFreqChildren              = ChildrenDF.orderBy(col("Total").asc()).first()
mostFreqChildren               = ChildrenDF.orderBy(col("Total").desc()).first()
leastFreqBabies                = BabiesDF.orderBy(col("Total").asc()).first()
mostFreqBabies                 = BabiesDF.orderBy(col("Total").desc()).first()

display(Markdown("""
| %s | %s | %s | %s |
|----|----|----|----|
| %s | %s | %s | %s |
""" % ("leastFreqTotalWeekendNights", "mostFreqTotalWeekendNights", "leastFreqTotalWeekNights", "mostFreqTotalWeekNights", \
       "%d (%d occurrences)" % (leastFreqTotalWeekendNights["stays_in_weekend_nights"], leastFreqTotalWeekendNights["Total"]), \
       "%d (%d occurrences)" % (mostFreqTotalWeekendNights["stays_in_weekend_nights"], mostFreqTotalWeekendNights["Total"]), \
       "%s (%d occurrences)" % (leastFreqTotalWeekNights["stays_in_week_nights"], leastFreqTotalWeekNights["Total"]), \
       "%s (%d occurrences)" % (mostFreqTotalWeekNights["stays_in_week_nights"], mostFreqTotalWeekNights["Total"]))))
display(Markdown("""
| %s | %s | %s | %s | %s | %s |
|----|----|----|----|----|----|
| %s | %s | %s | %s | %s | %s |
""" % ("leastFreqAdults", "mostFreqAdults", "leastFreqChildren", "mostFreqChildren", "leastFreqBabies", "mostFreqBabies", \
       "%s (%d occurrences)" % (leastFreqAdults["adults"], leastFreqAdults["Total"]), \
       "%s (%d occurrences)" % (mostFreqAdults["adults"], mostFreqAdults["Total"]), \
       "%s (%d occurrences)" % (leastFreqChildren["children"], leastFreqChildren["Total"]), \
       "%s (%d occurrences)" % (mostFreqChildren["children"], mostFreqChildren["Total"]), 
       "%s (%d occurrences)" % (leastFreqBabies["babies"], leastFreqBabies["Total"]), \
       "%s (%d occurrences)" % (mostFreqBabies["babies"], mostFreqBabies["Total"]))))

print("Summary of columns RepeatedGuest, PreviousCancellations, PreviousBookingsNotCancelled, DaysInWaitingList, RequiredCarParkingSpaces and TotalOfSpecialRequests:")
bookingsFinalDF.select("is_repeated_guest", "previous_cancellations", "days_in_waiting_list", "required_car_parking_spaces", "total_of_special_requests").summary().show()

print("Checking for nulls on columns RepeatedGuest, PreviousCancellations, PreviousBookingsNotCancelled, DaysInWaitingList, RequiredCarParkingSpaces and TotalOfSpecialRequests:")
bookingsFinalDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["is_repeated_guest", "previous_cancellations", "days_in_waiting_list", "required_car_parking_spaces", "total_of_special_requests"]]).show()

print("Checking amount of distinct values in columns RepeatedGuest, PreviousCancellations, PreviousBookingsNotCancelled, DaysInWaitingList, RequiredCarParkingSpaces and TotalOfSpecialRequests:")
bookingsFinalDF.select([countDistinct(c).alias(c) for c in ["is_repeated_guest", "previous_cancellations", "days_in_waiting_list", "required_car_parking_spaces", "total_of_special_requests"]]).show()

Summary of columns Weekend nights, Week nights, Number of adults, Number of children, Number of babies:
+-------+-----------------------+--------------------+------------------+-------------------+--------------------+
|summary|stays_in_weekend_nights|stays_in_week_nights|            adults|           children|              babies|
+-------+-----------------------+--------------------+------------------+-------------------+--------------------+
|  count|                 119390|              119390|            119390|             119390|              119390|
|   mean|     0.9275986263506156|   2.500301532791691|1.8564033838679956|0.10388990333874994|0.007948739425412514|
| stddev|     0.9986134945978791|  1.9082856150479042|0.5792609988327531| 0.3985614447864427|  0.0974361913012642|
|    min|                      0|                   0|                 0|                  0|                   0|
|    25%|                      0|                   1|                 2|                0.


| leastFreqTotalWeekendNights | mostFreqTotalWeekendNights | leastFreqTotalWeekNights | mostFreqTotalWeekNights |
|----|----|----|----|
| 19 (1 occurrences) | 0 (51998 occurrences) | 26 (1 occurrences) | 2 (33684 occurrences) |



| leastFreqAdults | mostFreqAdults | leastFreqChildren | mostFreqChildren | leastFreqBabies | mostFreqBabies |
|----|----|----|----|----|----|
| 6 (1 occurrences) | 2 (89680 occurrences) | 10 (1 occurrences) | 0 (110796 occurrences) | 9 (1 occurrences) | 0 (118473 occurrences) |


Summary of columns RepeatedGuest, PreviousCancellations, PreviousBookingsNotCancelled, DaysInWaitingList, RequiredCarParkingSpaces and TotalOfSpecialRequests:
+-------+-------------------+----------------------+--------------------+---------------------------+-------------------------+
|summary|  is_repeated_guest|previous_cancellations|days_in_waiting_list|required_car_parking_spaces|total_of_special_requests|
+-------+-------------------+----------------------+--------------------+---------------------------+-------------------------+
|  count|             119390|                119390|              119390|                     119390|                   119390|
|   mean|0.03191222045397437|   0.08711784906608594|   2.321149174972778|        0.06251779881062065|       0.5713627607002262|
| stddev|0.17576714541065672|    0.8443363841545121|  17.594720878776243|        0.24529114746749414|       0.7927984228094107|
|    min|                  0|                     0|                   0|

## 5. Answer some business questions to improve service

### A. Ratio of bookings per season

In [9]:
from pyspark.sql.functions import count, round

# Season is going to be the target variable and we need to split the dataset in 4:
#
#   "summer"   - July-September
#   "autumn"   - October-December
#   "winter"   - January-March
#   "spring"   - April-June

# 1. Let's enrich the DF with seasonality based on our categorization
totalBookings = bookingsFinalDF.count()
seasonCategorizationDF = bookingsFinalDF\
   .where(col("arrival_date_month")!="NA")\
   .withColumn("season", when(col("arrival_date_month")=="January","Winter")\
                        .when((col("arrival_date_month")=="February"),"Winter")\
                        .when((col("arrival_date_month")=="March"),"Winter")\
                        .when((col("arrival_date_month")=="April"),"Spring")\
                        .when((col("arrival_date_month")=="May"),"Spring")\
                        .when((col("arrival_date_month")=="June"),"Spring")\
                        .when((col("arrival_date_month")=="July"),"Summer")\
                        .when((col("arrival_date_month")=="August"),"Summer")\
                        .when((col("arrival_date_month")=="September"),"Summer")\
                        .when((col("arrival_date_month")=="October"),"Autumn")\
                        .when((col("arrival_date_month")=="November"),"Autumn")\
                        .otherwise("Autumn"))
seasonCategorizationDF.cache()
# 2. Ready to answer to this business question
seasonCategorizationDF.select("season")\
                      .groupBy("season")\
                      .agg(count("season").alias("NumRooms"), \
                          (count("season")/totalBookings*100).alias("Ratio"))\
                      .orderBy("season")\
                      .select("season","NumRooms",round("Ratio",2).alias("RoundedRatio")).show()

+------+--------+------------+
|season|NumRooms|RoundedRatio|
+------+--------+------------+
|Autumn|   24734|       20.72|
|Spring|   33819|       28.33|
|Summer|   37046|       31.03|
|Winter|   23791|       19.93|
+------+--------+------------+



### B. Season cancellation statistics by different characteristics (days, people, changes and lead time)

In [10]:
from pyspark.sql.functions import max, min, avg, stddev
from pyspark.sql.types import IntegerType

# To get statistics of season cancellation bookings, we have to prepare the previous DataFrame (seasonCategorizationDF):
#   1. Include only cancelled bookings
#   2. To get proper statistics, convert String columns into Integer columns (children)

bookingCancellationDF = \
  seasonCategorizationDF.where((col("is_canceled")==1))\
                        .withColumn("IntChildren", col("children").cast(IntegerType()))\
                        .select("season","stays_in_weekend_nights", "stays_in_week_nights","adults","IntChildren",\
                               "babies", "total_of_special_requests", "lead_time", "is_repeated_guest", "booking_changes")
bookingCancellationDF.cache() # optimization to make the processing faster

display(Markdown("**'Weekend days' cancelled basic stats** (in days):"))
bookingCancellationDF.groupBy("season")\
                     .agg(avg("stays_in_weekend_nights").alias("AverageCancellation"),\
                          min("stays_in_weekend_nights").alias("LowestCancellation"),\
                          max("stays_in_weekend_nights").alias("HighestCancellation"),\
                          stddev("stays_in_weekend_nights").alias("StdDevCancellation"))\
                     .orderBy("season").show()

display(Markdown("**'Week days' cancelled basic stats** (in days):"))
bookingCancellationDF.groupBy("season")\
                     .agg(avg("stays_in_week_nights").alias("AverageCancellation"),\
                          min("stays_in_week_nights").alias("LowestCancellation"),\
                          max("stays_in_week_nights").alias("HighestCancellation"),\
                          stddev("stays_in_week_nights").alias("StdDevCancellation"))\
                     .orderBy("season").show()

display(Markdown("**'Number of adults' cancelled basic stats**:"))
bookingCancellationDF.groupBy("season")\
                     .agg(avg("adults").alias("AverageCancellation"),\
                          min("adults").alias("LowestCancellation"),\
                          max("adults").alias("HighestCancellation"),\
                          stddev("adults").alias("StdDevCancellation"))\
                     .orderBy("season").show()

display(Markdown("**'Number of children' cancelled basic stats**:"))
bookingCancellationDF.groupBy("season")\
                     .agg(avg("IntChildren").alias("AverageCancellation"),\
                          min("IntChildren").alias("LowestCancellation"),\
                          max("IntChildren").alias("HighestCancellation"),\
                          stddev("IntChildren").alias("StdDevCancellation"))\
                     .orderBy("season").show()

display(Markdown("**'Number of babies' cancelled basic stats**:"))
bookingCancellationDF.groupBy("season")\
                     .agg(avg("babies").alias("AverageCancellation"),\
                          min("babies").alias("LowestCancellation"),\
                          max("babies").alias("HighestCancellation"),\
                          stddev("babies").alias("StdDevCancellation"))\
                     .orderBy("season").show()

display(Markdown("**'Number of special guests' cancelled basic stats**:"))
bookingCancellationDF.groupBy("season")\
                     .agg(avg("total_of_special_requests").alias("AverageCancellation"),\
                          min("total_of_special_requests").alias("LowestCancellation"),\
                          max("total_of_special_requests").alias("HighestCancellation"),\
                          stddev("total_of_special_requests").alias("StdDevCancellation"))\
                     .orderBy("season").show()

display(Markdown("**'Lead Time' cancelled basic stats** (in days):"))
bookingCancellationDF.groupBy("season")\
                     .agg(avg("lead_time").alias("AverageCancellation"),\
                          min("lead_time").alias("LowestCancellation"),\
                          max("lead_time").alias("HighestCancellation"),\
                          stddev("lead_time").alias("StdDevCancellation"))\
                     .orderBy("season").show()

display(Markdown("**'Repeated Guest' cancelled basic stats**:"))
bookingCancellationDF.groupBy("season")\
                     .agg(avg("is_repeated_guest").alias("AverageCancellation"),\
                          min("is_repeated_guest").alias("LowestCancellation"),\
                          max("is_repeated_guest").alias("HighestCancellation"),\
                          stddev("is_repeated_guest").alias("StdDevCancellation"))\
                     .orderBy("season").show()

display(Markdown("**'Number of changes' cancelled basic stats** (in days):"))
bookingCancellationDF.groupBy("season")\
                     .agg(avg("booking_changes").alias("AverageCancellation"),\
                          min("booking_changes").alias("LowestCancellation"),\
                          max("booking_changes").alias("HighestCancellation"),\
                          stddev("booking_changes").alias("StdDevCancellation"))\
                     .orderBy("season").show()

**'Weekend days' cancelled basic stats** (in days):

+------+-------------------+------------------+-------------------+------------------+
|season|AverageCancellation|LowestCancellation|HighestCancellation|StdDevCancellation|
+------+-------------------+------------------+-------------------+------------------+
|Autumn| 0.8851127131250716|                 0|                  9|0.9773501038264026|
|Spring| 0.8717239370995923|                 0|                  9|0.9322119487416681|
|Summer| 1.0148967865503298|                 0|                  8|0.9954804107635634|
|Winter| 0.9021170935703084|                 0|                 16|1.1694542652049256|
+------+-------------------+------------------+-------------------+------------------+



**'Week days' cancelled basic stats** (in days):

+------+-------------------+------------------+-------------------+------------------+
|season|AverageCancellation|LowestCancellation|HighestCancellation|StdDevCancellation|
+------+-------------------+------------------+-------------------+------------------+
|Autumn| 2.3772742876759354|                 0|                 24|1.7837138237346148|
|Spring|  2.515579499126383|                 0|                 24|1.5682661617444245|
|Summer|  2.712704830815067|                 0|                 20| 1.818027835411011|
|Winter| 2.5781495033978046|                 0|                 40|2.4836747483878416|
+------+-------------------+------------------+-------------------+------------------+



**'Number of adults' cancelled basic stats**:

+------+-------------------+------------------+-------------------+------------------+
|season|AverageCancellation|LowestCancellation|HighestCancellation|StdDevCancellation|
+------+-------------------+------------------+-------------------+------------------+
|Autumn| 1.8487241103101042|                 0|                 55|0.8256040377916612|
|Spring| 1.8731799650553291|                 0|                  4|0.4579976917298583|
|Summer| 1.9950344044832233|                 0|                 50|0.8242068904829948|
|Winter| 1.8416100365917407|                 0|                  4|0.4757937962507202|
+------+-------------------+------------------+-------------------+------------------+



**'Number of children' cancelled basic stats**:

+------+-------------------+------------------+-------------------+-------------------+
|season|AverageCancellation|LowestCancellation|HighestCancellation| StdDevCancellation|
+------+-------------------+------------------+-------------------+-------------------+
|Autumn| 0.0724339169241332|                 0|                  3|0.33543140688317113|
|Spring|0.09981071636575423|                 0|                  3| 0.4022289107924577|
|Summer|0.14269495494216988|                 0|                 10|0.47144101551588674|
|Winter|0.09082592786199686|                 0|                  3| 0.3815569059509733|
+------+-------------------+------------------+-------------------+-------------------+



**'Number of babies' cancelled basic stats**:

+------+--------------------+------------------+-------------------+--------------------+
|season| AverageCancellation|LowestCancellation|HighestCancellation|  StdDevCancellation|
+------+--------------------+------------------+-------------------+--------------------+
|Autumn|0.003890605332417...|                 0|                  1| 0.06225682325262109|
|Spring|0.002984857309260338|                 0|                  2|0.055872873574566294|
|Summer|0.006100588777754...|                 0|                  2| 0.07877617638077825|
|Winter|0.001045478306325...|                 0|                  1| 0.03231906224729337|
+------+--------------------+------------------+-------------------+--------------------+



**'Number of special guests' cancelled basic stats**:

+------+-------------------+------------------+-------------------+------------------+
|season|AverageCancellation|LowestCancellation|HighestCancellation|StdDevCancellation|
+------+-------------------+------------------+-------------------+------------------+
|Autumn| 0.3461494450165923|                 0|                  4|0.6901439612197943|
|Spring|0.28188701223063484|                 0|                  5|0.5979295358053065|
|Summer|0.37348371994041285|                 0|                  5| 0.679719189370532|
|Winter|0.31102979613173026|                 0|                  4|  0.62532167492052|
+------+-------------------+------------------+-------------------+------------------+



**'Lead Time' cancelled basic stats** (in days):

+------+-------------------+------------------+-------------------+------------------+
|season|AverageCancellation|LowestCancellation|HighestCancellation|StdDevCancellation|
+------+-------------------+------------------+-------------------+------------------+
|Autumn| 153.57924247625587|                 0|                626|135.53751315590188|
|Spring| 142.49956319161328|                 0|                471|100.63429052449636|
|Summer|  170.3944810952685|                 0|                521|118.46271009929012|
|Winter|   92.0334553058024|                 0|                629|110.59391798612799|
+------+-------------------+------------------+-------------------+------------------+



**'Repeated Guest' cancelled basic stats**:

+------+--------------------+------------------+-------------------+-------------------+
|season| AverageCancellation|LowestCancellation|HighestCancellation| StdDevCancellation|
+------+--------------------+------------------+-------------------+-------------------+
|Autumn|0.016477857878475798|                 0|                  1| 0.1273114008041682|
|Spring|0.004659289458357601|                 0|                  1| 0.0681022622753702|
|Summer| 0.01929488543661772|                 0|                  1|0.13756429493619027|
|Winter|0.009409304756926294|                 0|                  1|0.09655044268641946|
+------+--------------------+------------------+-------------------+-------------------+



**'Number of changes' cancelled basic stats** (in days):

+------+-------------------+------------------+-------------------+-------------------+
|season|AverageCancellation|LowestCancellation|HighestCancellation| StdDevCancellation|
+------+-------------------+------------------+-------------------+-------------------+
|Autumn|0.09955372468245795|                 0|                 16|0.49086275235414834|
|Spring|0.08590564938846826|                 0|                 14| 0.4205004501695135|
|Summer|0.10966872384195218|                 0|                  7| 0.4602881932426244|
|Winter|0.09840564558285415|                 0|                  6|0.43846259920916403|
+------+-------------------+------------------+-------------------+-------------------+



### C. Top 20 countries with more bookings in Warm Seasons (Spring and Summer)

In [12]:
# Our answer to this business question will be:
#   1. List of top 20 countries with highest warm seasons ratio (based on total number of bookings)
#   2. List of top 20 countries with warm season ratio by season

# In order to be able to deliver these insights, we need some preparation:
#   1. Define a DataFrame with total bookings per country (totalBookingsCountryDF)
#   2. Define a DataFrame with aggregated data by Country and Season to figure out
#      number of bookings per Warm Seasons (warmSeasonsCountryDF)
#   3. Combine both DataFrames to come up with one single DataFrame containin total bookings
#      per country and number of bookings per Warm Season to compute ratios (combinedDF)

totalBookingsCountryDF = \
   bookingsFinalDF.groupBy("country")\
                  .agg(count(lit(1)).alias("TotalBookings"))
warmSeasonsCountryDF = \
  seasonCategorizationDF.where((col("season")!="Winter") & (col("season")!="Autumn"))\
                        .select("country", "season")\
                        .groupBy("country", "season")\
                        .agg(count(lit(1)).alias("NumWarmSeasonBookings"))

combinedDF = \
  warmSeasonsCountryDF\
     .where((col("NumWarmSeasonBookings")>2))\
     .join(totalBookingsCountryDF, "country")\
     .withColumn("WarmSeasonRatio", round(col("NumWarmSeasonBookings")/col("TotalBookings")*100,2))\
     .orderBy(col("WarmSeasonRatio").desc())
combinedDF.cache() # optimization to make the processing faster

display(Markdown("**Top 20 countries** with highest bookings ratio in Warm Seasons (in \%):"))
combinedDF.limit(20).show()
display(Markdown("**Top 20 countries with warm seasons bookings ratio** by season (in \%):"))
combinedDF\
   .groupBy("country")\
   .pivot("season")\
   .min("WarmSeasonRatio")\
   .orderBy(col("summer").desc(), col("spring").desc())\
   .limit(20).show()

**Top 20 countries** with highest bookings ratio in Warm Seasons (in \%):

+-------+------+---------------------+-------------+---------------+
|country|season|NumWarmSeasonBookings|TotalBookings|WarmSeasonRatio|
+-------+------+---------------------+-------------+---------------+
|    BRB|Spring|                    4|            4|          100.0|
|    FRO|Summer|                    5|            5|          100.0|
|    BOL|Spring|                   10|           10|          100.0|
|    TJK|Spring|                    8|            9|          88.89|
|    GEO|Summer|                   18|           22|          81.82|
|    SRB|Spring|                   79|          101|          78.22|
|    ISL|Spring|                   43|           57|          75.44|
|    PRY|Spring|                    3|            4|           75.0|
|    GAB|Summer|                    3|            4|           75.0|
|    AND|Summer|                    5|            7|          71.43|
|    URY|Spring|                   22|           32|          68.75|
|    BGD|Spring|                  

**Top 20 countries with warm seasons bookings ratio** by season (in \%):

+-------+------+------+
|country|Spring|Summer|
+-------+------+------+
|    FRO|  null| 100.0|
|    GEO|  null| 81.82|
|    GAB|  null|  75.0|
|    AND|  null| 71.43|
|    VNM|  null|  62.5|
|    QAT| 33.33|  60.0|
|    BLR|  null| 53.85|
|    KAZ| 31.58| 52.63|
|    GIB|  null|  50.0|
|    MAR| 11.58| 48.65|
|    MOZ| 23.88| 47.76|
|     CN| 30.26| 46.83|
|    PHL|  42.5|  42.5|
|    CZE| 19.88| 42.11|
|    DNK| 25.75| 42.07|
|    NOR| 35.42| 42.01|
|    ROU|  23.6|  41.8|
|    AZE|  null| 41.18|
|    TWN| 19.61| 39.22|
|    RUS| 27.06| 38.92|
+-------+------+------+

