# EDA of Agoda's flights and hotel dataset using SQL

This notebook analyzes the flight and hotel bookings of Agoda's customers from 26-Sept-2019 to 24-July-2023. The aim of this project is to analyze customer booking behavior and booking pricing by answering the following questions:

- What are the booking trends by gender, age and month?
- What are the factors that affect flight and hotel prices?

## Set Up

The following code connects the notebook to a PostgreSQL database hosted on Supabase, allowing its output to be displayed in this notebook.

In [1]:
from sqlalchemy import create_engine
# from sqlalchemy.pool import NullPool
from dotenv import load_dotenv
import os
import pandas as pd

# Load environment variables from .env
load_dotenv()

# Fetch variables
USER = os.getenv("user")
PASSWORD = os.getenv("password")
HOST = os.getenv("host")
PORT = os.getenv("port")
DBNAME = os.getenv("dbname")

# Construct the SQLAlchemy connection string
DATABASE_URL = f"postgresql+psycopg2://{USER}:{PASSWORD}@{HOST}:{PORT}/{DBNAME}?sslmode=require"

# Create the SQLAlchemy engine
engine = create_engine(DATABASE_URL)
# If using Transaction Pooler or Session Pooler, we want to ensure we disable SQLAlchemy client side pooling -
# https://docs.sqlalchemy.org/en/20/core/pooling.html#switching-pool-implementations
# engine = create_engine(DATABASE_URL, poolclass=NullPool)

# Test the connection
try:
    with engine.connect() as connection:
        print("Connection successful!")
except Exception as e:
    print(f"Failed to connect: {e}")

Connection successful!


The flights, hotels, and users tables are created using the following code.

In [2]:
# Load the CSV data into a DataFrame
flights = pd.read_csv("../data/flights.csv")
hotels = pd.read_csv("../data/hotels.csv")
users = pd.read_csv("../data/users.csv")

# Push the DataFrame to the MSSQL database (creating the table)
flights.to_sql("flights", con=engine, if_exists="replace", index=False)
hotels.to_sql("hotels", con=engine, if_exists="replace", index=False)
users.to_sql("users", con=engine, if_exists="replace", index=False)

340

### Loading Extensions

In [3]:
%load_ext sql
%config SqlMagic.style = '_DEPRECATED_DEFAULT'
%sql $DATABASE_URL

### Converting Data Types

In [4]:
%%sql
-- Step 1: Add a new column
ALTER TABLE flights ADD COLUMN date_date DATE;
ALTER TABLE hotels ADD COLUMN date_date DATE;

-- Step 2: Copy and convert the data
UPDATE flights SET date_date = to_date(date, 'MM/DD/YYYY')::DATE;
UPDATE hotels SET date_date = to_date(date, 'MM/DD/YYYY')::DATE;

-- Step 3: Drop the old column
ALTER TABLE flights DROP COLUMN date;
ALTER TABLE hotels DROP COLUMN date;

-- Step 4: Rename the new column
ALTER TABLE flights RENAME COLUMN date_date TO date;
ALTER TABLE hotels RENAME COLUMN date_date TO date;



 * postgresql+psycopg2://postgres.maiqnwbncklkjhqpqboa:***@aws-0-ap-southeast-1.pooler.supabase.com:6543/postgres?sslmode=require
Done.
Done.
271888 rows affected.
40552 rows affected.
Done.
Done.
Done.
Done.


[]

## 1. Booking Trends By Demographic Factors

### Gender

The following query examines the number of flight bookings and the average flight prices by different genders. The results indicate even distribution of bookings among the different genders. The similar average prices allude that spending power or purchasing habit of Agoda's customers is not heavily dependent on gender.

In [5]:
%%sql
-- flight bookings contributed by different genders
SELECT 
    gender, COUNT(*) AS flight_count
FROM 
    flights
INNER JOIN 
    users 
ON 
    flights."userCode" = users."code" 
GROUP BY 
    gender;

 * postgresql+psycopg2://postgres.maiqnwbncklkjhqpqboa:***@aws-0-ap-southeast-1.pooler.supabase.com:6543/postgres?sslmode=require
3 rows affected.


gender,flight_count
female,91580
male,91248
none,89060


This query identifies the most popular travel destinations by gender, sorting the results in descending order by the number of bookings. Florianopolis is the top destination across all genders.

In [6]:
%%sql
-- popular destinations by gender
SELECT flights.to, users.gender, COUNT(*) AS bookings
FROM flights
INNER JOIN users
ON flights."userCode" = users.code
GROUP BY flights.to, users.gender
ORDER BY 2,3 DESC;

 * postgresql+psycopg2://postgres.maiqnwbncklkjhqpqboa:***@aws-0-ap-southeast-1.pooler.supabase.com:6543/postgres?sslmode=require
27 rows affected.


to,gender,bookings
Florianopolis (SC),female,19567
Aracaju (SE),female,12626
Campo Grande (MS),female,11050
Brasilia (DF),female,10470
Recife (PE),female,10260
Natal (RN),female,8112
Sao Paulo (SP),female,8070
Salvador (BH),female,5718
Rio de Janeiro (RJ),female,5707
Florianopolis (SC),male,19914


### Age Group

The following query examines the number of flight bookings and the average flight prices by age groups. The age group of 20-29 and 60-69 contribute less to the total number of flights booked. Meanwhile, the average flight price does not variable significantly among different age groups. 

In [7]:
%%sql
-- flights by age group
SELECT 
    CONCAT(floor(age/10)*10,'-',floor(age/10)*10+9) AS age_group, 
    COUNT(*) AS number_of_bookings, 
    ROUND(AVG(price)::NUMERIC,2) AS average_price
FROM flights 
INNER JOIN users 
ON flights."userCode" = users.code
GROUP BY floor(age/10)*10
ORDER BY 1;

 * postgresql+psycopg2://postgres.maiqnwbncklkjhqpqboa:***@aws-0-ap-southeast-1.pooler.supabase.com:6543/postgres?sslmode=require
5 rows affected.


age_group,number_of_bookings,average_price
20-29,52956,963.53
30-39,64670,962.19
40-49,61548,953.59
50-59,57260,957.44
60-69,35454,945.87


This query identifies the most popular travel destinations across age groups, sorting the results in descending order by the number of bookings. Florianopolis is the top destination across all age groups.

In [8]:
%%sql
-- popular destinations by age group
SELECT 
    flights.to, 
    CONCAT(floor(age/10)*10,'-',floor(age/10)*10+9) AS age_group, 
    COUNT(*) AS bookings
FROM flights
INNER JOIN users
ON flights."userCode" = users.code
GROUP BY flights.to, floor(age/10)*10
ORDER BY 2,3 DESC;

 * postgresql+psycopg2://postgres.maiqnwbncklkjhqpqboa:***@aws-0-ap-southeast-1.pooler.supabase.com:6543/postgres?sslmode=require
45 rows affected.


to,age_group,bookings
Florianopolis (SC),20-29,12180
Aracaju (SE),20-29,7050
Campo Grande (MS),20-29,6008
Brasilia (DF),20-29,5891
Recife (PE),20-29,5804
Sao Paulo (SP),20-29,4701
Natal (RN),20-29,4665
Rio de Janeiro (RJ),20-29,3347
Salvador (BH),20-29,3310
Florianopolis (SC),30-39,14433


## 2. Flight and Hotel Monthly Booking and Revenue Trends

This query calculates the total flights and hotel stays for each month by combining records from flights and hotels.

In [9]:
%%sql
-- bookings by months
SELECT a.travel_month, a.flight_bookings, b.hotel_bookings, a.flight_bookings+b.hotel_bookings AS total_bookings
FROM (
	SELECT EXTRACT(MONTH FROM date) AS travel_month, COUNT(*) AS flight_bookings
	FROM flights
	GROUP BY EXTRACT(MONTH FROM date)
	) AS a
INNER JOIN (
	SELECT EXTRACT(MONTH FROM date) AS travel_month, COUNT(*) AS hotel_bookings
	FROM hotels
	GROUP BY EXTRACT(MONTH FROM date)
	) AS b
ON a.travel_month = b.travel_month
ORDER BY 4 DESC;

 * postgresql+psycopg2://postgres.maiqnwbncklkjhqpqboa:***@aws-0-ap-southeast-1.pooler.supabase.com:6543/postgres?sslmode=require
12 rows affected.


travel_month,flight_bookings,hotel_bookings,total_bookings
10,28980,4504,33484
12,26346,4052,30398
11,25764,3634,29398
1,25587,3775,29362
4,22607,3576,26183
3,22741,3380,26121
2,22387,3284,25671
5,20968,2871,23839
7,20113,3079,23192
9,19468,3011,22479


This query calculates the total revenue generated each month by combining revenue from flights and hotels.

In [10]:
%%sql
-- revenue by month
SELECT a.travel_month, a.flight_revenue, b.hotel_revenue, ROUND(a.flight_revenue + b.hotel_revenue,2) AS total_revenue
FROM (
    SELECT EXTRACT(MONTH FROM date) AS travel_month, ROUND(SUM(price)::NUMERIC, 2) AS flight_revenue
    FROM flights
    GROUP BY EXTRACT(MONTH FROM date)
) AS a
INNER JOIN (
    SELECT EXTRACT(MONTH FROM date) AS travel_month, ROUND(SUM(total)::NUMERIC, 2) AS hotel_revenue
    FROM hotels
    GROUP BY EXTRACT(MONTH FROM date)
) AS b
ON a.travel_month = b.travel_month
ORDER BY total_revenue DESC;

 * postgresql+psycopg2://postgres.maiqnwbncklkjhqpqboa:***@aws-0-ap-southeast-1.pooler.supabase.com:6543/postgres?sslmode=require
12 rows affected.


travel_month,flight_revenue,hotel_revenue,total_revenue
10,27663040.66,2417426.96,30080467.62
12,25197015.56,2182197.21,27379212.77
11,24672328.58,1948201.51,26620530.09
1,24334009.62,1990376.79,26324386.41
3,21820447.99,1817208.18,23637656.17
4,21599832.6,1909806.3,23509638.9
2,21455153.5,1767709.29,23222862.79
5,20185235.75,1543882.23,21729117.98
7,19335078.33,1646520.04,20981598.37
9,18565907.08,1601906.05,20167813.13


From the results, we can see that
- The months with the highest number of bookings and revenue fall between October and January, which aligns with the holiday season and year-end vacations. 
- This indicates a seasonal peak in travel-related bookings during this period.

## 3. Flight and Hotel Price Factors

### Distance

This query calculates the average price of flights based on the distance traveled, grouped into intervals of 100 kilometers. The distance intervals are displayed as ranges (e.g., 100-199 km, 200-299 km), and for each range, the average price of the flights within that range is calculated.

The results show that, as the distance increases, the average flight price also tends to rise, with flights in the 900-999 km range having the highest average price of 1348.25.

In [11]:
%%sql
-- price by distance
SELECT 
	CONCAT(FLOOR(distance/100)*100,'-',FLOOR(distance/100)*100+99) AS flight_distance,
	ROUND(AVG(price)::NUMERIC,2) AS average_price
FROM flights
GROUP BY FLOOR(distance/100)*100
ORDER BY 1;

 * postgresql+psycopg2://postgres.maiqnwbncklkjhqpqboa:***@aws-0-ap-southeast-1.pooler.supabase.com:6543/postgres?sslmode=require
9 rows affected.


flight_distance,average_price
100-199,494.69
200-299,673.56
300-399,796.14
400-499,792.84
500-599,961.07
600-699,1099.14
700-799,1187.1
800-899,1274.6
900-999,1348.25


### Location

This query shows the average flight price and average hotel price for various destinations.

Insights:
- Florianopolis (SC) and Salvador (BH) have the highest average prices for flights among the destinations listed, at 1185.69 and 1264.35, respectively.
- Campo Grande (MS) has the lowest average hotel price (60.39), while Florianopolis (SC) has the highest hotel price (313.02).
- There seems to be a relationship between higher flight prices and higher hotel prices, as seen in Salvador (BH), which has both high flight and hotel prices.

In [None]:
%%sql
-- average prices by destinations
SELECT a.destination, a.avg_flight_price, b.avg_hotel_price
FROM (
	SELECT flights.to as destination, ROUND(AVG(price)::NUMERIC,2) as avg_flight_price
	FROM flights
	GROUP BY flights.to
	) AS a
LEFT JOIN (
	SELECT place, ROUND(AVG(price)::NUMERIC,2) as avg_hotel_price
	FROM hotels
	GROUP BY place
	) AS b
ON a.destination = b.place
ORDER BY avg_flight_price DESC, avg_hotel_price DESC;

 * postgresql+psycopg2://postgres.maiqnwbncklkjhqpqboa:***@aws-0-ap-southeast-1.pooler.supabase.com:6543/postgres?sslmode=require


9 rows affected.


destination,avg_flight_price,avg_hotel_price
Salvador (BH),1264.35,263.41
Florianopolis (SC),1185.69,313.02
Aracaju (SE),1065.24,208.04
Recife (PE),951.23,312.83
Rio de Janeiro (RJ),846.77,165.99
Campo Grande (MS),841.35,60.39
Natal (RN),826.87,242.88
Brasilia (DF),766.76,247.62
Sao Paulo (SP),648.34,139.1


### Month

This query shows the average flight and hotel prices for each month of travel:

Insights:
- The highest average flight price is observed in June (962.74), followed closely by May (962.67).
- The lowest flight prices occur in January (951.03) and September (953.66).
- Hotel prices are generally consistent throughout the year, ranging from a low of 212.38 in January to a high of 215.59 in June.

This data can help travelers identify which months might offer the best deals in terms of both flights and accommodation.

In [13]:
%%sql
-- avg price by months
SELECT a.travel_month, a.avg_flight_price, b.avg_hotel_price
FROM (
SELECT EXTRACT(MONTH FROM date) AS travel_month, ROUND(AVG(price)::NUMERIC,2) AS avg_flight_price
FROM flights
GROUP BY EXTRACT(MONTH FROM date)
    ) AS a
LEFT JOIN (
	SELECT EXTRACT(MONTH FROM date) AS travel_month, ROUND(AVG(price)::NUMERIC,2) as avg_hotel_price
	FROM hotels
	GROUP BY EXTRACT(MONTH FROM date)
	) AS b
ON a.travel_month = b.travel_month
ORDER BY 2 DESC;


 * postgresql+psycopg2://postgres.maiqnwbncklkjhqpqboa:***@aws-0-ap-southeast-1.pooler.supabase.com:6543/postgres?sslmode=require
12 rows affected.


travel_month,avg_flight_price,avg_hotel_price
6,962.74,215.59
5,962.67,215.4
7,961.32,213.26
3,959.52,215.12
2,958.38,214.09
8,958.25,214.79
11,957.63,214.0
12,956.39,215.09
4,955.45,214.47
10,954.56,215.11


### Flight Route

This query provides data on flight bookings, showing the departure city (from), destination city (to), the number of bookings (bookings), and the average price (avg_price) for each route.

Insights:
- Most Booked Route: The highest number of bookings is for the route Sao Paulo (SP) to Florianopolis (SC) with 6717 bookings, at an average price of 1380.88.
- Lowest Average Price: The route with the lowest average price is Florianopolis (SC) to Rio de Janeiro (RJ), with an average price of 474.81, but it still has 5807 bookings, making it a popular and affordable route.
- High Price Routes: Routes like Sao Paulo (SP) to Florianopolis (SC) are more expensive, with prices around 1380.88 but maintain a high number of bookings.

In [14]:
%%sql
-- flight bookings and avg price by flight route
SELECT "from", "to", COUNT(*) AS bookings, ROUND(AVG(price)::NUMERIC,2) AS avg_price
FROM flights
GROUP BY "from", "to"
ORDER BY 4 DESC
LIMIT 10;

 * postgresql+psycopg2://postgres.maiqnwbncklkjhqpqboa:***@aws-0-ap-southeast-1.pooler.supabase.com:6543/postgres?sslmode=require
10 rows affected.


from,to,bookings,avg_price
Sao Paulo (SP),Florianopolis (SC),6717,1380.88
Campo Grande (MS),Rio de Janeiro (RJ),2415,1371.03
Brasilia (DF),Salvador (BH),2009,1366.8
Aracaju (SE),Salvador (BH),2918,1364.97
Rio de Janeiro (RJ),Recife (PE),1897,1361.2
Natal (RN),Salvador (BH),926,1357.97
Florianopolis (SC),Salvador (BH),5800,1350.41
Salvador (BH),Florianopolis (SC),5800,1346.09
Recife (PE),Rio de Janeiro (RJ),1897,1341.6
Brasilia (DF),Florianopolis (SC),7779,1260.25


### Agency

This query returns the average flight and hotel prices for different agencies:

Insights:
- CloudFy: Offers a slightly cheaper flight at an average price of 918.90 and a hotel price of 213.73.
- Rainbow: Has a very similar pricing model to CloudFy with an average flight price of 919.78 and hotel price of 214.76.
- FlyingDrops: Charges significantly higher for flights, with an average price of 1186.16, but the hotel price is almost the same as Rainbow at 215.61.

In [15]:
%%sql
-- average prices by agency
SELECT f.agency, 
       ROUND(AVG(f.price)::NUMERIC, 2) AS avg_flight_price, 
       ROUND(AVG(h.price)::NUMERIC, 2) AS avg_hotel_price
FROM flights f
LEFT JOIN hotels h ON f."travelCode" = h."travelCode"
GROUP BY f.agency;

 * postgresql+psycopg2://postgres.maiqnwbncklkjhqpqboa:***@aws-0-ap-southeast-1.pooler.supabase.com:6543/postgres?sslmode=require
3 rows affected.


agency,avg_flight_price,avg_hotel_price
CloudFy,918.9,213.73
FlyingDrops,1186.16,215.61
Rainbow,919.78,214.76


### Flight Type

The following query examines the number of flight bookings and the average flight prices by flight type. The average prices are in accordance to business logic, whereby first class tickets record higher prices than premium and economic tickets. Meanwhile, more first class bookings were recorded in this dataset, indicating that first class customers contributed more to the revenue of the company during this time period.

In [16]:
%%sql
-- number of booking and average price by flight type
SELECT "flightType", COUNT(*) AS number_of_bookings, ROUND(AVG(price)::NUMERIC,2) AS avg_price
FROM flights
GROUP BY "flightType";

 * postgresql+psycopg2://postgres.maiqnwbncklkjhqpqboa:***@aws-0-ap-southeast-1.pooler.supabase.com:6543/postgres?sslmode=require
3 rows affected.


flightType,number_of_bookings,avg_price
economic,77466,658.44
firstClass,116418,1181.07
premium,78004,920.39


## 4. Top Customer

This query returns the names of users along with their corresponding flight, hotel, and total revenue. These are the top 10 customers who have spent the most on flights and hotels combined.

In [17]:
%%sql
-- top spenders
SELECT users.name, c.flight_revenue, c.hotel_revenue, c.total_revenue
FROM (
	SELECT a."userCode", 
		a.flight_revenue, 
		COALESCE(b.hotel_revenue, 0) AS hotel_revenue, 
		ROUND(a.flight_revenue+COALESCE(b.hotel_revenue, 0),2) AS total_revenue
	FROM (
		SELECT "userCode", ROUND(SUM(price)::NUMERIC,2) AS flight_revenue
		FROM flights
		GROUP BY "userCode"
		) AS a
	FULL OUTER JOIN (
		SELECT "userCode", ROUND(SUM(total)::NUMERIC,2) AS hotel_revenue
		FROM hotels
		GROUP BY "userCode"
		) AS b
	ON a."userCode" = b."userCode"
	) AS c
INNER JOIN users 
ON c."userCode" = users.code
ORDER BY 4 DESC
LIMIT 10;

 * postgresql+psycopg2://postgres.maiqnwbncklkjhqpqboa:***@aws-0-ap-southeast-1.pooler.supabase.com:6543/postgres?sslmode=require


10 rows affected.


name,flight_revenue,hotel_revenue,total_revenue
Tiffany Behm,442901.02,34051.8,476952.82
Christopher Mccormick,427202.93,34586.09,461789.02
Steven Smith,426133.03,34725.74,460858.77
Jessie Armstrong,427218.92,30112.59,457331.51
Lyndon Germain,420362.27,34435.88,454798.15
Helen Warner,411852.92,40999.86,452852.78
Jeffrey Ramage,420462.47,32103.16,452565.63
Laurel Rodriguez,423936.24,28170.15,452106.39
Albert Garroutte,420398.06,30218.54,450616.6
Trevor Robinson,418347.29,30776.36,449123.65
