## 1. Data Collected and Its Use
> I choose booking.com as the platform for data collection. The data I want to collect includes:

- The name of the accommodation: Unique name of the property (hotel, guesthouse, etc.) for identification and branding.
- Price per night: Cost of staying per night, crucial for price comparison and booking decisions.

- Guest Rating: Average score from guest reviews (1-10 scale), indicating customer satisfaction and influencing reputation.
- Guest Total Ratings: Total number of reviews given by guests, providing context for the guest rating's validity and the property's popularity.

- Location: Geographical address of the property, important for location analysis, mapping, and guest choice based on their destination.
- Star Rating: Standardized classification system (1-5 stars) given by authorities, indicating the general level of facilities and service quality.

> Reasons for Data Importance:

- This data can be used to build recommendation systems, analyze price trends, or determine marketing strategies based on location and ratings.
- Storing data in PostgreSQL allows us to take advantage of features such as indexing for quick searches, JSON support for semi-structured data, and powerful analytics capabilities.

## 2.Data Scraping Using BeautifulSoup

In [1]:
pip install selenium beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


### 

In [2]:
# import library yang dibutuhkan

import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
import pandas as pd
import re
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementClickInterceptedException, NoSuchElementException
from selenium.webdriver.common.by import By
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import os
pd.options.display.max_colwidth = 999
pd.options.display.max_rows = 999

#### Five star rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 26–27, 2025, filtered by a five-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [3]:
# URL Booking.com dengan filter bintang 4
url_hotel_4_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-26&checkout=2025-07-27&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D5"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_4_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_5_star = BeautifulSoup(driver.page_source, "html.parser")

In [4]:
for tag in soup_5_star.find_all(True):
    print(tag.name)

html
head
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
title
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
script
script
meta
meta
meta
meta
script
link
link
script
style
link
body
div
div
div
a
span
div
div
div
div
div
div
div
div
div
div
div
header
div
nav
div
div
span
a
svg
path
div
span
button
span
span
button
span
div
picture
img
span
a
span
span
svg
path
a
span
a
span
div
a
span
nav
div
ul
li
a
span
span
svg
path
div
span
li
a
span
span
svg
path
div
span
li
a
s

In [5]:
div_5_star_containers = soup_5_star.find_all('div', attrs={'class': 'aa97d6032f'})
print(div_5_star_containers)

[<div class="aa97d6032f" data-testid="property-card-container"><div class="e05069daf3"><div class="c17271c4d7"><a aria-hidden="true" data-testid="property-card-desktop-single-image" href="https://www.booking.com/hotel/id/grand-mercure-yogyakarta-adi-sucipto-opening-september-2016.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&amp;sid=98a2dc5e83d843937bd12ce44cdf1994&amp;aid=304142&amp;ucfs=1&amp;arphpl=1&amp;checkin=2025-07-26&amp;checkout=2025-07-27&amp;dest_id=-2703546&amp;dest_type=city&amp;group_adults=2&amp;req_adults=2&amp;no_rooms=1&amp;group_children=0&amp;req_children=0&amp;hpos=1&amp;hapos=1&amp;sr_order=popularity&amp;nflt=class%3D5&amp;srpvid=4a2a4869cc8b2d6e14ba1324d9fec203&amp;srepoch=1753111418&amp;all_sr_blocks=185002303_95415816_2_2_0&amp;highlighted_blocks=185002303_95415816_2_2_0&amp;matching_block_id=185002303_95415816_2_2_0&amp;sr_pri_blocks=185002303_95415816_2_2

- Hotel names

In [6]:
hotel_names = []
for container in div_5_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_5_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_5_star

==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Grand Mercure Yogyakarta Adi Sucipto</div>
==== part 2====
['Grand Mercure Yogyakarta Adi Sucipto']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Mustika Yogyakarta Resort and Spa</div>
==== part 2====
['Grand Mercure Yogyakarta Adi Sucipto', 'Mustika Yogyakarta Resort and Spa']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">The Phoenix Hotel Yogyakarta - Handwritten Collection</div>
==== part 2====
['Grand Mercure Yogyakarta Adi Sucipto', 'Mustika Yogyakarta Resort and Spa', 'The Phoenix Hotel Yogyakarta - Handwritten Collection']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Melia Purosani Yogyakarta</div>
==== part 2====
['Grand Mercure Yogyakarta Adi Sucipto', 'Mustika Yogyakarta Resort and Spa', 'The Phoenix Hotel Yogyakarta - Handwritten Collection', 'Melia Purosani Yogyakarta']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-tes

Unnamed: 0,Name
0,Grand Mercure Yogyakarta Adi Sucipto
1,Mustika Yogyakarta Resort and Spa
2,The Phoenix Hotel Yogyakarta - Handwritten Collection
3,Melia Purosani Yogyakarta
4,Yogyakarta Marriott Hotel
5,Jambuluwuk Malioboro Hotel Yogyakarta
6,Amaranta Prambanan Yogyakarta
7,Royal Ambarrukmo Yogyakarta
8,Garrya Bianti Yogyakarta
9,Hotel Tentrem Yogyakarta


- Price Hotel

In [7]:
# Extract prices
Price_hotel = []
for container in div_5_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_5_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_5_star

Unnamed: 0,Price
0,Rp 807.500
1,Rp 1.055.925
2,Rp 1.710.000
3,Rp 1.776.500
4,Rp 2.335.300
5,Rp 2.099.051
6,Rp 1.547.351
7,Rp 2.537.375
8,Rp 6.669.520
9,Rp 1.146.724


- Guest Rating

In [8]:
guest_ratings_hotel = []

for container in div_5_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_5_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_5_star


Unnamed: 0,guest_rating
0,75.0
1,81.0
2,87.0
3,86.0
4,94.0
5,82.0
6,93.0
7,89.0
8,94.0
9,


- Total Ratings

In [9]:
guest_review_counts = []

for container in div_5_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

df_guest_review_counts_5_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
df_guest_review_counts_5_star


Unnamed: 0,total_guest_reviews
0,148 ulasan
1,332 ulasan
2,1.264 ulasan
3,1.588 ulasan
4,512 ulasan
5,999 ulasan
6,34 ulasan
7,205 ulasan
8,14 ulasan
9,


- Location hotel

In [10]:
location_hotel_5_star = []
for container in div_5_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_5_star.append((location))

location_hotel_5_star = pd.DataFrame(location_hotel_5_star, columns=['Location'])
location_hotel_5_star

Unnamed: 0,Location
0,"Catur Tunggal, Yogyakarta"
1,Yogyakarta
2,"Jetis, Yogyakarta"
3,"Gondomanan, Yogyakarta (Malioboro)"
4,Yogyakarta
5,"Pakualaman, Yogyakarta"
6,Yogyakarta
7,"Catur Tunggal, Yogyakarta"
8,Yogyakarta
9,Jetis


- Merge columns

In [11]:
df_5_star_hotels = pd.concat([hotel_name_5_star, Price_hotel_5_star, guest_ratings_hotel_5_star, df_guest_review_counts_5_star, location_hotel_5_star], axis=1)
df_5_star_hotels['star_rating'] = 5

# Print the total number of hotels
print('Total Hotels : ', len(df_5_star_hotels))

# Display the first few rows of the combined DataFrame
df_5_star_hotels

Total Hotels :  11


Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating
0,Grand Mercure Yogyakarta Adi Sucipto,Rp 807.500,75.0,148 ulasan,"Catur Tunggal, Yogyakarta",5
1,Mustika Yogyakarta Resort and Spa,Rp 1.055.925,81.0,332 ulasan,Yogyakarta,5
2,The Phoenix Hotel Yogyakarta - Handwritten Collection,Rp 1.710.000,87.0,1.264 ulasan,"Jetis, Yogyakarta",5
3,Melia Purosani Yogyakarta,Rp 1.776.500,86.0,1.588 ulasan,"Gondomanan, Yogyakarta (Malioboro)",5
4,Yogyakarta Marriott Hotel,Rp 2.335.300,94.0,512 ulasan,Yogyakarta,5
5,Jambuluwuk Malioboro Hotel Yogyakarta,Rp 2.099.051,82.0,999 ulasan,"Pakualaman, Yogyakarta",5
6,Amaranta Prambanan Yogyakarta,Rp 1.547.351,93.0,34 ulasan,Yogyakarta,5
7,Royal Ambarrukmo Yogyakarta,Rp 2.537.375,89.0,205 ulasan,"Catur Tunggal, Yogyakarta",5
8,Garrya Bianti Yogyakarta,Rp 6.669.520,94.0,14 ulasan,Yogyakarta,5
9,Hotel Tentrem Yogyakarta,Rp 1.146.724,,,Jetis,5


#### Four Star Rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 26–27, 2025, filtered by a four-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [12]:
# URL Booking.com dengan filter bintang 4
url_hotel_4_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-26&checkout=2025-07-27&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D4"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_4_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_4_star = BeautifulSoup(driver.page_source, "html.parser")

In [13]:
div_4_star_containers = soup_4_star.find_all('div', attrs={'class':"aa97d6032f"})
print(div_4_star_containers)

[<div class="aa97d6032f" data-testid="property-card-container"><div class="e05069daf3"><div class="c17271c4d7"><a aria-hidden="true" data-testid="property-card-desktop-single-image" href="https://www.booking.com/hotel/id/novotel-suites-yogyakarta-malioboro.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&amp;sid=98a2dc5e83d843937bd12ce44cdf1994&amp;aid=304142&amp;ucfs=1&amp;arphpl=1&amp;checkin=2025-07-26&amp;checkout=2025-07-27&amp;dest_id=-2703546&amp;dest_type=city&amp;group_adults=2&amp;req_adults=2&amp;no_rooms=1&amp;group_children=0&amp;req_children=0&amp;hpos=1&amp;hapos=1&amp;sr_order=popularity&amp;nflt=class%3D4&amp;srpvid=32e0dff21fe9a8ba7fe02aa59ed3d5fd&amp;srepoch=1753111446&amp;all_sr_blocks=523032702_405605369_2_2_0&amp;highlighted_blocks=523032702_405605369_2_2_0&amp;matching_block_id=523032702_405605369_2_2_0&amp;sr_pri_blocks=523032702_405605369_2_2_0__161414000&amp;fr

- Hotel Names

In [14]:
hotel_names = []
for container in div_4_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_4_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_4_star

==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Novotel Suites Yogyakarta Malioboro</div>
==== part 2====
['Novotel Suites Yogyakarta Malioboro']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">INNSiDE by Meliá Yogyakarta</div>
==== part 2====
['Novotel Suites Yogyakarta Malioboro', 'INNSiDE by Meliá Yogyakarta']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Gallery Prawirotaman Hotel</div>
==== part 2====
['Novotel Suites Yogyakarta Malioboro', 'INNSiDE by Meliá Yogyakarta', 'Gallery Prawirotaman Hotel']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Grand Keisha Yogyakarta</div>
==== part 2====
['Novotel Suites Yogyakarta Malioboro', 'INNSiDE by Meliá Yogyakarta', 'Gallery Prawirotaman Hotel', 'Grand Keisha Yogyakarta']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">The Manohara Hotel Yogyakarta</div>
==== part 2====
['Novotel Suites Yogyakarta Malioboro', 'INNSiDE b

Unnamed: 0,Name
0,Novotel Suites Yogyakarta Malioboro
1,INNSiDE by Meliá Yogyakarta
2,Gallery Prawirotaman Hotel
3,Grand Keisha Yogyakarta
4,The Manohara Hotel Yogyakarta
5,Merapi Merbabu Hotel Yogyakarta Powered by Archipelago
6,Hotel Santika Premiere Jogja
7,Platinum Adisucipto Hotel & Conference Center
8,d'Salvatore Boutique Hotel Yogyakarta
9,Omah Siliran Heritage


- Price hotel

In [15]:
# Extract prices
Price_hotel = []
for container in div_4_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_4_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_4_star

Unnamed: 0,Price
0,Rp 1.614.140
1,Rp 845.184
2,Rp 819.000
3,Rp 633.360
4,Rp 833.628
5,Rp 675.439
6,Rp 850.850
7,Rp 582.015
8,Rp 550.000
9,Rp 441.984


- Guest Ratings

In [16]:
guest_ratings_hotel = []

for container in div_4_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_4_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_4_star


Unnamed: 0,guest_rating
0,86.0
1,83.0
2,86.0
3,87.0
4,89.0
5,82.0
6,83.0
7,84.0
8,62.0
9,87.0


- Total Ratings

In [17]:
guest_review_counts = []

for container in div_4_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

guest_review_counts_4_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
guest_review_counts_4_star


Unnamed: 0,total_guest_reviews
0,433 ulasan
1,338 ulasan
2,1.382 ulasan
3,120 ulasan
4,301 ulasan
5,40 ulasan
6,132 ulasan
7,438 ulasan
8,24 ulasan
9,133 ulasan


- Location

In [18]:
location_hotel_4_star = []
for container in div_4_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_4_star.append((location))

location_hotel_4_star = pd.DataFrame(location_hotel_4_star, columns=['Location'])
location_hotel_4_star

Unnamed: 0,Location
0,"Danurejan, Yogyakarta (Malioboro)"
1,Yogyakarta
2,"Mergangsan, Yogyakarta (Prawirotaman)"
3,Yogyakarta
4,"Catur Tunggal, Yogyakarta"
5,"Catur Tunggal, Yogyakarta"
6,"Jetis, Yogyakarta"
7,Yogyakarta
8,Yogyakarta
9,"Kraton, Yogyakarta"


- Merge Columns

In [19]:
df_4_star_hotels = pd.concat([hotel_name_4_star, Price_hotel_4_star, guest_ratings_hotel_4_star, guest_review_counts_4_star, location_hotel_4_star], axis=1)
df_4_star_hotels['star_rating'] = 4

# Print the total number of hotels
print('Total Hotels : ', len(df_4_star_hotels))

# Display the first few rows of the combined DataFrame
df_4_star_hotels

Total Hotels :  77


Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating
0,Novotel Suites Yogyakarta Malioboro,Rp 1.614.140,86.0,433 ulasan,"Danurejan, Yogyakarta (Malioboro)",4
1,INNSiDE by Meliá Yogyakarta,Rp 845.184,83.0,338 ulasan,Yogyakarta,4
2,Gallery Prawirotaman Hotel,Rp 819.000,86.0,1.382 ulasan,"Mergangsan, Yogyakarta (Prawirotaman)",4
3,Grand Keisha Yogyakarta,Rp 633.360,87.0,120 ulasan,Yogyakarta,4
4,The Manohara Hotel Yogyakarta,Rp 833.628,89.0,301 ulasan,"Catur Tunggal, Yogyakarta",4
5,Merapi Merbabu Hotel Yogyakarta Powered by Archipelago,Rp 675.439,82.0,40 ulasan,"Catur Tunggal, Yogyakarta",4
6,Hotel Santika Premiere Jogja,Rp 850.850,83.0,132 ulasan,"Jetis, Yogyakarta",4
7,Platinum Adisucipto Hotel & Conference Center,Rp 582.015,84.0,438 ulasan,Yogyakarta,4
8,d'Salvatore Boutique Hotel Yogyakarta,Rp 550.000,62.0,24 ulasan,Yogyakarta,4
9,Omah Siliran Heritage,Rp 441.984,87.0,133 ulasan,"Kraton, Yogyakarta",4


#### Three Star Rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 26–27, 2025, filtered by a three-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [20]:
# URL Booking.com dengan filter bintang 3
url_hotel_3_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-26&checkout=2025-07-27&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D3"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_3_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_3_star = BeautifulSoup(driver.page_source, "html.parser")

In [21]:
div_3_star_containers = soup_3_star.find_all('div', attrs={'class':"aa97d6032f"})
print(div_3_star_containers)

[<div class="aa97d6032f" data-testid="property-card-container"><div class="e05069daf3"><div class="c17271c4d7"><a aria-hidden="true" data-testid="property-card-desktop-single-image" href="https://www.booking.com/hotel/id/the-lawang-yogya.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&amp;sid=98a2dc5e83d843937bd12ce44cdf1994&amp;aid=304142&amp;ucfs=1&amp;arphpl=1&amp;checkin=2025-07-26&amp;checkout=2025-07-27&amp;dest_id=-2703546&amp;dest_type=city&amp;group_adults=2&amp;req_adults=2&amp;no_rooms=1&amp;group_children=0&amp;req_children=0&amp;hpos=1&amp;hapos=1&amp;sr_order=popularity&amp;nflt=class%3D3&amp;srpvid=9a3a6c5294b80866&amp;srepoch=1753111477&amp;all_sr_blocks=460684102_204728754_2_0_0&amp;highlighted_blocks=460684102_204728754_2_0_0&amp;matching_block_id=460684102_204728754_2_0_0&amp;sr_pri_blocks=460684102_204728754_2_0_0__24750000&amp;from=searchresults" rel="noopener nore

- Hotel Names

In [22]:
hotel_names = []
for container in div_3_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_3_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_3_star

==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">The LaWang Yogya Guesthouse</div>
==== part 2====
['The LaWang Yogya Guesthouse']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">KHAS Malioboro Hotel</div>
==== part 2====
['The LaWang Yogya Guesthouse', 'KHAS Malioboro Hotel']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Tjokro Style Yogyakarta</div>
==== part 2====
['The LaWang Yogya Guesthouse', 'KHAS Malioboro Hotel', 'Tjokro Style Yogyakarta']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Omah Konco Yogyakarta</div>
==== part 2====
['The LaWang Yogya Guesthouse', 'KHAS Malioboro Hotel', 'Tjokro Style Yogyakarta', 'Omah Konco Yogyakarta']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">ibis Yogyakarta Adi Sucipto</div>
==== part 2====
['The LaWang Yogya Guesthouse', 'KHAS Malioboro Hotel', 'Tjokro Style Yogyakarta', 'Omah Konco Yogyakarta', 'ibis Yogyakarta Adi Suci

Unnamed: 0,Name
0,The LaWang Yogya Guesthouse
1,KHAS Malioboro Hotel
2,Tjokro Style Yogyakarta
3,Omah Konco Yogyakarta
4,ibis Yogyakarta Adi Sucipto
5,MMUGM Hotel
6,Naima Jiwo
7,Omah Ambarukmo
8,KHAS Tugu Hotel Yogyakarta
9,Aveta Hotel Malioboro


- Price Hotels

In [23]:
# Extract prices
Price_hotel = []
for container in div_3_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_3_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_3_star

Unnamed: 0,Price
0,Rp 247.500
1,Rp 725.127
2,Rp 528.296
3,Rp 380.000
4,Rp 522.500
5,Rp 420.420
6,Rp 397.215
7,Rp 247.520
8,Rp 585.185
9,Rp 1.306.806


- Guest Rating

In [24]:
guest_ratings_hotel = []

for container in div_3_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_3_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_3_star


Unnamed: 0,guest_rating
0,84.0
1,78.0
2,74.0
3,84.0
4,78.0
5,80.0
6,83.0
7,87.0
8,78.0
9,82.0


- Total Rating

In [25]:
guest_review_counts = []

for container in div_3_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

guest_review_counts_3_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
guest_review_counts_3_star


Unnamed: 0,total_guest_reviews
0,158 ulasan
1,305 ulasan
2,73 ulasan
3,572 ulasan
4,154 ulasan
5,60 ulasan
6,64 ulasan
7,8 ulasan
8,80 ulasan
9,1.089 ulasan


- Location

In [26]:
location_hotel_3_star = []
for container in div_3_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_3_star.append((location))

location_hotel_3_star = pd.DataFrame(location_hotel_3_star, columns=['Location'])
location_hotel_3_star

Unnamed: 0,Location
0,"Danurejan, Yogyakarta"
1,"Gondomanan, Yogyakarta (Malioboro)"
2,"Umbulharjo, Yogyakarta"
3,"Kraton, Yogyakarta"
4,"Catur Tunggal, Yogyakarta"
5,"Catur Tunggal, Yogyakarta"
6,"Mergangsan, Yogyakarta"
7,"Catur Tunggal, Yogyakarta"
8,"Jetis, Yogyakarta"
9,"Danurejan, Yogyakarta (Malioboro)"


- Merge Columns

In [27]:
df_3_star_hotels = pd.concat([hotel_name_3_star, Price_hotel_3_star, guest_ratings_hotel_3_star, guest_review_counts_3_star, location_hotel_3_star], axis=1)
df_3_star_hotels['star_rating'] = 3

# Print the total number of hotels
print('Total Hotels : ', len(df_3_star_hotels))

# Display the first few rows of the combined DataFrame
df_3_star_hotels

Total Hotels :  70


Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating
0,The LaWang Yogya Guesthouse,Rp 247.500,84.0,158 ulasan,"Danurejan, Yogyakarta",3
1,KHAS Malioboro Hotel,Rp 725.127,78.0,305 ulasan,"Gondomanan, Yogyakarta (Malioboro)",3
2,Tjokro Style Yogyakarta,Rp 528.296,74.0,73 ulasan,"Umbulharjo, Yogyakarta",3
3,Omah Konco Yogyakarta,Rp 380.000,84.0,572 ulasan,"Kraton, Yogyakarta",3
4,ibis Yogyakarta Adi Sucipto,Rp 522.500,78.0,154 ulasan,"Catur Tunggal, Yogyakarta",3
5,MMUGM Hotel,Rp 420.420,80.0,60 ulasan,"Catur Tunggal, Yogyakarta",3
6,Naima Jiwo,Rp 397.215,83.0,64 ulasan,"Mergangsan, Yogyakarta",3
7,Omah Ambarukmo,Rp 247.520,87.0,8 ulasan,"Catur Tunggal, Yogyakarta",3
8,KHAS Tugu Hotel Yogyakarta,Rp 585.185,78.0,80 ulasan,"Jetis, Yogyakarta",3
9,Aveta Hotel Malioboro,Rp 1.306.806,82.0,1.089 ulasan,"Danurejan, Yogyakarta (Malioboro)",3


#### Two Star Rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 26–27, 2025, filtered by a Two-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [28]:
# URL Booking.com dengan filter bintang 2
url_hotel_2_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-26&checkout=2025-07-27&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D2"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_2_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_2_star = BeautifulSoup(driver.page_source, "html.parser")

In [29]:
div_2_star_containers = soup_2_star.find_all('div', attrs={'class':"aa97d6032f"})
print(div_2_star_containers)

[<div class="aa97d6032f" data-testid="property-card-container"><div class="e05069daf3"><div class="c17271c4d7"><a aria-hidden="true" data-testid="property-card-desktop-single-image" href="https://www.booking.com/hotel/id/venezia-homestay-and-garden.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&amp;sid=98a2dc5e83d843937bd12ce44cdf1994&amp;aid=304142&amp;ucfs=1&amp;arphpl=1&amp;checkin=2025-07-26&amp;checkout=2025-07-27&amp;dest_id=-2703546&amp;dest_type=city&amp;group_adults=2&amp;req_adults=2&amp;no_rooms=1&amp;group_children=0&amp;req_children=0&amp;hpos=1&amp;hapos=1&amp;sr_order=popularity&amp;nflt=class%3D2&amp;srpvid=c7cf6c66fec6046f&amp;srepoch=1753111520&amp;all_sr_blocks=217214703_240565989_0_2_0_238737%2C217214703_240565989_0_2_0_238737&amp;highlighted_blocks=217214703_240565989_0_2_0_238737%2C217214703_240565989_0_2_0_238737&amp;matching_block_id=217214703_240565989_0_2_0_2

- Hotel Names

In [30]:
hotel_names = []
for container in div_2_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_2_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_2_star

==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Venezia Homestay and Garden</div>
==== part 2====
['Venezia Homestay and Garden']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Griya Wijilan Syariah</div>
==== part 2====
['Venezia Homestay and Garden', 'Griya Wijilan Syariah']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Ndalem Mantrigawen</div>
==== part 2====
['Venezia Homestay and Garden', 'Griya Wijilan Syariah', 'Ndalem Mantrigawen']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Grand Marto Hotel</div>
==== part 2====
['Venezia Homestay and Garden', 'Griya Wijilan Syariah', 'Ndalem Mantrigawen', 'Grand Marto Hotel']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Hotel O Homestay Alqid Syariah</div>
==== part 2====
['Venezia Homestay and Garden', 'Griya Wijilan Syariah', 'Ndalem Mantrigawen', 'Grand Marto Hotel', 'Hotel O Homestay Alqid Syariah']
==== part 1====

Unnamed: 0,Name
0,Venezia Homestay and Garden
1,Griya Wijilan Syariah
2,Ndalem Mantrigawen
3,Grand Marto Hotel
4,Hotel O Homestay Alqid Syariah
5,Mawar Asri Hotel
6,Capital O 94274 Homestay Balimoelih
7,Aloha Hotel Yogyakarta
8,Ndalem Diajeng
9,Yellow Star Gejayan Hotel


- Hotel Price

In [31]:
# Extract prices
Price_hotel = []
for container in div_2_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_2_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_2_star

Unnamed: 0,Price
0,Rp 250.600
1,Rp 350.000
2,Rp 120.000
3,Rp 275.000
4,Rp 75.854
5,Rp 309.690
6,Rp 188.328
7,Rp 270.000
8,Rp 170.000
9,Rp 474.078


- Guest Rating

In [32]:
guest_ratings_hotel = []

for container in div_2_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_2_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_2_star


Unnamed: 0,guest_rating
0,81.0
1,84.0
2,80.0
3,74.0
4,37.0
5,78.0
6,86.0
7,91.0
8,91.0
9,81.0


- Total rating

In [33]:
guest_review_counts = []

for container in div_2_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

guest_review_counts_2_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
guest_review_counts_2_star

Unnamed: 0,total_guest_reviews
0,121 ulasan
1,870 ulasan
2,243 ulasan
3,246 ulasan
4,6 ulasan
5,301 ulasan
6,13 ulasan
7,1.250 ulasan
8,271 ulasan
9,129 ulasan


- Location

In [34]:
location_hotel_2_star = []
for container in div_2_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_2_star.append((location))

location_hotel_2_star = pd.DataFrame(location_hotel_2_star, columns=['Location'])
location_hotel_2_star

Unnamed: 0,Location
0,"Mantrijeron, Yogyakarta"
1,"Kraton, Yogyakarta"
2,"Kraton, Yogyakarta"
3,"Mergangsan, Yogyakarta (Prawirotaman)"
4,"Kraton, Yogyakarta"
5,"Ngampilan, Yogyakarta"
6,"Mantrijeron, Yogyakarta"
7,"Mergangsan, Yogyakarta (Prawirotaman)"
8,"Kraton, Yogyakarta"
9,"Catur Tunggal, Yogyakarta"


- Merge Columns

In [35]:
df_2_star_hotels = pd.concat([hotel_name_2_star, Price_hotel_2_star, guest_ratings_hotel_2_star, guest_review_counts_2_star, location_hotel_2_star], axis=1)
df_2_star_hotels['star_rating'] = 2

# Print the total number of hotels
print('Total Hotels : ', len(df_3_star_hotels))

# Display the first few rows of the combined DataFrame
df_2_star_hotels

Total Hotels :  70


Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating
0,Venezia Homestay and Garden,Rp 250.600,81.0,121 ulasan,"Mantrijeron, Yogyakarta",2
1,Griya Wijilan Syariah,Rp 350.000,84.0,870 ulasan,"Kraton, Yogyakarta",2
2,Ndalem Mantrigawen,Rp 120.000,80.0,243 ulasan,"Kraton, Yogyakarta",2
3,Grand Marto Hotel,Rp 275.000,74.0,246 ulasan,"Mergangsan, Yogyakarta (Prawirotaman)",2
4,Hotel O Homestay Alqid Syariah,Rp 75.854,37.0,6 ulasan,"Kraton, Yogyakarta",2
5,Mawar Asri Hotel,Rp 309.690,78.0,301 ulasan,"Ngampilan, Yogyakarta",2
6,Capital O 94274 Homestay Balimoelih,Rp 188.328,86.0,13 ulasan,"Mantrijeron, Yogyakarta",2
7,Aloha Hotel Yogyakarta,Rp 270.000,91.0,1.250 ulasan,"Mergangsan, Yogyakarta (Prawirotaman)",2
8,Ndalem Diajeng,Rp 170.000,91.0,271 ulasan,"Kraton, Yogyakarta",2
9,Yellow Star Gejayan Hotel,Rp 474.078,81.0,129 ulasan,"Catur Tunggal, Yogyakarta",2


#### One Star Rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 26–27, 2025, filtered by a five-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [36]:
# URL Booking.com dengan filter bintang 3
url_hotel_1_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-26&checkout=2025-07-27&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D1"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_1_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_1_star = BeautifulSoup(driver.page_source, "html.parser")

In [37]:
div_1_star_containers = soup_1_star.find_all('div', attrs={'class':"aa97d6032f"})
print(div_1_star_containers)

[<div class="aa97d6032f" data-testid="property-card-container"><div class="e05069daf3"><div class="c17271c4d7"><a aria-hidden="true" data-testid="property-card-desktop-single-image" href="https://www.booking.com/hotel/id/we-stay-condongcatur.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&amp;sid=98a2dc5e83d843937bd12ce44cdf1994&amp;aid=304142&amp;ucfs=1&amp;arphpl=1&amp;checkin=2025-07-26&amp;checkout=2025-07-27&amp;dest_id=-2703546&amp;dest_type=city&amp;group_adults=2&amp;req_adults=2&amp;no_rooms=1&amp;group_children=0&amp;req_children=0&amp;hpos=1&amp;hapos=1&amp;sr_order=popularity&amp;nflt=class%3D1&amp;srpvid=5d444c33b94a81b3aad27c9db556ff4d&amp;srepoch=1753111553&amp;all_sr_blocks=316985705_375878243_4_0_0&amp;highlighted_blocks=316985705_375878243_4_0_0&amp;matching_block_id=316985705_375878243_4_0_0&amp;sr_pri_blocks=316985705_375878243_4_0_0__13500000&amp;from=searchresults

- Hotel Names

In [38]:
hotel_names = []
for container in div_1_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_1_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_1_star

==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Riverside Homestay Jogja</div>
==== part 2====
['Riverside Homestay Jogja']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Pamularsih Homestay</div>
==== part 2====
['Riverside Homestay Jogja', 'Pamularsih Homestay']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Ndalem Sarengat</div>
==== part 2====
['Riverside Homestay Jogja', 'Pamularsih Homestay', 'Ndalem Sarengat']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">EC Pondokan</div>
==== part 2====
['Riverside Homestay Jogja', 'Pamularsih Homestay', 'Ndalem Sarengat', 'EC Pondokan']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Andelis Homestay</div>
==== part 2====
['Riverside Homestay Jogja', 'Pamularsih Homestay', 'Ndalem Sarengat', 'EC Pondokan', 'Andelis Homestay']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">BeOne House Jogja</div>
==== p

Unnamed: 0,Name
0,Riverside Homestay Jogja
1,Pamularsih Homestay
2,Ndalem Sarengat
3,EC Pondokan
4,Andelis Homestay
5,BeOne House Jogja
6,Jaya Homestay
7,Arjuna Garden Homestay
8,Panca Dewi Guest House
9,Hotel Batik Yogyakarta


- Hotel Price

In [39]:
# Extract prices
Price_hotel = []
for container in div_1_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_1_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_1_star

Unnamed: 0,Price
0,Rp 135.000
1,Rp 270.000
2,Rp 280.000
3,Rp 250.000
4,Rp 250.000
5,Rp 362.319
6,Rp 190.000
7,Rp 527.250
8,Rp 300.000
9,Rp 488.740


- Guest Rating

In [40]:
guest_ratings_hotel = []

for container in div_1_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_1_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_1_star


Unnamed: 0,guest_rating
0,68.0
1,90.0
2,86.0
3,76.0
4,82.0
5,84.0
6,80.0
7,83.0
8,77.0
9,


- Total Rating

In [41]:
guest_review_counts = []

for container in div_1_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

guest_review_counts_1_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
guest_review_counts_1_star

Unnamed: 0,total_guest_reviews
0,35 ulasan
1,193 ulasan
2,69 ulasan
3,127 ulasan
4,25 ulasan
5,46 ulasan
6,40 ulasan
7,121 ulasan
8,60 ulasan
9,


- Location

In [42]:
location_hotel_1_star = []
for container in div_1_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_1_star.append((location))

location_hotel_1_star = pd.DataFrame(location_hotel_1_star, columns=['Location'])
location_hotel_1_star

Unnamed: 0,Location
0,Yogyakarta
1,"Wirobrajan, Yogyakarta"
2,"Kraton, Yogyakarta"
3,"Gondomanan, Yogyakarta (Malioboro)"
4,Yogyakarta
5,Yogyakarta
6,Yogyakarta
7,"Mantrijeron, Yogyakarta"
8,"Danurejan, Yogyakarta"
9,"Gedongtengen, Yogyakarta (Malioboro)"


- Merge Columns

In [43]:
df_1_star_hotels = pd.concat([hotel_name_1_star, Price_hotel_1_star, guest_ratings_hotel_1_star, guest_review_counts_1_star, location_hotel_1_star], axis=1)
df_1_star_hotels['star_rating'] = 1

# Print the total number of hotels
print('Total Hotels : ', len(df_1_star_hotels))

# Display the first few rows of the combined DataFrame
df_1_star_hotels

Total Hotels :  39


Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating
0,Riverside Homestay Jogja,Rp 135.000,68.0,35 ulasan,Yogyakarta,1
1,Pamularsih Homestay,Rp 270.000,90.0,193 ulasan,"Wirobrajan, Yogyakarta",1
2,Ndalem Sarengat,Rp 280.000,86.0,69 ulasan,"Kraton, Yogyakarta",1
3,EC Pondokan,Rp 250.000,76.0,127 ulasan,"Gondomanan, Yogyakarta (Malioboro)",1
4,Andelis Homestay,Rp 250.000,82.0,25 ulasan,Yogyakarta,1
5,BeOne House Jogja,Rp 362.319,84.0,46 ulasan,Yogyakarta,1
6,Jaya Homestay,Rp 190.000,80.0,40 ulasan,Yogyakarta,1
7,Arjuna Garden Homestay,Rp 527.250,83.0,121 ulasan,"Mantrijeron, Yogyakarta",1
8,Panca Dewi Guest House,Rp 300.000,77.0,60 ulasan,"Danurejan, Yogyakarta",1
9,Hotel Batik Yogyakarta,Rp 488.740,,,"Gedongtengen, Yogyakarta (Malioboro)",1


#### Merge All Tables

In [44]:
df_hotel = pd.concat([
    df_1_star_hotels,
    df_2_star_hotels,
    df_3_star_hotels,
    df_4_star_hotels,
    df_5_star_hotels
], ignore_index=True)

In [45]:
print('Total Hotels : ', len(df_hotel))
df_hotel.sample(10)

Total Hotels :  265


Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating
126,Paku Mas Hotel,Rp 370.139,85.0,35 ulasan,"Catur Tunggal, Yogyakarta",3
223,GAIA Cosmo Hotel,Rp 910.849,88.0,452 ulasan,"Umbulharjo, Yogyakarta",4
20,Griya Jogja Hotel,Rp 338.371,,,"Jetis, Yogyakarta",1
39,Venezia Homestay and Garden,Rp 250.600,81.0,121 ulasan,"Mantrijeron, Yogyakarta",2
178,INNSiDE by Meliá Yogyakarta,Rp 845.184,83.0,338 ulasan,Yogyakarta,4
50,Griya Langen Guesthouse,Rp 224.000,72.0,30 ulasan,"Kraton, Yogyakarta",2
231,Yoga Lover,Rp 697.500,79.0,59 ulasan,"Mantrijeron, Yogyakarta",4
200,Abadi Hotel Malioboro Yogyakarta,Rp 969.497,80.0,30 ulasan,"Gedongtengen, Yogyakarta (Malioboro)",4
162,Urbanview TWH Costel near Stadion Maguwoharjo Yogyakarta,Rp 293.436,,,Yogyakarta,3
227,Swiss-Belboutique Yogyakarta,Rp 1.130.000,89.0,238 ulasan,"Gondokusuman, Yogyakarta",4


Description:
In this script, we use the pd.concat() function from the Pandas library to vertically concatenate multiple hotel DataFrames (df_1_star_hotels, df_2_star_hotels, ..., df_5_star_hotels) into a single unified DataFrame named df_hotel. The parameter ignore_index=True ensures the resulting DataFrame has a clean, continuous index.

After merging the data, we create a new column named Rp_Price by removing the "Rp" prefix and thousand separators (.) from the original Price column. The cleaned values are then converted into numeric format using the int64 data type, which represents the hotel prices in Indonesian Rupiah (Rp) as whole numbers.

Finally, we print the total number of hotel entries using the len() function.

Impact on Data Integrity:
* Vertical concatenation (union) ensures that all hotel records from different star ratings are combined row-wise, with no column misalignment.

* Converting the Price column to int64 ensures accurate handling of currency as whole numbers, which is typically how prices are represented in real-world hotel listings.

* Removing formatting characters from the Price column helps avoid errors during numeric operations like sorting, filtering, or aggregation.

* Knowing the total number of hotel entries provides insight into the dataset’s size and helps assess completeness and readiness for further analysis.

#### Cleaning

In [46]:
df_hotel['Total_ratings'] = df_hotel['total_guest_reviews'].str.extract('(\d+)')
df_hotel['Total_ratings'] = df_hotel['Total_ratings'].astype('Int64')
df_hotel.sample(10)

Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating,Total_ratings
11,Arjuna 31,Rp 132.593,50,1 ulasan,"Wirobrajan, Yogyakarta",1,1
211,Abhayagiri - Sumberwatu Heritage Resort,Rp 899.000,88,196 ulasan,Yogyakarta,4,196
28,The Samirono,Rp 772.535,80,1 ulasan,"Catur Tunggal, Yogyakarta",1,1
239,Surokarsan 9 House Yogyakarta,Rp 418.905,89,75 ulasan,"Pakualaman, Yogyakarta",4,75
180,Grand Keisha Yogyakarta,Rp 633.360,87,120 ulasan,Yogyakarta,4,120
51,Pendhapa Art Space - PAS Limasan Homestay,Rp 314.888,83,39 ulasan,Yogyakarta,2,39
194,Hotel New Saphir Yogyakarta,Rp 528.328,71,33 ulasan,"Gondokusuman, Yogyakarta",4,33
132,The Victoria Hotel Yogyakarta,Rp 455.832,86,23 ulasan,"Catur Tunggal, Yogyakarta",3,23
134,RedDoorz Plus near Alun Alun Selatan 2,Rp 226.287,63,36 ulasan,"Mantrijeron, Yogyakarta",3,36
227,Swiss-Belboutique Yogyakarta,Rp 1.130.000,89,238 ulasan,"Gondokusuman, Yogyakarta",4,238


> Description:
<br> We extract numeric values from the 'total_guest_reviews' column using str.extract() with the regex (\d+), then convert the result to the nullable integer type Int64 in a new column called 'Total_ratings'. This allows numerical operations while handling missing values (NaN).

> Impact on Data Integrity:
<br> If the original text contains no digits, NaN will be returned. Using Int64 prevents errors from missing values and ensures the column is ready for analysis.

In [47]:
df_hotel['Rp_Price'] = df_hotel['Price'].str.replace('Rp', '', regex=False).str.replace('.', '', regex=False).str.strip()
df_hotel['Rp_Price'] = df_hotel['Rp_Price'].astype('int64')
df_hotel.sample(10)

Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating,Total_ratings,Rp_Price
27,Koolkost near Stadion Maguwoharjo,Rp 125.096,,,Yogyakarta,1,,125096
190,The Atrium Hotel & Resort Yogyakarta,Rp 480.000,85.0,34 ulasan,"Mlati, Yogyakarta",4,34.0,480000
219,Sofia Boutique Residence,Rp 975.851,83.0,55 ulasan,Yogyakarta,4,55.0,975851
99,Super OYO 13I9 88 Exclusive Guesthouse,Rp 228.571,58.0,9 ulasan,Yogyakarta,2,9.0,228571
172,With Friends Homestay Jogja,Rp 500.000,62.0,9 ulasan,"Mantrijeron, Yogyakarta",3,9.0,500000
202,Hotel Sumaryo,Rp 264.865,77.0,13 ulasan,"Mergangsan, Yogyakarta (Prawirotaman)",4,13.0,264865
16,Prayogo Lama Family GH Prawirotaman,Rp 160.958,81.0,84 ulasan,"Mergangsan, Yogyakarta (Prawirotaman)",1,84.0,160958
41,Ndalem Mantrigawen,Rp 120.000,80.0,243 ulasan,"Kraton, Yogyakarta",2,243.0,120000
188,KESATRIYAN JOGJA GUEST HOUSE,Rp 553.000,84.0,351 ulasan,"Kraton, Yogyakarta",4,351.0,553000
125,Metro Malioboro Living,Rp 294.840,77.0,653 ulasan,"Danurejan, Yogyakarta",3,653.0,294840


In [48]:
# Ganti koma menjadi titik dan ubah ke float
df_hotel['guest_rating'] = df_hotel['guest_rating'].str.replace(',', '.', regex=False).astype(float)
# Normalisasi ke skala 1-5
df_hotel['guest_rating_normalized'] = df_hotel['guest_rating'] / 2

In [49]:
# Drop kolom mentah setelah dibuat versi bersihnya
df_hotel = df_hotel.drop(['Price', 'guest_rating', 'total_guest_reviews'], axis=1)

> Description:<br>
The `guest_rating` column was normalized by dividing all values by 2, converting the rating scale from 1–10 to 1–5. This was stored in a new column `guest_rating_normalized`.

> Impact on Data Integrity:<br>
Normalizing `guest_rating` to a 5-point scale ensures consistency with common rating systems and improves comparability for analysis or modeling. The original data remains intact, but slight floating-point differences may occur, requiring rounding if needed.

In [50]:
df_hotel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265 entries, 0 to 264
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Name                     265 non-null    object 
 1   Location                 265 non-null    object 
 2   star_rating              265 non-null    int64  
 3   Total_ratings            249 non-null    Int64  
 4   Rp_Price                 265 non-null    int64  
 5   guest_rating_normalized  249 non-null    float64
dtypes: Int64(1), float64(1), int64(2), object(2)
memory usage: 12.8+ KB


In [51]:
# Konversi ke float untuk memastikan tidak error
df_hotel['Total_ratings'] = df_hotel['Total_ratings'].astype(float)
df_hotel['guest_rating_normalized'] = df_hotel['guest_rating_normalized'].astype(float)

# Isi nilai null dengan median
df_hotel['Total_ratings'] = df_hotel['Total_ratings'].fillna(df_hotel['Total_ratings'].median())
df_hotel['guest_rating_normalized'] = df_hotel['guest_rating_normalized'].fillna(df_hotel['guest_rating_normalized'].median())


> Description: <br>
Missing values in both `Total_ratings` and `guest_rating_normalized` were replaced with the median of each column. Median was chosen to reduce the influence of outliers.

> Impact on Data Integrity:<br>
Filling missing values with the median retains all rows, avoiding data loss from nulls. Median is a robust measure that provides a fair estimate, especially with outliers. However, since the values are imputed, they may not reflect the actual data and could slightly bias the distribution if the missingness is not random.

In [52]:
df_hotel.sample(10)

Unnamed: 0,Name,Location,star_rating,Total_ratings,Rp_Price,guest_rating_normalized
230,The Malioboro Hotel & Conference Center,"Gedongtengen, Yogyakarta",4,67.0,1280988,4.35
136,Tujuan Jogja Villas - Kresna,"Sinduadi, Yogyakarta",3,7.0,1632411,4.5
261,Royal Ambarrukmo Yogyakarta,"Catur Tunggal, Yogyakarta",5,205.0,2537375,4.45
246,Hotel O Prawirotaman Near Keraton Yogyakarta Formerly Paris Guesthouse,Yogyakarta,4,1.0,243555,3.5
15,Rumah Pathuk Syariah Homestay,"Ngampilan, Yogyakarta",1,57.0,250000,4.0
105,OYO 409 Pondok Helomi,Yogyakarta,2,8.0,140082,4.6
16,Prayogo Lama Family GH Prawirotaman,"Mergangsan, Yogyakarta (Prawirotaman)",1,84.0,160958,4.05
13,Hotel Pules,"Danurejan, Yogyakarta",1,1.0,414452,3.5
62,SPOT ON 94319 Daffsell Homestay Syariah,"Danurejan, Yogyakarta",2,10.0,91287,2.8
189,Royal Malioboro by ASTON,"Gedongtengen, Yogyakarta (Malioboro)",4,902.0,1632310,4.45


In [53]:
# #convert df_hotel to csv
df_hotel.to_csv('hotel.csv', index=False)
# # After this we can export to SQL table