## 1. Data Collected and Its Use
> I choose booking.com as the platform for data collection. The data I want to collect includes:

- The name of the accommodation: Unique name of the property (hotel, guesthouse, etc.) for identification and branding.
- Price per night: Cost of staying per night, crucial for price comparison and booking decisions.

- Guest Rating: Average score from guest reviews (1-10 scale), indicating customer satisfaction and influencing reputation.
- Guest Total Ratings: Total number of reviews given by guests, providing context for the guest rating's validity and the property's popularity.

- Location: Geographical address of the property, important for location analysis, mapping, and guest choice based on their destination.
- Star Rating: Standardized classification system (1-5 stars) given by authorities, indicating the general level of facilities and service quality.

> Reasons for Data Importance:

- This data can be used to build recommendation systems, analyze price trends, or determine marketing strategies based on location and ratings.
- Storing data in PostgreSQL allows us to take advantage of features such as indexing for quick searches, JSON support for semi-structured data, and powerful analytics capabilities.

## 2.Data Scraping Using BeautifulSoup

In [1]:
pip install selenium beautifulsoup4




### 

In [2]:
# import library yang dibutuhkan

import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
import pandas as pd
import re
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementClickInterceptedException, NoSuchElementException
from selenium.webdriver.common.by import By
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import os
pd.options.display.max_colwidth = 999
pd.options.display.max_rows = 999

#### Five star rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 19–20, 2025, filtered by a five-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [3]:
# URL Booking.com dengan filter bintang 4
url_hotel_4_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-19&checkout=2025-07-20&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D5"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_4_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_5_star = BeautifulSoup(driver.page_source, "html.parser")

In [4]:
for tag in soup_5_star.find_all(True):
    print(tag.name)

html
head
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
script
script
title
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
link
meta
meta
meta
meta
script
link
script
link
script
style
link
body
div
div
div
a
span
div
div
div
div
div
div
div
div
div
div
div
header
div
nav
div
div
span
a
svg
path
div
span
button
span
span
button
span
div
picture
img
span
a
span
span
svg
path
a
span
a
span
div
a
span
nav
div
ul
li
a
span
span
svg
path
div
span
li
a
span
span
svg
pat

In [5]:
div_5_star_containers = soup_5_star.findAll('div', attrs={'class':"aa97d6032f"})
print(div_5_star_containers)

  div_5_star_containers = soup_5_star.findAll('div', attrs={'class':"aa97d6032f"})


[<div class="aa97d6032f" data-testid="property-card-container"><div class="e05069daf3"><div class="c17271c4d7"><a aria-hidden="true" data-testid="property-card-desktop-single-image" href="https://www.booking.com/hotel/id/grand-mercure-yogyakarta-adi-sucipto-opening-september-2016.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&amp;sid=98a2dc5e83d843937bd12ce44cdf1994&amp;aid=304142&amp;ucfs=1&amp;arphpl=1&amp;checkin=2025-07-19&amp;checkout=2025-07-20&amp;dest_id=-2703546&amp;dest_type=city&amp;group_adults=2&amp;req_adults=2&amp;no_rooms=1&amp;group_children=0&amp;req_children=0&amp;hpos=1&amp;hapos=1&amp;sr_order=popularity&amp;nflt=class%3D5&amp;srpvid=010a77cd3170591c00005260eee9cfbf&amp;srepoch=1752672048&amp;all_sr_blocks=185002303_95415816_2_2_0&amp;highlighted_blocks=185002303_95415816_2_2_0&amp;matching_block_id=185002303_95415816_2_2_0&amp;sr_pri_blocks=185002303_95415816_2_2

- Hotel names

In [6]:
hotel_names = []
for container in div_5_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_5_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_5_star

==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Grand Mercure Yogyakarta Adi Sucipto</div>
==== part 2====
['Grand Mercure Yogyakarta Adi Sucipto']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Mustika Yogyakarta Resort and Spa</div>
==== part 2====
['Grand Mercure Yogyakarta Adi Sucipto', 'Mustika Yogyakarta Resort and Spa']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Hyatt Regency Yogyakarta</div>
==== part 2====
['Grand Mercure Yogyakarta Adi Sucipto', 'Mustika Yogyakarta Resort and Spa', 'Hyatt Regency Yogyakarta']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">The Phoenix Hotel Yogyakarta - Handwritten Collection</div>
==== part 2====
['Grand Mercure Yogyakarta Adi Sucipto', 'Mustika Yogyakarta Resort and Spa', 'Hyatt Regency Yogyakarta', 'The Phoenix Hotel Yogyakarta - Handwritten Collection']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Jambuluwuk Maliobor

Unnamed: 0,Name
0,Grand Mercure Yogyakarta Adi Sucipto
1,Mustika Yogyakarta Resort and Spa
2,Hyatt Regency Yogyakarta
3,The Phoenix Hotel Yogyakarta - Handwritten Collection
4,Jambuluwuk Malioboro Hotel Yogyakarta
5,Yogyakarta Marriott Hotel
6,Melia Purosani Yogyakarta
7,Royal Ambarrukmo Yogyakarta
8,Lafayette Boutique Hotel
9,ARTOTEL Suites Bianti Yogyakarta


- Price Hotel

In [7]:
# Extract prices
Price_hotel = []
for container in div_5_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_5_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_5_star

Unnamed: 0,Price
0,Rp 950.000
1,Rp 960.892
2,Rp 2.722.500
3,Rp 1.710.000
4,Rp 1.204.441
5,Rp 1.839.200
6,Rp 1.776.500
7,Rp 1.596.275
8,Rp 802.040
9,Rp 1.270.126


- Guest Rating

In [8]:
guest_ratings_hotel = []

for container in div_5_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_5_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_5_star


Unnamed: 0,guest_rating
0,75.0
1,81.0
2,91.0
3,87.0
4,82.0
5,94.0
6,86.0
7,89.0
8,
9,


- Total Ratings

In [9]:
guest_review_counts = []

for container in div_5_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

df_guest_review_counts_5_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
df_guest_review_counts_5_star


Unnamed: 0,total_guest_reviews
0,151 ulasan
1,333 ulasan
2,909 ulasan
3,1.266 ulasan
4,1.000 ulasan
5,510 ulasan
6,1.577 ulasan
7,208 ulasan
8,
9,


- Location hotel

In [10]:
location_hotel_5_star = []
for container in div_5_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_5_star.append((location))

location_hotel_5_star = pd.DataFrame(location_hotel_5_star, columns=['Location'])
location_hotel_5_star

Unnamed: 0,Location
0,"Catur Tunggal, Yogyakarta"
1,Yogyakarta
2,Yogyakarta
3,"Jetis, Yogyakarta"
4,"Pakualaman, Yogyakarta"
5,Yogyakarta
6,"Gondomanan, Yogyakarta (Malioboro)"
7,"Catur Tunggal, Yogyakarta"
8,"Catur Tunggal, Yogyakarta"
9,"Gondokusuman, Yogyakarta"


- Merge columns

In [11]:
df_5_star_hotels = pd.concat([hotel_name_5_star, Price_hotel_5_star, guest_ratings_hotel_5_star, df_guest_review_counts_5_star, location_hotel_5_star], axis=1)
df_5_star_hotels['star_rating'] = 5

# Print the total number of hotels
print('Total Hotels : ', len(df_5_star_hotels))

# Display the first few rows of the combined DataFrame
df_5_star_hotels

Total Hotels :  13


Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating
0,Grand Mercure Yogyakarta Adi Sucipto,Rp 950.000,75.0,151 ulasan,"Catur Tunggal, Yogyakarta",5
1,Mustika Yogyakarta Resort and Spa,Rp 960.892,81.0,333 ulasan,Yogyakarta,5
2,Hyatt Regency Yogyakarta,Rp 2.722.500,91.0,909 ulasan,Yogyakarta,5
3,The Phoenix Hotel Yogyakarta - Handwritten Collection,Rp 1.710.000,87.0,1.266 ulasan,"Jetis, Yogyakarta",5
4,Jambuluwuk Malioboro Hotel Yogyakarta,Rp 1.204.441,82.0,1.000 ulasan,"Pakualaman, Yogyakarta",5
5,Yogyakarta Marriott Hotel,Rp 1.839.200,94.0,510 ulasan,Yogyakarta,5
6,Melia Purosani Yogyakarta,Rp 1.776.500,86.0,1.577 ulasan,"Gondomanan, Yogyakarta (Malioboro)",5
7,Royal Ambarrukmo Yogyakarta,Rp 1.596.275,89.0,208 ulasan,"Catur Tunggal, Yogyakarta",5
8,Lafayette Boutique Hotel,Rp 802.040,,,"Catur Tunggal, Yogyakarta",5
9,ARTOTEL Suites Bianti Yogyakarta,Rp 1.270.126,,,"Gondokusuman, Yogyakarta",5


#### Four Star Rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 19–20, 2025, filtered by a four-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [12]:
# URL Booking.com dengan filter bintang 4
url_hotel_4_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-26&checkout=2025-07-27&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D4"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_4_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_4_star = BeautifulSoup(driver.page_source, "html.parser")

In [13]:
div_4_star_containers = soup_4_star.findAll('div', attrs={'class':"aa97d6032f"})
print(div_4_star_containers)

  div_4_star_containers = soup_4_star.findAll('div', attrs={'class':"aa97d6032f"})


[<div class="aa97d6032f" data-testid="property-card-container"><div class="e05069daf3"><div class="c17271c4d7"><a aria-hidden="true" data-testid="property-card-desktop-single-image" href="https://www.booking.com/hotel/id/santika-premiere-jogja.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&amp;sid=98a2dc5e83d843937bd12ce44cdf1994&amp;aid=304142&amp;ucfs=1&amp;arphpl=1&amp;checkin=2025-07-26&amp;checkout=2025-07-27&amp;dest_id=-2703546&amp;dest_type=city&amp;group_adults=2&amp;req_adults=2&amp;no_rooms=1&amp;group_children=0&amp;req_children=0&amp;hpos=1&amp;hapos=1&amp;sr_order=popularity&amp;nflt=class%3D4&amp;srpvid=3ce35dde078805d8&amp;srepoch=1752672076&amp;all_sr_blocks=23828909_100790840_0_1_0_102735&amp;highlighted_blocks=23828909_100790840_0_1_0_102735&amp;matching_block_id=23828909_100790840_0_1_0_102735&amp;sr_pri_blocks=23828909_100790840_0_1_0_102735_85000000&amp;from=sear

- Hotel Names

In [14]:
hotel_names = []
for container in div_4_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_4_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_4_star

==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Hotel Santika Premiere Jogja</div>
==== part 2====
['Hotel Santika Premiere Jogja']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">The Alana Hotel &amp; Conference Center Malioboro Yogyakarta by ASTON</div>
==== part 2====
['Hotel Santika Premiere Jogja', 'The Alana Hotel & Conference Center Malioboro Yogyakarta by ASTON']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">INNSiDE by Meliá Yogyakarta</div>
==== part 2====
['Hotel Santika Premiere Jogja', 'The Alana Hotel & Conference Center Malioboro Yogyakarta by ASTON', 'INNSiDE by Meliá Yogyakarta']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Grand Tjokro Yogyakarta</div>
==== part 2====
['Hotel Santika Premiere Jogja', 'The Alana Hotel & Conference Center Malioboro Yogyakarta by ASTON', 'INNSiDE by Meliá Yogyakarta', 'Grand Tjokro Yogyakarta']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1

Unnamed: 0,Name
0,Hotel Santika Premiere Jogja
1,The Alana Hotel & Conference Center Malioboro Yogyakarta by ASTON
2,INNSiDE by Meliá Yogyakarta
3,Grand Tjokro Yogyakarta
4,Grand Keisha Yogyakarta
5,The Westlake Hotel & Resort Yogyakarta
6,Merapi Merbabu Hotel Yogyakarta Powered by Archipelago
7,Gallery Prawirotaman Hotel
8,FX Stay & Coffee
9,Sahid Raya Hotel & Convention Yogyakarta


- Price hotel

In [15]:
# Extract prices
Price_hotel = []
for container in div_4_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_4_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_4_star

Unnamed: 0,Price
0,Rp 850.000
1,Rp 683.677
2,Rp 889.792
3,Rp 603.556
4,Rp 592.000
5,Rp 951.944
6,Rp 660.914
7,Rp 794.235
8,Rp 640.000
9,Rp 431.993


- Guest Ratings

In [16]:
guest_ratings_hotel = []

for container in div_4_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_4_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_4_star


Unnamed: 0,guest_rating
0,83.0
1,89.0
2,83.0
3,77.0
4,87.0
5,85.0
6,82.0
7,86.0
8,83.0
9,75.0


- Total Ratings

In [17]:
guest_review_counts = []

for container in div_4_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

guest_review_counts_4_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
guest_review_counts_4_star


Unnamed: 0,total_guest_reviews
0,130 ulasan
1,564 ulasan
2,338 ulasan
3,66 ulasan
4,119 ulasan
5,10 ulasan
6,40 ulasan
7,1.380 ulasan
8,81 ulasan
9,85 ulasan


- Location

In [18]:
location_hotel_4_star = []
for container in div_4_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_4_star.append((location))

location_hotel_4_star = pd.DataFrame(location_hotel_4_star, columns=['Location'])
location_hotel_4_star

Unnamed: 0,Location
0,"Jetis, Yogyakarta"
1,"Mantrijeron, Yogyakarta"
2,Yogyakarta
3,"Catur Tunggal, Yogyakarta"
4,Yogyakarta
5,Yogyakarta
6,"Catur Tunggal, Yogyakarta"
7,"Mergangsan, Yogyakarta (Prawirotaman)"
8,"Kraton, Yogyakarta"
9,"Catur Tunggal, Yogyakarta"


- Merge Columns

In [19]:
df_4_star_hotels = pd.concat([hotel_name_4_star, Price_hotel_4_star, guest_ratings_hotel_4_star, guest_review_counts_4_star, location_hotel_4_star], axis=1)
df_4_star_hotels['star_rating'] = 4

# Print the total number of hotels
print('Total Hotels : ', len(df_4_star_hotels))

# Display the first few rows of the combined DataFrame
df_4_star_hotels

Total Hotels :  65


Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating
0,Hotel Santika Premiere Jogja,Rp 850.000,83.0,130 ulasan,"Jetis, Yogyakarta",4
1,The Alana Hotel & Conference Center Malioboro Yogyakarta by ASTON,Rp 683.677,89.0,564 ulasan,"Mantrijeron, Yogyakarta",4
2,INNSiDE by Meliá Yogyakarta,Rp 889.792,83.0,338 ulasan,Yogyakarta,4
3,Grand Tjokro Yogyakarta,Rp 603.556,77.0,66 ulasan,"Catur Tunggal, Yogyakarta",4
4,Grand Keisha Yogyakarta,Rp 592.000,87.0,119 ulasan,Yogyakarta,4
5,The Westlake Hotel & Resort Yogyakarta,Rp 951.944,85.0,10 ulasan,Yogyakarta,4
6,Merapi Merbabu Hotel Yogyakarta Powered by Archipelago,Rp 660.914,82.0,40 ulasan,"Catur Tunggal, Yogyakarta",4
7,Gallery Prawirotaman Hotel,Rp 794.235,86.0,1.380 ulasan,"Mergangsan, Yogyakarta (Prawirotaman)",4
8,FX Stay & Coffee,Rp 640.000,83.0,81 ulasan,"Kraton, Yogyakarta",4
9,Sahid Raya Hotel & Convention Yogyakarta,Rp 431.993,75.0,85 ulasan,"Catur Tunggal, Yogyakarta",4


#### Three Star Rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 19–20, 2025, filtered by a three-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [20]:
# URL Booking.com dengan filter bintang 3
url_hotel_3_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-19&checkout=2025-07-20&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D3"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_3_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_3_star = BeautifulSoup(driver.page_source, "html.parser")

In [21]:
div_3_star_containers = soup_3_star.findAll('div', attrs={'class':"aa97d6032f"})
print(div_3_star_containers)

  div_3_star_containers = soup_3_star.findAll('div', attrs={'class':"aa97d6032f"})


[<div class="aa97d6032f" data-testid="property-card-container"><div class="e05069daf3"><div class="c17271c4d7"><a aria-hidden="true" data-testid="property-card-desktop-single-image" href="https://www.booking.com/hotel/id/villa-d-maguwo-suites.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&amp;sid=98a2dc5e83d843937bd12ce44cdf1994&amp;aid=304142&amp;ucfs=1&amp;arphpl=1&amp;checkin=2025-07-19&amp;checkout=2025-07-20&amp;dest_id=-2703546&amp;dest_type=city&amp;group_adults=2&amp;req_adults=2&amp;no_rooms=1&amp;group_children=0&amp;req_children=0&amp;hpos=1&amp;hapos=1&amp;sr_order=popularity&amp;nflt=class%3D3&amp;srpvid=28165deb21030475&amp;srepoch=1752672102&amp;all_sr_blocks=1271279001_414698840_6_0_0&amp;highlighted_blocks=1271279001_414698840_6_0_0&amp;matching_block_id=1271279001_414698840_6_0_0&amp;sr_pri_blocks=1271279001_414698840_6_0_0__183600000&amp;from=searchresults" rel="noo

- Hotel Names

In [22]:
hotel_names = []
for container in div_3_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_3_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_3_star

==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Villa D'Maguwo Suites</div>
==== part 2====
["Villa D'Maguwo Suites"]
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Argya Guesthouse</div>
==== part 2====
["Villa D'Maguwo Suites", 'Argya Guesthouse']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Kebon Krapyak Cottage by Secoms</div>
==== part 2====
["Villa D'Maguwo Suites", 'Argya Guesthouse', 'Kebon Krapyak Cottage by Secoms']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">The LaWang Yogya Guesthouse</div>
==== part 2====
["Villa D'Maguwo Suites", 'Argya Guesthouse', 'Kebon Krapyak Cottage by Secoms', 'The LaWang Yogya Guesthouse']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Metro Malioboro Living</div>
==== part 2====
["Villa D'Maguwo Suites", 'Argya Guesthouse', 'Kebon Krapyak Cottage by Secoms', 'The LaWang Yogya Guesthouse', 'Metro Malioboro Living']
==== part 

Unnamed: 0,Name
0,Villa D'Maguwo Suites
1,Argya Guesthouse
2,Kebon Krapyak Cottage by Secoms
3,The LaWang Yogya Guesthouse
4,Metro Malioboro Living
5,Capital O 94314 Sunny Co Living
6,MMUGM Hotel
7,Grhatama Guest House
8,Tjokro Style Yogyakarta
9,Serambut Widi Artcation


- Price Hotels

In [23]:
# Extract prices
Price_hotel = []
for container in div_3_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_3_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_3_star

Unnamed: 0,Price
0,Rp 1.836.000
1,Rp 200.000
2,Rp 360.000
3,Rp 247.500
4,Rp 238.950
5,Rp 125.467
6,Rp 396.000
7,Rp 350.000
8,Rp 450.491
9,Rp 680.000


- Guest Rating

In [24]:
guest_ratings_hotel = []

for container in div_3_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_3_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_3_star


Unnamed: 0,guest_rating
0,95.0
1,79.0
2,20.0
3,84.0
4,77.0
5,69.0
6,80.0
7,75.0
8,74.0
9,93.0


- Total Rating

In [25]:
guest_review_counts = []

for container in div_3_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

guest_review_counts_3_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
guest_review_counts_3_star


Unnamed: 0,total_guest_reviews
0,4 ulasan
1,103 ulasan
2,1 ulasan
3,153 ulasan
4,657 ulasan
5,6 ulasan
6,60 ulasan
7,49 ulasan
8,73 ulasan
9,3 ulasan


- Location

In [26]:
location_hotel_3_star = []
for container in div_3_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_3_star.append((location))

location_hotel_3_star = pd.DataFrame(location_hotel_3_star, columns=['Location'])
location_hotel_3_star

Unnamed: 0,Location
0,Yogyakarta
1,"Gondokusuman, Yogyakarta"
2,Yogyakarta
3,"Danurejan, Yogyakarta"
4,"Danurejan, Yogyakarta"
5,Yogyakarta
6,"Catur Tunggal, Yogyakarta"
7,"Catur Tunggal, Yogyakarta"
8,"Umbulharjo, Yogyakarta"
9,Yogyakarta


- Merge Columns

In [27]:
df_3_star_hotels = pd.concat([hotel_name_3_star, Price_hotel_3_star, guest_ratings_hotel_3_star, guest_review_counts_3_star, location_hotel_3_star], axis=1)
df_3_star_hotels['star_rating'] = 3

# Print the total number of hotels
print('Total Hotels : ', len(df_3_star_hotels))

# Display the first few rows of the combined DataFrame
df_3_star_hotels

Total Hotels :  68


Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating
0,Villa D'Maguwo Suites,Rp 1.836.000,95.0,4 ulasan,Yogyakarta,3
1,Argya Guesthouse,Rp 200.000,79.0,103 ulasan,"Gondokusuman, Yogyakarta",3
2,Kebon Krapyak Cottage by Secoms,Rp 360.000,20.0,1 ulasan,Yogyakarta,3
3,The LaWang Yogya Guesthouse,Rp 247.500,84.0,153 ulasan,"Danurejan, Yogyakarta",3
4,Metro Malioboro Living,Rp 238.950,77.0,657 ulasan,"Danurejan, Yogyakarta",3
5,Capital O 94314 Sunny Co Living,Rp 125.467,69.0,6 ulasan,Yogyakarta,3
6,MMUGM Hotel,Rp 396.000,80.0,60 ulasan,"Catur Tunggal, Yogyakarta",3
7,Grhatama Guest House,Rp 350.000,75.0,49 ulasan,"Catur Tunggal, Yogyakarta",3
8,Tjokro Style Yogyakarta,Rp 450.491,74.0,73 ulasan,"Umbulharjo, Yogyakarta",3
9,Serambut Widi Artcation,Rp 680.000,93.0,3 ulasan,Yogyakarta,3


#### Two Star Rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 19–20, 2025, filtered by a Two-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [28]:
# URL Booking.com dengan filter bintang 2
url_hotel_2_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-19&checkout=2025-07-20&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D2"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_2_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_2_star = BeautifulSoup(driver.page_source, "html.parser")

In [29]:
div_2_star_containers = soup_2_star.findAll('div', attrs={'class':"aa97d6032f"})
print(div_2_star_containers)

  div_2_star_containers = soup_2_star.findAll('div', attrs={'class':"aa97d6032f"})


[<div class="aa97d6032f" data-testid="property-card-container"><div class="e05069daf3"><div class="c17271c4d7"><a aria-hidden="true" data-testid="property-card-desktop-single-image" href="https://www.booking.com/hotel/id/grand-marto.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&amp;sid=98a2dc5e83d843937bd12ce44cdf1994&amp;aid=304142&amp;ucfs=1&amp;arphpl=1&amp;checkin=2025-07-19&amp;checkout=2025-07-20&amp;dest_id=-2703546&amp;dest_type=city&amp;group_adults=2&amp;req_adults=2&amp;no_rooms=1&amp;group_children=0&amp;req_children=0&amp;hpos=1&amp;hapos=1&amp;sr_order=popularity&amp;nflt=class%3D2&amp;srpvid=18543a3c7653b8f3c4621d540d5b5d7e&amp;srepoch=1752672138&amp;all_sr_blocks=117828001_270861481_2_1_0&amp;highlighted_blocks=117828001_270861481_2_1_0&amp;matching_block_id=117828001_270861481_2_1_0&amp;sr_pri_blocks=117828001_270861481_2_1_0__27500000&amp;from=searchresults" rel="no

- Hotel Names

In [30]:
hotel_names = []
for container in div_2_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_2_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_2_star

==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Grand Marto Hotel</div>
==== part 2====
['Grand Marto Hotel']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Capital O 94274 Homestay Balimoelih</div>
==== part 2====
['Grand Marto Hotel', 'Capital O 94274 Homestay Balimoelih']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Ndalem Mantrigawen</div>
==== part 2====
['Grand Marto Hotel', 'Capital O 94274 Homestay Balimoelih', 'Ndalem Mantrigawen']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Hotel O Homestay Alqid Syariah</div>
==== part 2====
['Grand Marto Hotel', 'Capital O 94274 Homestay Balimoelih', 'Ndalem Mantrigawen', 'Hotel O Homestay Alqid Syariah']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">OstiC House</div>
==== part 2====
['Grand Marto Hotel', 'Capital O 94274 Homestay Balimoelih', 'Ndalem Mantrigawen', 'Hotel O Homestay Alqid Syariah', 'OstiC House']
====

Unnamed: 0,Name
0,Grand Marto Hotel
1,Capital O 94274 Homestay Balimoelih
2,Ndalem Mantrigawen
3,Hotel O Homestay Alqid Syariah
4,OstiC House
5,Mawar Asri Hotel
6,SPOT ON 94634 Homestay Griya Sunarti
7,Collection O 94154 Puri Gusti Ayu
8,Amaris Hotel Malioboro - Jogja
9,OYO 94311 Homestay Ayana Syariah


- Hotel Price

In [31]:
# Extract prices
Price_hotel = []
for container in div_2_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_2_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_2_star

Unnamed: 0,Price
0,Rp 275.000
1,Rp 235.664
2,Rp 150.000
3,Rp 135.037
4,Rp 230.736
5,Rp 313.131
6,Rp 138.848
7,Rp 215.489
8,Rp 797.325
9,Rp 55.930


- Guest Rating

In [32]:
guest_ratings_hotel = []

for container in div_2_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_2_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_2_star


Unnamed: 0,guest_rating
0,75.0
1,86.0
2,80.0
3,37.0
4,87.0
5,79.0
6,90.0
7,
8,79.0
9,50.0


- Total rating

In [33]:
guest_review_counts = []

for container in div_2_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

guest_review_counts_2_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
guest_review_counts_2_star

Unnamed: 0,total_guest_reviews
0,245 ulasan
1,13 ulasan
2,244 ulasan
3,6 ulasan
4,478 ulasan
5,300 ulasan
6,1 ulasan
7,
8,376 ulasan
9,4 ulasan


- Location

In [34]:
location_hotel_2_star = []
for container in div_2_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_2_star.append((location))

location_hotel_2_star = pd.DataFrame(location_hotel_2_star, columns=['Location'])
location_hotel_2_star

Unnamed: 0,Location
0,"Mergangsan, Yogyakarta (Prawirotaman)"
1,"Mantrijeron, Yogyakarta"
2,"Kraton, Yogyakarta"
3,"Kraton, Yogyakarta"
4,"Mantrijeron, Yogyakarta"
5,"Ngampilan, Yogyakarta"
6,Yogyakarta
7,Yogyakarta
8,"Gedongtengen, Yogyakarta (Malioboro)"
9,Yogyakarta


- Merge Columns

In [35]:
df_2_star_hotels = pd.concat([hotel_name_2_star, Price_hotel_2_star, guest_ratings_hotel_2_star, guest_review_counts_2_star, location_hotel_2_star], axis=1)
df_2_star_hotels['star_rating'] = 2

# Print the total number of hotels
print('Total Hotels : ', len(df_3_star_hotels))

# Display the first few rows of the combined DataFrame
df_2_star_hotels

Total Hotels :  68


Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating
0,Grand Marto Hotel,Rp 275.000,75.0,245 ulasan,"Mergangsan, Yogyakarta (Prawirotaman)",2
1,Capital O 94274 Homestay Balimoelih,Rp 235.664,86.0,13 ulasan,"Mantrijeron, Yogyakarta",2
2,Ndalem Mantrigawen,Rp 150.000,80.0,244 ulasan,"Kraton, Yogyakarta",2
3,Hotel O Homestay Alqid Syariah,Rp 135.037,37.0,6 ulasan,"Kraton, Yogyakarta",2
4,OstiC House,Rp 230.736,87.0,478 ulasan,"Mantrijeron, Yogyakarta",2
5,Mawar Asri Hotel,Rp 313.131,79.0,300 ulasan,"Ngampilan, Yogyakarta",2
6,SPOT ON 94634 Homestay Griya Sunarti,Rp 138.848,90.0,1 ulasan,Yogyakarta,2
7,Collection O 94154 Puri Gusti Ayu,Rp 215.489,,,Yogyakarta,2
8,Amaris Hotel Malioboro - Jogja,Rp 797.325,79.0,376 ulasan,"Gedongtengen, Yogyakarta (Malioboro)",2
9,OYO 94311 Homestay Ayana Syariah,Rp 55.930,50.0,4 ulasan,Yogyakarta,2


#### One Star Rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 19–20, 2025, filtered by a five-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [36]:
# URL Booking.com dengan filter bintang 3
url_hotel_1_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-19&checkout=2025-07-20&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D1"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_1_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_1_star = BeautifulSoup(driver.page_source, "html.parser")

In [37]:
div_1_star_containers = soup_1_star.findAll('div', attrs={'class':"aa97d6032f"})
print(div_1_star_containers)

  div_1_star_containers = soup_1_star.findAll('div', attrs={'class':"aa97d6032f"})


[<div class="aa97d6032f" data-testid="property-card-container"><div class="e05069daf3"><div class="c17271c4d7"><a aria-hidden="true" data-testid="property-card-desktop-single-image" href="https://www.booking.com/hotel/id/ec-pondokan.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&amp;sid=98a2dc5e83d843937bd12ce44cdf1994&amp;aid=304142&amp;ucfs=1&amp;arphpl=1&amp;checkin=2025-07-19&amp;checkout=2025-07-20&amp;dest_id=-2703546&amp;dest_type=city&amp;group_adults=2&amp;req_adults=2&amp;no_rooms=1&amp;group_children=0&amp;req_children=0&amp;hpos=1&amp;hapos=1&amp;sr_order=popularity&amp;nflt=class%3D1&amp;srpvid=90125e0e755c055c&amp;srepoch=1752672172&amp;all_sr_blocks=425672907_246385111_2_0_0&amp;highlighted_blocks=425672907_246385111_2_0_0&amp;matching_block_id=425672907_246385111_2_0_0&amp;sr_pri_blocks=425672907_246385111_2_0_0__26500000&amp;from=searchresults" rel="noopener noreferre

- Hotel Names

In [38]:
hotel_names = []
for container in div_1_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_1_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_1_star

==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">EC Pondokan</div>
==== part 2====
['EC Pondokan']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">The Cabin Hotel Gandekan</div>
==== part 2====
['EC Pondokan', 'The Cabin Hotel Gandekan']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Andelis Homestay</div>
==== part 2====
['EC Pondokan', 'The Cabin Hotel Gandekan', 'Andelis Homestay']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Roemah Canting Homestay</div>
==== part 2====
['EC Pondokan', 'The Cabin Hotel Gandekan', 'Andelis Homestay', 'Roemah Canting Homestay']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Prayogo Lama Family GH Prawirotaman</div>
==== part 2====
['EC Pondokan', 'The Cabin Hotel Gandekan', 'Andelis Homestay', 'Roemah Canting Homestay', 'Prayogo Lama Family GH Prawirotaman']
==== part 1====
<div class="b87c397a13 a3e0b4ffd1" data-testid="title">Bring

Unnamed: 0,Name
0,EC Pondokan
1,The Cabin Hotel Gandekan
2,Andelis Homestay
3,Roemah Canting Homestay
4,Prayogo Lama Family GH Prawirotaman
5,Bring In House Yogyakarta
6,Riverside Homestay Jogja
7,Hotel Batik Yogyakarta
8,Panca Dewi Guest House
9,Pamularsih Homestay


- Hotel Price

In [39]:
# Extract prices
Price_hotel = []
for container in div_1_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_1_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_1_star

Unnamed: 0,Price
0,Rp 265.000
1,Rp 204.371
2,Rp 252.172
3,Rp 371.348
4,Rp 169.415
5,Rp 238.027
6,Rp 135.000
7,Rp 478.330
8,Rp 300.000
9,Rp 315.000


- Guest Rating

In [40]:
guest_ratings_hotel = []

for container in div_1_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_1_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_1_star


Unnamed: 0,guest_rating
0,76.0
1,84.0
2,82.0
3,81.0
4,81.0
5,89.0
6,68.0
7,
8,77.0
9,90.0


- Total Rating

In [41]:
guest_review_counts = []

for container in div_1_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

guest_review_counts_1_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
guest_review_counts_1_star

Unnamed: 0,total_guest_reviews
0,128 ulasan
1,15 ulasan
2,25 ulasan
3,105 ulasan
4,85 ulasan
5,284 ulasan
6,34 ulasan
7,
8,60 ulasan
9,194 ulasan


- Location

In [42]:
location_hotel_1_star = []
for container in div_1_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_1_star.append((location))

location_hotel_1_star = pd.DataFrame(location_hotel_1_star, columns=['Location'])
location_hotel_1_star

Unnamed: 0,Location
0,"Gondomanan, Yogyakarta (Malioboro)"
1,"Gedongtengen, Yogyakarta (Malioboro)"
2,Yogyakarta
3,"Gondokusuman, Yogyakarta"
4,"Mergangsan, Yogyakarta (Prawirotaman)"
5,"Mantrijeron, Yogyakarta"
6,Yogyakarta
7,"Gedongtengen, Yogyakarta (Malioboro)"
8,"Danurejan, Yogyakarta"
9,"Wirobrajan, Yogyakarta"


- Merge Columns

In [43]:
df_1_star_hotels = pd.concat([hotel_name_1_star, Price_hotel_1_star, guest_ratings_hotel_1_star, guest_review_counts_1_star, location_hotel_1_star], axis=1)
df_1_star_hotels['star_rating'] = 1

# Print the total number of hotels
print('Total Hotels : ', len(df_1_star_hotels))

# Display the first few rows of the combined DataFrame
df_1_star_hotels

Total Hotels :  33


Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating
0,EC Pondokan,Rp 265.000,76.0,128 ulasan,"Gondomanan, Yogyakarta (Malioboro)",1
1,The Cabin Hotel Gandekan,Rp 204.371,84.0,15 ulasan,"Gedongtengen, Yogyakarta (Malioboro)",1
2,Andelis Homestay,Rp 252.172,82.0,25 ulasan,Yogyakarta,1
3,Roemah Canting Homestay,Rp 371.348,81.0,105 ulasan,"Gondokusuman, Yogyakarta",1
4,Prayogo Lama Family GH Prawirotaman,Rp 169.415,81.0,85 ulasan,"Mergangsan, Yogyakarta (Prawirotaman)",1
5,Bring In House Yogyakarta,Rp 238.027,89.0,284 ulasan,"Mantrijeron, Yogyakarta",1
6,Riverside Homestay Jogja,Rp 135.000,68.0,34 ulasan,Yogyakarta,1
7,Hotel Batik Yogyakarta,Rp 478.330,,,"Gedongtengen, Yogyakarta (Malioboro)",1
8,Panca Dewi Guest House,Rp 300.000,77.0,60 ulasan,"Danurejan, Yogyakarta",1
9,Pamularsih Homestay,Rp 315.000,90.0,194 ulasan,"Wirobrajan, Yogyakarta",1


#### Merge All Tables

In [44]:
df_hotel = pd.concat([
    df_1_star_hotels,
    df_2_star_hotels,
    df_3_star_hotels,
    df_4_star_hotels,
    df_5_star_hotels
], ignore_index=True)

In [45]:
print('Total Hotels : ', len(df_hotel))
df_hotel.sample(10)

Total Hotels :  253


Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating
99,Garuda Guesthouse Yogyakarta RedPartner,Rp 191.852,32.0,23 ulasan,Yogyakarta,2
30,City Stay Homestay Jogja,Rp 2.520.000,65.0,4 ulasan,"Tegalrejo, Yogyakarta",1
76,Super OYO 90778 River inn Malioboro,Rp 241.278,48.0,24 ulasan,"Gedongtengen, Yogyakarta",2
113,MMUGM Hotel,Rp 396.000,80.0,60 ulasan,"Catur Tunggal, Yogyakarta",3
190,The Atrium Hotel & Resort Yogyakarta,Rp 480.000,85.0,34 ulasan,"Mlati, Yogyakarta",4
158,University Club (UC) Hotel UGM,Rp 421.587,73.0,29 ulasan,"Catur Tunggal, Yogyakarta",3
240,Grand Mercure Yogyakarta Adi Sucipto,Rp 950.000,75.0,151 ulasan,"Catur Tunggal, Yogyakarta",5
134,favehotel Malioboro - Yogyakarta,Rp 522.391,82.0,160 ulasan,"Gondokusuman, Yogyakarta",3
251,Hotel Tentrem Yogyakarta,Rp 7.591.980,,,"Jetis, Yogyakarta",5
89,Super OYO 90210 Kenari House Syariah,Rp 135.109,30.0,1 ulasan,Yogyakarta,2


Description:
In this script, we use the pd.concat() function from the Pandas library to vertically concatenate multiple hotel DataFrames (df_1_star_hotels, df_2_star_hotels, ..., df_5_star_hotels) into a single unified DataFrame named df_hotel. The parameter ignore_index=True ensures the resulting DataFrame has a clean, continuous index.

After merging the data, we create a new column named Rp_Price by removing the "Rp" prefix and thousand separators (.) from the original Price column. The cleaned values are then converted into numeric format using the int64 data type, which represents the hotel prices in Indonesian Rupiah (Rp) as whole numbers.

Finally, we print the total number of hotel entries using the len() function.

Impact on Data Integrity:
* Vertical concatenation (union) ensures that all hotel records from different star ratings are combined row-wise, with no column misalignment.

* Converting the Price column to int64 ensures accurate handling of currency as whole numbers, which is typically how prices are represented in real-world hotel listings.

* Removing formatting characters from the Price column helps avoid errors during numeric operations like sorting, filtering, or aggregation.

* Knowing the total number of hotel entries provides insight into the dataset’s size and helps assess completeness and readiness for further analysis.

#### Cleaning

In [46]:
df_hotel['Total_ratings'] = df_hotel['total_guest_reviews'].str.extract('(\d+)')
df_hotel['Total_ratings'] = df_hotel['Total_ratings'].astype('Int64')
df_hotel.sample(10)

Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating,Total_ratings
102,Ndalem Suryo Saptono Guest House,Rp 600.000,88.0,278 ulasan,Yogyakarta,2,278.0
40,Collection O 94154 Puri Gusti Ayu,Rp 215.489,,,Yogyakarta,2,
157,Urbanview TWH Costel near Stadion Maguwoharjo Yogyakarta,Rp 281.275,,,Yogyakarta,3,
224,Surokarsan 9 House Yogyakarta,Rp 418.905,89.0,76 ulasan,"Pakualaman, Yogyakarta",4,76.0
252,Wyndham Garden Yogyakarta Hotel Conference & Action Park,Rp 1.140.000,81.0,10 ulasan,Sleman,5,10.0
234,The Lavana Greenkhaza Villa Jogja,Rp 1.551.076,93.0,4 ulasan,Yogyakarta,4,4.0
120,PORTA by Ambarrukmo,Rp 771.473,88.0,169 ulasan,"Catur Tunggal, Yogyakarta",3,169.0
70,Ndalem Diajeng,Rp 170.000,91.0,269 ulasan,"Kraton, Yogyakarta",2,269.0
229,@K Hotel,Rp 614.415,63.0,36 ulasan,Yogyakarta,4,36.0
227,Most Bali Malioboro Villa,Rp 1.368.000,85.0,39 ulasan,"Gedongtengen, Yogyakarta",4,39.0


> Description:
<br> We extract numeric values from the 'total_guest_reviews' column using str.extract() with the regex (\d+), then convert the result to the nullable integer type Int64 in a new column called 'Total_ratings'. This allows numerical operations while handling missing values (NaN).

> Impact on Data Integrity:
<br> If the original text contains no digits, NaN will be returned. Using Int64 prevents errors from missing values and ensures the column is ready for analysis.

In [47]:
df_hotel['Rp_Price'] = df_hotel['Price'].str.replace('Rp', '', regex=False).str.replace('.', '', regex=False).str.strip()
df_hotel['Rp_Price'] = df_hotel['Rp_Price'].astype('int64')
df_hotel.sample(10)

Unnamed: 0,Name,Price,guest_rating,total_guest_reviews,Location,star_rating,Total_ratings,Rp_Price
199,Swiss-Belhotel Airport Yogyakarta,Rp 613.800,88.0,152 ulasan,Yogyakarta,4,152.0,613800
203,eL Hotel Yogyakarta Malioboro,Rp 1.047.220,80.0,195 ulasan,"Gedongtengen, Yogyakarta (Malioboro)",4,195.0,1047220
240,Grand Mercure Yogyakarta Adi Sucipto,Rp 950.000,75.0,151 ulasan,"Catur Tunggal, Yogyakarta",5,151.0,950000
204,Hotel FortunaGrande Malioboro Yogyakarta,Rp 1.700.000,80.0,734 ulasan,"Gedongtengen, Yogyakarta (Malioboro)",4,734.0,1700000
51,SPOT ON Podomoro Homestay,Rp 72.026,45.0,2 ulasan,Yogyakarta,2,2.0,72026
198,Tara Hotel Yogyakarta,Rp 512.148,71.0,42 ulasan,"Sinduadi, Yogyakarta",4,42.0,512148
119,The Victoria Hotel Yogyakarta,Rp 450.365,86.0,23 ulasan,"Catur Tunggal, Yogyakarta",3,23.0,450365
73,The Eco Village Homestay RedPartner,Rp 156.371,80.0,1 ulasan,Yogyakarta,2,1.0,156371
71,Griya Aneka Malioboro Yogyakarta Mitra RedDoorz,Rp 179.871,,,"Ngampilan, Yogyakarta",2,,179871
147,Hotel Neo Malioboro by ASTON,Rp 1.608.144,82.0,570 ulasan,"Gedongtengen, Yogyakarta (Malioboro)",3,570.0,1608144


In [48]:
#convert df_hotel to csv
df_hotel.to_csv('hotel.csv', index=False)
# After this we can export to SQL table