## 1. Data Collected and Its Use
> I choose booking.com as the platform for data collection. The data I want to collect includes:

- The name of the accommodation: Unique name of the property (hotel, guesthouse, etc.) for identification and branding.
- Price per night: Cost of staying per night, crucial for price comparison and booking decisions.

- Guest Rating: Average score from guest reviews (1-10 scale), indicating customer satisfaction and influencing reputation.
- Guest Total Ratings: Total number of reviews given by guests, providing context for the guest rating's validity and the property's popularity.

- Location: Geographical address of the property, important for location analysis, mapping, and guest choice based on their destination.
- Star Rating: Standardized classification system (1-5 stars) given by authorities, indicating the general level of facilities and service quality.

> Reasons for Data Importance:

- This data can be used to build recommendation systems, analyze price trends, or determine marketing strategies based on location and ratings.
- Storing data in PostgreSQL allows us to take advantage of features such as indexing for quick searches, JSON support for semi-structured data, and powerful analytics capabilities.

## 2.Data Scraping Using BeautifulSoup

In [None]:
# import library yang dibutuhkan

import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
import pandas as pd
import re
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementClickInterceptedException, NoSuchElementException
from selenium.webdriver.common.by import By
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import os
pd.options.display.max_colwidth = 999
pd.options.display.max_rows = 999

#### Five star rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 26–27, 2025, filtered by a five-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [None]:
# URL Booking.com dengan filter bintang 4
url_hotel_4_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-26&checkout=2025-07-27&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D5"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_4_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_5_star = BeautifulSoup(driver.page_source, "html.parser")

In [None]:
for tag in soup_5_star.find_all(True):
    print(tag.name)

In [None]:
div_5_star_containers = soup_5_star.find_all('div', attrs={'class': 'aa97d6032f'})
print(div_5_star_containers)

- Hotel names

In [None]:
hotel_names = []
for container in div_5_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_5_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_5_star

- Price Hotel

In [None]:
# Extract prices
Price_hotel = []
for container in div_5_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_5_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_5_star

- Guest Rating

In [None]:
guest_ratings_hotel = []

for container in div_5_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_5_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_5_star


- Total Ratings

In [None]:
guest_review_counts = []

for container in div_5_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

df_guest_review_counts_5_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
df_guest_review_counts_5_star


- Location hotel

In [None]:
location_hotel_5_star = []
for container in div_5_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_5_star.append((location))

location_hotel_5_star = pd.DataFrame(location_hotel_5_star, columns=['Location'])
location_hotel_5_star

- Merge columns

In [None]:
df_5_star_hotels = pd.concat([hotel_name_5_star, Price_hotel_5_star, guest_ratings_hotel_5_star, df_guest_review_counts_5_star, location_hotel_5_star], axis=1)
df_5_star_hotels['star_rating'] = 5

# Print the total number of hotels
print('Total Hotels : ', len(df_5_star_hotels))

# Display the first few rows of the combined DataFrame
df_5_star_hotels

#### Four Star Rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 26–27, 2025, filtered by a four-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [None]:
# URL Booking.com dengan filter bintang 4
url_hotel_4_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-26&checkout=2025-07-27&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D4"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_4_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_4_star = BeautifulSoup(driver.page_source, "html.parser")

In [None]:
div_4_star_containers = soup_4_star.find_all('div', attrs={'class':"aa97d6032f"})
print(div_4_star_containers)

- Hotel Names

In [None]:
hotel_names = []
for container in div_4_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_4_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_4_star

- Price hotel

In [None]:
# Extract prices
Price_hotel = []
for container in div_4_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_4_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_4_star

- Guest Ratings

In [None]:
guest_ratings_hotel = []

for container in div_4_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_4_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_4_star


- Total Ratings

In [None]:
guest_review_counts = []

for container in div_4_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

guest_review_counts_4_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
guest_review_counts_4_star


- Location

In [None]:
location_hotel_4_star = []
for container in div_4_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_4_star.append((location))

location_hotel_4_star = pd.DataFrame(location_hotel_4_star, columns=['Location'])
location_hotel_4_star

- Merge Columns

In [None]:
df_4_star_hotels = pd.concat([hotel_name_4_star, Price_hotel_4_star, guest_ratings_hotel_4_star, guest_review_counts_4_star, location_hotel_4_star], axis=1)
df_4_star_hotels['star_rating'] = 4

# Print the total number of hotels
print('Total Hotels : ', len(df_4_star_hotels))

# Display the first few rows of the combined DataFrame
df_4_star_hotels

#### Three Star Rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 26–27, 2025, filtered by a three-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [None]:
# URL Booking.com dengan filter bintang 3
url_hotel_3_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-26&checkout=2025-07-27&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D3"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_3_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_3_star = BeautifulSoup(driver.page_source, "html.parser")

In [None]:
div_3_star_containers = soup_3_star.find_all('div', attrs={'class':"aa97d6032f"})
print(div_3_star_containers)

- Hotel Names

In [None]:
hotel_names = []
for container in div_3_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_3_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_3_star

- Price Hotels

In [None]:
# Extract prices
Price_hotel = []
for container in div_3_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_3_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_3_star

- Guest Rating

In [None]:
guest_ratings_hotel = []

for container in div_3_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_3_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_3_star


- Total Rating

In [None]:
guest_review_counts = []

for container in div_3_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

guest_review_counts_3_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
guest_review_counts_3_star


- Location

In [None]:
location_hotel_3_star = []
for container in div_3_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_3_star.append((location))

location_hotel_3_star = pd.DataFrame(location_hotel_3_star, columns=['Location'])
location_hotel_3_star

- Merge Columns

In [None]:
df_3_star_hotels = pd.concat([hotel_name_3_star, Price_hotel_3_star, guest_ratings_hotel_3_star, guest_review_counts_3_star, location_hotel_3_star], axis=1)
df_3_star_hotels['star_rating'] = 3

# Print the total number of hotels
print('Total Hotels : ', len(df_3_star_hotels))

# Display the first few rows of the combined DataFrame
df_3_star_hotels

#### Two Star Rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 26–27, 2025, filtered by a Two-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [None]:
# URL Booking.com dengan filter bintang 2
url_hotel_2_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-26&checkout=2025-07-27&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D2"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_2_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_2_star = BeautifulSoup(driver.page_source, "html.parser")

In [None]:
div_2_star_containers = soup_2_star.find_all('div', attrs={'class':"aa97d6032f"})
print(div_2_star_containers)

- Hotel Names

In [None]:
hotel_names = []
for container in div_2_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_2_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_2_star

- Hotel Price

In [None]:
# Extract prices
Price_hotel = []
for container in div_2_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_2_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_2_star

- Guest Rating

In [None]:
guest_ratings_hotel = []

for container in div_2_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_2_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_2_star


- Total rating

In [None]:
guest_review_counts = []

for container in div_2_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

guest_review_counts_2_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
guest_review_counts_2_star

- Location

In [None]:
location_hotel_2_star = []
for container in div_2_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_2_star.append((location))

location_hotel_2_star = pd.DataFrame(location_hotel_2_star, columns=['Location'])
location_hotel_2_star

- Merge Columns

In [None]:
df_2_star_hotels = pd.concat([hotel_name_2_star, Price_hotel_2_star, guest_ratings_hotel_2_star, guest_review_counts_2_star, location_hotel_2_star], axis=1)
df_2_star_hotels['star_rating'] = 2

# Print the total number of hotels
print('Total Hotels : ', len(df_3_star_hotels))

# Display the first few rows of the combined DataFrame
df_2_star_hotels

#### One Star Rating

> * We are scraping data from hotels available in Yogyakarta, Indonesia, for the dates around July 26–27, 2025, filtered by a five-star hotel rating.

> * We are scraping the first page using a scrolling system.

In [None]:
# URL Booking.com dengan filter bintang 3
url_hotel_1_star = "https://www.booking.com/searchresults.id.html?label=gen173nr-1BCAEoggI46AdIM1gEaGiIAQGYARK4ARfIAQzYAQHoAQGIAgGoAgO4AsneycMGwAIB0gIkM2NiM2FjZTgtN2QyNC00ODcwLWE4NDktOGViYTUzN2NjNWFk2AIF4AIB&sid=98a2dc5e83d843937bd12ce44cdf1994&aid=304142&ss=Yogyakarta&ssne=Yogyakarta&ssne_untouched=Yogyakarta&efdco=1&lang=id&src=searchresults&dest_id=-2703546&dest_type=city&checkin=2025-07-26&checkout=2025-07-27&group_adults=2&no_rooms=1&group_children=0&nflt=class%3D1"

# Konfigurasi Chrome
option = webdriver.ChromeOptions()
option.add_argument("--start-maximized")

driver = webdriver.Chrome(options=option)
driver.get(url_hotel_1_star)

# Waktu tunggu awal
time.sleep(5)

# Simulasikan scroll hingga akhir halaman
last_height = driver.execute_script("return document.body.scrollHeight")
scroll_pause_time = 3

while True:
    # Scroll ke bawah
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)

    # Cek apakah sudah tidak ada perubahan scroll
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Setelah selesai scroll, ambil HTML
soup_1_star = BeautifulSoup(driver.page_source, "html.parser")

In [None]:
div_1_star_containers = soup_1_star.find_all('div', attrs={'class':"aa97d6032f"})
print(div_1_star_containers)

- Hotel Names

In [None]:
hotel_names = []
for container in div_1_star_containers:
    name = container.find('div', attrs={"class":'b87c397a13 a3e0b4ffd1'})
    print('==== part 1====')
    print(name)
    if name:
      names = name.text.strip()
    else:
      names = None
    hotel_names.append((names))
    print('==== part 2====')
    print(hotel_names)

hotel_name_1_star = pd.DataFrame(hotel_names, columns=['Name'])
hotel_name_1_star

- Hotel Price

In [None]:
# Extract prices
Price_hotel = []
for container in div_1_star_containers:
    Price = container.find('span', attrs={"class": 'b87c397a13 f2f358d1de ab607752a2'})
    # listingPrice__finalPrice listingPrice__finalPrice--black
    if Price:
       Prices = Price.text.strip()
    else:
       Prices = None
    Price_hotel.append(Prices)
    # print(Price_hotel)

# Convert price list to DataFrame
Price_hotel_1_star = pd.DataFrame(Price_hotel, columns=['Price'])
Price_hotel_1_star

- Guest Rating

In [None]:
guest_ratings_hotel = []

for container in div_1_star_containers:
    review_div = container.find('div', attrs={'data-testid': 'review-score'})
    if review_div:
        rating_span = review_div.find('div', class_='f63b14ab7a dff2e52086')
        if rating_span:
            guest_ratings_hotel.append(rating_span.text.strip())
        else:
            guest_ratings_hotel.append(None)
    else:
        guest_ratings_hotel.append(None)

guest_ratings_hotel_1_star = pd.DataFrame(guest_ratings_hotel, columns=['guest_rating'])
guest_ratings_hotel_1_star


- Total Rating

In [None]:
guest_review_counts = []

for container in div_1_star_containers:
    review_score_section = container.find('div', attrs={'data-testid': 'review-score'})
    if review_score_section:
        review_count_div = review_score_section.find('div', class_='fff1944c52 fb14de7f14 eaa8455879')
        if review_count_div:
            text = review_count_div.text
            guest_review_counts.append(text)
        else:
            guest_review_counts.append(None)
    else:
        guest_review_counts.append(None)

guest_review_counts_1_star = pd.DataFrame(guest_review_counts, columns=['total_guest_reviews'])
guest_review_counts_1_star

- Location

In [None]:
location_hotel_1_star = []
for container in div_1_star_containers:
    location = container.find('span', attrs={"class":'d823fbbeed f9b3563dd4'}).text
    location_hotel_1_star.append((location))

location_hotel_1_star = pd.DataFrame(location_hotel_1_star, columns=['Location'])
location_hotel_1_star

- Merge Columns

In [None]:
df_1_star_hotels = pd.concat([hotel_name_1_star, Price_hotel_1_star, guest_ratings_hotel_1_star, guest_review_counts_1_star, location_hotel_1_star], axis=1)
df_1_star_hotels['star_rating'] = 1

# Print the total number of hotels
print('Total Hotels : ', len(df_1_star_hotels))

# Display the first few rows of the combined DataFrame
df_1_star_hotels

#### Merge All Tables

In [None]:
df_hotel = pd.concat([
    df_1_star_hotels,
    df_2_star_hotels,
    df_3_star_hotels,
    df_4_star_hotels,
    df_5_star_hotels
], ignore_index=True)

In [None]:
print('Total Hotels : ', len(df_hotel))
df_hotel.sample(10)

Description:
In this script, we use the pd.concat() function from the Pandas library to vertically concatenate multiple hotel DataFrames (df_1_star_hotels, df_2_star_hotels, ..., df_5_star_hotels) into a single unified DataFrame named df_hotel. The parameter ignore_index=True ensures the resulting DataFrame has a clean, continuous index.

After merging the data, we create a new column named Rp_Price by removing the "Rp" prefix and thousand separators (.) from the original Price column. The cleaned values are then converted into numeric format using the int64 data type, which represents the hotel prices in Indonesian Rupiah (Rp) as whole numbers.

Finally, we print the total number of hotel entries using the len() function.

Impact on Data Integrity:
* Vertical concatenation (union) ensures that all hotel records from different star ratings are combined row-wise, with no column misalignment.

* Converting the Price column to int64 ensures accurate handling of currency as whole numbers, which is typically how prices are represented in real-world hotel listings.

* Removing formatting characters from the Price column helps avoid errors during numeric operations like sorting, filtering, or aggregation.

* Knowing the total number of hotel entries provides insight into the dataset’s size and helps assess completeness and readiness for further analysis.

#### Cleaning

In [None]:
df_hotel['Total_ratings'] = df_hotel['total_guest_reviews'].str.extract('(\d+)')
df_hotel['Total_ratings'] = df_hotel['Total_ratings'].astype('Int64')
df_hotel.sample(10)

> Description:
<br> We extract numeric values from the 'total_guest_reviews' column using str.extract() with the regex (\d+), then convert the result to the nullable integer type Int64 in a new column called 'Total_ratings'. This allows numerical operations while handling missing values (NaN).

> Impact on Data Integrity:
<br> If the original text contains no digits, NaN will be returned. Using Int64 prevents errors from missing values and ensures the column is ready for analysis.

In [None]:
df_hotel['Rp_Price'] = df_hotel['Price'].str.replace('Rp', '', regex=False).str.replace('.', '', regex=False).str.strip()
df_hotel['Rp_Price'] = df_hotel['Rp_Price'].astype('int64')
df_hotel.sample(10)

In [None]:
# Ganti koma menjadi titik dan ubah ke float
df_hotel['guest_rating'] = df_hotel['guest_rating'].str.replace(',', '.', regex=False).astype(float)
# Normalisasi ke skala 1-5
df_hotel['guest_rating_normalized'] = df_hotel['guest_rating'] / 2

In [None]:
# Drop kolom mentah setelah dibuat versi bersihnya
df_hotel = df_hotel.drop(['Price', 'guest_rating', 'total_guest_reviews'], axis=1)

> Description:<br>
The `guest_rating` column was normalized by dividing all values by 2, converting the rating scale from 1–10 to 1–5. This was stored in a new column `guest_rating_normalized`.

> Impact on Data Integrity:<br>
Normalizing `guest_rating` to a 5-point scale ensures consistency with common rating systems and improves comparability for analysis or modeling. The original data remains intact, but slight floating-point differences may occur, requiring rounding if needed.

In [None]:
df_hotel.info()

In [None]:
# Konversi ke float untuk memastikan tidak error
df_hotel['Total_ratings'] = df_hotel['Total_ratings'].astype(float)
df_hotel['guest_rating_normalized'] = df_hotel['guest_rating_normalized'].astype(float)

# Isi nilai null dengan median
df_hotel['Total_ratings'] = df_hotel['Total_ratings'].fillna(df_hotel['Total_ratings'].median())
df_hotel['guest_rating_normalized'] = df_hotel['guest_rating_normalized'].fillna(df_hotel['guest_rating_normalized'].median())


> Description: <br>
Missing values in both `Total_ratings` and `guest_rating_normalized` were replaced with the median of each column. Median was chosen to reduce the influence of outliers.

> Impact on Data Integrity:<br>
Filling missing values with the median retains all rows, avoiding data loss from nulls. Median is a robust measure that provides a fair estimate, especially with outliers. However, since the values are imputed, they may not reflect the actual data and could slightly bias the distribution if the missingness is not random.

In [None]:
df_hotel.sample(10)

In [None]:
# #convert df_hotel to csv
df_hotel.to_csv('hotel.csv', index=False)
# # After this we can export to SQL table