# Scraping Hotel URLs with Selenium

## Overview

This notebook demonstrates how to scrape hotel listing URLs from Booking.com using **Selenium**. We use a headless browser automation technique to navigate through search results and extract hotel links. The extracted URLs are saved to text files for further processing.

While Booking.com URLs can be generated directly with query parameters, we **chose not to** use this approach. Instead, we simulate a real user’s interaction with the webpage.

If we wanted to avoid using Selenium and directly generate a URL for Booking.com, we could use the following format:

```
https://www.booking.com/searchresults.html?ss=Barcelona&checkin=2025-06-03&checkout=2025-06-09&group_adults=2&group_children=0&no_rooms=1&order=price
```

Where
- ss=Barcelona → City to search (Barcelona)
- checkin=2025-06-03 → Check-in date (June 3, 2025)
- checkout=2025-06-09 → Check-out date (June 9, 2025)
- group_adults=2 → Number of adults (2 people)
- group_children=0 → Number of children (0 children)
- no_rooms=1 → Number of rooms (1 room)
- order=price → Sorting order (e.g., by price)

This URL can be easily adapted for any other city, date range, number of guests, or sorting preference by replacing the corresponding values.

## Scraping Strategy

1. **Load the main page**: Navigate to [Booking.com](https://www.booking.com/index.es.html).
2. **Enter search parameters**:
   - Set the destination city.
   - Select check-in and check-out dates via the calendar.
3. **Scroll and load more results**:
   - Initially, three scrolls load approximately **100 hotels**.
   - Each subsequent scroll adds **~25 more hotels**.
   - We keep scrolling and clicking the "Load more" button up to a predefined limit of 999 hotels (`number_of_cycles = 37`).
4. **Extract hotel URLs**:
   - Scrape the URLs of hotel listings by identifying the appropriate **CSS class**.
   - Store the results in a text file.

## Code Components

- **`scroll_to_bottom(driver)`** – Scrolls the page to the bottom to load more results.
- **`scroll_three_times(driver)`** – Performs an initial set of three scrolls.
- **`go_to_calendar_page(driver, current_date, target_date)`** – Navigates through the calendar to select check-in and check-out dates.
- **`scrape_hotels(city, start_date, end_date, number_of_cycles)`** – The main function that automates the scraping process.

## Expected Results

For each city and date range, a text file is created in `./scraped_hotel_urls/`, containing the list of hotel URLs.

Example:
```
./scraped_hotel_urls/Barcelona_2025-06-03_to_2025-06-09_hotel_urls.txt
```

## Execution Plan

We run the scraping function for two cities (**Barcelona** and **Madrid**) across two date ranges:
- **June 3–9, 2025**
- **June 10–16, 2025**

This results in **4 separate scraping runs**.

## Performance

Each 37 cycles run takes about 7 minutes so we scrape 4000 hotels in less than 30 minutes
We can impove runtime and sustainability replacing time.sleep with WebDriverWait().until, but we didnt cover it during classes



In [6]:
#!pip install selenium tqdm
import time
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
from datetime import datetime
from tqdm import tqdm  # For progress bar

# Configuration
base_url = 'https://www.booking.com/index.es.html'
output_dir = './scraped_hotel_urls/'  # Directory to save the text files
hotels_webpage_class = 'a78ca197d0'
button_xpath = '//button[contains(@class, "bf0537ecb5")]'
next_month_button_xpath = '//button[contains(@class, "f073249358")]'

# Create output directory
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Initialize WebDriver
driver = Chrome()

def scroll_to_bottom(driver):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)

def scroll_three_times(driver):
    for _ in range(3):
        scroll_to_bottom(driver)

def go_to_calendar_page(driver, current_date, target_date):
    """
    Navigates to the appropriate calendar pages to select the required dates.
    """
    current_date_obj = datetime.strptime(current_date, "%Y-%m-%d")
    target_date_obj = datetime.strptime(target_date, "%Y-%m-%d")
    month_diff = (target_date_obj.year - current_date_obj.year) * 12 + (target_date_obj.month - current_date_obj.month)

    for _ in range(month_diff):
        next_month_button = driver.find_element(By.XPATH, next_month_button_xpath)
        next_month_button.click()
        time.sleep(1)

def scrape_hotels(city, start_date, end_date, number_of_cycles=5):
    """
    Scrape hotel URLs for a given city and date range.
    """
    print(f"\nStarting scraping for {city} ({start_date} to {end_date})")
    print("=" * 100)

    driver.get(base_url)
    time.sleep(1)

    # Input city
    search_input = driver.find_element(By.ID, ':rh:')
    search_input.clear()
    search_input.send_keys(city)
    time.sleep(1)

    # Open calendar and select dates
    calendar_css = 'button.ebbedaf8ac:nth-child(2) > span:nth-child(1)'
    scroll_to_bottom(driver) # Scroll to make the calendar visible on small screens
    driver.find_element('css selector', calendar_css).click()
    time.sleep(1)

    # Select start and end dates
    current_date = datetime.now().strftime("%Y-%m-%d")
    go_to_calendar_page(driver, current_date, start_date)
    driver.find_element(By.XPATH, f'//span[@data-date="{start_date}"]').click()
    time.sleep(1)

    go_to_calendar_page(driver, start_date, end_date)
    driver.find_element(By.XPATH, f'//span[@data-date="{end_date}"]').click()
    time.sleep(1)

    # Submit search
    driver.find_element(By.XPATH, '//*[@id="indexsearch"]/div[2]/div/form/div/div[4]/button').click()
    time.sleep(3)

    # Scroll and load more results
    with tqdm(total=number_of_cycles, desc=f"Scraping {city} ({start_date} to {end_date})", unit="cycle") as pbar:
        cycle_count = 0
        while cycle_count < number_of_cycles:
            scroll_three_times(driver) # actually we need to scroll 3 times to load all the hotels only fist time, later we need to scroll only once
            try:
                load_more_button = WebDriverWait(driver, 10).until(
                    EC.element_to_be_clickable((By.XPATH, button_xpath))
                )
                driver.execute_script("arguments[0].scrollIntoView();", load_more_button)
                load_more_button.click()
                time.sleep(1)
            except Exception as e:
                print(f"\nReached end of results or error encountered: {e}")
                break
            cycle_count += 1
            pbar.update(1)

    # Collect hotel URLs
    hotels = driver.find_elements(By.CLASS_NAME, hotels_webpage_class)
    hotel_urls = [hotel.get_attribute("href") for hotel in hotels]

    # Save results to a file
    filename = f"{output_dir}{city}_{start_date}_to_{end_date}_hotel_urls.txt"
    with open(filename, 'w') as f:
        for url in hotel_urls:
            f.write(f"{url}\n")

    print(f"\n{len(hotel_urls)} hotels found for {city} ({start_date} to {end_date})")
    print(f"Results saved to {filename}")
    print("=" * 100)

    return len(hotel_urls)

# Scrape hotel URLs for Barcelona and Madrid
try:
    scrape_hotels("Barcelona", "2025-06-03", "2025-06-09", number_of_cycles=37)  # fist cycle is about 100 hotels, the 25 hotels per cycle, up to max of 36 cycles
    scrape_hotels("Madrid", "2025-06-03", "2025-06-09", number_of_cycles=37)
    scrape_hotels("Barcelona", "2025-06-10", "2025-06-16", number_of_cycles=37)
    scrape_hotels("Madrid", "2025-06-10", "2025-06-16", number_of_cycles=37)
    print("\nScraping complete for all cities and dates.")
finally:
    driver.quit()



Starting scraping for Barcelona (2025-06-03 to 2025-06-09)


Scraping Barcelona (2025-06-03 to 2025-06-09): 100%|██████████| 37/37 [04:54<00:00,  7.95s/cycle]



999 hotels found for Barcelona (2025-06-03 to 2025-06-09)
Results saved to ./scraped_hotel_urls/Barcelona_2025-06-03_to_2025-06-09_hotel_urls.txt

Starting scraping for Madrid (2025-06-03 to 2025-06-09)


Scraping Madrid (2025-06-03 to 2025-06-09): 100%|██████████| 37/37 [04:49<00:00,  7.82s/cycle]



1000 hotels found for Madrid (2025-06-03 to 2025-06-09)
Results saved to ./scraped_hotel_urls/Madrid_2025-06-03_to_2025-06-09_hotel_urls.txt

Starting scraping for Barcelona (2025-06-10 to 2025-06-16)


Scraping Barcelona (2025-06-10 to 2025-06-16): 100%|██████████| 37/37 [04:58<00:00,  8.07s/cycle]



992 hotels found for Barcelona (2025-06-10 to 2025-06-16)
Results saved to ./scraped_hotel_urls/Barcelona_2025-06-10_to_2025-06-16_hotel_urls.txt

Starting scraping for Madrid (2025-06-10 to 2025-06-16)


Scraping Madrid (2025-06-10 to 2025-06-16): 100%|██████████| 37/37 [04:49<00:00,  7.83s/cycle]



999 hotels found for Madrid (2025-06-10 to 2025-06-16)
Results saved to ./scraped_hotel_urls/Madrid_2025-06-10_to_2025-06-16_hotel_urls.txt

Scraping complete for all cities and dates.


# Scraping Hotel Details with BeautifulSoup

## Overview

This script extracts detailed information about hotels from **Booking.com** using **requests** and **BeautifulSoup**. It processes previously scraped hotel URLs stored in text files and retrieves information such as the hotel's name, description, room type, rating, and price. The extracted data is then saved to a CSV file.

## Why Use Requests and BeautifulSoup?

Instead of using Selenium for this step, we opted for **requests** and **BeautifulSoup** to:
- **Improve efficiency** – Requests are generally faster than Selenium-based browser automation.
- **Avoid unnecessary browser rendering** – Since we only need structured data, using a lightweight parser is more efficient.

## Scraping Strategy

1. **Read URL files**:
   - The script looks for text files in the `./scraped_hotel_urls/` directory.
   - Each file contains a list of hotel URLs from a specific city and date range.

2. **Extract metadata from filenames**:
   - City, check-in date, and check-out date are extracted from the filename format:
     ```
     City_YYYY-MM-DD_to_YYYY-MM-DD_hotel_urls.txt
     ```

3. **Loop through URLs**:
   - Send a **GET request** with a **random user agent** to avoid detection.
   - Parse the response using **BeautifulSoup**.

4. **Extract hotel details**:
   - **Hotel Name**: Extracted from the page header.
   - **Description**: Retrieved from the property’s summary section.
   - **Room Type**: Identified from the room listing.
   - **Rating**: Extracted from the user review section.
   - **Price**: Scraped from the pricing display.

5. **Store results in a DataFrame**:
   - Each hotel’s data is stored in a Pandas **DataFrame**.
   - The data is then **saved as a CSV file** (`scraped_hotel_data.csv`).

## Code Components

- **`USER_AGENTS`** – A list of different User-Agent strings to reduce blocking risks.
- **`requests.get(url, headers=headers)`** – Sends an HTTP request with a random User-Agent.
- **`BeautifulSoup(response.content, "html.parser")`** – Parses the HTML response.
- **Regular Expressions (`re.search()`)** – Used to extract numeric rating values.

## Expected Results

After running the script, a CSV file `scraped_hotel_data.csv` is generated, containing:
- **URL**
- **Hotel Name**
- **Description**
- **Room Type**
- **City**
- **Check-in Date**
- **Check-out Date**
- **Rating**
- **Price**

## Limitations and Considerations

- **Anti-Scraping Protection** – Booking.com may block repeated requests, so using **rotating user agents** and **delays** can help.
- **HTML Structure Changes** – If the website layout changes, selectors might need to be updated.
- **Missing Data** – Some elements may not be present on all hotel pages, leading to `"N/A"` values.

## Performance

Each 999 hotels run takes about 25 minutes so we scrape 4000 hotels in less than 2 hours
We can impove runtime and sustainability using async requests


In [14]:
#!pip install bs4 requests pandas tqdm
import os
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from tqdm import tqdm  # For progress bar
import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Gecko/20100101 Firefox/108.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Gecko/20100101 Firefox/108.0",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Edge/114.0.0.0",
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
]

# Directory containing URL files
directory = './scraped_hotel_urls/'

# DataFrame to store the results
df = pd.DataFrame(columns=["url", "hotel_name", "description", "room_type", "city", "start_date", "end_date", "rating", "price"])

# Get a list of all files in the directory
files = [f for f in os.listdir(directory) if os.path.isfile(os.path.join(directory, f))]
total_files = len(files)  # Total number of files to process

# Loop through the filenames in the directory with a progress bar
for file_index, filename in enumerate(tqdm(files, desc="Processing files", unit="file")):
    # Create the full path to the file
    file_path = os.path.join(directory, filename)
    
    # Extract city, start_date, and end_date from the filename
    city = filename.split('_')[0]
    match = re.search(r'_(\d{4}-\d{2}-\d{2})_to_(\d{4}-\d{2}-\d{2})_', filename)
    start_date, end_date = match.groups() if match else ("N/A", "N/A")
    
    # Read the URLs from the file
    with open(file_path) as f:
        lines = f.readlines()
    total_urls = len(lines)
    
    # Progress bar for URLs in the file
    print("=" * 100)
    print(f"\nProcessing file {file_index + 1} of {total_files}: {filename} ({total_urls} URLs)")
    print("=" * 100)
    for line in tqdm(lines, desc=f"File {file_index + 1}/{total_files} Progress", unit="URL"):
        url = line.strip()
        
        # Send GET request to the URL
        headers = {
            "User-Agent": random.choice(USER_AGENTS)
        }
        response = requests.get(url, headers=headers)
        
        # If request is successful, parse the HTML
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            # Extract hotel name
            hotel_name_element = soup.find("h2", class_="pp-header__title")
            hotel_name = hotel_name_element.get_text(strip=True) if hotel_name_element else "N/A"
            
            # Extract property description
            description_element = soup.find("p", {"data-testid": "property-description"})
            hotel_description = description_element.get_text(strip=True) if description_element else "N/A"
            
            # Extract room type
            room_type_element = soup.find("span", class_="hprt-roomtype-icon-link")
            hotel_short_description = room_type_element.get_text(strip=True) if room_type_element else "N/A"
            
            # Extract rating
            rating_element = soup.find("div", class_="ac4a7896c7")
            if rating_element:
                rating_match = re.search(r'\d+(\.\d+)?', rating_element.text)
                hotel_rating = rating_match.group() if rating_match else "N/A"
            else:
                hotel_rating = "N/A"
            
            # Extract price
            price_element = soup.find("div", class_="bui-price-display__value")
            hotel_price = price_element.get_text(strip=True) if price_element else "N/A"
            
            # Append the extracted data to the DataFrame
            df = pd.concat([df, pd.DataFrame([{
                "url": url,
                "hotel_name": hotel_name,
                "description": hotel_description,
                "room_type": hotel_short_description,
                "city": city,
                "start_date": start_date,
                "end_date": end_date,
                "rating": hotel_rating,
                "price": hotel_price
            }])], ignore_index=True)
        else:
            print(f"Failed to fetch URL: {url}, Status Code: {response.status_code}")

# Replace placeholder rating with NaN and save to CSV
df['rating'] = df['rating'].replace(999, 'NaN')  # Adjust as needed for placeholder values
output_file = 'scraped_hotel_data.csv'
df.to_csv(output_file, index=False)

print(f"\nProcessing complete. Data saved to {output_file}")


Processing files:   0%|          | 0/4 [00:00<?, ?file/s]


Processing file 1 of 4: Barcelona_2025-06-03_to_2025-06-09_hotel_urls.txt (999 URLs)


File 1/4 Progress: 100%|██████████| 999/999 [32:18<00:00,  1.94s/URL]
Processing files:  25%|██▌       | 1/4 [32:18<1:36:54, 1938.29s/file]


Processing file 2 of 4: Barcelona_2025-06-10_to_2025-06-16_hotel_urls.txt (992 URLs)


File 2/4 Progress: 100%|██████████| 992/992 [31:51<00:00,  1.93s/URL]
Processing files:  50%|█████     | 2/4 [1:04:10<1:04:05, 1922.70s/file]


Processing file 3 of 4: Madrid_2025-06-03_to_2025-06-09_hotel_urls.txt (1000 URLs)


File 3/4 Progress: 100%|██████████| 1000/1000 [30:14<00:00,  1.81s/URL]
Processing files:  75%|███████▌  | 3/4 [1:34:24<31:13, 1873.41s/file]  


Processing file 4 of 4: Madrid_2025-06-10_to_2025-06-16_hotel_urls.txt (999 URLs)


File 4/4 Progress: 100%|██████████| 999/999 [30:43<00:00,  1.85s/URL]
Processing files: 100%|██████████| 4/4 [2:05:08<00:00, 1877.01s/file]


Processing complete. Data saved to scraped_hotel_data.csv



