# Objective

Scrape property listings from Redfin.com and extract a range of essential details, including prices, addresses, beds/baths, images, and geo-coordinates.

## Target URL

https://www.redfin.com/neighborhood/547223/CA/Los-Angeles/Hollywood-Hills

## Approach

- Check if site is static or dynamically generated
- Methodically identify and bypass anti-bot techniques
- Identify and parse relevant page elements for the data points in the first listing
- Put everything together- Loop through listings in the first page and create a dataframe from the listings
- Extend the script to cover multiple pages


## Determine whether the site is static or dynamically generated, and identify anti-bot techniques

Typical symptoms of bot detection:
- 403 Forbidden
- Empty response or "Access Denied"
- CAPTCHA pages
- Meta refresh or JavaScript redirection

In [1]:
# import basic libraries
import requests
from bs4 import BeautifulSoup

In [2]:
# Redfin search URL for Hollywood Hills, Los Angeles
base_url = "https://www.redfin.com/neighborhood/547223/CA/Los-Angeles/Hollywood-Hills"

## Testing requests without headers

In [3]:
r = requests.get(base_url)
print(r.status_code)
print(r.text[:500])

429
<!doctype html>
<html>

<head>
    <meta charset="utf-8">
    <title>Are You a Robot? | Redfin</title>
    <style>
        body {
            font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
            margin: 0;
            text-align: left;
            font-size: 16px;
            color: #333;
        }

        #header {
            min-width: 300px;
            height: 60px;
            width: 100%;
            background-color: #fff;
        }

        #header .logo {
         


Status code: 429 suggests 'requests without headers' won't fly. Some anti-bot symptom?

## Capture headers from live browsing

- Visit target URL
- Open the Network tab in the DevTools
- Right click (or Ctrl-click) a request
- Click "Copy" → "Copy as cURL(bash)"
- You can now paste it in the relevant curl converter (e.g. https://curlconverter.com/) to translate it in the language you want - in our case, python

## Testing requests with captured headers

In [4]:
# copy the python request header from the curl converter

headers = {
    'sec-ch-ua-platform': '"Windows"',
    'Referer': 'https://www.redfin.com/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36',
    'sec-ch-ua': '"Not)A;Brand";v="8", "Chromium";v="138", "Google Chrome";v="138"',
    'sec-ch-ua-mobile': '?0',
}

r = requests.get(base_url, headers=headers)
print(r.status_code)
print(r.text[:2000])

200
<!DOCTYPE html><html lang="en"><head>
		<script src="https://cdn.cookielaw.org/scripttemplates/otSDKStub.js"  type="text/javascript" charset="UTF-8" data-domain-script="7e5bc3d6-ef20-4760-aa0d-c8df4649fae2" ></script>
		
		<script>
			window.__uspapi = function (command, version, callback) {
				callback({}, false);
			}
		</script>
	
	<!-- Server: customer-pages-map-dp --><!-- Time generated: Sun Jul 27 2025 21:20:24 GMT+0000 (Coordinated Universal Time) --><script>(function(a){window.__reactSe


Ok, status code returns 200, but the page appears to be dynamically generated. React app?

If the data we need is loaded dynamically via JavaScript, standard scraping methods won’t work. Selenium can load the page and execute JavaScript, allowing us to scrape the rendered content.

## Exploratory scraping with Selenium

In [17]:
# Import selenium and related libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
import json
import random
import re

In [18]:
# helper to set up selenium
def init_chrome_driver():
    
    chrome_options = Options()
    chrome_options.add_argument(" - headless")
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=chrome_options)
    
    return driver

In [19]:
# Redfin search URL for Hollywood Hills, Los Angeles
base_url = "https://www.redfin.com/neighborhood/547223/CA/Los-Angeles/Hollywood-Hills"

In [20]:
driver = init_chrome_driver()

In [None]:
# Load the website
driver.get(base_url)

# Let the page load completely
driver.implicitly_wait(10)

# Get the page source after JavaScript has been executed
html = driver.page_source

# Parse with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print(soup.title.string)
print()
soup.prettify()

An important part of building a web scraping tool is navigating through the source code of the web page we are scraping. The chunk of text above is just a part of the whole page. We may go through it to find the position of the main listings container. Alternatively, we can visit the page live and visually identify where the listings of interest start, right click the area and select inspect or Ctrl+Shift+I to view the CSS elements

### Find the data points in the first listing

Let's find data for the first listing in the page before looping over the rest to find results of the entire first page.

In [21]:
driver.get(base_url)
time.sleep(random.uniform(5, 8))

# Locate the main listings container
try:
    container = driver.find_element("css selector", "div.HomeCardsContainer")
    listings = container.find_elements("css selector", "div.HomeCardContainer")
    print(f"Found {len(listings)} listings on the page")
except:
    print("Failed to locate the property list container. Exiting...")

Found 44 listings on the page


In [None]:
# Extract price
try:
    price = listings[0].find_element("css selector", "span.bp-Homecard__Price--value").text.strip()
except:
    price = "N/A"
print(price)

In [None]:
# Extract address
try:
    address = listings[0].find_element("css selector", "div.bp-Homecard__Address").text.strip()
    print(address)
except:
    print("Skipping a listing due to missing address data")


In [None]:
# Extract beds
try:
    beds = listings[0].find_element("css selector", "span.bp-Homecard__Stats--beds").text.strip()
except:
    beds = "N/A"
print(beds)

In [None]:
# Extract baths
try:
    baths = listings[0].find_element("css selector", "span.bp-Homecard__Stats--baths").text.strip()
except:
    baths = "N/A"
print(baths)

In [None]:
# Extract sqft
try:
    sqft = listings[0].find_element("css selector", "span.bp-Homecard__LockedStat--value").text.strip()
except:
    sqft = "N/A"
print(sqft)

In [None]:
# Extract listing link
try:
    link = listings[0].find_element("css selector", "a.bp-Homecard").get_attribute("href")
    link = f"https://www.redfin.com{link}" if link.startswith("/") else link
    print(link)
    
    # Extract ID after /home/
    match = re.search(r'/home/(\d+)', link)
    if match:
        listing_id = match.group(1)
        print("🏠 Listing ID:", listing_id)
    else:
        print("❌ Listing ID not found")
        
except:
    print("Skipping a listing due to missing link data")

In [None]:
# Extract image URL
try:
    image_element = listings[0].find_element("css selector", "img.bp-Homecard__Photo--image")
    image_url = image_element.get_attribute("src")
except:
    image_url = "N/A"

In [None]:
try:
    # Extract Geo-Coordinates (Latitude & Longitude)
    json_script = listings[0].find_element(By.CSS_SELECTOR, "script[type='application/ld+json']").get_attribute("innerHTML")
    json_data = json.loads(json_script)

    # Sometimes it's a dict, sometimes a list of dicts
    if isinstance(json_data, list):
        geo_data = next((item.get("geo") for item in json_data if item.get("geo")), None)
    else:
        geo_data = json_data.get("geo")

    if geo_data:
        latitude = geo_data.get("latitude", "N/A")
        longitude = geo_data.get("longitude", "N/A")
    else:
        latitude = "N/A"
        longitude = "N/A"
except Exception as e:
    print(f"⚠️ Failed to extract geo-coordinates: {e}")
    latitude = "N/A"
    longitude = "N/A"
print(f"Latitude: {latitude} Longitude:{longitude}")

### Put everything together- Loop through results and append data inside a list

In [None]:
# Initialize data storage
redfin_data = []

for listing in listings:
    # Extract price
    try:
        price = listing.find_element("css selector", "span.bp-Homecard__Price--value").text.strip()
    except:
        price = "N/A"

    # Extract address
    try:
        address = listing.find_element("css selector", "div.bp-Homecard__Address").text.strip()
    except:
        print("Skipping a listing due to missing address data")
        continue  # Skip listings with missing elements

    # Extract beds, baths, and sqft
    try:
        beds = listing.find_element("css selector", "span.bp-Homecard__Stats--beds").text.strip()
    except:
        beds = "N/A"

    try:
        baths = listing.find_element("css selector", "span.bp-Homecard__Stats--baths").text.strip()
    except:
        baths = "N/A"

    try:
        sqft = listing.find_element("css selector", "span.bp-Homecard__LockedStat--value").text.strip()
    except:
        sqft = "N/A"

    # Extract listing link
    try:
        link = listing.find_element("css selector", "a.bp-Homecard").get_attribute("href")
        link = f"https://www.redfin.com{link}" if link.startswith("/") else link

        # Extract ID after /home/
        match = re.search(r'/home/(\d+)', link)
        if match:
            listing_id = match.group(1)
        else:
            listing_id = "N/A"

    except:
        print("Skipping a listing due to missing link data")
        continue  # Skip listings with missing elements

    # Extract image URL
    try:
        image_element = listing.find_element("css selector", "img.bp-Homecard__Photo--image")
        image_url = image_element.get_attribute("src")
    except:
        image_url = "N/A"
    
    try:
        # Extract Geo-Coordinates (Latitude & Longitude)
        json_script = listing.find_element(By.CSS_SELECTOR, "script[type='application/ld+json']").get_attribute("innerHTML")
        json_data = json.loads(json_script)

        # Sometimes it's a dict, sometimes a list of dicts
        if isinstance(json_data, list):
            geo_data = next((item.get("geo") for item in json_data if item.get("geo")), None)
        else:
            geo_data = json_data.get("geo")

        if geo_data:
            latitude = geo_data.get("latitude", "N/A")
            longitude = geo_data.get("longitude", "N/A")
        else:
            latitude = "N/A"
            longitude = "N/A"
    except Exception as e:
        print(f"⚠️ Failed to extract geo-coordinates: {e}")
        latitude = "N/A"
        longitude = "N/A"

    # Store the data
    redfin_data.append({
        "Listing ID": listing_id,
        "Price": price,
        "Address": address,
        "Beds": beds,
        "Baths": baths,
        "SqFt": sqft,
        "Link": link,
        "Image URL": image_url, 
        "Latitude": latitude,
        "Longitude": longitude
    })


In [None]:
# creating pandas dataframe from the scraped page
df_redfin  = pd.DataFrame(redfin_data)

In [None]:
df_redfin

### Multiple Pages

In [None]:
# Initialize data storage
redfin_data = []

# looping through 5 pages using for loop
page_number = 1
for i in range(1,6):
    print(f"Scraping page {page_number}...")

    url = base_url if page_number == 1 else f"{base_url}/page-{page_number}"
    driver.get(url)
    time.sleep(random.uniform(5, 8))

    # Locate the main listings container
    try:
        container = driver.find_element("css selector", "div.HomeCardsContainer")
        listings = container.find_elements("css selector", "div.HomeCardContainer")
    except:
        print("Failed to locate the property list container. Exiting...")
        break

    print(f"Found {len(listings)} listings on page {page_number}")

    for listing in listings:
        # Extract price
        try:
            price = listing.find_element("css selector", "span.bp-Homecard__Price--value").text.strip()
        except:
            price = "N/A"
    
        # Extract address
        try:
            address = listing.find_element("css selector", "div.bp-Homecard__Address").text.strip()
        except:
            print("Skipping a listing due to missing address data")
            continue  # Skip listings with missing elements
    
        # Extract beds, baths, and sqft
        try:
            beds = listing.find_element("css selector", "span.bp-Homecard__Stats--beds").text.strip()
        except:
            beds = "N/A"
    
        try:
            baths = listing.find_element("css selector", "span.bp-Homecard__Stats--baths").text.strip()
        except:
            baths = "N/A"

        try:
            sqft = listing.find_element("css selector", "span.bp-Homecard__LockedStat--value").text.strip()
        except:
            sqft = "N/A"
    
        # Extract listing link and listing id
        try:
            link = listing.find_element("css selector", "a.bp-Homecard").get_attribute("href")
            link = f"https://www.redfin.com{link}" if link.startswith("/") else link
    
            # Extract ID after /home/
            match = re.search(r'/home/(\d+)', link)
            if match:
                listing_id = match.group(1)
            else:
                listing_id = "N/A"
    
        except:
            print("Skipping a listing due to missing link data")
            continue  # Skip listings with missing elements

        # Extract image URL
        try:
            image_element = listing.find_element("css selector", "img.bp-Homecard__Photo--image")
            image_url = image_element.get_attribute("src")
        except:
            image_url = "N/A"
    
        try:
            # Extract Geo-Coordinates (Latitude & Longitude)
            json_script = listing.find_element(By.CSS_SELECTOR, "script[type='application/ld+json']").get_attribute("innerHTML")
            json_data = json.loads(json_script)
    
            # Sometimes it's a dict, sometimes a list of dicts
            if isinstance(json_data, list):
                geo_data = next((item.get("geo") for item in json_data if item.get("geo")), None)
            else:
                geo_data = json_data.get("geo")
    
            if geo_data:
                latitude = geo_data.get("latitude", "N/A")
                longitude = geo_data.get("longitude", "N/A")
            else:
                latitude = "N/A"
                longitude = "N/A"
        except Exception as e:
            print(f"⚠️ Failed to extract geo-coordinates: {e}")
            latitude = "N/A"
            longitude = "N/A"

        # Store the data
        redfin_data.append({
            "Listing ID": listing_id,
            "Price": price,
            "Address": address,
            "Beds": beds,
            "Baths": baths,
            "SqFt": sqft,
            "Link": link,
            "Image URL": image_url, 
            "Latitude": latitude,
            "Longitude": longitude
        })
    

    # Try going to the next page by checking if the next-page anchor exists
    try:
        next_page_anchor = driver.find_element(By.CSS_SELECTOR, "a[aria-label='page {}']".format(page_number + 1))
        if next_page_anchor:
            page_number += 1
            time.sleep(random.uniform(3, 6))
        else:
            break
    except:
        print("✅ No more pages.")
        break
        

In [None]:
# creating pandas dataframe from the scraped pages
df_redfin  = pd.DataFrame(redfin_data)

In [None]:
df_redfin

### Close the browser

In [25]:
driver.quit()