# Craigslist Webscrape 🕵🏻‍♀️💻

**Date**: 2025-08-18  
**Collaborators**: Amy Zhang, ChatGPT, Perplexity AI, Claude AI

## Objective 🎯
This notebook demonstrates how to extract structured data from Craigslist listings using Python. Here, we focus on gathering item listings, prices, and descriptions to explore automated data collection pipelines.

## Exploration Summary 🔍

1. **Initial Craigslist scrape** 📝  
   Searched for "Roland amp" in Brooklyn, NY using gallery view. Retrieved limited results due to dynamic loading; however, did successfully scrape descriptions from individual links.

2. **Attempted Selenium automation** 🤖  
   Tried using Selenium to load more results beyond the first 13. This approach failed because the gallery view dynamically loads content in a way Selenium couldn’t easily access.

3. **Adjusted scraping strategy** 🔧  
   Switched to JSON endpoints and changed to list view. Successfully retrieved more listings, but links were not immediately accessible in the expected format.

4. **Debugging and key insight** 💡  
   Through debugging, we discovered that Craigslist no longer uses the old `li.result-row` containers. Listing URLs still exist but follow a new pattern (`/brk/msg/d/`). Identifying this pattern allowed us to correctly extract actual listings while ignoring navigation links.

5. **Final multi-page scrape** 📊  
   Scraped three pages of listings. After removing duplicates, we compiled **168 complete entries** with URLs.

*Additional notes and reflections are included at the end of the notebook.* 📌


---
---
---
---
---
# 1. Initial Craigslist Scrape

## 1) Import libraries


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


## 2) Define the URL
- Pick a Craigslist search results page.
- Copy-paste that URL here.


In [2]:
url = "https://newyork.craigslist.org/search/brk/msa?query=roland%20amp#search=2~gallery~54"


## 3) Fetch the webpage


In [3]:
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")


## 4) Parse out listing data
- Title
- Price
- Location (if available)
- Link


In [4]:
listings = []
for result in soup.find_all("li", class_="cl-static-search-result"):
    title = result.find("div", class_="title").get_text(strip=True)
    price = result.find("span", class_="price").get_text(strip=True) if result.find("span", class_="price") else "N/A"
    link = result.find("a")["href"]
    location = result.find("span", class_="nearby")  # may not always exist
    location = location.get_text(strip=True) if location else "N/A"
    
    listings.append({
        "title": title,
        "price": price,
        "location": location,
        "link": link
    })


## 5) Convert to DataFrame


In [5]:
df = pd.DataFrame(listings)
df.head(15)


Unnamed: 0,title,price,location,link
0,Roland KC60 1x10 40W Keyboard Amp For Sale,,,https://newyork.craigslist.org/brk/msg/d/brook...
1,Roland KC60 1x10 40W Keyboard Amp For Sale,,,https://newyork.craigslist.org/brk/msg/d/brook...
2,Roland KC60 1x10 40W Keyboard Amp For Sale,,,https://newyork.craigslist.org/brk/msg/d/brook...
3,Roland KC60 1x10 40W Keyboard Amp For Sale,,,https://newyork.craigslist.org/brk/msg/d/brook...
4,Roland KC60 1x10 40W Keyboard Amp For Sale,,,https://newyork.craigslist.org/brk/msg/d/brook...
5,Roland KC60 1x10 40W Keyboard Amp For Sale,,,https://newyork.craigslist.org/brk/msg/d/brook...
6,Roland RD-700GX 88-Key Digital Stage Piano W.K...,,,https://newyork.craigslist.org/brk/msg/d/brook...
7,1980s Roland Cube 20 Guitar combo amplifier - ...,,,https://newyork.craigslist.org/brk/msg/d/brook...
8,Roland Blues Cube Hot 30-watt,,,https://newyork.craigslist.org/brk/msg/d/brook...
9,Hourly Rehearsal @ Battalion Studios: large cl...,,,https://newyork.craigslist.org/brk/msg/d/brook...


## 6) Scrape: grab descriptions from individual links

In [6]:
descriptions = []

for link in df['link']:
    response = requests.get(link)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Grab the main body of the listing (Craigslist uses <section id="postingbody">)
    post_body = soup.find('section', id='postingbody')
    
    if post_body:
        descriptions.append(post_body.get_text(strip=True))
    else:
        descriptions.append('N/A')

df['description'] = descriptions

In [7]:
df.head(15)

Unnamed: 0,title,price,location,link,description
0,Roland KC60 1x10 40W Keyboard Amp For Sale,,,https://newyork.craigslist.org/brk/msg/d/brook...,QR Code Link to This PostRoland KC60 1x10 40W ...
1,Roland KC60 1x10 40W Keyboard Amp For Sale,,,https://newyork.craigslist.org/brk/msg/d/brook...,QR Code Link to This PostRoland KC60 1x10 40W ...
2,Roland KC60 1x10 40W Keyboard Amp For Sale,,,https://newyork.craigslist.org/brk/msg/d/brook...,QR Code Link to This PostRoland KC60 1x10 40W ...
3,Roland KC60 1x10 40W Keyboard Amp For Sale,,,https://newyork.craigslist.org/brk/msg/d/brook...,QR Code Link to This PostRoland KC60 1x10 40W ...
4,Roland KC60 1x10 40W Keyboard Amp For Sale,,,https://newyork.craigslist.org/brk/msg/d/brook...,QR Code Link to This PostRoland KC60 1x10 40W ...
5,Roland KC60 1x10 40W Keyboard Amp For Sale,,,https://newyork.craigslist.org/brk/msg/d/brook...,QR Code Link to This PostRoland KC60 1x10 40W ...
6,Roland RD-700GX 88-Key Digital Stage Piano W.K...,,,https://newyork.craigslist.org/brk/msg/d/brook...,QR Code Link to This PostRoland RD-700GX 88-Ke...
7,1980s Roland Cube 20 Guitar combo amplifier - ...,,,https://newyork.craigslist.org/brk/msg/d/brook...,"QR Code Link to This PostHello,Here's a 1980s ..."
8,Roland Blues Cube Hot 30-watt,,,https://newyork.craigslist.org/brk/msg/d/brook...,QR Code Link to This PostRoland Blues Cube Hot...
9,Hourly Rehearsal @ Battalion Studios: large cl...,,,https://newyork.craigslist.org/brk/msg/d/brook...,QR Code Link to This PostBattalion Studios has...


## 7) Export dataframe as CSV

In [8]:
import os

# Export data to csv
path = '/Users/amyzhang/Documents/Projects/craigslist-webscrape/data'
output_file = 'roland_amp_listings.csv'
full_path = os.path.join(path, output_file)

# Save the dataframe
df.to_csv(full_path, index=False)

# Check if the file was created successfully
if os.path.exists(full_path):
    print(f"Saving confirmed: '{output_file}' has been created successfully.")
else:
    print("Error: File was not saved.")

Saving confirmed: 'roland_amp_listings.csv' has been created successfully.


In [9]:
df.shape

(13, 5)

# 2. Experiment: Selenium Automation 

With a static `requests.get()`, we only retrieve the initial HTML—hence why only the first 13 listings appeared. To handle dynamic content that is loaded by JavaScript after the page renders, we attempted to use Selenium to render the page fully:

* Selenium simulates a real browser, so the JavaScript executes and the dynamically loaded listings become visible.  
* Once the page is fully rendered (after scrolling or waiting for all content), we can grab the complete HTML for further parsing.


In [10]:
# 1️⃣ Install packages if you haven't already
# (.venv) $ uv add selenium
# Download chromedriver and make sure it's in your PATH or same folder

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
# import pandas as pd

# 2️⃣ Configure headless Chrome
chrome_options = Options()
chrome_options.add_argument("--headless")  # no visible window
chrome_options.add_argument("--disable-gpu")  # often recommended
chrome_options.add_argument("--no-sandbox")   # for Linux/Mac safety

driver = webdriver.Chrome(options=chrome_options)

# 3️⃣ Go to Craigslist search page
url = "https://newyork.craigslist.org/search/brk/msa?query=roland%20amp#search=2~gallery~0"
driver.get(url)

# 4️⃣ Grab listing titles
titles = driver.find_elements(By.CSS_SELECTOR, ".result-title")  # CSS class for titles
title_list = [t.text for t in titles]

# 5️⃣ Optional: grab links
links = [t.get_attribute("href") for t in titles]

# 6️⃣ Put into dataframe
df_selenium = pd.DataFrame({"title": title_list, "link": links})
print(df_selenium.head(15))

# 7️⃣ Close browser
driver.quit()


Empty DataFrame
Columns: [title, link]
Index: []


In [11]:
df_selenium.shape

(0, 2)

### Lesson 📌
Automated scrapers like Selenium can break if the webpage structure changes, since they rely on specific HTML elements to locate content.


# 3. Adjusted Scraping Strategy
* List View
* JSON Endpoints

🤖 After running a ChatGPT-assisted debugging script, we discovered that Craigslist now delivers listing data as **JSON embedded inside a `<script>` tag** rather than directly in the visible HTML.  

To adapt, we:

1. Builta more **robust scraper** that doesn’t rely on fragile HTML element classes or structures.  
2. Targeted the **embedded JSON** inside the `<script>` tag with ID `ld_searchpage_results`.  
3. Parsed this JSON to extract listing details—title, price, location, and link—directly, avoiding issues with dynamically loaded or frequently changing HTML markup.


In [18]:
#import requests
#from bs4 import BeautifulSoup
import json
#import pandas as pd
import time

base_url = "https://newyork.craigslist.org/search/brk/msa"
query_params = {
    "purveyor": "owner",
    "query": "amp",
    "view": "list"
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
}

listings = []
results_per_page = 120
max_pages = 3

for page in range(max_pages):
    offset = page * results_per_page
    query_params['s'] = offset

    response = requests.get(base_url, params=query_params, headers=headers)
    if response.status_code != 200:
        print(f"Failed to fetch page {page + 1}")
        continue

    soup = BeautifulSoup(response.text, 'html.parser')

    # Get embedded JSON with listing data
    script_tag = soup.select_one("script#ld_searchpage_results")
    if not script_tag:
        print(f"JSON data not found on page {page + 1}")
        break

    json_text = script_tag.string

    data = json.loads(json_text)
    items = data.get('itemListElement', [])

    if not items:
        print(f"No listings found in JSON on page {page + 1}")
        break

    for entry in items:
        item = entry.get('item', {})
        name = item.get('name', 'N/A')
        url = item.get('url', 'N/A')
        offers = item.get('offers', {})
        price = offers.get('price', 'N/A')
        location_info = offers.get('availableAtOrFrom', {}).get('address', {})
        city = location_info.get('addressLocality', 'N/A')
        region = location_info.get('addressRegion', 'N/A')
        full_location = f"{city}, {region}" if city != 'N/A' else 'N/A'

        listings.append({
            'title': name,
            'price': price,
            'location': full_location,
            'link': url
        })

    print(f"Scraped page {page + 1}, total listings so far: {len(listings)}")
    time.sleep(2)

df_list = pd.DataFrame(listings)
print(df_list.head(20))
print(f"Total listings scraped: {len(df_list)}")


Scraped page 1, total listings so far: 166
Scraped page 2, total listings so far: 333
Scraped page 3, total listings so far: 499
                                                title    price      location  \
0               Fishman LoudBox Artist with amp cover   150.00  Brooklyn, NY   
1      Tone King Falcon Amp with Cover and Footswitch  1000.00  Brooklyn, NY   
2                  Evans FET 500 - Amp -- pedal steel   400.00  Brooklyn, NY   
3                      Ampeg Gemini 2 - Amp Amplifier   700.00  Brooklyn, NY   
4          Roland KC60 1x10 40W Keyboard Amp For Sale   125.00  Brooklyn, NY   
5                           Wells amp NYC 5:00 shadow   750.00  Brooklyn, NY   
6   Electric guitar with fender amp (all wires inc...   350.00  Brooklyn, NY   
7            Caroline The Blues “Expensive Amp” Pedal   100.00  Brooklyn, NY   
8   Orange AD30TC 30W 2x12 Tube Combo Amp – Loud, ...   800.00  Brooklyn, NY   
9          Roland KC60 1x10 40W Keyboard Amp For Sale   125.00  Brookly

In [19]:
for entry in items[:1]:
    print(json.dumps(entry, indent=2))


{
  "@type": "ListItem",
  "item": {
    "@context": "http://schema.org",
    "name": "Fishman LoudBox Artist with amp cover",
    "description": "",
    "offers": {
      "priceCurrency": "USD",
      "price": "150.00",
      "@type": "Offer",
      "availableAtOrFrom": {
        "@type": "Place",
        "address": {
          "streetAddress": "",
          "addressRegion": "NY",
          "addressLocality": "Brooklyn",
          "@type": "PostalAddress",
          "postalCode": "",
          "addressCountry": ""
        },
        "geo": {
          "@type": "GeoCoordinates",
          "latitude": 40.6205012121211,
          "longitude": -73.9239022354266
        }
      }
    },
    "@type": "Product",
    "image": [
      "https://images.craigslist.org/00X0X_iSLHoqzJA5K_12S0KD_600x450.jpg",
      "https://images.craigslist.org/00606_9HQgVy2lHia_0Ko0zD_600x450.jpg",
      "https://images.craigslist.org/00303_5eBszzoZIyF_0Pd0yN_600x450.jpg",
      "https://images.craigslist.org/00y0

# 4. Debugging & Key Insight

a) Debugging Script (Claude AI)

In [23]:
# import requests
# from bs4 import BeautifulSoup
# import json

base_url = "https://newyork.craigslist.org/search/brk/msa"
query_params = {
    "purveyor": "owner",
    "query": "amp",
    "view": "list"
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
}

response = requests.get(base_url, params=query_params, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

print("=== DEBUGGING CRAIGSLIST STRUCTURE ===\n")

# Check for JSON data
script_tag = soup.select_one("script#ld_searchpage_results")
if script_tag:
    try:
        data = json.loads(script_tag.string)
        items = data.get('itemListElement', [])
        print(f"JSON data found: {len(items)} items")
        if items:
            first_item = items[0].get('item', {})
            print("Sample JSON item keys:", list(first_item.keys()))
            print("Sample item data:", json.dumps(first_item, indent=2)[:500] + "...")
    except:
        print("JSON parsing failed")
else:
    print("No JSON script tag found")

print("\n" + "="*50 + "\n")

# Look for different possible container structures
containers = [
    ("li.result-row", "Traditional result rows"),
    ("li.cl-search-result", "New cl-search-result"),
    (".result-info", "Result info divs"),
    (".cl-search-result", "Search result class"),
    ("[data-pid]", "Elements with data-pid"),
    (".gallery .result-image", "Gallery results"),
    (".rows .result-row", "Nested result rows")
]

for selector, description in containers:
    elements = soup.select(selector)
    print(f"{description} ({selector}): {len(elements)} found")
    
    if elements and len(elements) > 0:
        print(f"  Sample element structure:")
        sample = elements[0]
        print(f"    Classes: {sample.get('class', [])}")
        print(f"    Attributes: {list(sample.attrs.keys())}")
        
        # Look for links within this element
        links = sample.select("a[href]")
        print(f"    Links inside: {len(links)}")
        for i, link in enumerate(links[:3]):  # Show first 3 links
            href = link.get('href')
            text = link.get_text(strip=True)[:50]
            print(f"      Link {i+1}: {href} ('{text}')")
        print()

print("\n" + "="*50 + "\n")

# Look for any links that might be listings (containing "/brk/" or similar patterns)
all_links = soup.select("a[href]")
listing_links = []
for link in all_links:
    href = link.get('href', '')
    if any(pattern in href for pattern in ['/brk/', '/msa/', 'craigslist.org']):
        if href not in [l[0] for l in listing_links]:  # Avoid duplicates
            listing_links.append((href, link.get_text(strip=True)[:50]))

print(f"Potential listing links found: {len(listing_links)}")
for i, (href, text) in enumerate(listing_links[:10]):  # Show first 10
    print(f"  {i+1}. {href} ('{text}')")

print("\n" + "="*50 + "\n")

# Save a sample of the HTML for manual inspection
with open('craigslist_sample.html', 'w', encoding='utf-8') as f:
    f.write(str(soup.prettify()))
print("Full HTML saved to 'craigslist_sample.html' for manual inspection")

# Also print first 2000 characters of the main content area
main_content = soup.select_one('.cl-results-page') or soup.select_one('#search-results') or soup.select_one('.content')
if main_content:
    print(f"\nFirst 2000 chars of main content:")
    print(str(main_content)[:2000] + "...")

=== DEBUGGING CRAIGSLIST STRUCTURE ===

JSON data found: 168 items
Sample JSON item keys: ['@context', 'image', 'name', 'offers', 'description', '@type']
Sample item data: {
  "@context": "http://schema.org",
  "image": [
    "https://images.craigslist.org/00X0X_iSLHoqzJA5K_12S0KD_600x450.jpg",
    "https://images.craigslist.org/00606_9HQgVy2lHia_0Ko0zD_600x450.jpg",
    "https://images.craigslist.org/00303_5eBszzoZIyF_0Pd0yN_600x450.jpg",
    "https://images.craigslist.org/00y0y_8VvfUSuikv4_10l0zD_600x450.jpg",
    "https://images.craigslist.org/00r0r_enYHpdNKbPa_0KD0By_600x450.jpg"
  ],
  "name": "Fishman LoudBox Artist with amp cover",
  "offers": {
    "priceC...


Traditional result rows (li.result-row): 0 found
New cl-search-result (li.cl-search-result): 0 found
Result info divs (.result-info): 0 found
Search result class (.cl-search-result): 0 found
Elements with data-pid ([data-pid]): 0 found
Gallery results (.gallery .result-image): 0 found
Nested result rows (.rows .result-ro

---
###  Key Insights & Corrections (Claude AI🤖): 

"Here's your fixed scraper: This new approach:

1. **Uses the JSON data** for clean title/price/location info (which you were getting correctly)
2. **Finds actual listing links** using the pattern `/brk/msg/d/` which identifies real listing URLs
3. **Matches them intelligently** by comparing words in the title with words in the link text
4. **Has a fallback method** that extracts info directly from the link text if the JSON matching fails

The key insight from your debug output is that Craigslist is no longer using the old HTML structure with `li.result-row` containers, but the listing links ARE still there - they're just not wrapped in the expected containers. The pattern `/brk/msg/d/` in the URLs is the key to identifying actual listings vs navigation links.

Try this and let me know how many links you get! The debug showed 184 potential listing links, so you should get close to that number with working URLs."

---

In [24]:
# import requests
# from bs4 import BeautifulSoup
# import json
# import pandas as pd
# import time
import re

base_url = "https://newyork.craigslist.org/search/brk/msa"
query_params = {
    "purveyor": "owner",
    "query": "amp",
    "view": "list"
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
}

listings = []
results_per_page = 120
max_pages = 3

for page in range(max_pages):
    offset = page * results_per_page
    query_params['s'] = offset
    
    response = requests.get(base_url, params=query_params, headers=headers)
    if response.status_code != 200:
        print(f"Failed to fetch page {page + 1}")
        continue
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Get JSON data for titles, prices, locations
    script_tag = soup.select_one("script#ld_searchpage_results")
    json_items = []
    if script_tag:
        try:
            data = json.loads(script_tag.string)
            json_items = data.get('itemListElement', [])
        except:
            print("JSON parsing failed")
    
    # Get all listing links from HTML (they're not in the old containers anymore)
    listing_links = []
    all_links = soup.select("a[href]")
    
    for link in all_links:
        href = link.get('href', '')
        # Match actual listing URLs (contain /brk/msg/d/ pattern)
        if re.match(r'.*/(brk|nyc|que|wch|lgi)/msg/d/', href):
            text = link.get_text(strip=True)
            # Make sure it has actual content and isn't a navigation link
            if len(text) > 5 and not text.lower() in ['next', 'prev', 'previous', 'brooklyn', 'queens']:
                listing_links.append((href, text))
    
    print(f"Found {len(json_items)} JSON items and {len(listing_links)} listing links")
    
    # Method 1: If we have both JSON and links, try to match them
    if json_items and listing_links:
        for i, entry in enumerate(json_items):
            item = entry.get('item', {})
            name = item.get('name', 'N/A')
            offers = item.get('offers', {})
            price = offers.get('price', 'N/A')
            location_info = offers.get('availableAtOrFrom', {}).get('address', {})
            city = location_info.get('addressLocality', 'N/A')
            region = location_info.get('addressRegion', 'N/A')
            full_location = f"{city}, {region}" if city != 'N/A' else 'N/A'
            
            # Try to find matching link by title similarity
            link = 'N/A'
            name_words = set(name.lower().split())
            
            for href, link_text in listing_links:
                link_words = set(link_text.lower().split())
                # If there's significant word overlap, it's probably a match
                if len(name_words.intersection(link_words)) >= min(3, len(name_words) * 0.6):
                    link = href if href.startswith('http') else f"https://newyork.craigslist.org{href}"
                    # Remove this link so we don't match it again
                    listing_links.remove((href, link_text))
                    break
            
            listings.append({
                'title': name,
                'price': price,
                'location': full_location,
                'link': link
            })
    
    # Method 2: Fallback - if JSON matching doesn't work well, use links directly
    elif listing_links:
        for href, text in listing_links:
            # Extract basic info from link text if possible
            title = text
            price = 'N/A'
            location = 'Brooklyn, NY'  # Default since we're searching Brooklyn
            
            # Try to extract price from link text
            price_match = re.search(r'\$[\d,]+', text)
            if price_match:
                price = price_match.group()
                title = text.replace(price, '').strip()
            
            link = href if href.startswith('http') else f"https://newyork.craigslist.org{href}"
            
            listings.append({
                'title': title,
                'price': price,
                'location': location,
                'link': link
            })
    
    print(f"Scraped page {page + 1}, total listings so far: {len(listings)}")
    time.sleep(2)

# Remove duplicates based on link
df_list_2 = pd.DataFrame(listings)
if not df_list_2.empty:
    df_list_2 = df_list_2.drop_duplicates(subset=['link'], keep='first')

print(df_list_2.head(20))
print(f"Total listings scraped: {len(df_list_2)}")

# Check success rate
links_found = df_list_2[df_list_2['link'] != 'N/A']
print(f"Links successfully extracted: {len(links_found)}")
print(f"Percentage with links: {len(links_found)/len(df_list_2)*100:.1f}%")

# Show a few sample links to verify they work
print(f"\nSample working links:")
for i, link in enumerate(links_found['link'].head(5)):
    print(f"  {i+1}. {link}")

Found 166 JSON items and 182 listing links
Scraped page 1, total listings so far: 166
Found 167 JSON items and 183 listing links
Scraped page 2, total listings so far: 333
Found 166 JSON items and 181 listing links
Scraped page 3, total listings so far: 499
                                                title    price      location  \
0               Fishman LoudBox Artist with amp cover   150.00  Brooklyn, NY   
1      Tone King Falcon Amp with Cover and Footswitch  1000.00  Brooklyn, NY   
2                  Evans FET 500 - Amp -- pedal steel   400.00  Brooklyn, NY   
3                      Ampeg Gemini 2 - Amp Amplifier   700.00  Brooklyn, NY   
4          Roland KC60 1x10 40W Keyboard Amp For Sale   125.00  Brooklyn, NY   
5                           Wells amp NYC 5:00 shadow   750.00  Brooklyn, NY   
6   Electric guitar with fender amp (all wires inc...   350.00  Brooklyn, NY   
7            Caroline The Blues “Expensive Amp” Pedal   100.00  Brooklyn, NY   
8   Orange AD30TC 30W 

In [25]:
df_list_2.shape

(165, 4)

In [26]:
df_list_2.head()

Unnamed: 0,title,price,location,link
0,Fishman LoudBox Artist with amp cover,150.0,"Brooklyn, NY",https://newyork.craigslist.org/brk/msg/d/brook...
1,Tone King Falcon Amp with Cover and Footswitch,1000.0,"Brooklyn, NY",https://newyork.craigslist.org/brk/msg/d/brook...
2,Evans FET 500 - Amp -- pedal steel,400.0,"Brooklyn, NY",https://newyork.craigslist.org/brk/msg/d/brook...
3,Ampeg Gemini 2 - Amp Amplifier,700.0,"Brooklyn, NY",https://newyork.craigslist.org/brk/msg/d/brook...
4,Roland KC60 1x10 40W Keyboard Amp For Sale,125.0,"Brooklyn, NY",https://newyork.craigslist.org/brk/msg/d/brook...


# 5. Final Multi-Page Scrape
(Descriptions from Individual Links appended to dataframe 'df_list_2')

In [27]:
# import requests
# from bs4 import BeautifulSoup
# import pandas as pd
# import time
# import re

def scrape_listing_description(url, headers, max_retries=3):
    """
    Scrape the description text from a single Craigslist listing page
    """
    for attempt in range(max_retries):
        try:
            response = requests.get(url, headers=headers, timeout=10)
            if response.status_code != 200:
                return f"Error: HTTP {response.status_code}"
            
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # Try multiple selectors for the description text
            description_selectors = [
                "#postingbody",           # Most common
                ".userbody",              # Alternative
                ".posting-body",          # Another variant
                ".postinginfos",          # Sometimes used
                "[id*='postingbody']",    # Partial match
                ".section.description"    # New structure
            ]
            
            description = None
            for selector in description_selectors:
                desc_element = soup.select_one(selector)
                if desc_element:
                    # Clean up the text
                    description = desc_element.get_text(separator=' ', strip=True)
                    # Remove common Craigslist artifacts
                    description = re.sub(r'QR Code Link to This Post', '', description)
                    description = re.sub(r'do NOT contact me with unsolicited services or offers', '', description)
                    description = description.strip()
                    break
            
            return description or "No description found"
            
        except Exception as e:
            if attempt == max_retries - 1:
                return f"Error: {str(e)}"
            time.sleep(1)

# Assuming you already have df_list_2 from your working scraper
# Let's scrape descriptions for the first 50 listings with valid links

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
}

# Get first 50 listings that have valid links
valid_listings = df_list_2[df_list_2['link'] != 'N/A'].head(50).copy()
print(f"Scraping descriptions for {len(valid_listings)} listings...")

# Scrape descriptions
descriptions_data = []
for idx, row in valid_listings.iterrows():
    url = row['link']
    print(f"Scraping {idx + 1}/{len(valid_listings)}: {row['title'][:50]}...")
    
    description = scrape_listing_description(url, headers)
    
    descriptions_data.append({
        'link': url,
        'description': description
    })
    
    # Be nice to Craigslist servers
    time.sleep(1.5)

# Create descriptions DataFrame
df_descriptions = pd.DataFrame(descriptions_data)

# Merge back with original data
df_final = df_list_2.merge(df_descriptions, on='link', how='left')

# Fill missing descriptions
df_final['description'] = df_final['description'].fillna('No description scraped')

print(f"\n{'='*60}")
print(f"RESULTS:")
print(f"Original listings: {len(df_list_2)}")
print(f"Descriptions scraped: {len(df_descriptions)}")
print(f"Final merged dataset: {len(df_final)}")

# Show sample with descriptions
sample_with_desc = df_final[df_final['description'].str.len() > 20].head(3)
for idx, row in sample_with_desc.iterrows():
    print(f"\n--- Sample Listing ---")
    print(f"Title: {row['title']}")
    print(f"Price: {row['price']}")
    print(f"Description: {row['description'][:200]}...")

# Save results
df_final.to_csv('craigslist_with_descriptions.csv', index=False)
print(f"\nSaved to 'craigslist_with_descriptions.csv'")

Scraping descriptions for 50 listings...
Scraping 1/50: Fishman LoudBox Artist with amp cover...
Scraping 2/50: Tone King Falcon Amp with Cover and Footswitch...
Scraping 3/50: Evans FET 500 - Amp -- pedal steel...
Scraping 4/50: Ampeg Gemini 2 - Amp Amplifier...
Scraping 5/50: Roland KC60 1x10 40W Keyboard Amp For Sale...
Scraping 6/50: Wells amp NYC 5:00 shadow...
Scraping 7/50: Electric guitar with fender amp (all wires include...
Scraping 8/50: Caroline The Blues “Expensive Amp” Pedal...
Scraping 9/50: Orange AD30TC 30W 2x12 Tube Combo Amp – Loud, Migh...
Scraping 10/50: Roland KC60 1x10 40W Keyboard Amp For Sale...
Scraping 11/50: Stewart Audio PA-50B 100 watt Power Amp...
Scraping 12/50: 🟢 Stereo Headphone tube Amp 5 tubes preamp hi-end ...
Scraping 13/50: Amphion Amp500 Amp 500 Amp-500 One15 Two15 one18 t...
Scraping 14/50: Roland KC60 1x10 40W Keyboard Amp For Sale...
Scraping 15/50: Visible Signals Dual Distrib 2 channel video distr...
Scraping 16/50: Roland KC60 1x10 40W Keyb

In [29]:
# Export data to csv
path = '/Users/amyzhang/Documents/Projects/craigslist-webscrape/data'
output_file = 'roland_amp_listings_FULL.csv'
full_path = os.path.join(path, output_file)

# Save the dataframe
df_list_2.to_csv(full_path, index=False)

# Check if the file was created successfully
if os.path.exists(full_path):
    print(f"Saving confirmed: '{output_file}' has been created successfully.")
else:
    print("Error: File was not saved.")

Saving confirmed: 'roland_amp_listings_FULL.csv' has been created successfully.


In [30]:
df_list_2.shape

(165, 4)

## Additional Notes

### What is Web Scraping?

Web scraping is the process of programmatically requesting data from web pages and organizing it into a structured format, like a DataFrame, for analysis and visualization. In other words, your code talks to the website (usually via HTTP requests), retrieves information, and then arranges it so you can actually work with it.

### Key Lessons Learned

1. **Traditional web scraping is brittle.** Small changes to the website structure can break your code entirely. Even tools like Selenium — a browser automation framework that can control web browsers to mimic human actions — are vulnerable.  
2. **There is no escaping web scraping.** LLMs are pre-trained, but for up-to-date information they rely on retrieval. **However**, the LLM itself doesn’t scrape: it sends a query to a search engine (e.g., Brave), which crawls and scrapes the web, then returns processed results (often in JSON) that the LLM can use to generate a response.  
3. **Hypothesis:** Directly querying a commercial API (e.g., eBay’s API) is generally more compute-efficient and cost-effective than having an LLM make repeated search queries. However, an LLM + search engine approach can serve as a useful fallback when APIs are unavailable.  
4. **Open Questions:** Are more websites actively resisting web scraping? If so, how? Can LLMs circumvent these protections by using search engines whose web crawling is designed to mimic human browsing behavior?

---

### 🤖 ChatGPT's Thoughts re: Open Questions

#### **Are more websites resistant to web scraping? Why?**

"Yes. Many sites actively try to prevent scraping to protect **intellectual property, server load, and user data privacy**. Techniques include:

- CAPTCHAs  
- Rate limiting  
- Dynamic content loading (AJAX/JSON)  
- Bot detection scripts  

From a developer’s perspective, this increases the brittleness of traditional scraping pipelines — what worked last week may fail today."

---

#### **Are LLMs able to circumvent this via search engines?**

"LLMs themselves don’t bypass protections; they rely on external search engines that have legitimate crawling capabilities. These crawlers are designed to **mimic human browsing patterns** (delays, page rendering, link-following) so that the data they return is accessible despite protections.

Essentially, the LLM delegates the “hard work” of scraping to a crawler that’s already designed to handle the anti-scraping measures. This makes the pipeline more robust without the LLM itself violating rules or scraping directly."