# üìö Module 1: Static HTML Scraping

**Learn web scraping fundamentals using BeautifulSoup**

In this notebook, you'll learn:
- How to fetch HTML using `requests`
- Parse HTML with BeautifulSoup
- Extract structured data
- Validate data with Pydantic

**Target Website**: [Bonbanh.com](https://bonbanh.com) (Vietnamese car marketplace)

## üîß Setup
Install required packages (run once)

In [None]:
!pip install requests beautifulsoup4 pydantic -q
print("‚úÖ Packages installed!")

---
## Step 1: Basic HTTP Request

**Goal**: Fetch HTML content from a website

**Concepts**:
- HTTP GET request
- Response object
- Basic HTML structure

In [None]:
import requests

# Target URL - page 1 of car listings
url = "https://bonbanh.com/oto/page,1?q="

print("Fetching webpage...")
print(f"URL: {url}\n")

# Send GET request
response = requests.get(url)

# Check if request was successful
print(f"Status Code: {response.status_code}")  # 200 means success
print(f"Content Type: {response.headers['content-type']}\n")

# Print first 500 characters of HTML
print("=" * 50)
print("HTML PREVIEW (first 500 characters):")
print("=" * 50)
print(response.text[:500])
print("...")

# Full HTML is available in response.text
print(f"\nTotal HTML length: {len(response.text)} characters")

### üí° Key Takeaways

- `requests.get(url)` sends an HTTP GET request
- `response.status_code` = 200 means success
- `response.text` contains the raw HTML

---
## Step 2: Parse HTML with BeautifulSoup

**Goal**: Parse HTML and extract car titles using CSS selectors

**Concepts**:
- BeautifulSoup HTML parser
- Finding elements by tag name
- Extracting text from elements

In [None]:
from bs4 import BeautifulSoup

# Fetch the webpage
url = "https://bonbanh.com/oto/page,1?q="
print(f"Fetching: {url}\n")

response = requests.get(url)
html_content = response.content  # Raw HTML bytes

# Parse HTML with BeautifulSoup
# 'html.parser' is Python's built-in parser (no extra install needed)
soup = BeautifulSoup(html_content, 'html.parser')

print("=" * 50)
print("EXTRACTING CAR TITLES")
print("=" * 50)

# Find all <h3> tags which contain the car titles
# On Bonbanh, each car listing has a <h3> with the title
h3_elements = soup.find_all('h3')

print(f"Found {len(h3_elements)} car listings\n")

# Extract and print the text from each title
for i, h3_element in enumerate(h3_elements, 1):
    title_text = h3_element.get_text(strip=True)  # strip=True removes extra whitespace
    print(f"{i:2d}. {title_text}")

print("\n‚úÖ Successfully extracted car titles!")

### üí° Key Takeaways

- `BeautifulSoup(html, 'html.parser')` creates a parse tree
- `soup.find_all('tag')` finds all elements with that tag
- `.get_text(strip=True)` extracts text content

---
## Step 3: Extract Multiple Data Fields

**Goal**: Extract title, price, URL, and year from car listings

**Concepts**:
- CSS selectors for different elements
- Extracting attributes (href)
- Regular expressions for pattern matching
- Storing data in dictionaries

In [None]:
import re

# Fetch and parse
url = "https://bonbanh.com/oto/page,2?q="
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find all list items (li) that contain car listings
# On Bonbanh, car listings are in <li> elements with <h3> tags
all_lis = soup.find_all('li')
car_listings = [li for li in all_lis if li.find('h3')]

print("=" * 70)
print("EXTRACTING STRUCTURED DATA")
print("=" * 70)

cars = []  # List to store all car data

# Process each car listing (limit to first 5 for demo)
for container in car_listings[:5]:
    # Extract title from H3
    h3_elem = container.find('h3')
    if not h3_elem:
        continue
    
    title = h3_elem.get_text(strip=True)
    
    # Extract URL from link in H3
    link = h3_elem.find('a')
    url_raw = link.get('href', '') if link else ''
    full_url = f"https://bonbanh.com{url_raw}" if url_raw and not url_raw.startswith('http') else url_raw
    
    # Extract price (look for price class or text)
    price_elem = container.find('div', class_='price') or container.find('span', class_='price')
    price = price_elem.get_text(strip=True) if price_elem else "Contact"
    
    # Extract year using regex from title
    year = 0
    year_match = re.search(r'\b(20\d{2}|19\d{2})\b', title)
    if year_match:
        year = int(year_match.group(1))
    
    # Store in dictionary
    car_data = {
        "title": title,
        "price": price,
        "url": full_url,
        "year": year
    }
    
    cars.append(car_data)

# Print results
print(f"\nExtracted data for {len(cars)} cars:\n")

for i, car in enumerate(cars, 1):
    print(f"--- Car {i} ---")
    print(f"Title: {car['title']}")
    print(f"Price: {car['price']}")
    print(f"Year:  {car['year']}")
    print(f"URL:   {car['url'][:60]}...")
    print()

print("‚úÖ Successfully extracted structured data!")

### üí° Key Takeaways

- `element.find('a')` finds child elements
- `element.get('href')` extracts attributes
- Use regex `re.search()` for pattern matching

---
## Step 4: Complete Crawler with Pydantic Validation

**Goal**: Add data validation and export to JSON

**Concepts**:
- Pydantic models for data validation
- Type checking
- JSON export

In [None]:
from pydantic import BaseModel, Field
import json

# Define Pydantic model
class CarListing(BaseModel):
    """Represents a single car listing from Bonbanh.com"""
    title: str = Field(..., description="Car title")
    price: str = Field(..., description="Price (as displayed)")
    url: str = Field(..., description="Full URL to listing")
    year: int = Field(default=0, description="Year of manufacture")

In [None]:
def fetch_car_listings(page=1):
    """Fetch and parse car listings from Bonbanh"""
    url = f"https://bonbanh.com/oto/page,{page}?q="
    print(f"Fetching page {page}...")
    
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all LI elements with H3 (car listings)
    all_lis = soup.find_all('li')
    car_listings = [li for li in all_lis if li.find('h3')]
    listings = []
    
    for container in car_listings[:10]:  # Limit to 10 for demo
        try:
            # Extract title from H3
            h3_elem = container.find('h3')
            if not h3_elem:
                continue
            
            title = h3_elem.get_text(strip=True)
            
            # Extract URL
            link = h3_elem.find('a')
            url_raw = link.get('href', '') if link else ''
            full_url = f"https://bonbanh.com{url_raw}" if url_raw and not url_raw.startswith('http') else url_raw
            
            # Extract price
            price_elem = container.find('div', class_='price') or container.find('span', class_='price')
            price = price_elem.get_text(strip=True) if price_elem else "Contact"
            
            year = 0
            year_match = re.search(r'\b(20\d{2}|19\d{2})\b', title)
            if year_match:
                year = int(year_match.group(1))
            
            # Create Pydantic model (validates automatically!)
            car = CarListing(
                title=title,
                price=price,
                url=full_url,
                year=year
            )
            
            listings.append(car)
            
        except Exception as e:
            # Skip invalid listings
            print(f"Skipped listing due to error: {e}")
            continue
    
    return listings

In [None]:
print("=" * 70)
print("COMPLETE WEB SCRAPER WITH PYDANTIC VALIDATION")
print("=" * 70)
print()

# Scrape data
cars = fetch_car_listings(page=1)

print(f"\n‚úÖ Successfully scraped {len(cars)} car listings")

# Convert to JSON
car_dicts = [car.model_dump() for car in cars]

# Print sample
if cars:
    print("\n" + "=" * 70)
    print("SAMPLE OUTPUT:")
    print("=" * 70)
    print(cars[0].model_dump_json(indent=2))

print("\nüéâ You've built a working web scraper!")

---
## üèÜ Exercises

Try these challenges:

1. **Add more fields**: Modify `CarListing` to include `location`, `mileage`, `fuel_type`
2. **Multi-page**: Modify `fetch_car_listings` to crawl pages 1-5
3. **Save to JSON file**: Write the results to `car_listings.json`
4. **Filter**: Only include cars newer than 2015

In [None]:
# Exercise: Save to JSON file
output_file = "car_listings.json"
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(car_dicts, f, ensure_ascii=False, indent=2)

print(f"‚úÖ Saved to {output_file}")