# üï∑Ô∏è Module 1: Static HTML Scraping

### *"The internet is just text. Let's read it."*

---

**Yo, welcome to web scraping!** üëã

Before we dive deep into fancy stuff, let's get the fundamentals right. Today we're gonna learn how to:

1. **Fetch HTML** ‚Üí basically asking a website "hey, can I see your code?"
2. **Parse it** ‚Üí making sense of that messy HTML soup
3. **Extract data** ‚Üí getting the good stuff we actually need
4. **Validate** ‚Üí making sure our data isn't garbage

> üí≠ *"The best scrapers are lazy scrapers. We write code so we don't have to copy-paste."*

**Target**: [Bonbanh.com](https://bonbanh.com) - a Vietnamese car marketplace

## üîß Setup

Run this once. Don't overthink it.

In [None]:
!pip install requests beautifulsoup4 pydantic -q

---

## Step 1: The HTTP Request

### What's actually happening?

When you type a URL in your browser, you're basically saying:

> *"Hey server, give me that page."*

That's a **GET request**. The server responds with HTML. That's it. No magic. ü™Ñ

Let's do the same thing, but with Python.

In [None]:
import requests

# This is our target
url = "https://bonbanh.com/oto/page,1?q="

# Send the request (just like your browser does)
response = requests.get(url)

# Did it work?
print(f"Status: {response.status_code}")  # 200 = success, 404 = not found, 403 = blocked
print(f"Content-Type: {response.headers.get('content-type')}")
print(f"\nHTML size: {len(response.text):,} characters")

# Let's peek at what we got
print("\n" + "‚îÄ" * 50)
print("First 300 chars:")
print("‚îÄ" * 50)
print(response.text[:300])

### üß† What you should notice:

| Code | Meaning |
|------|---------|
| `200` | All good, we got the page |
| `403` | Access denied (we might be blocked) |
| `404` | Page doesn't exist |
| `503` | Server is having a bad day |

> **Pro tip**: If you're getting 403s, the website might be blocking automated requests. We'll handle that in Module 3.

---

## Step 2: Parsing HTML

### The problem with raw HTML

That HTML we just got? It's a mess. It's like trying to read a book where all the pages are shuffled.

**BeautifulSoup** helps us navigate through this chaos. It turns that string into a tree structure we can actually work with.

Think of it like this:
```
Raw HTML  ‚Üí  BeautifulSoup  ‚Üí  Organized Tree
(chaos)        (parser)         (makes sense)
```

In [None]:
from bs4 import BeautifulSoup

# Fetch fresh data
response = requests.get("https://bonbanh.com/oto/page,1?q=")

# Parse it
soup = BeautifulSoup(response.content, 'html.parser')

# Now let's find all car titles
# On Bonbanh, each car listing has an <h3> tag with the title
titles = soup.find_all('h3')

print(f"Found {len(titles)} car listings\n")
print("‚îÄ" * 50)

# Print the first 10
for i, h3 in enumerate(titles[:10], 1):
    print(f"{i:2}. {h3.get_text(strip=True)}")

### üß† Key methods you'll use constantly:

```python
soup.find('tag')        # Get first match
soup.find_all('tag')    # Get all matches
element.get_text()      # Get the text inside
element.get('href')     # Get an attribute
element.find('child')   # Find inside an element
```

> **Real talk**: 80% of web scraping is just `find()` and `find_all()`. Master these two.

---

## Step 3: Extracting Structured Data

### From chaos to clarity

Titles are cool, but we want more. We want:
- **Title** (what car is this?)
- **Price** (how much?)
- **URL** (link to the listing)
- **Year** (when was it made?)

This is where it gets interesting.

In [None]:
import re  # for regex (pattern matching)

# Fetch and parse
response = requests.get("https://bonbanh.com/oto/page,2?q=")
soup = BeautifulSoup(response.content, 'html.parser')

# Find all <li> elements that contain car listings
# (they have an <h3> inside them)
all_items = soup.find_all('li')
car_items = [li for li in all_items if li.find('h3')]

print(f"Found {len(car_items)} car listings\n")

cars = []
for item in car_items[:5]:  # just first 5 for demo
    h3 = item.find('h3')
    if not h3:
        continue
    
    # Extract title
    title = h3.get_text(strip=True)
    
    # Extract URL from the <a> tag inside <h3>
    link = h3.find('a')
    url = f"https://bonbanh.com{link.get('href', '')}" if link else ""
    
    # Extract price (look for price class)
    price_el = item.find('div', class_='price') or item.find('span', class_='price')
    price = price_el.get_text(strip=True) if price_el else "Li√™n h·ªá"
    
    # Extract year using regex
    # Pattern: find a 4-digit year starting with 19 or 20
    year_match = re.search(r'\b(19|20)\d{2}\b', title)
    year = int(year_match.group()) if year_match else 0
    
    cars.append({
        "title": title,
        "price": price,
        "url": url,
        "year": year
    })

# Display results
for i, car in enumerate(cars, 1):
    print(f"üöó Car {i}")
    print(f"   Title: {car['title'][:50]}...")
    print(f"   Price: {car['price']}")
    print(f"   Year:  {car['year']}")
    print()

### üß† The regex pattern explained:

```python
r'\b(19|20)\d{2}\b'
```

| Part | Meaning |
|------|---------|
| `\b` | Word boundary (so "12020" doesn't match) |
| `(19\|20)` | Starts with 19 or 20 |
| `\d{2}` | Followed by 2 digits |
| `\b` | Another word boundary |

> **Exercise**: What years would this match? What about "Toyota 2025"?

---

## Step 4: Data Validation with Pydantic

### Why bother?

Real-world data is messy. You'll get:
- Missing fields
- Wrong types ("2024" as string instead of int)
- Unexpected values

**Pydantic** catches these issues early. It's like a security guard for your data.

> *"Trust no data. Validate everything."*

In [None]:
from pydantic import BaseModel, Field
import json

class CarListing(BaseModel):
    """A validated car listing."""
    title: str = Field(..., min_length=1)
    price: str
    url: str
    year: int = Field(default=0, ge=0, le=2030)  # must be reasonable

# Test it
car = CarListing(
    title="Honda Civic 2020",
    price="500 Tri·ªáu",
    url="https://bonbanh.com/xe-honda-civic.html",
    year=2020
)

print("‚úÖ Valid car:")
print(car.model_dump_json(indent=2))

In [None]:
# Now let's use it in our scraper
def scrape_bonbanh(page=1):
    """Scrape car listings from Bonbanh with validation."""
    url = f"https://bonbanh.com/oto/page,{page}?q="
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    all_items = soup.find_all('li')
    car_items = [li for li in all_items if li.find('h3')]
    
    results = []
    for item in car_items[:10]:  # limit for demo
        try:
            h3 = item.find('h3')
            if not h3:
                continue
            
            title = h3.get_text(strip=True)
            link = h3.find('a')
            url = f"https://bonbanh.com{link.get('href', '')}" if link else ""
            
            price_el = item.find('div', class_='price') or item.find('span', class_='price')
            price = price_el.get_text(strip=True) if price_el else "Li√™n h·ªá"
            
            year_match = re.search(r'\b(19|20)\d{2}\b', title)
            year = int(year_match.group()) if year_match else 0
            
            # Validate with Pydantic
            car = CarListing(title=title, price=price, url=url, year=year)
            results.append(car)
            
        except Exception as e:
            print(f"‚ö†Ô∏è Skipped invalid listing: {e}")
            continue
    
    return results

# Run it
cars = scrape_bonbanh(page=1)
print(f"\n‚úÖ Scraped {len(cars)} valid listings")

if cars:
    print("\nSample:")
    print(cars[0].model_dump_json(indent=2))

---

## üéØ Save Your Data

Always save your scraped data. You never know when you'll need it.

In [None]:
# Save to JSON
output = [car.model_dump() for car in cars]

with open("car_listings.json", "w", encoding="utf-8") as f:
    json.dump(output, f, ensure_ascii=False, indent=2)

print("üíæ Saved to car_listings.json")

# Verify
!head -20 car_listings.json

---

## üèãÔ∏è Practice Time

Don't just read. **Do**.

### Exercise 1: Add more fields
Modify `CarListing` to include:
- `location` (where is the car?)
- `kilometer` (how many km?)
- `fuel_type` (xƒÉng, d·∫ßu, ƒëi·ªán?)

### Exercise 2: Multi-page scraping
Modify `scrape_bonbanh()` to accept a range of pages:
```python
def scrape_bonbanh(start_page=1, end_page=5):
    ...
```

### Exercise 3: Filter by year
Only keep cars from 2018 or newer.

### Exercise 4: Different website
Try scraping [xe.chotot.com](https://xe.chotot.com) instead. What's different?

---

## üìù Summary

| Concept | What you learned |
|---------|------------------|
| `requests` | Fetch HTML from websites |
| `BeautifulSoup` | Parse and navigate HTML |
| `find()` / `find_all()` | Locate elements |
| `get_text()` / `get()` | Extract data |
| `Pydantic` | Validate scraped data |
| `json.dump()` | Save data to file |

### Next up: Module 2

What if the website loads data with JavaScript? `requests` can't see that.

We'll need **Selenium** ‚Äì a tool that controls a real browser.

*See you there.* ‚úåÔ∏è