# Module 2: Dynamic Content Scraping

### *"When JavaScript enters the chat..."*

---

**So here's the thing.** ü§î

Some websites don't give you the data in the initial HTML. They:
1. Send you a mostly empty page
2. Run JavaScript to fetch data
3. Render it in your browser

`requests` only sees step 1. It's blind to JavaScript.

**Solution?** Use a real browser. Automate it. Let it do the JavaScript dance.

> üí≠ *"If you can see it in your browser, you can scrape it."*

**Target**: [CafeF VNINDEX](https://cafef.vn/du-lieu/Lich-su-giao-dich-vnindex-1.chn) - Stock market data

## Setup

Colab already has Chrome. We just need to configure it.

In [None]:
# Install dependencies
!apt-get update -qq
!apt-get install -y chromium-chromedriver -qq
!pip install selenium pydantic pandas -q

print("‚úÖ Setup complete!")

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import time

def get_driver():
    """Create a headless Chrome driver for Colab."""
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    return webdriver.Chrome(options=options)

print("‚úÖ Driver ready!")

---

## Step 1: Opening a Browser

### Your first robot browser

Think of Selenium as a puppet master. You tell the browser:
- Go here
- Click this
- Wait for that
- Extract data

And it just... does it. üé≠

In [None]:
driver = get_driver()

url = "https://cafef.vn/du-lieu/Lich-su-giao-dich-vnindex-1.chn"
print(f"Opening: {url}")

driver.get(url)

print(f"‚úÖ Page loaded!")
print(f"Title: {driver.title}")

driver.quit()
print("Browser closed.")

### üß† Basic commands:

```python
driver.get(url)         # Navigate to URL
driver.title            # Get page title
driver.page_source      # Get HTML (after JS ran!)
driver.quit()           # Close browser
```

---

## Step 2: Waiting for Elements

### The #1 mistake beginners make

You open a page and immediately try to scrape. But the data isn't there yet!

JavaScript is still loading. You need to **wait**.

> ‚ö†Ô∏è *"Never assume data is ready. Always wait for it."*

In [None]:
driver = get_driver()

try:
    url = "https://cafef.vn/du-lieu/Lich-su-giao-dich-vnindex-1.chn"
    driver.get(url)
    
    # Create a waiter (max 20 seconds)
    wait = WebDriverWait(driver, 20)
    
    print("Waiting for table to load...")
    
    # Wait until this element exists
    table_xpath = '//*[@id="render-table-owner"]'
    table = wait.until(EC.presence_of_element_located((By.XPATH, table_xpath)))
    
    print("‚úÖ Table found!")
    
    # Give JS a bit more time to populate data
    time.sleep(2)
    
    # Count rows
    rows = table.find_elements(By.TAG_NAME, "tr")
    print(f"Rows in table: {len(rows)}")
    
except Exception as e:
    print(f"‚ùå Error: {e}")
    
finally:
    driver.quit()

### üß† Wait strategies:

| Method | When to use |
|--------|-------------|
| `presence_of_element_located` | Element exists in DOM |
| `visibility_of_element_located` | Element is visible to user |
| `element_to_be_clickable` | Can be clicked |
| `time.sleep(n)` | Last resort, avoid if possible |

---

## Step 3: Extracting Table Data

### XPath ‚Äì Your new best friend

XPath is like GPS for HTML. It tells you exactly where an element is.

```
//*[@id="render-table-owner"]  ‚Üí  "Find any element with this ID"
//tr                           ‚Üí  "Find all <tr> elements"
//td[1]                        ‚Üí  "Find first <td> in each row"
```

In [None]:
driver = get_driver()

try:
    driver.get("https://cafef.vn/du-lieu/Lich-su-giao-dich-vnindex-1.chn")
    
    wait = WebDriverWait(driver, 20)
    table = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="render-table-owner"]')))
    time.sleep(2)
    
    rows = table.find_elements(By.TAG_NAME, "tr")
    
    print(f"Found {len(rows)} rows\n")
    print("‚îÄ" * 60)
    print("First 5 rows:")
    print("‚îÄ" * 60)
    
    for i, row in enumerate(rows[:5], 1):
        cells = row.find_elements(By.TAG_NAME, "td")
        if not cells:
            continue
            
        data = [c.text.strip() for c in cells]
        
        # The table structure:
        # 0: Date, 1: Close, 9: Open, 10: High, 11: Low
        if len(data) > 1:
            print(f"{i}. Date: {data[0]:<12} Close: {data[1]:<10}")
    
except Exception as e:
    print(f"‚ùå {e}")
    
finally:
    driver.quit()

---

## Step 4: Complete Crawler

### Putting it all together

Now let's build a proper crawler with:
- Pydantic validation
- Export to CSV
- Error handling

In [None]:
from pydantic import BaseModel, Field
import pandas as pd
import json

class StockData(BaseModel):
    """One day of VNINDEX data."""
    date: str
    close_price: float = 0.0
    open_price: float = 0.0
    high_price: float = 0.0
    low_price: float = 0.0

def parse_number(text):
    """Convert Vietnamese number format to float."""
    if not text or text == '-':
        return 0.0
    try:
        return float(text.replace(',', ''))
    except:
        return 0.0

print("‚úÖ Models defined")

In [None]:
def scrape_vnindex():
    """Scrape VNINDEX data from CafeF."""
    driver = get_driver()
    results = []
    
    try:
        print("üöÄ Starting crawler...")
        driver.get("https://cafef.vn/du-lieu/Lich-su-giao-dich-vnindex-1.chn")
        
        wait = WebDriverWait(driver, 20)
        table = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="render-table-owner"]')))
        time.sleep(2)
        
        rows = table.find_elements(By.TAG_NAME, "tr")
        print(f"üìä Found {len(rows)} rows")
        
        for row in rows:
            cells = row.find_elements(By.TAG_NAME, "td")
            if not cells:
                continue
            
            data = [c.text.strip() for c in cells]
            date = data[0] if len(data) > 0 else ""
            
            # Skip headers
            if not date or "Ng√†y" in date:
                continue
            
            try:
                stock = StockData(
                    date=date,
                    close_price=parse_number(data[1] if len(data) > 1 else "0"),
                    open_price=parse_number(data[9] if len(data) > 9 else "0"),
                    high_price=parse_number(data[10] if len(data) > 10 else "0"),
                    low_price=parse_number(data[11] if len(data) > 11 else "0")
                )
                results.append(stock)
            except:
                continue
        
        print(f"‚úÖ Extracted {len(results)} valid records")
        
    except Exception as e:
        print(f"‚ùå Error: {e}")
        
    finally:
        driver.quit()
    
    return results

# Run it
data = scrape_vnindex()

In [None]:
if data:
    # Convert to DataFrame
    df = pd.DataFrame([d.model_dump() for d in data])
    
    # Save
    df.to_csv("vnindex_data.csv", index=False)
    print("üíæ Saved to vnindex_data.csv\n")
    
    # Preview
    print(df.head())
else:
    print("No data to save.")

---

## üèãÔ∏è Practice Time

### Exercise 1: Handle pagination
CafeF has multiple pages. Modify the scraper to click "Next" and collect more data.

### Exercise 2: Different index
Change the URL to scrape HNX instead of VNINDEX.

### Exercise 3: Visualize
Plot the closing prices using matplotlib.

### Exercise 4: Compare speed
Time how long this takes vs Module 3's async approach.

---

## üìù Summary

| Concept | What you learned |
|---------|------------------|
| Selenium | Control a real browser |
| WebDriverWait | Wait for elements to load |
| XPath | Navigate to specific elements |
| Headless mode | Run without visible browser |

### ‚ö†Ô∏è The catch

Selenium is **slow**. Loading a browser, rendering JavaScript, waiting for elements...

For 200 records, it might take 30-60 seconds.

### Next up: Module 3

What if we could skip all that and call the API directly?

**Spoiler**: We can. And it's 5-10x faster. ‚ö°

*See you in Module 3.* ‚úåÔ∏è