# ‚ö° Module 3: API-Based Scraping

### *"The secret pro technique."*

---

**Plot twist incoming.** üé¨

That table you scraped with Selenium? There's probably an API behind it.

When you open a webpage, your browser:
1. Loads the HTML skeleton
2. Runs JavaScript
3. JavaScript fetches data from an **API**
4. Renders it on screen

What if we just... call that API directly?

> üí≠ *"The fastest scraper is one that doesn't scrape at all."*

**Speed comparison**:
- Selenium: ~60s for 200 records
- Direct API: ~5s for 200 records

That's a **12x performance boost**. Let's go. üöÄ

## üîß Setup

In [None]:
!pip install aiohttp pydantic pandas nest_asyncio -q

# Enable async in Jupyter
import nest_asyncio
nest_asyncio.apply()

print("‚úÖ Ready!")

---

## Step 1: Finding Hidden APIs

### The detective work

This is where it gets fun. You're basically reverse-engineering how the website works.

**How to find APIs:**

1. Open the website in Chrome
2. Press `F12` ‚Üí Network tab
3. Filter by `Fetch/XHR`
4. Reload the page
5. Look for URLs that return JSON

For CafeF, the magic URL is:
```
https://cafef.vn/du-lieu/Ajax/PageNew/DataHistory/PriceHistory.ashx
  ?Symbol=VNINDEX
  &PageIndex=1
  &PageSize=20
```

> üéØ Pro tip: Look for URLs ending in `.json`, `.ashx`, `/api/`, `/Ajax/`

---

## Step 2: Calling the API Directly

### No browser. No JavaScript. Just data.

In [None]:
import requests
import json

api_url = "https://cafef.vn/du-lieu/Ajax/PageNew/DataHistory/PriceHistory.ashx"

params = {
    "Symbol": "VNINDEX",
    "StartDate": "",
    "EndDate": "",
    "PageIndex": 1,
    "PageSize": 20
}

print(f"Calling: {api_url}")
print(f"Params: {params}\n")

response = requests.get(api_url, params=params)
data = response.json()

if data.get("Success"):
    records = data["Data"]["Data"]
    print(f"‚úÖ Got {len(records)} records\n")
    
    # Show first record
    print("Sample record:")
    print(json.dumps(records[0], indent=2, ensure_ascii=False))
else:
    print("‚ùå API returned error")

### üß† What just happened?

We got the same data that Selenium had to:
1. Open Chrome
2. Load the page
3. Wait for JavaScript
4. Parse the HTML

But we did it in **one line**.

> This is the cheat code. üéÆ

---

## Step 3: Going Async

### Why async?

Normal code: Do one thing ‚Üí Wait ‚Üí Do next thing ‚Üí Wait ‚Üí ...

Async code: Start 10 things at once ‚Üí Wait for all ‚Üí Done

```
Sync:   |‚îÄ‚îÄPage1‚îÄ‚îÄ|‚îÄ‚îÄPage2‚îÄ‚îÄ|‚îÄ‚îÄPage3‚îÄ‚îÄ|  (slow)
Async:  |‚îÄ‚îÄPage1‚îÄ‚îÄ|                     (fast!)
        |‚îÄ‚îÄPage2‚îÄ‚îÄ|
        |‚îÄ‚îÄPage3‚îÄ‚îÄ|
```

In [None]:
import aiohttp
import asyncio
import time

API_URL = "https://cafef.vn/du-lieu/Ajax/PageNew/DataHistory/PriceHistory.ashx"

async def fetch_page(session, page):
    """Fetch one page of data."""
    params = {
        "Symbol": "VNINDEX",
        "StartDate": "",
        "EndDate": "",
        "PageIndex": page,
        "PageSize": 20
    }
    
    async with session.get(API_URL, params=params) as resp:
        data = await resp.json(content_type=None)
        
        if data.get("Success"):
            records = data["Data"]["Data"]
            print(f"  ‚úÖ Page {page}: {len(records)} records")
            return records
        return []

async def fetch_all(total_pages=5):
    """Fetch multiple pages concurrently."""
    print(f"Fetching {total_pages} pages in parallel...\n")
    
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_page(session, i) for i in range(1, total_pages + 1)]
        results = await asyncio.gather(*tasks)
        
        # Flatten
        all_records = [r for page in results for r in page]
        return all_records

In [None]:
# Time it!
start = time.time()

records = asyncio.run(fetch_all(total_pages=5))

elapsed = time.time() - start

print(f"\n‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê")
print(f"‚îÇ  Total records: {len(records):>13} ‚îÇ")
print(f"‚îÇ  Time taken: {elapsed:>13.2f}s ‚îÇ")
print(f"‚îÇ  Speed: {len(records)/elapsed:>13.1f}/sec ‚îÇ")
print(f"‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò")

### üß† Key concepts:

| Concept | What it does |
|---------|-------------|
| `async def` | Declare an async function |
| `await` | Wait for async operation |
| `asyncio.gather()` | Run tasks in parallel |
| `aiohttp.ClientSession` | Async HTTP client |

---

## Step 4: Production-Ready Crawler

### Adding the important stuff

Real-world APIs can:
- Rate limit you (too many requests = blocked)
- Return bad data
- Be temporarily unavailable

We need:
1. **Rate limiting** ‚Üí Don't spam
2. **Validation** ‚Üí Clean data only
3. **Error handling** ‚Üí Don't crash

In [None]:
from pydantic import BaseModel, Field
import pandas as pd

class StockData(BaseModel):
    """Validated stock data."""
    date: str = Field(alias="Ngay")
    close: float = Field(alias="GiaDongCua")
    open: float = Field(default=0.0, alias="GiaMoCua")
    high: float = Field(default=0.0, alias="GiaCaoNhat")
    low: float = Field(default=0.0, alias="GiaThapNhat")
    volume: float = Field(default=0.0, alias="KhoiLuong")
    
    class Config:
        populate_by_name = True

print("‚úÖ Model defined")

In [None]:
class VNIndexCrawler:
    """Production-ready async crawler."""
    
    def __init__(self, symbol="VNINDEX", max_concurrent=5):
        self.symbol = symbol
        self.semaphore = asyncio.Semaphore(max_concurrent)  # Rate limit!
        self.api_url = "https://cafef.vn/du-lieu/Ajax/PageNew/DataHistory/PriceHistory.ashx"
    
    async def fetch_page(self, session, page):
        """Fetch one page with rate limiting."""
        async with self.semaphore:  # <-- Only N requests at a time
            params = {
                "Symbol": self.symbol,
                "StartDate": "",
                "EndDate": "",
                "PageIndex": page,
                "PageSize": 20
            }
            
            try:
                async with session.get(self.api_url, params=params) as resp:
                    data = await resp.json(content_type=None)
                    
                    if not data.get("Success"):
                        return []
                    
                    # Validate each record
                    validated = []
                    for record in data["Data"]["Data"]:
                        try:
                            validated.append(StockData(**record))
                        except:
                            continue
                    
                    print(f"  ‚úÖ Page {page}: {len(validated)} records")
                    return validated
                    
            except Exception as e:
                print(f"  ‚ùå Page {page}: {e}")
                return []
    
    async def crawl(self, pages=10):
        """Crawl multiple pages."""
        print(f"üöÄ Crawling {self.symbol} ({pages} pages, max 5 concurrent)\n")
        
        async with aiohttp.ClientSession() as session:
            tasks = [self.fetch_page(session, i) for i in range(1, pages + 1)]
            results = await asyncio.gather(*tasks)
            return [r for page in results for r in page]

print("‚úÖ Crawler class defined")

In [None]:
# Run it!
start = time.time()

crawler = VNIndexCrawler(symbol="VNINDEX", max_concurrent=5)
data = asyncio.run(crawler.crawl(pages=10))

elapsed = time.time() - start

print(f"\n{'‚ïê'*40}")
print(f"üìä Results: {len(data)} records in {elapsed:.2f}s")
print(f"‚ö° Speed: {len(data)/elapsed:.1f} records/sec")
print(f"{'‚ïê'*40}")

In [None]:
# Save it
if data:
    df = pd.DataFrame([d.model_dump() for d in data])
    
    df.to_csv("vnindex_api.csv", index=False)
    print("üíæ Saved to vnindex_api.csv\n")
    
    print(df.head())

---

## üèãÔ∏è Practice Time

### Exercise 1: Different symbol
Try `VN30`, `HNX`, or a specific stock like `VNM`.

### Exercise 2: Date range
Fill in `StartDate` and `EndDate` parameters.

### Exercise 3: Visualize
Plot closing prices over time.

### Exercise 4: Find another API
Pick any website with a table. Use DevTools to find its API.

---

## üìù Final Summary

### The Scraping Hierarchy

| Level | Tool | Speed | When to use |
|-------|------|-------|-------------|
| 1 | `requests` + BS4 | ‚≠ê‚≠ê‚≠ê | Static HTML |
| 2 | Selenium | ‚≠ê | JavaScript pages |
| 3 | Direct API | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | When API exists |

### The workflow

```
1. Try to find an API first (DevTools ‚Üí Network ‚Üí XHR)
2. If no API, check if data is in static HTML (requests + BS4)
3. Last resort: Use Selenium
```

### What you learned

| Module | Concept |
|--------|--------_|
| 1 | HTTP requests, HTML parsing |
| 2 | Browser automation, waits |
| 3 | API discovery, async programming |

---

## üéì You're done!

You now know the fundamentals of web scraping at three different levels.

The next step? **Build something real.**

Some ideas:
- Price tracker for e-commerce sites
- News aggregator
- Job listing collector
- Social media dashboard

> *"The best way to learn is to build."*

Go make something cool. ‚úåÔ∏è