# üìö Module 3: API-Based Scraping (Async)

**Learn high-performance data collection via APIs**

In this notebook, you'll learn:
- How to find hidden APIs using browser DevTools
- Direct API calls (no browser needed!)
- Async programming with `aiohttp`
- Rate limiting and best practices

**Target**: CafeF Stock API

‚ö° **This is 5-10x faster than Selenium!**

## üîß Setup

In [None]:
!pip install aiohttp pydantic pandas nest_asyncio -q

# Enable async in Jupyter/Colab
import nest_asyncio
nest_asyncio.apply()

print("‚úÖ Setup complete!")

---
## Step 1: How to Find API Endpoints

This is a **guide** on how to discover hidden APIs!

### üîç Step-by-Step:

1. **Open the Website**: Navigate to [CafeF VNINDEX](https://cafef.vn/du-lieu/Lich-su-giao-dich-vnindex-1.chn)

2. **Open DevTools**: Press `F12` ‚Üí Click "Network" tab

3. **Filter for XHR**: Click "Fetch/XHR" to show only API calls

4. **Reload the Page**: Press `F5` and watch requests appear

5. **Find the API**: Look for `PriceHistory.ashx` üéØ

6. **Check Response**: Click it ‚Üí "Response" tab ‚Üí You'll see JSON!

### üìã API Details We Found:

```
URL: https://cafef.vn/du-lieu/Ajax/PageNew/DataHistory/PriceHistory.ashx
Parameters:
  - Symbol=VNINDEX
  - PageIndex=1
  - PageSize=20
```

### üí° Key Insight:
- Many websites load data via APIs
- APIs are MUCH faster than scraping HTML
- Network tab is your best friend!

---
## Step 2: Make a Single API Request

**Goal**: Fetch data directly from the API (no browser needed!)

In [None]:
import requests
import json

print("=" * 70)
print("CALLING CAFEF API DIRECTLY")
print("=" * 70)
print()

# The API endpoint we discovered
api_url = "https://cafef.vn/du-lieu/Ajax/PageNew/DataHistory/PriceHistory.ashx"

# Parameters for the request
params = {
    "Symbol": "VNINDEX",
    "StartDate": "",
    "EndDate": "",
    "PageIndex": 1,
    "PageSize": 20
}

print("API URL:", api_url)
print("Parameters:", json.dumps(params, indent=2))
print()

# Make the request
print("Sending request...")
response = requests.get(api_url, params=params)

print(f"Status Code: {response.status_code}")
print()

# Parse JSON response
data = response.json()

if data.get("Success"):
    print("‚úÖ API call successful!")
    
    records = data.get("Data", {}).get("Data", [])
    print(f"Received: {len(records)} records")
    print()
    
    if records:
        print("=" * 70)
        print("FIRST RECORD:")
        print("=" * 70)
        print(json.dumps(records[0], indent=2, ensure_ascii=False))
        print()
        
        print("=" * 70)
        print("AVAILABLE FIELDS:")
        print("=" * 70)
        for key in records[0].keys():
            print(f"  - {key}")
else:
    print("‚ùå API call failed!")

print("\n‚ö° This is MUCH faster than Selenium!")

### üí° Key Takeaways

- `requests.get(url, params={...})` sends API request with parameters
- `response.json()` parses JSON response
- No browser = instant results!

---
## Step 3: Async Requests with aiohttp

**Goal**: Fetch multiple pages concurrently (parallel requests)

**Concepts**:
- Asynchronous programming (async/await)
- aiohttp for async HTTP
- asyncio.gather for parallel execution

In [None]:
import aiohttp
import asyncio
from typing import List, Dict
import time

API_URL = "https://cafef.vn/du-lieu/Ajax/PageNew/DataHistory/PriceHistory.ashx"
SYMBOL = "VNINDEX"

async def fetch_page(session: aiohttp.ClientSession, page_index: int) -> List[Dict]:
    """Fetch a single page asynchronously"""
    params = {
        "Symbol": SYMBOL,
        "StartDate": "",
        "EndDate": "",
        "PageIndex": page_index,
        "PageSize": 20
    }
    
    print(f"  Fetching page {page_index}...")
    
    try:
        async with session.get(API_URL, params=params) as response:
            data = await response.json(content_type=None)
            
            if data.get("Success"):
                records = data.get("Data", {}).get("Data", [])
                print(f"  ‚úÖ Page {page_index}: {len(records)} records")
                return records
            else:
                print(f"  ‚ùå Page {page_index}: Failed")
                return []
                
    except Exception as e:
        print(f"  ‚ùå Page {page_index}: {e}")
        return []

async def fetch_all_pages(total_pages: int = 5) -> List[Dict]:
    """Fetch multiple pages concurrently"""
    print("=" * 70)
    print(f"FETCHING {total_pages} PAGES CONCURRENTLY")
    print("=" * 70)
    print()
    
    async with aiohttp.ClientSession() as session:
        # Create tasks for all pages
        tasks = [fetch_page(session, i) for i in range(1, total_pages + 1)]
        
        # Execute all tasks concurrently!
        results = await asyncio.gather(*tasks)
        
        # Flatten the list
        all_records = [record for page_records in results for record in page_records]
        
        return all_records

In [None]:
# Run the async function
start_time = time.time()

records = asyncio.run(fetch_all_pages(total_pages=5))

elapsed = time.time() - start_time

print()
print("=" * 70)
print("RESULTS")
print("=" * 70)
print(f"Total records: {len(records)}")
print(f"Time taken: {elapsed:.2f} seconds")
print(f"Speed: {len(records) / elapsed:.1f} records/second")
print()
print("‚ö° ASYNC IS FAST!")

### üí° Key Takeaways

- `async def` defines asynchronous functions
- `await` waits for async operations
- `asyncio.gather(*tasks)` runs tasks in parallel
- Much faster than sequential requests!

---
## Step 4: Complete Scraper with Rate Limiting

**Goal**: Production-ready async scraper

**Concepts**:
- Semaphore for concurrency control
- Rate limiting (be nice to servers!)
- Pydantic validation
- Export to JSON/CSV

In [None]:
from pydantic import BaseModel, Field, field_validator
import pandas as pd

class StockData(BaseModel):
    """Validated stock data model"""
    date: str = Field(alias="Ngay")
    close_price: float = Field(alias="GiaDongCua")
    open_price: float = Field(default=0.0, alias="GiaMoCua")
    high_price: float = Field(default=0.0, alias="GiaCaoNhat")
    low_price: float = Field(default=0.0, alias="GiaThapNhat")
    volume: float = Field(default=0.0, alias="KhoiLuong")
    
    class Config:
        populate_by_name = True

print("‚úÖ Model defined!")

In [None]:
class CafeFAsyncCrawler:
    """Production-ready async crawler with rate limiting"""
    
    API_URL = "https://cafef.vn/du-lieu/Ajax/PageNew/DataHistory/PriceHistory.ashx"
    
    def __init__(self, symbol: str = "VNINDEX", max_concurrent: int = 5):
        self.symbol = symbol
        self.semaphore = asyncio.Semaphore(max_concurrent)  # Rate limiting!
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        }
    
    async def fetch_page(self, session: aiohttp.ClientSession, page_index: int) -> List:
        """Fetch a single page with rate limiting"""
        async with self.semaphore:  # Only N concurrent requests
            params = {
                "Symbol": self.symbol,
                "StartDate": "",
                "EndDate": "",
                "PageIndex": page_index,
                "PageSize": 20
            }
            
            try:
                async with session.get(self.API_URL, params=params, headers=self.headers) as response:
                    data = await response.json(content_type=None)
                    
                    if not data.get("Success"):
                        return []
                    
                    records = data.get("Data", {}).get("Data", [])
                    
                    # Validate with Pydantic
                    validated = []
                    for record in records:
                        try:
                            stock_data = StockData(**record)
                            validated.append(stock_data)
                        except:
                            continue
                    
                    print(f"  ‚úÖ Page {page_index}: {len(validated)} records")
                    return validated
                    
            except Exception as e:
                print(f"  ‚ùå Page {page_index}: {e}")
                return []
    
    async def crawl(self, total_pages: int = 10) -> List:
        """Crawl multiple pages with rate limiting"""
        print("=" * 70)
        print(f"CAFEF ASYNC CRAWLER - {self.symbol}")
        print(f"Fetching {total_pages} pages (max 5 concurrent)")
        print("=" * 70)
        print()
        
        async with aiohttp.ClientSession() as session:
            tasks = [self.fetch_page(session, i) for i in range(1, total_pages + 1)]
            results = await asyncio.gather(*tasks)
            return [item for page in results for item in page]

In [None]:
# Run the production crawler
start_time = time.time()

crawler = CafeFAsyncCrawler(symbol="VNINDEX", max_concurrent=5)
data = asyncio.run(crawler.crawl(total_pages=10))

elapsed = time.time() - start_time

print()
print("=" * 70)
print("CRAWLING COMPLETE")
print("=" * 70)
print(f"Total records: {len(data)}")
print(f"Time taken: {elapsed:.2f} seconds")
print(f"Speed: {len(data) / elapsed:.1f} records/second")

In [None]:
# Export to DataFrame and files
if data:
    df = pd.DataFrame([d.model_dump() for d in data])
    
    # Save to CSV
    df.to_csv("vnindex_final.csv", index=False)
    print("üíæ Saved: vnindex_final.csv")
    
    # Save to JSON
    with open("vnindex_final.json", "w") as f:
        json.dump([d.model_dump() for d in data], f, indent=2)
    print("üíæ Saved: vnindex_final.json")
    
    print("\n" + "=" * 70)
    print("DATA PREVIEW:")
    print("=" * 70)
    print(df.head())
    
    print("\nüéâ You've built a production-ready async scraper!")
    print("\n‚ö° SPEED COMPARISON:")
    print("  Selenium: ~30-60s for 200 records")
    print("  Async API: ~5-10s for 200 records")
    print("  That's 5-6x faster!")

---
## üèÜ Exercises

1. **Different stock**: Change symbol to "VN30" or "HNX"
2. **More pages**: Increase `total_pages` to 50
3. **Date filter**: Add StartDate and EndDate parameters
4. **Data analysis**: Calculate average, min, max prices
5. **Visualization**: Plot the data with matplotlib

In [None]:
# Exercise: Visualize the data
import matplotlib.pyplot as plt

if 'df' in dir() and len(df) > 0:
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
    
    # Price chart
    ax1.plot(df['close_price'].head(50), marker='o', markersize=3)
    ax1.set_title('VNINDEX Close Price')
    ax1.set_ylabel('Price')
    ax1.grid(True)
    
    # Volume chart
    ax2.bar(range(len(df.head(50))), df['volume'].head(50))
    ax2.set_title('Trading Volume')
    ax2.set_xlabel('Day')
    ax2.set_ylabel('Volume')
    
    plt.tight_layout()
    plt.show()

---
## üìä Summary: Scraping Techniques Comparison

| Technique | Speed | Complexity | When to Use |
|-----------|-------|------------|-------------|
| **BeautifulSoup** | ‚≠ê‚≠ê‚≠ê | Easy | Static HTML pages |
| **Selenium** | ‚≠ê | Medium | JavaScript-rendered content |
| **Async API** | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | Advanced | When API is available |

### üéØ Pro Tips:
1. Always check for APIs first (fastest option)
2. Use rate limiting to avoid being blocked
3. Validate data with Pydantic
4. Handle errors gracefully