# 03 - Plant Species Hardiness Map — Data Pull Notebook

Name: Zihan Yin

This notebook downloads the hardiness map HTML for each plant species from Perenual’s API and saves one HTML file per species ID.
It uses a single API key, polite pacing between requests, retry logic for temporary errors, and a “safe stop” when rate limits/quota are likely reached. It’s resume-friendly—existing files are skipped automatically.

Before each time running, teammates can fetch from our github reporsitory first, then go through step 1 - 4, which means simply click "Run All" is enough.

## Step 1 - Configuration (range, output paths)



Set the API key, the ID range to fetch, basic rate-limit settings, and where the HTML files will be saved.

In [1]:
API_KEY = "sk-hCEF68b0479a9ad5b12090"   

START_ID = 1
END_ID   = 3000

# Rate limiting & retries
SLEEP_BETWEEN = 1
MAX_RETRIES   = 5
BACKOFF_BASE  = 1.6

# Output directory & filename pattern
from pathlib import Path
OUT_DIR = Path("01_raw_data/03_hardiness_map")
FILENAME_PATTERN = "plant_species_hardiness_map_{species_id}.html"
OUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Key: ****{API_KEY[-6:]}, Range: {START_ID}~{END_ID}, Output dir: {OUT_DIR.resolve()}")

Key: ****b12090, Range: 1~3000, Output dir: /Users/klarissa/Documents/MASTERS/SEM3/FIT5120/Project/2025-08-SDG13-Plant-X-Website/01_data_wrangling/01_raw_data/03_hardiness_map


## Step 2 - Some Small Helper Functions



Minimal utilities for saving HTML and building file paths. Keeps things tidy and reusable.

In [2]:
import time
import requests

def save_html(path: Path, html_text: str):
    # Make sure the parent folder exists, then write the HTML as UTF-8
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(html_text, encoding="utf-8")

def build_filepath(species_id: int) -> Path:
    # One HTML file per species_id
    return OUT_DIR / FILENAME_PATTERN.format(species_id=species_id)

## Step 3 - Fetch one species (with retries & safe-stop)



Calls the hardiness-map endpoint, handles 404s with a placeholder, retries on temporary errors with exponential backoff, and stops safely when repeated 429s or zero remaining quota are detected.

In [3]:
def fetch_species_hardiness_map_html(species_id: int, api_key: str) -> str | None:
    """
    Success -> returns HTML as a string
    404     -> returns a placeholder HTML comment
    None    -> signals the main loop to stop (e.g., quota/IP limit, repeated 429, or x-ratelimit-remaining == 0)
    """
    base_url = "https://perenual.com/api/hardiness-map"
    params = {"species_id": species_id, "size": "og", "key": api_key}
    attempt = 0
    consecutive_429 = 0

    while attempt <= MAX_RETRIES:
        try:
            resp = requests.get(
                base_url,
                params=params,
                headers={"accept": "text/html"},
                timeout=30
            )

            # check the remaining-quota hint if the server provides it
            remaining = resp.headers.get("x-ratelimit-remaining")
            if remaining is not None:
                try:
                    if int(remaining) <= 0:
                        print(f"[ID {species_id}] x-ratelimit-remaining=0 → stopping safely (quota likely reached).")
                        return None
                except ValueError:
                    pass  # ignore if not an int

            if resp.status_code == 200:
                return resp.text

            if resp.status_code == 404:
                print(f"[ID {species_id}] 404: no hardiness map for this species (saving a placeholder).")
                return f"<!-- missing hardiness map for species_id={species_id} -->"

            if resp.status_code == 429:
                # Three 429s in a row → probably quota/IP-level restriction, stop
                consecutive_429 += 1
                if consecutive_429 >= 3:
                    print(f"[ID {species_id}] 429 occurred {consecutive_429} times in a row → stopping safely (rate/IP limit).")
                    return None
                wait = (BACKOFF_BASE ** attempt) + 0.2 * attempt
                print(f"[ID {species_id}] 429 Too Many Requests — waiting {wait:.1f}s before retrying…")
                time.sleep(wait)
                attempt += 1
                continue

            if resp.status_code in (500, 502, 503, 504):
                wait = (BACKOFF_BASE ** attempt) + 0.2 * attempt
                print(f"[ID {species_id}] {resp.status_code} server error — waiting {wait:.1f}s then retrying…")
                time.sleep(wait)
                attempt += 1
                continue
            
            # For other unexpected status codes, log a short snippet for context
            print(f"[ID {species_id}] HTTP {resp.status_code}: {resp.text[:200]}")
            return f"<!-- error status={resp.status_code} for species_id={species_id} -->"

        except requests.RequestException as e:
            wait = (BACKOFF_BASE ** attempt) + 0.2 * attempt
            print(f"[ID {species_id}] Network error {type(e).__name__}: {e}. Waiting {wait:.1f}s then retrying…")
            time.sleep(wait)
            attempt += 1

    print(f"[ID {species_id}] Retries exhausted. Skipping this one.")
    return f"<!-- retries exhausted for species_id={species_id} -->"

## Step 4 - Main loop (resume-friendly, skips existing files)



Iterates through the ID range, skips files that already exist, adds a little random jitter between requests, and safely stops when the fetcher returns None.

In [4]:
from tqdm.auto import tqdm
import random

downloaded = 0
skipped = 0

for species_id in tqdm(range(START_ID, END_ID + 1), desc="Downloading Hardiness Maps"):
    fp = build_filepath(species_id)
    if fp.exists():
        skipped += 1
        continue

    html_text = fetch_species_hardiness_map_html(species_id, API_KEY)
    if html_text is None:
        print("Quota/rate-limit inferred → stopping safely.")
        break

    save_html(fp, html_text)
    downloaded += 1
    
    # Small random jitter helps avoid synchronized spikes
    time.sleep(SLEEP_BETWEEN + random.uniform(0.0, 0.6))

print(f"New downloads: {downloaded}, skipped (already exists): {skipped}. Output dir: {OUT_DIR.resolve()}")

Downloading Hardiness Maps:   0%|          | 0/3000 [00:00<?, ?it/s]

[ID 1188] x-ratelimit-remaining=0 → stopping safely (quota likely reached).
Quota/rate-limit inferred → stopping safely.
New downloads: 99, skipped (already exists): 1088. Output dir: /Users/klarissa/Documents/MASTERS/SEM3/FIT5120/Project/2025-08-SDG13-Plant-X-Website/01_data_wrangling/01_raw_data/03_hardiness_map


## Step 5 (Optional) - Pack and download as a zip (for Colab)



When running on Google Colab, zip the output folder and download it. On a local machine, a file manager is easier.

In [5]:
# import shutil
# from google.colab import files
#
# shutil.make_archive("03_species_hardiness_map", 'zip', "01_raw_data/03_hardiness_map")
# files.download("03_species_hardiness_map.zip")