# 01 - Plant Species Details — Data Pull Notebook

Name: Zihan Yin

This notebook fetches plant species detail JSON from Perenual’s Open API and saves one file per species ID. I kept it simple: one API key, polite rate-limiting, retry logic, and “safe stop” when I hit daily quotas. Run top-to-bottom; it’s restart-friendly and will skip files that already exist.    

Before each time running, teammates can fetch from our github reporsitory first, then go through step 1 - 4, which means simply click "Run All" is enough.

## Step 1 - Configuration (range, output paths)



I set my API key, the ID range to pull, some conservative rate-limit settings, and where to save the files. Paths use `/` so they work cross-platform.

In [1]:
# I put my API key here. Replace with your own.
API_KEY = "sk-kMpx68b02586bb61112084"

# IDs I want to fetch (I usually test with a small range first).
START_ID = 1
END_ID   = 3000

# Rate limit & retry settings 
SLEEP_BETWEEN = 1        # seconds to sleep after each request
MAX_RETRIES   = 5
BACKOFF_BASE  = 1.6

# Output directory & filename pattern.
# I use forward slashes; pathlib will adapt automatically on Windows/macOS/Linux.
from pathlib import Path
OUT_DIR = Path("01_raw_data/01_species_details")
FILENAME_PATTERN = "plant_species_details_{species_id}.json"
OUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Key: ****{API_KEY[-6:]}, Range: {START_ID}~{END_ID}, Output dir: {OUT_DIR.resolve()}")

Key: ****112084, Range: 1~3000, Output dir: E:\05_YZH_DS\02_Monash_DS\2025_S2_FIT5120_Industry_Experience_Studio_Project\06_main_project\03_github_submission\03_github_submission\2025-08-SDG13-Plant-X-Website\01_data_wrangling\01_raw_data\01_species_details


## Step 2 - Some Small Helper Functions



A tiny save function and a path builder.

In [2]:
import time
import json
import re
from typing import Optional, Dict, Any
import requests

def save_json(path: Path, data: Dict[str, Any]):
    # UTF-8 so everything is readable.
    path.write_text(json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8")

def build_filepath(species_id: int) -> Path:
    # One JSON per species_id, based on the pattern above.
    return OUT_DIR / FILENAME_PATTERN.format(species_id=species_id)

## Step 3 - Fetch one species (with retries & safe-stop)



This is the actual fetcher. It handles 404s (I save a placeholder), retries on 429/5xx with exponential backoff, and “safe-stops” if it keep hitting 429 or the response headers say my daily quota is done.

In [3]:
def fetch_species_details(species_id: int, api_key: str) -> Optional[dict]:
    """
    Success -> returns a JSON dict
    404     -> returns {"__missing__": True, "id": species_id} as a placeholder
    None    -> signals the main loop to stop (e.g., quota/rate limit reached)
    """
    url = f"https://perenual.com/api/v2/species/details/{species_id}"
    attempt = 0
    consecutive_429 = 0

    while attempt <= MAX_RETRIES:
        try:
            resp = requests.get(
                url,
                params={"key": api_key},
                headers={"accept": "application/json"},
                timeout=30
            )

            # The API will tells me the remaining quota
            remaining = resp.headers.get("x-ratelimit-remaining")
            if remaining is not None:
                try:
                    if int(remaining) <= 0:
                        print(f"[ID {species_id}] x-ratelimit-remaining=0 → I’ll stop safely (quota reached).")
                        return None
                except ValueError:
                    pass  # if it's not an integer, just ignore it

            if resp.status_code == 200:
                return resp.json()

            if resp.status_code == 404:
                print(f"[ID {species_id}] 404: not found / no data. I’ll save a placeholder.")
                return {"__missing__": True, "id": species_id}

            if resp.status_code == 429:
                consecutive_429 += 1
                # If hit too many 429s in a row, it’s likely the daily/IP limit.
                if consecutive_429 >= 3:
                    print(f"[ID {species_id}] 429 happened {consecutive_429} times in a row → I’ll stop safely.")
                    return None
                wait = (BACKOFF_BASE ** attempt) + 0.2 * attempt
                print(f"[ID {species_id}] 429 Too Many Requests, waiting {wait:.1f}s before retrying…")
                time.sleep(wait)
                attempt += 1
                continue

            if resp.status_code in (500, 502, 503, 504):
                wait = (BACKOFF_BASE ** attempt) + 0.2 * attempt
                print(f"[ID {species_id}] {resp.status_code} server error, waiting {wait:.1f}s then retrying…")
                time.sleep(wait)
                attempt += 1
                continue

            # For other unexpected status codes, I log a short snippet for context.
            print(f"[ID {species_id}] HTTP {resp.status_code}: {resp.text[:200]}")
            return {"__error_status__": resp.status_code, "id": species_id}

        except requests.RequestException as e:
            wait = (BACKOFF_BASE ** attempt) + 0.2 * attempt
            print(f"[ID {species_id}] Network error ({type(e).__name__}): {e}. Waiting {wait:.1f}s then retrying…")
            time.sleep(wait)
            attempt += 1

    print(f"[ID {species_id}] Retries exhausted. I’m giving up on this one.")
    return {"__error_retries_exhausted__": True, "id": species_id}

## Step 4 - Main loop (resume-friendly, skips existing files)



I iterate over the ID range, skip files that already exist, and pause a bit between calls. If the fetcher returns None, I assume quota/rate limit and stop politely.

In [4]:
from tqdm.auto import tqdm
import random

downloaded = 0
skipped = 0

for species_id in tqdm(range(START_ID, END_ID + 1), desc="Downloading Species Details"):
    fp = build_filepath(species_id)
    if fp.exists():
        skipped += 1
        continue

    data = fetch_species_details(species_id, API_KEY)
    if data is None:
        print("Quota/rate-limit inferred → stopping safely.")
        break

    save_json(fp, data)
    downloaded += 1
    time.sleep(SLEEP_BETWEEN + random.uniform(0.0, 0.6))

print(f"New downloads: {downloaded}, skipped (already exists): {skipped}. Output dir: {OUT_DIR.resolve()}")

  from .autonotebook import tqdm as notebook_tqdm
Downloading Species Details:  49%|████▉     | 1484/3000 [03:45<03:50,  6.58it/s] 

[ID 1485] x-ratelimit-remaining=0 → I’ll stop safely (quota reached).
Quota/rate-limit inferred → stopping safely.
New downloads: 99, skipped (already exists): 1385. Output dir: E:\05_YZH_DS\02_Monash_DS\2025_S2_FIT5120_Industry_Experience_Studio_Project\06_main_project\03_github_submission\03_github_submission\2025-08-SDG13-Plant-X-Website\01_data_wrangling\01_raw_data\01_species_details





## Step 5 (Optional) - Pack and download as a zip (for Colab)



When I’m on Colab, this zips the folder and downloads it to my laptop.

In [5]:
# import shutil
# from google.colab import files
#
# # Zip the entire folder
# shutil.make_archive("01_species_details", 'zip', "01_raw_data/01_species_details")
#
# # Download to my local machine
# files.download("01_species_details.zip")