# Dynamic/API Data Scraper

This notebook is for:
- Detecting API endpoints (JSON/XML) from the Treasury website.
- If no API is available, using Selenium to scrape dynamically loaded content.


Import libraries

In [1]:
import requests
import json
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time

print("✅ Libraries imported successfully.")


✅ Libraries imported successfully.


Check for possible API endpoints

In [2]:
# First, check for JSON data endpoints
potential_api_url = "https://www.treasury.gov.lk/"  # Change if you find actual endpoint

headers = {"User-Agent": "Group5Scraper/1.0 (+https://github.com/YourGitHubRepoLink)"}
response = requests.get(potential_api_url, headers=headers)

if response.status_code == 200:
    if response.headers.get("Content-Type", "").startswith("application/json"):
        print("✅ API endpoint found and returns JSON:")
        print(json.dumps(response.json(), indent=2))
    else:
        print("⚠️ No JSON response detected. Might need Selenium.")
else:
    print("❌ Failed to fetch. Status code:", response.status_code)


⚠️ No JSON response detected. Might need Selenium.


Selenium dynamic scraping example

In [4]:
# Set up headless browser
options = Options()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=options)  # Make sure ChromeDriver is installed

try:
    driver.get("https://www.treasury.gov.lk/")
    time.sleep(3)  # Wait for JS to load content

    # Example: Get text from a specific element
    elements = driver.find_elements(By.TAG_NAME, "a")
    for elem in elements[:10]:  # First 10 links
        print(elem.text, "->", elem.get_attribute("href"))

finally:
    driver.quit()


 -> https://www.treasury.gov.lk/#home
 -> https://www.treasury.gov.lk/#budget-highlights
 -> https://www.treasury.gov.lk/#at-a-glance
 -> https://www.treasury.gov.lk/#mof-links
Ministry of Finance, Planning and Economic Development -> https://www.treasury.gov.lk/
සිංහල -> https://www.treasury.gov.lk/si/#
| -> None
தமிழ் -> https://www.treasury.gov.lk/ta/#
| -> None
English -> https://www.treasury.gov.lk/#


In [6]:
dynamic_html = driver.page_source

with open("data_raw/treasury_dynamic.html", "w", encoding="utf-8") as f:
    f.write(dynamic_html)

print("💾 Dynamic HTML saved to data_raw/treasury_dynamic.html")


MaxRetryError: HTTPConnectionPool(host='localhost', port=56660): Max retries exceeded with url: /session/fa83aa35d9a9dd68f696894617b02f0f/source (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001FFB7E3BD40>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))

In [7]:
# 1️⃣ Open page
driver.get("https://www.treasury.gov.lk/")
time.sleep(3)

# 2️⃣ Save HTML while driver is still open
dynamic_html = driver.page_source
with open("data_raw/treasury_dynamic.html", "w", encoding="utf-8") as f:
    f.write(dynamic_html)

print("✅ Dynamic HTML saved to data_raw/treasury_dynamic.html")

# 3️⃣ Close driver
driver.quit()


MaxRetryError: HTTPConnectionPool(host='localhost', port=56660): Max retries exceeded with url: /session/fa83aa35d9a9dd68f696894617b02f0f/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001FFB7EEBB30>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))