1 Fetch the Main Page

In [1]:
import requests
from bs4 import BeautifulSoup

url = "https://www.treasury.gov.lk/"
response = requests.get(url)

print(response.status_code)  # 200 means successful
html_content = response.text


200


This code:

Connects to the Ministry of Finance website.

Checks if the connection is successful.

Saves the website’s HTML code for further processing.

Parse HTML Content

In [3]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

# Optional: print the first 1000 characters to inspect
print(soup.prettify()[:1000])


<!DOCTYPE html>
<html lang="en">
 <head>
  <script async="" src="https://www.googletagmanager.com/gtag/js?id=G-X5XT78QC7P">
  </script>
  <script>
   window.dataLayer = window.dataLayer || [];
            function gtag(){dataLayer.push(arguments);}
            gtag('js', new Date());
            gtag('config', 'G-X5XT78QC7P', {
              page_path: window.location.pathname,
            });
  </script>
  <meta content="width=device-width" name="viewport"/>
  <meta charset="utf-8"/>
  <link href="https://fonts.googleapis.com/css2?family=Raleway:wght@400;500;700;900&amp;display=swap" rel="stylesheet"/>
  <title>
   Ministry of Finance - Sri lanka
  </title>
  <link href="/favicon.ico" rel="icon"/>
  <link as="style" href="/_next/static/css/32c251979959008ca97b.css" rel="preload"/>
  <link data-n-g="" href="/_next/static/css/32c251979959008ca97b.css" rel="stylesheet"/>
  <link as="style" href="/_next/static/css/a1dd6b1741da6116c21d.css" rel="preload"/>
  <link data-n-p="" href="/_next/

This second code:

Takes the HTML from the website.

Converts it into a structured format that’s easy to work with.

Prints the first part of it so you can visually confirm the page content.

In [4]:
import pandas as pd

url = "https://www.treasury.gov.lk"  # replace with actual table page
tables = pd.read_html(url)

print(f"Found {len(tables)} tables")
for i, table in enumerate(tables):
    print(f"Table {i}")
    print(table.head())


Found 4 tables
Table 0
  Currency     Buying    Selling
0      USD  298.86LKR  306.09LKR
1      GBP  401.01LKR  413.27LKR
Table 1
                    Unnamed: 0   Year / Month Amount / LKR Bn
0  Governement Revenue & Grant  Jan-July 2024          2155.9
1       Government Expenditure  Jan-July 2024          3034.4
2       Overall Budget Deficit              .               .
Table 2
   #               Item   Pettah Dambulla
0  1              Samba  0LKR/Kg  0LKR/Kg
1  2  Red-Onions(Local)  0LKR/Kg  0LKR/Kg
Table 3
     Month/Year  Export  Import  Trade Balance
0  Jan-Nov 2023       0       0              0
1  Jan-Nov 2022       0       0              0


This code:

Opens the Treasury webpage.

Finds all <table> elements automatically.

Saves each table into a DataFrame.

Displays them one by one.

It’s a shortcut compared to using BeautifulSoup, because Pandas does the heavy lifting.

How many Raws in the tables

In [6]:
from lxml import html
import requests

resp = requests.get("https://www.treasury.gov.lk")
tree = html.fromstring(resp.content)

# Extract all table rows with XPath
rows = tree.xpath("//table//tr")
print(len(rows))


13


This code:

Connects to the Treasury site.

Parses the HTML with lxml.

Counts how many rows exist in all tables combined.

It’s like asking: “How many rows of data are on this website’s tables?”

In [8]:
pip install scrapy


Note: you may need to restart the kernel to use updated packages.


In [10]:
import scrapy


In [15]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

BASE = "https://www.treasury.gov.lk"

# Part A: Extract all tables
tables = pd.read_html(BASE)
print(f"Found {len(tables)} tables")
for i, t in enumerate(tables):
    print(f"\nTable {i}")
    print(t.head())

# Save to Excel
with pd.ExcelWriter("treasury_tables.xlsx") as w:
    for i, t in enumerate(tables):
        t.to_excel(w, sheet_name=f"table_{i}", index=False)

# Part B: Extract all links
resp = requests.get(BASE)
soup = BeautifulSoup(resp.text, "html.parser")
links = [a["href"] for a in soup.find_all("a", href=True)]

print(f"\nFound {len(links)} links")
print("First 10 links:", links[:10])

pd.Series(links, name="links").to_csv("homepage_links.csv", index=False)

# Part C: Budget Highlights text
budget = soup.find(id="budget-highlights")
if budget:
    print("\nBudget Highlights Section:")
    print(budget.get_text(strip=True))


Found 4 tables

Table 0
  Currency     Buying    Selling
0      USD  298.86LKR  306.09LKR
1      GBP  401.01LKR  413.27LKR

Table 1
                    Unnamed: 0   Year / Month Amount / LKR Bn
0  Governement Revenue & Grant  Jan-July 2024          2155.9
1       Government Expenditure  Jan-July 2024          3034.4
2       Overall Budget Deficit              .               .

Table 2
   #               Item   Pettah Dambulla
0  1              Samba  0LKR/Kg  0LKR/Kg
1  2  Red-Onions(Local)  0LKR/Kg  0LKR/Kg

Table 3
     Month/Year  Export  Import  Trade Balance
0  Jan-Nov 2023       0       0              0
1  Jan-Nov 2022       0       0              0

Found 96 links
First 10 links: ['#home', '#budget-highlights', '#at-a-glance', '#mof-links', '/', '/si/#', '/ta/#', '/#', '/search', '/']

Budget Highlights Section:



his code:

Extracts all tables from the Treasury homepage.

Saves them neatly into an Excel file.

Extracts all links (like menus, resources, PDFs).

Saves them into a CSV file.

So you’ve got structured data (tables) and navigation data (links) — both stored in reusable files.

Meta data

In [16]:
import requests
from bs4 import BeautifulSoup

url = "https://www.treasury.gov.lk"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, "html.parser")

print("Page title:", soup.title.string)


Page title: Ministry of Finance - Sri lanka


This code:

Connects to the Treasury website.

Parses the HTML with BeautifulSoup.

Extracts just the title of the page.

In [17]:
images = [img["src"] for img in soup.find_all("img", src=True)]
print("Images:", images[:10])


Images: ['/assets/images/main-bannerlogo.png', '/assets/icons/search/searchblue.png', '/assets/icons/search/searchwhite.png', '/assets/icons/hamburger.svg', 'data:image/svg+xml;charset=utf-8,<svg width="20" height="20" xmlns="http://www.w3.org/2000/svg" version="1.1"/>', 'data:image/svg+xml;charset=utf-8,<svg width="20" height="20" xmlns="http://www.w3.org/2000/svg" version="1.1"/>', 'data:image/svg+xml;charset=utf-8,<svg width="20" height="20" xmlns="http://www.w3.org/2000/svg" version="1.1"/>', 'data:image/svg+xml;charset=utf-8,<svg width="20" height="20" xmlns="http://www.w3.org/2000/svg" version="1.1"/>', 'data:image/svg+xml;charset=utf-8,<svg width="20" height="20" xmlns="http://www.w3.org/2000/svg" version="1.1"/>', 'data:image/svg+xml;charset=utf-8,<svg width="20" height="20" xmlns="http://www.w3.org/2000/svg" version="1.1"/>']


This code:

Finds all image elements (<img>) on the page.

Collects their source file paths (like .png, .jpg, .svg).

Shows you the first 10 of them.

In [18]:
pip install pillow pytesseract requests





In [1]:
pip install selenium webdriver-manager pillow


Collecting webdriver-manager
  Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl.metadata (12 kB)
Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl (27 kB)
Installing collected packages: webdriver-manager
Successfully installed webdriver-manager-4.0.2
Note: you may need to restart the kernel to use updated packages.


Selenium Screenshot Code

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time

URL = "https://www.treasury.gov.lk"

options = webdriver.ChromeOptions()
options.add_argument("--window-size=1400,1600")

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

try:
    driver.get(URL)
    time.sleep(2)  # wait a bit
    driver.save_screenshot("treasury_test.png")
    print("Screenshot saved successfully.")
finally:
    driver.quit()


Screenshot saved successfully.


In Simple Words

This code:

Opens the Ministry of Finance website in Chrome.

Waits a little for it to load.

Captures a screenshot of the page.

Saves it to your computer.

Closes the browser.

OCR Code

In [8]:
from PIL import Image
import pytesseract

# Make sure path is set if needed
# pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Open the screenshot
img = Image.open("treasury_test.png")

# Convert to grayscale + upscale for better OCR
gray = img.convert("L")
gray = gray.resize((gray.width*2, gray.height*2))

# Run OCR
text = pytesseract.image_to_string(gray, lang="eng", config="--psm 6")

# Preview some text
print("\n--- OCR Preview ---\n")
print(text[:600])  # show first 600 chars

# Save to file
with open("treasury_ocr.txt", "w", encoding="utf-8") as f:
    f.write(text)

print("\nFull OCR text saved -> treasury_ocr.txt")



--- OCR Preview ---

So.
Co Ministry of Finance, Planning and Economic . ; ()
(Fy) ' Boe TA)] English
27) Development ene | Sudlip | Englis
Home | About Us | Ministry and Departments | Acts, Gazettes, Circulars & Guidelines | Newsroom | RTI / Internal Affairs Unit | Contact us
e
&
a
=
o
a=
« ®sri Lanka Recorded One of the Most
NATIONAL <) : = & yr, ~~ ~~. | ce
a ; ie! pg | oh = ; | The ™ “ NS =~ ‘  -_=
Stat ts and Treasury
Citizen Budget > Remarke. an > Management > Vacancies > Publications >
Systems


Full OCR text saved -> treasury_ocr.txt


In Simple Words

This code:

Opens the screenshot from Selenium.

Cleans and enlarges it for clarity.

Runs OCR to extract text from the image.

Prints a preview and saves the full text into a .txt file.