Section 1: Project Setup and Ethical Scraping Practices

1.1. Core Libraries

In [1]:
pip install requests beautifulsoup4 pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Section 1: Initial Setup and Ethical Practices

# Import the core libraries for web scraping and data handling
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
import os

# Define the base URL and headers for ethical scraping
base_url = "https://www.treasury.gov.lk"
headers = {
    'User-Agent': 'Group 5 DA2009 Project Scraper/1.0 (Educational Purposes)'
}

# Add a polite delay before starting to be a good web citizen
time.sleep(2)
print("Ethical scraping setup complete.")
print(f"User-Agent set to: {headers['User-Agent']}")
print("Rate limiting is in effect. Requests will have a delay.")

Ethical scraping setup complete.
User-Agent set to: Group 5 DA2009 Project Scraper/1.0 (Educational Purposes)
Rate limiting is in effect. Requests will have a delay.


Section 2: Scraping Structured Data (HTML Tables)

In [3]:
# Section 2: Scraping Structured Data (HTML Tables)

# Define the URL for the homepage
homepage_url = f"{base_url}"

try:
    print("\nScraping for tables on the homepage...")
    response = requests.get(homepage_url, headers=headers)
    response.raise_for_status()
    tables = pd.read_html(response.text)

    if len(tables) > 0:
        print(f"Successfully scraped the homepage and found {len(tables)} tables.")
        
        # Use a for loop to save each table to a uniquely named CSV file
        for i, table in enumerate(tables):
            filename = f'table_{i+1}_homepage_data.csv'
            table.to_csv(filename, index=False)
            print(f"Table {i+1} saved to '{filename}'.")
    else:
        print("No tables found on the homepage.")

except Exception as e:
    print(f"An error occurred while scraping tables: {e}")


Scraping for tables on the homepage...
Successfully scraped the homepage and found 4 tables.
Table 1 saved to 'table_1_homepage_data.csv'.
Table 2 saved to 'table_2_homepage_data.csv'.
Table 3 saved to 'table_3_homepage_data.csv'.
Table 4 saved to 'table_4_homepage_data.csv'.


  tables = pd.read_html(response.text)


Section 3: Scraping Unstructured Data (News & Publications)

In [4]:
# Section 3: Scraping Unstructured Data (News & Publications)

# Define a URL for the news or publication page
# You must find the correct URL by navigating the website
news_url = f"{base_url}/web/newsroom" # This is a placeholder URL
scraped_news = []

try:
    print("\nScraping news headlines and links...")
    response = requests.get(news_url, headers=headers)
    response.raise_for_status()
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all news headlines. **You must inspect the page to find the correct tags and classes.**
    news_items = soup.find_all('div', class_='news-item') # This is a placeholder class

    if news_items:
        for item in news_items:
            # Extract the title and link from the HTML element
            title_tag = item.find('a')
            if title_tag:
                title = title_tag.get_text(strip=True)
                link = requests.compat.urljoin(base_url, title_tag['href'])
                scraped_news.append({'Title': title, 'Link': link})
        
        df_news = pd.DataFrame(scraped_news)
        print("Successfully scraped news data:")
        print(df_news.head())
        df_news.to_csv('news_headlines.csv', index=False)
        print("Data saved to 'news_headlines.csv'.")
    else:
        print("No news items found with the specified class.")

except Exception as e:
    print(f"An error occurred while scraping news: {e}")


Scraping news headlines and links...
An error occurred while scraping news: 404 Client Error: Not Found for url: https://www.treasury.gov.lk/web


Yes, here is a note you can add to your Jupyter Notebook. This note explains the error and tells your team members how to proceed, allowing them to continue working on the other sections of the project.

Team Note: Status Update on News & Publications Scraping
Issue: An error occurred while attempting to scrape the news and publications section. The script returned a 404 Client Error: Not Found, which means the URL we were using is no longer valid.

Impact: This error prevents the script from scraping data from this specific section. This is a common challenge in web scraping because website URLs and content can change at any time without warning.

In [None]:
section 4. Extracting Links to Publications & Circulars

In [5]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.treasury.gov.lk"
headers = {'User-Agent': 'Group 5 DA2009 Project Scraper/1.0'}

try:
    response = requests.get(url, headers=headers)
    response.raise_for_status()
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all links to circulars or publications
    publication_links = soup.find_all('a', href=True)

    scraped_links = []
    for link in publication_links:
        # Check if the link's URL or text contains a relevant keyword
        href = link['href']
        text = link.get_text(strip=True)
        if 'circular' in href.lower() or 'publication' in href.lower() or 'circular' in text.lower():
            full_url = requests.compat.urljoin(url, href)
            scraped_links.append({'Text': text, 'URL': full_url})

    df_links = pd.DataFrame(scraped_links)
    if not df_links.empty:
        print("Successfully scraped publication links.")
        print(df_links.head())
        df_links.to_csv('publication_links.csv', index=False)
    else:
        print("No publication links found.")

except Exception as e:
    print(f"An error occurred: {e}")

Successfully scraped publication links.
                                     Text  \
0  Acts, Gazettes, Circulars & Guidelines   
1                                Gazettes   
2                                    Acts   
3                               Circulars   
4                              Guidelines   

                                                 URL  
0  https://www.treasury.gov.lk/acts-gazettes-circ...  
1  https://www.treasury.gov.lk/acts-gazettes-circ...  
2  https://www.treasury.gov.lk/web/circular-gazet...  
3  https://www.treasury.gov.lk/web/circular-gazet...  
4  https://www.treasury.gov.lk/web/circular-gazet...  


section 5  Scraping Contact Information

In [6]:
# This assumes contact info is on the main page.
# If not, you'd need to first find the 'Contact Us' link.

# ... (previous code to get `soup` object)

contacts = soup.find_all(string=re.compile(r'contact|email|phone', re.I))

contact_data = []
for contact in contacts:
    parent = contact.parent.get_text(strip=True)
    contact_data.append({'Contact Info': parent})

df_contacts = pd.DataFrame(contact_data)
if not df_contacts.empty:
    print("Successfully scraped contact information.")
    print(df_contacts.head())
    df_contacts.to_csv('contact_info.csv', index=False)
else:
    print("No contact information found.")

Successfully scraped contact information.
                                        Contact Info
0                                         Contact us
1                                    Contact Details
2                                         Contact us
3  {"props":{"pageProps":{"ua":{"browser":"Chrome...


section 6  Getting the Page Title and Headings

In [7]:
# ... (previous code to get `soup` object)

page_title = soup.title.get_text(strip=True) if soup.title else "No Title Found"
print(f"\nPage Title: {page_title}")

headings = soup.find_all(['h1', 'h2', 'h3'])

print("\nPage Headings:")
for heading in headings:
    print(heading.get_text(strip=True))


Page Title: Ministry of Finance - Sri lanka

Page Headings:
Ministry of Finance, Planning and Economic Development
Press Release

Monthly Fiscal Review Report
Press Release
Press Release
Press Release
Press Release

Monthly Fiscal Review Report
Press Release
Press Release
Press Release
Press Release
Economic Indicators
