# Web Scraping with Unstructured.IO and Selenium

(Auto Scrape Iteration 2 Part 1)

In this attempt, I make use of Unstructured's partition_html's capability of picking out links to rapidly crawl through pages and extract only links - this is to target the DC Guidelines extraction only, but this can be expanded to crawl other sites. The links are saved to a CSV, then Selenium is used to grab all the HTML content in the main body of the webpages.

This notebook is a documented copy of the _dcg\_scraper.py_ file, which is now deprecated and removed from the repo.

Import libraries as below

In [1]:
from unstructured.partition.html import partition_html
import time
import os
import csv
import time
from urllib.parse import urlparse
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import json
import subprocess
import bs4

Hardcoded function to extract from the main body (deprecated, but it still works)

In [2]:
def get_body(elements):
    '''
    This function finds the document element list indexes corresponding to the main body of the webpage, in the absence of usage of proper body tags

    Parameters:
        elements (list): A list of Unstructured document objects

    Returns:
        list: The sliced elements list whose indexes point to document objects containing text and other metadata from the main body of the webpage
    '''

    START = 0
    END = 0
    flag = False

    for i in range(len(elements)):

        if flag == False and elements[i].text == 'Search':
            flag = True
            continue
        elif START == 0 and elements[i].category == 'Title':
            START = i
            continue
        elif END == 0 and elements[i].text == 'Urban Redevelopment Authority':
            END = i
            break
    return elements[START:END]

From the starting page, pages are crawled and links added to the unique_links list as a queue. The links scraped are then filtered - in this case only the DC Guidelines stuff. Link scrawling is rate-limited, but I believe the pause/sleep time can be reduced.

In [3]:
def scrape_links(url, visited, queue):
    unique_links = []
    try:
        elements = partition_html(url=url,
                                    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'})
        body_elements = get_body(elements)
        for element in body_elements:
            if element.metadata.link_urls != None:
                links = element.metadata.link_urls
                for link in links :
                    if 'https://' not in link and 'http://' not in link:
                        link = 'https://www.ura.gov.sg' + link
                    if "https://www.ura.gov.sg/" not in link or "https://www.ura.gov.sg/maps" in link:
                        continue
                    if "https://www.ura.gov.sg/-/media/Corporate/Guidelines/Development-control" in link or "https://www.ura.gov.sg/Corporate/Guidelines/Development-Control" in link:
                        if link not in visited and link not in queue:
                            unique_links.append(link)
    except:
        pass
    return unique_links
    

In [4]:
def all_links_in_page(url):
    links = []
    visited = set()
    queue = [url]
    while queue:
        current_url = queue.pop(0)
        if current_url in visited:
            continue
        print(f"Current Page: {current_url}")
        visited.add(current_url)
        links.append(current_url)
        queue += scrape_links(current_url, visited, queue)
        time.sleep(0.1)
    return links

In [5]:
test = all_links_in_page("https://www.ura.gov.sg/Corporate/Guidelines/Development-Control")

Current Page: https://www.ura.gov.sg/Corporate/Guidelines/Development-Control
Current Page: https://www.ura.gov.sg/Corporate/Guidelines/Development-Control/Residential
Current Page: https://www.ura.gov.sg/Corporate/Guidelines/Development-Control/Residential/Flats-Condominiums
Current Page: https://www.ura.gov.sg/Corporate/Guidelines/Development-Control/Residential/Bungalows
Current Page: https://www.ura.gov.sg/Corporate/Guidelines/Development-Control/Residential/Semi-Detached-Houses
Current Page: https://www.ura.gov.sg/Corporate/Guidelines/Development-Control/Residential/Terrace
Current Page: https://www.ura.gov.sg/Corporate/Guidelines/Development-Control/Residential/Strata-Landed-Housing
Current Page: https://www.ura.gov.sg/Corporate/Guidelines/Development-Control/Non-Residential
Current Page: https://www.ura.gov.sg/Corporate/Guidelines/Development-Control/Non-Residential/Commercial
Current Page: https://www.ura.gov.sg/Corporate/Guidelines/Development-Control/Non-Residential/Hotel
Cur

Quick count of total webpage URLs scraped except for media files

In [6]:
count = sum(1 for each_link in test if "https://www.ura.gov.sg/-/media/Corporate/Guidelines/Development-control" not in each_link)
print(count)

448


Write links to CSV of choice

In [7]:
def write_links_to_csv(links, filename):
  """Writes a list of links to a CSV file, each link on a new line.

  Args:
    links: List of strings containing the links.
    filename: Path to the CSV file to be created.
  """
  with open(filename, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    for link in links:
      writer.writerow([link])  # Write each link as a single element list

# Example usage (replace with your list of links and desired filename)
filename = "../data/dc_links.csv"
write_links_to_csv(test, filename)
print(f"Links written to CSV: {filename}")



Links written to CSV: ../data/dc_links.csv


Next, we will use Selenium to scrape the web content from each of the links. For each of the links, we structure the storage of content in the same hierachical way as the URA website, by creating the necessary directory structure

In [None]:
def create_directory_structure(base_dir, url, start_from):
    """
    Create a directory structure based on the URL, starting from a specified part of the URL path,
    excluding the base directory and network location. Creates directories up to the parent of the last segment.
    """
    try:
        parsed_url = urlparse(url)
        if not parsed_url.scheme or not parsed_url.netloc:
            raise ValueError("Invalid URL")

        # Find the starting index of the desired directory structure
        start_index = parsed_url.path.find(start_from)
        if start_index == -1:
            raise ValueError(f"The start_from segment '{start_from}' not found in the URL path")

        # Extract the relevant path starting from the specified part
        relevant_path = parsed_url.path[start_index:].lstrip('/')
        
        # Get the parent directory of the last segment
        parent_dir = os.path.dirname(relevant_path)
        
        # Construct the full path to the parent directory
        full_path = os.path.join(base_dir, parent_dir).replace("Development-Control", "Development-Control-html")
        
        # Create directories if they do not exist
        if not os.path.exists(full_path):
            os.makedirs(full_path)
        
        return full_path
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

Helper function for the actual scraping, to convert images into links

In [None]:
def convert_images_to_links(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')

    for img in soup.find_all('img'):
        try:
            img_url = "https://www.ura.gov.sg" + img['data-original']
            link = soup.new_tag('a', href=img_url)
            link.string = img_url
            img.replace_with(link)
        except:
            pass

    return str(soup)

The selectors here a bit outdated/not all-encompassing, use with caution and modify

In [None]:
def scrape_and_save_csv(csv_file, base_dir, start_from):
    '''
    This is the main function to scrape the DC guidelines from the URA website into local HTML files in a manner that mimics the site hierarchy, with Selenium. Scraping cannot be run headless due to webpage restrictions.

    Parameters:
        csv_file (csv): File of URA weblinks to parse through
        base_dir (string):  Relative link to the directory that the subdirectories of HTMLs will be stored in
        start_from (string): Specific path segment in the URL to start the hierarchical directory structure creation from

    '''
    errors_dict = {}

    chrome_options = Options()
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")

    driver = webdriver.Chrome(service=Service(), options=chrome_options)

    try:
        with open(csv_file, newline='') as csvfile:
            reader = csv.reader(csvfile)
            for row in reader:
                url = row[0]
                if "https://www.ura.gov.sg/-/media/Corporate/Guidelines/Development-control" in url:
                    print(f"Media file: {url} - skipping...")
                    continue

                print(f"Processing URL: {url}")
                directory = create_directory_structure(
                    base_dir, url, start_from)
                if directory is None:
                    print(f"Failed to create directory for URL: {url}")
                    continue

                driver.get(url)
                errors = []
                content_html_1 = content_html_2 = content_html_3 = ""

                try:
                    wait = WebDriverWait(driver, 10)
                    wait.until(EC.presence_of_element_located(
                        (By.CSS_SELECTOR, '#pnlMain > div.fullbody-wrapper.no-t-padding > div > div.row > div.col-sm-9.col-md-9.col-xs-12')))
                except:
                    errors.append("Main header wait condition failed")
                    print("Main header wait condition failed")

                try:
                    content_html_1 = driver.find_element(
                        By.CSS_SELECTOR, '#pnlMain > div.fullbody-wrapper.no-t-padding > div > div.row > div.col-sm-9.col-md-9.col-xs-12 > div'
                    ).get_attribute("outerHTML")
                except:
                    errors.append("Main header not found")
                    print("Main header not found")

                try:
                    content_html_2 = driver.find_element(
                        By.CSS_SELECTOR, '#pnlMain > div.fullbody-wrapper.no-t-padding > div > div.row > div.col-sm-9.col-md-9.col-xs-12 > div.fullbody-wrapper.no-t-padding > div > div > div'
                    ).get_attribute("outerHTML")
                except:
                    errors.append("Main body not found")
                    print("Main body not found")

                try:
                    content_html_3 = driver.find_element(
                        By.CSS_SELECTOR, '#pnlMain > div.fullbody-wrapper.no-t-padding > div > div.row > div.col-sm-9.col-md-9.col-xs-12 > div:nth-child(5)'
                    ).get_attribute("outerHTML")
                except:
                    try:
                        content_html_3 = driver.find_element(
                            By.CSS_SELECTOR, '#pnlMain > div.fullbody-wrapper.no-t-padding > div > div.row > div.col-sm-9.col-md-9.col-xs-12 > div:nth-child(3)'
                        ).get_attribute("outerHTML")
                    except:
                        errors.append("Date not found")
                        print("Date not found")

                html_content = convert_images_to_links(
                    content_html_1 + content_html_2 + content_html_3)

                if not html_content.strip():
                    errors.append("No content found")
                    print("No content found, skipping URL...")
                    continue

                if errors:
                    errors_dict[url] = errors

                parsed_url = urlparse(url)

                filename = os.path.basename(parsed_url.path)
                if not filename.endswith('.html'):
                    filename += '.html'
                file_path = os.path.join(directory, filename)

                with open(file_path, 'w', encoding='utf-8') as file:
                    file.write(html_content)

                print(f"Saved content of URL: {url} to {file_path}")
                time.sleep(0.25)

    except Exception as e:
        print(f"An error occurred: {e}")
    finally:
        driver.quit()

    with open("../data/errors.json", "w") as outfile:
        json.dumps(errors_dict, outfile)


In [None]:
csv_file = '../data/dc_links.csv'
base_dir = '../data'
start_from = 'Development-Control'
scrape_and_save_csv(csv_file, base_dir, start_from)

Run the .bat file (need to convert to .bash on MacOS/Linux) to run pandoc conversion from HTML to Markdown for processing later

In [None]:
# Path to your .bat file
bat_file_path = "../scripts/html-to-md.bat"

# Run the .bat file and capture the output
result = subprocess.run([bat_file_path], shell=True, capture_output=True, text=True)

# Print the output and error (if any)
print("Output:", result.stdout)
print("Error:", result.stderr)
print("Return Code:", result.returncode)