# Unstructured.IO Quickstart

(Auto Scrape Iteration 1)

Hello! This notebook summarises the general efforts and methods used to try and parse the URA website data with the Unstructured library. Even though the effort was not successful, there are still some valuable insights to be gained - in particular, Unstructured works well for purely semantic-based text parsing rather than extraction based on HTML classes and ids, without the hallucinations of an LLM. This is also its weakness, as Unstructured is only interested in text content without references, which may be helpful in some cases but not this one.

This notebook is a documented copy of the _process\_text.py_ file, which is now deprecated and removed from the repo.

In [1]:
#import Unstructured classes and functions

from unstructured.partition.html import partition_html
from unstructured.cleaners.core import clean
from unstructured.chunking.basic import chunk_elements
from unstructured.chunking.title import chunk_by_title
import copy

Load in link - use partition_html instead of default partition because its cleaner LOL. Run only ONCE to reduce scraping instances, we will write something to cache it later.

In [2]:
input_html = "https://www.ura.gov.sg/Corporate/Guidelines/Development-Control/Residential/Flats-Condominiums/Earthworks"
elements = partition_html(url=input_html,
                          headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'})

Create cache by using the copy library to copy partitioned element objects into a list for parsing. Rerun this to reset the cache to the original element objects

In [3]:
doc_elements = []
for item in elements:
    seperate_item = copy.deepcopy(item)
    doc_elements.append(seperate_item)

In [4]:
for text in doc_elements:
    print(text.text)

Select Category
                                    
	Select a search category
	All
	Planning
	Property
	Guidelines
	Car Parks
	Land Sales
	Get Involved
	Resources
	E-Services
	Media Room
Who We Are
                                            
                                            
                                            
                                        
                                        
                                            
                                                
                                                    
                                                            
                                                                
                                                                    Who We Are
                                                                    
                                                                
                                                                
                                 

Hardcoded function to extract the main body of the webpage, since we can't use HTML selectors with Unstructured.

In [5]:
def get_body(elements):
    '''
    This function finds the document element list indexes corresponding to the main body of the webpage, in the absence of usage of proper body tags

    Parameters:
        elements (list): A list of Unstructured document objects

    Returns:
        list: The sliced elements list whose indexes point to document objects containing text and other metadata from the main body of the webpage
    '''

    START = 0
    END = 0
    flag = False

    for i in range(len(elements)):

        if flag == False and elements[i].text == 'Earthworks, Retaining Walls, and Boundary Walls':
            flag = True
            continue
        elif START == 0 and elements[i].category == 'Title' and flag == True:
            START = i
            continue
        elif END == 0 and elements[i].text == 'Urban Redevelopment Authority' and flag == True:
            END = i
            break
    return elements[START:END]

In [6]:
main_elements = get_body(doc_elements)

In [7]:
for text in main_elements:
    print(text.text)

Flats and Condominiums
Content
Advisory Notes
📇 Guidelines at a Glance (PDF, 205 KB)
Introduction
Serviced Apartments (akin to Residential Use)
Serviced Apartments II (SA2)
Gross Plot Ratio
Bonus GFA Incentive Schemes
Balconies, Private Enclosed Spaces, Private Roof Terraces and Indoor Recreation Spaces
Guidelines on Dwelling Units (DU) in Non-Landed Residential Developments
Site Area
Site Coverage
Building Setback from Boundary
Building Height
Building Length
Landscape Deck
Basements
Special and Detailed Control Plans
Street Block Plans
Developments Involving Waterbodies
Attic
Ancillary Shops
Ancillary Structures
Parking
RC Flat Roofs
Greenery
Walking and Cycling Plan
Strata Subdivision
Earthworks, Retaining Walls, and Boundary Walls
Advisory Notes
📇 Guidelines at a Glance (PDF, 205 KB)
Introduction
Serviced Apartments (akin to Residential Use)
Serviced Apartments II (SA2)
Gross Plot Ratio
Bonus GFA Incentive Schemes
Balconies, Private Enclosed Spaces, Private Roof Terraces and Indoor

You can also clean the body of extra whitespace, and options include bullets (bool), dashes (bool), lowercase (bool), trailing punctuation (bool). These weren't very helpful though since we do want to keep the main corpus of information.

In [8]:
for html in main_elements:
    html.text = clean(html.text)

The main portion where Unstructured was not so adept at was chunking - granted, the final version ingested large documents rather than smaller chunks, but here the unique HTML structure of the website does no favours to the auto-chunker. 

You can try with the basic chunk:

In [9]:
chunks = chunk_elements(main_elements, overlap=100)
for items in chunks:
    print("CHUNK: \n" + items.text + "\nEND OF CHUNK \n")

CHUNK: 
Flats and Condominiums

Content

Advisory Notes

📇 Guidelines at a Glance (PDF, 205 KB)

Introduction

Serviced Apartments (akin to Residential Use)

Serviced Apartments II (SA2)

Gross Plot Ratio

Bonus GFA Incentive Schemes

Balconies, Private Enclosed Spaces, Private Roof Terraces and Indoor Recreation Spaces

Guidelines on Dwelling Units (DU) in Non-Landed Residential Developments

Site Area

Site Coverage

Building Setback from Boundary

Building Height

Building Length

Landscape Deck
END OF CHUNK 

CHUNK: 
Basements

Special and Detailed Control Plans

Street Block Plans

Developments Involving Waterbodies

Attic

Ancillary Shops

Ancillary Structures

Parking

RC Flat Roofs

Greenery

Walking and Cycling Plan

Strata Subdivision

Earthworks, Retaining Walls, and Boundary Walls

Advisory Notes

📇 Guidelines at a Glance (PDF, 205 KB)

Introduction

Serviced Apartments (akin to Residential Use)

Serviced Apartments II (SA2)

Gross Plot Ratio

Bonus GFA Incentive Schemes
EN

Or the chunk by title, which was better but still overcome by the inconsistencies in HTML formatting that caused Unstructured to recognise and classify certain portions of text wrongly

In [11]:
chunks2 = chunk_by_title(main_elements)
for items in chunks2:
    print("CHUNK: \n" + items.text + "\nEND OF CHUNK \n")

CHUNK: 
Flats and Condominiums
END OF CHUNK 

CHUNK: 
Content

Advisory Notes

📇 Guidelines at a Glance (PDF, 205 KB)

Introduction

Serviced Apartments (akin to Residential Use)

Serviced Apartments II (SA2)

Gross Plot Ratio

Bonus GFA Incentive Schemes

Balconies, Private Enclosed Spaces, Private Roof Terraces and Indoor Recreation Spaces

Guidelines on Dwelling Units (DU) in Non-Landed Residential Developments

Site Area

Site Coverage

Building Setback from Boundary

Building Height

Building Length

Landscape Deck

Basements
END OF CHUNK 

CHUNK: 
Special and Detailed Control Plans

Street Block Plans

Developments Involving Waterbodies

Attic

Ancillary Shops

Ancillary Structures

Parking

RC Flat Roofs

Greenery

Walking and Cycling Plan

Strata Subdivision

Earthworks, Retaining Walls, and Boundary Walls

Advisory Notes

📇 Guidelines at a Glance (PDF, 205 KB)

Introduction

Serviced Apartments (akin to Residential Use)

Serviced Apartments II (SA2)

Gross Plot Ratio

Bonus GF