# (PART 1) WEBSCRAPING PROPERTY GURU WEBSITE
Property Guru (PG) is a leading real estate listing site in Singapore.  Owners of apartments, houses or rooms who wish to sell/rent typically list them on propertyguru.com.sg through an agent with an asking price.  However, in my experience, the asking price can be ~10-20% higher than previous transaction prices (obtainable from URA).  The idea here, is to scrape PG website to get a snapshot of asking prices of condominium listings.

The main steps in this exercise:
1. setup
2. capture master list of condominium projects
3. capture rental listing in each condo project
4. scrap details of each rental listing

In the first part, I will cover the setup and obtaining a master list of condominium projects.

## The Setup
I will rely on Beautiful Soup for this web scraping exercise
First load the libraries

In [1]:
from bs4 import BeautifulSoup
import urllib

PG website has a page (actually multiple pages) dedicated to condominium projects: "https://www.propertyguru.com.sg/condo-directory/search-condo-project/1"
Go ahead and store it as "url" variable

In [2]:
url = "https://www.propertyguru.com.sg/condo-directory/search-condo-project/1"

We then use urllib to request for the content of this url to load into BeautifulSoup

In [3]:
req = urllib.request.Request(url = url)
res = urllib.request.urlopen(req).read()
soup = BeautifulSoup(res, "html.parser")

HTTPError: HTTP Error 403: Forbidden

Unfortunately, it returns a 403 error code.
This error is raised because the server is missing an "agent" information, or that the request is coming from a non-browser source; it is to protect against abnormal requests (like bots, which can lead to a DOS attack).  
read this medium post for an explanation: https://medium.com/@speedforcerun/python-crawler-http-error-403-forbidden-1623ae9ba0f#:~:text=Using%20urllib.&text=urlopen()%20to%20open%20a,to%20prevent%20from%20abnormal%20visit.
https://stackoverflow.com/questions/16627227/http-error-403-in-python-3-web-scraping

So, we need to define a user agent, as though urllib is requesting like a normal person through an internet browser like Chrome.

In [6]:
class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

opener = AppURLopener()
response = opener.retrieve(url)
response

  after removing the cwd from sys.path.


('C:\\Users\\mbijlkh\\AppData\\Local\\Temp\\tmp3bh8hmxt',
 <http.client.HTTPMessage at 0x188a45f3a08>)

response returns a tuple of where the contents of the url is stored locally

In [7]:
temp = open(response[0]).read()
soup = BeautifulSoup(temp,"html.parser")

In [9]:
#retrieve title (just a test)
soup.title

<title>Singapore Condominium Projects for Sale and for Rent | PropertyGuru Singapore</title>

There we go, we have the first page of condo listing (and we will iterate through all pages to get a master list)

As all the condo names are stored under the class "nav-link", we can retrieve condo names with a simple iterative loop below

In [11]:
projects = []
for i in soup.body.find_all(class_="nav-link"):
    if i.string != None:
        projects += [i.string]
projects

['Peak Residence',
 'Parkwood Residences',
 'The Verandah Residences',
 'One Holland Village',
 'Royalgreen',
 'Rymden 77',
 'Dunearn 386',
 'Cairnhill 16',
 'Juniper Hill (Former Crystal Tower)',
 'The Landmark',
 'Hyll on Holland',
 'Penrose']

And then move on to the next page by modifying the url to 
"https://www.propertyguru.com.sg/condo-directory/search-condo-project/2"

In [19]:
url = "https://www.propertyguru.com.sg/condo-directory/search-condo-project/2"
response = opener.retrieve(url)
temp = open(response[0]).read()
soup = BeautifulSoup(temp,"html.parser")

In [13]:
for i in soup.body.find_all(class_="nav-link"):
    if i.string != None:
        projects += [i.string]
projects

['Peak Residence',
 'Parkwood Residences',
 'The Verandah Residences',
 'One Holland Village',
 'Royalgreen',
 'Rymden 77',
 'Dunearn 386',
 'Cairnhill 16',
 'Juniper Hill (Former Crystal Tower)',
 'The Landmark',
 'Hyll on Holland',
 'Penrose',
 'The Gazania',
 'The Lilium',
 'Verticus',
 'Mayfair Modern',
 'Mayfair Gardens',
 'Riverfront Residences',
 'Affinity At Serangoon',
 'MeyerHouse',
 'Forett at Bukit Timah',
 'Midwood',
 'Wilshire Residences',
 'The M @ Middle Road']

The 'projects' list has now expanded with listings from the second page.  We can iterate through all pages.
Of course, we need to know when to stop, the last page.  This can be retrieved from the bottom of the loaded url page.

In [40]:
pages = []
for f in soup.body.find_all(True):
    if f.has_attr('data-page'):
        try:
            pages += [int(f.text)]
        except:
            continue
            
#get max page
last_page = max(pages)
print(last_page)

139


We can now iterate through all pages of the condo listing.  Try to add buffer time in between requests, so you don't overload the server

In [47]:
import random, time
def getSoup(page):
    url = "https://www.propertyguru.com.sg/condo-directory/search-condo-project/" + str(page)
    response = opener.retrieve(url)
    temp = open(response[0]).read()
    soup = BeautifulSoup(temp,"html.parser")
    return soup

def getCondos(soup):
    projects = []
    for i in soup.body.find_all(class_="nav-link"):
        if i.string != None:
            projects += [i.string]
    return projects

projects = []
for i in range(last_page + 1):
    soup = getSoup(i)
    projects += getCondos(soup)
    pause_time = 1 + random.random()
    time.sleep(pause_time)

In [49]:
#a small sample
projects = []
for i in range(3):
    soup = getSoup(i)
    projects += getCondos(soup)
    pause_time = 1 + random.random()
    time.sleep(pause_time)
projects

['Peak Residence',
 'Parkwood Residences',
 'The Verandah Residences',
 'One Holland Village',
 'Royalgreen',
 'Rymden 77',
 'Dunearn 386',
 'Cairnhill 16',
 'Juniper Hill (Former Crystal Tower)',
 'The Landmark',
 'Hyll on Holland',
 'Penrose',
 'Peak Residence',
 'Parkwood Residences',
 'The Verandah Residences',
 'One Holland Village',
 'Royalgreen',
 'Rymden 77',
 'Dunearn 386',
 'Cairnhill 16',
 'Juniper Hill (Former Crystal Tower)',
 'The Landmark',
 'Hyll on Holland',
 'Penrose',
 'The Gazania',
 'The Lilium',
 'Verticus',
 'Mayfair Modern',
 'Mayfair Gardens',
 'Riverfront Residences',
 'Affinity At Serangoon',
 'MeyerHouse',
 'Forett at Bukit Timah',
 'Midwood',
 'Wilshire Residences',
 'The M @ Middle Road']

Voila! The next step is to filter down to condo projects with rental listings, and to obtain the individual URLs in order to access information about these listings and compile them into a dataset.