# Vancouver Real Estate

Brandon Tu

May 3, 2021

---

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project-Background" data-toc-modified-id="Project-Background-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Project Background</a></span><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Overview</a></span></li><li><span><a href="#Thought-Process" data-toc-modified-id="Thought-Process-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Thought Process</a></span></li><li><span><a href="#Data-Dictionary" data-toc-modified-id="Data-Dictionary-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Data Dictionary</a></span></li></ul></li><li><span><a href="#Import-Packages-and-Libraries" data-toc-modified-id="Import-Packages-and-Libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import Packages and Libraries</a></span></li><li><span><a href="#Web-Scraping-Data" data-toc-modified-id="Web-Scraping-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Web Scraping Data</a></span></li><li><span><a href="#Cleaning" data-toc-modified-id="Cleaning-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Cleaning</a></span></li><li><span><a href="#EDA" data-toc-modified-id="EDA-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>EDA</a></span></li></ul></div>

## Project Background

### Overview

This project was created for the purpose of using web scraping to obtain data on real estate listings in Vancouver with the ultimate goal of creating an application and the intention to help real estate investors and home owners find possible investments that are underpriced. With the insanely high cost of housing in Vancouver, being able to find cheaper housing is very valuable for both home owners and real estate investors; such properties are difficult to find and can provide great return on investments for investors or allow renters with lower income to purchase their first property.

### Thought Process
1. I will start by using the available information off of a real estate website containing listings from Lower Mainland of British Columbia, [REW](https://www.rew.ca/), and will build out the necessary framework to scrape and clean the information off of the website. 
2. Obtain information on the average price of a property based on bedrooms, size and other characteristics of the property to for usage in determining whether a listing is considered above the average or below the average price.
3. Include an interactive population density map to help visualize where the listings are situated and the population in that area. This is important for both investors and home owners as investors may want to rent out these properties or sell these properties after upgrading the property, and home owners may want to know how busy the neighbourhood in consideration of their personal and/or family interests.
4. Build a package to automate the scraping of the information.
5. Turn the entire process into an application that can use these functions and visualize the map.
6. Increase the area that the application will consider.


### Data Dictionary

The information that I want to obtain from the website are included below with a description.

Column Name   | Datatype     | Description
------------- | -------------| -------------
Address | str | Address of the property
URL | str | URL to find the listing
property_sizes | int | The size of the property in square feet
descriptions | str | Description of the property provided on the website
list_prices | int | Price of the property
bedrooms | int | Number of Bedrooms
bathrooms | int | Number of Bathrooms
property_types | str | Type of property (house, condo, etc.)
lot_sizes | int | Size of the land (only applicable for non-apartment buildings)
years_built | int | The year the property was built
titles | str | The type of ownership
styles | str | The style of the property (ex. 2 storey, Townhouse, etc.)
features | str | A list of features of the property and surrounding community
amenities | str | A list of amenities in and around the property accessible to the owner
appliances | str | Appliances found in the property
communities | str | The neighbourhood of which the property belongs
days_on_rew | int | Days the listing are found on the website

---

## Import Packages and Libraries

Since I want to perform web scraping in this project, I will use the Selenium and BeautifulSoup libraries to obtain my data.

In [2]:
# Import Packages and Libraries

# Fundamentals
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
import re
import time

# Web Scraping
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

---

## Web Scraping Data

For web scraping, I will disable many functions that are available for Google Chrome browsers while using the Selenium webdriver in order to reduce time loading webpages and potential errors that may occur while scraping data.

In [3]:
# Customize the options
options = webdriver.ChromeOptions()

# Add arguement to ignore certificate errors
options.add_argument("--ignore-certificate-errors")

# Add argument to access web browser in incognito mode
options.add_argument("--incognito")

# Add argument to use Selenium without opening a browser
options.add_argument("--headless")

# Disable Side Navigation
options.add_argument('--disable-browser-side-navigation')

# Disable info bars
options.add_argument("--disable-infobars")

# No sandbox
options.add_argument("--no-sandbox")

# Disable dev shm usage
options.add_argument("--disable-dev-shm-usage")

# Enable automation
options.add_argument("enable-automation")

# Use chrome webdriver as the driver
path = "/Users/Brandon 1/desktop/chromedriver"
driver = webdriver.Chrome(path, chrome_options=options)

# Maximize the window size
driver.maximize_window()

  driver = webdriver.Chrome(path, chrome_options=options)


In [54]:
# Get REW webpage
# Enter the website url
url = "https://www.rew.ca"
driver.get(url)

In [55]:
# Add a 20 second explicit wait time before the page is timed out to find if the page has loaded
timeout = 20
try:
    # Find if the element is present
    element_present = EC.presence_of_element_located((By.XPATH, 
                                                      '/html/body/footer/div[1]/div/div/div/div[1]/div/ul/li[1]/a'))
    WebDriverWait(driver, timeout).until(element_present)

    # Click in to the view all listings in Vancouver
    driver.find_element_by_xpath("/html/body/footer/div[1]/div/div/div/div[1]/div/ul/li[1]/a").click()
    
    # Wait until the listings are present
    listings_present = EC.presence_of_all_elements_located((By.CLASS_NAME, "displaypanel-body"))
    all_listings = WebDriverWait(driver, timeout).until(listings_present)
        
    # Scroll to the last listing to load all of the listings
    driver.execute_script('arguments[0].scrollIntoView();', all_listings[-1])
    
    # Include an implicit wait of 20 seconds to allow all of the listings to load
    implicit_wait = 20
    driver.implicitly_wait(implicit_wait)

    # Create page source alias to use BeautifulSoup
    page_source = driver.page_source

    # Use beautifulsoup to scrape the page
    soup = BeautifulSoup(page_source, 'lxml')

# Print a timeout error if the page did not load
except TimeoutException:
    print ("Timed out waiting for page to load")

In [56]:
# Add all of the links for each listing in the website
condition = True
rows_list = []

# Use while there is a next page button to click, scrape all of the urls and then click next 
# Starting on first page until page 25, which is the max amount of pages on the website
page = 1
while page<25:
    
    # Scrape all of the listing hrefs and then flip to the next page
    try:
        
        # Get the URLs of all of the listings to enter each listing for scraping
        listings = soup.find_all('a', class_=False, title=True, href=True)

        # Use for loop to get all of the listings on the page
        for listing in listings:
    
            # Create the dictionary
            listings_dict = {}

            # Add listing addresses and URLs if they are not already in the list
            if {"Address":listing["title"], "URL":(url+listing["href"])} not in rows_list:
                listings_dict.update({"Address":listing["title"], "URL":(url+listing["href"])})
                rows_list.append(listings_dict)

            # Else do not include
            else:
                pass
        
        # Find out if the element is present before clicking it
        next_button_present = EC.presence_of_element_located((By.CLASS_NAME, 
                                                              "paginator-next_page.paginator-control"))
        WebDriverWait(driver, timeout).until(next_button_present)
        
        # Click on the next page button
        driver.find_element_by_class_name("paginator-next_page.paginator-control").click()
        
        # Add to the next page
        page += 1
        
        # Wait until all the listings are present
        more_listings_present = EC.presence_of_all_elements_located((By.CLASS_NAME, "displaypanel-body"))
        more_listings = WebDriverWait(driver, timeout).until(more_listings_present)
        
        # Scroll to the last listing to load all of the listings
        driver.execute_script('arguments[0].scrollIntoView();', more_listings[-1])
        
        # Include an implicit wait of 10 seconds, same as previously
        driver.implicitly_wait(implicit_wait)
        
        # Create page source alias to use BeautifulSoup
        listings_page_source = driver.page_source

        # Use beautifulsoup to scrape the page
        soup = BeautifulSoup(listings_page_source, 'lxml')
    
    # If there are no more next pages then stop
    except:
        page==25

In [57]:
# Create new dataframe containing each of the listings and links
listings_df = pd.DataFrame(rows_list, columns=["Address", "URL"])

# Print the number of rows and columns of the dataframe
print(f"Rows: {listings_df.shape[0]}, Columns: {listings_df.shape[1]}\n")

# Check for duplicates and NaN values
print(f"Duplicates: \n{listings_df.duplicated().sum()}\n")
print(f"NaN Values: \n{listings_df.isna().sum()}")

Rows: 154, Columns: 2

Duplicates: 
0

NaN Values: 
Address    0
URL        0
dtype: int64


In [None]:
# Create list for all of the information that will be scraped from the website
property_sizes = []
descriptions = []
list_prices = []
bedrooms = []
bathrooms = []
property_types = []
lot_sizes = []
years_built = []
titles = []
styles = []
features = []
amenities = []
appliances = []
communities = []
days_on_rew = []

# Use for loop to index through each url and scrape the information off of each page
for idx in range(len(listings_df)):
    
    # Initialize all of the features as NaN to keep the indexing correct when appending to the lists
    property_size = None
    description = None
    list_price = None
    num_bedrooms = None
    num_bathrooms = None
    property_type = None
    lot_size = 0
    year_built = None
    title = None
    style = None
    the_features = None
    the_amenities = None
    the_appliances = None
    community = None
    days_on_website = None
    
    # Alias the url
    link = listings_df["URL"][idx]
    
    # Use links with Selenium and start scraping the content off of the webpages
    driver.get(link)
    
    # Scroll to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    # Implicit wait
    driver.implicitly_wait(implicit_wait)
    
    # Find the property size and clean the text
    try:
        unclean_property_size = driver.find_element_by_class_name("listingheader").text
        clean_property_size = unclean_property_size.replace("\n", " ")
        property_size = next(iter(re.findall(r'[0-9]{3,4}[ ][S][q][f][t]', 
                                             clean_property_size)), None).replace(" Sqft", "")
    except NoSuchElementException:
        pass
    
    # Find the description and clean the text
    try:
        unclean_description = driver.find_element_by_class_name("listingoverview").text
        description = re.sub(r'[^\w\s]', "", unclean_description)
    except NoSuchElementException:
        pass
    
    # Try to find the class, and just continue if it is not found
    try:
        
        # Find out if the section containing all of the text are present before 30 seconds
        site_timeout = 30
        blocks_present = EC.presence_of_element_located((By.CLASS_NAME, "sectionblock"))
        WebDriverWait(driver, site_timeout).until(blocks_present)
    
        # Find the rest of the text from the block of text on the website
        block = driver.find_element_by_class_name("sectionblock").text

        # Split the block of text into each individual line
        split_body_lines = block.split("\n")
    
        # Loop through each line that is found in the tags and alias the values accordingly
        for i in range(len(split_body_lines)):

            # Listing Price
            if split_body_lines[i].lower() == "list price":
                unclean_list_price = str(split_body_lines[i+1])

            # Number of Bedrooms
            elif split_body_lines[i].lower() == "bedrooms":
                num_bedrooms = int(split_body_lines[i+1])

            # Number of Bathrooms
            elif split_body_lines[i].lower() == "bathrooms":
                num_bathrooms = int(split_body_lines[i+1])

            # Property Type
            elif split_body_lines[i].lower() == "property type":
                property_type = str(split_body_lines[i+1])

            # Lot Size
            elif split_body_lines[i].lower() == "lot size":
                unclean_lot_size = str(split_body_lines[i+1])

            # Year Built
            elif split_body_lines[i].lower() == "year built":
                unclean_year_built = str(split_body_lines[i+1])

            # Title/type of property ownership
            elif split_body_lines[i].lower() == "title":
                title = str(split_body_lines[i+1])

            # Style of the Property
            elif split_body_lines[i].lower() == "style":
                style = str(split_body_lines[i+1])

            # Features
            elif split_body_lines[i].lower() == "features":
                the_features = str(split_body_lines[i+1])

            # Amenities found in/around the property
            elif split_body_lines[i].lower() == "amenities":
                the_amenities = str(split_body_lines[i+1])

            # Appliances
            elif split_body_lines[i].lower() == "appliances":
                the_appliances = str(split_body_lines[i+1])

            # Community/neighbourhood the property is situated in
            elif split_body_lines[i].lower() == "community":
                community = str(split_body_lines[i+1])

            # Days on the REW website
            elif split_body_lines[i].lower() == "days on rew":
                unclean_days_on_website = str(split_body_lines[i+1])

            # Pass if the line is none of the above choices
            else:
                pass
    
    # If the page times out, then just keep going
    except TimeoutException:
        pass
    
    # Remove any punctuation and unwanted strings in features being extracted
    try:
        list_price = re.sub(r'[^\d]', "", unclean_list_price)
    except:
        list_price == None
    
    try:
        year_built = next(iter(re.findall(r'[0-9]{4}', unclean_year_built)), None)
    except:
        year_built == None
    
    try:
        days_on_website = re.sub(r'[^\d]', "", unclean_days_on_website)
    except:
        days_on_website == None
        
    try:
        lot_size = next(iter(re.findall(r'[0-9]{4}', unclean_lot_size)), None)
        lot_size = int(lot_size)
    except:
        lot_size == 0
    
    # Append all of the values to the respective arrays
    property_sizes.append(property_size)
    descriptions.append(description)
    list_prices.append(int(list_price))
    bedrooms.append(num_bedrooms)
    bathrooms.append(num_bathrooms)
    property_types.append(property_type)
    lot_sizes.append(lot_size)
    years_built.append(int(year_built))
    titles.append(title)
    styles.append(style)
    features.append(the_features)
    amenities.append(the_amenities)
    appliances.append(the_appliances)
    communities.append(community)
    days_on_rew.append(int(days_on_website))

In [33]:
# Add all of the lists to the dataframe
listings_df["property_sizes"] = property_sizes
listings_df["descriptions"] = descriptions
listings_df["list_prices"] = list_prices
listings_df["bedrooms"] = bedrooms
listings_df["bathrooms"] = bathrooms
listings_df["property_types"] = property_types
listings_df["lot_sizes"] = lot_sizes
listings_df["years_built"] = years_built
listings_df["titles"] = titles
listings_df["styles"] = styles
listings_df["features"] = features
listings_df["amenities"] = amenities
listings_df["appliances"] = appliances
listings_df["communities"] = communities
listings_df["days_on_rew"] = days_on_rew

# Show the new dataframe
listings_df

Unnamed: 0,Address,URL,property_sizes,descriptions,list_prices,bedrooms,bathrooms,property_types,lot_sizes,years_built,titles,styles,features,amenities,appliances,communities,days_on_rew
0,"260 E 17th Avenue, Vancouver, BC, V5V 1A7",https://www.rew.ca/properties/3299096/260-e-17...,1104,This stunning farfromtypical half duplex offer...,1498000,3,3,Duplex,,2021,Freehold Strata,2 Storey,,"Central Location, Shopping Nearby",Washer/Dryer/Fridge/Stove/Dishwasher,Main,17
1,"3823 W 11th Avenue, Vancouver, BC, V6R 2K8",https://www.rew.ca/properties/3289248/3823-w-1...,3218,Endless unobstructed sweeping mountain city v...,3588000,4,5,House,4035.0,1936,Freehold NonStrata,3 Storey,"Central Location, Drapes/window Coverings, Sec...","Recreation Nearby, Shopping Nearby",Washer/Dryer/Fridge/Stove/Dishwasher,Point Grey,21
2,"4056 W 10th Avenue, Vancouver, BC, V6R 2H1",https://www.rew.ca/properties/3337194/4056-w-1...,2627,Panoramic Views from Bowen Island to Mountains...,2998000,3,4,House,4035.0,2004,Freehold NonStrata,3 Storey w/Bsmt,"Drapes/window Coverings, Garage Door Opener, L...","Golf Course Development, Marina Nearby, Shoppi...","Microwave, Washer/Dryer/Fridge/Stove/Dishwasher",Point Grey,3
3,"209-469 E King Edward Avenue, Vancouver, BC, V...",https://www.rew.ca/properties/3335316/209-469-...,1058,Marquise is situated in the heart of Vancouver...,1699000,2,2,Apt/Condo,4035.0,2020,Freehold Strata,"Corner Unit,Inside Unit",Garage Door Opener,,Washer/Dryer/Fridge/Stove/Dishwasher,Cambie,3
4,"1836 W 60th Avenue, Vancouver, BC, V6P 2A9",https://www.rew.ca/properties/3336095/1836-w-6...,4402,West Boulevard beauty awaits in 1836 W 60th Av...,3688000,6,5,House,7966.0,1986,Freehold NonStrata,2 Storey w/Bsmt.,,"Recreation Nearby, Shopping Nearby",Washer/Dryer/Fridge/Stove/Dishwasher,Southwest Marine,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
127,"301-717 W 17th Avenue, Vancouver, BC, V5Z 1V1",https://www.rew.ca/properties/3326092/301-717-...,1315,Heather 17th A rare collection of 16 luxurio...,2198800,2,2,Apt/Condo,,2020,Freehold Strata,1 Storey,,"Air Conditioning, Private Setting, Recreation ...",Washer/Dryer/Fridge/Stove/Dishwasher,Cambie,6
128,"103-334 E 5th Avenue, Vancouver, BC, V5T 1H4",https://www.rew.ca/properties/3325991/103-334-...,,,535000,1,1,Apt/Condo,,1977,Freehold Strata,"Corner Unit,Ground Level Unit",,"Central Location, Private Setting, Private Yar...",,Mount Pleasant East,6
129,"2947 W 35th Avenue, Vancouver, BC, V6N 2M5",https://www.rew.ca/properties/3325994/2947-w-3...,3068,Introducing a beautifully built craftsmanstyle...,3828000,5,4,House,4290.0,2007,Freehold NonStrata,2 Storey w/Bsmt.,"Drapes/window Coverings, Fireplace Insert, Gar...","Air Conditioning, Recreation Nearby, Shopping ...","Microwave, Washer/Dryer/Fridge/Stove/Dishwashe...",MacKenzie Heights,6
130,"312 W 1st Avenue, Vancouver, BC, V5Y 3T7",https://www.rew.ca/properties/3325993/312-w-1s...,1209,This highquality concrete townhouse built in 2...,1199000,2,3,Townhouse,4290.0,2009,Freehold Strata,1 Storey,"Garage Door Opener, Intercom, Security System","Central Location, Recreation Nearby, Shopping ...","Freezer, Washer/Dryer/Fridge/Stove/Dishwasher",False Creek,6


## Cleaning

As I was scraping the data, it occurred to me that the webdriver would not be able to open every link as the driver would time out at random points and not retrieve the data. Thus, duplicate or NaN values are likely to occur and would need to be cleaned.

In [35]:
# Check for missing values
listings_df.isna().sum()

Address            0
URL                0
property_sizes     2
descriptions       2
list_prices        0
bedrooms           0
bathrooms          0
property_types     0
lot_sizes         16
years_built        0
titles             0
styles             0
features          49
amenities         20
appliances        15
communities        0
days_on_rew        0
dtype: int64

In [34]:
# Check if there are any duplicates
listings_df.iloc[:, 2:-1].duplicated().sum()

0

In [26]:
listings_df["property_types"].unique()

array(['Duplex', 'House', 'Apt/Condo', 'Townhouse'], dtype=object)

It seems that there are no duplicates, but there are indeed missing values in this case. Regarding the `lot_sizes` column as listings that have no lot size (ie. apartment/condos) and other newly built properties that do not have a correct number on there can have the column name on the webpage but do not have a value as part of it.

To clean up the missing values, I will start by setting the `lot_sizes` column as 0 if the `property_types` are labelled as `apt/condo`. Any remaining listings will be listed as 1 indicating an unknown lot size. If any other information is missing, the listing will be dropped. Seeing as how there is are almost no missing values in other columns, less than 1% of listings will be removed.

In [44]:
listings_df[listings_df["property_sizes"].isna()]

Unnamed: 0,Address,URL,property_sizes,descriptions,list_prices,bedrooms,bathrooms,property_types,lot_sizes,years_built,titles,styles,features,amenities,appliances,communities,days_on_rew
89,"707-1661 Quebec Street, Vancouver, BC, V6A 0H2",https://www.rew.ca/properties/3325334/707-1661...,,,925000,2,2,Apt/Condo,6900.0,2018,Freehold Strata,"End Unit,Upper Unit","Drapes/window Coverings, Oven - Built In","Air Conditioning, Central Location, Lane Acces...","Microwave, Washer/Dryer/Fridge/Stove/Dishwasher",Mount Pleasant East,7
128,"103-334 E 5th Avenue, Vancouver, BC, V5T 1H4",https://www.rew.ca/properties/3325991/103-334-...,,,535000,1,1,Apt/Condo,,1977,Freehold Strata,"Corner Unit,Ground Level Unit",,"Central Location, Private Setting, Private Yar...",,Mount Pleasant East,6


In [47]:
listings_df["URL"][0]

'https://www.rew.ca/properties/3299096/260-e-17th-avenue-vancouver-bc?search_params%5Bquery%5D=Vancouver%2C+BC&searchable_id=361&searchable_type=Geography'

## EDA