# Vancouver Real Estate

Brandon Tu

May 3, 2021

---

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project-Background" data-toc-modified-id="Project-Background-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Project Background</a></span><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Overview</a></span></li><li><span><a href="#Thought-Process" data-toc-modified-id="Thought-Process-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Thought Process</a></span></li><li><span><a href="#Data-Dictionary" data-toc-modified-id="Data-Dictionary-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Data Dictionary</a></span></li></ul></li><li><span><a href="#Import-Packages-and-Libraries" data-toc-modified-id="Import-Packages-and-Libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import Packages and Libraries</a></span></li><li><span><a href="#Web-Scraping-Data" data-toc-modified-id="Web-Scraping-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Web Scraping Data</a></span></li><li><span><a href="#Cleaning" data-toc-modified-id="Cleaning-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Cleaning</a></span></li><li><span><a href="#EDA" data-toc-modified-id="EDA-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>EDA</a></span></li></ul></div>

## Project Background

### Overview

This project was created for the purpose of using web scraping to obtain data on real estate listings in Vancouver with the ultimate goal of creating an application and the intention to help real estate investors and home owners find possible investments that are underpriced. With the insanely high cost of housing in Vancouver, being able to find cheaper housing is very valuable for both home owners and real estate investors; such properties are difficult to find and can provide great return on investments for investors or allow renters with lower income to purchase their first property.

### Thought Process
1. I will start by using the available information off of a real estate website containing listings from Lower Mainland of British Columbia, [Remax](https://www.remax.ca/bc/vancouver-real-estate?v=1), and will build out the necessary framework to scrape and clean the information off of the website. 
2. Obtain information on the average price of a property based on bedrooms, size and other characteristics of the property to for usage in determining whether a listing is considered above the average or below the average price.
3. Include an interactive population density map to help visualize where the listings are situated and the population in that area. This is important for both investors and home owners as investors may want to rent out these properties or sell these properties after upgrading the property, and home owners may want to know how busy the neighbourhood in consideration of their personal and/or family interests.
4. Build a package to automate the scraping of the information.
5. Turn the entire process into an application that can use these functions and visualize the map.
6. Increase the area that the application will consider.


### Data Dictionary

The information that I want to obtain from the website are included below with a description.

Column Name   | Datatype     | Description
------------- | -------------| -------------
Address | str | Address of the property
URL | str | URL to find the listing
property_size | int | The size of the property in square feet
description | str | Description of the property provided on the website
price | int | Price of the property
bedrooms | int | Number of Bedrooms
bathrooms | int | Number of Bathrooms
property_type | str | Type of property (house, condo, etc.)
style | str | The style of the property (ex. 2 storey, Townhouse, etc.)
subdivision | str | The neighbourhood of which the property belongs

---

## Import Packages and Libraries

Since I want to perform web scraping in this project, I will use the Selenium and BeautifulSoup libraries to obtain my data.

In [1]:
# Import Packages and Libraries

# Fundamentals
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
import re
import time

# Web Scraping
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

---

## Web Scraping Data

For web scraping, I will disable many functions that are available for Google Chrome browsers while using the Selenium webdriver in order to reduce time loading webpages and potential errors that may occur while scraping data.

In [2]:
# Customize the options
options = webdriver.ChromeOptions()

# Add arguement to ignore certificate errors
options.add_argument("--ignore-certificate-errors")

# Add argument to access web browser in incognito mode
options.add_argument("--incognito")

# Add argument to use Selenium without opening a browser
options.add_argument("--headless")

# Disable Side Navigation
options.add_argument('--disable-browser-side-navigation')

# Disable info bars
options.add_argument("--disable-infobars")

# No sandbox
options.add_argument("--no-sandbox")

# Disable dev shm usage
options.add_argument("--disable-dev-shm-usage")

# Enable automation
options.add_argument("enable-automation")

# Use chrome webdriver as the driver
path = "/Users/Brandon 1/desktop/chromedriver"
driver = webdriver.Chrome(path, chrome_options=options)

# Maximize the window size
driver.maximize_window()

  driver = webdriver.Chrome(path, chrome_options=options)


When it comes to web scraping, I need to remember to be a responsible web scraper by checking whether web crawling is allowed on the websites. Lucky for me, the Remax website seems to allow the scraping of the listings.

Note: When web scraping, I will only use Selenium for interacting with the webpages and BeautifulSoup for parsing the data in order to reduce runtime, memory and CPU usage, and simply for the robustness of BeautifulSoup compared to Selenium.

In [3]:
# Get REMAX webpage
# Enter the website url
driver.get("https://www.remax.ca/bc/vancouver-real-estate?v=1")

In [4]:
# Add all of the links for each listing in the website
rows_list = []

# Use while there is a next page button to click, scrape all of the urls and then click next 
page = 1

# Alias wait times for timeouts and implicit waits
timeout = 10
implicit_wait = 10

# Find out how many listings there are
num_listings = int(re.sub(r'[^\d]', "", driver.find_element_by_class_name("has-pos-rel.listings-count").text))

# Starting on first page until max pages are reached, contains approximately 10 listings per page
while page<=round(num_listings/10):

    # Wait until all the listings are present
    more_listings_present = EC.presence_of_all_elements_located((By.CLASS_NAME, "card-container"))
    more_listings = WebDriverWait(driver, timeout).until(more_listings_present)
        
    # Scroll to the last listing to load all of the listings
    driver.execute_script('arguments[0].scrollIntoView();', more_listings[-1])
        
    # Include an implicit wait of 10 seconds to allow all of the listings to load
    driver.implicitly_wait(implicit_wait)

    # Create page source alias to use BeautifulSoup
    page_source = driver.page_source

    # Use beautifulsoup to scrape the page
    soup = BeautifulSoup(page_source, 'lxml')
        
    # Get the URLs of all of the listings to enter each listing for scraping
    listings = soup.find_all(class_="card-container")

    # Use for loop to get all of the listings on the page
    for listing in listings:

        # Add listing addresses and URLs as the dict to list if they are not already existing
        if {"Address":listing.find(class_="redesign").string.split(",")[0],
            "Municipality":listing.find(class_="redesign").string.split(",")[1],
            "URL":("https://www.remax.ca"+listing["href"]),
            "Price":re.sub(r'[^\d]', "", listing.find(class_='price').text),
            "Bedrooms":re.sub(r'[^\d]', "", listing.find(class_='property-details').text.split("|")[0]),
            "Bathrooms":re.sub(r'[^\d]', "", listing.find(class_='property-details').text.split("|")[1]),
            "Property Size":re.sub(r'[^\d]', "", listing.find(class_='property-details').text.split("|")[2])
           } not in rows_list:
            rows_list.append(
            {"Address":listing.find(class_="redesign").string, 
            "URL":("https://www.remax.ca"+listing["href"]),
            "Price":re.sub(r'[^\d]', "", listing.find(class_='price').text),
            "Bedrooms":re.sub(r'[^\d]', "", listing.find(class_='property-details').text.split("|")[0]),
            "Bathrooms":re.sub(r'[^\d]', "", listing.find(class_='property-details').text.split("|")[1]),
            "Property Size":re.sub(r'[^\d]', "", listing.find(class_='property-details').text.split("|")[2])
           })

        # If it exists, just pass
        else:
            pass
        
#         # Find out if the element is present before clicking it
#         next_button_present = EC.presence_of_element_located((By.LINK_TEXT, ">"))
#         WebDriverWait(driver, timeout).until(next_button_present)
        
#         # Click on the next page button
#         driver.find_element_by_partial_link_text(">").click()
        
    # Add to the page count
    page += 1
        
    # Get the new page
    try:
        
        # Move to the next page
        driver.get("https://www.remax.ca/bc/vancouver-real-estate?v=1&page="+str(page))
    
    # If there are no more next pages then stop
    except:
        break

In [5]:
page

620

In [9]:
# Use for loop to index through each url and scrape the information off of each page
for i in range(len(rows_list)):
    
    # Use links with Selenium and start scraping the content off of the webpages
    driver.get(rows_list[i]["URL"])
    
    # Scroll to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    # Implicit wait
    # This wait should be enough for property size and description to load as its the top of the page
    driver.implicitly_wait(implicit_wait)
    
    # Create a dictionary containing features of listings that I want
    property_features = {"Description":"",
                         "Property Type":"",
                         "Style":"",
                         "Subdivision":""}
    
    # Find the description and clean the text
    try:
        property_features["Description"] = driver.find_element_by_class_name("content").text
    except NoSuchElementException:
        property_features["Description"] = None
    
    # Try to find the containers holding home features, and just continue if it is not found
    try:
        
        # Find out if the sections containing all of the text are present before 30 seconds
        site_timeout = 30
        features_present = EC.presence_of_all_elements_located((By.CLASS_NAME, "detail-container.ng-star-inserted"))
        WebDriverWait(driver, site_timeout).until(features_present)
    
        # Find the rest of the text from the block of text on the website
        home_features = driver.find_elements_by_class_name("detail-container.ng-star-inserted")[1].text.split("\n")
    
        # Loop through each line that is found in the tags and alias the values accordingly
        for feature in home_features:
            
            # Property Type
            if feature.split(": ")[0] in property_features:
                property_features[feature.split(": ")[0]] = feature.split(": ")[1]

            # Pass if the line is none of the above choices
            else:
                pass
    
     # If the page times out, then just keep going
    except TimeoutException:
        pass
    
    # Update previous rows_list list of dictionaries
    rows_list[i].update(property_features)

KeyboardInterrupt: 

In [10]:
# Create new dataframe containing each of the listings and links
listings_df = pd.DataFrame(rows_list)

# Print the number of rows and columns of the dataframe
print(f"Rows: {listings_df.shape[0]}, Columns: {listings_df.shape[1]}\n")

# Check for duplicates and NaN values
print(f"Duplicates: \n{listings_df.duplicated().sum()}\n")
print(f"NaN Values: \n{listings_df.isna().sum()}")

# Show the dataframe
display(listings_df)

Rows: 6190, Columns: 10

Duplicates: 
44

NaN Values: 
Address          0
URL              0
Price            0
Bedrooms         0
Bathrooms        0
Property Size    0
Description      0
Property Type    0
Style            0
Subdivision      0
dtype: int64


Unnamed: 0,Address,URL,Price,Bedrooms,Bathrooms,Property Size,Description,Property Type,Style,Subdivision
0,"5436 CHAFFEY AVE, Burnaby, BC",https://www.remax.ca/bc/burnaby-real-estate/54...,1850000,4,4,1991,Brand New 1/2 Duplex in the Central of Burnaby...,Duplex,2 Storey,Central Park Bs
1,"706 - 1250 BURNABY ST, Vancouver, BC",https://www.remax.ca/bc/vancouver-real-estate/...,329900,1,1,565,Investor/First time buyer Alert! This property...,Condo,Corner Unit,West End Vw
2,"6561 MACDONALD ST, Vancouver, BC",https://www.remax.ca/bc/vancouver-real-estate/...,6990000,5,6,6451,Located in the Prestigious Vancouver West S.W....,Single Family,2 Storey w/Bsmt.,S.W. Marine
3,"402 - 6969 21 AVE, Burnaby, BC",https://www.remax.ca/bc/burnaby-real-estate/40...,518000,1,1,774,"Welcome to The Stratford! Top unit, a beautifu...",Condo,"Upper Unit, Penthouse",Highgate
4,"3065 W 49 AVE, Vancouver, BC",https://www.remax.ca/bc/vancouver-real-estate/...,5280000,4,6,4830,Beautiful Southlands home at QUIET block of W ...,Single Family,2 Storey w/Bsmt.,Southlands
...,...,...,...,...,...,...,...,...,...,...
6185,"5561 ASH ST, Vancouver, BC",https://www.remax.ca/bc/vancouver-real-estate/...,10000000,,,,Potential for highrise residential rezoning. G...,Vacant Land,,Cambie
6186,"1005 - 1550 ALBERNI ST, Vancouver, BC",https://www.remax.ca/bc/vancouver-real-estate/...,2099999,1,1,788,Under construction. Measurements are approxima...,Condo,Upper Unit,West End Vw
6187,"1105 - 1550 ALBERNI ST, Vancouver, BC",https://www.remax.ca/bc/vancouver-real-estate/...,1899888,1,1,788,!!!MOTIVATED SELLER!!! Under construction. Mea...,Condo,Upper Unit,West End Vw
6188,"402 - 1550 ALBERNI ST, Vancouver, BC",https://www.remax.ca/bc/vancouver-real-estate/...,1749888,0,1,723,!!!MOTIVATED SELLER!!! Under construction. Mea...,Condo,Upper Unit,West End Vw


## Cleaning

As I was scraping the data, it occurred to me that the webdriver would not be able to open every link as the driver would time out at random points and not retrieve the data. Thus, duplicate or NaN values are likely to occur and would need to be cleaned.

In [None]:
listings_df["property_types"].unique()

It seems that there are no duplicates, but there are indeed missing values in this case. Regarding the `lot_sizes` column as listings that have no lot size (ie. apartment/condos) and other newly built properties that do not have a correct number on there can have the column name on the webpage but do not have a value as part of it.

To clean up the missing values, I will start by setting the `lot_sizes` column as 0 if the `property_types` are labelled as `apt/condo`. Any remaining listings will be listed as 1 indicating an unknown lot size. If any other information is missing, the listing will be dropped. Seeing as how there is are almost no missing values in other columns, less than 1% of listings will be removed.

In [None]:
listings_df[listings_df["property_sizes"].isna()]

In [None]:
listings_df["URL"][0]

## EDA