<a href="https://cognitiveclass.ai/"><img src = "https://compete.cognitiveclass.ai/static/media/cognitive-class-logo.b08236c1.png" width = 384></a>

# Scrape Toronto Neighborhood Data From Wikipedia

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore. It is required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a *pandas* dataframe so that it is in a structured format.

### Import the necessary libraries for the project

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

### Define the wikipedia page for web scraping

In [2]:
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

### Download the web page html content

In [3]:
response = requests.get(wiki_url)
html = response.text
print("Response status code:", response.status_code)

Response status code: 200


In [4]:
soup = BeautifulSoup(html, "lxml")

# Partially print the downloaded HTML page
pretty_html = soup.prettify()
pretty_html_omitted = pretty_html[:1024] + "\n...\n...\n...\n" + pretty_html[-1024:]
print(pretty_html_omitted)

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":890001695,"wgRevisionId":890001695,"wgArticleId":539066,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wg

### Find the target table from the web page

In [5]:
html_table = soup.find("table", {"class": "wikitable sortable"})

### Extract the data from the table

In [6]:
# Define function to extract all the cells from a row
def extract_cells(row):
    return list(map(lambda cell: cell.getText().strip(), row.findAll("td")))
    
# Define function to extract all the cells from a row
def extract_rows(table):
    return list(map(lambda row: extract_cells(row), table.findAll("tr")))

# Get the table data (filter out any rows that don't have exactly 3 cells)
table = list(filter(lambda row: len(row) == 3, extract_rows(html_table)))
table[:5]

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront']]

### Create Pandas dataframe from table's data

In [7]:
df = pd.DataFrame(table)
df.head()

Unnamed: 0,0,1,2
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Clean-up and format the dataframe

In [8]:
# Set the columns
df.columns = ["PostalCode", "Borough", "Neighborhood"]

# Remove the rows that don't have an assigned borough
df = df.loc[df["Borough"] != "Not assigned"]

# Replace missing neighborhoods with the same value as borough column
df["Neighborhood"] = np.where(df["Neighborhood"] == "Not assigned", df["Borough"], df["Neighborhood"])

# Group dataframe by postal code
df = df.groupby("PostalCode").agg({"Borough": "first", "Neighborhood": lambda cell: ", ".join(cell)}).reset_index()

### Save the dataframe to disk

In [9]:
df.to_csv("toronto_neighborhood.csv")

### Dataframe preview

In [10]:
# Display a preview of the data frame (first 16 rows)
print(df.head(16))

# The size of the dataframe
print("\nDataframe size: {0}".format(df.shape))

   PostalCode      Borough                                       Neighborhood
0         M1B  Scarborough                                     Rouge, Malvern
1         M1C  Scarborough             Highland Creek, Rouge Hill, Port Union
2         M1E  Scarborough                  Guildwood, Morningside, West Hill
3         M1G  Scarborough                                             Woburn
4         M1H  Scarborough                                          Cedarbrae
5         M1J  Scarborough                                Scarborough Village
6         M1K  Scarborough        East Birchmount Park, Ionview, Kennedy Park
7         M1L  Scarborough                    Clairlea, Golden Mile, Oakridge
8         M1M  Scarborough    Cliffcrest, Cliffside, Scarborough Village West
9         M1N  Scarborough                        Birch Cliff, Cliffside West
10        M1P  Scarborough  Dorset Park, Scarborough Town Centre, Wexford ...
11        M1R  Scarborough                                  Mary