<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

# Collecting and transforming Neighborhoods data from Toronto City provided by Wikipedia page #

## Introduction

In order to obtain the data about the neighborhoods in Toronto, we present in this document the process to obtain the dataset that is in the table of postal codes and how explore the dataset available freely in the wikipedia page. Futhermore, we explain how the data will be transformed and stored into a pandas dataframe.

## Table of Contents

1. <a href="#item1">Download and Explore Dataset</a>


Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import requests

## 1. Download and Explore Dataset

The dataset to explore the neighborhoods in Toronto is the wikipedia site.
In order to segment the neighborhoods and explore them, we will essentially need a dataset that contains the postal codes, boroughs and the neighborhoods that exist in each postal code. 

After to capture and format the dataset, we will create a new dataframe that will consist of three columns: PostalCode, Borough, and Neighborhood.

**Notes when scrape the wikipedia page:**

Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned**.

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that **M5A** is listed twice and has two neighborhoods: **Harbourfront** and **Regent Park**. These two rows will be combined into one row with the neighborhoods separated with a **comma**.

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the **9th** cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be **Queen's Park**.

This dataset exists for free on the web.  Here is the link to the dataset: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [2]:
wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
raw_toronto_wikipedia_page = requests.get(wikipedia_link)

Storing the wikipedia page in a page variable

In [4]:
page = raw_toronto_wikipedia_page.text

Finding Postal Code Table inside 'wikipedia page' and storing in a table_script variable

In [5]:
html_table_tag_start = "wikitable sortable"
html_table_tag_end = "</tbody></table>"
table_start = page.find(html_table_tag_start) + len(html_table_tag_start)
table_end = page.find(html_table_tag_end,table_start)
table_script = page[table_start:table_end]


Removing tags not important to dataset

In [6]:
table_script = table_script.replace("</a>","")
table_script = table_script.replace("<td>","")
table_script = table_script.replace("\n","")
table_script = table_script.replace("\t","")
table_script = table_script.replace("\"><tbody><tr><th>Postcode</th><th>Borough</th><th>Neighbourhood</th>","")

Removing rows have "Not assigned" string and storing just valids rows in a new list

In [7]:
tr_table = table_script.split("</tr>")
tr_table_valid = [];
for p in tr_table:    
    not_assigned = p.find("Not assigned</td>Not assigned")
    if (not_assigned == -1):
        if (len(p) > 0):
            tr_table_valid.append(p)

In [8]:
print(tr_table_valid)

['<tr>M3A</td><a href="/wiki/North_York" title="North York">North York</td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</td>', '<tr>M4A</td><a href="/wiki/North_York" title="North York">North York</td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</td>', '<tr>M5A</td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</td>', '<tr>M5A</td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</td>', '<tr>M6A</td><a href="/wiki/North_York" title="North York">North York</td><a href="/wiki/Lawrence_Heights" title="Lawrence Heights">Lawrence Heights</td>', '<tr>M6A</td><a href="/wiki/North_York" title="North York">North York</td><a href="/wiki/Lawrence_Manor" title="Lawrence Manor">Lawrence Manor</td>', '<tr>M7A</td><a href="/wiki/Queen%27s_Park_(Toronto)" title="

Create a new DataFrame

In [9]:
# define the dataframe columns
column_names = ['PostalCode','Borough', 'Neighborhood'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [10]:
neighborhoods

Unnamed: 0,PostalCode,Borough,Neighborhood


Extract Postal Code, Borough, Neighborhood for each row in the list and store in DataFrame

In [11]:
for r in tr_table_valid:
    PostalCode = ''
    Borough = ''
    neighborhood_name = '' 
    td_table = r.split("</td")
    if (td_table[0].rfind(">") > -1):
        PostalCode = td_table[0][td_table[0].rfind(">")+1:len(td_table[0])]    
    if (td_table[1].rfind(">") > -1):
        Borough =  td_table[1][td_table[1].rfind(">")+1:len(td_table[1])]
    if (td_table[2].rfind(">") > -1):
        neighborhood_name =  td_table[2][td_table[2].rfind(">")+1:len(td_table[2])]
    else:
        neighborhood_name = Borough
    if (neighborhood_name == "Not assigned"):
        neighborhood_name = Borough
        
    neighborhoods = neighborhoods.append({'PostalCode': PostalCode,'Borough': Borough,
                                          'Neighborhood': neighborhood_name}, ignore_index=True)    

Check dataframe results

In [12]:
neighborhoods.head(220)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Check Shape from dataFrame

In [13]:
neighborhoods.shape

(212, 3)

Create a new DataFrame

In [14]:
# instantiate the postal Code dataframe
postalcode = pd.DataFrame(columns=column_names)

Extract Postal Code, Borough, Neighborhood for each row in the list and store in DataFrame **agroupping by** Postal Code

In [15]:
grouped_PostalCode = neighborhoods.groupby('PostalCode')
for name,group in grouped_PostalCode:
    g_PostalCode = name
    g_Borough = group['Borough'].unique()[0]
    g_Neighborhood = ",".join(group['Neighborhood'].values.tolist())
    postalcode = postalcode.append({'PostalCode': g_PostalCode,'Borough': g_Borough,
                                    'Neighborhood': g_Neighborhood}, ignore_index=True)

Check dataframe results again using groups

In [16]:
postalcode.head(150)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


Check Shape from dataFrame

In [17]:
postalcode.shape

(103, 3)

### About the Authors:  
 [Clayton Magalhaes]( https://www.linkedin.com/in/cvianam/) Clayton Magalhaes is a Fraud Prevention Specialist at IBM.



 <hr>
Copyright &copy; 2018 [cognitiveclass.ai](cognitiveclass.ai?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).