# Getting the data ready - Toronto postal codes

Import the necessary libraries

In [36]:
import pandas as pd
import numpy as np

Use pandas capability of read html to get the table directly from Wikipedia. Transform the 'Not assigned' values to NaN and get the relevant part of the html table to be the dataframe

In [71]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
table = pd.read_html(url, na_values=['Not assigned'])

df = table[0]
df

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
...,...,...,...
175,M5Z,,
176,M6Z,,
177,M7Z,,
178,M8Z,Etobicoke,Mimico NW / The Queensway West / South of Bloo...


Check for NaN values since we need to get rid of codes with no Borough assigned

In [72]:
df.isna().sum()

Postal code      0
Borough         77
Neighborhood    77
dtype: int64

There are the same number of Borough and Neighborhood NaN's. We need bo be sure that they are all in the same rows(ie there's no NaN in Neighborhood with a value for Borough)

In [73]:
for n, b in zip(df['Neighborhood'], df['Borough']):
    if b == np.nan:
        assert n == np.nan, 'Borough {} has a NaN neighborhood'.format(b)

Since there are no Boroughs with Neighborhood NaN's (and vice-versa), we can just drop the NaN values.

In [74]:
df.dropna(inplace=True)
df

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...
160,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,Business reply mail Processing Centre
169,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


There are 103 rows on the table. Check to see if all postal codes are unique.

In [75]:
df['Postal code'].nunique()

103

Number of unique postal codes equals the number of rows. Our final job to get the Neighborhood column right, with commas instead of a forward slash.

In [76]:
df['Neighborhood'] = df['Neighborhood'].str.replace('/', ',')
df.reset_index(inplace=True, drop=True)
df

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway , Montgomery Road , Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing Centre
101,M8Y,Etobicoke,"Old Mill South , King's Mill Park , Sunnylea ,..."


In [77]:
df.shape

(103, 3)