<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Assignment---Toronto-data" data-toc-modified-id="Assignment---Toronto-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Assignment - Toronto data</a></span><ul class="toc-item"><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Import libraries</a></span></li><li><span><a href="#Data-acquisition" data-toc-modified-id="Data-acquisition-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Data acquisition</a></span></li><li><span><a href="#Data-preprocessing" data-toc-modified-id="Data-preprocessing-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Data preprocessing</a></span></li><li><span><a href="#Add-latitude-and-longitude-coordinates-to-data" data-toc-modified-id="Add-latitude-and-longitude-coordinates-to-data-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Add latitude and longitude coordinates to data</a></span></li><li><span><a href="#Explore-neighborhoods-in-Toronto" data-toc-modified-id="Explore-neighborhoods-in-Toronto-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Explore neighborhoods in Toronto</a></span></li></ul></li></ul></div>

# Assignment - Toronto data

## Import libraries

In [23]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import requests # library to handle requests
from bs4 import BeautifulSoup # library to scrape web
import geocoder # import geocoder
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium #plotting library
import matplotlib.cm as cm #Matplotlib and associated plotting modules
import matplotlib.colors as colors #Matplotlib and associated plotting modules
from sklearn.cluster import KMeans #import k-means from clustering stage

## Data acquisition

Data are obtained by scraping Wikipedia page and using **BeautifulSoup** library:

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
data_from_url = requests.get(url).text
data_bs = BeautifulSoup(data_from_url, 'html.parser')
data_bs

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of postal codes of Canada: M - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XmK0WgpAADwAAIQbuZkAAACH","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":942851379,"wgRevisionId":942851379,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontari

Get relevant information:

In [4]:
data = []
for tr in data_bs.tbody.find_all('tr'):
    data.append([ td.get_text().strip() for td in tr.find_all('td')])
    
data

[[],
 ['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', 'Downtown Toronto', "Queen's Park"],
 ['M8A', 'Not assigned', 'Not assigned'],
 ['M9A', 'Etobicoke', 'Islington Avenue'],
 ['M1B', 'Scarborough', 'Rouge'],
 ['M1B', 'Scarborough', 'Malvern'],
 ['M2B', 'Not assigned', 'Not assigned'],
 ['M3B', 'North York', 'Don Mills North'],
 ['M4B', 'East York', 'Woodbine Gardens'],
 ['M4B', 'East York', 'Parkview Hill'],
 ['M5B', 'Downtown Toronto', 'Ryerson'],
 ['M5B', 'Downtown Toronto', 'Garden District'],
 ['M6B', 'North York', 'Glencairn'],
 ['M7B', 'Not assigned', 'Not assigned'],
 ['M8B', 'Not assigned', 'Not assigned'],
 ['M9B', 'Etobicoke', 'Cloverdale'],
 ['M9B', 'Etobicoke', 'Islington'],
 ['M9B', 'Etobicoke', 'Martin Grove'],
 ['M9B

Let's create a dataframe with columns 'PostalCode', 'Borough' and 'Neighborhood':

In [5]:
cols = ['PostalCode', 'Borough', 'Neighborhood']
df = pd.DataFrame(data, columns = cols)
display(df.head())
print(df.shape)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


(288, 3)


## Data preprocessing

First row has no data. Let's drop it and reset the index number:

In [6]:
df = df.dropna()
df = df.reset_index(drop = True)
display(df.head())
print(df.shape)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


(287, 3)


We have to process only the cells that have an assigned borough, so we are ingoringe cells with a borough that is 'Not assigned':

In [7]:
df = df[ df.Borough != 'Not assigned']
display(df.head())
print(df.shape)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


(210, 3)


If a cell has a borough but a 'Not assigned neighborhood', then the neighborhood will be the same as the borough:

In [8]:
#Print number of unique values
print('There are {} uniques neighborhoods.'.format(len(df['Neighborhood'].unique())))
df.groupby('Neighborhood').count()

There are 208 uniques neighborhoods.


Unnamed: 0_level_0,PostalCode,Borough
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1
Adelaide,1,1
Agincourt,1,1
Agincourt North,1,1
Albion Gardens,1,1
Alderwood,1,1
...,...,...
Woodbine Heights,1,1
York Mills,1,1
York Mills West,1,1
York University,1,1


In [9]:
print('Number of rows with not assigned Neighborhood = ' + str(df[ df.Neighborhood == 'Not assigned' ].shape[0]))

Number of rows with not assigned Neighborhood = 0


Since there is more than one neighborhood in one postal code area, we are going to combine the neighborhoods in the same row but separated with a comma:

In [10]:
#Print number of unique values
print('There are {} uniques neighborhoods.'.format(len(df['PostalCode'].unique())))
df.groupby('PostalCode').count()

There are 103 uniques neighborhoods.


Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,2,2
M1C,3,3
M1E,3,3
M1G,1,1
M1H,1,1
...,...,...
M9N,1,1
M9P,1,1
M9R,4,4
M9V,8,8


In [11]:
#Aggregate all neighborhoods with same postal code to same row
df_proc = df.groupby('PostalCode').agg({'Borough': 'first', 'Neighborhood': ', '.join}).reset_index()
display(df_proc.head())
print(df_proc.shape)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


(103, 3)


## Add latitude and longitude coordinates to data

Now we can get the latitude and longitude for every postal code of the dataframe by using the provided csv with geospatial data:

In [12]:
df_geo = pd.read_csv('http://cocl.us/Geospatial_data')
display(df_geo.head())
print(df_geo.shape)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


(103, 3)


Change name of column 'Postal Code' to 'PostalCode':

In [18]:
df_geo = df_geo.rename(columns={"Postal Code": "PostalCode"})
display(df_geo.head())
print(df_geo.shape)

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


(103, 3)


Now we can combine both dataframes to add latitude and longitude to df dataframe:

In [19]:
df = pd.merge(df, df_geo, on= 'PostalCode')
display(df.head())
print(df.shape)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,Lawrence Heights,43.718518,-79.464763
4,M6A,North York,Lawrence Manor,43.718518,-79.464763


(210, 5)


## Explore neighborhoods in Toronto

In [20]:
df_toronto = df[df['Borough'].str.contains('Toronto')]
df_toronto = df_toronto.reset_index(drop = True)
display(df_toronto.head())
print(df_toronto.shape)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
1,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
2,M5B,Downtown Toronto,Ryerson,43.657162,-79.378937
3,M5B,Downtown Toronto,Garden District,43.657162,-79.378937
4,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418


(74, 5)


Use geopy library to get the latitude and longitude values of the city. Let's define a function to do this:

In [27]:
city = 'Toronto, ON'
#In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent city_explorer 
geolocator = Nominatim(user_agent = "city_explorer") 
location = geolocator.geocode(city)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of ' + city + ' are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, ON are 43.653963, -79.387207.


Generate map centred in Toronto:

In [28]:
toronto_map = folium.Map(location = [latitude, longitude], zoom_start = 10)

#Display map
toronto_map

Add a red circle marker to represent the center of Toronto:

In [29]:
folium.CircleMarker( [latitude, longitude], radius = 10, color = 'red', popup = city, fill = True,
                    fill_color = 'red', fill_opacity = 0.6).add_to(toronto_map)
toronto_map

Add the neighborhoods as blue circle markers:

In [30]:
for lat, lng, bor, neigh in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighborhood']):
    label = '{}-{}'.format(bor, neigh)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius = 5, color = 'blue', popup = label, fill = True,
                        fill_color='blue', fill_opacity=0.6).add_to(toronto_map)
toronto_map