# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

## Part 1: Create notebook

Done.

## Part 2: Data Scraping

The Wikipedia table data can be scraped using Pandas directly, or using BeautifulSoup. In this simple case, Pandas requires fewer lines of code. However, BeautifulSoup can be customized for scraping data from dirtier sources. Therefore, both options are shown below.

#### Option A: Use Pandas

In [1]:
# import library
import pandas as pd

In [64]:
# adjust display properties
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

In [39]:
# pull table from url
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
# use index (i.e. [0] or [1] etc.) to select specific table...
df = pd.read_html(url, header=0)[0]
print(df.shape)
df.head()

(288, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [14]:
# ignore this irrelevant table, captured from the same url...
# use index (i.e. [0] or [1] etc.) to select specific table...
df2 = pd.read_html(url, header=0)[1]
print(df2.shape)
df2.head()

(3, 18)


Unnamed: 0.1,Unnamed: 0,Canadian postal codes,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
0,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,,,,,,,,,,,,,,,
1,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,ON,ON,MB,SK,AB,BC,NU/NT,YT
2,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


#### Option B: Use BeautifulSoup

In [4]:
# import libraries
import pandas
import requests
from bs4 import BeautifulSoup

In [24]:
# url & acquire data
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
print(type(soup))
soup

<class 'bs4.BeautifulSoup'>


<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="UTF-8"/>
<title>List of postal codes of Canada: M - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":920980179,"wgRevisionId":920980179,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","Dec

In [23]:
# isolate table data
# soup.find() finds the first instance only (soup.find_all() will produce a list of tables)
table = soup.find('table',{'class':'wikitable sortable'})
print(type(table))
table

<class 'bs4.element.Tag'>


<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>

In [28]:
# create list where each item in list starts/end with <tr>
# each item in the list is then a row in the table
table_rows = table.find_all('tr')
print(type(table_rows))
table_rows

<class 'bs4.element.ResultSet'>


[<tr>
 <th>Postcode</th>
 <th>Borough</th>
 <th>Neighbourhood
 </th></tr>, <tr>
 <td>M1A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M2A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M3A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
 </td></tr>, <tr>
 <td>M4A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
 </td></tr>, <tr>
 <td>M5A</td>
 <td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
 <td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
 </td></tr>, <tr>
 <td>M5A</td>
 <td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
 <td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
 </td></tr>, <tr>
 <td>M6A</td>
 <td

In [30]:
# build list from data
# isolate text bookended by <td> and </td>, then strip those bookends
data = []
for row in table_rows:
    data.append([t.text.strip() for t in row.find_all('td')])
print(type(data))
data

<class 'list'>


[[],
 ['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M5A', 'Downtown Toronto', 'Regent Park'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', "Queen's Park", 'Not assigned'],
 ['M8A', 'Not assigned', 'Not assigned'],
 ['M9A', 'Etobicoke', 'Islington Avenue'],
 ['M1B', 'Scarborough', 'Rouge'],
 ['M1B', 'Scarborough', 'Malvern'],
 ['M2B', 'Not assigned', 'Not assigned'],
 ['M3B', 'North York', 'Don Mills North'],
 ['M4B', 'East York', 'Woodbine Gardens'],
 ['M4B', 'East York', 'Parkview Hill'],
 ['M5B', 'Downtown Toronto', 'Ryerson'],
 ['M5B', 'Downtown Toronto', 'Garden District'],
 ['M6B', 'North York', 'Glencairn'],
 ['M7B', 'Not assigned', 'Not assigned'],
 ['M8B', 'Not assigned', 'Not assigned'],
 ['M9B', 'Etobicoke', 'Cloverdale'],
 ['M9B', 'Etobicoke', 'Islington'],
 ['M

In [34]:
# convert to pandas dataframe
df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
print(df.shape)
df.head()

(289, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [36]:
# remove rows with 'None' values
df = df.loc[1:]
print(df.shape)
df.head()

(288, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


## Part 3: Clean dataframe

Follow the directives in the assignment, which are also shown below.

In [150]:
# rebuild df from scratch if screwed up during cleaning/testing...
df = pd.read_html(url, header=0)[0]
print(df.shape)
df.head()

(288, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Ignore cells with a borough that is **Not assigned.**

In [151]:
# remove rows with borough 'Not assigned'
df_na = df[df['Borough'] != 'Not assigned']
df_na

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be the same as the borough.

In [152]:
# check row with a 'Not assigned' Neighborhood
df_na.iloc[6,:]

Postcode         M7A         
Borough          Queen's Park
Neighbourhood    Not assigned
Name: 8, dtype: object

In [153]:
# loop over Neighborhood column to find 'Not assigned', & replace w/ Borough value of that row
for h, hood in enumerate(df_na['Neighbourhood']):
    #print(hood)
    if (hood=='Not assigned'):
        print(h, hood)
        #print(df_na.iloc[h,1])
        #print(df_na.iloc[h,2])
        # can't use .iloc() because it MAY be assigning to a copy of df_na! Use .iat()...
        #df_na.iloc[h,2] = df_na.iloc[h,1] 
        df_na.iat[h,2] = df_na.iat[h,1]

6 Not assigned


In [154]:
# re-check row which previously had a 'Not assigned' Neighborhood
df_na.iloc[6,:]

Postcode         M7A         
Borough          Queen's Park
Neighbourhood    Queen's Park
Name: 8, dtype: object

More than one neighborhood can exist in one postal code area. Combine all **Neighbourhood**s in the same **Postcode** into a single row, where the **Neighbourhood**s are separated by a comma and a space.

In [155]:
# aggregate all Neighborhoods with the same Postcode
df_post = df_na.groupby('Postcode').agg({'Borough':'first', 'Neighbourhood': ', '.join})
df_post

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Rouge, Malvern"
M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
M1E,Scarborough,"Guildwood, Morningside, West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
M1J,Scarborough,Scarborough Village
M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
M1N,Scarborough,"Birch Cliff, Cliffside West"


In [156]:
# quick check against assignment solution image
df_post.loc[['M5V'], ['Neighbourhood']]

Unnamed: 0_level_0,Neighbourhood
Postcode,Unnamed: 1_level_1
M5V,"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara"


In [157]:
# check the dataframe shape 
#'Postcode' is the index, and it's not included in shape ==> reset index
df_post.shape

(103, 2)

In [158]:
# reset index
df_post.reset_index(inplace=True)
df_post.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [161]:
# final dataframe shape and # of rows
print(df_post.shape)
print('The number of rows in the dataframe =', df_post.shape[0])

(103, 3)
The number of rows in the dataframe = 103


## Part 4: Get Neighbourhood Latitudes & Longitudes

From the assignment:
"In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking postal code M5G as an example, your code would look something like this:

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data"

Get latitude and longitude.

In [173]:
print(type(df_post.iat[0,0]))
df_post.iat[0,0]

<class 'str'>


'M1B'

In [175]:
# import geocoder
import geocoder

ModuleNotFoundError: No module named 'geocoder'

In [None]:
# initialize your variable to None
lat_lng_coords = None

# example postal code
postal_code = df_post.iat[0,0]

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

Add latitude and longitude to the dataframe.

<class 'str'>


'M1B'

## Part 5: 

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

1. to add enough Markdown cells to explain what you decided to do and to report any observations you make.

2. to generate maps to visualize your neighborhoods and how they cluster together.