# Clustering Toronto Neighborhood Data
## Data Scraping

Import libraries

In [3]:
# import libraries
from bs4 import BeautifulSoup
import requests

Use Beautiful Soup to extract the data from the Wikipedia page on Toronto neighborhoods

In [4]:
# set website
postcodes_html='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# pass html to BS
source = requests.get(postcodes_html).text
soup = BeautifulSoup(source, 'lxml')

# show in readable format
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className=document.documentElement.className.replace(/(^|\s)client-nojs(\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":900271985,"wgRevisionId":900271985,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June",

Pull out the table with the neighborhood data

In [5]:
full_table = soup.find('table', class_='wikitable sortable')
print(full_table)

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>

Loop through each row in the table then through each cell to get each parameter

In [6]:
for table_row in full_table.find_all('tr'):
	counter = 1
	for table_cell in table_row.find_all('td'):
		if counter == 1:
			postcode = table_cell.text
			print("Postcode is "+postcode )
		elif counter == 2:
			borough = table_cell.text
			print("Borough is "+borough )
		else:
			neighborhood = table_cell.text
			print("Neighborhood is "+neighborhood )
		counter = counter + 1

Postcode is M1A
Borough is Not assigned
Neighborhood is Not assigned

Postcode is M2A
Borough is Not assigned
Neighborhood is Not assigned

Postcode is M3A
Borough is North York
Neighborhood is Parkwoods

Postcode is M4A
Borough is North York
Neighborhood is Victoria Village

Postcode is M5A
Borough is Downtown Toronto
Neighborhood is Harbourfront

Postcode is M5A
Borough is Downtown Toronto
Neighborhood is Regent Park

Postcode is M6A
Borough is North York
Neighborhood is Lawrence Heights

Postcode is M6A
Borough is North York
Neighborhood is Lawrence Manor

Postcode is M7A
Borough is Queen's Park
Neighborhood is Not assigned

Postcode is M8A
Borough is Not assigned
Neighborhood is Not assigned

Postcode is M9A
Borough is Etobicoke
Neighborhood is Islington Avenue

Postcode is M1B
Borough is Scarborough
Neighborhood is Rouge

Postcode is M1B
Borough is Scarborough
Neighborhood is Malvern

Postcode is M2B
Borough is Not assigned
Neighborhood is Not assigned

Postcode is M3B
Borough is 

Create an empty pandas dataframe to store the neighborhood data

In [7]:
# create empty dataframe
import pandas as pd
column_names = ['Postcode', 'Borough', 'Neighborhood'] 
neighborhoods = pd.DataFrame(columns=column_names)

In [8]:
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighborhood


Use the above loop to load the neighborhood data with these caveats:
1. Ignore cells with a borough that is Not assigned.
2. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [9]:
# load data into dataframe
for table_row in full_table.find_all('tr'):
	counter = 1
	postcode = 'NA'
	borough = 'NA'
	neighborhood = 'NA'
	for table_cell in table_row.find_all('td'):
		if counter == 1:
			postcode = table_cell.text
			print("Postcode is "+postcode )
		elif counter == 2:
			borough = table_cell.text
			print("Borough is "+borough )
		else:
			neighborhood = table_cell.text.strip('\n')
			print("Neighborhood is "+neighborhood )
		counter = counter + 1
	print('The postcode is {}, borough is {}, and neighborhood is {}.'.format(postcode, borough, neighborhood))
	if (postcode == 'NA') or (borough == 'Not assigned'):
		print('Skipping this row')
	elif neighborhood == 'Not assigned':
		neighborhoods = neighborhoods.append({'Postcode':postcode,
		                                      'Borough': borough,
											  'Neighborhood': borough}, ignore_index=True)
	else:
		neighborhoods = neighborhoods.append({'Postcode':postcode,
		                                      'Borough': borough,
											  'Neighborhood': neighborhood}, ignore_index=True)

The postcode is NA, borough is NA, and neighborhood is NA.
Skipping this row
Postcode is M1A
Borough is Not assigned
Neighborhood is Not assigned
The postcode is M1A, borough is Not assigned, and neighborhood is Not assigned.
Skipping this row
Postcode is M2A
Borough is Not assigned
Neighborhood is Not assigned
The postcode is M2A, borough is Not assigned, and neighborhood is Not assigned.
Skipping this row
Postcode is M3A
Borough is North York
Neighborhood is Parkwoods
The postcode is M3A, borough is North York, and neighborhood is Parkwoods.
Postcode is M4A
Borough is North York
Neighborhood is Victoria Village
The postcode is M4A, borough is North York, and neighborhood is Victoria Village.
Postcode is M5A
Borough is Downtown Toronto
Neighborhood is Harbourfront
The postcode is M5A, borough is Downtown Toronto, and neighborhood is Harbourfront.
Postcode is M5A
Borough is Downtown Toronto
Neighborhood is Regent Park
The postcode is M5A, borough is Downtown Toronto, and neighborhood i

Postcode is M5J
Borough is Downtown Toronto
Neighborhood is Toronto Islands
The postcode is M5J, borough is Downtown Toronto, and neighborhood is Toronto Islands.
Postcode is M5J
Borough is Downtown Toronto
Neighborhood is Union Station
The postcode is M5J, borough is Downtown Toronto, and neighborhood is Union Station.
Postcode is M6J
Borough is West Toronto
Neighborhood is Little Portugal
The postcode is M6J, borough is West Toronto, and neighborhood is Little Portugal.
Postcode is M6J
Borough is West Toronto
Neighborhood is Trinity
The postcode is M6J, borough is West Toronto, and neighborhood is Trinity.
Postcode is M7J
Borough is Not assigned
Neighborhood is Not assigned
The postcode is M7J, borough is Not assigned, and neighborhood is Not assigned.
Skipping this row
Postcode is M8J
Borough is Not assigned
Neighborhood is Not assigned
The postcode is M8J, borough is Not assigned, and neighborhood is Not assigned.
Skipping this row
Postcode is M9J
Borough is Not assigned
Neighborho

Postcode is M1P
Borough is Scarborough
Neighborhood is Scarborough Town Centre
The postcode is M1P, borough is Scarborough, and neighborhood is Scarborough Town Centre.
Postcode is M1P
Borough is Scarborough
Neighborhood is Wexford Heights
The postcode is M1P, borough is Scarborough, and neighborhood is Wexford Heights.
Postcode is M2P
Borough is North York
Neighborhood is York Mills West
The postcode is M2P, borough is North York, and neighborhood is York Mills West.
Postcode is M3P
Borough is Not assigned
Neighborhood is Not assigned
The postcode is M3P, borough is Not assigned, and neighborhood is Not assigned.
Skipping this row
Postcode is M4P
Borough is Central Toronto
Neighborhood is Davisville North
The postcode is M4P, borough is Central Toronto, and neighborhood is Davisville North.
Postcode is M5P
Borough is Central Toronto
Neighborhood is Forest Hill North
The postcode is M5P, borough is Central Toronto, and neighborhood is Forest Hill North.
Postcode is M5P
Borough is Centr

Postcode is M9V
Borough is Etobicoke
Neighborhood is Thistletown
The postcode is M9V, borough is Etobicoke, and neighborhood is Thistletown.
Postcode is M1W
Borough is Scarborough
Neighborhood is L'Amoreaux West
The postcode is M1W, borough is Scarborough, and neighborhood is L'Amoreaux West.
Postcode is M2W
Borough is Not assigned
Neighborhood is Not assigned
The postcode is M2W, borough is Not assigned, and neighborhood is Not assigned.
Skipping this row
Postcode is M3W
Borough is Not assigned
Neighborhood is Not assigned
The postcode is M3W, borough is Not assigned, and neighborhood is Not assigned.
Skipping this row
Postcode is M4W
Borough is Downtown Toronto
Neighborhood is Rosedale
The postcode is M4W, borough is Downtown Toronto, and neighborhood is Rosedale.
Postcode is M5W
Borough is Downtown Toronto
Neighborhood is Stn A PO Boxes 25 The Esplanade
The postcode is M5W, borough is Downtown Toronto, and neighborhood is Stn A PO Boxes 25 The Esplanade.
Postcode is M6W
Borough is N

Show the top five rows of the dataframe.

In [10]:
neighborhoods.head(5)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Show the size of the dataframe.

In [11]:
neighborhoods.shape

(211, 3)

## Geo Locations

Install and import geocoder

In [12]:
# import geocoder
! pip install geocoder
import geocoder



Create a new dataframe with geo location columns

In [13]:
# create a load new dataframe
column_names = ['Postcode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
hoods_ll = pd.DataFrame(columns=column_names)
hoods_ll.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude


Load data into the dataframe

In [14]:
# load data into dataframe
for table_row in full_table.find_all('tr'):
	counter = 1
	postcode = 'NA'
	borough = 'NA'
	neighborhood = 'NA'
	for table_cell in table_row.find_all('td'):
		if counter == 1:
			postcode = table_cell.text
		elif counter == 2:
			borough = table_cell.text
		else:
			neighborhood = table_cell.text.strip('\n')
		counter = counter + 1
	if (postcode == 'NA') or (borough == 'Not assigned'):
		print('Skipping this row')
	elif neighborhood == 'Not assigned':
		hoods_ll = hoods_ll.append({'Postcode':postcode,
		                                      'Borough': borough,
											  'Neighborhood': borough}, ignore_index=True)
	else:
		hoods_ll = hoods_ll.append({'Postcode':postcode,
		                                      'Borough': borough,
											  'Neighborhood': neighborhood}, ignore_index=True)
											  
hoods_ll.head(5)

Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping this row
Skipping t

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,,
1,M4A,North York,Victoria Village,,
2,M5A,Downtown Toronto,Harbourfront,,
3,M5A,Downtown Toronto,Regent Park,,
4,M6A,North York,Lawrence Heights,,


Using geocoder just hangs, even when I try to look up a single postal code. Instead, I am importing data from the CSV file provided and saving it into a dataframe.

In [15]:
# Read data from CSV file into a dataframe
csv_path = 'http://cocl.us/Geospatial_data'
geo_df = pd.read_csv(csv_path)	
geo_df.head(5)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Match each row in the neighborhood dataframe with the postal code that matches in the dataframe of geolocation data. Then save the latitude and longitude into the neighborhood row.

In [16]:
for i in hoods_ll.index:
	# set postal code
	postal_code = hoods_ll.loc[i, 'Postcode']
	
	latitude = geo_df.loc[geo_df['Postal Code'] == postal_code].Latitude.values[0]
	longitude = geo_df.loc[geo_df['Postal Code'] == postal_code].Longitude.values[0]
	# print('The location of {} is {}, {}'.format(postal_code, latitude, longitude))

	hoods_ll.at[i, 'Latitude'] = latitude
	hoods_ll.at[i, 'Longitude'] = longitude

hoods_ll.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7533,-79.3297
1,M4A,North York,Victoria Village,43.7259,-79.3156
2,M5A,Downtown Toronto,Harbourfront,43.6543,-79.3606
3,M5A,Downtown Toronto,Regent Park,43.6543,-79.3606
4,M6A,North York,Lawrence Heights,43.7185,-79.4648
5,M6A,North York,Lawrence Manor,43.7185,-79.4648
6,M7A,Queen's Park,Queen's Park,43.6623,-79.3895
7,M9A,Etobicoke,Islington Avenue,43.6679,-79.5322
8,M1B,Scarborough,Rouge,43.8067,-79.1944
9,M1B,Scarborough,Malvern,43.8067,-79.1944
