<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

# Part 1
## Create Initial Dataframe with Toronto Postal Codes, Boroughs, and Neighborhoods
Scrape Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M in order to obtain the data that is in the table of postal codes and transform the data into a pandas dataframe.

Download Dependencies

In [16]:
import pandas as pd # library for data analysis

# !conda install -c conda-forge beautifulsoup4 --yes
from bs4 import BeautifulSoup

import requests # library to handle requests

Read the wiki page into a file and parse it.

In [17]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_file = requests.get(url).text
soup = BeautifulSoup(html_file, 'lxml')

Drill down into the table to extract headings and rows, then create DataFrame with headings.

In [18]:
table = soup.find('table', class_='wikitable sortable') # Get the Postcode/Borough/Neighbourgood table.
headings = table.find_all('th') # Extract the 3 column headings.
rows = table.find_all('tr') # Get all rows in the table
# Create a DataFrame with the column headings, removing the newline character from the 3rd heading.
df = pd.DataFrame(columns=[headings[0].text, headings[1].text, headings[2].text.split('\n')[0]])

Loop through all rows to build the DataFrame one row at a time.

In [19]:
for row in rows[1:]: # Skip the 1st (header) row.
    # Get the Postcode, Borough, and Neighbourhood for the current row,
    # removing the trailing newline from the neighborhood.
    columns = row.find_all('td')
    postcode = columns[0].text
    borough = columns[1].text
    neighborhood = columns[2].text.split('\n')[0]
    if borough != 'Not assigned': # Skip any rows without a borough.
        if neighborhood == 'Not assigned': # Unassigned neighborhoods take on the name of their borough.
            neighborhood = borough
        if postcode in df['Postcode'].values:
            # Group neighboorhoods within same postcode into single postcode row.
            # Assumption: A postcode includes only one borough
            # (though boroughs may span multiple postcodes).
            df.loc[df['Postcode'] == postcode, 'Neighbourhood'] = \
            df[df['Postcode'] == postcode]['Neighbourhood'] + ", " + neighborhood
        else: # Add row for new postcode, borough, and neighborhood.
            df = df.append(pd.Series([postcode, borough, neighborhood], index=df.columns), ignore_index=True)

In [20]:
df.head(12) # Display the first 12 rows of the DataFrame

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [21]:
df.shape

(103, 3)

# Part 2
## Get the latitude and longitude coordinates of each neighborhood
Add Latitude and Longitude columns to the DataFrame.

Download dependencies

In [22]:
# !conda install -c conda-forge geocoder --yes
import geocoder # import geocoder

Loop through each postal code to get latitude and longitude, then add DataFrame as new columns

In [24]:
latitudes = []
longitudes = []
for postal_code in df['Postcode']:
    # using geocoder.arcgis rather than geocoder.google as the latter did not work
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))

    latitudes.append(g.latlng[0])
    longitudes.append(g.latlng[1])

df['Latitude'] = latitudes
df['Longitude'] = longitudes

In [25]:
df.head(12) # Display the first 12 rows of the DataFrame

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75242,-79.329242
1,M4A,North York,Victoria Village,43.7306,-79.313265
2,M5A,Downtown Toronto,Harbourfront,43.650295,-79.359166
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.72327,-79.451286
4,M7A,Downtown Toronto,Queen's Park,43.66115,-79.391715
5,M9A,Etobicoke,Islington Avenue,43.662299,-79.528195
6,M1B,Scarborough,"Rouge, Malvern",43.811525,-79.195517
7,M3B,North York,Don Mills North,43.749055,-79.362227
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.707535,-79.311773
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657363,-79.37818


In [26]:
df.shape

(103, 5)