# Segmenting and Clustering Neighborhoods in Toronto

In this notebook we will explore, segment, and cluster the neighborhoods in the city of Toronto. However, the neighborhood data is not readily available on the internet. A Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will need to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

Let's start importing the required libraries. _requests_ will be used to retrieve the Wikipedia paga, and _BeautifulSoup_ to scrape the page.

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

Retrieve the page with the data, and verify the response is correct (code 200 = Ok).

In [2]:
page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
page.status_code

200

Now we use _BeautifulSoup_ library to parse downloaded page. We verify the title of the page is correct.

In [3]:
soup = BeautifulSoup(page.content, 'html.parser')
print("Page Title: '{}'".format(soup.title.get_text()))
neigh_table = soup.table

Page Title: 'List of postal codes of Canada: M - Wikipedia'


Now extract data from the page, which is in the table of postal codes. We loop over all the rows and columns to obtain the data. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood.

In [4]:
#columns = [col.get_text().strip('\n') for col in neigh_table.tbody.find_all('th')]
columns = ['PostalCode', 'Borough', 'Neighborhood']
rows = [[col.get_text().strip('\n') for col in row.find_all('td')] for row in neigh_table.tbody.find_all('tr')]
rows[0:5]

[[],
 ['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village']]

And then create a new dataframe (there is one empty row at the beginning, so we include only non-empty rows).

In [6]:
df = pd.DataFrame([row for row in rows if len(row) > 0], columns = columns)
print("Number of neighborhoods: ", df.shape[0])
df.head()

Number of neighborhoods:  289


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We need to do some processing to obtain the final dataframe. First, ignore cells with a borough that is **Not assigned**.

In [7]:
df = df[df['Borough'] != 'Not assigned'].reset_index(drop=True)

print("Number of neighborhoods: ", df.shape[0])
df.head()

Number of neighborhoods:  212


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be the same as the borough.

In [8]:
neighborhood_na_mask = df['Neighborhood'] == 'Not assigned'
df.loc[neighborhood_na_mask, 'Neighborhood'] = df[neighborhood_na_mask]['Borough']
df[neighborhood_na_mask]

Unnamed: 0,PostalCode,Borough,Neighborhood
6,M7A,Queen's Park,Queen's Park


More than one neighborhood can exist in one postal code area. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [10]:
neighborhoods = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()
print("Number of neighborhoods: ", neighborhoods.shape[0])
neighborhoods.head()

Number of neighborhoods:  103


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


And we display the list of boroughs and he number of neighborhoods in each borough.

In [12]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)
neighborhoods['Borough'].value_counts().sort_index()

The dataframe has 11 boroughs and 103 neighborhoods.


Central Toronto      9
Downtown Toronto    18
East Toronto         5
East York            5
Etobicoke           12
Mississauga          1
North York          24
Queen's Park         1
Scarborough         17
West Toronto         6
York                 5
Name: Borough, dtype: int64

Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

To get the latitude and the longitude coordinates of each neighborhood, we can use _geocorder_ package. But after some tests, I was unable to make it work because it always returns no results.

In [14]:
# uncomment to install geocoder package
!conda install -c conda-forge geocoder --yes 
import geocoder # import geocoder

def postalCodeLatLong(postal_code):
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return (latitude, longitude)

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    geocoder:   1.38.1-py_0  conda-forge
    orderedset: 2.0-py35_0   conda-forge
    ratelim:    0.1.6-py35_0 conda-forge

orderedset-2.0 100% |################################| Time: 0:00:00  60.24 MB/s
ratelim-0.1.6- 100% |################################| Time: 0:00:00  12.72 MB/s
geocoder-1.38. 100% |################################| Time: 0:00:00  48.17 MB/s


In [15]:
g = geocoder.google('{}, Toronto'.format('M5A'))
print(g.latlng)

None


In this case where we were not able to get the geographical coordinates of the neighborhoods using the Geocoder package, we could obtain the coordinates from a link to a csv file that has the geographical coordinates of each postal code.

So, first we read the data from a URL into a new dataframe.

In [16]:
#coordinates = pd.read_csv('Geospatial_Coordinates.csv')
coordinates = pd.read_csv('http://cocl.us/Geospatial_data')
coordinates = coordinates.rename(columns={'Postal Code': 'PostalCode'})
coordinates.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


And then merge the previous dataframe with this one using the column **PostalCode** to obtain the final dataframe of neighborhoods with its geographical coordinates.

In [22]:
neighborhoods_coordinates = neighborhoods.merge(coordinates, on='PostalCode')
neighborhoods_coordinates

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [20]:
neighborhoods_coordinates.shape

(103, 5)