# Segmenting and Clustering Neighborhoods in Toronto

##### In this notebook, we will explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information. 

We will obtain the neighbourhood information from this <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">Wikipedia page</a>, which we will scrape using the Beautiful Soup library.

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser") # Get Wikipedia page in HTML
neigh_table = soup.find("table") # Find the tаble we want

df_neigh = pd.read_html(str(neigh_table))[0] # Encode table as pandas dataframe

df_neigh.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


We will now clean the data, getting rid of the rows that have "Not assigned" as their burough.

In [3]:
df_neigh.drop(df_neigh[df_neigh['Borough'] == "Not assigned"].index, inplace = True)
df_neigh.reset_index(drop = True, inplace = True)

df_neigh.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [4]:
df_neigh.shape # Find shape of the dataframe

(103, 3)

We now have a table of neighbourhoods of Toronto with 103 entries that encodes the burough and neighbourhood information for each postal code.

Next, we want to find the latitude and longitude of each postal code. We will use Geocoder to do this.

In [5]:
import geocoder

In [6]:
# Initialize latitude and longitude columns with None
df_neigh["Latitude"] = [None]*103
df_neigh["Longitude"] = [None]*103
df_neigh.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,,
1,M4A,North York,Victoria Village,,
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",,
3,M6A,North York,"Lawrence Manor, Lawrence Heights",,
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",,
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",,
99,M4Y,Downtown Toronto,Church and Wellesley,,
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",,
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",,


This would be the Geocoder code that we could use if the requests were not denied:
```
# Obtain coordinates for each Postal Code
for post_code in df_neigh['Postal Code'].head(1):

    while(df_neigh[df_neigh['Postal Code'] == post_code]['Latitude'].iloc[0] is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(post_code))
        coordinates = g.latlng
        
        try:
            df_neigh[df_neigh['Postal Code'] == post_code]['Latitude'] = coordinates[0]
        except: pass
    
    df_neigh[df_neigh['Postal Code'] == post_code]['Longitude'] = coordinates[1]
````

Since Geocoder is not working we will import the data from a csv file.

In [8]:
coordinates = pd.read_csv("Geospatial_Coordinates.csv")

coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [30]:
for post_code in df_neigh['Postal Code'].head():
    df_neigh.loc[df_neigh['Postal Code'] == post_code, 'Latitude'] = coordinates.loc[coordinates['Postal Code'] == post_code, 'Latitude'].iloc[0]
    df_neigh.loc[df_neigh['Postal Code'] == post_code, 'Longitude'] = coordinates.loc[coordinates['Postal Code'] == post_code, 'Longitude'].iloc[0]
    
df_neigh.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7533,-79.3297
1,M4A,North York,Victoria Village,43.7259,-79.3156
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6543,-79.3606
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7185,-79.4648
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6623,-79.3895
