
# Segmenting and Clustering Neighborhoods in Toronto

In this assignment, we will explore and cluster the neighborhoods in Toronto using simple pandas code to scrap a [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). The Wikipedia page comprises three tables, and we analyze the first table that contains the Postal Code,  Borough, and Neighbourhood.


### Import necessary Libraries

In [1]:
import pandas as pd # library for data analsysis

### Scrap the neighborhoods table

In [2]:
# Define the URL for the source data
url =  'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Read HTML tables into a list of DataFrame objects.
df = pd.read_html(url)

# Get the number of tables scrapped from the Wikipedia page
print(len(df))

3


In [3]:
# Select the appropriate neighborhoods table
df = df[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Cleaning and arranging the dataset. 

As we can see on the above table's first and second row, the DataFrame comprises multiple unassigned boroughs and neighborhoods. Hence, we exclude boroughs that are not assigned from the DataFrame. 

In [4]:
# Exclude unassigned boroughs and reset the index of the DataFrame
df = df[df['Borough']!='Not assigned'].reset_index(drop=True)

# Rename the first column
df = df.rename(columns={"Postal Code": "PostalCode"})

df.head(15)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


### Get Shape of Pandas DataFrame

In [5]:
df.shape

(103, 3)

### Linking a [csv file]( http://cocl.us/Geospatial_data) that has the geographical coordinates of each postal code

Explor the geogrphical DataFrame

In [6]:
geo_df = pd.read_csv('Geospatial_Coordinates.csv')
geo_df = geo_df.rename(columns={"Postal Code": "PostalCode"})
geo_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
geo_df.shape

(103, 3)

Merge the borough and geographical coordinates DataFrame

In [8]:
geo_data = pd.merge(df, geo_df, on='PostalCode')
geo_data.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## Explore and cluster the neighborhoods in Toronto

#### Import necessary Libraries for the geographic analysis

In [9]:
# import library to handle data in a vectorized manner
import numpy as np 

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# map rendering library
import folium 

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')

Libraries imported.


### Getting rows from the DataFrame which contains Toronto in their Borough.

In [10]:
toronto_data = geo_data[geo_data['Borough'].str.contains('Toronto',regex=False)].reset_index(drop=True)
toronto_data.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


#### Create a map of Toronto, Canada with neighborhoods superimposed on top.

In [11]:
# create map of Toronto using latitude and longitude values
map_torontoCa = folium.Map(location=[43.651070,-79.347015], zoom_start=10.5)

# add markers to map
for lat, lng, borough, neighbourhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_torontoCa)
    
    folium.CircleMarker(
        location=[43.6534817, -79.3839347],
        radius=150,
        popup='Laurelhurst Park',
        color='chartreuse',
        fill=False).add_to(map_torontoCa)

map_torontoCa

### Using KMeans clustering for the clsutering of the neighbourhoods

Run _k_-means to cluster the neighborhood into 5 clusters.



In [12]:
# set number of clusters
kclusters = 4
toronto_clustering = toronto_data.drop(['PostalCode','Borough','Neighbourhood'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:20]

array([3, 3, 3, 3, 0, 3, 3, 1, 3, 1, 3, 1, 0, 3, 1, 0, 3, 0, 2, 2])

Now update our dataframe to include the cluster

In [13]:
# add clustering labels
toronto_data.insert(5, 'Cluster Labels', kmeans.labels_)
toronto_data.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,3
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,3
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,3
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,3
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,0



### Final visualization of the Tornto neighborhood clusters

In [14]:
# create map
map_clusters_tornoto = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighbourhood'], toronto_data['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters_tornoto)
    
    folium.CircleMarker(
        location=[43.6534817, -79.3839347],
        radius=150,
        popup='Laurelhurst Park',
        color='chartreuse',
        fill=False).add_to(map_clusters_tornoto)

map_clusters_tornoto