# Toronto Neighborhood classification
We will explore, segment, and cluster the neighborhoods in the city of Toronto. It's a part of IBM  
Applied Data Science Capstone. 
For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will be scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe.
Once the data is in a structured format, we can applly required analysis to explore and cluster the neighborhoods in the city of Toronto.
In this notebook we will just create the dataframe.

First we will import the required libraries .

In [253]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

import folium # plotting library
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
from pandas.io.html import read_html
import requests
from bs4 import BeautifulSoup


print('Libraries imported.')

Libraries imported.


In [254]:
The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood.
WE will use the BeautifulSoup package to transform the data in the table on the Wikipedia page into the above pandas dataframe.

SyntaxError: invalid syntax (<ipython-input-254-9eb64fe1dbdf>, line 1)

In [None]:


page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'


res = requests.get(page)
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
dfs = pd.read_html(str(table))

df = dfs[0]
df.columns = df.iloc[0]
df.drop([df.index[0]], inplace=True,axis=0, errors='ignore')

df

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [None]:
df=df.dropna()
df = df.drop(df[df['Borough']=="Not assigned"].index)


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.


In [None]:
df.loc[df['Neighbourhood']=="Not assigned","Neighbourhood"]=df.loc[df['Neighbourhood']=="Not assigned","Borough"]

More than one neighborhood can exist in one postal code area. These rows will be combined into one row with the neighborhoods separated with a comma.

In [None]:
df_grouped = df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join).reset_index()
df_grouped

print the number of rows of your dataframe.

In [252]:
df.shape

(212, 3)

Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In [255]:
df_postal = pd.read_csv("https://cocl.us/Geospatial_data")
df_postal.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Then we will merge the two dataframes to get a dataframe of [Postcode , Borough , Neighbourhood , Latitude, Longitude]

In [259]:
df_merged = pd.merge(df_grouped, df_postal, how='left', left_on=['Postcode'], right_on=['Postal Code'])

In [260]:
df_merged.drop(columns=['Postal Code'], inplace= True)
df_merged.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [224]:
print(df_merged.head(5))
print(df_merged['Borough'].unique())
d = df_merged[df_merged['Postcode']== 'Toronto']
d

  Postcode      Borough                         Neighbourhood   Latitude  \
0      M1B  Scarborough                         Rouge,Malvern  43.806686   
1      M1C  Scarborough  Highland Creek,Rouge Hill,Port Union  43.784535   
2      M1E  Scarborough       Guildwood,Morningside,West Hill  43.763573   
3      M1G  Scarborough                                Woburn  43.770992   
4      M1H  Scarborough                             Cedarbrae  43.773136   

   Longitude  
0 -79.194353  
1 -79.160497  
2 -79.188711  
3 -79.216917  
4 -79.239476  
['Scarborough' 'North York' 'East York' 'East Toronto' 'Central Toronto'
 'Downtown Toronto' 'York' 'West Toronto' "Queen's Park" 'Mississauga'
 'Etobicoke']


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude


Then we will explore and cluster the neighborhoods in Toronto. We will work with only boroughs that contain the word Toronto.

In [261]:
t= df_merged['Borough'].str.contains("Toronto")
df_merged1 = df_merged.loc[t,:]
#d = d[d['Postcode']== 'Toronto']
df_merged1.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


We will create onehot vector of the Borough field in the dataframe and use that to show each Neighbourhood in which part of Toronto city located.

In [262]:
Toronto_onehot = pd.get_dummies(df_merged1[['Borough']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot['Neighbourhood'] = df_merged1['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])
Toronto_onehot = Toronto_onehot[fixed_columns]

Toronto_onehot.head(5)

Unnamed: 0,Neighbourhood,Central Toronto,Downtown Toronto,East Toronto,West Toronto
37,The Beaches,0,0,1,0
41,"The Danforth West,Riverdale",0,0,1,0
42,"The Beaches West,India Bazaar",0,0,1,0
43,Studio District,0,0,1,0
44,Lawrence Park,1,0,0,0


Then we will cluster the neighborhood of Toronto city.
We choosed number of clusters = 4

In [263]:
kclusters = 4

Toronto_grouped_clustering = Toronto_onehot.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 2, 2, 2, 2, 2, 2])

We will add the cluster field to the dataframe.

In [266]:
df_merged1['Cluster Labels'] = kmeans.labels_
df_merged1 = df_merged1.merge(Toronto_onehot, on='Neighbourhood')
df_merged1.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,Central Toronto_x,Downtown Toronto_x,East Toronto_x,West Toronto_x,Central Toronto_y,Downtown Toronto_y,East Toronto_y,West Toronto_y,Central Toronto,Downtown Toronto,East Toronto,West Toronto
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,0,0,1,0,0,0,1,0,0,0,1,0
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,0,0,0,1,0,0,0,1,0,0,0,1,0
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,0,0,0,1,0,0,0,1,0,0,0,1,0
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,0,0,1,0,0,0,1,0,0,0,1,0
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2,1,0,0,0,1,0,0,0,1,0,0,0


Then we create the map and colors as the number of clusters.

In [267]:
# create map
map_clusters = folium.Map(location=[43.653963 ,-79.387207], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_merged1['Latitude'], df_merged1['Longitude'], df_merged1['Neighbourhood'], df_merged1['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters