# Clustering Neighborhoods in Toronto

### This Notebook was created as part of a peer reviewed assignment for IBM Data Science Professional Certificate.
**Author: Priya Dhawka, February 2020**
This notebook is divided in 3 parts based on the assignment requirements: 
1. Scraping the Wikipedia page for Toronto postal codes, boroughs and neighborhoods & cleaning the collected data.
2. Getting latitude and longitude data for said postal codes.
3. Using KMeans clustering to cluster neighborhoods in borough names which contain "Toronto" and mapping the cluster result.



## Part 1: Scraping Toronto Postal Code data

In [145]:

import pandas as pd #library for data analysis
import numpy as np #library to handle data in a vectorized manner
import requests #library to handle requests
import json #library to handle JSON files
from pandas.io.json import json_normalize #convert JSON object to a pandas dataframe
from geopy.geocoders import Nominatim #for address conversion into latitude and longitude
#importing matplotlib and necessary modules
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans #import kmeans module 
!conda install -c conda-forge folium=0.5.0
import folium #map plotting
print("Imported all necessary libraries")


Solving environment: done

# All requested packages already installed.

Imported all necessary libraries


In [148]:
#use pandas read_html method to access postal code table
df_page = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
df = df_page[0]
df.rename({"Postcode": "PostalCode", "Borough": "Borough", "Neighbourhood": "Neighborhood"},axis=1, inplace=True)

#function to clean dataframe for not assigned boroughs and neighborhoods
def get_borough(row):
    if row["Borough"] == "Not assigned":
        row["Neighborhood"] = None
        row["Borough"] = None
    else:
        if row["Neighborhood"] == "Not assigned":
            row["Neighborhood"] = row["Borough"]
    return row
df = df.apply(get_borough, axis=1)
df.dropna(inplace=True)
#group neighborhoods with similar postal codes to eliminate redundant rows
df = df.groupby(["PostalCode","Borough"])["Neighborhood"].apply(', '.join).reset_index()
df.shape

(103, 3)


## Part 2: Merging latitude and longitude information for postal codes
    
        

In [149]:
#Using suggested csv file for coordinates
#download csv file
!wget -q -O 'toronto_data.csv'  http://cocl.us/Geospatial_data
coordinates = pd.read_csv('toronto_data.csv')
new_df = pd.merge(df, coordinates, left_on="PostalCode",right_on='Postal Code')
new_df.drop('Postal Code',axis=1,inplace=True)
new_df.head()



Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## Part 3: Choosing a specific borough to use KMeans clustering

In [150]:
#get all borough names containing 'Toronto'
toronto_data = new_df[new_df["Borough"].str.contains("Toronto")]
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [151]:
#Run KMeans Clustering algorithm
#set number of cluster
kclusters = 5
toronto_clustering = toronto_data.drop("PostalCode",1)
toronto_clustering = toronto_clustering.drop("Borough",1)
toronto_clustering = toronto_clustering.drop("Neighborhood",1)
#run kmeans clustering
kmeans = KMeans(n_clusters=kclusters,random_state=0).fit(toronto_clustering)
#check cluster labels generated for each cluster
kmeans.labels_[0:10]


array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1], dtype=int32)

In [153]:
#Add clustering labels
toronto_data["Cluster Labels"] = kmeans.labels_
toronto_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,0
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,0
43,M4M,East Toronto,Studio District,43.659526,-79.340923,0
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,1


In [155]:
#create map for clusters
map_clusters = folium.Map(location=[43.741667, -79.373333],zoom_start=11)
#set color scheme for clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_data["Latitude"], toronto_data["Longitude"], toronto_data["Neighborhood"], toronto_data["Cluster Labels"]):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters