## Segmenting and Clustering Neighborhoods in Toronto

## Table of Contents

1. Download and Explore Dataset

2. Explore Neighborhoods in Toronto

3. Analyze Each Neighborhood

4. Cluster Neighborhoods

5. Examine Clusters

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
from pandas import DataFrame
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install folium
import folium # map rendering library

import urllib.request
from bs4 import BeautifulSoup

print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


## 1. Download and Explore Dataset

In [2]:
toronto_url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(toronto_url)[0] # It's the first table in the designated web page
print('Data downloaded!')
print('The initial shape of dataframe is: ',df.shape)
df.head()

Data downloaded!
The initial shape of dataframe is:  (180, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now, we should drop the  cells with a borough that is Not assigned.

In [3]:
df=df[~df['Borough'].isin(['Not assigned'])]
df.reset_index(drop=True,inplace=True)
print('The shape of dataframe is  reduced to: ',df.shape)
df.head()

The shape of dataframe is  reduced to:  (103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [4]:
print('The dataframe has {} Borough.'.format(len(df['Borough'].unique())))

The dataframe has 10 Borough.


Rename the 'Not assigned Neighbourhood' as the same name of it's Borough.

In [5]:
df_array=np.array(df)
len(df_array)

103

In [6]:
Neighbourhood_array=[]
for i in range(len(df_array)):
    Neighbourhood_array.append(df_array[i][2].split(','))
for i in range(len(df_array)):
    for j in range(len(Neighbourhood_array[i])):
        if Neighbourhood_array[i][j]=='Not assigned':
            Neighbourhood_array[i][j]=df_array[i][1]
    df_array[i][2]=','.join(Neighbourhood_array[i])
df_array

array([['M3A', 'North York', 'Parkwoods'],
       ['M4A', 'North York', 'Victoria Village'],
       ['M5A', 'Downtown Toronto', 'Regent Park, Harbourfront'],
       ['M6A', 'North York', 'Lawrence Manor, Lawrence Heights'],
       ['M7A', 'Downtown Toronto',
        "Queen's Park, Ontario Provincial Government"],
       ['M9A', 'Etobicoke', 'Islington Avenue, Humber Valley Village'],
       ['M1B', 'Scarborough', 'Malvern, Rouge'],
       ['M3B', 'North York', 'Don Mills'],
       ['M4B', 'East York', 'Parkview Hill, Woodbine Gardens'],
       ['M5B', 'Downtown Toronto', 'Garden District, Ryerson'],
       ['M6B', 'North York', 'Glencairn'],
       ['M9B', 'Etobicoke',
        'West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale'],
       ['M1C', 'Scarborough', 'Rouge Hill, Port Union, Highland Creek'],
       ['M3C', 'North York', 'Don Mills'],
       ['M4C', 'East York', 'Woodbine Heights'],
       ['M5C', 'Downtown Toronto', 'St. James Town'],
       ['M6C', 'York

In [7]:
Toronto_neighbourhood_df = DataFrame(df_array,index=None,columns = ['Postal Code','Borough','Neighbourhood'])
Toronto_neighbourhood_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [8]:
Toronto_neighbourhood_df.shape

(103, 3)

Get the latitude and the longitude coordinates of each postal code.

In [9]:
Geospatial_Coordinates=pd.read_csv('http://cocl.us/Geospatial_data')

In [10]:
print('The Geospatial_Coordinates data frame shape is:',Geospatial_Coordinates.shape)
Geospatial_Coordinates.head()

The Geospatial_Coordinates data frame shape is: (103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [11]:
Toronto_neighbourhood=pd.merge(Toronto_neighbourhood_df,Geospatial_Coordinates,how='left',on='Postal Code')
Toronto_neighbourhood.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [12]:
Toronto_neighbourhood.shape

(103, 5)