# Introduction
In this project I explore, segment, and cluster the neighborhoods in the city of Toronto. The neighborhood data though is not readily available on the internet.

For the Toronto neighborhood data, a Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will have to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

Once the data is in a structured format, we can start the analysis to explore and cluster the neighborhoods in the city of Toronto.

We build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood. We only process the cells that have an assigned borough. We ignore cells with a borough that is Not assigned. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

There are different website scraping libraries and packages in Python. One of the most common packages is BeautifulSoup and we will use it in this project. Package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/

Before we get the data and start exploring it, let's download all the dependencies that we will need.



In [1]:
import sys
!{sys.executable} -m pip install geocoder
!{sys.executable} -m pip install folium

print('Packages installed.')

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 6.5 MB/s eta 0:00:011
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Collecting folium
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 3.2 MB/s  eta 0:00:01
[?25hCollecting branca>=0.3.0
  Downloading branca-0.4.1-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0
Packages installed.


In [2]:
pip install BeautifulSoup4

Note: you may need to restart the kernel to use updated packages.


In [3]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import geocoder # import geocoder
import requests 
from bs4 import BeautifulSoup 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [8]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M') 
  
soup = BeautifulSoup(r.content, 'html5lib') 
table = soup.find('div', attrs = {'id':'container'}) 

# print(soup.prettify()) 
print('Page Scrapped.')

Page Scrapped.


In [9]:
postalCodes = [];
boroughs= [];
neighborhoods = [];
columnNum = 1;
passVal = False

for row in soup.find_all('td'):
    for cell in row:
        if cell.string and cell.string[0].isalpha() and len(cell.string) > 2:
            passVal = False
            if columnNum == 1:
                if passVal == False and cell.string[1].isdigit():
                    postalCodes.append(cell.string);   
                    columnNum = 2
                else:
                    continue
            elif columnNum == 2 :
                if cell.string == 'Not assigned':
                    passVal = True
                    del postalCodes[-1]
                    columnNum = 1
                    continue
                else:
                    boroughs.append(cell.string);      
                    columnNum = 3
            elif columnNum == 3 :
                if cell.string == 'Not assigned\n':
                    neighborhoods.append(boroughs[-1])
                else:
                    neighborhoods.append(cell.string); 
                columnNum = 1
                
print('Data Collected.')

Data Collected.


In [10]:
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighbors = pd.DataFrame(columns=column_names)

neighbors

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude


In [13]:

# initialize your variable to None
lat_lng_coords = None

for data in range(0, len(postalCodes)-1):
    code = postalCodes[data]
    borough = boroughs[data]
    neighborhood_name = neighborhoods[data]
    
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(code))
    lat_lng_coords = g.latlng

    neighbors = neighbors.append({ 'PostalCode': code,
                                   'Borough': borough,
                                   'Neighborhood': neighborhood_name,
                                   'Latitude': lat_lng_coords[0],
                                   'Longitude': lat_lng_coords[1]}, ignore_index=True)
    
neighbors

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1A\n,Not assigned\n,Not assigned\n,43.64869,-79.38544
1,M2A\n,Not assigned\n,Not assigned\n,43.64869,-79.38544
2,M3A\n,North York\n,Parkwoods\n,43.75245,-79.32991
3,M4A\n,North York\n,Victoria Village\n,43.73057,-79.31306
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n",43.65512,-79.36264
...,...,...,...,...,...
366,M4Z\n,Not assigned\n,Not assigned\n,43.64869,-79.38544
367,M5Z\n,Not assigned\n,Not assigned\n,43.64869,-79.38544
368,M6Z\n,Not assigned\n,Not assigned\n,43.64869,-79.38544
369,M7Z\n,Not assigned\n,Not assigned\n,43.64869,-79.38544


In [14]:
neighbors.shape

(371, 5)

In [15]:
VERSION = '20180605' # Foursquare API version

neighborhood_name = neighbors.loc[0, 'Neighborhood'] # neighborhood name
neighborhood_latitude = neighbors.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighbors.loc[0, 'Longitude'] # neighborhood longitude value

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

radius = 500 # define radius
LIMIT = 100 # limit of number of venues returned by Foursquare API

# url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
#     CLIENT_ID, 
#     CLIENT_SECRET, 
#     VERSION, 
#     43.67635739999999, 
#     79.2930312, 
#     radius, 
#     LIMIT)

url = 'https://api.foursquare.com/v2/venues/explore?&client_id=JGGBRN5XODTLZGJOMCSWIQMRH1JLGJKPSFR10XNB2R5U25GR&client_secret=KWRAMLK2HOJBQ2XLICLKXRU3M4HOCC1U2VG4Y4OPP5JF03QX&v=20180605&ll=43.67635739999999,-79.2930312&radius=500&limit=100'

results = requests.get(url).json()
results

Latitude and longitude values of Not assigned
 are 43.648690000000045, -79.38543999999996.


{'meta': {'code': 200, 'requestId': '5fae2319f6c312434330bb57'},
 'response': {'headerLocation': 'The Beaches',
  'headerFullLocation': 'The Beaches, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.680857404499996,
    'lng': -79.28682091449052},
   'sw': {'lat': 43.67185739549999, 'lng': -79.29924148550948}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd461bc77b29c74a07d9282',
       'name': 'Glen Manor Ravine',
       'location': {'address': 'Glen Manor',
        'crossStreet': 'Queen St.',
        'lat': 43.67682094413784,
        'lng': -79.29394208780985,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.67682094413784,
          'lng': -79.29394208780985}],
        'distanc

In [16]:
# From the Foursquare lab in the previous module, we know that all the information is in the items key. 
# Before we proceed, let's borrow the get_category_type function from the Foursquare lab.

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [17]:
from pandas.io.json import json_normalize

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()



Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Upper Beaches,Neighborhood,43.680563,-79.292869


In [18]:

print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


In [19]:
def getNearbyVenues(names, latitudes, longitudes):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)

        # create the API request URL
#         url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
#             CLIENT_ID, 
#             CLIENT_SECRET, 
#             VERSION, 
#             lat, 
#             lng, 
#             radius, 
#             LIMIT)
        
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [20]:
toronto_venues = getNearbyVenues(names=neighborhoods,
                                 latitudes=neighbors['Latitude'],
                                 longitudes=neighbors['Longitude'])

Not assigned

Not assigned

Parkwoods

Victoria Village

Regent Park, Harbourfront

Lawrence Manor, Lawrence Heights

Queen's Park, Ontario Provincial Government

Not assigned

Islington Avenue, Humber Valley Village

Malvern, Rouge

Not assigned

Don Mills

Parkview Hill, Woodbine Gardens

Garden District, Ryerson

Glencairn

Not assigned

Not assigned

West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale

Rouge Hill, Port Union, Highland Creek

Not assigned

Don Mills

Woodbine Heights

St. James Town

Humewood-Cedarvale

Not assigned

Not assigned

Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood

Guildwood, Morningside, West Hill

Not assigned

Not assigned

The Beaches

Berczy Park

Caledonia-Fairbanks

Not assigned

Not assigned

Not assigned

Woburn

Not assigned

Not assigned

Leaside

Central Bay Street

Christie

Not assigned

Not assigned

Not assigned

Cedarbrae

Hillcrest Village

Bathurst Manor, Wilson Heights, Downsview North

Thorncliffe P

In [21]:
# check the size of the resulting dataframe

print(toronto_venues.shape)
toronto_venues.head()

(720, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Not assigned\n,43.64869,-79.38544,Glen Manor Ravine,43.676821,-79.293942,Trail
1,Not assigned\n,43.64869,-79.38544,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,Not assigned\n,43.64869,-79.38544,Grover Pub and Grub,43.679181,-79.297215,Pub
3,Not assigned\n,43.64869,-79.38544,Upper Beaches,43.680563,-79.292869,Neighborhood
4,Not assigned\n,43.64869,-79.38544,Glen Manor Ravine,43.676821,-79.293942,Trail


In [22]:
# check how many venues were returned for each neighborhood

toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt\n,4,4,4,4,4,4
"Alderwood, Long Branch\n",4,4,4,4,4,4
"Bathurst Manor, Wilson Heights, Downsview North\n",4,4,4,4,4,4
Bayview Village\n,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East\n",4,4,4,4,4,4
...,...,...,...,...,...,...
"Willowdale, Willowdale West\n",4,4,4,4,4,4
Woburn\n,4,4,4,4,4,4
Woodbine Heights\n,4,4,4,4,4,4
York Mills West\n,4,4,4,4,4,4


In [23]:

# find out how many unique categories can be curated from all the returned venues

print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 4 uniques categories.


In [24]:

# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Trail,Health Food Store,Neighborhood,Pub
0,1,0,Not assigned\n,0
1,0,1,Not assigned\n,0
2,0,0,Not assigned\n,1
3,0,0,Not assigned\n,0
4,1,0,Not assigned\n,0


In [25]:

# examine the new dataframe size.

toronto_onehot.shape

(720, 4)

In [26]:
# group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Trail,Health Food Store,Pub
0,Agincourt\n,0.25,0.25,0.25
1,"Alderwood, Long Branch\n",0.25,0.25,0.25
2,"Bathurst Manor, Wilson Heights, Downsview North\n",0.25,0.25,0.25
3,Bayview Village\n,0.25,0.25,0.25
4,"Bedford Park, Lawrence Manor East\n",0.25,0.25,0.25
...,...,...,...,...
95,"Willowdale, Willowdale West\n",0.25,0.25,0.25
96,Woburn\n,0.25,0.25,0.25
97,Woodbine Heights\n,0.25,0.25,0.25
98,York Mills West\n,0.25,0.25,0.25


In [27]:

# confirm the new size

toronto_grouped.shape

(100, 4)

In [28]:
# print each neighborhood along with the top 5 most common venues

num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt
----
               venue  freq
0              Trail  0.25
1  Health Food Store  0.25
2                Pub  0.25


----Alderwood, Long Branch
----
               venue  freq
0              Trail  0.25
1  Health Food Store  0.25
2                Pub  0.25


----Bathurst Manor, Wilson Heights, Downsview North
----
               venue  freq
0              Trail  0.25
1  Health Food Store  0.25
2                Pub  0.25


----Bayview Village
----
               venue  freq
0              Trail  0.25
1  Health Food Store  0.25
2                Pub  0.25


----Bedford Park, Lawrence Manor East
----
               venue  freq
0              Trail  0.25
1  Health Food Store  0.25
2                Pub  0.25


----Berczy Park
----
               venue  freq
0              Trail  0.25
1  Health Food Store  0.25
2                Pub  0.25


----Birch Cliff, Cliffside West
----
               venue  freq
0              Trail  0.25
1  Health Food Store  0.25
2                Pub  0.2

               venue  freq
0              Trail  0.25
1  Health Food Store  0.25
2                Pub  0.25


----The Kingsway, Montgomery Road, Old Mill North
----
               venue  freq
0              Trail  0.25
1  Health Food Store  0.25
2                Pub  0.25


----Thorncliffe Park
----
               venue  freq
0              Trail  0.25
1  Health Food Store  0.25
2                Pub  0.25


----Toronto Dominion Centre, Design Exchange
----
               venue  freq
0              Trail  0.25
1  Health Food Store  0.25
2                Pub  0.25


----University of Toronto, Harbord
----
               venue  freq
0              Trail  0.25
1  Health Food Store  0.25
2                Pub  0.25


----Upper Rouge
----
               venue  freq
0              Trail  0.25
1  Health Food Store  0.25
2                Pub  0.25


----Victoria Village
----
               venue  freq
0              Trail  0.25
1  Health Food Store  0.25
2                Pub  0.25


----West Dea

In [29]:

# sort the venues in descending order.

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [30]:
# create the new dataframe and display the top 10 venues for each neighborhood.

num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Agincourt\n,Pub,Health Food Store,Trail
1,"Alderwood, Long Branch\n",Pub,Health Food Store,Trail
2,"Bathurst Manor, Wilson Heights, Downsview North\n",Pub,Health Food Store,Trail
3,Bayview Village\n,Pub,Health Food Store,Trail
4,"Bedford Park, Lawrence Manor East\n",Pub,Health Food Store,Trail


In [31]:
# Run k-means to cluster the neighborhood into 5 clusters.

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]



array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [32]:

# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = neighbors

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,M1A\n,Not assigned\n,Not assigned\n,43.64869,-79.38544,0,Pub,Health Food Store,Trail
1,M2A\n,Not assigned\n,Not assigned\n,43.64869,-79.38544,0,Pub,Health Food Store,Trail
2,M3A\n,North York\n,Parkwoods\n,43.75245,-79.32991,0,Pub,Health Food Store,Trail
3,M4A\n,North York\n,Victoria Village\n,43.73057,-79.31306,0,Pub,Health Food Store,Trail
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n",43.65512,-79.36264,0,Pub,Health Food Store,Trail


In [33]:
# Finally, let's visualize the resulting clusters

# create map
map_clusters = folium.Map(location=[43.67635739999999, -79.2930312], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters