## Segmenting and Clustering Neighbourhoods

### Load Libraries

Load the libraries needed for this exercise

In [1]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

###  Fetch Wikipedia Page with Postal Codes

Connect to the wikipedia page to get the postal codes

In [2]:
wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
raw_page = requests.get (wikipedia_link)
page = raw_page.text


### Create Data Frame for the Postal Codes

In [3]:
# Convert the page to a BeautifulSoup object to parse the postal codes

from bs4 import BeautifulSoup
bs = BeautifulSoup (page, 'lxml')

In [4]:
# Create the postal code dataframe

postal_table = bs.find ('tbody')
cols = ['PostalCode', 'Borough', 'Neighborhood']
postal_df = pd.DataFrame(columns=cols)

# first row contains the table header information
for r, row in enumerate (postal_table.find_all('tr')[1:]):
    postalcode = ""
    borough = ""
    neighborhood = ""
    
    for c, col in enumerate (row.find_all('td')):
        if (c == 0):
            postalcode = col.text
        elif (c == 1):
            borough = col.text
        elif (c == 2):
            neighborhood = col.text.rstrip()
        else:
            print ('Should never get here')
    
    # Use the borough name if the neighborhood has not been assigned
    if (neighborhood == "Not assigned"):
        neighborhood = borough
    
    # if the borough is not assigned, then skip this row
    if (borough != "Not assigned"):
        dup_df = postal_df[postal_df ['PostalCode'] == postalcode]
        if (not dup_df.empty):
            idx = dup_df.index.values.astype(int)[0]
            appended_nh = neighborhood + ', ' + dup_df.loc[idx, 'Neighborhood'] 
            postal_df.loc[idx, 'Neighborhood'] = appended_nh
        else: 
            temp_df = pd.DataFrame ({'PostalCode': [postalcode], 'Borough': [borough], 'Neighborhood': [neighborhood]})
            postal_df = postal_df.append(temp_df, ignore_index=True)

#with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.width', 100):
#    print(postal_df)
print (postal_df.shape)


(103, 3)


### Add Coordinates

Add the coordinate information using the csv file from https://cocl.us/Geospatial_data and map to the postal codes

In [5]:
# read in the coordinates from the provided geospatial data file

geo_df = pd.read_csv('https://cocl.us/Geospatial_data')

In [6]:
# merge the dataframes using the postal code of each dataframe as the key
postal_geo_df = postal_df.merge (geo_df, left_on='PostalCode', right_on='Postal Code', how='left')

# remove the extra postal code column
postal_geo_df.drop (columns=['Postal Code'], inplace=True)

#with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.width', 100):
#    print (postal_geo_df)
#postal_geo_df

## Explore the Toronto Neighborhood

Similar to the exercise of exploring the New York, the rest of this notebook explores the neighborhood of Scarborough - a borough of Toronto. We will get the venues from Foursquare leveraging the postal codes and corresponding geo coordinates and use k-means to segment the regions based on the frequency of the most common venues.

First import the necessary libraries and functions.

In [7]:

from geopy.geocoders import Nominatim 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


### Scarborough Neighborhood Data

Take a section of the Toronto postal codes corresponding to the borough of Scarborough

In [8]:
scar_data = postal_geo_df[postal_geo_df['Borough'] == 'Scarborough'].reset_index(drop=True)

In [9]:
address = 'Scarborough, Ontario'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [10]:
scar_map = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(scar_data['Latitude'], scar_data['Longitude'], scar_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(scar_map)  
    
scar_map

### Get the venues around the Scarborough neighborhoods

Now get the venues that are close to the location of the postal codes. Since Scarborough is more sparse than Manhattan, we will need to search on a wider radius than for Manhattan

In [11]:
CLIENT_ID = 
CLIENT_SECRET =  
VERSION = '20180605'


In [12]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, limit=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Get venues within a 1km radius of each neighborhood

In [13]:
# get all the venues around the Scarborough neighborhoods (within 2km)
scar_venues = getNearbyVenues(names=scar_data['Neighborhood'],
                              latitudes=scar_data['Latitude'],
                              longitudes=scar_data['Longitude'],
                             radius = 2000)

Malvern, Rouge
Port Union, Rouge Hill, Highland Creek
West Hill, Morningside, Guildwood
Woburn
Cedarbrae
Scarborough Village
Kennedy Park, Ionview, East Birchmount Park
Oakridge, Golden Mile, Clairlea
Scarborough Village West, Cliffside, Cliffcrest
Cliffside West, Birch Cliff
Wexford Heights, Scarborough Town Centre, Dorset Park
Wexford, Maryvale
Agincourt
Tam O'Shanter, Sullivan, Clarks Corners
Steeles East, Milliken, L'Amoreaux East, Agincourt North
L'Amoreaux West
Upper Rouge


In [14]:
print(scar_venues.shape)
scar_venues.head()

(1124, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.806686,-79.194353,African Rainforest Pavilion,43.817725,-79.183433,Zoo Exhibit
1,"Malvern, Rouge",43.806686,-79.194353,Toronto Pan Am Sports Centre,43.790623,-79.193869,Athletics & Sports
2,"Malvern, Rouge",43.806686,-79.194353,Toronto Zoo,43.820582,-79.181551,Zoo
3,"Malvern, Rouge",43.806686,-79.194353,Canadiana exhibit,43.817962,-79.193374,Zoo Exhibit
4,"Malvern, Rouge",43.806686,-79.194353,Images Salon & Spa,43.802283,-79.198565,Spa


### k-means Clustering for Common Venues

Perform k-measn clustering on the frequency of the venues by neighborhoods.

In [15]:
scar_onehot = pd.get_dummies(scar_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
scar_onehot['Neighborhood'] = scar_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [scar_onehot.columns[-1]] + list(scar_onehot.columns[:-1])
scar_onehot = scar_onehot[fixed_columns]

scar_grouped = scar_onehot.groupby('Neighborhood').mean().reset_index()

In [23]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = scar_grouped['Neighborhood']

for ind in np.arange(scar_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(scar_grouped.iloc[ind, :], num_top_venues)

#neighborhoods_venues_sorted.head()

In [24]:
# set number of clusters
kclusters = 5

scar_grouped_clustering = scar_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(scar_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 


array([0, 4, 4, 4, 0, 3, 4, 4, 4, 1], dtype=int32)

In [25]:

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

scar_merged = scar_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
scar_merged = scar_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

scar_merged


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,3,Zoo Exhibit,Fast Food Restaurant,Pizza Place,Gift Shop,Park,Zoo,Other Great Outdoors,Coffee Shop,Liquor Store,Skating Rink
1,M1C,Scarborough,"Port Union, Rouge Hill, Highland Creek",43.784535,-79.160497,4,Breakfast Spot,Coffee Shop,Sandwich Place,Pharmacy,Mexican Restaurant,Supermarket,Burger Joint,Liquor Store,Fish & Chips Shop,Fast Food Restaurant
2,M1E,Scarborough,"West Hill, Morningside, Guildwood",43.763573,-79.188711,1,Pizza Place,Fast Food Restaurant,Park,Pharmacy,Coffee Shop,Breakfast Spot,Convenience Store,Smoothie Shop,Fried Chicken Joint,Burger Joint
3,M1G,Scarborough,Woburn,43.770992,-79.216917,4,Coffee Shop,Fast Food Restaurant,Pizza Place,Sandwich Place,Furniture / Home Store,Supermarket,Indian Restaurant,Beer Store,Discount Store,Chinese Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,4,Coffee Shop,Clothing Store,Fast Food Restaurant,Gym,Sandwich Place,Restaurant,Indian Restaurant,Pizza Place,Sporting Goods Shop,Pharmacy
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,4,Fast Food Restaurant,Coffee Shop,Pizza Place,Sandwich Place,Pharmacy,Chinese Restaurant,Grocery Store,Liquor Store,Discount Store,Big Box Store
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029,4,Chinese Restaurant,Grocery Store,Fast Food Restaurant,Pharmacy,Coffee Shop,Discount Store,Sandwich Place,Pizza Place,Beer Store,Bank
7,M1L,Scarborough,"Oakridge, Golden Mile, Clairlea",43.711112,-79.284577,4,Coffee Shop,Fast Food Restaurant,Sandwich Place,Pizza Place,Burger Joint,Bakery,Cosmetics Shop,Sporting Goods Shop,Beer Store,Clothing Store
8,M1M,Scarborough,"Scarborough Village West, Cliffside, Cliffcrest",43.716316,-79.239476,1,Harbor / Marina,Fast Food Restaurant,Park,Pizza Place,Beach,Grocery Store,Pharmacy,Sandwich Place,Coffee Shop,Breakfast Spot
9,M1N,Scarborough,"Cliffside West, Birch Cliff",43.692657,-79.264848,4,Coffee Shop,Bank,Pizza Place,Park,Grocery Store,Fast Food Restaurant,Pharmacy,Beer Store,Sporting Goods Shop,Dog Run


### Map the 5 Clusters


**Cluster 0** has many Chinese restaurants and coffee shops. <br>
**Cluster 1** is near the water with fast food and parks. <br>
**Cluster 2** is more of a scenic area with little restaurants.  It is also a cluster of only 1. <br>
**Cluster 3** has a mixture of attractions and food.  It is also a cluster of only 1. <br>
**Cluster 4** is the biggest cluster with coffee shops dominating the venues. <br>

In [26]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(scar_merged['Latitude'], scar_merged['Longitude'], scar_merged['Neighborhood'], scar_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters