<h2>Segmenting and Clustering Neighborhoods in Toronto</h2>
<h3> Part 1 - create df, wrangle data and cluster neighborhoods</h3>
Import the numpy, pandas and requests libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import requests

import json # library to handle JSON files
from pandas import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Import k-means from clustering stage
from sklearn.cluster import KMeans

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Install the <b>Beautiful Soup (v4)</b> and <b>lxml</b> packages - if they are not already installed on the server

In [2]:
!conda install -c conda-forge beautifulsoup4 --yes
!conda install -c conda-forge lxml --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    beautifulsoup4-4.9.0       |   py36h9f0ad1d_0         160 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    soupsieve-1.9.4            |   py36h9f0ad1d_1          58 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.3 MB

The following NEW packages will be INSTALLED:

  beautifulsoup4     conda-forge/linux-64::beautifulsoup4-4.9.0-py36h9f0ad1d_0
  soupsieve          conda-forge/linux-64::soupsieve-1.9.4-py36h9f0ad1d_1

The following packages will be UPDATED:

  openssl                           

Create a dataframe (neighborhoods) and set the column names as <b>PostCode</b>, <b>Borough</b> and <b>Neighborhood</b>

In [3]:
# Define list of column names to be used in the neighborhoods dataframe
column_names = ['PostCode', 'Borough', 'Neighborhood'] 

# Instantiate the dataframe and set the column name
neighborhoods = pd.DataFrame(columns=column_names)

Import beautiful soup library and assign the target Wikipedia url. Scrape the post code data from the Wikipedia table.

In [4]:
# Import the beautiful soup library
from bs4 import BeautifulSoup

# Set the target url and extract the html text from the wiki url
wiki_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

# Create the Beautifulsoup object and assign to variable soup
soup = BeautifulSoup(wiki_url,'lxml')

# Find the post code table (wikitable) and assign only those elements belonging to the table to a variable - pcode_tbl
pcode_tbl = soup.find('table', class_= 'wikitable')

Create a loop which cycles through each row in the table, scraping the the postcode, borough and neighborhood data from each row and assign these to variables var_a, var_b and var_c. Remove the carriage returns ('\n') from the text strings and replace the backslash separating neighborhoods with a comma, before appending the cleaned data to the neighborhoods dataframe.

In [5]:
# Create a loop which cycles through each row in the table 
for rows in pcode_tbl.find_all('tr'):
    
    # Assign the cells of each row (<td>) to the variable cells
    cells = rows.find_all('td')

    # First test if there are 3 cells in the row - representing post code, borough and neighborhood
    if len(cells) == 3:
    
        # Assign data scraped from the <td> cells to variables for postcode, borough and neighborhood and clean the text
        var_a = cells[0].find(text=True).rstrip('\n')
        var_b = cells[1].find(text=True).rstrip('\n')
        var_c = cells[2].find(text=True).rstrip('\n').replace(' /',',')
        
        # Omit rows where no borough is assigned
        if var_b != 'Not assigned':
            
            # Test for post codes with a borough, yet not assigned neighborhood
            if var_c == 'Not assigned':
                
                # Set the neighborhood name to the same as the borough
                var_c = var_b
            
            # Create row data for cleaned poscode, borough and neighborhoods
            new_row = {'PostCode':var_a, 'Borough':var_b, 'Neighborhood':var_c}
            
            # Append the row data to the neighborhoods dataframe
            neighborhoods = neighborhoods.append(new_row, ignore_index=True)

# Check the populated dataframe
neighborhoods

Unnamed: 0,PostCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [6]:
neighborhoods.shape

(103, 3)

<br><h3>Part 2 - Create new df and add latitude and longitude data</h3>
<h4>NOTE: I was having too many problems trying to obtain the lat long data from <b>geocoder</b>, so have elected to use the data contained in the csv instead.</h4>
If using <b>geocoder</b> then first install the geocoder package on the server and import the geocoder library

In [6]:
!conda install -c conda-forge geocoder --yes
import geocoder

Collecting package metadata (current_repodata.json): / ^C
- 

Define a new dataframe (<b>toronto</b>) and set its column headers. Set a path to the csv containing the latitude and logitude data for the Toronto post codes and read the csv into a dataframe (<b>csv_df</b>).<br>

In [7]:
# Define list of column names to be used in the new dataframe
column_names = ['PostCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# Instantiate the new dataframe and set the column names
toronto = pd.DataFrame(columns=column_names)

# Set the path to the csv file and read it into the new dataframe
path='https://cocl.us/Geospatial_data'
csv_df = pd.read_csv(path)

# After reading in the csv file, the line below is used to check its contents - to ensure that it has worked.
# For now, i've commented this line out to conserve screen space

#csv_df

Loop through each of the rows in the <b>neighborhoods</b> DataFrame that was created in <b>Part 1</b> above and find a matching entry in the <b>csv</b> file. After finding a match join the data from the matching, <b>neighborhoods</b> DataFrame <b>csv</b> file creating a new DataFrame.

In [8]:
# Loop through each row of the neighborhoods dataframe
for ind in neighborhoods.index: 
    
    # Initiate empty variables to store the lat and long data
    csv_lat = []
    csv_long = []
    
    # Loop through each row in the csv to find a matching Postcode
    for c_ind in csv_df.index:
        
        # Test if the neighborhood postcode matches the csv postcode
        if csv_df['Postal Code'][c_ind] == neighborhoods.PostCode[ind]:
            
            #If there is a match, assign the latitude and longitude to the csv_lat and csv_long variables
            csv_lat = csv_df.Latitude[c_ind]
            csv_long = csv_df.Longitude[c_ind]
    
    # Create row data containing data from the neighborhood df and latitude and longitude variables
    new_row = {
        'PostCode':neighborhoods.PostCode[ind],
        'Borough':neighborhoods.Borough[ind],
        'Neighborhood':neighborhoods.Neighborhood[ind],
        'Latitude':csv_lat,
        'Longitude':csv_long
    }
            
    # Append the row data into the new dataframe
    toronto = toronto.append(new_row, ignore_index=True)

# Display the new dataframe    
toronto

Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


<h3>Part 3 - Visualise neighborhoods on a map</h3>
Install folium and geopy packages on the server

In [9]:
!conda install -c conda-forge folium=0.5.0 --yes
!conda install -c conda-forge geopy --yes

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    brotlipy-0.7.0             |py36h8c4c3a4_1000         346 KB  conda-forge
    chardet-3.0.4              |py36h9f0ad1d_1006         188 KB  conda-forge
    cryptography-2.9.2         |   py36h45558ae_0         613 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    pandas-1.0.3               |   py36h83

Use geopy library to get the latitude and longitude values of Toronto.

In [10]:
from geopy.geocoders import Nominatim

address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Toronto are 43.6534817, -79.3839347.


Render a map of Toronto and place markers for each neighborhood/borough

In [11]:
# Import folium rendering library
import folium

# Create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# Add markers to map
for lat, lng, borough, neighborhood in zip(toronto['Latitude'], toronto['Longitude'], toronto['Borough'], toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

As GitHub does not render Folium images, you have to go to the following <a href="https://nbviewer.jupyter.org/github/epaki/applied-ds-capstone/blob/master/segclusttorontoMap.ipynb">link</a> to view the rendered map for Toronto along with the markers for the neighborhoods

<a href="https://nbviewer.jupyter.org/github/epaki/applied-ds-capstone/blob/master/segclusttorontoMap.ipynb">https://nbviewer.jupyter.org/github/epaki/applied-ds-capstone/blob/master/segclusttorontoMap.ipynb</a>

<h3>Part 4 - Cluster Toronto Postcodes</h3>
<h4>I've elected to create clusters using Toronto <b>postcodes</b>. This makes it easier to use a single latitude and longitude reference, where postcodes are associated with multiple neighborhoods.</h4>

In [13]:
CLIENT_ID = ########### i've hashed out my client ID for use in the Foursquare url string after running the code for privacy
CLIENT_SECRET = ########### i've hashed out my client secret for use in the Foursquare url string after running the code for privacy
VERSION = '20180605' # Foursquare API version
radius = 500 # Define the explore radius for the API call to 500m from the lat & long
LIMIT = 100 # Limit the venues returned to no more than 100

Create a function that calls the Foursquare API to generate a list of venues situated within a 500m radius of each Postcode.

In [14]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# function that extracts the category of the venue for all PostCodes in Toronto
def getNearbyVenues(postcode, latitudes, longitudes, radius=radius):
    
    venues_list=[]
    for pc, lat, lng in zip(postcode, latitudes, longitudes):
        print(pc)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            pc, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostCode', 
                  'PostCode Latitude', 
                  'Postcode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Call the function created above and append the venues to a list called <b>toronto_venues</b> for each postcode contained in the <i>toronto</i> dataframe

In [15]:
toronto_venues = getNearbyVenues(postcode=toronto['PostCode'],
                                   latitudes=toronto['Latitude'],
                                   longitudes=toronto['Longitude']
                                  )

M3A
M4A
M5A
M6A
M7A
M9A
M1B
M3B
M4B
M5B
M6B
M9B
M1C
M3C
M4C
M5C
M6C
M9C
M1E
M4E
M5E
M6E
M1G
M4G
M5G
M6G
M1H
M2H
M3H
M4H
M5H
M6H
M1J
M2J
M3J
M4J
M5J
M6J
M1K
M2K
M3K
M4K
M5K
M6K
M1L
M2L
M3L
M4L
M5L
M6L
M9L
M1M
M2M
M3M
M4M
M5M
M6M
M9M
M1N
M2N
M3N
M4N
M5N
M6N
M9N
M1P
M2P
M4P
M5P
M6P
M9P
M1R
M2R
M4R
M5R
M6R
M7R
M9R
M1S
M4S
M5S
M6S
M1T
M4T
M5T
M1V
M4V
M5V
M8V
M9V
M1W
M4W
M5W
M8W
M9W
M1X
M4X
M5X
M8X
M4Y
M7Y
M8Y
M8Z


Curate toronto_venues dataframe so it can be used for clustering and plotting

In [16]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add PostCode column back to dataframe
toronto_onehot['PostCode'] = toronto_venues['PostCode'] 

# move PostCode column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

#venues = results['response']['groups'][0]['items']
    
#nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
#filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
#nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
#nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
#nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
# group rows by Postcode and by taking the mean of the frequency of occurrence of each category
toronto_grouped = toronto_onehot.groupby('PostCode').mean().reset_index()

<h4>Create a function that clusters each postcode using the <b>most common</b> venue categories found in each postcode to cluter by. The resulting dataframe has venues sorted in descending order.</h4>

In [17]:
num_top_venues = 10 # limit the top venues to 10

# function to sort venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
postcodes_venues_sorted = pd.DataFrame(columns=columns)
postcodes_venues_sorted['PostCode'] = toronto_grouped['PostCode']

for ind in np.arange(toronto_grouped.shape[0]):
    postcodes_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

Run <b>k-means</b> to cluster the Toronto postcodes into 5 clusters

In [18]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('PostCode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# add clustering labels
postcodes_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(postcodes_venues_sorted.set_index('PostCode'), on='PostCode')

In [19]:
# tidy up the data for plotting
toronto_merged = toronto_merged.dropna()

<h4>Plot each of the resulting postcodes with colors representing each of the 5 x colors that the grouping has been limited. </h4>

In [20]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['PostCode'], toronto_merged['Cluster Labels'].astype(int)):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters