<h2>Segmenting and Clustering Neighborhoods in Toronto</h2>
<h3> Part 1 - create df, wrangle data and cluster neighborhoods</h3>
Import the numpy, pandas and requests libraries

In [1]:
import pandas as pd
import requests

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Install the <b>Beautiful Soup (v4)</b> and <b>lxml</b> packages - if they are not already installed on the server

In [None]:
!conda install -c conda-forge beautifulsoup4 --yes
!conda install -c conda-forge lxml --yes

Create a dataframe (neighborhoods) and set the column names as <b>PostCode</b>, <b>Borough</b> and <b>Neighborhood</b>

In [2]:
# Define list of column names to be used in the neighborhoods dataframe
column_names = ['PostCode', 'Borough', 'Neighborhood'] 

# Instantiate the dataframe and set the column name
neighborhoods = pd.DataFrame(columns=column_names)

Import beautiful soup library and assign the target Wikipedia url. Scrape the post code data from the Wikipedia table.

In [3]:
# Import the beautiful soup library
from bs4 import BeautifulSoup

# Set the target url and extract the html text from the wiki url
wiki_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

# Create the Beautifulsoup object and assign to variable soup
soup = BeautifulSoup(wiki_url,'lxml')

# Find the post code table (wikitable) and assign only those elements belonging to the table to a variable - pcode_tbl
pcode_tbl = soup.find('table', class_= 'wikitable')

Create a loop which cycles through each row in the table, scraping the the postcode, borough and neighborhood data from each row and assign these to variables var_a, var_b and var_c. Remove the carriage returns ('\n') from the text strings and replace the backslash separating neighborhoods with a comma, before appending the cleaned data to the neighborhoods dataframe.

In [4]:
# Create a loop which cycles through each row in the table 
for rows in pcode_tbl.find_all('tr'):
    
    # Assign the cells of each row (<td>) to the variable cells
    cells = rows.find_all('td')

    # First test if there are 3 cells in the row - representing post code, borough and neighborhood
    if len(cells) == 3:
    
        # Assign data scraped from the <td> cells to variables for postcode, borough and neighborhood and clean the text
        var_a = cells[0].find(text=True).rstrip('\n')
        var_b = cells[1].find(text=True).rstrip('\n')
        var_c = cells[2].find(text=True).rstrip('\n').replace(' /',',')
        
        # Omit rows where no borough is assigned
        if var_b != 'Not assigned':
            
            # Test for post codes with a borough, yet not assigned neighborhood
            if var_c == 'Not assigned':
                
                # Set the neighborhood name to the same as the borough
                var_c = var_b
            
            # Create row data for cleaned poscode, borough and neighborhoods
            new_row = {'PostCode':var_a, 'Borough':var_b, 'Neighborhood':var_c}
            
            # Append the row data to the neighborhoods dataframe
            neighborhoods = neighborhoods.append(new_row, ignore_index=True)

# Check the populated dataframe
neighborhoods

Unnamed: 0,PostCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [5]:
neighborhoods.shape

(103, 3)

<br><h3>Part 2 - Create new df and add latitude and longitude data</h3>
<h4>NOTE: I was having too many problems trying to obtain the lat long data from <b>geocoder</b>, so have elected to use the data contained in the csv instead.</h4>
If using <b>geocoder</b> then first install the geocoder package on the server and import the geocoder library

In [6]:
!conda install -c conda-forge geocoder --yes
import geocoder

Collecting package metadata (current_repodata.json): / ^C
- 

Define a new dataframe (<b>toronto</b>) and set its column headers. Set a path to the csv containing the latitude and logitude data for the Toronto post codes and read the csv into a dataframe (<b>csv_df</b>).<br>

In [15]:
# Define list of column names to be used in the new dataframe
column_names = ['PostCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# Instantiate the new dataframe and set the column names
toronto = pd.DataFrame(columns=column_names)

# Set the path to the csv file and read it into the new dataframe
path='https://cocl.us/Geospatial_data'
csv_df = pd.read_csv(path)

# After reading in the csv file, the line below is used to check its contents - to ensure that it has worked.
# For now, i've commented this line out to conserve screen space

#csv_df

Loop through each of the rows in the <b>neighborhoods</b> DataFrame that was created in <b>Part 1</b> above and find a matching entry in the <b>csv</b> file. After finding a match join the data from the matching, <b>neighborhoods</b> DataFrame <b>csv</b> file creating a new DataFrame.

In [16]:
# Loop through each row of the neighborhoods dataframe
for ind in neighborhoods.index: 
    
    # Initiate empty variables to store the lat and long data
    csv_lat = []
    csv_long = []
    
    # Loop through each row in the csv to find a matching Postcode
    for c_ind in csv_df.index:
        
        # Test if the neighborhood postcode matches the csv postcode
        if csv_df['Postal Code'][c_ind] == neighborhoods.PostCode[ind]:
            
            #If there is a match, assign the latitude and longitude to the csv_lat and csv_long variables
            csv_lat = csv_df.Latitude[c_ind]
            csv_long = csv_df.Longitude[c_ind]
    
    # Create row data containing data from the neighborhood df and latitude and longitude variables
    new_row = {
        'PostCode':neighborhoods.PostCode[ind],
        'Borough':neighborhoods.Borough[ind],
        'Neighborhood':neighborhoods.Neighborhood[ind],
        'Latitude':csv_lat,
        'Longitude':csv_long
    }
            
    # Append the row data into the new dataframe
    toronto = toronto.append(new_row, ignore_index=True)

# Display the new dataframe    
toronto

Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


<h3>Part 3 - Visualise neighborhoods on a map</h3>
Install folium and geopy packages on the server

In [10]:
!conda install -c conda-forge folium=0.5.0 --yes
!conda install -c conda-forge geopy --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



Use geopy library to get the latitude and longitude values of Toronto.

In [19]:
from geopy.geocoders import Nominatim

address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Toronto are 43.6534817, -79.3839347.


Render a map of Toronto and place markers for each neighborhood/borough

In [18]:
# Import folium rendering library
import folium

# Create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# Add markers to map
for lat, lng, borough, neighborhood in zip(toronto['Latitude'], toronto['Longitude'], toronto['Borough'], toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto