# Segment and clustering neighborhoods in Toronto (Capstone Week 3)


## Step 1: turning the wiki data into a dataframe

First, load BeautifulSoup, the wikipedia page, and obtain the rows of the relevant table

In [1]:
from bs4 import BeautifulSoup
import numpy as np, pandas as pd, requests

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(url)

soup = BeautifulSoup(r.text, 'html.parser')

wiki_table = soup.find(class_="wikitable")

wiki_rows = wiki_table.find_all("tr")


Now define an empty pandas dataframe to contain the elements, and loop through each row of the BeautifulSoup extract, mapping each cell to postcode/borough/neighborhood based on table position.  Discard rows where borough is not assigned, use borough if neighborhood is not assigned, 
and create dictionaries of boroughs (brs) and neighborhoods (nbs) for later use.

In [2]:

# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

neighborhoods.set_index('PostalCode', inplace=True)

brs = {}
nbs = {}

for row in wiki_rows:
    #rint(row)
    cells = row.find_all("td")
    cell_list = [x.get_text() for x in cells]
    if len(cell_list) > 0 and cell_list[1] != 'Not assigned':
        postal_code = cell_list[0]
        borough = cell_list[1]
        neighborhood = cell_list[1] if cell_list[2].replace('\n','') == 'Not assigned' else cell_list[2].replace('\n','')
        
        brs[postal_code] = borough
        
        if postal_code in nbs:
            nbs[postal_code] = nbs[postal_code] + ", " + neighborhood
        else:
            nbs[postal_code] = neighborhood

            
for pcode in brs:
    neighborhoods = neighborhoods.append({'PostalCode': pcode,
                                         'Borough': brs[pcode],
                                         'Neighborhood': nbs[pcode]}, ignore_index=True)
    


In [3]:
print("The resulting pandas dataframe is the following size:")      
neighborhoods.shape

The resulting pandas dataframe is the following size:


(103, 3)

## Step 2: obtain geocoding per postcode

In [4]:
!conda install -c conda-forge geocoder --yes 
!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library


Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs:
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.3.9   |       hecc5488_0         146 KB  conda-forge
    certifi-2019.3.9           |           py36_0         149 KB  conda-forge
    conda-4.6.9                |           py36_0         896 KB  conda-forge
    geocoder-1.38.1            |             py_0          52 KB  conda-forge
    openssl-1.1.1b             |       h14c3975_1         4.0 MB  conda-forge
    orderedset-2.0             |           py36_0         231 KB  conda-forge
    ratelim-0.1.6              |           py36_0           5 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         5.4 MB

The foll

The code below tries up to three times to obtain a latitude and longitude in case the service does not respond.  I am using the ArcGIS function/website since other services were not giving useful results. (Bing returned nothing, Geonames had null responses for several postcodes, OSM wouldn't relate to postcodes and borough was too general)

In [5]:
import geocoder

lat_lng_coords = None

latitude = {}
longitude = {}

with requests.Session() as session:
    for pc, nbname in nbs.items():
        attempts = 0
        while attempts < 3:
        # loop until you get the coordinates
        #while(lat_lng_coords is None):
            g = geocoder.arcgis('{}, Toronto, Ontario'.format(pc))
            print('{}, Toronto, Ontario'.format(pc))
        
            lat_lng_coords = g.latlng
            if lat_lng_coords is None:
                attempts = attempts + 1
            else:
                attempts = 10

        if lat_lng_coords is not None:
            latitude[pc] = lat_lng_coords[0]
            longitude[pc] = lat_lng_coords[1]

neighborhoods['Latitude'] = neighborhoods['PostalCode'].map(latitude)
neighborhoods['Longitude'] = neighborhoods['PostalCode'].map(longitude)

M3A, Toronto, Ontario
M4A, Toronto, Ontario
M5A, Toronto, Ontario
M6A, Toronto, Ontario
M7A, Toronto, Ontario
M9A, Toronto, Ontario
M1B, Toronto, Ontario
M3B, Toronto, Ontario
M4B, Toronto, Ontario
M5B, Toronto, Ontario
M6B, Toronto, Ontario
M9B, Toronto, Ontario
M1C, Toronto, Ontario
M3C, Toronto, Ontario
M4C, Toronto, Ontario
M5C, Toronto, Ontario
M6C, Toronto, Ontario
M9C, Toronto, Ontario
M1E, Toronto, Ontario
M4E, Toronto, Ontario
M5E, Toronto, Ontario
M6E, Toronto, Ontario
M1G, Toronto, Ontario
M4G, Toronto, Ontario
M5G, Toronto, Ontario
M6G, Toronto, Ontario
M1H, Toronto, Ontario
M2H, Toronto, Ontario
M3H, Toronto, Ontario
M4H, Toronto, Ontario
M5H, Toronto, Ontario
M6H, Toronto, Ontario
M1J, Toronto, Ontario
M2J, Toronto, Ontario
M3J, Toronto, Ontario
M4J, Toronto, Ontario
M5J, Toronto, Ontario
M6J, Toronto, Ontario
M1K, Toronto, Ontario
M2K, Toronto, Ontario
M3K, Toronto, Ontario
M4K, Toronto, Ontario
M5K, Toronto, Ontario
M6K, Toronto, Ontario
M1L, Toronto, Ontario
M2L, Toron

## Result: Output the data frame

In [6]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,PostalCode,Latitude,Longitude
0,North York,Parkwoods,M3A,43.752440,-79.329271
1,North York,Victoria Village,M4A,43.730421,-79.313320
2,Downtown Toronto,"Harbourfront, Regent Park",M5A,43.655120,-79.362640
3,North York,"Lawrence Heights, Lawrence Manor",M6A,43.723125,-79.451589
4,Queen's Park,Queen's Park,M7A,43.661102,-79.391035
5,Etobicoke,Islington Avenue,M9A,43.662242,-79.528379
6,Scarborough,"Rouge, Malvern",M1B,43.811525,-79.195517
7,North York,Don Mills North,M3B,43.749195,-79.361905
8,East York,"Woodbine Gardens, Parkview Hill",M4B,43.707535,-79.311773
9,Downtown Toronto,"Ryerson, Garden District",M5B,43.657363,-79.378180


### Map the results as a sense check

In [7]:
centre_latitude = 43.655115
centre_longitude = -79.380219

centres_map = folium.Map(location=[centre_latitude, centre_longitude], zoom_start=11) 

# add the postcode centres as red circle markers
for lat, lng, label in zip(neighborhoods.Latitude, neighborhoods.Longitude, neighborhoods.PostalCode):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='red',
        popup=label,
        fill = True,
        fill_color='red',
        fill_opacity=0.6
    ).add_to(centres_map)

# display map
centres_map