# Clustering Toronto Neighbourhoods

### A. Gutmanas
### Feb 2020



---
## Part 1:
## Get, clean and load neighbourhood locations

Despite the fact that a simple google search and some critical review points to the City of Toronto website (https://www.toronto.ca) and their "Open Data" portal: https://open.toronto.ca, which contains the necessary data, I will follow the instructions from the course.

1. Scrape the list of postcodes for Toronto from Wiki page at https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and load into a dataframe.
2. Drop rows where borough is _not assigned_.
3. Normalise the dataframe so that neighbourhoods with the same postal code are combined into a comma-separated list.

Let's start by importing the necessary libraries. Some of them will be needed later.

In [None]:
#!pip install folium    # uncomment if library not available
#!pip install shapely   # uncomment if library not available

In [1]:
# import libraries
import json
from shapely.geometry import shape, Point # will be needed later
import pandas as pd
import requests
import folium
import bs4

Get the raw HTML from the Wikipedia page

In [2]:
# get the raw data
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
wiki_text = requests.get(wiki_url).text

Find the actual table on the page, get column names (from _th_ tags) and values (from _td_ tags), load the resulting data into a dataframe. Of course, one could add the business logic of checking if values are "Not assigned" and combining neighbourhoods. But it seems to be a cleaner way not to do that here, even if it means we will first load and then drop some rows.

In [3]:
# find the table with Toronto postal codes
wiki_tables = bs4.BeautifulSoup(wiki_text).find_all("table", attrs={"class": "wikitable sortable"})

cols = [x.get_text().strip() for x in wiki_tables[0].find_all("th")]
rows = wiki_tables[0].find_all("tr")
values = []
for row in rows[1:]:
    values.append([x.get_text().strip() for x in row.find_all("td")])

toronto_postcodes = pd.DataFrame(columns=cols, data=values)
toronto_postcodes.head()



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")



Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Now, some cleanup. Drop "Not assigned" boroughs.

In [4]:
toronto_postcodes.drop(toronto_postcodes.loc[toronto_postcodes['Borough']=="Not assigned"].index, inplace=True)
toronto_postcodes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Check if there are any neighbourhoods that are "Not assigned" and copy the name of the corresponding borough instead. Check how many such cases there are.

In [5]:
indices = toronto_postcodes.loc[toronto_postcodes['Neighbourhood']=="Not assigned"].index
toronto_postcodes.loc[indices,"Neighbourhood"] = toronto_postcodes.loc[indices,"Borough"]

toronto_postcodes.loc[indices,:] 

Unnamed: 0,Postcode,Borough,Neighbourhood
9,M9A,Queen's Park,Queen's Park


OK. Just that one case. 

_Actually, having lived in Toronto, Queen's Park doesn't exactly qualify as a borough, but it looks like the Ontario legislature and government wish to have a postal code area all to themselves!_

Now the fun bit - group the neighbourhoods by their postal code area and concatenate them into a comma separated list.
Just for curiosity, check also if any postal code covers more than one borough.

In [6]:
tg = toronto_postcodes.groupby(["Postcode"])
postcodes = list(tg.groups.keys())
boroughs = []
neighbourhoods = []

for code in postcodes:    
    area = tg.get_group(code)
    if area["Borough"].nunique() != 1:
        print(f"Postal code {code} covers an area in {area['Borough'].nunique()} boroughs. Keeping only the first one")
    boroughs.append(area.iloc[0,1])
    neighbourhoods.append(pd.Series(area["Neighbourhood"].unique()).str.cat(sep=", "))
    
toronto_work_df = pd.DataFrame({
    "Postal_code": postcodes,
    "Borough": boroughs,
    "Neighbourhood": neighbourhoods
})
    
toronto_work_df.head()    

Unnamed: 0,Postal_code,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


There is probably a better way of achieving this without looping over the individual groups of a dataframe, but we are not dealing with massive data, so I'll be lazy and leave it as is. (BTW, no postal code spread itself over more than one borough.

### End of part 1: 
check the size of the resulting dataframe

In [7]:
toronto_work_df.shape

(103, 3)

### An alternative way to get a list of Toronto's neighbourhoods (with geolocation data!)
Just for fun, I will also load the neighbourhood geodata from: https://open.toronto.ca/dataset/neighbourhoods, which allows download in CSV, GeoJSON and a few other formats. This is easier than scraping Wikipedia, which also contains a different list of neighbourhoods and boroughs. The exact link for the CSV is https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/a083c865-6d60-4d1d-b6c6-b0c8a85f9c15?format=csv&projection=4326, and it will be easy to load into a pandas dataframe. This dataset lacks "boroughs", which are the old municipalities before and the city of Toronto was amalgamated in 2001. This information could be useful at some point, and the geographic boundaries for these areas are available from https://open.toronto.ca/dataset/former-municipality-boundaries/. The GeoJSON file can be downloaded from https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/f82dbe76-928e-4cec-8147-a21882f575e2?format=geojson&projection=4326

In [8]:
# Download the CSV with Toronto neighbourhoods and load into a dataframe
toronto_raw = pd.read_csv("https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/a083c865-6d60-4d1d-b6c6-b0c8a85f9c15?format=csv&projection=4326")
toronto_raw.head()

Unnamed: 0,_id,AREA_ID,AREA_ATTR_ID,PARENT_AREA_ID,AREA_SHORT_CODE,AREA_LONG_CODE,AREA_NAME,AREA_DESC,X,Y,LONGITUDE,LATITUDE,OBJECTID,Shape__Area,Shape__Length,geometry
0,3221,25886861,25926662,49885,94,94,Wychwood (94),Wychwood (94),,,-79.425515,43.676919,16491505,3217960.0,7515.779658,"{u'type': u'Polygon', u'coordinates': (((-79.4..."
1,3222,25886820,25926663,49885,100,100,Yonge-Eglinton (100),Yonge-Eglinton (100),,,-79.40359,43.704689,16491521,3160334.0,7872.021074,"{u'type': u'Polygon', u'coordinates': (((-79.4..."
2,3223,25886834,25926664,49885,97,97,Yonge-St.Clair (97),Yonge-St.Clair (97),,,-79.397871,43.687859,16491537,2222464.0,8130.411276,"{u'type': u'Polygon', u'coordinates': (((-79.3..."
3,3224,25886593,25926665,49885,27,27,York University Heights (27),York University Heights (27),,,-79.488883,43.765736,16491553,25418210.0,25632.335242,"{u'type': u'Polygon', u'coordinates': (((-79.5..."
4,3225,25886688,25926666,49885,31,31,Yorkdale-Glen Park (31),Yorkdale-Glen Park (31),,,-79.457108,43.714672,16491569,11566690.0,13953.408098,"{u'type': u'Polygon', u'coordinates': (((-79.4..."


In [17]:
# create a new dataframe with relevant columns only 
toronto_base = toronto_raw[["AREA_SHORT_CODE", "LONGITUDE", "LATITUDE"]].copy()
toronto_base.head()

Unnamed: 0,AREA_SHORT_CODE,LONGITUDE,LATITUDE
0,94,-79.425515,43.676919
1,100,-79.40359,43.704689
2,97,-79.397871,43.687859
3,27,-79.488883,43.765736
4,31,-79.457108,43.714672


In [18]:
# add cleaned up names of neighbourhoods
toronto_base["AREA_NAME"] = [x[:x.find('(')-1] for x in toronto_raw["AREA_NAME"]]                               
toronto_base.head()

Unnamed: 0,AREA_SHORT_CODE,LONGITUDE,LATITUDE,AREA_NAME
0,94,-79.425515,43.676919,Wychwood
1,100,-79.40359,43.704689,Yonge-Eglinton
2,97,-79.397871,43.687859,Yonge-St.Clair
3,27,-79.488883,43.765736,York University Heights
4,31,-79.457108,43.714672,Yorkdale-Glen Park


In [13]:
# download GeoJSON with data for old municipalities (i.e., boroughs)
url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/f82dbe76-928e-4cec-8147-a21882f575e2?format=geojson&projection=4326"
boroughs_geoJSON = requests.get(url).json()

For each neighbourhood create a "Point" object and loop over the boroughs GeoJSON to check which borough the point belongs to. 

In [19]:
boroughs = []
for index, neighbourhood in toronto_base.iterrows():
    # print(neighbourhood["AREA_NAME"])
    point = Point(neighbourhood["LONGITUDE"], neighbourhood["LATITUDE"])
    for feature in boroughs_geoJSON['features']:
        polygon = shape(feature['geometry'])
        if polygon.contains(point):
            # print(neighbourhood['AREA_NAME']," is in ",feature['properties']['AREA_NAME'])
            boroughs.append(feature['properties']['AREA_NAME'])
            break
            
toronto_base["BOROUGH"] = boroughs
toronto_base.sort_values(by=["BOROUGH", "AREA_SHORT_CODE"], inplace=True)

In [20]:
toronto_base.head()

Unnamed: 0,AREA_SHORT_CODE,LONGITUDE,LATITUDE,AREA_NAME,BOROUGH
29,54,-79.312228,43.7068,O'Connor-Parkview,EAST YORK
57,55,-79.349984,43.707749,Thorncliffe Park,EAST YORK
9,56,-79.366072,43.703797,Leaside-Bennington,EAST YORK
91,57,-79.35563,43.688825,Broadview North,EAST YORK
32,58,-79.335488,43.696781,Old East York,EAST YORK


In [16]:
toronto_base.shape

(140, 5)

So, there are 103 postal code areas in Toronto, and 140 official neighbourhoods recognised by the City of Toronto. For the purposes of the project, it is probably immaterial which of the approaches is used. 

### End of alternative data download

---
## Part 2:
### Obtain latitude and longitude for the neighbourhoods

Following the instructions, use the geocoder library and try searching for each postcode (possibly in an infinite loop?)

In [22]:
!pip install geocoder

Defaulting to user installation because normal site-packages is not writeable
Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 1.5 MB/s eta 0:00:01
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Collecting click
  Downloading Click-7.0-py2.py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 18.5 MB/s eta 0:00:01
Collecting future
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 24.3 MB/s eta 0:00:01
Building wheels for collected packages: future
  Building wheel for future (setup.py) ... [?25ldone
[?25h  Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=491056 sha256=1efd5d016de52cc84bea70acd4ce72e6294065ec858346204332e4704bb99e15
  Stored in directory: /Users/gutmanas/Library/Caches/pip/wheels/8e/70/28/3d6ccd6e315f65f245da085482a2e1c7d14b90b30f239e2cf4
Successfully built future
Installing 

In [23]:
import geocoder

Let's define a function to obtain coordinates for a Toronto postcode area

In [58]:
def get_lat_lng(postal_code, suffix="Toronto, Ontario", max_iter=10):
    """
    combine the postal code and the city/province/country in the suffix
    no more than max_iter attempts 
    return a tuple of longitude and latitude
    """
    result = None
    i = 0
    while result is None and i<max_iter:
        # google method failed to return anything even after a 1000 iterations. 
        # by trial and error found that arcgis does the job. 
        # I am not sure this is a permissible free use of the service, so will look for other options
        g = geocoder.arcgis(f'{postal_code}, Toronto, Ontario')
        result = g.json
        i += 1
    
    if result:
        return result['lat'], result['lng']
    else:
        return None

Now let's iterate over the postcodes and actually obtain the coordinates. Then add them to the dataframe.

In [62]:
latitudes = []
longitudes = []

for code in postcodes:
    ll = get_lat_lng(code)
    if ll is None:
        latitudes.append(None)
        longitudes.append(None)
        print("None for ", code)
    else:
        latitudes.append(ll[0])
        longitudes.append(ll[1])
        
    
toronto_work_df["Latitude"] = latitudes
toronto_work_df["Longitude"] = longitudes
toronto_work_df.head()

Unnamed: 0,Postal_code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.811525,-79.195517
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.785665,-79.158725
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765815,-79.175193
3,M1G,Scarborough,Woburn,43.768369,-79.21759
4,M1H,Scarborough,Cedarbrae,43.769688,-79.23944


A quick sanity check.

In [65]:
toronto_work_df.loc[toronto_work_df["Postal_code"] == "M4R"]

Unnamed: 0,Postal_code,Borough,Neighbourhood,Latitude,Longitude
46,M4R,Central Toronto,North Toronto West,43.714523,-79.40696


In [67]:
# Coordinates for Yonge and Eg (roughly central)
longitude = -79.403590 
latitude = 43.704689

Now let's create a map!

In [86]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11, tiles='Stamen Watercolor')
# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_base['LATITUDE'], toronto_base['LONGITUDE'], toronto_base['BOROUGH'], toronto_base['AREA_NAME']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
for lat, lng, borough, neighborhood, postcode in zip(toronto_work_df['Latitude'], 
                                                     toronto_work_df['Longitude'], 
                                                     toronto_work_df['Borough'], 
                                                     toronto_work_df['Neighbourhood'],
                                                     toronto_work_df['Postal_code']):
    label = '{} {}, {}'.format(postcode, neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='pink',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
    
folium.GeoJson(
    boroughs_geoJSON,
    name='geojson'
).add_to(map_toronto)

map_toronto

The map above shows the "Postal code" neighbourhoods from Wikipedia in red and the ones downloaded from the City of Toronto Open data portal in blue.

There is a curious concentration of red dots (postal codes) in the city centre. This is not too surprising, really, as this is the area with many high-rises, and many businesses, so the postal code density is expected to be higher there.

---
### End of Part 2

## Part 3:
### Looking into venues with Foursquare, and clustering of neighbourhoods by popularity of venues.

