# Toronto Clustering Exercise

Explore and cluster the neighborhoods in the city of Toronto. 

# 1. Get and normalise the data we need

## Area codes
First up lets get the postal codes from Wikipedia and parse the details until we get a suitable Dataframe. We could use html.parser to fully parse the source Wiki page, we could use a regular expression, but pandas has a read table that works on this ! It actually reads all of the tables but we are only interested in the first one (first row is header). This method actually uses the BeautifulSoup package mentioned in the document.

In [3]:
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df_list = pd.read_html(url, header=0)
df_codes = df_list[0]
df_codes.head(2)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned


## Area codes of interest only

In order to get the DataFrame of interest we need to
- Remove boroughs that are 'Not assigned'
- Combine neighbourhoods in the same Postal Code area (if you've ever used relational databases this seems so wrong !)
- Map neighbourhoods to boroughs if a neighbourhood is not assigned


In [4]:
# define our combining function for postcode matches
# take the first Postcode and Borough and concatenate the Neighbourhoods
def doCombine(x):
    data = [' ']*3
    data[0] = x['Postcode'].iloc[0]
    data[1] = x['Borough'].iloc[0]
    data[2] = x['Neighbourhood'].str.cat(sep=',')
    return pd.Series(data,['Postcode','Borough','Neighbourhood']);
  

In [5]:
import numpy as np

# drop unassigned Boroughs
df_codes_norm = df_codes[df_codes.Borough != 'Not assigned'] 

# combine Postcodes (Borough will also match so include)
df_codes_norm = df_codes_norm.groupby(['Postcode'], as_index=False).apply(doCombine) 

# now map neighbourhoods to boroughs
df_codes_norm['Neighbourhood'] = np.where(
    df_codes_norm['Neighbourhood'] == 'Not assigned', 
    df_codes_norm['Borough'], 
    df_codes_norm['Neighbourhood'])
df_codes_norm.head(2)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"


## And shape

Finally we shape the result

In [6]:
df_codes_norm.shape

(103, 3)

# 2. Geocoder Section 
Use the Geocoder library to add coordinates (separate this out as it is time consuming - don't want to run more than once)

In [12]:
# Uncomment if geocoder not already installed
# !conda install -c conda-forge geocoder --yes 


In [9]:
import geocoder # import the geocoder library

# Initialise the new columns with float values
df_codes_norm['Latitude']=0.0
df_codes_norm['Longitude']=0.0
is_in_error=False
for index, row in df_codes_norm.iterrows():

    # Repeat until we get a match
    times = 0
    lat_lng_coords = None
    while(lat_lng_coords is None and times < 3):
        times = times + 1
        g = geocoder.google('{}, Toronto, Ontario'.format(row['Postcode']))
    
    if lat_lng_coords is None:
        print('Error encountered with geocoder');
        is_in_error = True
        break
    else:
        row['Latitude'] = g.latlng[0]
        row['Longitude'] = g.latlng[1]
        

df_codes_norm.head()

Error encountered with geocoder


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",0,0
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",0,0
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",0,0
3,M1G,Scarborough,Woburn,0,0
4,M1H,Scarborough,Cedarbrae,0,0


The following lines download the csv file instead - we need to install wget first though

In [39]:
# Uncomment if the file hasn't been downloaded
# !wget Geospatial_data 'http://cocl.us/Geospatial_data'

df_dat = pd.read_csv('Geospatial_data')
df_dat.set_index('Postal Code')
df_dat.head()


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [45]:
if (is_in_error):
    for index, row in df_codes_norm.iterrows():

        # Look up the code
        code = row['Postcode']
        loc_row = df_dat.loc[df_dat['Postal Code'] == code]
        if not (loc_row is None): 
            df_codes_norm.at[index,'Latitude'] = loc_row['Latitude']
            df_codes_norm.at[index,'Longitude'] = loc_row['Longitude']

df_codes_norm.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
