# Segmenting and Clustering Neighborhoods in Toronto
## PART 1 - LOADING DATA
The first thing to do is loading the data from Wikipedia.

In [2]:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text, 'html.parser')

Now we have to:
    1. Convert our HTML into a pandas' dataframe
    2. Remove useless rows (the ones without neighbourhood informations)

In [3]:
import pandas as pd
import numpy as np
table = soup.select_one('table')
dfX = pd.read_html(table.prettify())[0]
dfX.drop(dfX[dfX.Neighbourhood == "Not assigned"].index, inplace = True, axis = 0)
dfX.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


I'm going to aggregate rows for Postcode and Borough

In [4]:
df = dfX.groupby(by = ["Postcode", "Borough"]).agg(lambda col: ','.join(col)).reset_index()
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Just a check if everything is fine

In [5]:
df[df.Postcode == "M5A"]

Unnamed: 0,Postcode,Borough,Neighbourhood
53,M5A,Downtown Toronto,"Harbourfront,Regent Park"


Plot the shape of the dataframe:

In [6]:
df.shape

(102, 3)

## PART2 - LOCATION

Just load the arcgis positions and store them into the dataframe.
Because loading position is slow and unstable, i saved the output in a file.

I will use the file for the part 3 too :)

In [33]:
import geocoder # import geocoder
postcodeposition = []
for index in df.Postcode.values:
    pos = geocoder.arcgis(location = '{}, Toronto, Ontario'.format(index)).json
    if pos['ok'] == True:
        postcodeposition.append({'code' : index, 'lat': pos['lat'], 'lng': pos['lng'] })
    else:
        print(pos)

In [44]:
df['lat'] = df['Postcode'].map(lambda code: [a['lat'] for a in postcodeposition if a['code'] == code][0])
df['lng'] = df['Postcode'].map(lambda code: [a['lng'] for a in postcodeposition if a['code'] == code][0])
df.to_csv(r'complete_dataframe.csv')

In [45]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,lat,lng
0,M1B,Scarborough,"Rouge,Malvern",43.811525,-79.195517
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.78573,-79.15875
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.76569,-79.175256
3,M1G,Scarborough,Woburn,43.768359,-79.21759
4,M1H,Scarborough,Cedarbrae,43.769688,-79.23944
