## Segmenting and Clustering Neighborhoods in Toronto

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import requests # library to handle requests
from lxml import html
from bs4 import BeautifulSoup

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np

print('Libraries imported.')

Libraries imported.


## 1. Download and Explore Dataset

In [4]:
r = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(r.content, 'html.parser')
data = []
table = soup.find('table', class_="wikitable")
table_body = table.find('tbody')
rows = table_body.find_all('tr')

for row in rows:    
    cols = row.find_all(['td','th'])
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values
    
#Ignore cells with a borough that is Not assigned.
df = pd.DataFrame(data)
new_header = df.iloc[0]
df = df[1:]
df.columns = new_header
df=df[~df.Borough.str.contains("Not assigned")]
df.rename(columns={'Postcode':'Postal Code'}, inplace=True)
df=df.reset_index(drop=True)

#Now we will combine all neighbourhood in a row having same postcode and seperated by a comma.
gb = df.groupby(('Postal Code','Borough'))
result = gb['Neighbourhood'].agg([('Neighbourhood', ', '.join)])
result=result.reset_index()

#Neighbourhood having value 'Not assigned' will be replaced by name of it respective Borough
result['Neighbourhood'] = np.where(result['Neighbourhood']=='Not assigned', result['Borough'], result['Neighbourhood'])


## 2. Loading Latitude,Longitude Data from CSV file

In [3]:
filename='https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv'
latLongDf = pd.read_csv(filename)
latLongDf.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
