# Segmenting and Clustering Neighborhoods in Toronto

### Exploring and clustering the neighborhoods in Toronto
<img src="https://www.toronto.ca/wp-content/uploads/2017/12/9578-moving-to-toronto-7.png" alt="Italian Trulli">


Step 1 - Scraping the neighborhoods data from the wikipedia page

Let's download all the dependencies that we will need.

In [15]:
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request

Now lets download the contents of the web page using urllib and BeautifulSoup

In [16]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')

Find the table with the data

In [17]:
table = soup.find_all('table')[0]

Now, lets clean up table 

In [18]:
del table['class']

and convert it to a pandas dataframe with the correct column names. Lets check it using .head()

In [19]:
dfs = pd.read_html(table.prettify(), flavor='bs4')
df = dfs[0]
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))
df.rename(columns={'Postcode':'PostalCode'}, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


Great, now we have a dataframe with the correct colums! But now we need to get rid of the "Not Assigned" rows. 

First we get rid of the Not assigned boroughs, then we replace the Not assigned neighbourhoods with the Borough names if needed. 

In [20]:
df = df[df.Borough != 'Not assigned']
df.Neighbourhood.replace('Not assigned',df.Borough,inplace=True)

Now we will group the data by postal code and Comma Seperate the Neighbourhood, again lets check it using .head()

In [21]:
df = df.groupby(['PostalCode', 'Borough'])['Neighbourhood'].apply(", ".join)
df = df.reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [22]:
df.shape

(103, 3)

Now lets add the long and lat to the Neighbourhood! First lets download the long and lat data by postal code. 

In [26]:
data_url="https://cocl.us/Geospatial_data"
c=pd.read_csv(data_url)

Now lets join the dataframes by PostalCode

In [32]:
df = df.set_index('PostalCode').join(c.set_index('Postal Code'))

In [36]:
df = df.reset_index()
df.head(11)

Unnamed: 0,level_0,index,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,0,0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,1,1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,2,2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,3,3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,4,4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,5,5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,6,6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,7,7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,8,8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,9,9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
