# Segmenting and Clustering Neighbourhoods in the city of Toronto, Canada
#### Applied Data Science Capstone - Week 3 assignement

## Part 1 - Obtaining the list of neighbourhoods
In this part we will use the BeautifulSoup package as well as the html-parsing capabilities of the pandas package to load the list of postcodes corresponding to the Toronto area from Wikipedia into a dataframe.

In [6]:
from bs4 import BeautifulSoup
import requests

import pandas as pd

In [37]:
#Finding the table in the wikipedia page using BeautifulSoup
wiki_page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
wiki_soup = BeautifulSoup(wiki_page.text)
html_table = wiki_soup.find(class_='wikitable sortable').prettify()

#Converting the html table to a pandas dataframe
df_list = pd.read_html(html_table)
Toronto_hoods_df = df_list[0]
Toronto_hoods_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [50]:
Toronto_hoods_df = Toronto_hoods_df[Toronto_hoods_df['Borough'] != 'Not assigned'] #dropping unassigned Boroughs
Toronto_hoods_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [53]:
Toronto_hoods_df[Toronto_hoods_df['Neighbourhood']=='Not assigned'].index #No unassigned Neighbourhoods

Int64Index([], dtype='int64')

In [59]:
[g for _, g in Toronto_hoods_df.groupby('Postal Code') if len(g) > 1] #No duplicate postal codes

[]

In [67]:
#Clean up the index
Toronto_hoods_df.reset_index(drop=True, inplace=True)
Toronto_hoods_df.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [63]:
Toronto_hoods_df.shape

(103, 3)

We now have a clean dataframe with one line for each assigned postal code and a list of corresponding neighbourhoods.

## Part 2 - Obtaining the coordinates of the neighboorhoods