# Applied Data Science Capstone
*by IBM*

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

Start by creating a new Notebook for this assignment. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [18]:
import numpy as np
import pandas as pd

> Using `pandas.read_html`

In [77]:
df_pd = pd.read_html(
    'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',
    encoding='utf-8')[0]

# Only process the cells that have an assigned borough.
# Ignore cells with a borough that is Not assigned.
df_pd = df_pd[df_pd['Borough'] != 'Not assigned'].reset_index(drop=True)

# If a cell has a borough but a Not assigned  neighborhood,
# then the neighborhood will be the same as the borough.
df_pd['Neighbourhood'] = df_pd.apply(
    lambda x: x['Neighbourhood'] if x['Neighbourhood']!='Not assigned' else x['Borough'],axis=1)


df_pd.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


> Using `bs4.BeautifulSoup`

In [78]:
import requests 
from bs4 import BeautifulSoup
from html import unescape

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

table = BeautifulSoup(requests.get(url).text, 'lxml').find('div', class_='mw-parser-output').table.tbody

table_string = unescape(str(table.find_all('tr')))

subs = {'[':'',']':'','\n':'','</th><th>': '| ', '<th>':'', '</th></tr>,': '',
    '<tr>':'\n', '<td>':'', '</td>':'| ', '</td><td>':'| ', '| </tr>,':'', '| </tr>': '',
    '| ':'|',' |':'','  ':' '}

for key in subs: table_string = table_string.replace(key,subs[key])
    
table_string = table_string.strip().split('\n')

columns = list(map(lambda x: x.strip(), table_string[0].strip().split('|')))

df_bs = pd.DataFrame([dict(zip(columns,line.strip().split('|')))  for line in table_string[1:]])

# Only process the cells that have an assigned borough.
# Ignore cells with a borough that is Not assigned.
df_bs = df_bs[df_bs['Borough'] != 'Not assigned'].reset_index(drop=True)

# If a cell has a borough but a Not assigned  neighborhood,
# then the neighborhood will be the same as the borough.
df_bs['Neighbourhood'] = df_bs.apply(
    lambda x: x['Neighbourhood'] if x['Neighbourhood']!='Not assigned' else x['Borough'],axis=1)


df_bs.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


> Comparing approaches

In [79]:
df_test = df_bs.merge(df_pd,on='Postal Code',how='left')
print("Diferences on 'Neighbourhood':",len(df_test[df_test['Neighbourhood_x'] != df_test['Neighbourhood_y']]))
print("Diferences on 'Borough':",len(df_test[df_test['Borough_x'] != df_test['Borough_y']]))

Diferences on 'Neighbourhood': 0
Diferences on 'Borough': 0


**Ps**.: As we can see, both approaches lead to the same result.

In [80]:
df = df_pd
print("DataFrame Shape:",df.shape)
df.head()

DataFrame Shape: (103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [81]:
df.shape

(103, 3)