# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

## Part 1: Parsing Toronto's neighborhoods from Wikipedia

Install and import necessary packages

In [2]:
!pip install bs4

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/e8/b5/7bb03a696f2c9b7af792a8f51b82974e51c268f15e925fc834876a4efa0b/beautifulsoup4-4.9.0-py3-none-any.whl (109kB)
[K     |████████████████████████████████| 112kB 7.6MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from beautifulsoup4->bs4)
  Downloading https://files.pythonhosted.org/packages/05/cf/ea245e52f55823f19992447b008bcbb7f78efc5960d77f6c34b5b45b36dd/soupsieve-2.0-py2.py3-none-any.whl
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jupyterlab/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.9.0 bs4-0.

In [3]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

Fetching the data from Wikipedia

In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = requests.get(url)

Converting the html table into a Pandas dataframe

In [69]:
table = BeautifulSoup(html.text, 'html.parser').find('table')  #Finds the first table in the page
table_rows = table.find_all('tr')                              #Finds all <tr> tags and returns a list of them

data = []
for row in table_rows:                               #Iterate through all rows
    tds = row.find_all('td')                         #Find all <td> tags
    new_row = [td.text.strip() for td in tds]        #Create a list of the td elements, while removing the "\n" from the end
    if len(new_row):
        data.append(new_row)
        
df = pd.DataFrame(data, columns=["Postal code", "Borough", "Neighborhood"])
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Remove not assigned postal codes

In [70]:
df = df.drop(df[df["Borough"] == "Not assigned"].index)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


Replace slash to comma and delete that strange "\n" in the row "M5V" (it is oddly rendered even in Wikipedia, it is probably a typo)

In [71]:
df["Neighborhood"] = df["Neighborhood"].str.replace(" / ", ", ")
df["Neighborhood"] = df["Neighborhood"].str.replace("\\n", "")
df.reset_index(inplace = True, drop = True)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Check the shape of the dataframe

In [72]:
df.shape

(103, 3)

It turns out that there are 103 postal codes in Toronto with one or more assigned neighborhoods.