#  Segmenting and Clustering Neighborhoods in Toronto

## Objectives of this notebook

- Get all the postal codes in Toronto, which the respective neighbourhoods
- Extract data from this [Wikipedia URL](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)
- Normalize and standarize the data
- Write the dataframe into a CSV format

In [1]:
# basic packages for data manipulation
import pandas as pd
import numpy as np

# to handle HTML and API's REST request
import requests
from bs4 import BeautifulSoup

In [2]:
# define the base url
URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [3]:
# GET the data
req = requests.get(URL)
req.raise_for_status()

html = BeautifulSoup(req.text)
table_wiki = html.find_all("table", class_="wikitable")[0]

In [4]:
# get the rows and columns of the table
rows = []
columns = []
for index, row in enumerate(table_wiki.find_all("tr")):
    if index != 0:
        rows.append(row.get_text().strip().split("\n\n"))
    else:
        columns = row.get_text().strip().split("\n\n")

In [5]:
df = pd.DataFrame(data=rows, columns=columns)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [6]:
df.shape

(180, 3)

In [7]:
df[df == "Not assigned"].count()

Postal Code       0
Borough          77
Neighbourhood    77
dtype: int64

In [8]:
df[df == "Not assigned"] = np.nan

In [9]:
for column in df.columns.values:
    column_rename = column.replace(" ", "_")
    column_rename = column_rename.lower()
    df.rename(columns={column: column_rename}, inplace=True)

In [10]:
df.head()

Unnamed: 0,postal_code,borough,neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [11]:
df.to_csv("datasets/toronto-postal-codes.csv")