# Segmenting and Clustering Neighborhoods in Toronto - Part 1
## Yu Deng

First import the necessary packages.

In [1]:
import pandas as pd
import numpy as np

Simply use pandas to scrape the table exhibited on Wikipedia page.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
toronto_wiki = pd.read_html(url)[0]
toronto_wiki.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Drop all the rows having "Not assigned" value in column Borough.

In [3]:
toronto_neigh = toronto_wiki[~toronto_wiki["Borough"].isin(["Not assigned"])].reset_index(drop=True)
toronto_neigh.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


Replace all "Not assigned" in column Neighbourhood by the value in corresponding column Borough.

In [4]:
toronto_neigh[toronto_neigh["Neighbourhood"]=="Not assigned"]["Neighbourhood"] = toronto_neigh[toronto_neigh["Neighbourhood"]=="Not assigned"]["Borough"]
toronto_neigh.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


Check whether all rows are not containing "Not assigned". 

In [5]:
print(toronto_neigh["Borough"].isin(["Not assigned"]).value_counts(),"\n",toronto_neigh["Neighbourhood"].isin(["Not assigned"]).value_counts())

False    210
Name: Borough, dtype: int64 
 False    210
Name: Neighbourhood, dtype: int64


Combine the neighbourhoods existed in the same postcode area together.

In [6]:
toronto = toronto_neigh.groupby(["Postcode","Borough"]).aggregate(lambda x:', '.join(x)).reset_index()
toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Check the number of elements in out final dataframe.

In [7]:
toronto.shape

(103, 3)

Save the final results into a csv file named "toronto_neigh" for my next notebook.

In [8]:
toronto.to_csv("toronto_neigh.csv")