## Web Scraping - Segmenting and Clustering Neighborhoods in Toronto

First we should import the libraries. We are going to use requests, pandas and BeautifulSoup for scraping

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup


We proceed with the scraping from the wikipedia link, and look for the TABLE content to transform it into a pandas DataFrame.

In [2]:
req=requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

soup=BeautifulSoup(req.content, "html.parser")
table=soup.find(name="table")

df=pd.read_html(str(table), header=0)[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [3]:
df.shape

(288, 3)

As we can see, the table contains 288 rows and 3 columns. Let's see the Not Assigned values in Borough column

In [4]:
df["Borough"].value_counts()

Not assigned        77
Etobicoke           45
North York          38
Scarborough         37
Downtown Toronto    37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

Let's desconsider the rows with Not Assigned values in Borough column - it should result in 211 rows at the end

In [5]:
df=df[df.Borough!="Not assigned"]
print("Shape:" , df.shape)
df.head()

Shape: (211, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Correcting the spelling of Neighborhood and PostalCode columns

In [6]:
df.rename(columns={"Postcode":"PostalCode","Neighbourhood":"Neighborhood"}, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [7]:
df["PostalCode"].value_counts()

M9V    8
M8Y    8
M5V    7
M4V    5
M9B    5
M8Z    5
M6M    4
M9C    4
M1V    4
M9R    4
M5T    3
M5H    3
M5J    3
M8V    3
M1P    3
M6K    3
M6L    3
M5R    3
M3H    3
M1M    3
M8X    3
M2J    3
M1L    3
M1K    3
M1T    3
M1C    3
M1E    3
M6P    2
M4K    2
M4L    2
      ..
M6B    1
M5G    1
M4Y    1
M1S    1
M2N    1
M9L    1
M1X    1
M6C    1
M9A    1
M4N    1
M1W    1
M2H    1
M7R    1
M3B    1
M3N    1
M3L    1
M5C    1
M4M    1
M3A    1
M4W    1
M1J    1
M6E    1
M4H    1
M9P    1
M4P    1
M2R    1
M4J    1
M6G    1
M7A    1
M4S    1
Name: PostalCode, Length: 103, dtype: int64

It means that, when aggregating Neighborhoods from same Postal Code into the same row, we have to end up with a dataframe of 103 rows (the number of differente postal codes).

Lets check how many Not Assigned Neighborhoods

In [8]:
df[df["Neighborhood"]=="Not assigned"]

Unnamed: 0,PostalCode,Borough,Neighborhood
8,M7A,Queen's Park,Not assigned


Only one. Then we will set the Borough (Queen's Park) in substitution to the Not Assigned Neighborhood.

In [9]:
df["Neighborhood"].replace({"Not assigned":"Queen's Park"}, inplace=True)
df.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


Now we create a frame grouping the Neighborhoods by Postal Code

In [10]:
grouped_neigh=df.groupby("PostalCode")["Neighborhood"].agg([("Neighborhood", ",".join)])
grouped_neigh.head()

Unnamed: 0_level_0,Neighborhood
PostalCode,Unnamed: 1_level_1
M1B,"Rouge,Malvern"
M1C,"Highland Creek,Rouge Hill,Port Union"
M1E,"Guildwood,Morningside,West Hill"
M1G,Woburn
M1H,Cedarbrae


And we change all the Dataframe considering the aggregated Neighborhood (sepparated by comma)

In [11]:
for pc in df["PostalCode"]:
    df.loc[df["PostalCode"]==pc,"Neighborhood"]=str(grouped_neigh.loc[pc, "Neighborhood"])

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Harbourfront,Regent Park"
5,M5A,Downtown Toronto,"Harbourfront,Regent Park"
6,M6A,North York,"Lawrence Heights,Lawrence Manor"


We now remove duplicates

In [12]:
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [13]:
df.shape

(103, 3)

## We end up with a shape of 3 columns and 103 rows, as expected once we have 103 unique postal codes.