# Segmenting and Clustering Neighborhoods in Toronto

I have to scrape the following Wikipedia page, <https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M>, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

In [1]:
#!conda install -c conda-forge beautifulsoup4 --yes
#!conda install -c conda-forge lxml --yes
#!conda install -c conda-forge requests --yes

## Install Packages and read libraries

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd


## Read the Wikipedia Page

Get the html-Text of the page:

In [3]:
page_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page_response = requests.get(page_link).text
soup = BeautifulSoup(page_response, 'lxml')
#print(soup.prettify())

Extract only the table and transform the entries to a list:

In [4]:
data = []
table = soup.find('table', class_='wikitable sortable') 
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values
    
data = data[1:len(data)] # delete the first row because of "none"-values

Create a pandas dataframe:

In [5]:
labels=['PostalCode', 'Borough', 'Neighborhood']
df = pd.DataFrame.from_records(data, columns=labels)

In [6]:
print(df.head(10))

  PostalCode           Borough      Neighborhood
0        M1A      Not assigned      Not assigned
1        M2A      Not assigned      Not assigned
2        M3A        North York         Parkwoods
3        M4A        North York  Victoria Village
4        M5A  Downtown Toronto      Harbourfront
5        M5A  Downtown Toronto       Regent Park
6        M6A        North York  Lawrence Heights
7        M6A        North York    Lawrence Manor
8        M7A      Queen's Park      Not assigned
9        M8A      Not assigned      Not assigned


Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [7]:
df = df[df.Borough != 'Not assigned']

In [8]:
print(df.head(10))

   PostalCode           Borough      Neighborhood
2         M3A        North York         Parkwoods
3         M4A        North York  Victoria Village
4         M5A  Downtown Toronto      Harbourfront
5         M5A  Downtown Toronto       Regent Park
6         M6A        North York  Lawrence Heights
7         M6A        North York    Lawrence Manor
8         M7A      Queen's Park      Not assigned
10        M9A         Etobicoke  Islington Avenue
11        M1B       Scarborough             Rouge
12        M1B       Scarborough           Malvern


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park. (It's the only appearance of this special combination):

In [9]:
print(df[df.Neighborhood == 'Not assigned'])

  PostalCode       Borough  Neighborhood
8        M7A  Queen's Park  Not assigned


In [10]:
df.loc[df.Neighborhood == 'Not assigned', ['Neighborhood']] = 'Queen\'s Park'

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [11]:
df = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join)

In [12]:
print(df.head(10)) 

PostalCode  Borough    
M1B         Scarborough                                     Rouge, Malvern
M1C         Scarborough             Highland Creek, Rouge Hill, Port Union
M1E         Scarborough                  Guildwood, Morningside, West Hill
M1G         Scarborough                                             Woburn
M1H         Scarborough                                          Cedarbrae
M1J         Scarborough                                Scarborough Village
M1K         Scarborough        East Birchmount Park, Ionview, Kennedy Park
M1L         Scarborough                    Clairlea, Golden Mile, Oakridge
M1M         Scarborough    Cliffcrest, Cliffside, Scarborough Village West
M1N         Scarborough                        Birch Cliff, Cliffside West
Name: Neighborhood, dtype: object


This is a pandas data series, so I have to change the type to a data frame and reset the index:

In [13]:
df = df.to_frame().reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [14]:
df.shape

(103, 3)