# Segmenting and Clustering Neighborhoods in Toronto

### Andrew Shelstad

# Section 1: Toronto Wikipedia Data Retrieval & Preparation

## Step 1: Retrieving Data From Wikipedia

The information on the neighborhoods of Toronto including Postal Code, Borough and Neighborhood Name will be retrieved from this Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M


In [1]:
# import html and requests to scrape the data from the webpage in html format

from lxml import html
import requests

# use the wikipedia url to save the html tree into a python variable

wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_page = requests.get(wiki_url)
tree = html.fromstring(wiki_page.content)

# parse through table rows in html tree (xpath contains table row)
tr_elements = tree.xpath('//tr')

#empty list for table
tab=[]
i=0

#For each row in the table, store each header for the column name and an empty list for each column
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    tab.append((name,[]))
    
print(tab)

[('Postcode', []), ('Borough', []), ('Neighbourhood\n', [])]


Now that an empty list is created the next step is to append the data in each row of the table to the list

In [2]:
#Since our first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 3, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        tab[i][1].append(data)
        #Increment i for the next column
        i+=1

## Step 2: Storing the Data in a Pandas Dataframe

Now that the data is stored in a list I then convert it to a dictionary and store the dictionary in a pandas dataframe

In [3]:
# convert list to dictionary

dict = {title:column for (title,column) in tab}

# store dictionary in pandas dataframe

import pandas as pd

tor_df = pd.DataFrame(dict)
print(tor_df.shape)
print(tor_df.head())

(288, 3)
  Postcode           Borough     Neighbourhood\n
0      M1A      Not assigned      Not assigned\n
1      M2A      Not assigned      Not assigned\n
2      M3A        North York         Parkwoods\n
3      M4A        North York  Victoria Village\n
4      M5A  Downtown Toronto      Harbourfront\n


## Step 3: Cleaning Up & Organizing Data

The data in the dataframe needs to be organized. The first step is to rename the *"PostCode"* and *"Neighbourhood\n"* columns to *"Postal Code"* and *"Neighborhood"*

In [4]:
# rename Postal Code and Neighborhood column

tor_df.rename(columns={'Postcode':'PostalCode', 'Neighbourhood\n':'Neighborhood'}, inplace=True)
tor_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


All of the values in the third column end with a *"\n"* due to the html formatting specifying a new line or row in the table. This needs to be deleted in the dataframe.

In [5]:
# replace the "\n" in the third column with ""
tor_df['Neighborhood'] = tor_df['Neighborhood'].str.replace('\n','')
tor_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Next I delete all of the values in the Borough column that equal *"Not assigned"*.

In [6]:
# count unnasigned values in borough
print(tor_df[tor_df.Borough == 'Not assigned'].count())
print('--------------')

# ignore unassigned values in borough
tor_df = tor_df[tor_df.Borough != 'Not assigned']

# double check there are no not assigned values remaining in borough
print(tor_df[tor_df.Borough == 'Not assigned'].count())

PostalCode      77
Borough         77
Neighborhood    77
dtype: int64
--------------
PostalCode      0
Borough         0
Neighborhood    0
dtype: int64


In the next cell I group the data by postal code and join the neighborhoods with the same postal code separated by a comma. I used an aggregate function for this but I could not figure out how to keep the borough column in the dataframe without concatenating the same boroughs together as well.

In [7]:
# group by postal code and join the neighborhoods with the same postal code separated by a ','

df_temp = tor_df.groupby('PostalCode').agg({'Neighborhood':', '.join})
df_temp.reset_index(inplace=True)
df_temp.head(10)

Unnamed: 0,PostalCode,Neighborhood
0,M1B,"Rouge, Malvern"
1,M1C,"Highland Creek, Rouge Hill, Port Union"
2,M1E,"Guildwood, Morningside, West Hill"
3,M1G,Woburn
4,M1H,Cedarbrae
5,M1J,Scarborough Village
6,M1K,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,"Clairlea, Golden Mile, Oakridge"
8,M1M,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,"Birch Cliff, Cliffside West"


To introduce the borough column back into the dataframe, I merge the temporary grouped dataframe from the previous cell with the original dataframe then delete the duplicate values. I then reorder the columns and reset the index.

In [8]:
# merge the dataframes back together to get boroughs
df_tor = pd.merge(df_temp[['PostalCode','Neighborhood']], tor_df[['PostalCode', 'Borough']], left_on = 'PostalCode', right_on = 'PostalCode', how = 'left')

#drop duplicate postal code values
df_tor.drop_duplicates(subset='PostalCode', inplace=True)

#reorder colums and reset index
df_tor = df_tor[['PostalCode', 'Borough', 'Neighborhood']]
df_tor.reset_index(drop = True, inplace = True)

df_tor.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In the next 3 cells, I check if there are any unassigned values in neighborhood column that weren't dropped when I took out the unassigned values from the borough column. Turns out there is one unassigned value so I replace it with the corresponding borough value. I then do a final check to make sure that there are no unassigned values in the neighborhood and borough columns.

In [9]:
# find if there are any not assigned values for neighborhood

df_tor[df_tor.Neighborhood == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Not assigned


In [10]:
# replace not assigned value for neighborhood with the borough
df_tor.loc[df_tor.Neighborhood == 'Not assigned', 'Neighborhood'] = df_tor.loc[df_tor.Neighborhood == 'Not assigned', 'Borough']
df_tor[df_tor.PostalCode == 'M7A']

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Queen's Park


In [11]:
# double check that no unassigned values remain
print('Neighborhood Not Assigned:','\n',df_tor[df_tor.Neighborhood == 'Not assigned'].count())
print('Borough Not Assigned:','\n', df_tor[df_tor.Borough == 'Not assigned'].count())

Neighborhood Not Assigned: 
 PostalCode      0
Borough         0
Neighborhood    0
dtype: int64
Borough Not Assigned: 
 PostalCode      0
Borough         0
Neighborhood    0
dtype: int64


Looks good. Last step is to print the shape of the dataframe.

In [12]:
print(df_tor.shape)

(103, 3)
