<h2>Segmenting and Clustering Neighborhoods in Toronto</h2>
Import the numpy, pandas and requests libraries and install beautiful soup (v4) and lxml conversion packages onto server

In [7]:
import pandas as pd
import requests

!conda install -c conda-forge beautifulsoup4 --yes
!conda install -c conda-forge lxml --yes


Create a dataframe (neighborhoods) and set the column names as <b>PostCode</b>, <b>Borough</b> and <b>Neighborhood</b>

In [8]:
# define list of column names to be used in the neighborhoods dataframe
column_names = ['PostCode', 'Borough', 'Neighborhood'] 

# instantiate the dataframe and set the column name
neighborhoods = pd.DataFrame(columns=column_names)

Import beautiful soup library and assign the target url. Scrape the post code data from the Wikipedia table.

In [9]:
#import the beautiful soup library
from bs4 import BeautifulSoup

#set the target url and extract the html text from the wiki url
wiki_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

#create the Beautifulsoup object and assign to variable soup
soup = BeautifulSoup(wiki_url,'lxml')

#find the post code table (wikitable) and assign only those elements belonging to the table to a variable - pcode_tbl
pcode_tbl = soup.find('table', class_= 'wikitable')

Create a loop which cycles through each row in the table, scraping the the postcode, borough and neighborhood data from each row and assign these to variables var_a, var_b and var_c. Remove the carriage returns ('\n') from the text strings and replace the backslash separating neighborhoods with a comma, before appending the cleaned data to the neighborhoods dataframe.

In [10]:
#create a loop which cycles through each row in the table 
for rows in pcode_tbl.find_all('tr'):
    
    #assign the cells of each row (<td>) to the variable cells
    cells = rows.find_all('td')

    #first test if there are 3 cells in the row - representing post code, borough and neighborhood
    if len(cells) == 3:
    
        #assign data scraped from the <td> cells to variables for postcode, borough and neighborhood and clean the text
        var_a = cells[0].find(text=True).rstrip('\n')
        var_b = cells[1].find(text=True).rstrip('\n')
        var_c = cells[2].find(text=True).rstrip('\n').replace(' /',',')
        
        #omit rows where no borough is assigned
        if var_b != 'Not assigned':
            
            #test for post codes with a borough, yet not assigned neighborhood
            if var_c == 'Not assigned':
                
                #set the neighborhood name to the same as the borough
                var_c = var_b
            
            #create row data for cleaned poscode, borough and neighborhoods
            new_row = {'PostCode':var_a, 'Borough':var_b, 'Neighborhood':var_c}
            
            #append the row data to the neighborhoods dataframe
            neighborhoods = neighborhoods.append(new_row, ignore_index=True)

#check the outcome
neighborhoods

Unnamed: 0,PostCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road , Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [11]:
neighborhoods.shape

(103, 3)