<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

In this assignment, it will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. 

In [1]:
#First, let's install soup package
#!pip install beautifulsoup4
#!pip install lxml
#!pip install html5lib

In [2]:
#Import libraries
import pandas as pd
import numpy as np
import requests

In [3]:
#Download the page using the request library
wiki_toronto = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
wiki_toronto
#A status_code of 200 means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

<Response [200]>

In [4]:
#wiki_toronto.content
# From the dev tools from Chrome, we can see that the table is in <table class="wikitable sortable jquery-tablesorter">

In [5]:
#Let's import the BeautifulSoup library
from bs4 import BeautifulSoup
soup = BeautifulSoup(wiki_toronto.content, 'html.parser')

In [6]:
#Let's preview how our page looks in HTML with the soup
print(soup.prettify()[:500])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"3b3a


In [7]:
#The class id that holds the table is ""mw-content-text"
#Retrieve that from the page's HTML
table = soup.find(id="mw-content-text")
items = table.find_all("table") #filters classes that has "table" in them 
 

In [8]:
full_table = items[0]
#print(full_table.prettify())
#full_table

In [9]:
#Let's create a function, that does write every element in the pandas dataframe

class HTMLTableParser:  
    
        def parse_html_table(self, table):
            n_columns = 0
            n_rows=0
            column_names = []
    
            # Find number of rows and columns
            # we also find the column titles if we can
            for row in table.find_all('tr'):
                
                # Determine the number of rows in the table
                td_tags = row.find_all('td')
                if len(td_tags) > 0:
                    n_rows+=1
                    if n_columns == 0:
                        # Set the number of columns for our table
                        n_columns = len(td_tags)
                        
                # Handle column names if we find them
                th_tags = row.find_all('th') 
                if len(th_tags) > 0 and len(column_names) == 0:
                    for th in th_tags:
                        column_names.append(th.get_text())
    
            # Safeguard on Column Titles
            if len(column_names) > 0 and len(column_names) != n_columns:
                raise Exception("Column titles do not match the number of columns")
    
            columns = column_names if len(column_names) > 0 else range(0,n_columns)
            df = pd.DataFrame(columns = columns,
                              index= range(0,n_rows))
            row_marker = 0
            for row in table.find_all('tr'):
                column_marker = 0
                columns = row.find_all('td')
                for column in columns:
                    df.iat[row_marker,column_marker] = column.get_text()
                    column_marker += 1
                if len(columns) > 0:
                    row_marker += 1
            
            return df

In [10]:
hp = HTMLTableParser()
table = hp.parse_html_table(full_table)# Grabbing the table from the tuple
table.head()

Unnamed: 0,PostalCode\n,Borough\n,Neighborhood\n
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


In [11]:
#Remove new line symbols
table = table.rename(lambda x: x[:-1], axis=1) #I think this is pretty bad idea to do, but could figure it out how to do it differently
table = table.replace('\n','', regex=True)
table.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [12]:
#drop the rows with "Not assigned" value
table = table[table.Borough != 'Not assigned'] 

In [13]:
#Reset index
table.reset_index(drop=True)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [14]:
#Check if there is dublicates:
duplicateRows = table[table.duplicated()]
duplicateRows
#No dubplicate rows, so no merging rows. (I think somebody cleaned wiki page)

Unnamed: 0,PostalCode,Borough,Neighborhood


In [15]:
#Just played with the filters
#table.Neighborhood[table.Neighborhood == "Not assigned"]
#table[table['Neighborhood'].str.contains('Regent Park', regex=False)]


In [16]:
#Let's check if there are Neighborhoods left with "Not assigned value"
table[table.Neighborhood == "Not assigned"] #None, so I dont need to put the same value as a Borough

Unnamed: 0,PostalCode,Borough,Neighborhood


In [17]:
#As per request, the last cell has the dataframe shape:
print(table.shape)

(103, 3)


All of the comments about the code and explanations are next to the code in the cells. 
No assumptions yet, as the dataframe doesn't contain any significant or interesting data(Just the names of the Toronto Neighborhoods)