# Clustering of Toronto Neighbourhood

This notebook contains the work for clustering Toronto neighbourhood

## Part I: Process and Clean Data

We first import the required libraries. The library 'requests' for handling html page requests, the library 'lxml' for processing the html page and the pandas library for handling data

In [147]:
import requests
import lxml.html as lh
import pandas as pd

Request the html page from wikipedia which lists the postal codes of cananda

In [148]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
doc = lh.fromstring(page.content)

Process the table which contains the postal code. The table is identified as the one where the 'class' attirbute is set to 'wikitable sortable' as there is only one table with that attribute. 

In [149]:
table_one_elements=doc.xpath('//table[@class="wikitable sortable"]/tbody/tr')
table_one_elements_len = len(table_one_elements)

Extract the table header

In [150]:
table_header=[]
for col in table_one_elements[0].iterchildren():
    table_header.append(col.text_content()) 

Extract the table data in python dictionary format with the key being postal code. For each row, check that there are three columns and if the row does not contain 3 columns, do not process (as we dont know how to process it). For each processed row, check the column 2 (borough). If not assigned, skip the row. If entry for borough is defined for the row, then check if the postal code already exists in dictionary. If yes, then append the neighbourhood defined for the current row to the neighbourhood already definded for the postal code in the dictionary. Else, create a new entry for the postal code with the borough and neighbourhood defined for the current row. If the neighbourhood is not defined for the current row, then take borough as the neighbourhood 

In [151]:
table_data={}
for row in range(1,table_one_elements_len):
    row_content = table_one_elements[row]
    row_content_len = len(row_content)
  
  
    if row_content_len == 3:
        try:
    
            borough = row_content[1].text_content()
            if borough != 'Not assigned':
                
                postcode = row_content[0].text_content()            
                if postcode in table_data.keys():
      
                    neighbourhood = row_content[2].text_content().strip()
                    if neighbourhood == 'Not assigned':
                        neighbourhood = borough
                     
                    old_neighbourhood = table_data[postcode][1]  
                    new_neighbourhood = old_neighbourhood + ", " + neighbourhood
                    attr = [borough, new_neighbourhood]
                    
                else:
                    
                    neighbourhood = row_content[2].text_content().strip()    
                    attr = [borough, neighbourhood]
                
                table_data[postcode] = attr      
                
        except:
            pass




Convert the dictionary into pandas data frame. Reset the index and rename the index colum to html table header[0] (extracted in earlier step). Print the data frame and check if the data frame looks good

In [152]:
df=pd.DataFrame.from_dict(table_data, orient='index', columns=table_header[1:]) 
df=df.reset_index().rename(columns={'index':table_header[0]})
df


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Not assigned
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


Print the shape of the data frame

In [153]:
df.shape

(103, 3)