# Segmenting and Clustering Neighborhoods in Toronto

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

Start by creating a new Notebook for this assignment.
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [3]:
#!conda update -n base -c defaults conda
#!conda install -c anaconda lxml
import requests
import lxml.html as lh
import pandas as pd

In [31]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#handle page for contents of the website
page = requests.get(url)

#Store contents under doc
doc = lh.fromstring(page.content)

#Parse stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

In [32]:
#Check the length of the first 12 rows
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

In [33]:
# Parse the 1 row as header
tr_elements = doc.xpath('//tr')

#Create empty list
col=[]
i=0

#For each row, store each 1 element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print ('%d:"%s"'%(i,name))
    col.append((name,[]))

1:"Postcode"
2:"Borough"
3:"Neighbourhood
"


In [34]:
# Creates pandas frame
#Since out 1 row is the header, data is stored on the 2 row

for j in range(1,len(tr_elements)):
    
    T=tr_elements[j]

    if len(T)!=3:
        break
    
    #index
    i=0
        #Iterate 
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

In [35]:
# Check the length of each column..
[len(C) for (title,C) in col]

[288, 288, 288]

In [36]:
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

In [10]:
#creating pandas df

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


In [42]:
# Rearranging and renaming the columns

df.columns = ['Borough', 'Neighbourhood','Postcode']
cols = df.columns.tolist()
cols

cols = cols[-1:] + cols[:-1]

df = df[cols]

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


In [43]:
# Cleaning string 

df = df.replace('\n',' ', regex=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [44]:
# Dropping all cells with a borough that is Not assigned


df.drop(df.index[df['Borough'] == 'Not assigned'], inplace = True)

# Reset the index drop the previous index
df = df.reset_index(drop=True)

df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In [46]:
# Combining Neighbourhoods based on similar Postcode and Borough

df = df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(','.join).reset_index()
df.columns = ['Postcode','Borough','Neighbourhood']
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge ,Malvern"
1,M1C,Scarborough,"Highland Creek ,Rouge Hill ,Port Union"
2,M1E,Scarborough,"Guildwood ,Morningside ,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park ,Ionview ,Kennedy Park"
7,M1L,Scarborough,"Clairlea ,Golden Mile ,Oakridge"
8,M1M,Scarborough,"Cliffcrest ,Cliffside ,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff ,Cliffside West"


In [47]:
# Removing any space in the start of the string

df['Neighbourhood'] = df['Neighbourhood'].str.strip()

# Assigning Borough values
df.loc[df['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = df['Borough']

In [48]:
df[df['Borough'] == 'Queen\'s Park']

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


In [51]:
df.shape

(103, 3)

In [52]:
# save
df.to_csv(r'df_can.csv')