# Segmenting and Clustering Neighborhoods in Toronto
In this blog post, we will explore neighborhoods in Toronto, Canada.

First, we will build the code to scrape an HTML table from this Wikipedia page (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) that lists postal codes in Canada.  
We will need **requests** for getting the HTML contents of the website and **lxml.html** for parsing the relevant fields. We will then store this data on a **pandas** dataframe.

In [0]:
# import libraries
import requests
import lxml.html as lh
import pandas as pd

### Scrape the HTML table cells.

In [0]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# create a handle page to handle the contents of the website
page = requests.get(url)

# store the contents of the website under doc
doc = lh.fromstring(page.content)

# parse data that are stored between <tr>...</tr> of HTML
tr_elements = doc.xpath('//tr')

Ensure that all the rows have the same width. If not, we probably got something more that just the table.

In [92]:
# check the length of the first 10 rows
[len(T) for T in tr_elements[:10]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

This means that there are 3 columns per row.  
Now parse the table header.

In [93]:
# parse the first row as the header
tr_elements = doc.xpath('//tr')

# create an empty list
col = []
i = 0

# for each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name = t.text_content()
    print('%d:"%s"'%(i,name))
    col.append((name,[]))

1:"Postcode"
2:"Borough"
3:"Neighbourhood
"


### Create pandas dataframe
Each header is appended to a tuple along with an empty list

In [0]:
# since the first row is the header, data is stored on the second row onwards
for j in range(1, len(tr_elements)):
    #T is our j'th row
    T = tr_elements[j]
    
    # if row is not of size 3, the //tr data is not from our table 
    if len(T)!= 3:
        break
    
    # i is the index of our column
    i = 0
    
    # iterate through each element of the row
    for t in T.iterchildren():
        data = t.text_content() 
        # check if row is empty
        if i > 0:
        # convert any numerical value to integers
            try:
                data = int(data)
            except:
                pass
        # append the data to the empty list of the i'th column
        col[i][1].append(data)
        # increment i for the next column
        i+=1

Check the length of each column. Ideally, there should all be the same.

In [95]:
[len(C) for (title, C) in col]

[287, 287, 287]

This shows that each of the columns has exactly 287 rows.

Create the dataframe.

In [96]:
Dict = {title:column for (title, column) in col}
df = pd.DataFrame(Dict)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood\n
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


In [97]:
df.columns = ['Postcode', 'Borough', 'Neighborhood']
cols = df.columns.tolist()
cols

['Postcode', 'Borough', 'Neighborhood']

Clean the messy string in the Neighborhood column.

In [98]:
df = df.replace('\n', ' ', regex=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Drop all cells with a Borough that is "Not assigned".

In [99]:
df.drop(df.index[df['Borough'] == 'Not assigned'], inplace=True)
# reset the index and drop the previous index
df = df.reset_index(drop=True)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Etobicoke,Islington Avenue
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


Combine neighborhoods based on similar Postcode and Borough.

In [108]:
df = df.groupby(['Postcode', 'Borough'])['Neighborhood'].apply(','.join).reset_index()
df.columns = ['Postcode', 'Borough', 'Neighborhood']
df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge ,Malvern"
1,M1C,Scarborough,"Highland Creek ,Rouge Hill ,Port Union"
2,M1E,Scarborough,"Guildwood ,Morningside ,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park ,Ionview ,Kennedy Park"
7,M1L,Scarborough,"Clairlea ,Golden Mile ,Oakridge"
8,M1M,Scarborough,"Cliffcrest ,Cliffside ,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff ,Cliffside West"


Remove any spacing at the start of the string.

In [0]:
df['Neighborhood'] = df['Neighborhood'].str.strip()

Assign Borough values to the Neighborhood where value is "Not assigned".

In [0]:
df.loc[df['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df['Borough']

In [109]:
# shape of the dataframe
df.shape

(103, 3)