<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto Project</font></h1>

In [1]:
import numpy as np 
import pandas as pd

import requests


## import wikipedia table with BeautifulSoup

In [2]:
# import beautifulSoup
from bs4 import BeautifulSoup

In [3]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
website_url = requests.get(url).text
soup = BeautifulSoup(website_url,'lxml')

In [4]:
my_table = soup.find('table',{'class':'wikitable sortable'})
print(my_table)

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td

We have the table under the html format. 

We then try to create an empty dataframe with the columns we want, that we fill with the data from my_table read line by line in the td parts

In [5]:
# create empty dataframe

column_names = ['PostalCode', 'Borough', 'Neighborhood'] 
neighborhoods = pd.DataFrame(columns=column_names)

In [6]:
# read my_table line by line and fill each of the three columns
# For each group of 3 lines, the first one is the postal code, the second one is the borough and the third one is the neiborhood

i=0
postalcode=[]
borough=[]
neighborhood=[]

for row in my_table.find_all('td'):
    if i%3==0:
        postalcode.append(row.text)
    if i%3==1:
        borough.append(row.text)
    if i%3==2:
        neighborhood.append(row.text)
    i+=1

neighborhoods['PostalCode']=postalcode
neighborhoods['Borough']=borough
neighborhoods['Neighborhood']=neighborhood

In [7]:
# Check the result
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


In [8]:
# Remove the \n sign at the end of each cell

neighborhoods['PostalCode'] = neighborhoods['PostalCode'].str.replace(r'\n', '')
neighborhoods['Borough'] = neighborhoods['Borough'].str.replace(r'\n', '')
neighborhoods['Neighborhood'] = neighborhoods['Neighborhood'].str.replace(r'\n', '')

In [9]:
# Check the result, this time it should be ok
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [10]:
# Count number of rows in order to later check if we correctly suppressed unassigned rows
neighborhoods.shape

(180, 3)

Now we have a good table.

We now need to clean the data :
- suppress the rows where borough is 'Not assigned'
- combine rows that have the same postal code
- when the neighborough is not assigned, we give it the name of the borough

## Suppress Not assigned borough

it seems that the dropna function doesn't work there.
So the idea is to slice the dataframe to keep only the borough others than 'Not assigned'

In [11]:
#Count number of unassigned boroughs
neighborhoods[neighborhoods['Borough']=='Not assigned'].shape

(77, 3)

In [12]:
# Slice the dataframe by keeping only the rows with an assigned Borough
neighborhoods=neighborhoods[neighborhoods['Borough']!='Not assigned']

In [13]:
# Check the results on the first rows
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [14]:
# Check the result with the number of rows. We expect to have now 180 - 77 = 103 rows
neighborhoods.shape

(103, 3)

So now we have removed the Not assigned borough and reduced the numbers of rows from 180 to 103.

## Combine rows with same Postal Code
Or at list check if there is any, because it seems the table already combined them

In [15]:
neighborhoods['PostalCode'].value_counts()

M4T    1
M9L    1
M6K    1
M8V    1
M4P    1
      ..
M4V    1
M5M    1
M2K    1
M1P    1
M5R    1
Name: PostalCode, Length: 103, dtype: int64

Each postal code is already unique.

## Assign Borough name to unassigned neighborhood

In [16]:
# Count number of unassigned neighborhood
neighborhoods[neighborhoods['Neighborhood']=='Not assigned'].shape

(0, 3)

There is no unassigned neighborhood

On the wikipedia table, we can check that all unassigned neighborhoods are also unassigned boroughs (and we have dropped them)

## Final Table - End of part 1

In [17]:
neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [18]:
neighborhoods.shape

(103, 3)