# Segmenting and Clustering Neighborhood in Toronto 1/3
## Collecting Neighbourhood data from web
This is the first part of my Capstone project. I'm collecting a list of neighbourhoods of Toronto int a Pandas DataFrame.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

I'm using Requests library to get the html file and Beautiful Soup to process it.

In [2]:
wiki_url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

r=requests.get(wiki_url)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify()[:1000]) #just the first couple of lines to check if the load was ok

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"dcc5d878-223e-4b4b-a885-0285652ee233","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":979555370,"wgRevisionId":979555370,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Communicati

I have already checked the source code in browser so I know that the list of postal codes is the first table on the page and it is in a regular \<table> section:

In [3]:
print(soup.table.prettify()[:400]) #again, printing the first rows to check if the idea is correct

<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postal Code
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighbourhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </


An html \<table> can be easily loaded into Pandas DataSheet with read_html function. Be aware that the function returns a **list** of DataFrames, we'll need the first (and only) DataFrame out of the list.

In [4]:
df_toronto=pd.read_html(soup.table.prettify(),flavor='bs4',header=0,index_col=0)
df_toronto

[                      Borough  \
 Postal Code                     
 M1A              Not assigned   
 M2A              Not assigned   
 M3A                North York   
 M4A                North York   
 M5A          Downtown Toronto   
 ...                       ...   
 M5Z              Not assigned   
 M6Z              Not assigned   
 M7Z              Not assigned   
 M8Z                 Etobicoke   
 M9Z              Not assigned   
 
                                                  Neighbourhood  
 Postal Code                                                     
 M1A                                               Not assigned  
 M2A                                               Not assigned  
 M3A                                                  Parkwoods  
 M4A                                           Victoria Village  
 M5A                                  Regent Park, Harbourfront  
 ...                                                        ...  
 M5Z                        

Please note that I'm using Postal Code as index, which would have thrown an exception if the same code had beenfound in multiple rows, so I don't have to check them for duplications. It seems that the wiki page has been modified since the assignment was created.

In [5]:
df_toronto=df_toronto[0]
df_toronto.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [6]:
df_toronto.shape

(180, 2)

As I set postal code as index the DataFrame is already sorted by postal codes. I'm not going to count whether all rows are loaded but I'm checking if the first and the last row from the wiki page is in the DataFrame:

In [7]:
print(df_toronto.loc['M1S']) #this is the firs row on the wiki page
print(df_toronto.loc['M2L']) #this is the last row

Borough          Scarborough
Neighbourhood      Agincourt
Name: M1S, dtype: object
Borough                        North York
Neighbourhood    York Mills, Silver Hills
Name: M2L, dtype: object


Dropping postal codes that are not assigned to boroughs:

In [8]:
df_toronto=df_toronto[df_toronto['Borough']!='Not assigned']
df_toronto.shape

(103, 2)

In [9]:
df_toronto.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


The instuction says: *\"If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough."* Interestingly there are no postal codes that are assigned to Borough but not to Neighbourhood. It seems that the wiki page has been modified since the assignment was made. That simplifies our job.

In [10]:
df_toronto[df_toronto['Neighbourhood']=='Not assigned']

Unnamed: 0_level_0,Borough,Neighbourhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1


There are some duplication though (some neighbourhoods have more than one postal codes), we keep only one of them.

In [11]:
df_toronto[df_toronto.duplicated(keep=False)]

Unnamed: 0_level_0,Borough,Neighbourhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M3B,North York,Don Mills
M3C,North York,Don Mills
M3K,North York,Downsview
M3L,North York,Downsview
M3M,North York,Downsview
M3N,North York,Downsview


In [12]:
df_toronto=df_toronto[df_toronto.duplicated()==False]
df_toronto.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [13]:
df_toronto.to_csv('toronto.csv')
df_toronto.shape

(99, 2)