### As a part of segmentation and clustering neighborhoods in the city of Toronto, Canada,  this notebook is to code the scraping the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

#### Importing all the libraries

In [112]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### We will be using BeautifulSoup library for web scraping

In [113]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

#### We will get the postal codes of Toronto from this wiki page

In [114]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html=urlopen(url)

#### create a Beautiful Soup object

In [115]:
soup=BeautifulSoup(html,'lxml')
type(soup)

bs4.BeautifulSoup

#### Using find_all() method of soup to extract tablerows tags within a webpage

In [116]:
rows=soup.find_all('tr')
rows[:5]

[<tr>
 <th>Postal Code
 </th>
 <th>Borough
 </th>
 <th>Neighborhood
 </th></tr>, <tr>
 <td>M1A
 </td>
 <td>Not assigned
 </td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M2A
 </td>
 <td>Not assigned
 </td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M3A
 </td>
 <td>North York
 </td>
 <td>Parkwoods
 </td></tr>, <tr>
 <td>M4A
 </td>
 <td>North York
 </td>
 <td>Victoria Village
 </td></tr>]

#### Using regular expressions that finds all the characters inside the td html tags and replace them with an empty string for each table row and reformtting as needed

In [117]:
import re
list_rows = []
for row in rows:
    cells = row.find_all('td')
    str_cells = str(cells)
    clean = re.compile('<.*?>')
    clean2 = (re.sub(clean, '',str_cells))
    clean3 = re.sub("\s", "", clean2)
    list_rows.append(clean3)
print(list_rows)
type(list_rows)

['[]', '[M1A,Notassigned,Notassigned]', '[M2A,Notassigned,Notassigned]', '[M3A,NorthYork,Parkwoods]', '[M4A,NorthYork,VictoriaVillage]', '[M5A,DowntownToronto,RegentPark,Harbourfront]', '[M6A,NorthYork,LawrenceManor,LawrenceHeights]', "[M7A,DowntownToronto,Queen'sPark,OntarioProvincialGovernment]", '[M8A,Notassigned,Notassigned]', '[M9A,Etobicoke,IslingtonAvenue,HumberValleyVillage]', '[M1B,Scarborough,Malvern,Rouge]', '[M2B,Notassigned,Notassigned]', '[M3B,NorthYork,DonMills]', '[M4B,EastYork,ParkviewHill,WoodbineGardens]', '[M5B,DowntownToronto,GardenDistrict,Ryerson]', '[M6B,NorthYork,Glencairn]', '[M7B,Notassigned,Notassigned]', '[M8B,Notassigned,Notassigned]', '[M9B,Etobicoke,WestDeanePark,PrincessGardens,MartinGrove,Islington,Cloverdale]', '[M1C,Scarborough,RougeHill,PortUnion,HighlandCreek]', '[M2C,Notassigned,Notassigned]', '[M3C,NorthYork,DonMills]', '[M4C,EastYork,WoodbineHeights]', '[M5C,DowntownToronto,St.JamesTown]', '[M6C,York,Humewood-Cedarvale]', '[M7C,Notassigned,Notas

list

#### Storing the list in dataframe

In [118]:
df=pd.DataFrame(list_rows)

In [119]:
df.head()

Unnamed: 0,0
0,[]
1,"[M1A,Notassigned,Notassigned]"
2,"[M2A,Notassigned,Notassigned]"
3,"[M3A,NorthYork,Parkwoods]"
4,"[M4A,NorthYork,VictoriaVillage]"


#### Splitting each row on ',' so as to put them in seperate cols and also replacing ' [ ',  ' ] ' with space. And also getting rid of all rows having string 'Notassigned'

In [120]:
df1=df[0].str.split(',',2,expand=True)
df1[0]=df1[0].str.strip('[]')
df1[1]=df1[1].str.strip('[]')
df1[2]=df1[2].str.strip('[]')
df1=df1[df1[1] != "Notassigned"]
df1=df1[df1[2] != "Notassigned"]
df1.reset_index(inplace=True,drop=True)
df1

Unnamed: 0,0,1,2
0,,,
1,M3A,NorthYork,Parkwoods
2,M4A,NorthYork,VictoriaVillage
3,M5A,DowntownToronto,"RegentPark,Harbourfront"
4,M6A,NorthYork,"LawrenceManor,LawrenceHeights"
5,M7A,DowntownToronto,"Queen'sPark,OntarioProvincialGovernment"
6,M9A,Etobicoke,"IslingtonAvenue,HumberValleyVillage"
7,M1B,Scarborough,"Malvern,Rouge"
8,M3B,NorthYork,DonMills
9,M4B,EastYork,"ParkviewHill,WoodbineGardens"


#### Eliminating all unnecessary headers and trailers

In [121]:
df1=(df1[1:104])
df1.reset_index(inplace=True,drop=True)

#### setting column names for the dataframe

In [122]:
df1.columns=["PostalCode","Borough","Neighborhood"]

In [123]:
df1

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,NorthYork,Parkwoods
1,M4A,NorthYork,VictoriaVillage
2,M5A,DowntownToronto,"RegentPark,Harbourfront"
3,M6A,NorthYork,"LawrenceManor,LawrenceHeights"
4,M7A,DowntownToronto,"Queen'sPark,OntarioProvincialGovernment"
5,M9A,Etobicoke,"IslingtonAvenue,HumberValleyVillage"
6,M1B,Scarborough,"Malvern,Rouge"
7,M3B,NorthYork,DonMills
8,M4B,EastYork,"ParkviewHill,WoodbineGardens"
9,M5B,DowntownToronto,"GardenDistrict,Ryerson"


In [124]:
df1.shape

(103, 3)