# Webscraping and geocoding postal codes of Canada

Part of IBM Capstone Data Science Course
**Part 1** - webscraping the postal codes of Canada
**Part 2** - geocoding th postal codes of Canada

<img src = "https://www.cannalawblog.com/files/2015/10/Flag_map_of_Greater_Canada.png" width = 400 align = 'center'>



## **Part 1** Webscraping the postal codes of Canada

## Prepare Dependencies

In [106]:
from bs4 import BeautifulSoup
import requests

## Getting the source from wikipedia

In [18]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [107]:
# put it into the soup

soup = BeautifulSoup(source, 'lxml')
#print(soup.prettify())

### Parsing out the information from the table

In [108]:
table = soup.find('table', class_ = 'wikitable sortable')
#print(table.prettify())

In [26]:
# All table body rows ***

for table_body in soup.find_all('td'):
    print(table_body)

<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td>
<td>M6A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Lawrence_Heights" title="Lawrence Heights">Lawrence Heights</a>
</td>
<td>M6A</td>
<td

In [109]:
# 4th step from labs/DP0701EN/Webscraping python BeautifulSoup parsing table.ipynb

data = [] #create empty list named data

table = soup.find('table', attrs={'class':'wikitable sortable'})
#print(table)

tbody = table.tbody
#print(tbody)

In [110]:
# 5th step from labs/DP0701EN/Webscraping python BeautifulSoup parsing table.ipynb

#table_body = table.find('tbody')

rows = tbody.find_all('tr')
for row in rows:
    #print(row)
    cols = row.find_all('td')
    #print(cols)
    cols = [ele.text.strip() for ele in cols]
    #print(cols)
    data.append([ele for ele in cols if ele]) # Get rid of empty values
    
#print(data) # print list data with lists of the rows scraped from the wikipedia page

## load data in Pandas dataframe

In [111]:
df = pd.DataFrame(data)
df.columns = ['Postcode','Borough','Neighbourhood']
#df

### Clean up the data in the pandas frame

In [113]:
#drop all entry's with no assigned Borough
df = df[df.Borough != 'Not assigned']
#df

In [114]:
#clean up first row
df = df[df.Postcode.notnull()]
#df

In [115]:
#reindex dataframe
df = df.reset_index(drop=True)
#df

In [116]:
#check if there are rows with value not assigned for the column Neighbourhood
df.loc[df['Neighbourhood'] == 'Not assigned'] 

Unnamed: 0,Postcode,Borough,Neighbourhood
6,M7A,Queen's Park,Not assigned


In [117]:
#Copy value of Borough to Neighbourhood when Neighbourhood is not assigned

df.Neighbourhood = df.Borough.where(df.Neighbourhood == 'Not assigned', df.Neighbourhood)
#df.loc[df['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood


In [119]:
#Check the copied value
df.loc[df['Neighbourhood'] == "Queen's Park"]

Unnamed: 0,Postcode,Borough,Neighbourhood
6,M7A,Queen's Park,Queen's Park


In [121]:
#Group the dataframe by Postcode and Borough, combine Neighbourhood seperated with a comma, reindex data frame

#https://stackoverflow.com/questions/36392735/how-to-combine-multiple-rows-into-a-single-row-with-pandas
df = df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()

## Result of webscraping the postal codes of Canada in the cleaned up pandas data frame

In [123]:
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


### Shape of the resulting data frame

In [124]:
df.shape

(103, 3)

## **Part 2** Geocoding the postal codes of Canada

Using the python geocoder package -  https://geocoder.readthedocs.io/index.html

<img src = "https://1netwiki.com/ueditor/php/upload/image/20180911/1536677667341132.jpg" width = 400 align = 'center'>



## Prepare Dependencies & and load data

In [193]:
# Create df working set 
dfpc = df

In [127]:
#dfpc

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


### load data

In [178]:
dfgeo = pd.read_csv('Geospatial_Coordinates.csv')

In [179]:
#dfgeo

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


### check if the loaded data is clean and can be added to the dataframe

In [None]:
# maybe first check if the list of geopspation coordinates contains the same PC and in the same order, before tying them together

df1 = dfpc.iloc[:,0]
df2 = dfgeo.iloc[:,0]

ne = (df1 != df2).any(0) # false if the two frames are the same

ne

In [191]:
# ...visual check...

df_all = pd.concat([df1, df2], 
                   axis='columns', keys=['First', 'Second'])

df_all

Unnamed: 0,First,Second
0,M1B,M1B
1,M1C,M1C
2,M1E,M1E
3,M1G,M1G
4,M1H,M1H
5,M1J,M1J
6,M1K,M1K
7,M1L,M1L
8,M1M,M1M
9,M1N,M1N


## Put the data together

In [202]:
# add the geo coordinates to the Toronto PC data frame

dfpc = pd.concat([dfpc, dfgeo.iloc[:,[1,2]]], axis=1)

### Final data frame of geolocated boroughs in Toronto Canada

In [204]:
# Final data frame of geolocated boroughs in Toronto Canada
dfpc

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
