## Scrapping

This notebook is used to scrap [https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M] and obtain the list of postal codes of canada

In [1]:
import pandas as pd

from urllib import request
from bs4 import BeautifulSoup

Load webpage and obtain html

In [2]:
html = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = request.urlopen(html).read()

soup = BeautifulSoup(source, "html.parser")
table = soup.find('table', attrs={'class':'wikitable'})
table_rows = table.find_all('tr')

get table elements from html

In [3]:
res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)

In [4]:
df = pd.DataFrame(res, columns=["PostalCode", "Borough", "Neighborhood"])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


remove cases where borough is not assigned

In [5]:
df = df[df.Borough != "Not assigned"]

join various neighbourhoods in the same borough

In [6]:
df = df.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()

replcae not assigned neighbourhoods with borough name

In [7]:
def replace(r):
    if r['Neighborhood'] == "Not assigned": r["Neighborhood"] = r["Borough"]
    return r

In [8]:
df = df.apply(replace,axis=1)

In [9]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [10]:
df.shape

(103, 3)

## Geocode

The geocoder api did not work, so i downloaded the csv and used

In [None]:
import geocoder # import geocoder

lat = []
long = []

for _, r in df.iterrows():

    postal_code = r['PostalCode']
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    

In [16]:
lat_long = pd.read_csv("Geospatial_Coordinates.csv")

In [19]:
df = df.merge(lat_long, left_on ="PostalCode", right_on="Postal Code").drop(columns=["Postal Code"])

In [20]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476


In [1]:
## Explore and cluster the neighborhoods in Toronto
## replicate the same analysis we did to the New York City data.

## 1. add enough Markdown cells to explain what you decided to do and to report any observations you make.
## 2. enerate maps to visualize your neighborhoods and how they cluster together. 