# IBM Data Science Certificate - Capstone Project - Segmenting and Clustering 2/3


## Table of content:
* [1. Scrap toronto neighborhoods from wikipedia](#first-part)
* [2. Add geo coordinates to Toronto neighborhoods](#second-part)

## 1. Scrap toronto neighborhoods from wikipedia <a class="anchor" id="first-part">

In [3]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [4]:
# read the wikipedia page containing postal codes for canada satrting with letter M
wikipedia_canada_post_codes_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikipedia_canada_post_codes_page =  requests.get(wikipedia_canada_post_codes_url)

In [5]:
# extract data from the html table containing the postal codes and put them in
# a pandas dat frame : df
soup_post_codes = BeautifulSoup(wikipedia_canada_post_codes_page .content, 'html.parser')
table_post_codes = soup_post_codes.find("table", {"class": "wikitable"})
df=pd.read_html(str(table_post_codes))[0] #read_html returns a list of dataframes, we take the first (and only) one
print(df)

    Postcode           Borough           Neighborhood
0        M1A      Not assigned           Not assigned
1        M2A      Not assigned           Not assigned
2        M3A        North York              Parkwoods
3        M4A        North York       Victoria Village
4        M5A  Downtown Toronto           Harbourfront
..       ...               ...                    ...
282      M8Z         Etobicoke              Mimico NW
283      M8Z         Etobicoke     The Queensway West
284      M8Z         Etobicoke  Royal York South West
285      M8Z         Etobicoke         South of Bloor
286      M9Z      Not assigned           Not assigned

[287 rows x 3 columns]


In [6]:
# remove rows where Borough is 'Not assigned'
df = df[df.Borough != 'Not assigned']
df.reset_index(drop=True, inplace=True)
# if a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned', df['Borough'], df['Neighborhood'])
df

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
...,...,...,...
205,M8Z,Etobicoke,Kingsway Park South West
206,M8Z,Etobicoke,Mimico NW
207,M8Z,Etobicoke,The Queensway West
208,M8Z,Etobicoke,Royal York South West


In [7]:
df.shape

(210, 3)

## 2. Add geo coordinates to Toronto neighborhoods <a class="anchor" id="second-part">

In [12]:
#read csv file containing geo coordinates
df_geo = pd.read_csv('./data/Geospatial_data.csv')
#make sure tha the column containing postal codes has the same name in df and df_geo datframes
df_geo=df_geo.rename(columns={'Postal Code':'Postcode' })
df_geo

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [15]:
#make an inner join between df and df_geo
df=pd.merge(df, df_geo)
df

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.654260,-79.360636
3,M6A,North York,Lawrence Heights,43.718518,-79.464763
4,M6A,North York,Lawrence Manor,43.718518,-79.464763
...,...,...,...,...,...
205,M8Z,Etobicoke,Kingsway Park South West,43.628841,-79.520999
206,M8Z,Etobicoke,Mimico NW,43.628841,-79.520999
207,M8Z,Etobicoke,The Queensway West,43.628841,-79.520999
208,M8Z,Etobicoke,Royal York South West,43.628841,-79.520999
