## Applied Data Science Capstone
### Week 3: Segmenting and Clustering Neighborhoods in Toronto
#### Section 1: Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

Import Required Libraries:

In [42]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import lxml

Fetch the wikipedia page using requests

In [43]:
page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")


Extract the contents from the html page using BeautifulSoup HTML parser

In [44]:
soup = BeautifulSoup(page.content,'html.parser')

Find the first table containing a list of Postal Codes, Burroughs and Neighbourhoods

In [45]:
table = soup.find_all('table')
df = pd.read_html(str(table))[0]

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [46]:
df_filt = df[df['Borough'] != 'Not assigned']

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [47]:
df_filt[df_filt['Neighborhood'] == 'Not assigned']

Unnamed: 0,Postal Code,Borough,Neighborhood


More than one neighborhood can exist in one postal code area. That has already been taken care of in the wikipedia page and no multiple neighborhoods currently exist in the dataframe.
Print unique neighorhoods and total number of data entries in the data frame to verify the same.

In [48]:
print(df_filt['Postal Code'].nunique())
print(df_filt.shape)

103
(103, 3)


#### Section 2: Get geographic co-ordinates of each neighborhood and add it to the dataframe

Get coordinates for every Postal Code and append it to the dataframe. geocoder is not working reliably, hence get the csv file containing latitude and longitude cooridnates of every Postal Code

In [142]:
 !wget Geospatial_Coordinates.csv http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv

--2020-07-07 15:01:00--  http://geospatial_coordinates.csv/
Resolving geospatial_coordinates.csv (geospatial_coordinates.csv)... failed: Name or service not known.
wget: unable to resolve host address ‘geospatial_coordinates.csv’
--2020-07-07 15:01:00--  http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv
Resolving cocl.us (cocl.us)... 119.81.168.76, 119.81.168.75, 161.202.50.39
Connecting to cocl.us (cocl.us)|119.81.168.76|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv [following]
--2020-07-07 15:01:01--  https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv
Connecting to cocl.us (cocl.us)|119.81.168.76|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-07-07 15:01:04--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resol

Read the coordinates of every Postal code into df_coord dataframe

In [143]:
df_coord = pd.read_csv('Geospatial_Coordinates.csv')
df_coord.head(10)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


As we can see the the df_coord is sorted as per values of Postal Code. We need to sort df_filt to match with those of df_coord

In [144]:
df_filt.sort_values('Postal Code', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Reset Index of the df_filt table

In [147]:
df_filt.reset_index(drop=True, inplace=True)
df_filt

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


Merge both the frames (dffilt and df_coord)

In [148]:
frames = [df_filt, df_coord]
neighbors = pd.merge(df_filt, df_coord, on='Postal Code')

In [150]:
neighbors.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
