# Week 3 Capstone assignment 
## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto
### Adding Geographical Coordinates for Postal Codes

*Developer: Dan Grigore* \
*Created on: 2020/02/01* 


This Jupyter notebook presents the following work:

1. Install any necessary libraries
2. Get the Wikipedia page "**List of postal codes of Canada: M**"
3. Extract into a dataframe the table
4. Process the data according to requirements
5. Work with CSV file to add Latitude and Longitude columns to dataframe

Let's start!

#### 1. Install BeautifulSoup4 library for scraping if needed

In [3]:
#install BeautifulSoup4 library for scraping if needed
#!conda install urllib
!pip install BeautifulSoup4
#print('urllib3 library installed!')
print('bs4 package installed!')
!pip install lxml
print('lxml package installed!')
!pip install parse

bs4 package installed!
lxml package installed!


#### 2. Import libraries

In [4]:
#import libraries
import pandas as pd
import numpy as np
from urllib import request
import html5lib
from bs4 import BeautifulSoup
import lxml.html as lh
from parse import *

#### 3. Define variables

In [5]:
#define variables
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
print(url)

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M


#### 4. Define a dataframe and read the table

In [6]:
#define a dataframe and read the table
tables = pd.read_html(url)
tables[0]

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


#### 5. Identify any row with a 'Not assigned' borough

In [7]:
#identify any row with a 'Not assigned' borough and drop it from the new df
new_df = tables[0][tables[0].Borough != 'Not assigned']

#### 6. Create a the new dataframe from above and display it

In [8]:
#create a the new dataframe from above and display it
new_table = new_df
new_table

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
...,...,...,...
281,M8Z,Etobicoke,Kingsway Park South West
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West


#### 7. More than one neighborhood can exist in one postal code area. These rows will be combined into one row with the neighborhoods separated with a comma. We define a function foo to join a concatenate by comma character. Testing for M6M Postcode (probably the table in Wikipedia changed lately for M5A Postcode, as for this it is only one value)

In [9]:
#More than one neighborhood can exist in one postal code area. These rows will be combined into one row with the neighborhoods separated with a comma
#define a function foot to join a concatenate by comma character. Testing for M6M Postcode (probably the table in Wikipedia changed lately for M5A Postcode, as for this it is only one value)
foo = lambda a: " , ".join(a) 
#group by Postcode and Borough and aggregate on Neighbourhood
new_result = new_table.groupby(['Postcode','Borough']).agg({'Neighbourhood': foo}).reset_index()
new_result.loc[new_result['Postcode'] == "M6M"]

Unnamed: 0,Postcode,Borough,Neighbourhood
80,M6M,York,"Del Ray , Keelesdale , Mount Dennis , Silverthorn"


#### 8. If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough.

In [22]:
#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
new_result['Neighbourhood'] = np.where((new_result.Neighbourhood=='Not assigned'),new_result['Borough'],new_result['Neighbourhood'])

#### 9.  Verify replace for 'Not assigned' neighbourhoods with the Borough value (example for PostCode M9A)

In [23]:
# Verify replace for 'Not assigned' neighbourhoods with the Borough value (example for PostCode M9A)
new_result.loc[new_result['Postcode'] == "M9A"]

Unnamed: 0,Postcode,Borough,Neighbourhood
93,M9A,Queen's Park,Queen's Park


#### 10. Adding the shape method to print the number of rows of the dataframe

In [14]:
#Adding the shape method to print the number of rows of the dataframe
new_result.shape

(103, 3)

#### 11. Load geograpgical coordinates from the CSV file into a dataframe

In [15]:
#Load geograpgical coordinates from the CSV file into a dataframe
geog_data = pd.read_csv('Geospatial_Coordinates.csv')

#### 12. Display geog_data information

In [16]:
# Display dataframe data
geog_data

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


#### 13. Join the two dataframes based on Postal Code column in the new_df dataframe

In [17]:
#Join the two dataframes based on Postal Code column
new_df = new_result.join(geog_data[['Latitude', 'Longitude']], lsuffix='Postal Code', rsuffix='Postal Code')

#### 14. Verify for Postal Code M5G

In [18]:
#Verify JOIN for the example Postal Code M5G
new_df.loc[new_df['Postcode'] == "M5G"]

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
57,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383


In [19]:
#View data
new_df

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge , Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek , Rouge Hill , Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood , Morningside , West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village , Martin Grove Gardens , Ric...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens , Beaumond Heights , Humbergate...",43.739416,-79.588437


In [20]:
print('Done!')

Done!
