## Introduction

The project is to explore and cluster the neighborhoods in Toronto. In this project, we will learn how to scrape a webpage and manipulate the dataframe, how to convert addresses into their equivalent latitude and longitude values, as well as the visualization of clustering (neighborhoods).  

## Part A.
**To scrape a webpage and manipulate the dataframe**

In [1]:
# import packages and mainly use bs4 function to perform the task.

import urllib
import requests 
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import os,sys

In [None]:
os.getcwd()

In [24]:
# specify the given URL and target the specific webpage.

URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
webpage = urlopen(URL)
soup = BeautifulSoup(webpage)
webpage.close()



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [25]:
# Now play around to gain some idea as to show the source Wikipedia article

article = requests.get(URL).text
article[:100]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'

In [26]:
# to parse the article and show the results

# soup = BeautifulSoup(article, 'lxml')
soup.prettify()[:100]

# get the table of postal codes
# table = soup.find('table')
# table.prettify()[:100]

'<!DOCTYPE html>\n<html class="client-nojs" dir="ltr" lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <'

To simplify - we create a new file on CWD. 

In [27]:
# to write a csv file locally

newfile = open("postcode_canada.csv","w")
tables = soup.findAll('table')

tab = tables[0]

for tr in tab.tbody.findAll('tr'):
    for th in tr.findAll('th'):
        text = th.getText().strip()+','
        newfile.write(text)
    for td in tr.findAll('td'):
        text = td.getText().strip()+','
        newfile.write(text)
    newfile.write('\n')
    
newfile.close()

In [28]:
# now loading into dataframe and print out its rows and columns for the original data.

df = pd.read_csv('postcode_canada.csv')
df.drop('Unnamed: 3',axis=1,inplace = True)
print(df.shape)
df.head()

(289, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [29]:
# to ignore cells with a borough that is Not assigned

new_df = df[ ~ df['Borough'].str.contains('Not assigned')]
print(new_df.shape)
new_df.head()

(212, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Now combine the neighborhoods on the same row level if they have a same name of borough.

In [30]:
# using groupby and aggregated function and using comma as to separate the combined values. 

dfps = new_df.groupby(['Postcode','Borough'],as_index=False).agg(lambda x : x.sum() if x.dtype=='float64' else ', '.join(x))
print(dfps.shape)
dfps.tail()

(103, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
102,M9W,Etobicoke,Northwest


We don't want "Not assigned" as a value appeared in out dataset, so we do some change. 

In [31]:
#  to replace "Not assigned" in column 'Neighbourhood' with the value in 'Borough'.

for i in range(len(dfps)):
    dat_line=dfps.iloc[i,:]
    if dat_line['Neighbourhood'] == 'Not assigned':
        dat_line['Neighbourhood'] = dat_line['Borough']

In [32]:
# check and verify if updated correctly. 

dfps.iloc[85,:]

Postcode                  M7A
Borough          Queen's Park
Neighbourhood    Queen's Park
Name: 85, dtype: object

In [33]:
# print out its row-number and column-number. 

dfps.shape

(103, 3)

In [34]:
# display the dataset (-Part A.)

dfps.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


## Part B.
**To get the latitude and the longitude coordinates of each neighborhood**

In [35]:
# we use the given file to obtain the information. 

geo_df = pd.read_csv("Geospatial_Coordinates.csv")
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [36]:
# to change the 1st column name as to match the previous dataset.

geo_df.rename(columns={'Postal Code':'Postcode'}, inplace = True)
geo_df.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [37]:
# now merge the two related datasets and form a new dataframe.

new_df = pd.merge(dfps, geo_df, on='Postcode')
new_df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [38]:
new_df.shape

(103, 5)

In [39]:
# print out the total boroughs and the neighborhoods associated with (Part B.)  

print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(new_df['Borough'].unique()),
        new_df.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


## Part C.
**Use geopy library to get the latitude and longitude values of Toronto and to visualize the neighborhoods and see how they cluster together**

In [40]:
##  to find the latitude & longitude of the city of Toronto.

from geopy.geocoders import Nominatim

address = 'Toronto, Canada'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

  import sys


The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [20]:
# Optional block...
## In case of the Geocoder Python package unreliable... then run the following:

latitude=43.653963
longitude=-79.387207
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [41]:
#!conda install -c conda-forge folium=0.7.0 --yes 
import folium # map rendering librarypd.set_option('display.max_rows', None)

In [42]:
# create map of Toronto using latitude and longitude values

map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(new_df['Latitude'], new_df['Longitude'], new_df['Borough'], new_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='darkblue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_Toronto)  
    
map_Toronto

In [43]:
# let's simplify the above map and segment and cluster only the neighborhoods in Toronto. 
## let's slice the dataframe from Part B. and create a new dataframe of the North York data.

north_york_data = new_df[new_df['Borough'] == 'North York'].reset_index(drop=True)
north_york_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M2H,North York,Hillcrest Village,43.803762,-79.363452
1,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
2,M2K,North York,Bayview Village,43.786947,-79.385975
3,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714
4,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493


**Let's get the geographical coordinates of North York (Toronto, Canada)**

In [44]:
address = 'North York, Ontario'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude1 = location.latitude
longitude1 = location.longitude
print('The geograpical coordinate of North York (Ontario, Canada) are {}, {}.'.format(latitude1, longitude1))

  This is separate from the ipykernel package so we can avoid doing imports until


The geograpical coordinate of North York (Ontario, Canada) are 43.7709163, -79.4124102.


In [45]:
# create map of North York using the latitude1 and longitude1 values

map_north_york = folium.Map(location=[latitude1, longitude1], zoom_start=11)

# add markers to map
for lat, lng, label in zip(north_york_data['Latitude'], north_york_data['Longitude'], north_york_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_north_york)  
    
map_north_york

**(The notebook included Part A, Part B and Part C only)**
**This notebook was complete.  Thank you!**