<h1 align="center"><font size="5">Segmenting and Clustering Neighborhoods in Toronto</font></h1>  

<a id='ToC'></a>  

<div class="alert alert-block alert-info" style="margin-top: 20px">
    
<h1> Table of Contents</h1>  <br>

1. <a href='#FirstSection'>Scrape Canadian Postal Code</a> <br>

     1.1. <a href='#11'>BeautifulSoup</a> <br>  
     
     1.2. <a href='#12'>Preprocessing</a> <br>  
     
     1.3. <a href='#13'>Clean DataFrame</a>  <br>  
     
2. <a href='#SecondSection'>Obtain the Latitude and Longitude coordinates</a> <br>  

     2.1. <a href='#21'>Using package - did not work</a> <br>  
     
     2.2. <a href='#22'>Using csv</a> <br>  
     
     2.3. <a href='#23'>Latitude Longitude DataFrame - Final</a>   
<!--3. [Explore and cluster the Toronto Neighborhood](#ThirdSection)  
     3.1. [Following the NY Cluster lab](#31)  
     3.2. [Subsetting to East, Central, Downtown, and West Toronto](#32)  
     3.3. [Decoupling Neighborhood](#33)  
     3.4. [Map of East, Central, Downtown, and West Toronto Neighborhood](#34)  
     3.5. [FourSquare API](#35)  
     3.6. [Preprocessing Neighborhood](#36)  
     3.7. [K-Means Clustering](#37)  
     3.8. [Cluster Map](#38)  
     3.9. [Examine Cluster](#39)  -->
</div>

<div class="alert alert-block alert-success" style="margin-top: 20px">

<a id='FirstSection'> <H1> First Assignment - Scrape Canadian Postal Code</H1></a>  

 1.1. <a href='#11'>BeautifulSoup</a> <br>  
 
 1.2. <a href='#12'>Preprocessing</a> <br>     
 
 1.3. <a href='#13'>Clean DataFrame</a>  <br>  
  
</div>

Let's import a couple dependancies before we start:

In [1]:
# Segmenting and Clustering Neighborhoods in Toronto
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

<a id='11'> BeautifulSoup</H1></a>

First things first: let's scrape the wikipedia page, using BeautifulSoup package

In [2]:
from bs4 import BeautifulSoup

In [3]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')

In [4]:
table = soup.find('table',class_='wikitable sortable')

In [5]:
# to show the table:
# table.prettify().splitlines()

In [6]:
# Convert the html into pandas dataframe:
df = pd.read_html(str(table),header=0)[0]

<a id='12'> Pre-processing</a>  
* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [7]:
# rename the column to the one specified by the assignment
df.columns = ['PostalCode','Borough','Neighborhood']

In [8]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [9]:
# sanity check:
df[df.PostalCode=='M2H']

Unnamed: 0,PostalCode,Borough,Neighborhood
63,M2H,North York,Hillcrest Village


* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [10]:
# remove cells with a borough that is "Not assigned"
df = df[df.Borough!='Not assigned']

In [11]:
# to check whether all the 'not assigned' has been dropped:
sum(df.Borough=='Not assigned')

0

In [12]:
df.shape

(211, 3)

* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [13]:
# split the data frame into: duplicated and unique df. 
duplicated_df = df[df.duplicated(subset='PostalCode',keep=False)].sort_values(by='PostalCode')
duplicated_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern
28,M1C,Scarborough,Rouge Hill
29,M1C,Scarborough,Port Union
27,M1C,Scarborough,Highland Creek


In [14]:
duplicated_df.shape

(165, 3)

Let's also separate the dataframe with unique ```PostalCode```. Once the cleaning up of the duplicated ```PostalCode``` is done, we can then add them back in. 

In [15]:
unique_df = df[~df.duplicated(subset='PostalCode',keep=False)].sort_values(by='PostalCode')

In [16]:
unique_df.shape

(46, 3)

In [17]:
unique_df[unique_df.PostalCode=='M2H']

Unnamed: 0,PostalCode,Borough,Neighborhood
63,M2H,North York,Hillcrest Village


There are 211 rows for the ```df``` which had been separated into:  
* ```duplicated_df``` with 165 rows
* ```unique_df``` with 46 rows

Let's clean up the ```duplicated_df``` data frame:

In [18]:
# grab the first duplicated value using the ~ duplicated below:
cleaned_dup_df = duplicated_df[~duplicated_df.duplicated(subset='PostalCode')]
cleaned_dup_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
11,M1B,Scarborough,Rouge
28,M1C,Scarborough,Rouge Hill
44,M1E,Scarborough,West Hill
93,M1K,Scarborough,Kennedy Park
107,M1L,Scarborough,Clairlea


In [19]:
cleaned_dup_df.shape

(57, 3)

In [20]:
# extracting the neighborhood list from the cleaned_dup_df and grabbing them from the 
# duplicated_df dataframe using groupby and get_group method
neighborhood_list = []
for postal in cleaned_dup_df['PostalCode']:
    neighborhood_list.append(list(duplicated_df.groupby('PostalCode').get_group(postal)['Neighborhood']))

In [21]:
neighborhood_list[:5]

[['Rouge', 'Malvern'],
 ['Rouge Hill', 'Port Union', 'Highland Creek'],
 ['West Hill', 'Morningside', 'Guildwood'],
 ['Kennedy Park', 'Ionview', 'East Birchmount Park'],
 ['Clairlea', 'Golden Mile', 'Oakridge']]

In [22]:
# to flatten the list, use list comprehension trick:
neighborhood_list = [', '.join(i) for i in neighborhood_list]
neighborhood_list[:5]

['Rouge, Malvern',
 'Rouge Hill, Port Union, Highland Creek',
 'West Hill, Morningside, Guildwood',
 'Kennedy Park, Ionview, East Birchmount Park',
 'Clairlea, Golden Mile, Oakridge']

In [23]:
# add a new column into the dataframe to hold the new list of multiple neighborhood in one row
cleaned_dup_df.insert(2,'Neighborhood_dup',neighborhood_list)

In [24]:
# remove the old neighborhood column:
cleaned_dup_df = cleaned_dup_df.drop('Neighborhood',axis=1)

In [25]:
# rename the 'Neighborhood_dup' with 'Neighborhood'
cleaned_dup_df.columns = ['PostalCode','Borough','Neighborhood']
cleaned_dup_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
11,M1B,Scarborough,"Rouge, Malvern"
28,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
44,M1E,Scarborough,"West Hill, Morningside, Guildwood"
93,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
107,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"


In [26]:
# combine the unique_df and cleaned_dup_df with concat function:
cleaned_df = pd.concat([cleaned_dup_df,unique_df])

In [27]:
# sort the value and reset the old index to the new index
cleaned_df.sort_values(by='PostalCode')
cleaned_df = cleaned_df.reset_index(drop=True)

In [28]:
# For sanity check:
cleaned_df[cleaned_df.PostalCode=='M5A']

Unnamed: 0,PostalCode,Borough,Neighborhood
24,M5A,Downtown Toronto,"Harbourfront, Regent Park"


* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [29]:
# figured out where the "Not assigned" row in the new cleaned_df located
cleaned_df[cleaned_df.Neighborhood=='Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
95,M7A,Queen's Park,Not assigned


In [30]:
# rename the neighborhood into "Queen's Park"
cleaned_df.loc[95,'Neighborhood'] = "Queen's Park"

In [31]:
# For sanity check: compare it with the list of postal code given from this assignment
cleaned_df[cleaned_df['PostalCode'].isin(['M5G','M2H','M4B','M1J','M4G','M4M','M1R','M9V','M9L','M5V','M1B','M7A'])]

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
8,M1R,Scarborough,"Maryvale, Wexford"
18,M4B,East York,"Parkview Hill, Woodbine Gardens"
35,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo..."
56,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
59,M1J,Scarborough,Scarborough Village
63,M2H,North York,Hillcrest Village
76,M4G,East York,Leaside
79,M4M,East Toronto,Studio District
88,M5G,Downtown Toronto,Central Bay Street


<a id='13'> Clean DataFrame</a>  
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.

In [32]:
# Here is the finished dataframe:
cleaned_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"West Hill, Morningside, Guildwood"
3,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
4,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
5,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
6,M1N,Scarborough,"Birch Cliff, Cliffside West"
7,M1P,Scarborough,"Wexford Heights, Dorset Park, Scarborough Town..."
8,M1R,Scarborough,"Maryvale, Wexford"
9,M1T,Scarborough,"Clarks Corners, Sullivan, Tam O'Shanter"


* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [33]:
cleaned_df.shape

(103, 3)

<div class="alert alert-block alert-success" style="margin-top: 20px">  
    
<a id='SecondSection'> <H1> Second Assignment - Obtain the Latitude and Longitude coordinates </H1></a>  

   2.1. <a href='#21'>Using package - did not work</a> <br>  
     
   2.2. <a href='#22'>Using csv</a> <br>  
     
   2.3. <a href='#23'>Latitude Longitude DataFrame - Final</a>   <br>  
   
[Back to the top](#ToC)  

</div>

In [34]:
## this is how I would have used the geocoder
## unfortunately, the code did not work 
#import geocoder # import geocoder
#lat_lng = []
#for postal_code in cleaned_df['PostalCode']:
#    lat_lng_coords = None
#    # loop until you get the coordinates
#    while(lat_lng_coords is None):
#        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#        lat_lng_coords = g.latlng
#    lat_lng.append(lat_lng_coords)

<a id='21'>Using package - did not work</a>  

In [35]:
# Instead, I am going to use the pypostalcode package:
# example using'M5G':
from pypostalcode import PostalCodeDatabase
pcdb = PostalCodeDatabase()
pc = 'M5G'
location = pcdb[pc]
print("latitude for M5G:",location.latitude)
print("longitude for M5G:",location.longitude)

latitude for M5G: 43.6519
longitude for M5G: -79.3874


In [36]:
# let's create a for loop and getting all the data into one:
pcdb = PostalCodeDatabase()
lat_ca = []
long_ca = []
for postal_code in cleaned_df['PostalCode']:
    try:
        location = pcdb[postal_code]
    except:
        location.latitude = np.nan
        location.longitude = np.nan
    lat_ca.append(location.latitude)
    long_ca.append(location.longitude)
# let's insert the latitude column into the cleaned_df:
cleaned_df.insert(3,'Latitude',lat_ca)
# let's insert the latitude column into the cleaned_df:
cleaned_df.insert(4,'Longitude',long_ca)
# Check one row that is missing:
cleaned_df[cleaned_df['Latitude'].isnull()]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
96,M7R,Mississauga,Canada Post Gateway Processing Centre,,


In [37]:
# to get the value, we will use the csv file given by the IBM coursera course:
downloaded_df = pd.read_csv('Geospatial_Coordinates.csv')
downloaded_df.head()
# let's make sure that the column name is the same as what we had in cleaned_df:
downloaded_df.columns = ['PostalCode','Latitude','Longitude']
# from the downloaded_df:
# downloaded_df[downloaded_df['Postal Code']=='M7R']
# let's replace the NaN value for M7R:
cleaned_df.loc[cleaned_df['Latitude'].isnull(),'Latitude'] = float(downloaded_df[downloaded_df['PostalCode']=='M7R']['Latitude'])
cleaned_df.loc[cleaned_df['Longitude'].isnull(),'Longitude'] = float(downloaded_df[downloaded_df['PostalCode']=='M7R']['Longitude'])
cleaned_df.loc[cleaned_df.PostalCode=='M7R']

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
96,M7R,Mississauga,Canada Post Gateway Processing Centre,43.636966,-79.615819


In [38]:
# For sanity check: compare it with the list of postal code given from this assignment
cleaned_df[cleaned_df['PostalCode'].isin(['M5G','M2H','M4B','M1J','M4G','M4M','M1R','M9V','M9L','M5V','M1B','M5A'])]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.7976,-79.227
8,M1R,Scarborough,"Maryvale, Wexford",43.7293,-79.3038
18,M4B,East York,"Parkview Hill, Woodbine Gardens",43.6979,-79.2986
24,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.6369,-79.3505
35,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo...",43.6525,-79.3686
56,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.73,-79.5542
59,M1J,Scarborough,Scarborough Village,43.7315,-79.246
63,M2H,North York,Hillcrest Village,43.7895,-79.3735
76,M4G,East York,Leaside,43.6918,-79.3708
79,M4M,East Toronto,Studio District,43.6505,-79.3369


<a id='22'> Using csv</a>  
The value doesn't exactly aligned with the data provided by the coursera assignment, so I will instead use the downloaded_df value:

In [39]:
# let's create a new dataframe and drop the Latitude and Longitude columns from the dataframe
cleaned_df2 = cleaned_df.copy()
cleaned_df2.drop(['Latitude','Longitude'],axis=1,inplace=True)
cleaned_df2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"West Hill, Morningside, Guildwood"
3,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
4,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"


In [40]:
# merge cleaned_df2 with downloaded_df to add latitude/longitude for each PostalCode
cleaned_df2 = cleaned_df2.join(downloaded_df.set_index('PostalCode'), on='PostalCode')

<a id='23'>Latitude Longitude DataFrame - Final</a>

In [41]:
# For sanity check: compare it with the list of postal code given from this assignment
cleaned_df2[cleaned_df2['PostalCode'].isin(['M5G','M2H','M4B','M1J','M4G','M4M','M1R','M9V','M9L','M5V','M1B','M5A'])]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
8,M1R,Scarborough,"Maryvale, Wexford",43.750072,-79.295849
18,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
24,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
35,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo...",43.628947,-79.39442
56,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437
59,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
63,M2H,North York,Hillcrest Village,43.803762,-79.363452
76,M4G,East York,Leaside,43.70906,-79.363452
79,M4M,East Toronto,Studio District,43.659526,-79.340923


This value is the same as the one given from the coursera assignment. 

In [42]:
# To show the whole value:
cleaned_df2

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"West Hill, Morningside, Guildwood",43.763573,-79.188711
3,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
4,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
5,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
6,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
7,M1P,Scarborough,"Wexford Heights, Dorset Park, Scarborough Town...",43.757410,-79.273304
8,M1R,Scarborough,"Maryvale, Wexford",43.750072,-79.295849
9,M1T,Scarborough,"Clarks Corners, Sullivan, Tam O'Shanter",43.781638,-79.304302


Let's save the `cleaned_df2` DataFrame so that I can use it for the third assignment.

In [43]:
cleaned_df2.to_csv('toronto_lat_long.csv',index=False)