# Capstone Project - Segmenting and Clustering Neighborhoods in Toronto (Week 3)
### Applied Data Science Capstone 
*by Lim*   
*on 31st December 2019*

## Table of contents
* [Introduction](#introduction)
* [1. Data Scraping from Website (Question 1)](#data)
* [2. Coordinates of Neighborhoods (Question 2)](#coordinates)
* [3. Exploration and Clustering of the Neighborhoods in Toronto (Question 3)](#explore)

---

## Introduction: <a name="introduction"></a>

This notebook consists of three parts, they are:
1. Data collection, where the data is available on the website https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 
2. Finding coordinates of the neighborhoods of Canada, which utilises the Foursquare API and Geocoder python package
3. Exploration of Neighborhoods in Toronto, which analyses and clusters the neighborhoods in Toronto based on their similarity.

## 1. Data Scraping from Website: <a name="data"></a>

Two libraries, namely [requests](https://realpython.com/python-requests/) and [BeautifulSoup4](https://beautiful-soup-4.readthedocs.io/en/latest/) are required to scrape and process the data from website, also, the data will be ultimately saved as a local copy with .csv format and hence, the necessary libraries are imported into the notebook:

In [1]:
# Install the beautifulsoup library if they are not installed yet
import sys
!{sys.executable} -m pip install beautifulsoup4



In [2]:
# Install the third party parser (Optional, Python has built-in HTML parser)
!{sys.executable} -m pip install lxml



In [3]:
# Import libraries
from bs4 import BeautifulSoup
import requests # library to handle requests

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # for data analsysis

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes  
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

# import csv library 
import csv

Solving environment: | ^C
failed

CondaError: KeyboardInterrupt

Libraries imported.


The data is available on the Wikipedia page (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) as given in the assignment.

We will start the scraping of data from assigning the **URL** and getting the content using **requests.get**.

In [4]:
# Assign URL
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# use 'requests' to get the website content
try:
    response = requests.get(url)
    print("Website responded successfully")
except:
    print("Error occured, failed to get data. Please check.")


Website responded successfully


Next, we apply html (or lxml) parser on the result we obtained from the website using **BeautifulSoup** module.

In [5]:
# Apply beautifulsoup module and html parser
soup_html = BeautifulSoup(response.text, 'html.parser')

# Check the title of the file
print(soup_html.text[0:500])





List of postal codes of Canada: M - Wikipedia
document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"Xg1lggpAAEMAAHzLN7wAAADJ","wgCSPNonce"


From inspecting the elements of Wikipedia page, the table is found with the *class="wikitable sortable"* as seen in the ![Screenshot of wikipedia page, inspect element](https://raw.githubusercontent.com/ahdelim/IBM_Coursera_Capstone_Course8-9/master/Inspect_element_week3.png). 



Hence we can run the following command to find the table.


In [6]:
# use .find to look for the 'table' from the html file
result_table = soup_html.find('table', class_='wikitable sortable')
result_table.text[0:200]

'\n\nPostcode\nBorough\nNeighborhood\n\n\nM1A\nNot assigned\nNot assigned\n\n\nM2A\nNot assigned\nNot assigned\n\n\nM3A\nNorth York\nParkwoods\n\n\nM4A\nNorth York\nVictoria Village\n\n\nM5A\nDowntown Toronto\nHarbourfront\n\n\nM6A\nN'

The table is found and assigned to the one variable.
It is observed that every rows of the table are within <tr> and </tr>, so to take out the rows of the table, we run the following cell.

Noted that there is a 'next line' ('\n') at the end of each word, so we apply **.split** while assigning the rows to a list.

In [7]:
# Search for all rows from the table using .find_all('tr')
result_table_rows = result_table.find_all('tr')

# Initialize a list 
result_text = []

# Assign the rows to the list
for text_row in result_table_rows:
    result_text.append(text_row.text.split('\n'))

# To check the first 10 elements in the list
for i in range(0,10):
    print(result_text[i])

len(result_text) # check the total number of data

['', 'Postcode', 'Borough', 'Neighborhood', '']
['', 'M1A', 'Not assigned', 'Not assigned', '']
['', 'M2A', 'Not assigned', 'Not assigned', '']
['', 'M3A', 'North York', 'Parkwoods', '']
['', 'M4A', 'North York', 'Victoria Village', '']
['', 'M5A', 'Downtown Toronto', 'Harbourfront', '']
['', 'M6A', 'North York', 'Lawrence Heights', '']
['', 'M6A', 'North York', 'Lawrence Manor', '']
['', 'M7A', 'Downtown Toronto', "Queen's Park", '']
['', 'M8A', 'Not assigned', 'Not assigned', '']


288

The first element is the headings of the 'DataFrame' while the remaining is the values. 

The following commands will sort out the values and assign them accordingly to a DataFrame.

In [8]:
# Assign the data frame headings to a separate list
headings = result_text[0][1:4]
headings[0] = 'PostalCode' # same as the headings given in the assignment
headings

['PostalCode', 'Borough', 'Neighborhood']

- ### Create a new data frame for this table

In [9]:
# Assign data to list
data_element = []
for row in range(1, len(result_text)):
    data_element.append([result_text[row][1], result_text[row][2], result_text[row][3].rstrip('\n')])

# Create DataFrame using the list 
df_toronto = pd.DataFrame(data_element, columns = headings)
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We may ignore the cells without a Borough assigned, i.e. "Not assigned" under Borough column.

There are a total of 77 Boroughs that are 'Not Assigned'

In [10]:
df_toronto.Borough.value_counts()

Not assigned        77
Etobicoke           44
North York          38
Downtown Toronto    37
Scarborough         37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

- ### Drop cells with a Borough with 'Not assigned'

In [11]:
df_toronto_dropNA = df_toronto[df_toronto.Borough != 'Not assigned']
df_toronto_dropNA.reset_index(drop=True, inplace=True)
df_toronto_dropNA.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Queen's Park,Not assigned
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


- ### Rename the Neighborhood with the value 'Not assigned' to its Borough name

In [12]:
# Loop to find the value 'Not assigned' in the Neighborhood
for index, data_row in df_toronto_dropNA.iterrows():
    if data_row['Neighborhood'] == 'Not assigned':
        data_row['Neighborhood'] = data_row['Borough']

df_toronto_dropNA.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Queen's Park,Queen's Park
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


- ### Group the neighborhoods of the same borough

In [13]:
# group neighborhoods of the same borough
# df_toronto_grouped = df_toronto_dropNA.groupby(["PostalCode", "Borough"], as_index=False).agg(lambda x: ', '.join(x))
df_toronto_grouped = df_toronto_dropNA.groupby("PostalCode").agg({"Borough":"first", "Neighborhood": ', '.join}).reset_index()
df_toronto_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


- ### Test the grouped and cleaned DataFrame if it is done correctly
  - #### by comparing it with question

In [14]:
# Create a DataFrame for testing
check_df = pd.DataFrame(columns=headings)
check_postalcode = ["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for code in check_postalcode:
    check_df = check_df.append(df_toronto_grouped[df_toronto_grouped["PostalCode"] == code])

check_df.reset_index(drop=True)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Woodbine Gardens, Parkview Hill"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Maryvale, Wexford"
7,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo..."


- ### Print the size of the cleaned dataframe

In [15]:
# The shape of the cleaned and grouped DataFrame
df_toronto_grouped.shape

(103, 3)

## 2. Coordinates of Neighborhoods in Toronto: <a name="coordinates"></a>

In [None]:
# Download the Geographical Coordinates from the given url in Coursera
!wget -q -O 'Geospatial_Coordinates.csv' http://cocl.us/Geospatial_data
print('Data downloaded!')

In [None]:
df_coord = pd.read_csv("Geospatial_Coordinates.csv")
df_coord.head()

- ### Rename the PostalCode so that the column name is consistent

In [None]:
df_coord.rename(columns={"Postal Code":"PostalCode"}, inplace=True)
df_coord.head()

- ### Merge two DataFrames using 'PostalCode' as the key

In [None]:
df_toronto_new = pd.merge(df_toronto_grouped, df_coord, left_on="PostalCode", right_on="PostalCode")
df_toronto_new.head()

- ### Test the new DataFrame if it is same as the one in Question 2

In [None]:
# Create a DataFrame for testing
check_df_coord = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood", "Latitude", "Longitude"])
check_postalcode = ["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for code in check_postalcode:
    check_df_coord = check_df_coord.append(df_toronto_new[df_toronto_new["PostalCode"] == code])

check_df_coord.reset_index(drop=True)

### Check the shape pf the new DataFrame (with coordinates)

In [None]:
df_toronto_new.shape

## 3. Explore and Cluster the Neighborhoods in Toronto: <a name="explore"></a>

- ### Get the coordinate of Toronto by applying geopy library

In [None]:
address = 'Toronto'

geolocator = Nominatim(user_agent="trt_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))

- ### Create a map of Toronto with the neighborhoods superimposed on top

In [None]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto_new['Latitude'], df_toronto_new['Longitude'], df_toronto_new['Borough'], df_toronto_new['Neighborhood']):
    label = '{}, {}, {}, {}'.format(neighborhood, borough, lat, lng)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto

- ### Focus on the analysis of the Boroughs with the word 'Toronto'
  - This includes 'East Toronto, Central Toronto, West Toronto, Downtown Toronto'
  - We filter this out from the Borough column

In [None]:
# Initialize a borough list
Borough_Toronto_filtered = []

# Loop through the Borough column from the DataFrame to search for the Borough with 'Toronto' word
for borough in df_toronto_new["Borough"].unique():
    if "toronto" in borough.lower():
        Borough_Toronto_filtered.append(borough)
        
# The Boroughs with the name Toronto
Borough_Toronto_filtered

- ### Create a DataFrame with only the filtered Boroughs

In [None]:
df_borough_toronto = df_toronto_new[df_toronto_new["Borough"].isin(Borough_Toronto_filtered)]
df_borough_toronto.reset_index(drop=True, inplace=True)
print(df_borough_toronto.shape)
df_borough_toronto.head()

- ### Plot the map again with the filtered Boroughs

In [None]:
# create map of New York using latitude and longitude values
map_borough_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_borough_toronto['Latitude'], df_borough_toronto['Longitude'], df_borough_toronto['Borough'], df_borough_toronto['Neighborhood']):
    label = '{}, {}, {}, {}'.format(neighborhood, borough, lat, lng)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_borough_toronto)  
    
map_borough_toronto

- ### Explore the Neighborhoods using FourSquare API
  - Import Foursquare ID and Secret
  - Use requests to get and explore the top 100 venue near neighborhoods (within 500 metres) 

In [None]:
CLIENT_ID = '3FPYDUBIL1OXZMNIM2TMFPQAICIYTWDUJHGJ4URYPOBDJBGL' # your Foursquare ID
CLIENT_SECRET = '1QYY2OKRE5ZMC2XRAMWKHNYUGBPUY505ICQRSADENKUIHRZL' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [None]:
RADIUS = 500
LIMIT = 100

venues_list = []

for lat, lng, postal, borough, neighborhood in zip(df_borough_toronto['Latitude'], df_borough_toronto['Longitude'], df_borough_toronto['PostalCode'], 
                                                   df_borough_toronto['Borough'], df_borough_toronto['Neighborhood']):
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        lat, 
        lng, 
        VERSION, 
        RADIUS, 
        LIMIT)
    
    results = requests.get(url).json()['response']['groups'][0]['items']
    
    venues_list.append([(
            postal, 
            borough, 
            neighborhood,
            lat,
            lng,
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    
nearby_venues = pd.DataFrame([item for venue_in_list in venues_list for item in venue_in_list])
nearby_venues.columns = ['PostalCode', 
                  'Borough', 
                  'Neighborhood', 
                  'Latitude',
                  'Longtitude',
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

print(nearby_venues.shape)  # Size of the nearby venues
nearby_venues.head() # Display the DataFrame of nearby venues

In [None]:
# Perform random check on one of the neighborhoods
nearby_venues[nearby_venues["Neighborhood"] == "Roselawn"]

In [None]:
# Check how many venues in each neighborhood 
nearby_venues.groupby(["PostalCode", "Borough", "Neighborhood"]).count()

- ### Find out how many unique categories and what are them

In [None]:
print('There are {} uniques categories.'.format(len(nearby_venues['Venue Category'].unique())))
nearby_venues['Venue Category'].unique()[0:20]

- ### Analyse each of the neighborhoods

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(nearby_venues[['Venue Category']], prefix="", prefix_sep="")

# add PostalCode, Borough and Neighborhood back to DataFrame
toronto_onehot['PostalCode'] = nearby_venues['PostalCode'] 
toronto_onehot['Borough'] = nearby_venues['Borough'] 
toronto_onehot['Neighborhoods'] = nearby_venues['Neighborhood'] 

# move the added columns to the first
fixed_columns = list(toronto_onehot.columns[-3:]) + list(toronto_onehot.columns[:-3])
toronto_onehot = toronto_onehot[fixed_columns]
print("There are {} rows and {} columns".format(toronto_onehot.shape[0], toronto_onehot.shape[1]))
toronto_onehot.head(10)

- ### Group the row by neighborhood and by taking the mean of the frequency of occurence of each category

In [None]:
df_toronto_grouped = toronto_onehot.groupby(['PostalCode', 'Borough', 'Neighborhoods']).mean().reset_index()
print("The new size: ", df_toronto_grouped.shape)
df_toronto_grouped

- ### Find the top 10 venues and save into DataFrame

In [None]:
# Function definition - to sort venues in descending order

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[3:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode', 'Borough', 'Neighborhood']
for index in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(index+1, indicators[index]))
    except:
        # avoid error to occur
        columns.append('{}th Most Common Venue'.format(index+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['PostalCode'] = df_toronto_grouped['PostalCode']
neighborhoods_venues_sorted['Borough'] = df_toronto_grouped['Borough']
neighborhoods_venues_sorted['Neighborhood'] = df_toronto_grouped['Neighborhoods']
    
for index in np.arange(df_toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[index, 3:] = return_most_common_venues(df_toronto_grouped.iloc[index, :], num_top_venues)
    
neighborhoods_venues_sorted.head()

- ### Clustering Neighborhoods

In [None]:
# set number of clusters
kclusters = 8

toronto_grouped_clustering = df_toronto_grouped.drop(['PostalCode','Borough','Neighborhoods'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:] 

- ### Create new DataFrame that includes cluster and the top 10 venues

In [None]:
# add clustering labels
# neighborhoods_venues_sorted.drop(['Cluster Labels'], axis=1, inplace=True)
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_borough_toronto

# merge df_toronto_grouped with df_borough_toronto to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.drop(["Borough", "Neighborhood"], 1).set_index('PostalCode'), on='PostalCode')

print(toronto_merged.shape)
toronto_merged.head() # check the last columns!

- ### Create map for the clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster+1), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

- ### Examine Cluster

#### Cluster 1

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 2

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 3

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 4

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 5

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 6

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 7

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 6, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 8

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 7, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]