### PART 1 Web scraping

Pandas is used to read the table in this data scraping task.

Data is processed as the instruction of the assignment:
- The dataframe consists of three columns: PostalCode, Borough, and Neighborhood
- Cells with a [Borough] that is 'Not assigned' are removed
- All '/' are replaced with ',' in [Neighborhood] column
- [Neighborhood] value is updated with the same as the [Borough] is that is 'Not assigned'.

The output of this notebook is the shape of the dataframe.

In [11]:
import requests
import pandas as pd
import numpy as np

In [14]:
# scrape the table from the website provided
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = requests.get(url).content
df_table = pd.read_html(html)[0]

# update column names
df_table.columns=['PostalCode','Borough','Neighborhood']
df_table=df_table.iloc[1:]

# data pre-processing
df_table.drop(df_table[df_table['Borough']=='Not assigned'].index,inplace=True)
df_table.groupby('PostalCode')['Neighborhood'].apply(', '.join)
df_table.loc[df_table['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df_table['Borough']
# df_table=df_table.apply(lambda x: x.str.replace(' /',','))
df_table['Neighborhood']=df_table['Neighborhood'].str.replace(' /',',')

# update index
df_table=df_table.reset_index(drop = True)

df_table

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [13]:
print(df_table.shape)
# df_table
# df_table.to_csv('NeighbourhoodToronto.csv')

AttributeError: 'NoneType' object has no attribute 'shape'

### PART 2 Latitude and longitude coordinates



In [73]:
# import geocoder # import geocoder

# latitude = []
# longitude = []
# for postal_code in df_table['PostalCode']:
#     # initialize your variable to None
#     lat_lng_coords = None

#     # loop until you get the coordinates
#     while(lat_lng_coords is None):
#         g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#         lat_lng_coords = g.latlng

#     latitude.append(lat_lng_coords[0])
#     longitude.append(lat_lng_coords[1])

# df_table['latitude'] = latitude
# df_table['longitude'] = longitude
# df_table

In [74]:
df_geo = pd.read_csv('/Users/xiao/Documents/PProjects/Coursera Applied Data Science Capstone/Coursera_Capstone/Geospatial_Coordinates.csv')
df_table=df_table.join(df_geo.set_index('Postal Code'),on='PostalCode')
df_table

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


### PART 3 Explore and cluster the neighborhoods in Toronto

This analysis allows users to specify neighborhooods in Toronto that to be explored and clustered by simply modify the variable 'index'.

Foursquare API is used to explore neighborhoods. The getNearbyVenues function produces geo information on all venues in selected neighborhoods. 

This analysis then uses venue features to group specified neighborhoods into clusters by using k-means clustering algorithm.

Specified eighborhoods and their clusters in Toronto are visualised using Folium library.

In [75]:
import certifi
import ssl
import geopy.geocoders
from geopy.geocoders import Nominatim
ctx = ssl.create_default_context(cafile=certifi.where())
geopy.geocoders.options.default_ssl_context = ctx

import folium

import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans


CLIENT_ID = 'TS2DPJEPPDPN5EZMGNMTHTJ25T3P0NOZM0D4LRFEHUUX1J1D' # your Foursquare ID
CLIENT_SECRET = 'VASOSLOQETUPKYRXOCPNJWQSXPCXKK12XD3KUY2FK5P2RKHE' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30

In [76]:
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
toronto_latitude = location.latitude
toronto_longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(toronto_latitude, toronto_longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [77]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_table['Latitude'], df_table['Longitude'], df_table['Borough'], df_table['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

# map_toronto

In [78]:
# specified areas
index = df_table['PostalCode'].str.contains('M5')

index_latitude = df_table.loc[index, 'Latitude'].values # neighborhood latitude value
index_longitude = df_table.loc[index, 'Longitude'].values # neighborhood longitude value
index_name = df_table.loc[index, 'Neighborhood'].values # neighborhood name

# for name, lat, lng in zip(index_name, index_latitude, index_longitude):
#         print('Latitude and longitude values of {} are {}, {}.'.format(name,lat,lng))

In [79]:
# get all index neighborhoods 
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

index_venues = getNearbyVenues(names=index_name,
                                   latitudes=index_latitude,
                                   longitudes=index_longitude
                                  )

print('done')

Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Bedford Park, Lawrence Manor East
Roselawn
Forest Hill North & West
The Annex, North Midtown, Yorkville
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
done


In [80]:
index_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
3,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
4,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


In [81]:
# index_venues.groupby('Neighborhood').count()

In [82]:
# one hot encoding
index_onehot = pd.get_dummies(index_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
index_onehot=index_onehot.drop(columns = ['Neighborhood'])
index_onehot['Neighborhood'] = index_venues['Neighborhood']
# index_onehot.rename(columns = {'Neighborhoods':'Neighborhood'}, inplace = True)
# move neighborhood column to the first column
fixed_columns = [index_onehot.columns[-1]] + list(index_onehot.columns[:-1])
index_onehot = index_onehot[fixed_columns]

index_onehot.head()

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Aquarium,Art Gallery,...,Tea Room,Thai Restaurant,Theater,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [88]:
index_grouped = index_onehot.groupby('Neighborhood').mean().reset_index()
# set number of clusters
kclusters = 3
index_grouped_clustering = index_grouped.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(index_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 
# # add clustering labels
index_grouped.insert(0, 'Cluster Labels', kmeans.labels_)
index_grouped.head()

Unnamed: 0,Cluster Labels,Neighborhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Aquarium,...,Tea Room,Thai Restaurant,Theater,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,1,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,...,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.033333,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0
2,1,"CN Tower, King and Spadina, Railway Lands, Har...",0.0625,0.0625,0.0625,0.125,0.1875,0.125,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.033333,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.033333
4,1,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,...,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [87]:
index_merged = index_venues.join(index_grouped.set_index('Neighborhood'), on='Neighborhood')
index_merged.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Cluster Labels,Airport,Airport Food Court,...,Tea Room,Thai Restaurant,Theater,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery,1,0.0,0.0,...,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.033333
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop,1,0.0,0.0,...,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.033333
2,"Regent Park, Harbourfront",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot,1,0.0,0.0,...,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.033333
3,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center,1,0.0,0.0,...,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.033333
4,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa,1,0.0,0.0,...,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.033333


In [86]:
# create map
map_clusters = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(index_merged['Neighborhood Latitude'], index_merged['Neighborhood Longitude'], index_merged['Neighborhood'], index_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters