# Capstone - Webscrape Wikipedia for Table

This notebook will webscrape a Wikipedia link to extract a table with information using get requests and BeautifulSoup. 

The desired table should look like the following Table.
<img src = "6b. Desired Table.jpg" width = 500> </a>


In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

## 0) Webscrape Table from Wikipedia Page

In [2]:
#send get request
html_data = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(html_data.content,'xml')

#identify html regarding the table, Inspect element of table on url to identify tags
table = soup.find('table',{'class':'wikitable sortable'})

In [3]:
#Find the Table Headers
table_headers = table.find_all('th')
list_headers = []
for header in table_headers:
    list_headers.append(header.text.strip()) #strip can remove the '\n'
print(list_headers)

#Grab the data for each row
table_rows = table.find_all('tr')
table_data = []
for row in table_rows:
    table_data.append([cell.text.strip() for cell in row.find_all('td')])

['Postal Code', 'Borough', 'Neighbourhood']


## 1) Pass Data into a Pandas DataFrame and begin Preprocessing

In [4]:
#Create the dataframe from the webscraped data (table_data and list_headers)
df = pd.DataFrame(table_data, columns=list_headers)
df.dropna(axis=0, inplace=True)
df 

Unnamed: 0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
176,M5Z,Not assigned,Not assigned
177,M6Z,Not assigned,Not assigned
178,M7Z,Not assigned,Not assigned
179,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


### *1a) Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.*

In [5]:
#drop all rows where Borough is 'Not Assigned'
df_noNotAssigned = df[df['Borough'] != 'Not assigned']
print(df_noNotAssigned.shape)
df_noNotAssigned.head(15)

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
9,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
10,M1B,Scarborough,"Malvern, Rouge"
12,M3B,North York,Don Mills
13,M4B,East York,"Parkview Hill, Woodbine Gardens"
14,M5B,Downtown Toronto,"Garden District, Ryerson"


### *1b) More than one neighborhood can exist in one postal code area.*

The code below shows that this is no longer true for the dataset from the Wikipedia Page. 

Code is provided such that if the condition were true then the data would merge neighborhoods belonging to the same postal code and borough into a single row.

In [6]:
#Identify rows where Neighbourhood is 'Not Assigned'
#as shown below, no rows with neighbourhood == 'Not Assigned'
df_noNotAssigned[df_noNotAssigned['Neighbourhood']=='Not Assigned']

Unnamed: 0,Postal Code,Borough,Neighbourhood


In [7]:
#Double check manually by downloading to csv and checking if there is any 'Not Assigned'
# df_noNotAssigned.to_csv('df.csv')

In [8]:
#if there were multiple neighborhoods belonging to the same postal code and borough then the following code would
#group the neighborhoods into the same row
df_UniquePostalCode = df_noNotAssigned.groupby(['Postal Code','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
df_UniquePostalCode

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [9]:
df_UniquePostalCode.shape

(103, 3)

In [10]:
len(df_noNotAssigned['Postal Code'].unique())
#this shows that after removing the rows where Borughs are 'Not Assigned', the number of unique postal codes is 103, 
#so the final dataframe with all the neighborhoods with the same postal code grouped together
#should have 103 rows.

103

### 1c) If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [11]:
#For Neighborhoods that say 'Not Assigned', replace that with the Borough
condition = df_UniquePostalCode['Neighbourhood']=='Not Assigned'
df_UniquePostalCode.loc[condition, 'Neighbourhood']=df_UniquePostalCode.loc[condition, 'Borough']

In [12]:
df_UniquePostalCode.shape

(103, 3)

#### __________________________________________END OF NOTEBOOK PART I_____________________________________________________

## 2) Adding in the Geospatial data (Latitude and Longitude)
*I could not get the geocoder module to work so I downloaded the csv 'Geospatial_Coordinates'

In [13]:
# %pip install geocoder

In [14]:
# import geocoder # import geocoder
# # initialize your variable to None
# lat_lng_coords = None
# postal_code= 'M5G'

# # loop until you get the coordinates
# while(lat_lng_coords is None):
#     g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#     lat_lng_coords = g.latlng
# latitude = lat_lng_coords[0]
# longitude = lat_lng_coords[1]

# print(latitude, longitude)

In [15]:
# %pip install wget

In [16]:
import wget

In [17]:
url = 'http://cocl.us/Geospatial_data'
wget.download(url)

'Geospatial_Coordinates.csv'

In [18]:
df_GeoCoord = pd.read_csv('Geospatial_Coordinates.csv')
df_GeoCoord

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [19]:
df_merged = df_UniquePostalCode.join(df_GeoCoord.set_index('Postal Code'),on='Postal Code')
df_merged

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


In [20]:
#just checking when compared to the example given in the assignment
df_merged[df_merged['Postal Code']=='M5G']

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
57,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383


#### __________________________________________END OF NOTEBOOK PART II_____________________________________________________

# 3) Segmenting and Clustering Datapoints

In [21]:
#filter dataframe to only consider Boroughs which include the word 'Toronto'
df_TorontoBors = df_merged[df_merged['Borough'].str.contains('Toronto')]
df_TorontoBors.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
47,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
49,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


In [22]:
df_TorontoBors.shape

(39, 5)

### 3a) Obtain venue information for the neighbourhoods

In [23]:
#access Foursquare API using my CLIENT_ID & CLIENT_SECRET stored as environmental variables 
#on my computer
import os

In [24]:
CLIENT_ID = os.environ.get('Foursquare_CLIENT_ID') # your Foursquare ID
CLIENT_SECRET = os.environ.get('Foursquare_CLIENT_SECRET') # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

# print('Your credentails:')
# print('CLIENT_ID: ' + CLIENT_ID)
# print('CLIENT_SECRET:' + CLIENT_SECRET)

In [25]:
#Function from IBM Data Science/Course 9/Week 3/ Lab 4a. Neighborhoods New York
#this function will take a series of names, latitudes, and longitudes (series of neighborhood geospatial data)
#and return a dataframe which includes venues (quantity depending on the limit variable) that are near the neighborhood lat & long

def getNearbyVenues(names, latitudes, longitudes, LIMIT=100, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [26]:
Toronto_venues = getNearbyVenues(names=df_TorontoBors['Neighbourhood'],
                                   latitudes=df_TorontoBors['Latitude'],
                                   longitudes=df_TorontoBors['Longitude']
                                  )

In [27]:
Toronto_venues.shape

(1634, 7)

In [28]:
Toronto_venues.head(7)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,MenEssentials,43.67782,-79.351265,Cosmetics Shop
5,"The Danforth West, Riverdale",43.679557,-79.352188,Dolce Gelato,43.677773,-79.351187,Ice Cream Shop
6,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


In [29]:
Toronto_venues.groupby('Neighborhood').count().head()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,57,57,57,57,57,57
"Brockton, Parkdale Village, Exhibition Place",25,25,25,25,25,25
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",19,19,19,19,19,19
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",15,15,15,15,15,15
Central Bay Street,63,63,63,63,63,63


## 3b) Analyze Each Neighborhood

In [30]:
Toronto_venues['Neighborhood'].head()

0                     The Beaches
1                     The Beaches
2                     The Beaches
3                     The Beaches
4    The Danforth West, Riverdale
Name: Neighborhood, dtype: object

In [31]:
Toronto_venues['Neighborhood'].shape

(1634,)

In [32]:
# one hot encoding
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot['Neighborhoods'] = Toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])
Toronto_onehot = Toronto_onehot[fixed_columns] 

Toronto_onehot.head()

Unnamed: 0,Neighborhoods,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
#Adding in the Boroughs Column back in, I will use the Neighborhoods column of Toronto_onehot (left dataframe) as the index of the right dataframe

#first obtain df with Postal Codes & Boroughs & Neighborhoods from df_TorontoBors
df_right = df_TorontoBors[['Postal Code','Borough','Neighbourhood']].set_index('Neighbourhood')
df_right

#Confirm that total number of unique Neighbourhoods match between df_right & df_left=Toronto_onehot
print(len(df_right.index.unique()))
print(len(Toronto_onehot.Neighborhoods.unique()))

#Merge so that new df includes Postal Code, Borough, Neighborhoods, [venues]
Toronto_oneHotMerged=Toronto_onehot.merge(df_right, how='right',left_on='Neighborhoods',right_on='Neighbourhood')

#move Postal Code and Borough to the left of the Neighborhoods
sorted_cols = list(Toronto_oneHotMerged.columns[-2:])+list(Toronto_oneHotMerged.columns[:-2])
Toronto_oneHotMerged = Toronto_oneHotMerged[sorted_cols]

Toronto_oneHotMerged.head()

39
39


Unnamed: 0,Postal Code,Borough,Neighborhoods,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M4K,East Toronto,"The Danforth West, Riverdale",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
Toronto_grouped = Toronto_oneHotMerged.groupby(['Postal Code','Borough','Neighborhoods']).sum().reset_index()
Toronto_grouped.head()

Unnamed: 0,Postal Code,Borough,Neighborhoods,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,M4K,East Toronto,"The Danforth West, Riverdale",0,0,0,0,0,0,1,...,0,1,0,0,0,0,0,0,0,1
2,M4L,East Toronto,"India Bazaar, The Beaches West",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4M,East Toronto,Studio District,0,0,0,0,0,0,2,...,0,0,0,0,0,0,1,0,0,1
4,M4N,Central Toronto,Lawrence Park,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [35]:
#Function from IBM Data Science/Course 9/Week 3/ Lab 4a. Neighborhoods New York
#Function takes a row from a dataframe and returns the venues that appear most frequently in descending order
#A certain number of venues are returned (defined by the variable 'num_top_venues')

def return_most_common_venues(row, num_top_venues):
#     print(row)
    row_categories = row.iloc[3:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [36]:
#Now create a NEW dataframe with the top 10 most frequent venues in each neighborhood

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code','Borough','Neighborhoods']
for ind in np.arange(num_top_venues): 
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhoods'] = Toronto_grouped['Neighborhoods']
neighborhoods_venues_sorted['Postal Code'] = Toronto_grouped['Postal Code']
neighborhoods_venues_sorted['Borough'] = Toronto_grouped['Borough']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 3:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Postal Code,Borough,Neighborhoods,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,Neighborhood,Health Food Store,Pub,Trail,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Deli / Bodega,Doner Restaurant
1,M4K,East Toronto,"The Danforth West, Riverdale",Greek Restaurant,Italian Restaurant,Coffee Shop,Ice Cream Shop,Restaurant,Furniture / Home Store,Liquor Store,Indian Restaurant,Spa,Juice Bar
2,M4L,East Toronto,"India Bazaar, The Beaches West",Pet Store,Movie Theater,Burrito Place,Sandwich Place,Fast Food Restaurant,Italian Restaurant,Fish & Chips Shop,Restaurant,Steakhouse,Sushi Restaurant
3,M4M,East Toronto,Studio District,Café,Coffee Shop,Brewery,Gastropub,Bakery,American Restaurant,Convenience Store,Seafood Restaurant,Sandwich Place,Cheese Shop
4,M4N,Central Toronto,Lawrence Park,Park,Bus Line,Swim School,Yoga Studio,Diner,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


### 3c) Cluster Neighborhoods

In [37]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [38]:
# set number of clusters
kclusters = 5

Toronto_grouped_clustering = Toronto_grouped.drop(['Postal Code','Borough','Neighborhoods'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 1, 4, 1, 4, 4, 4, 1, 4, 4])

In [39]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [40]:
#Making copies so that original dataframes are not impacted
TorontoBors_merged = df_TorontoBors.copy()
neighborhoods_venues_sorted1 = neighborhoods_venues_sorted.copy().drop(['Postal Code','Borough'], 1)
neighborhoods_venues_sorted1

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
TorontoBors_merged = TorontoBors_merged.join(neighborhoods_venues_sorted1.set_index('Neighborhoods'), on='Neighbourhood')

TorontoBors_merged.head() 

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,4,Neighborhood,Health Food Store,Pub,Trail,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Deli / Bodega,Doner Restaurant
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,1,Greek Restaurant,Italian Restaurant,Coffee Shop,Ice Cream Shop,Restaurant,Furniture / Home Store,Liquor Store,Indian Restaurant,Spa,Juice Bar
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,4,Pet Store,Movie Theater,Burrito Place,Sandwich Place,Fast Food Restaurant,Italian Restaurant,Fish & Chips Shop,Restaurant,Steakhouse,Sushi Restaurant
43,M4M,East Toronto,Studio District,43.659526,-79.340923,1,Café,Coffee Shop,Brewery,Gastropub,Bakery,American Restaurant,Convenience Store,Seafood Restaurant,Sandwich Place,Cheese Shop
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,4,Park,Bus Line,Swim School,Yoga Studio,Diner,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


*Visualize the Clusters*

In [41]:
#Get latitude and longitude for Toronto, ON
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [42]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
colors_array = cm.rainbow(np.linspace(0, 1, kclusters))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(TorontoBors_merged['Latitude'], TorontoBors_merged['Longitude'], TorontoBors_merged['Neighbourhood'], TorontoBors_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 3d) Examine Clusters

In [43]:
TorontoBors_merged.loc[TorontoBors_merged['Cluster Labels'] == 0, 
                     TorontoBors_merged.columns[[2] + list(range(6, TorontoBors_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
54,"Garden District, Ryerson",Clothing Store,Coffee Shop,Café,Cosmetics Shop,Bubble Tea Shop,Italian Restaurant,Japanese Restaurant,Diner,Bookstore,Hotel


In [44]:
TorontoBors_merged.loc[TorontoBors_merged['Cluster Labels'] == 1, 
                     TorontoBors_merged.columns[[2] + list(range(6, TorontoBors_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
41,"The Danforth West, Riverdale",Greek Restaurant,Italian Restaurant,Coffee Shop,Ice Cream Shop,Restaurant,Furniture / Home Store,Liquor Store,Indian Restaurant,Spa,Juice Bar
43,Studio District,Café,Coffee Shop,Brewery,Gastropub,Bakery,American Restaurant,Convenience Store,Seafood Restaurant,Sandwich Place,Cheese Shop
47,Davisville,Pizza Place,Dessert Shop,Sandwich Place,Coffee Shop,Café,Italian Restaurant,Sushi Restaurant,Gym,Brewery,Gourmet Shop
51,"St. James Town, Cabbagetown",Coffee Shop,Café,Restaurant,Italian Restaurant,Bakery,Pub,Pizza Place,Park,Japanese Restaurant,Jewelry Store
65,"The Annex, North Midtown, Yorkville",Sandwich Place,Café,Coffee Shop,Pet Store,History Museum,Donut Shop,Burger Joint,Middle Eastern Restaurant,Indian Restaurant,Pub
66,"University of Toronto, Harbord",Café,Sandwich Place,Bar,Japanese Restaurant,Bookstore,Restaurant,Bakery,Beer Bar,Beer Store,Italian Restaurant
67,"Kensington Market, Chinatown, Grange Park",Café,Vegetarian / Vegan Restaurant,Coffee Shop,Mexican Restaurant,Bar,Vietnamese Restaurant,Gaming Cafe,Dessert Shop,Grocery Store,Park
77,"Little Portugal, Trinity",Bar,Coffee Shop,Café,Vietnamese Restaurant,Restaurant,Men's Store,Asian Restaurant,Yoga Studio,Deli / Bodega,Brewery
84,"Runnymede, Swansea",Coffee Shop,Café,Pizza Place,Sushi Restaurant,Italian Restaurant,Pub,Smoothie Shop,Bookstore,Sandwich Place,Burrito Place


In [45]:
TorontoBors_merged.loc[TorontoBors_merged['Cluster Labels'] == 2, 
                     TorontoBors_merged.columns[[2] + list(range(6, TorontoBors_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
52,Church and Wellesley,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Men's Store,Hotel,Café,Yoga Studio,Bubble Tea Shop
53,"Regent Park, Harbourfront",Coffee Shop,Bakery,Park,Pub,Café,Breakfast Spot,Theater,Beer Store,Ice Cream Shop,Chocolate Shop
55,St. James Town,Café,Coffee Shop,Restaurant,Clothing Store,Cocktail Bar,Cosmetics Shop,American Restaurant,Breakfast Spot,Park,Moroccan Restaurant
56,Berczy Park,Coffee Shop,Seafood Restaurant,Bakery,Farmers Market,Restaurant,Cheese Shop,Cocktail Bar,Café,Beer Bar,Department Store
57,Central Bay Street,Coffee Shop,Sandwich Place,Café,Italian Restaurant,Japanese Restaurant,Department Store,Bubble Tea Shop,Salad Place,Burger Joint,Miscellaneous Shop
69,Stn A PO Boxes,Coffee Shop,Café,Italian Restaurant,Japanese Restaurant,Seafood Restaurant,Gym,Beer Bar,Hotel,Restaurant,Pub
85,"Queen's Park, Ontario Provincial Government",Coffee Shop,Diner,Creperie,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Café,College Auditorium,Park


In [46]:
TorontoBors_merged.loc[TorontoBors_merged['Cluster Labels'] == 3, 
                     TorontoBors_merged.columns[[2] + list(range(6, TorontoBors_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
58,"Richmond, Adelaide, King",Coffee Shop,Café,Clothing Store,Hotel,Restaurant,Gym,Bar,Thai Restaurant,Steakhouse,Salad Place
59,"Harbourfront East, Union Station, Toronto Islands",Coffee Shop,Aquarium,Café,Hotel,Restaurant,Scenic Lookout,Brewery,Fried Chicken Joint,Baseball Stadium,Music Venue
60,"Toronto Dominion Centre, Design Exchange",Coffee Shop,Hotel,Café,Restaurant,Italian Restaurant,Salad Place,Seafood Restaurant,American Restaurant,Japanese Restaurant,Sushi Restaurant
61,"Commerce Court, Victoria Hotel",Coffee Shop,Café,Restaurant,Hotel,American Restaurant,Gym,Seafood Restaurant,Japanese Restaurant,Italian Restaurant,Deli / Bodega
70,"First Canadian Place, Underground city",Coffee Shop,Café,Hotel,Japanese Restaurant,Restaurant,Gym,Seafood Restaurant,Salad Place,Steakhouse,Asian Restaurant


In [47]:
TorontoBors_merged.loc[TorontoBors_merged['Cluster Labels'] == 4, 
                     TorontoBors_merged.columns[[2] + list(range(6, TorontoBors_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,The Beaches,Neighborhood,Health Food Store,Pub,Trail,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Deli / Bodega,Doner Restaurant
42,"India Bazaar, The Beaches West",Pet Store,Movie Theater,Burrito Place,Sandwich Place,Fast Food Restaurant,Italian Restaurant,Fish & Chips Shop,Restaurant,Steakhouse,Sushi Restaurant
44,Lawrence Park,Park,Bus Line,Swim School,Yoga Studio,Diner,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
45,Davisville North,Pizza Place,Hotel,Food & Drink Shop,Sandwich Place,Department Store,Breakfast Spot,Gym / Fitness Center,Park,Convenience Store,Discount Store
46,"North Toronto West, Lawrence Park",Clothing Store,Coffee Shop,Yoga Studio,Mexican Restaurant,Salon / Barbershop,Fast Food Restaurant,Spa,Sporting Goods Shop,Café,Restaurant
48,"Moore Park, Summerhill East",Playground,Park,Tennis Court,Yoga Studio,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
49,"Summerhill West, Rathnelly, South Hill, Forest...",Coffee Shop,Pub,Bank,Sushi Restaurant,Bagel Shop,Light Rail Station,Fried Chicken Joint,Sports Bar,Restaurant,Supermarket
50,Rosedale,Park,Playground,Trail,Yoga Studio,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
63,Roselawn,Garden,Home Service,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
64,"Forest Hill North & West, Forest Hill Road Park",Jewelry Store,Trail,Bus Line,Sushi Restaurant,Yoga Studio,Diner,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
