## <center>Capstone Project</center>

# <center>Segmenting and Clustering Neighborhoods in Toronto</center>

### **Introduction**

Toronto, the capital of the province of Ontario, is a major Canadian city along Lake Ontario’s northwestern shore. It's a dynamic metropolis with a core of soaring skyscrapers, all dwarfed by the iconic, free-standing CN Tower. Toronto also has many green spaces, from the orderly oval of Queen’s Park to 400-acre High Park and its trails, sports facilities and zoo.

The diverse population of Toronto reflects its current and historical role as an important destination for immigrants to Canada. More than 50 percent of residents belong to a visible minority population group, and over 200 distinct ethnic origins are represented among its inhabitants. While the majority of Torontonians speak English as their primary language, over 160 languages are spoken in the city.

This project is about exploring and clustering neighborhoods in Toronto, Canada.
The notebook shows valuable information by visualizing the boroughs, neighborhoods and their venues all in one place, for anyone who wants to make a decision on where to move within the city of Toronto.

### **Part 1 - Creating the Dataframe**

 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Importing and istalling all the necessary packages and libraries**

In [2]:
import numpy as np #library to handle data in a vectorized manner

import pandas as pd #library for data analsysis

import json #library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim #convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize #tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

!conda install -c conda-forge folium=0.5.0 --yes
import folium #map rendering library

#Installing BeautifulSoup and importing the necessary packages
import requests
!pip install bs4
from bs4 import BeautifulSoup

print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Scraping the html table using Pandas and placing it in Postal Codes Dataframe**

In [4]:
html_table = pd.read_html(r"https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
df_PC = html_table[0] #choosing the first table in the wikipage
df_PC

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Processing the cells that have an assigned borough**

In [5]:
df_PC_AB = df_PC[df_PC.Borough != 'Not assigned']
df_PC_AB

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Grouping the dataframe by Postal Codes and reseting the index**

In [6]:
df_PC_AB_grp = df_PC_AB.groupby('Postal Code').sum()
df_PC_AB_res = df_PC_AB_grp.reset_index()
df_PC_AB_res

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [7]:
df_PC_AB_res.shape

(103, 3)

### **Part 2 - Neighborhood Coordinates**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Reading the geographical coordinates into a dataframe**

In [8]:
df_GEO = pd.read_csv(r"https://cocl.us/Geospatial_data")
df_GEO

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Inserting/merging the latitude and longitude columns into the original dataframe**

In [9]:
df_GEO_merged = df_PC_AB_res[['Postal Code','Borough', 'Neighbourhood']].merge(df_GEO[['Latitude','Longitude', 'Postal Code']], on='Postal Code', how='left')
df_GEO_merged

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


### **Part 3 - Clustering and Exploring the Neighborhoods in Toronto**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**How many boroughs and neighborhoods are there in Toronto?**

In [10]:
print('Toronto has {} boroughs and {} neighborhoods.'.format(len(df_PC_AB_res['Borough'].unique()), df_PC_AB_res.shape[0]))

Toronto has 10 boroughs and 103 neighborhoods.


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Visualizing the neighborhoods of Toronto**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Getting the latitude and longitude of Toronto using geopy library

In [12]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Creating the Toronto map showing all the boroughs and their neighborhoods as blue dots

In [13]:
#create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

#add markers to map
for lat, lng, borough, neighborhood in zip(df_GEO_merged['Latitude'], df_GEO_merged['Longitude'], df_GEO_merged['Borough'], df_GEO_merged['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Segmenting and clustering the neighborhoods in Downtown Toronto**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:green">This could be applied to each neighborhood, however, to keep the project simple, we are going to take Downtown Toronto as a blueprint.</span>

In [14]:
dt_toronto_data = df_GEO_merged[df_GEO_merged['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
dt_toronto_data

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Getting the coordinates of Downtown Toronto - considering that these coordinates represent the coordinates of the city of Toronto

In [15]:
latitude_dt_toronto = latitude
longitude_dt_toronto = longitude
print('The geograpical coordinates of Downtown Toronto are {}, {}.'.format(latitude_dt_toronto, longitude_dt_toronto))

The geograpical coordinates of Downtown Toronto are 43.6534817, -79.3839347.


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Creating the Downtown Toronto map showing all its neighborhoods as green dots

In [16]:
#create map of Downtown Toronto using latitude and longitude values
map_dt_toronto = folium.Map(location=[latitude_dt_toronto, longitude_dt_toronto], zoom_start=11)

#add markers to map
for lat, lng, label in zip(dt_toronto_data['Latitude'], dt_toronto_data['Longitude'], dt_toronto_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dt_toronto)  
    
map_dt_toronto

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Exploring the neighborhoods and their venues in Downtown Toronto using Foursquare API**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Defining the Foursquare Credentials and Version

In [4]:
# @hidden_cell
CLIENT_ID = 'SLJD1F2MZNYFTYKZZRFWQVKOPKGIM4QJVIAEK21PPABOIQ2A'
CLIENT_SECRET = 'ELNISHJKRIUXAY2ZUPTN1AWQOEDKXSZFJ4HKLEZ4EQW2OWOH'
VERSION = '20200605'
LIMIT = 100

print('Credentials are defined')

Credentials are defined


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Function that explores all the neighborhood in Downtown Toronto

In [18]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Using the above function to explore all the venues in each neighborhood of Downtown Toronto

In [19]:
dt_toronto_venues = getNearbyVenues(names=dt_toronto_data['Neighbourhood'], latitudes=dt_toronto_data['Latitude'], longitudes=dt_toronto_data['Longitude'])
dt_toronto_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rosedale,43.679563,-79.377529,Rosedale Park,43.682328,-79.378934,Playground
1,Rosedale,43.679563,-79.377529,Whitney Park,43.682036,-79.373788,Park
2,Rosedale,43.679563,-79.377529,Alex Murray Parkette,43.678300,-79.382773,Park
3,Rosedale,43.679563,-79.377529,Milkman's Lane,43.676352,-79.373842,Trail
4,"St. James Town, Cabbagetown",43.667967,-79.367675,Cranberries,43.667843,-79.369407,Diner
...,...,...,...,...,...,...,...
1217,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,SUDS,43.659880,-79.394712,Bar
1218,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Convocation Hall,43.660828,-79.395245,College Auditorium
1219,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Tim Hortons,43.659415,-79.391221,Coffee Shop
1220,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Tim Hortons,43.658906,-79.388696,Coffee Shop


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;How many venues were returned for each neighborhood? Let's check that!

In [20]:
dt_toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,58,58,58,58,58,58
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
Central Bay Street,65,65,65,65,65,65
Christie,15,15,15,15,15,15
Church and Wellesley,78,78,78,78,78,78
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
"First Canadian Place, Underground city",100,100,100,100,100,100
"Garden District, Ryerson",100,100,100,100,100,100
"Harbourfront East, Union Station, Toronto Islands",100,100,100,100,100,100
"Kensington Market, Chinatown, Grange Park",66,66,66,66,66,66


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;How many unique venue categories were returned? Let's find out!

In [21]:
print('There are {} unique categories found in the neighborhoods of Downtown Toronto.'.format(len(dt_toronto_venues['Venue Category'].unique())))

There are 212 unique categories found in the neighborhoods of Downtown Toronto.


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Analyzing the venues of each neighborhood of Downtown Toronto**

In [23]:
#Populating the dataframe of venues with dummies (0 and 1) to show if a venue exists in a neighborhood or not
df_venues = pd.get_dummies(dt_toronto_venues[['Venue Category']], prefix="", prefix_sep="")

#Adding the neighborhood column as first column to the df_venues dataframe
df_venues.insert(0, "Neighborhoods", dt_toronto_venues['Neighborhood'], True)

df_venues.head()

Unnamed: 0,Neighborhoods,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theater,Theme Restaurant,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Rosedale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Rosedale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Rosedale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Rosedale,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,"St. James Town, Cabbagetown",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
#Droping the duplicate column called 'Neighborhood'
df_venues.drop(['Neighborhood'], axis=1, inplace=True)

In [25]:
#Checking how many columns and rows the complete dataframe has
df_venues.shape

(1222, 212)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Grouping the dataframe by neighborhood and calculating the mean of occurences of each category

In [26]:
df_venues_grouped = df_venues.groupby('Neighborhoods').mean()
df_venues_grouped_res = df_venues_grouped.reset_index()
df_venues_grouped_res

Unnamed: 0,Neighborhoods,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theater,Theme Restaurant,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0
1,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0625,0.0625,0.0625,0.125,0.1875,0.125,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.015385,0.0,0.0,0.015385,0.0,0.015385
3,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Church and Wellesley,0.012821,0.0,0.0,0.0,0.0,0.0,0.0,0.012821,0.0,...,0.012821,0.012821,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641
5,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0
6,"First Canadian Place, Underground city",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,...,0.01,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0
7,"Garden District, Ryerson",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.02,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0
8,"Harbourfront East, Union Station, Toronto Islands",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0
9,"Kensington Market, Chinatown, Grange Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.060606,0.0,0.045455,0.015152,0.0,0.0


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Showing each neighborhood along with the top 10 most common venues

In [27]:
top10_venues = 10
for neighborhood in df_venues_grouped_res['Neighborhoods']:
    print("----"+neighborhood+"----")
    
    #Looping into each row, putting it in a dataframe, transposing it and reseting its index
    temp = df_venues_grouped_res[df_venues_grouped_res['Neighborhoods'] == neighborhood].T.reset_index()
    
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(top10_venues))
    print('\n')

----Berczy Park----
                venue  freq
0         Coffee Shop  0.10
1        Cocktail Bar  0.05
2              Bakery  0.05
3            Pharmacy  0.03
4      Farmers Market  0.03
5            Beer Bar  0.03
6  Seafood Restaurant  0.03
7         Cheese Shop  0.03
8          Restaurant  0.03
9            Creperie  0.02


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
                 venue  freq
0      Airport Service  0.19
1       Airport Lounge  0.12
2     Airport Terminal  0.12
3      Harbor / Marina  0.06
4              Airport  0.06
5  Rental Car Location  0.06
6     Sculpture Garden  0.06
7        Boat or Ferry  0.06
8             Boutique  0.06
9                Plane  0.06


----Central Bay Street----
                       venue  freq
0                Coffee Shop  0.17
1                       Café  0.05
2             Sandwich Place  0.05
3         Italian Restaurant  0.05
4        Japanese Restaurant  0.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Putting the neighborhoods along with the top 10 most common venues into a dataframe**

In [28]:
#The below function sorts the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Creating the dataframe

In [29]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

#Create columns according to number of top venues
columns = ['Neighborhoods']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

#Create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhoods'] = df_venues_grouped_res['Neighborhoods']

for ind in np.arange(df_venues_grouped_res.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_venues_grouped_res.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhoods,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Bakery,Cocktail Bar,Pharmacy,Beer Bar,Seafood Restaurant,Farmers Market,Restaurant,Cheese Shop,Juice Bar
1,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Harbor / Marina,Boutique,Plane,Rental Car Location,Sculpture Garden,Boat or Ferry,Airport Gate
2,Central Bay Street,Coffee Shop,Sandwich Place,Café,Italian Restaurant,Bubble Tea Shop,Japanese Restaurant,Thai Restaurant,Burger Joint,Salad Place,Middle Eastern Restaurant
3,Christie,Grocery Store,Café,Park,Baby Store,Candy Store,Nightclub,Coffee Shop,Italian Restaurant,Restaurant,Discount Store
4,Church and Wellesley,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar,Café,Fast Food Restaurant,Pub,Hotel,Yoga Studio


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Clustering the neighborhoods into 5 clusters using k-means**

In [30]:
#Importing KMeans
from sklearn.cluster import KMeans

#Set the number of clusters
kclusters = 4

#Dropping the categorical neighborhoods column and keeping the rest of the columns with quantative values so we can cluster them
dt_toronto_grouped_clustering = df_venues_grouped_res.drop('Neighborhoods', 1)

#Running k-means clustering with random_state 4
kmeans = KMeans(n_clusters=kclusters, random_state=4).fit(dt_toronto_grouped_clustering)

#Checking cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 0, 1, 3, 1, 1, 1, 1, 1, 1])

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Creating a dataframe that includes the cluster labels and the top 10 venues for each neighborhood.

In [31]:
#Add clustering labels to the neighborhoods_venues_sorted dataframe
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

#Creating the new dataframe dt_toronto_merged from the original dt_toronto_data dataframe
dt_toronto_merged = dt_toronto_data

#Merging dt_toronto_merged with neighborhoods_venues_sorted to add latitude/longitude for each neighborhood
dt_toronto_merged = dt_toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhoods'), on='Neighbourhood')

dt_toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,2,Park,Trail,Playground,Department Store,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
1,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675,1,Coffee Shop,Bakery,Café,Italian Restaurant,Restaurant,Pub,Pizza Place,General Entertainment,Liquor Store,Beer Store
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,1,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar,Café,Fast Food Restaurant,Pub,Hotel,Yoga Studio
3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1,Coffee Shop,Bakery,Park,Pub,Café,Theater,Breakfast Spot,Event Space,Beer Store,Shoe Store
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1,Coffee Shop,Clothing Store,Café,Middle Eastern Restaurant,Hotel,Bubble Tea Shop,Cosmetics Shop,Japanese Restaurant,Italian Restaurant,Diner


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Visualizing the clusters of neighborhoods of Downtown Toronto

In [32]:
#Initializing the map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

#Setting color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

#Adding markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dt_toronto_merged['Latitude'], dt_toronto_merged['Longitude'], dt_toronto_merged['Neighbourhood'], dt_toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Observations on the clustering:**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1. Cluster 1 has the most concentration of venues, which give us an indicator that these neighborhoods could have the same lifestyle. <br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2. The rest of the clusters don't have a wide variete of venues and certainly nothing in common between them. <br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3. The distribution of clustering looks pretty much accurate as we can see from the points distancing.