# Capstone Project

This is the coursera capstone project week 3 assignment.

In this project we will analize the different neighbourhoods in Toronto and cluster them according to the kind of venues that are more present in each of them.

In [1]:
#Import necessary modules
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests #Makes web requests
import geocoder # import geocoder
import csv

import json # library to handle JSON files
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


# WEEK 3

## 1._ Get the data:
## This section corresponds to the first question of the assignment

First we will get the necessary data.

For that we will Scrap Wikipedia in order to get the latitude and longitude of all the neighborhoods. Later we will clean the data scraped form the web page and present it in a neat dataframe.

The rest of the data we will get it from the API of 4Square.

### 1.1 Scrap Wikipedia

In [2]:
#Get the source
source=requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

#html format
soup=BeautifulSoup(source,'lxml')

#Get the table
table=soup.find('table', class_='wikitable')

In [3]:
#We iterate to get a list for each column values
data_list=[[],[],[]]
for tr in table.find_all('tr'): #Goes through all the <tr> tags
    i=0
    for td in tr.find_all('td'):
        try:
            data_list[i].append(td.text.rstrip())#rstrip to remove the final \n
        except:
            data_list[i].append("Error")
        finally:
            i=i+1

In [4]:
#I have all the data in a list with three lists
#I separate the data

Postal_code_list=data_list[0][:]
Borough_list=data_list[1][:]
Neighborhood_list=data_list[2][:]

#Create a dictionary and create a frame with it
dictionary = {'Postal_code':Postal_code_list,
              'Borough':Borough_list,
              'Neighborhood':Neighborhood_list}
df = pd.DataFrame(dictionary, columns = ['Postal_code','Borough','Neighborhood'])

### 1.2 Clean the data

We have all the information from the wikipedia table stored in the dataframe.

Now we have to clean the data according to:
 - Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
 - In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

 - Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [5]:
#Get rid of all the rows with "Not assigned" in Borough
df=df[~((df['Borough'] == 'Not assigned'))]
df.shape

(103, 3)

 - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [6]:
#We use boolean indexing
df.loc[df['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df['Borough']
df.shape

(103, 3)

 - More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

Actually now the table in the wikipedia page has this job done already.
##### Check
We can make sure that our table satisfies that, and also make sure that all the rows with the same postal code also have the same Borough.

For that, we check the dimension of the dataframe as it is, then grouped by Postal_code and Borough, and grouped by Postal_code. We see that they all have the same lenght.

In [7]:
g_pc=df.groupby(['Postal_code'])
g_pc_b = df.groupby(['Postal_code','Borough'])
print(df.shape)
print(len(g_pc))
print(len(g_pc_b))

(103, 3)
103
103


In [8]:
#To make it look like in the exercise, we can change the "/" for a ",".
df.reset_index(level=None, drop=True, inplace=True)
df.replace("/", ",", inplace=True, regex=True)

### 1.3 Result

In [9]:
df

Unnamed: 0,Postal_code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway , Montgomery Road , Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,"Old Mill South , King's Mill Park , Sunnylea ,..."


### We end up with 103 rows

In [10]:
df.shape

(103, 3)

## 2._ Cluster the Neighbourhoods

Now we will start clustering the neighbourhoods list that we obtained in the previous section.

Before we can start requesting information to the 4Square app, we make sure to find the coordinates of each neighbourhood.

### Get the neighborhoods location features

In [11]:
df_l = pd.read_csv('Geospatial_Coordinates.csv') 
df_l.sort_values(by=['Postal Code'], ascending = True,inplace=True)
df.sort_values(by=['Postal_code'], ascending = True,inplace=True)
df.reset_index(level=None, drop=True, inplace=True)

In [12]:
#Add latitude and longitude as two new columns for my dataframe:
df['Latitude'] = df_l['Latitude']
df['Longitude'] = df_l['Longitude']
df

Unnamed: 0,Postal_code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern , Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill , Port Union , Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood , Morningside , West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village , St. Phillips , Martin Grov...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles , Silverstone , Humbergate , Jam...",43.739416,-79.588437


In [13]:
latToronto=43.651070
longToronto = -79.347015
print('The geograpical coordinate of Toronto are {}, {}.'.format(latToronto, longToronto))

The geograpical coordinate of Toronto are 43.65107, -79.347015.


#### We create a preliminary map to display the different neighborhoods.

In [14]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latToronto, longToronto], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

# 2._ 4Square API

### Set down the parameters for calling the API

We will gather information about each neighborhood using the 4Square API

##### Note: API Credentials are hidden in this public version of the notebook

In [15]:
## Define 4Square Credentials
CLIENT_ID = '**HIDDEN**' # your Foursquare ID
CLIENT_SECRET = '**HIDDEN**' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

In [16]:
LIMIT=200

The API returns a JSON file.

For our purposes we are interested in the category data.

In [17]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

### Loop over all neighbourhoods to search info

In [18]:
#This function gets the data of my neighbourhoods and returns a list of tuples.
#Each of those tuples corresponds to a venue of interes.

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    tuplasfilas=[item for venue_list in venues_list for item in venue_list]
    #Creamos la tabla con encabezados
        
    return(tuplasfilas)

In [19]:
df_test=df

In [20]:
#We get the information for all the venues using getNearbyVenues and present them in a dataframe.

torontotuplasfilas=getNearbyVenues(names=df_test['Neighborhood'],
                                   latitudes=df_test['Latitude'],
                                   longitudes=df_test['Longitude']
                                  )
#print(torontotuplasfilas)

d = pd.DataFrame(torontotuplasfilas,columns=['Neighborhood', 
                'Neighborhood Latitude', 
                'Neighborhood Longitude', 
                'Venue', 
                'Venue Latitude', 
                'Venue Longitude', 
                'Venue Category'])
d

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern , Rouge",43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,"Rouge Hill , Port Union , Highland Creek",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Guildwood , Morningside , West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
3,"Guildwood , Morningside , West Hill",43.763573,-79.188711,Big Bite Burrito,43.766299,-79.190720,Mexican Restaurant
4,"Guildwood , Morningside , West Hill",43.763573,-79.188711,Enterprise Rent-A-Car,43.764076,-79.193406,Rental Car Location
...,...,...,...,...,...,...,...
2201,"South Steeles , Silverstone , Humbergate , Jam...",43.739416,-79.588437,McDonald's,43.741757,-79.584230,Fast Food Restaurant
2202,"South Steeles , Silverstone , Humbergate , Jam...",43.739416,-79.588437,LCBO,43.741508,-79.584501,Liquor Store
2203,Northwest,43.706748,-79.594054,Economy Rent A Car,43.708471,-79.589943,Rental Car Location
2204,Northwest,43.706748,-79.594054,Logistics Distribution,43.707554,-79.589252,Bar


## Analysis of the different types of venues

We got more than two thousand results for the venues.

We are interested in the different groups that we can make with them depending on their category.

In [21]:
#Lets star analyzing the categories:
print('There are {} uniques categories.'.format(len(d['Venue Category'].unique())))

There are 267 uniques categories.


#### Count how many venues of each type are in each neighborhood

In [22]:
#We first create a binary table with the type of the venues in each neighbourhood

# one hot encoding
toronto_onehot = pd.get_dummies(d[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = d['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

#toronto_onehot.head()

In [23]:
#Now we group them by neighborhood
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
#toronto_grouped

Now I want to create a dataframe with the top 10 most common type of venues in each neighbourhood.

In [24]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
#Los nombres de las columnas
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
#Los nombres de los barrios en la columna de barrios
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

#Función para ordenar
#Le paso una fila y el número de venues más comunes que quiero

def return_most_common_venues(row, num_top_venues):
    #Se queda con todos los números menos el nombre del barrio
    row_categories = row.iloc[1:]
    #Las ordena en orden descendente
    row_categories_sorted = row_categories.sort_values(ascending=False)
    #Me devuelve el nombre (el index) de cada tipo de venue más habitual
    return row_categories_sorted.index.values[0:num_top_venues]


#itera sobre el número de barrios
for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Latin American Restaurant,Breakfast Spot,Clothing Store,Lounge,Eastern European Restaurant,Electronics Store,Dumpling Restaurant,Empanada Restaurant,Dessert Shop,Drugstore
1,"Alderwood , Long Branch",Pizza Place,Coffee Shop,Pharmacy,Sandwich Place,Skating Rink,Athletics & Sports,Pub,Gym,Comic Shop,Deli / Bodega
2,"Bathurst Manor , Wilson Heights , Downsview North",Bank,Coffee Shop,Convenience Store,Ice Cream Shop,Supermarket,Deli / Bodega,Sushi Restaurant,Restaurant,Middle Eastern Restaurant,Diner
3,Bayview Village,Café,Bank,Japanese Restaurant,Chinese Restaurant,Dog Run,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant
4,"Bedford Park , Lawrence Manor East",Sandwich Place,Coffee Shop,Italian Restaurant,Restaurant,Grocery Store,Thai Restaurant,Pub,Café,Sushi Restaurant,Indian Restaurant


# Cluster the neighbourhoods

Now we are going to cluster the neighbourhoods using te vectors that account for the relative presence of each type of venues.

In [25]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
#kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

kmeans = KMeans(init="k-means++", n_clusters=kclusters, n_init=12)
kmeans.fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 4, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0,
       4, 0, 0, 0, 0, 3, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 4, 1, 0, 0, 0, 1,
       0, 2, 0, 4, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 4, 4, 1,
       0, 0, 0, 0, 1])

Create a dataframe with the neigbourhoods, the mos common venues and the cluster they belong

In [26]:
# add clustering labels
neighborhoods_venues_sorted['Cluster Labels']=kmeans.labels_

In [27]:
neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,Agincourt,Latin American Restaurant,Breakfast Spot,Clothing Store,Lounge,Eastern European Restaurant,Electronics Store,Dumpling Restaurant,Empanada Restaurant,Dessert Shop,Drugstore,0
1,"Alderwood , Long Branch",Pizza Place,Coffee Shop,Pharmacy,Sandwich Place,Skating Rink,Athletics & Sports,Pub,Gym,Comic Shop,Deli / Bodega,4
2,"Bathurst Manor , Wilson Heights , Downsview North",Bank,Coffee Shop,Convenience Store,Ice Cream Shop,Supermarket,Deli / Bodega,Sushi Restaurant,Restaurant,Middle Eastern Restaurant,Diner,0
3,Bayview Village,Café,Bank,Japanese Restaurant,Chinese Restaurant,Dog Run,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant,0
4,"Bedford Park , Lawrence Manor East",Sandwich Place,Coffee Shop,Italian Restaurant,Restaurant,Grocery Store,Thai Restaurant,Pub,Café,Sushi Restaurant,Indian Restaurant,0
...,...,...,...,...,...,...,...,...,...,...,...,...
88,"Wexford , Maryvale",Smoke Shop,Auto Garage,Shopping Mall,Breakfast Spot,Bakery,Middle Eastern Restaurant,Sandwich Place,Dog Run,Diner,Discount Store,0
89,Willowdale,Coffee Shop,Pizza Place,Ramen Restaurant,Restaurant,Sushi Restaurant,Café,Discount Store,Sandwich Place,Bubble Tea Shop,Hotel,0
90,Woburn,Coffee Shop,Korean Restaurant,Insurance Office,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore,0
91,Woodbine Heights,Skating Rink,Dance Studio,Spa,Diner,Curling Ice,Athletics & Sports,Bus Stop,Cosmetics Shop,Beer Store,Pharmacy,0


For some neighborhoods it didn't find any venue. Therefore, we will eliminate them from the list.

We also want to add latitude and longitude columns to the dataframe.

In [28]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = df_test
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!
toronto_merged.tail()
#el índice debería seguir llegando hasta el 102
#Cluster labels debería ser integer

#Tengo que borrar todas las filas en las que haya un Nan
#Eso tiene que hacer que me llegue el índice hasta el 92
toronto_merged.dropna(axis=0, inplace=True)
toronto_merged=toronto_merged.reset_index(drop=True)

#It converted the cluster column to float because of the Nan's.
#Once we have deleted them, we set the cluster variable back to integer.
toronto_merged=toronto_merged.astype({'Cluster Labels': 'int64'})
toronto_merged

Unnamed: 0,Postal_code,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,M1B,Scarborough,"Malvern , Rouge",43.806686,-79.194353,Fast Food Restaurant,Women's Store,Department Store,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Doner Restaurant,3
1,M1C,Scarborough,"Rouge Hill , Port Union , Highland Creek",43.784535,-79.160497,Bar,Women's Store,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,Falafel Restaurant,2
2,M1E,Scarborough,"Guildwood , Morningside , West Hill",43.763573,-79.188711,Mexican Restaurant,Rental Car Location,Breakfast Spot,Intersection,Bank,Medical Center,Electronics Store,Cosmetics Shop,Costume Shop,Eastern European Restaurant,0
3,M1G,Scarborough,Woburn,43.770992,-79.216917,Coffee Shop,Korean Restaurant,Insurance Office,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore,0
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,Gas Station,Thai Restaurant,Fried Chicken Joint,Bank,Athletics & Sports,Caribbean Restaurant,Bakery,Hakka Restaurant,Drugstore,Donut Shop,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,M9N,York,Weston,43.706876,-79.518188,Park,Convenience Store,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Deli / Bodega,Doner Restaurant,1
94,M9P,Etobicoke,Westmount,43.696319,-79.532242,Pizza Place,Middle Eastern Restaurant,Chinese Restaurant,Coffee Shop,Discount Store,Sandwich Place,Intersection,Dessert Shop,Dim Sum Restaurant,Diner,4
95,M9R,Etobicoke,"Kingsview Village , St. Phillips , Martin Grov...",43.688905,-79.554724,Park,Bus Line,Pizza Place,Sandwich Place,Discount Store,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Distribution Center,4
96,M9V,Etobicoke,"South Steeles , Silverstone , Humbergate , Jam...",43.739416,-79.588437,Grocery Store,Beer Store,Fried Chicken Joint,Sandwich Place,Liquor Store,Fast Food Restaurant,Pizza Place,Pharmacy,Donut Shop,Department Store,4


# Lets visualize the clusters

In [29]:
# create map
map_clusters = folium.Map(location=[latToronto, longToronto], zoom_start=11)


# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

 # add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# Examine clusters

We display the different clusters in order to see their venue types.

### Cluster 1

In [30]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
2,Scarborough,Mexican Restaurant,Rental Car Location,Breakfast Spot,Intersection,Bank,Medical Center,Electronics Store,Cosmetics Shop,Costume Shop,Eastern European Restaurant,0
3,Scarborough,Coffee Shop,Korean Restaurant,Insurance Office,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore,0
4,Scarborough,Gas Station,Thai Restaurant,Fried Chicken Joint,Bank,Athletics & Sports,Caribbean Restaurant,Bakery,Hakka Restaurant,Drugstore,Donut Shop,0
5,Scarborough,Grocery Store,Spa,Playground,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Donut Shop,0
6,Scarborough,Convenience Store,Coffee Shop,Discount Store,Department Store,Doner Restaurant,Dim Sum Restaurant,Diner,Distribution Center,Dog Run,Women's Store,0
...,...,...,...,...,...,...,...,...,...,...,...,...
85,Etobicoke,Bakery,Fried Chicken Joint,Mexican Restaurant,Gym,Restaurant,Pet Store,Pharmacy,Pizza Place,Liquor Store,American Restaurant,0
89,Etobicoke,Grocery Store,Hardware Store,Discount Store,Flower Shop,Burrito Place,Burger Joint,Supplement Shop,Convenience Store,Social Club,Fast Food Restaurant,0
90,Etobicoke,Cosmetics Shop,Café,Pet Store,Pizza Place,Coffee Shop,Beer Store,Convenience Store,Liquor Store,Doner Restaurant,Diner,0
92,North York,Baseball Field,Women's Store,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Falafel Restaurant,0


### Cluster 2

In [31]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
14,Scarborough,Park,Playground,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,1
20,North York,Park,Bank,Convenience Store,Bar,Women's Store,Doner Restaurant,Diner,Discount Store,Distribution Center,Dog Run,1
22,North York,Park,Bus Stop,Food & Drink Shop,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Ethiopian Restaurant,1
37,East York,Park,Convenience Store,Coffee Shop,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,1
47,Downtown Toronto,Park,Playground,Trail,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Electronics Store,Donut Shop,Doner Restaurant,Dance Studio,1
71,York,Park,Market,Women's Store,Gluten-free Restaurant,Gift Shop,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Doner Restaurant,1
76,North York,Park,Bakery,Construction & Landscaping,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,1
87,Etobicoke,Park,River,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Ethiopian Restaurant,Dog Run,1
88,Etobicoke,Park,Construction & Landscaping,Baseball Field,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,1
93,York,Park,Convenience Store,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Deli / Bodega,Doner Restaurant,1


### Cluster 3

In [32]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
1,Scarborough,Bar,Women's Store,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,Falafel Restaurant,2


### Cluster 4

In [33]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,Scarborough,Fast Food Restaurant,Women's Store,Department Store,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Doner Restaurant,3


### Cluster 5

In [34]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
8,Scarborough,American Restaurant,Motel,Intersection,Women's Store,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,4
13,Scarborough,Pharmacy,Pizza Place,Shopping Mall,Bank,Fast Food Restaurant,Intersection,Fried Chicken Joint,Thai Restaurant,Chinese Restaurant,Italian Restaurant,4
31,North York,Intersection,Coffee Shop,Pizza Place,Hockey Arena,Portuguese Restaurant,Women's Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,4
32,East York,Pizza Place,Fast Food Restaurant,Gastropub,Café,Athletics & Sports,Intersection,Bus Line,Bank,Pharmacy,Pet Store,4
69,North York,Park,Pub,Pizza Place,Japanese Restaurant,Distribution Center,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,4
78,York,Pizza Place,Bus Line,Caribbean Restaurant,Brewery,Women's Store,Dog Run,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,4
86,Etobicoke,Pizza Place,Coffee Shop,Pharmacy,Sandwich Place,Skating Rink,Athletics & Sports,Pub,Gym,Comic Shop,Deli / Bodega,4
91,North York,Empanada Restaurant,Pizza Place,Dog Run,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant,4
94,Etobicoke,Pizza Place,Middle Eastern Restaurant,Chinese Restaurant,Coffee Shop,Discount Store,Sandwich Place,Intersection,Dessert Shop,Dim Sum Restaurant,Diner,4
95,Etobicoke,Park,Bus Line,Pizza Place,Sandwich Place,Discount Store,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Distribution Center,4


# Conclusions

Note: Since different initializations of the cluster centroids yield different results, the results may vary slightlly in each run.

However, the rough conclusions are similar:

There is one cluster which holds the majority of neighbourhoods.

This cluster is the one to which all the city center neighbourhoods belong. It is characterized for having parks and restaurants.

On the other hand in the perifery there are neighborhoods belonging to all different clusters. We can see some areas that look residential (usually one cluster), whereas another three clusters are usually more industrial, displaying big stores or gas stations, and very little parks/playgrounds/schools.