# Neighborhoods in Toronto Clustering

### It is very important to notice that that the data provided is limited to the postal codes beggining with "M". These codes are for Toronto, Canada.

## 1. Getting all relevant geographic location data

In [1]:
import pandas as pd
import numpy as np

In [2]:
url1 = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
postal = pd.read_html(url1, header=0)
postal

[    Postal code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 5           M6A        North York   
 6           M7A  Downtown Toronto   
 7           M8A      Not assigned   
 8           M9A         Etobicoke   
 9           M1B       Scarborough   
 10          M2B      Not assigned   
 11          M3B        North York   
 12          M4B         East York   
 13          M5B  Downtown Toronto   
 14          M6B        North York   
 15          M7B      Not assigned   
 16          M8B      Not assigned   
 17          M9B         Etobicoke   
 18          M1C       Scarborough   
 19          M2C      Not assigned   
 20          M3C        North York   
 21          M4C         East York   
 22          M5C  Downtown Toronto   
 23          M6C              York   
 24          M7C      Not assigned   
 25         

In [3]:
type(postal)

list

In [4]:
postal[0]

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


In [5]:
postal[1]

Unnamed: 0.1,Unnamed: 0,Canadian postal codes,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30
0,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL,NS,PE,NB,QC,ON,MB,SK,AB,...,L,M,N,P,R,S,T,V,X,Y
1,NL,NS,PE,NB,QC,ON,MB,SK,AB,BC,...,,,,,,,,,,
2,A,B,C,E,G,H,J,K,L,M,...,,,,,,,,,,
3,NL,NS,PE,NB,QC,ON,MB,SK,AB,BC,...,,,,,,,,,,
4,A,B,C,E,G,H,J,K,L,M,...,,,,,,,,,,


### As shown, the approach used returned a list with 2 data frames, the first containing the required information, so, we'll just create a new variable containainig the relevant data frame.

In [6]:
postaldf = postal[0]
postaldf.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


In [7]:
postaldf.Neighborhood.isna().sum()

77

### We'll just ingnore al the observations without a borough assigned

In [8]:
#create a mask for filtering out "not assigned", and remove them
print('Before:', postaldf.shape)
na_mask=postaldf['Borough']!='Not assigned'
postaldf = postaldf[na_mask]
print('After:', postaldf.shape)

Before: (180, 3)
After: (103, 3)


In [9]:
postaldf.reset_index(inplace=True, drop=True)
postaldf.head(6)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue


### Then, the " / " separator for neighborhoods associated with the same code will be replaced by a comma

In [10]:
#get indexes for rows than contains more than one Neighborhood:
plusone_n = postaldf[postaldf['Neighborhood'].str.contains("/")]
plusone_n.index

Int64Index([  2,   3,   4,   6,   8,  11,  12,  17,  18,  28,  30,  31,  33,
             34,  36,  37,  38,  41,  42,  43,  44,  45,  47,  48,  49,  51,
             52,  55,  56,  57,  58,  63,  65,  69,  71,  74,  75,  77,  80,
             81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  93,  96,  97,
             98, 101, 102],
           dtype='int64')

In [11]:
#replace slashes with commas:
postaldf['Neighborhood'] = postaldf['Neighborhood'].str.replace(' /', ',')
postaldf['Neighborhood'].head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


0                                      Parkwoods
1                               Victoria Village
2                      Regent Park, Harbourfront
3               Lawrence Manor, Lawrence Heights
4    Queen's Park, Ontario Provincial Government
5                               Islington Avenue
6                                 Malvern, Rouge
7                                      Don Mills
8                Parkview Hill, Woodbine Gardens
9                       Garden District, Ryerson
Name: Neighborhood, dtype: object

### Checking if there are still NaN or 'Not assigned' values in the 'Neighborhood' column:

In [12]:
postaldf['Neighborhood'].isna().sum()

0

In [13]:
postaldf[postaldf['Neighborhood']=='Not assigned']

Unnamed: 0,Postal code,Borough,Neighborhood


### As there aren't any missing values in Neighborhood rows, no furhter operations are needed for filling them, and at this point it's convenient to check how many rows we have:

In [14]:
postaldf.head(11)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


### ...and let's check how many rows are in the dataframe:

In [15]:
postaldf.shape

(103, 3)

# Get coordinates for each postal code:

### As the link provided is for Toronto, Canada (postal codes starting with "M"), all the analysis will focus in that specific area.

In [16]:
# !conda install -c conda-forge geocoder

In [17]:
#make a try
# import geocoder
# pru = geocoder.google('M6B, Toronto, Ontario')
# pru

### Since geocoder des not work, let's use the csv file provided

In [16]:
geo_coord = pd.read_csv('Geospatial_Coordinates.csv')
geo_coord.head()


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Having the required coordinate data, include it  into our first dataset, 'postaldf' as new columns:

In [17]:
print(geo_coord.shape)
print(postaldf.shape)

(103, 3)
(103, 3)


In [18]:
#to include the longitude, latitude columns, join using as key column Postal Code
geo_coord.rename(columns={'Postal Code': 'Postal code'}, inplace=True)
CAneighbor = postaldf.merge(geo_coord, how='inner', on='Postal code')

In [19]:
CAneighbor.head(12)

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [20]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(CAneighbor['Borough'].unique()),
        CAneighbor.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


In [21]:
CAneighbor['Neighborhood'].value_counts(dropna=False).sort_values(ascending=False).head(10)

Downsview                            4
Don Mills                            2
Willowdale                           2
Hillcrest Village                    1
Davisville                           1
Woodbine Heights                     1
Guildwood, Morningside, West Hill    1
Garden District, Ryerson             1
Westmount                            1
Roselawn                             1
Name: Neighborhood, dtype: int64

### There are some duplicated values in original set, this can be adressed addin suffixes

In [22]:
CAneighbor[CAneighbor['Neighborhood'].duplicated()]

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
13,M3C,North York,Don Mills,43.7259,-79.340923
46,M3L,North York,Downsview,43.739015,-79.506944
53,M3M,North York,Downsview,43.728496,-79.495697
60,M3N,North York,Downsview,43.761631,-79.520999
72,M2R,North York,Willowdale,43.782736,-79.442259


In [23]:
#add  suffix to dup

CAneighbor.loc[13, 'Neighborhood']= 'Don Mills1'
CAneighbor.loc[72, 'Neighborhood']= 'Willowdale1'
CAneighbor.loc[46, 'Neighborhood']= 'Downsview1'
CAneighbor.loc[53, 'Neighborhood']= 'Downsview2'
CAneighbor.loc[60, 'Neighborhood']= 'Downsview3'

In [24]:
CAneighbor[CAneighbor['Neighborhood']=='Don Mills']

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
7,M3B,North York,Don Mills,43.745906,-79.352188


In [25]:
CAneighbor[CAneighbor['Neighborhood']=='Willowdale']

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
59,M2N,North York,Willowdale,43.77012,-79.408493


In [26]:
CAneighbor[CAneighbor['Neighborhood']=='Downsview']

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
40,M3K,North York,Downsview,43.737473,-79.464763


In [27]:
#check if there are still duplicated
CAneighbor[CAneighbor['Neighborhood'].duplicated()]

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude


In [28]:
CAneighbor.reset_index(inplace=True, drop=True)

In [29]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(CAneighbor['Borough'].unique()),
        CAneighbor.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


In [30]:
CAneighbor.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### Getting the information for longitudes and latitudes in Toronto:

In [31]:
# first, install geocoders for Nominatim

# !conda install -c conda-forge folium=0.5.0 --yes
import folium
from geopy.geocoders import Nominatim

address = 'Toronto, CA'
geolocator = Nominatim(user_agent="CAexplorer")   #arbitray name
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Toronto are 43.6534817, -79.3839347.


### Creating a map of Toronto, Canada to show the neighborhoods' locations

In [32]:
# create map of New York using latitude and longitude values

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10.2)

# add markers to map
for lat, lng, borough, neighborhood in zip(CAneighbor['Latitude'], CAneighbor['Longitude'], CAneighbor['Borough'], CAneighbor['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

# 2. Exploring venues near each Toronto neighborhood

### Input all required credentials for using Foursquare:

In [33]:
CLIENT_ID = 'OEIQPBBHFH2TRCVPE1OZOD1IHWMSGVMCATXOSKYKQ4XK0RXV' # your Foursquare ID
CLIENT_SECRET = '1BHTGTZUPAWPVHMKFVTD1YXZOGF024LQPAKJWR0FD0LLT5BK' # your Foursquare Secret
VERSION = '20181212' # Foursquare API version


### ...and for the analysis, we explore a maximum of 100 venues in 500 maround each neighborhood 

In [34]:
radius = 500
LIMIT = 100

In [35]:
#data frame with all neighoborhoods in Toronto:
CAneighbor.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [36]:
CAneighbor.shape

(103, 5)

In [37]:
#try the Foursqare url:
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas datafram

platitude=CAneighbor.iloc[0,3]
plongitude=CAneighbor.iloc[0,4]
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, platitude, plongitude, VERSION, radius, LIMIT)

In [38]:
results=requests.get(url).json()
results.keys()

dict_keys(['meta', 'response'])

In [44]:
results['response']['groups'][0]['items']

[{'reasons': {'count': 0,
   'items': [{'summary': 'This spot is popular',
     'type': 'general',
     'reasonName': 'globalInteractionReason'}]},
  'venue': {'id': '4e8d9dcdd5fbbbb6b3003c7b',
   'name': 'Brookbanks Park',
   'location': {'address': 'Toronto',
    'lat': 43.751976046055574,
    'lng': -79.33214044722958,
    'labeledLatLngs': [{'label': 'display',
      'lat': 43.751976046055574,
      'lng': -79.33214044722958}],
    'distance': 245,
    'cc': 'CA',
    'city': 'Toronto',
    'state': 'ON',
    'country': 'Canada',
    'formattedAddress': ['Toronto', 'Toronto ON', 'Canada']},
   'categories': [{'id': '4bf58dd8d48988d163941735',
     'name': 'Park',
     'pluralName': 'Parks',
     'shortName': 'Park',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/park_',
      'suffix': '.png'},
     'primary': True}],
   'photos': {'count': 0, 'groups': []}},
  'referralId': 'e-0-4e8d9dcdd5fbbbb6b3003c7b-0'},
 {'reasons': {'count': 0,
   'items': [{'

### Create a function for explore all available venues, using Foursquare API: 

In [45]:
#will iterate over CAnighbor, extracting for each neghborhood,up to 100 venues in a 500m range  
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
#         print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']   #choose nested info 
        
        # return only relevant information for each nearby venue
        venues_list.append([(                 
            name,                             #at this level the not nested, returns name of neighborhood
            lat,                              #at this level the not nested, returns lat of neighborhood
            lng,                              #at this level the not nested, returns lat of neighborhood   
            v['venue']['name'],               #gives tha name of establishment
            v['venue']['location']['lat'],    #as lat is in nested location dict, this is for the establishment
            v['venue']['location']['lng'],    #as lon is in nested location dict, this is for the establishment
            v['venue']['categories'][0]['name']) for v in results])   #vey nested, gives concept of establishment 

    venues_detail = []            
    for neigh in venues_list:      #iterate to get to list (first level) of the neighbohoods and all its venues
        for detailvenue in neigh:  #to get the tuples inside each venue
            venues_detail.append(detailvenue)
    
    nearby_venues = pd.DataFrame(venues_detail)    
    
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### ...and use the function to get all the venues by neighborhood

In [46]:
toronto_venues = getNearbyVenues(names=CAneighbor['Neighborhood'],
                                   latitudes=CAneighbor['Latitude'],
                                   longitudes=CAneighbor['Longitude']
                                  )

In [47]:
print(toronto_venues.shape)
print(len(toronto_venues['Neighborhood'].unique())) #control neighborhood number
toronto_venues.head()

(2110, 7)
98


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


### As there are some neighborhods with no venues, update the data frame accordingly

In [49]:
#identify whic are not in resulting frame
unique_invenues = toronto_venues['Neighborhood'].value_counts(dropna=True).index
mask =CAneighbor['Neighborhood'].isin(unique_invenues)
CAneighbor[~mask]

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
11,M9B,Etobicoke,"West Deane Park, Princess Gardens, Martin Grov...",43.650943,-79.554724
50,M9L,North York,Humber Summit,43.756303,-79.565963
52,M2M,North York,"Willowdale, Newtonbrook",43.789053,-79.408493
95,M1X,Scarborough,Upper Rouge,43.836125,-79.205636


In [50]:
#maintain only the neighborhoods  with venues
toronto_clean=CAneighbor[mask]

In [51]:
toronto_clean.reset_index(drop=True, inplace=True)
toronto_clean.tail()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
93,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
94,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
95,M7Y,East Toronto,Business reply mail Processing CentrE,43.662744,-79.321558
96,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509
97,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999


In [54]:
print('There are {} neighborhoods with near venues in Toronto'.format(toronto_clean.shape[0]))

There are 98 neighborhoods with near venues in Toronto


In [55]:
print('There are {} venues in Toronto'.format(toronto_venues.shape[0]))

There are 2110 venues in Toronto


### Let's get the venues count  by neighborhood:

In [65]:
toronto_venues.groupby('Neighborhood').count()['Venue']

Neighborhood
Agincourt                                                                                                          4
Alderwood, Long Branch                                                                                             8
Bathurst Manor, Wilson Heights, Downsview North                                                                   22
Bayview Village                                                                                                    4
Bedford Park, Lawrence Manor East                                                                                 22
Berczy Park                                                                                                       55
Birch Cliff, Cliffside West                                                                                        4
Brockton, Parkdale Village, Exhibition Place                                                                      24
Business reply mail Processing CentrE              

### Let's also find out the number of categories of venues

In [66]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 262 unique categories.


# 3. Analyze each neighborhood

### Measuring the presence of each category of venue, by neighborhood. 

In [67]:
#notice that there is a category "Neighborhood"
# one hot encoding
toronto1hot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto1hot.rename(columns= {'Neighborhood':'Neighborhoods'}, inplace=True)

# # add neighborhood column back to dataframe
toronto1hot['Neighborhood'] = toronto_venues['Neighborhood'] 

# # move neighborhood column to the first place
fixed_columns = [toronto1hot.columns[-1]] + list(toronto1hot.columns[:-1])   #puts the last col in first place
toronto1hot = toronto1hot[fixed_columns]
toronto1hot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### It is important to exlpain that there was a venue with the category 'Neighborhood". I order to avoid conflict with the columns with that name, we just renamed 

In [68]:
toronto1hot.shape

(2110, 263)

### Now, get the mean of occurrence of each venue by neighborhoot

In [69]:
toronto_group = toronto1hot.groupby('Neighborhood').mean().reset_index()
toronto_group.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [70]:
toronto_group.shape

(98, 263)

### Create a data frame with just the top 10 n venues for each neighborhood

In [71]:
#function for extracting top n venues:

def return_most_common_venues(row, num_top_venues):  #will run on each manhattan grouped row
    #takes all the row with means of occurences, excluding the first element (as its the neighborhood name):
    row_categories = row.iloc[1:]                    
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]   #order desc by index as it's oredered, return only "top n" 

In [72]:
num_top_venues = 10
#creates ordinals:
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues, for the df to bre generated
columna = ['Neighborhood']
for ind in np.arange(num_top_venues):      #This creates the columns names 1st, 2nd, ....
    try:   #until "idicators" is depleted
        columna.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columna.append('{}th Most Common Venue'.format(ind+1))   #once is depleted
        
columna[0] #each will be a ordinal with first element being the neighborhood, the following the ordinals(1st, 2nd, ...)


'Neighborhood'

In [73]:
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columna)
#get only the column with the neighborhoods, appropriately ordered
neighborhoods_venues_sorted['Neighborhood'] = toronto_group['Neighborhood'] 
#this will fill each row in the new df with the ordinals: starts in 1 because Neighborhood is already filled in the line above:
for ind in np.arange(toronto_group.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_group.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Latin American Restaurant,Skating Rink,Lounge,Breakfast Spot,Distribution Center,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store
1,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Pharmacy,Sandwich Place,Pub,Skating Rink,Gym,Comfort Food Restaurant,Dance Studio,Drugstore
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Sushi Restaurant,Chinese Restaurant,Shopping Mall,Diner,Bridal Shop,Restaurant,Deli / Bodega,Ice Cream Shop
3,Bayview Village,Café,Bank,Japanese Restaurant,Chinese Restaurant,Yoga Studio,Discount Store,Dessert Shop,Dim Sum Restaurant,Diner,Distribution Center
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Italian Restaurant,Coffee Shop,Restaurant,Greek Restaurant,Thai Restaurant,Liquor Store,Indian Restaurant,Butcher,Pub


## 4. Cluster the neighborhoods

### Prepare the grouped dataframe

In [74]:
#drop the Neighborhood columns for running kmeans
toronto_group_kmean = toronto_group.drop('Neighborhood', 1)

from sklearn.cluster import KMeans
#define the algorithm:
kclusters = 6
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_group_kmean )
# check cluster labels generated for each row in the dataframe
kmeans.labels_[:10] 

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [75]:
#dimension check:
print(toronto_clean.shape)
print(neighborhoods_venues_sorted.shape)

(98, 5)
(98, 11)


### Merge with toronto_clean, containing the coordinate data for each Neighorhood

In [76]:

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

torontomerg = toronto_clean 

# merge toronto_grouped with toronto_clean to add latitude/longitude for each neighborhood
torontomerg = torontomerg.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

torontomerg.tail() # check the last columns!

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
93,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,5,River,Park,Pool,Diner,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Yoga Studio
94,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,1,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Yoga Studio,Hotel,Men's Store,Pub,Pizza Place,Smoke Shop
95,M7Y,East Toronto,Business reply mail Processing CentrE,43.662744,-79.321558,1,Light Rail Station,Garden,Restaurant,Fast Food Restaurant,Auto Workshop,Farmers Market,Burrito Place,Pizza Place,Park,Recording Studio
96,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,4,Baseball Field,Yoga Studio,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant,Dog Run
97,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,1,Convenience Store,Fast Food Restaurant,Supplement Shop,Burrito Place,Tanning Salon,Burger Joint,Discount Store,Hardware Store,Bakery,Sandwich Place


In [77]:
torontomerg.shape

(98, 16)

### Visualize clusters in a map

In [127]:
import matplotlib.pyplot as plt
from matplotlib import cm
import matplotlib.colors as colors

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10, tiles='Stamen Terrain')

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(torontomerg['Latitude'], torontomerg['Longitude'], torontomerg['Neighborhood'], torontomerg['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [117]:
torontomerg.loc[torontomerg['Cluster Labels'] == 0, torontomerg.columns[[2] + list(range(6, torontomerg.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
30,Scarborough Village,Playground,Convenience Store,Yoga Studio,Distribution Center,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
79,"Moore Park, Summerhill East",Playground,Yoga Studio,Distribution Center,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
81,"Milliken, Agincourt North, Steeles East, L'Amo...",Park,Playground,Yoga Studio,Discount Store,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Dog Run


### Cluster 0: The most fequent venues in this cluster are playgrounds, convenience stores and yoga studios, while there are some stores and shops, with less frequency. 

In [116]:
torontomerg.loc[torontomerg['Cluster Labels'] == 1, torontomerg.columns[[2] + list(range(6, torontomerg.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Victoria Village,French Restaurant,Portuguese Restaurant,Coffee Shop,Pizza Place,Hockey Arena,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant
2,"Regent Park, Harbourfront",Coffee Shop,Bakery,Park,Pub,Café,Breakfast Spot,Theater,Electronics Store,Hotel,Spa
3,"Lawrence Manor, Lawrence Heights",Clothing Store,Accessories Store,Arts & Crafts Store,Gift Shop,Furniture / Home Store,Event Space,Miscellaneous Shop,Coffee Shop,Boutique,Women's Store
4,"Queen's Park, Ontario Provincial Government",Coffee Shop,Sushi Restaurant,Diner,Yoga Studio,Creperie,Beer Bar,Sandwich Place,Burger Joint,Burrito Place,Café
6,Don Mills,Café,Gym / Fitness Center,Caribbean Restaurant,Japanese Restaurant,Baseball Field,Dog Run,Dim Sum Restaurant,Diner,Discount Store,Distribution Center
7,"Parkview Hill, Woodbine Gardens",Pizza Place,Bank,Intersection,Fast Food Restaurant,Gym / Fitness Center,Pharmacy,Gastropub,Athletics & Sports,Department Store,Dessert Shop
8,"Garden District, Ryerson",Coffee Shop,Clothing Store,Café,Bubble Tea Shop,Restaurant,Cosmetics Shop,Middle Eastern Restaurant,Japanese Restaurant,Theater,Hotel
9,Glencairn,Bakery,Park,Japanese Restaurant,Italian Restaurant,Playground,Pub,Yoga Studio,Diner,Department Store,Dessert Shop
10,"Rouge Hill, Port Union, Highland Creek",Bar,Yoga Studio,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant,Dog Run
11,Don Mills1,Restaurant,Gym,Asian Restaurant,Coffee Shop,Beer Store,Italian Restaurant,Dim Sum Restaurant,Discount Store,Sporting Goods Shop,Chinese Restaurant


### Cluster 1: This is the largest cluster, where the presence of cofee shops an cafes is definitely prepomderant. The reamining categoes are very varied, something that is not surprising, taking into account the size of the cluster.

In [120]:
torontomerg.loc[torontomerg['Cluster Labels'] == 2, torontomerg.columns[[2] + list(range(6, torontomerg.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
43,"York Mills, Silver Hills",Cafeteria,Dance Studio,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant,Dog Run,Distribution Center


### Cluster 2: This cluster is made up by just one neighborhood. Tho most frequent venues in this neighborhood are cafeterias, dance studios and ethophian restaurants.

In [121]:
torontomerg.loc[torontomerg['Cluster Labels'] == 3, torontomerg.columns[[2] + list(range(6, torontomerg.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,"Malvern, Rouge",Fast Food Restaurant,Yoga Studio,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant,Dog Run


### Cluster 3: This cluster is made up by just one neighborhood. Tho most frequent venues in this neighborhood are fast food restaurants, yoga studios and deli/bodegas.

In [122]:
torontomerg.loc[torontomerg['Cluster Labels'] == 4, torontomerg.columns[[2] + list(range(6, torontomerg.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
49,Downsview2,Food Truck,Baseball Field,Yoga Studio,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Falafel Restaurant
53,"Humberlea, Emery",Food Service,Baseball Field,Yoga Studio,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Falafel Restaurant
96,"Old Mill South, King's Mill Park, Sunnylea, Hu...",Baseball Field,Yoga Studio,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant,Dog Run


### Cluster 4: Three neighborhoods makes up this cluster, which have a food trucks and food services as the most common types of venues. However there is an important presence of baseball fields and yoga studios in the area. 

In [123]:
torontomerg.loc[torontomerg['Cluster Labels'] == 5, torontomerg.columns[[2] + list(range(6, torontomerg.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Parkwoods,Park,Food & Drink Shop,Dance Studio,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant,Dog Run,Distribution Center
14,Humewood-Cedarvale,Park,Field,Hockey Arena,Trail,Yoga Studio,Discount Store,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant
19,Caledonia-Fairbanks,Park,Women's Store,Pool,Discount Store,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Distribution Center
33,East Toronto,Park,Convenience Store,Dance Studio,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant,Dog Run,Distribution Center
38,Downsview,Park,Airport,Bus Stop,Distribution Center,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
47,"North Park, Maple Leaf Park, Upwood Park",Park,Bakery,Construction & Landscaping,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant,Dog Run,Dance Studio,Distribution Center
57,Lawrence Park,Swim School,Park,Bus Line,Distribution Center,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Yoga Studio
60,Weston,Park,Convenience Store,Dance Studio,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant,Dog Run,Distribution Center
62,York Mills West,Park,Convenience Store,Bank,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dance Studio
64,Forest Hill North & West,Sushi Restaurant,Trail,Jewelry Store,Park,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store


### Cluster 5: In this cluster, it is very clear the predominance of parks, convenience stores and dance studios. In much less proportions, convenience stores and convenience stores can be found. A fact that draws the attention of these areas, is the presence of rivers, playgrounds and pools, that reveals that physical actvities are somewhat common. 