## Segmenting and Clustering Neighborhoods in Toronto
### Let's get started
import necessary packages


In [12]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
#  http://beautiful-soup-4.readthedocs.io/en/latest/    # for more advanced web scraping  

import lxml
import html5lib

print('Libraries imported.')

Libraries imported.


Get postal code data from Wikipedias page List of "postal codes of Canada: M"  

Then turn it into a panda dataframe

In [13]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url, attrs={"class": "wikitable"})[0]   # 0 is for the 1st table in this particular page
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Clean the data in the dataframe df

In [14]:
df['Borough']=df['Borough'].replace('Not assigned',np.NaN)   # replace 'Not assigned' with NaN 
df=df.dropna() 
df.reset_index(drop=True, inplace=True)
print("Dropped 'Not assigned' values in the 'Postal Code' column!")

# pd.set_option('display.max_rows', None)     # show all data in dataframe

# check for mispelled 'Not assigned' values in 'Borough' which has not been replaced
print("Any mispelles not assigned values in 'Borough'?  " + str(df['Borough'][df['Borough'].str.lower().str.contains('t as', regex=False)]) + ", " + str(df['Borough'][df['Borough'].str.lower().str.contains('not ', regex=False)]) )   

# find non unique (duplicate) postal codes 
dupli= "no"
for d in df['Postal Code'].duplicated().unique():
    if d == True:
        dupli = ""
print("There is " + dupli+" duplicate 'Postal Code' rows!")


# find neighborhoods with 'Not assigned' values
assigned="Cannot find any"
for a in df['Neighbourhood'].isin(['Not assigned']):
    if a == True:
        assigned="Found"
print(assigned+" 'Not assigned' values in the 'Neighbourhood' column!")

df.head()

Dropped 'Not assigned' values in the 'Postal Code' column!
Any mispelles not assigned values in 'Borough'?  Series([], Name: Borough, dtype: object), Series([], Name: Borough, dtype: object)
There is no duplicate 'Postal Code' rows!
Cannot find any 'Not assigned' values in the 'Neighbourhood' column!


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Let's check the size of our dataframe

In [15]:
df.shape

(103, 3)


# Next Part

Import necessary packages: 

Google Maps Geocoding API started to cost therefore geocoder is used, however it is a bit unreliable it seems, thus we need to download an .CSV file with data of the coordinates


In [16]:
df_coordinates=pd.read_csv('../datafiles/Geospatial_Coordinates.csv')
df_coordinates

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [17]:
# pip install geopandas

In [18]:
import geocoder # import geocoder
import geopy
from geopy.geocoders import Nominatim

In [19]:
# https://towardsdatascience.com/geocode-with-python-161ec1e62b89

locator = Nominatim(user_agent="myGeocoder")
location = locator.geocode("Champ de Mars, Paris, France")

In [21]:

# Coordinate arrays
lat_list=np.array([])
long_list=np.array([])

for i, pcode in enumerate(df['Postal Code']):
    print(i, pcode)
    # initialize your variable to None
    lat_coords =None
    lng_coords =None
    

    locator = Nominatim(user_agent="myGeocoder")
    count = 0
    # loop until you get the coordinates
    while(lng_coords is None):
        location = locator.geocode('{}, Toronto, Ontario'.format(pcode))
        try:
            lat_coords = location.latitude
            lng_coords = location.longitude
            # print("Coordinates: "+ str(lat_coords) + ", "+str(lng_coords) )
        except:
            lat_coords =None
            lng_coords =None

        if count == 3:  # the limit
            # print(" Error 'Location' not found in Geocode! Switching to CSV file")
            lat_coords = df_coordinates.loc[ df_coordinates['Postal Code'] == pcode ]['Latitude'].to_list()[0]
            lng_coords = df_coordinates.loc[ df_coordinates['Postal Code'] == pcode ]['Longitude'].to_list()[0]
            # print("\nCoordinates: "+ str(lat_coords) + ", "+str(lng_coords) )
        # if count == 3:
        #     print("Error count = 3")
        #     break

        count = count + 1
    
    lat_list= np.append(lat_list, lat_coords)
    long_list= np.append(long_list, lng_coords)

lat_list

0 M3A
1 M4A
 Error 'Location' not found in Geocode! Switching to excel file
2 M5A
 Error 'Location' not found in Geocode! Switching to excel file
3 M6A
 Error 'Location' not found in Geocode! Switching to excel file
4 M7A
5 M9A
 Error 'Location' not found in Geocode! Switching to excel file
6 M1B
7 M3B
 Error 'Location' not found in Geocode! Switching to excel file
8 M4B
 Error 'Location' not found in Geocode! Switching to excel file
9 M5B
 Error 'Location' not found in Geocode! Switching to excel file
10 M6B
 Error 'Location' not found in Geocode! Switching to excel file
11 M9B
 Error 'Location' not found in Geocode! Switching to excel file
12 M1C
13 M3C
14 M4C
 Error 'Location' not found in Geocode! Switching to excel file
15 M5C
 Error 'Location' not found in Geocode! Switching to excel file
16 M6C
 Error 'Location' not found in Geocode! Switching to excel file
17 M9C
18 M1E
 Error 'Location' not found in Geocode! Switching to excel file
19 M4E
 Error 'Location' not found in Geocode

array([43.6534817 , 43.7258823 , 43.6542599 , 43.718518  , 43.6534817 ,
       43.6678556 , 43.6534817 , 43.7459058 , 43.7063972 , 43.6571618 ,
       43.709577  , 43.6509432 , 43.6534817 , 43.7328216 , 43.6953439 ,
       43.6514939 , 43.6937813 , 43.64410993, 43.7635726 , 43.6763574 ,
       43.6421064 , 43.6890256 , 43.76571677, 43.7090604 , 43.6579524 ,
       43.669542  , 43.773136  , 43.8037622 , 43.7543283 , 43.7053689 ,
       43.64990081, 43.6690051 , 43.7447342 , 43.7797719 , 43.7679803 ,
       43.685347  , 43.6392586 , 43.6522219 , 43.7279292 , 43.7869473 ,
       43.7374732 , 43.6795571 , 43.6471768 , 43.63709691, 43.7111117 ,
       43.7574902 , 43.7390146 , 43.6727601 , 43.6481985 , 43.7137562 ,
       43.7563033 , 43.716316  , 43.7859621 , 43.7284964 , 43.6595255 ,
       43.7332825 , 43.6911158 , 43.7247659 , 43.692657  , 43.77923857,
       43.7616313 , 43.7280205 , 43.7116948 , 43.67455325, 43.706876  ,
       43.7574096 , 43.7527583 , 43.7127511 , 43.6969476 , 43.66

In [22]:
lat_list.shape

(103,)

The shape is of the correct size

In [23]:
df["Latitude"]=lat_list
df["Longitude"]=long_list
df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.653482,-79.383935
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.653482,-79.383935
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


#  Analysis

explore and cluster neighborhoods. Begin by importing necessary libraries.

In [24]:
import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

The credentials for Foursquare is kept in another file, imported here.

In [25]:
# some_file.py
import sys
# insert at 1, 0 is the script path (or '' in REPL)
sys.path.insert(1, '../credentials/')

import config

Continue exploring data 

In [44]:
print('The dataframe of size {} has {} boroughs and {} neighborhoods.'.format(
        df.shape,
        len(df['Borough'].unique()),
        len(df['Neighbourhood'].unique())
    )
)

The dataframe of size (103, 5) has 10 boroughs and 99 neighborhoods.


### Use geopy library to get the latitude and longitude values of Toronto

In [45]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
to_latitude = location.latitude
to_longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(to_latitude, to_longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [46]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[to_latitude, to_longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### Define Foursquare Credentials and Version

In [30]:
# the credentials for Foursquared is stored in another file
CLIENT_ID=config.client_id
CLIENT_SECRET=config.client_secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

### Explore the first neighborhood of Toronto

In [31]:
df.loc[0, 'Neighbourhood']

'Parkwoods'

In [33]:
neighborhood_latitude = df.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = df.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Parkwoods are 43.6534817, -79.3839347.


### Now, let's get the top 100 venues that are in 'Parkwoods' within 500 meters.

First, create the GET request URL

In [34]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

In [38]:
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, radius, LIMIT)

Send Get request and examine results

In [39]:
results = requests.get(url).json()

Use the function 'get_category_type' from below to help filter the JSON file

In [40]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [41]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Downtown Toronto,Neighborhood,43.653232,-79.385296
1,Nathan Phillips Square,Plaza,43.65227,-79.383516
2,Eggspectation Bell Trinity Square,Breakfast Spot,43.653144,-79.38198
3,Japango,Sushi Restaurant,43.655268,-79.385165
4,Indigo,Bookstore,43.653515,-79.380696


In [42]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

90 venues were returned by Foursquare.


## Explore all the neighborhoods in Toronto 

Create a function to repeat the same process for all neighborhoods in Toronto.

In [47]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

run the above function

In [53]:
toronto_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [54]:
print(toronto_venues.shape)
toronto_venues.head()

(2598, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.653482,-79.383935,Downtown Toronto,43.653232,-79.385296,Neighborhood
1,Parkwoods,43.653482,-79.383935,Nathan Phillips Square,43.65227,-79.383516,Plaza
2,Parkwoods,43.653482,-79.383935,Eggspectation Bell Trinity Square,43.653144,-79.38198,Breakfast Spot
3,Parkwoods,43.653482,-79.383935,Japango,43.655268,-79.385165,Sushi Restaurant
4,Parkwoods,43.653482,-79.383935,Indigo,43.653515,-79.380696,Bookstore


In [55]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 281 uniques categories.


## Analyze each Neighbourhood

In [56]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Aquarium,Arcade,Art Gallery,Art Museum,...,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [57]:
toronto_onehot.shape

(2598, 281)

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [67]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Aquarium,Arcade,Art Gallery,...,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [60]:
toronto_grouped.shape

(96, 281)

Let's print each neighborhood along with the top 5 most common venues

In [82]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

  venue  freq
0   Chinese Restaurant  0.25
1                 Café  0.25
2  Japanese Restaurant  0.25
3                 Bank  0.25
4           Nail Salon  0.00


----Bedford Park, Lawrence Manor East----
                venue  freq
0      Sandwich Place  0.09
1         Coffee Shop  0.09
2  Italian Restaurant  0.09
3     Thai Restaurant  0.05
4                Café  0.05


----Berczy Park----
           venue  freq
0    Coffee Shop  0.15
1  Boat or Ferry  0.07
2     Restaurant  0.04
3          Hotel  0.04
4         Bakery  0.03


----Birch Cliff, Cliffside West----
                     venue  freq
0          College Stadium  0.25
1    General Entertainment  0.25
2             Skating Rink  0.25
3                     Café  0.25
4  North Indian Restaurant  0.00


----Brockton, Parkdale Village, Exhibition Place----
                venue  freq
0  Tibetan Restaurant  0.09
1          Restaurant  0.06
2            Pharmacy  0.06
3                Café  0.06
4                 Bar  0.06


----Busi

Let's put that into a pandas dataframe

First, let's write a function to sort the venues in descending order

In [83]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [230]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Breakfast Spot,Latin American Restaurant,Clothing Store,Lounge,Skating Rink,Falafel Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space
1,"Alderwood, Long Branch",Pizza Place,Pub,Coffee Shop,Gym,Sandwich Place,Pharmacy,Fast Food Restaurant,Field,Farmers Market,Falafel Restaurant
2,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Restaurant,Ice Cream Shop,Sushi Restaurant,Chinese Restaurant,Supermarket,Gas Station,Sandwich Place,Diner
3,Bayview Village,Bank,Chinese Restaurant,Japanese Restaurant,Café,Event Space,Eastern European Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Xinjiang Restaurant
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Italian Restaurant,Coffee Shop,Liquor Store,Café,Restaurant,Juice Bar,Thai Restaurant,Pub,Indian Restaurant


## Cluster Neighborhoods
Run k-means to cluster the neighborhood into 3 clusters.

In [231]:
# set number of clusters
kclusters = 3

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [232]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [247]:

x=np.append(toronto_merged['Neighbourhood'].unique(), neighborhoods_venues_sorted['Neighborhood'].unique())

print("neighborhoods_venues_sorted shape ", neighborhoods_venues_sorted['Neighborhood'].unique().shape) 
print("toronto_merged shape ", np.unique(toronto_merged['Neighbourhood']).shape) 
print("appended shape ",np.unique(x).shape)


# Consider each row as indexing tuple & get linear indexing value             
# lid = np.ravel_multi_index(x.T,x.max(0)+1)
lid=x

# Get counts and unique indices
_,idx,count = np.unique(lid,return_index=True,return_counts=True)

# See which counts are exactly 1 and select the corresponding unique indices 
# and thus the correspnding rows from input as the final output
out = x[idx[count==1]]

# elements that are not not in both 'toronto_merged' and 'neighborhoods_venues_sorted' (probably only in 'toronto_merged')
out



neighborhoods_venues_sorted shape  (96,)
toronto_merged shape  (99,)
appended shape  (99,)


array(['Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood',
       'Islington Avenue, Humber Valley Village', 'Upper Rouge'],
      dtype=object)

In [234]:
# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.653482,-79.383935,1.0,Coffee Shop,Clothing Store,Hotel,Office,Plaza,Restaurant,Diner,Bookstore,Thai Restaurant,Theater
1,M4A,North York,Victoria Village,43.725882,-79.315572,1.0,Intersection,Hockey Arena,Portuguese Restaurant,Pizza Place,Coffee Shop,Xinjiang Restaurant,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Escape Room
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1.0,Coffee Shop,Park,Pub,Bakery,Theater,Breakfast Spot,Café,Distribution Center,Shoe Store,French Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1.0,Clothing Store,Women's Store,Vietnamese Restaurant,Boutique,Furniture / Home Store,Event Space,Coffee Shop,Gift Shop,Accessories Store,Deli / Bodega
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.653482,-79.383935,1.0,Coffee Shop,Clothing Store,Hotel,Office,Plaza,Restaurant,Diner,Bookstore,Thai Restaurant,Theater


In [235]:

print("toronto_merged cluster labels ",toronto_merged['Cluster Labels'].unique())
# toronto_merged[toronto_merged['Cluster Labels']== np.nan]
toronto_merged['Cluster Labels'].isna().sum() # how many nans? 

# You either create a new cluster label for the nan rows OR you drop them:
toronto_merged['Cluster Labels'].replace(np.nan, kclusters, inplace=True)     # replace a number =kclusters instead of nan values 
# toronto_merged['Cluster Labels'].dropna(inplace=True)                       # drop nan rows

toronto_merged['Cluster Labels']=toronto_merged['Cluster Labels'].astype(int) # change type from float to int

toronto_merged cluster labels  [ 1. nan  0.  2.]


In [248]:
# toronto_merged['Cluster Labels']

Finally, let's visualize the resulting clusters

In [249]:
# create map
map_clusters = folium.Map(location=[to_latitude, to_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters+1)
ys = [i + x + (i*x)**2 for i in range(kclusters+1)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine clusters

### Cluster 1

In [238]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
21,York,0,Park,Pool,Women's Store,Afghan Restaurant,Falafel Restaurant,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant
33,North York,0,Park,Pharmacy,Pizza Place,Convenience Store,Middle Eastern Restaurant,Coffee Shop,Thai Restaurant,Wine Shop,Cuban Restaurant,Drugstore
35,East York,0,Convenience Store,Park,Intersection,Eastern European Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market
49,North York,0,Construction & Landscaping,Park,Bakery,Fish Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market,Drugstore
61,Central Toronto,0,Park,Bus Line,Swim School,Dumpling Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Xinjiang Restaurant
64,York,0,Park,Xinjiang Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant
66,North York,0,Convenience Store,Park,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant
68,Central Toronto,0,Sushi Restaurant,Park,Trail,Jewelry Store,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Fish & Chips Shop
69,West Toronto,0,Bowling Alley,Park,Residential Building (Apartment / Condo),Convenience Store,Cuban Restaurant,Curling Ice,Food,Flower Shop,Fish Market,Fish & Chips Shop
85,Scarborough,0,Playground,Bakery,Park,Xinjiang Restaurant,Falafel Restaurant,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant


### cluster 2

In [239]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,1,Coffee Shop,Clothing Store,Hotel,Office,Plaza,Restaurant,Diner,Bookstore,Thai Restaurant,Theater
1,North York,1,Intersection,Hockey Arena,Portuguese Restaurant,Pizza Place,Coffee Shop,Xinjiang Restaurant,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Escape Room
2,Downtown Toronto,1,Coffee Shop,Park,Pub,Bakery,Theater,Breakfast Spot,Café,Distribution Center,Shoe Store,French Restaurant
3,North York,1,Clothing Store,Women's Store,Vietnamese Restaurant,Boutique,Furniture / Home Store,Event Space,Coffee Shop,Gift Shop,Accessories Store,Deli / Bodega
4,Downtown Toronto,1,Coffee Shop,Clothing Store,Hotel,Office,Plaza,Restaurant,Diner,Bookstore,Thai Restaurant,Theater
...,...,...,...,...,...,...,...,...,...,...,...,...
98,Etobicoke,1,Pool,River,Doner Restaurant,Drugstore,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space
99,Downtown Toronto,1,Coffee Shop,Japanese Restaurant,Gay Bar,Sushi Restaurant,Restaurant,Yoga Studio,Men's Store,Pub,Mediterranean Restaurant,Hotel
100,East Toronto,1,Comic Shop,Auto Workshop,Park,Recording Studio,Restaurant,Butcher,Skate Park,Farmers Market,Fast Food Restaurant,Burrito Place
101,Etobicoke,1,Construction & Landscaping,Baseball Field,Farmers Market,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Fast Food Restaurant,Dumpling Restaurant


### Cluster 3

In [243]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
45,North York,2,Martial Arts School,Xinjiang Restaurant,Farmers Market,Eastern European Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Fast Food Restaurant


### Cluster 4

This is the rows with missing values

In [241]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Etobicoke,3,,,,,,,,,,
17,Etobicoke,3,,,,,,,,,,
95,Scarborough,3,,,,,,,,,,


In [245]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


In [246]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


Sorted by cluste label, 1st and 2nd most common venue

In [399]:
to_me_test =toronto_merged.drop(['Postal Code','Borough','Neighbourhood','Latitude','Longitude'],axis='columns')
to_me_test.groupby(['Cluster Labels', '1st Most Common Venue','2nd Most Common Venue']).agg(lambda x:x.value_counts().index[0])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
Cluster Labels,1st Most Common Venue,2nd Most Common Venue,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,Bowling Alley,Park,Residential Building (Apartment / Condo),Convenience Store,Cuban Restaurant,Curling Ice,Food,Flower Shop,Fish Market,Fish & Chips Shop
0,Construction & Landscaping,Park,Bakery,Fish Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market,Drugstore
0,Convenience Store,Park,Intersection,Dumpling Restaurant,Electronics Store,Electronics Store,Escape Room,Event Space,Falafel Restaurant,Farmers Market
0,Park,Bus Line,Swim School,Dumpling Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Xinjiang Restaurant
0,Park,Pharmacy,Pizza Place,Convenience Store,Middle Eastern Restaurant,Coffee Shop,Thai Restaurant,Wine Shop,Cuban Restaurant,Drugstore
...,...,...,...,...,...,...,...,...,...,...
1,Skating Rink,Beer Store,Curling Ice,Dance Studio,Park,Athletics & Sports,Xinjiang Restaurant,Escape Room,Ethiopian Restaurant,Event Space
1,Sporting Goods Shop,Coffee Shop,Burger Joint,Bank,Furniture / Home Store,Shopping Mall,Brewery,Breakfast Spot,Sports Bar,Smoothie Shop
1,Thai Restaurant,Hakka Restaurant,Caribbean Restaurant,Gas Station,Athletics & Sports,Bank,Bakery,Fried Chicken Joint,Fast Food Restaurant,Farmers Market
1,Tibetan Restaurant,Café,Pharmacy,Restaurant,Bar,Indian Restaurant,Light Rail Station,Thrift / Vintage Store,Liquor Store,Fast Food Restaurant


# Insights

### Display most common venues in the different clusters

In [348]:
# multiple side-by-side display function
from IPython.display import display_html
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

In [398]:
col_array=np.array(['1st Most Common Venue','2nd Most Common Venue','3rd Most Common Venue','4th Most Common Venue','5th Most Common Venue','6th Most Common Venue','7th Most Common Venue','8th Most Common Venue','9th Most Common Venue','10th Most Common Venue', 'Borough'])  # columns
display_list=[0,0,0,0] # temporary store dataframes to be displayed    ### number of zeros must equal number of kclusters

for index in range(0,kclusters): 

    for i in range(0,4):
        
        to_test_cl = toronto_merged.loc[toronto_merged['Cluster Labels'] == index, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
        
        to_test_cl=to_test_cl.drop( col_array[np.arange(len(col_array))!= i]  ,axis='columns') # remove all columns except the columns of interest and one extra to be transformed                                                                                                  # into a count column
        
        to_test_cl=to_test_cl.groupby( [col_array[ i ]] ).agg({'count'}) # count number of times an element appear
        to_test_cl=to_test_cl['Cluster Labels']
        display_list[i]=to_test_cl.sort_values(['count'], ascending=[ False])  # sort by most counts
    
    print("--- cluster "+ str(index)+" ---") 
    display_side_by_side(display_list[0],display_list[1],display_list[2],display_list[3])

--- cluster 0 ---


Unnamed: 0_level_0,count
1st Most Common Venue,Unnamed: 1_level_1
Park,5
Convenience Store,2
Bowling Alley,1
Construction & Landscaping,1
Playground,1
Sushi Restaurant,1

Unnamed: 0_level_0,count
2nd Most Common Venue,Unnamed: 1_level_1
Park,5
Bakery,1
Bus Line,1
Pharmacy,1
Playground,1
Pool,1
Xinjiang Restaurant,1

Unnamed: 0_level_0,count
3rd Most Common Venue,Unnamed: 1_level_1
Donut Shop,2
Trail,2
Bakery,1
Intersection,1
Park,1
Pizza Place,1
Residential Building (Apartment / Condo),1
Swim School,1
Women's Store,1

Unnamed: 0_level_0,count
4th Most Common Venue,Unnamed: 1_level_1
Dumpling Restaurant,3
Convenience Store,2
Xinjiang Restaurant,2
Afghan Restaurant,1
Eastern European Restaurant,1
Fish Market,1
Jewelry Store,1


--- cluster 1 ---


Unnamed: 0_level_0,count
1st Most Common Venue,Unnamed: 1_level_1
Coffee Shop,21
Grocery Store,7
Café,6
Pizza Place,4
Indian Restaurant,3
Bakery,3
Clothing Store,2
Playground,2
Pool,2
Korean Restaurant,2

Unnamed: 0_level_0,count
2nd Most Common Venue,Unnamed: 1_level_1
Coffee Shop,9
Pizza Place,8
Café,6
Park,6
Clothing Store,5
Pub,3
Sandwich Place,3
Chinese Restaurant,2
Italian Restaurant,2
Xinjiang Restaurant,2

Unnamed: 0_level_0,count
3rd Most Common Venue,Unnamed: 1_level_1
Hotel,8
Coffee Shop,8
Xinjiang Restaurant,5
Restaurant,5
Home Service,4
Sandwich Place,4
Park,4
Fast Food Restaurant,3
Dog Run,2
Shoe Store,2

Unnamed: 0_level_0,count
4th Most Common Venue,Unnamed: 1_level_1
Restaurant,4
Pizza Place,4
Office,4
Gym / Fitness Center,4
Café,4
Xinjiang Restaurant,3
Sushi Restaurant,3
Gym,3
Electronics Store,3
Dumpling Restaurant,2


--- cluster 2 ---


Unnamed: 0_level_0,count
1st Most Common Venue,Unnamed: 1_level_1
Martial Arts School,1

Unnamed: 0_level_0,count
2nd Most Common Venue,Unnamed: 1_level_1
Xinjiang Restaurant,1

Unnamed: 0_level_0,count
3rd Most Common Venue,Unnamed: 1_level_1
Farmers Market,1

Unnamed: 0_level_0,count
4th Most Common Venue,Unnamed: 1_level_1
Eastern European Restaurant,1


### Cluster summary

Cluster 0 has mostly Parks

Cluster 1 has mostly Coffee Shops, Grocery Store and Cafe  nearby

Cluster 2 has only one place (one datapoint), so not much of a cluster.