# Segmenting and Clustering Neighborhoods in Toronto
## Brian's Week 3 project notebook

In [1]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

In [2]:
# get the content for the Canadian postal codes
cpc_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
cpc_content = requests.get(cpc_url).text
cpc_content



'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of postal codes of Canada: M - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"3ad865cd-79b8-49be-b442-e7b7797acd3d","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":979555370,"wgRevisionId":979555370,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Communications in Ontar

Scrape the contents for the table we want

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(cpc_content,'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"3ad865cd-79b8-49be-b442-e7b7797acd3d","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":979555370,"wgRevisionId":979555370,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Communicati

In [4]:
cpc_table = soup.find('table',{'class':'wikitable sortable'})
cpc_table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td

In [5]:
# build raw dataframe for postalcode data
cpc_raw_df = pd.DataFrame(columns=['PostalCode','Borough','Neighborhood'])
for tr in cpc_table.find_all('tr'):#this will skip th header row from table so no need to remove manually
    tds = tr.find_all('td')
    if not tds:
        continue
    postalcode, borough, neighborhood = [td.text.strip() for td in tds[:3]]
    print('pc:{}, bor:{}, n:{}'.format(postalcode, borough, neighborhood))
    cpc_raw_df = cpc_raw_df.append({'PostalCode': postalcode, 'Borough': borough, 'Neighborhood': neighborhood}, ignore_index=True, sort=False)
cpc_raw_df
    

pc:M1A, bor:Not assigned, n:Not assigned
pc:M2A, bor:Not assigned, n:Not assigned
pc:M3A, bor:North York, n:Parkwoods
pc:M4A, bor:North York, n:Victoria Village
pc:M5A, bor:Downtown Toronto, n:Regent Park, Harbourfront
pc:M6A, bor:North York, n:Lawrence Manor, Lawrence Heights
pc:M7A, bor:Downtown Toronto, n:Queen's Park, Ontario Provincial Government
pc:M8A, bor:Not assigned, n:Not assigned
pc:M9A, bor:Etobicoke, n:Islington Avenue, Humber Valley Village
pc:M1B, bor:Scarborough, n:Malvern, Rouge
pc:M2B, bor:Not assigned, n:Not assigned
pc:M3B, bor:North York, n:Don Mills
pc:M4B, bor:East York, n:Parkview Hill, Woodbine Gardens
pc:M5B, bor:Downtown Toronto, n:Garden District, Ryerson
pc:M6B, bor:North York, n:Glencairn
pc:M7B, bor:Not assigned, n:Not assigned
pc:M8B, bor:Not assigned, n:Not assigned
pc:M9B, bor:Etobicoke, n:West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
pc:M1C, bor:Scarborough, n:Rouge Hill, Port Union, Highland Creek
pc:M2C, bor:Not assigned, n

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


## Analyze the raw data and make a clean version

In [6]:
# business rules for cleaning postal codes
# drop rows where Borough is 'Not assigned'
# merge any repeated postalcodes by appending neighborhoods with comma delimiter
# if Borough assigned, but Neighborhood == 'Not Assigned', set Neighborhood to Borough
print('Raw Shape ', cpc_raw_df.shape)
print('Raw Distinct postal code count', len(cpc_raw_df.PostalCode.unique()))
print('Raw Distinct Boroughs count', len(cpc_raw_df.Borough.unique()))
print('Raw Distinct Boroughs list:\n', cpc_raw_df.Borough.value_counts())

#make clean df
cpc_df = pd.DataFrame()

for i in cpc_raw_df.index:
    #print('pc:{}, bor:{}, n:{}'.format(cpc_raw_df.loc[i,'PostalCode'],cpc_raw_df.loc[i,'Borough'],cpc_raw_df.loc[i,'Neighborhood']))
    if cpc_raw_df.loc[i,'Borough'] == 'Not assigned':
        continue
    elif cpc_raw_df.loc[i,'Neighborhood'] == 'Not assigned':
        cpc_df = cpc_df.append({'PostalCode': cpc_raw_df.loc[i,'PostalCode'], 'Borough': cpc_raw_df.loc[i,'Borough'], 'Neighborhood': cpc_raw_df.loc[i,'Borough']}, ignore_index=True, sort=False)
    else:
        cpc_df = cpc_df.append({'PostalCode': cpc_raw_df.loc[i,'PostalCode'], 'Borough': cpc_raw_df.loc[i,'Borough'], 'Neighborhood': cpc_raw_df.loc[i,'Neighborhood']}, ignore_index=True, sort=False)

print('\nCleaned Distinct postal code count', len(cpc_df.PostalCode.unique()))
print('Cleaned Distinct Boroughs count', len(cpc_df.Borough.unique()))
print('Cleaned Distinct Boroughs list:\n', cpc_df.Borough.value_counts())
print('*** Cleaned Shape ', cpc_df.shape)
cpc_df = cpc_df.reindex(cpc_raw_df.columns, axis=1)
cpc_df.set_index('PostalCode')
cpc_df.head(3)

Raw Shape  (180, 3)
Raw Distinct postal code count 180
Raw Distinct Boroughs count 11
Raw Distinct Boroughs list:
 Not assigned        77
North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
York                 5
East Toronto         5
Mississauga          1
Name: Borough, dtype: int64

Cleaned Distinct postal code count 103
Cleaned Distinct Boroughs count 10
Cleaned Distinct Boroughs list:
 North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
East Toronto         5
York                 5
Mississauga          1
Name: Borough, dtype: int64
*** Cleaned Shape  (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [18]:
#use a temp copy of df for testing:
tempdf = cpc_df.head()
# loop through postalcodes and get the lat/long for the borough
# !conda install -c conda-forge geocoder --yes
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
tries = 0
while (tries < 10) and (lat_lng_coords is None):
    tries += 1
    print('try ', tries)
    g = geocoder.google('{}, Toronto, Ontario'.format('M7A'))
    print(g.json)
    lat_lng_coords = g.latlng
if tries < 10:
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]

try  1
None
try  2
None
try  3
None
try  4
None
try  5
None
try  6
None
try  7
None
try  8
None
try  9
None
try  10
None


Since Geocoder.google is non-responsive, we'll switch to using the provided csv file and merge it instead

In [19]:
ll_df = pd.read_csv('Geospatial_Coordinates.csv')
ll_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Appending latlong into neighborhood table

In [20]:
cpc_df = cpc_df.join(ll_df.set_index('Postal Code'), on='PostalCode')
cpc_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


# Exploring Toronto's ethnic restaurants

I like a lot of different cuisines, so I am interested in seeing which neighborhoods tend to focus more on international foods. So I am going to try pulling up Foursquare data for the restaurants in these neighborhoods and  try to make a weighted metric to classify the neighborhoods into categories.


In [23]:
# prepping for datawrangling and mapping
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge folium=0.5.0 geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


Let's first find the lat/long for Toronto so we can visualize the area and see on a map

In [24]:
address = 'Toronto, Ontario'
geolocator = Nominatim(user_agent="bbgh@mooman.com")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


In [25]:
map_ca = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, postalcode in zip(cpc_df['Latitude'], cpc_df['Longitude'], cpc_df['Borough'], cpc_df['PostalCode']):
    label = '{},{}'.format(postalcode, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_ca)  
    
map_ca

Now for the Foursquare data

In [26]:
# @hidden_cell

CLIENT_ID = 'AR2GDRKHV2PFFOQZUPHKYI11OWFLJONZNTBD2KUIIXQEO23N' # your Foursquare ID
CLIENT_SECRET = 'SAYV4MJV4TU3PXJF2DO4JDMQD3DOLJ42FMXCJUYYPKZNXENB' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
MY_TOKEN = 'GKU3M1PGEPCPCIO0KCOPDHZZEFW3THK5STICE5X2XNFLNQ3K'

print('Your credentials (partial for security):')
print('CLIENT_ID: ' + CLIENT_ID[:40] + '...')
print('CLIENT_SECRET:' + CLIENT_SECRET[:40] + '...')



Your credentials (partial for security):
CLIENT_ID: AR2GDRKHV2PFFOQZUPHKYI11OWFLJONZNTBD2KUI...
CLIENT_SECRET:SAYV4MJV4TU3PXJF2DO4JDMQD3DOLJ42FMXCJUYY...


For the points of interest, I'm using a fixed 500 meter radius.  I checked the map, and some of the more remote postal codes are 2 km apart, while downtown some are only a few hundred meters apart.
So that means we'll miss some places on the outskirts, and doublecount them downtown, but since our use case is restaurants we might want to eat at, and 500m is a simple walking distance, it's not a problem if we are walking into a neighboring postal code for a meal, so no modifications will be made to this method.

Also, I'm adding a 'section' parameter of 'food' to limit the results to just restaurants and exclude museums, parks, etc.

In [27]:
# initial explorations of postal code M5A
test_lat = 43.654260
test_long=-79.360636
radius = 500
section = 'food' #limit our findings to places to eat
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&radius={}&section={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, test_lat, test_long,MY_TOKEN, VERSION, radius, section, LIMIT)

In [28]:
results = requests.get(url).json()
results



{'meta': {'code': 200, 'requestId': '5fdf9a2bd5239e00e29b9ea4'},
 'notifications': [{'type': 'notificationTray', 'item': {'unreadCount': 0}}],
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Corktown',
  'headerFullLocation': 'Corktown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'query': 'food',
  'totalResults': 35,
  'suggestedBounds': {'ne': {'lat': 43.6587600045, 'lng': -79.35442800013826},
   'sw': {'lat': 43.6497599955, 'lng': -79.36684399986174}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '54ea41ad498e9a11e9e13308',
       'name': 'Roselle Desserts',
       'location': {'address': '362 King St E',
        'crossStreet': 'Trinity St',
        'lat': 43.65

We'll summarize what categories we got to see if we're happy with this set of query parameters.

In [29]:
venues = results['response']['groups'][0]['items']

In [30]:
venues

[{'reasons': {'count': 0,
   'items': [{'summary': 'This spot is popular',
     'type': 'general',
     'reasonName': 'globalInteractionReason'}]},
  'venue': {'id': '54ea41ad498e9a11e9e13308',
   'name': 'Roselle Desserts',
   'location': {'address': '362 King St E',
    'crossStreet': 'Trinity St',
    'lat': 43.653446723052674,
    'lng': -79.3620167174383,
    'labeledLatLngs': [{'label': 'display',
      'lat': 43.653446723052674,
      'lng': -79.3620167174383}],
    'distance': 143,
    'postalCode': 'M5A 1K9',
    'cc': 'CA',
    'city': 'Toronto',
    'state': 'ON',
    'country': 'Canada',
    'formattedAddress': ['362 King St E (Trinity St)',
     'Toronto ON M5A 1K9',
     'Canada']},
   'categories': [{'id': '4bf58dd8d48988d16a941735',
     'name': 'Bakery',
     'pluralName': 'Bakeries',
     'shortName': 'Bakery',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/bakery_',
      'suffix': '.png'},
     'primary': True}],
   'photos': {'count': 0, 'grou

In [32]:
nearby_venues = pd.json_normalize(venues)

In [39]:
#now we'll clean up those results and look at just the categories

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
nearby_venues['venue.categories'].value_counts()


Pizza Place                 4
Bakery                      3
Café                        3
Restaurant                  3
Sandwich Place              2
Breakfast Spot              2
Mediterranean Restaurant    2
Italian Restaurant          2
Seafood Restaurant          1
Salad Place                 1
Thai Restaurant             1
Food Truck                  1
Gastropub                   1
Ethiopian Restaurant        1
Deli / Bodega               1
Japanese Restaurant         1
Mexican Restaurant          1
Sushi Restaurant            1
Chinese Restaurant          1
Asian Restaurant            1
Greek Restaurant            1
French Restaurant           1
Name: venue.categories, dtype: int64

I'm satisfied with these results, so now, we'll repeat the process for a larger dataset. I am going to make a bounding box of lat/long to limit ourselves to the central downtown area (just to throttle the number of hits to foursquare - this could easily be extended for larger areas with a paid account).
After experimenting with Google Maps, I am going to apply the ranges:
 - latitude: 43.6 to 43.7
 - longitude: -79.3 to -79.5

Here's a revised map showing what is included in that square...

In [44]:
llfilter = (cpc_df['Latitude'] >= 43.6) & (cpc_df['Latitude'] <= 43.7) & (cpc_df['Longitude'] >= -79.5) & (cpc_df['Longitude'] <= -79.3)
#can then filter dataframe for just things meeting that condition:
downtown_df = cpc_df[llfilter]
# downtown_df.shape # gave me (40,5), a good size for my queries

downtown_map = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, postalcode in zip(downtown_df['Latitude'], downtown_df['Longitude'], downtown_df['Borough'], downtown_df['PostalCode']):
    label = '{},{}'.format(postalcode, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(downtown_map)  


downtown_map

(40, 5)

In [48]:
# name will get used for postalcode this time, rather than neighborhood
def getNearbyFood(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&section={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
            'food')
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_eats = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_eats.columns = ['PostalCode', 
                  'Postcode Latitude', 
                  'Postcode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_eats)

In [49]:
downtown_eats = getNearbyFood(downtown_df['PostalCode'],downtown_df['Latitude'],downtown_df['Longitude'])

M5A
M7A
M5B
M4C
M5C
M6C
M5E
M6E
M5G
M6G
M5H
M6H
M4J
M5J
M6J
M4K
M5K
M6K
M4L
M5L
M4M
M6M
M6N
M5P
M6P
M5R
M6R
M5S
M6S
M4T
M5T
M4V
M5V
M4W
M5W
M4X
M5X
M4Y
M7Y
M8Y


In [50]:
print(downtown_eats.shape)
downtown_eats.head()

(1333, 7)


Unnamed: 0,PostalCode,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M5A,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
1,M5A,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
2,M5A,43.65426,-79.360636,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
3,M5A,43.65426,-79.360636,Brick Street Bakery,43.650574,-79.359539,Bakery
4,M5A,43.65426,-79.360636,Cluny Bistro & Boulangerie,43.650565,-79.357843,French Restaurant


In [51]:
# exploring the counts for each postal code in our bounding box
downtown_eats.groupby('PostalCode').count()

Unnamed: 0_level_0,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M4C,4,4,4,4,4,4
M4J,2,2,2,2,2,2
M4K,36,36,36,36,36,36
M4L,12,12,12,12,12,12
M4M,22,22,22,22,22,22
M4T,2,2,2,2,2,2
M4V,9,9,9,9,9,9
M4W,1,1,1,1,1,1
M4X,28,28,28,28,28,28
M4Y,62,62,62,62,62,62


3 postal codes apparently maxed out our query results with >= 100 venues, with another few that are in the 90s.

At the other end, we have a few that have only a single venue in them. We'll want to remember that when making a weighted metric because if an area has 100% international restuarants, but in fact only has a single venue, we probably want to exclude it for lack of options. So we might want a filter later having at least 10 eateries for our final contender list.

### Now to see the types of restaurants in our search area

In [52]:
print('The types of eateries in this area are:\n',downtown_eats['Venue Category'].value_counts())

The types of eateries in this area are:
 Café                               121
Restaurant                         109
Italian Restaurant                  68
Sandwich Place                      65
Pizza Place                         62
Bakery                              59
Sushi Restaurant                    56
Japanese Restaurant                 51
American Restaurant                 35
Fast Food Restaurant                33
Burger Joint                        33
Breakfast Spot                      31
Salad Place                         30
Seafood Restaurant                  28
Deli / Bodega                       28
Thai Restaurant                     26
Vegetarian / Vegan Restaurant       26
Steakhouse                          24
Asian Restaurant                    24
Greek Restaurant                    24
Chinese Restaurant                  22
Gastropub                           22
Diner                               19
Middle Eastern Restaurant           18
Burrito Place          

Now I'd like to group these into basic categories so will create some sets that include the broader types of each.
This is very subjective, but for the purposes of this exercise it should suffice.
I made my own CSV file assigning each of those venue types to a category, which I will now load and merge into the data.

In [56]:
resttypes_df = pd.read_csv('RestaurantTypes.csv')
resttypes_df.columns = ('Venue Category','Cuisine')
resttypes_df.head()                        

Unnamed: 0,Venue Category,Cuisine
0,American Restaurant,American
1,Arepa Restaurant,Latin
2,Argentinian Restaurant,Latin
3,Asian Restaurant,Asian
4,Bagel Shop,American


In [57]:
downtown_eats = downtown_eats.join(resttypes_df.set_index('Venue Category'), on='Venue Category')
downtown_eats.head()

Unnamed: 0,PostalCode,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Cuisine
0,M5A,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant,NonSpecific
1,M5A,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot,American
2,M5A,43.65426,-79.360636,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot,American
3,M5A,43.65426,-79.360636,Brick Street Bakery,43.650574,-79.359539,Bakery,NonSpecific
4,M5A,43.65426,-79.360636,Cluny Bistro & Boulangerie,43.650565,-79.357843,French Restaurant,European


Now I'll make a one hot encoding per postal code with these cuisines

In [58]:
# one hot encoding
downtown_onehot = pd.get_dummies(downtown_eats[['Cuisine']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
downtown_onehot['PostalCode'] = downtown_eats['PostalCode'] 

# move this postal code column to the first column
fixed_columns = [downtown_onehot.columns[-1]] + list(downtown_onehot.columns[:-1])
downtown_onehot = downtown_onehot[fixed_columns]

downtown_onehot.head()



Unnamed: 0,PostalCode,African,American,Asian,European,Healthy,Indian,Italian,Latin,Mediterranean,NonSpecific
0,M5A,0,0,0,0,0,0,0,0,0,1
1,M5A,0,1,0,0,0,0,0,0,0,0
2,M5A,0,1,0,0,0,0,0,0,0,0
3,M5A,0,0,0,0,0,0,0,0,0,1
4,M5A,0,0,0,1,0,0,0,0,0,0


In [59]:
downtown_onehot.shape

(1333, 11)

Now I'll group these by producing a mean occurrence for each cuisine in each postalcode, to characterize what cuisines dominate


In [61]:
downtown_grouped = downtown_onehot.groupby('PostalCode').mean().reset_index()
downtown_grouped


Unnamed: 0,PostalCode,African,American,Asian,European,Healthy,Indian,Italian,Latin,Mediterranean,NonSpecific
0,M4C,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5
1,M4J,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,M4K,0.0,0.166667,0.194444,0.0,0.0,0.027778,0.083333,0.027778,0.333333,0.166667
3,M4L,0.0,0.416667,0.083333,0.083333,0.0,0.0,0.083333,0.083333,0.0,0.25
4,M4M,0.0,0.318182,0.090909,0.0,0.0,0.0,0.045455,0.090909,0.045455,0.409091
5,M4T,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.5
6,M4V,0.0,0.444444,0.222222,0.0,0.0,0.0,0.0,0.0,0.0,0.333333
7,M4W,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,M4X,0.0,0.178571,0.214286,0.0,0.0,0.035714,0.071429,0.035714,0.0,0.464286
9,M4Y,0.016129,0.209677,0.33871,0.032258,0.0,0.048387,0.032258,0.080645,0.048387,0.177419


Now let's add the restaurant counts per postal code so we can filter the ones that had few options


In [69]:
downtown_eats.columns

Index(['PostalCode', 'Postcode Latitude', 'Postcode Longitude', 'Venue',
       'Venue Latitude', 'Venue Longitude', 'Venue Category', 'Cuisine'],
      dtype='object')

In [125]:
downtown_counts = downtown_eats.groupby('PostalCode').count().reset_index()
downtown_counts = downtown_counts[['PostalCode','Venue']]
downtown_counts.columns =['PostalCode','RestaurantCount']
downtown_counts.head()


Unnamed: 0,PostalCode,RestaurantCount
0,M4C,4
1,M4J,2
2,M4K,36
3,M4L,12
4,M4M,22


We'll glue that onto the end of the grouped file

In [86]:
downtown_grouped.set_index('PostalCode')
downtown_grouped.head()

Unnamed: 0,PostalCode,African,American,Asian,European,Healthy,Indian,Italian,Latin,Mediterranean,NonSpecific
0,M4C,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5
1,M4J,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,M4K,0.0,0.166667,0.194444,0.0,0.0,0.027778,0.083333,0.027778,0.333333,0.166667
3,M4L,0.0,0.416667,0.083333,0.083333,0.0,0.0,0.083333,0.083333,0.0,0.25
4,M4M,0.0,0.318182,0.090909,0.0,0.0,0.0,0.045455,0.090909,0.045455,0.409091


In [129]:
downtown_counts.set_index('PostalCode') # not sure if this is needed. Indexes have not been explained well yet
downtown_grouped = downtown_grouped.merge(downtown_counts, on='PostalCode',how='left')
downtown_grouped.head()

Unnamed: 0,PostalCode,African,American,Asian,European,Healthy,Indian,Italian,Latin,Mediterranean,NonSpecific,RestaurantCount
0,M4C,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,4
1,M4J,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2
2,M4K,0.0,0.166667,0.194444,0.0,0.0,0.027778,0.083333,0.027778,0.333333,0.166667,36
3,M4L,0.0,0.416667,0.083333,0.083333,0.0,0.0,0.083333,0.083333,0.0,0.25,12
4,M4M,0.0,0.318182,0.090909,0.0,0.0,0.0,0.045455,0.090909,0.045455,0.409091,22


In [135]:
downtown_grouped.shape

(39, 12)

## Cluster analysis using k-nearest neighbors

Now I can run the cluster analysis to see if there are certain combinations of neighborhoods with regards to restaurant options.

I will keep the count column in there, but normalize it using feature scaling so it can be used as another meaningful variable like 'place with lots of options' vs 'restaurant wasteland'

In [141]:
norm = (downtown_grouped['RestaurantCount'] - downtown_grouped['RestaurantCount'].min()) / (downtown_grouped['RestaurantCount'].max() - downtown_grouped['RestaurantCount'].min())

cluster_df = downtown_grouped.drop('PostalCode',1)
cluster_df = cluster_df.drop('RestaurantCount',1)
cluster_df['ScaledEats'] = norm
cluster_df.head()

Unnamed: 0,African,American,Asian,European,Healthy,Indian,Italian,Latin,Mediterranean,NonSpecific,ScaledEats
0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.030303
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.010101
2,0.0,0.166667,0.194444,0.0,0.0,0.027778,0.083333,0.027778,0.333333,0.166667,0.353535
3,0.0,0.416667,0.083333,0.083333,0.0,0.0,0.083333,0.083333,0.0,0.25,0.111111
4,0.0,0.318182,0.090909,0.0,0.0,0.0,0.045455,0.090909,0.045455,0.409091,0.212121


In [143]:
kclusters = 5
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cluster_df)
kmeans.labels_

array([0, 3, 4, 0, 0, 3, 0, 2, 0, 4, 0, 1, 1, 4, 1, 1, 1, 1, 1, 3, 0, 0,
       4, 0, 1, 1, 3, 3, 3, 4, 0, 0, 0, 0, 0, 0, 4, 0, 0])

In [148]:
# add the labels to the group table
downtown_grouped.insert(1,'K5cluster', kmeans.labels_)

In [149]:
downtown_grouped.head()

Unnamed: 0,PostalCode,K5cluster,African,American,Asian,European,Healthy,Indian,Italian,Latin,Mediterranean,NonSpecific,RestaurantCount
0,M4C,0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,4
1,M4J,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2
2,M4K,4,0.0,0.166667,0.194444,0.0,0.0,0.027778,0.083333,0.027778,0.333333,0.166667,36
3,M4L,0,0.0,0.416667,0.083333,0.083333,0.0,0.0,0.083333,0.083333,0.0,0.25,12
4,M4M,0,0.0,0.318182,0.090909,0.0,0.0,0.0,0.045455,0.090909,0.045455,0.409091,22


Let's see if we can make sense of the groups by sorting the cluster labels and seeing what they have in common

In [150]:
downtown_grouped.sort_values(by='K5cluster', ascending=False, axis=0, inplace=False)

Unnamed: 0,PostalCode,K5cluster,African,American,Asian,European,Healthy,Indian,Italian,Latin,Mediterranean,NonSpecific,RestaurantCount
13,M5E,4,0.039216,0.235294,0.117647,0.156863,0.039216,0.019608,0.058824,0.039216,0.058824,0.235294,51
2,M4K,4,0.0,0.166667,0.194444,0.0,0.0,0.027778,0.083333,0.027778,0.333333,0.166667,36
36,M7A,4,0.0,0.275862,0.206897,0.034483,0.034483,0.068966,0.034483,0.103448,0.103448,0.137931,29
22,M5T,4,0.0,0.145455,0.218182,0.054545,0.072727,0.0,0.018182,0.181818,0.018182,0.290909,55
29,M6J,4,0.0,0.255814,0.162791,0.069767,0.046512,0.0,0.023256,0.139535,0.023256,0.27907,43
9,M4Y,4,0.016129,0.209677,0.33871,0.032258,0.0,0.048387,0.032258,0.080645,0.048387,0.177419,62
19,M5P,3,0.0,0.0,0.25,0.25,0.0,0.0,0.0,0.0,0.0,0.5,4
27,M6G,3,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.8,5
26,M6E,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
1,M4J,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2


From this we can see some clear patterns, mostly based on the counts.
 - Cluster 4 looked like moderate restaurant with almost all cuisine represented.
 - Cluster 3 are the very sparse with little to choose from
 - Cluster 2 is an isolated case with a single Asian restaurant. It's interesting that C3 has a single non-specific restaurant but these did not clump together
 - Cluster 1 looks like the massive varieties with the most options
 - Cluster 0 looks like moderate count places where NonSpecific is pretty high, suggesting these are fastfood and chain restaurant places

Let's map them and see what we see.  I'll add in the feature of changing the marker size relative to the number of restaurants, to help differentiate the busy places.

First, though, I'll need to make a new dataframe combining the clusters and counts with the lat/longs, since those are still independent.

In [153]:
mappable_df = pd.merge(downtown_df,downtown_grouped[['PostalCode','K5cluster', 'RestaurantCount']],on='PostalCode', how='left')
mappable_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,K5cluster,RestaurantCount
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0.0,22.0
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,4.0,29.0
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1.0,98.0
3,M4C,East York,Woodbine Heights,43.695344,-79.318389,0.0,4.0
4,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1.0,73.0
5,M6C,York,Humewood-Cedarvale,43.693781,-79.428191,,
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,4.0,51.0
7,M6E,York,Caledonia-Fairbanks,43.689026,-79.453512,3.0,1.0
8,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,1.0,69.0
9,M6G,Downtown Toronto,Christie,43.669542,-79.422564,3.0,5.0


I notice there is an NaN for one postal code, M6C. I'll just verify we got no restaurant data for them, and if so, I'll delete the row since it has no cluster assigned.

In [156]:
downtown_eats[(downtown_eats['PostalCode'] == 'M6C')]

Unnamed: 0,PostalCode,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Cuisine


In [159]:
mappable_df = mappable_df[mappable_df.PostalCode != 'M6C']
mappable_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,K5cluster,RestaurantCount
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0.0,22.0
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,4.0,29.0
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1.0,98.0
3,M4C,East York,Woodbine Heights,43.695344,-79.318389,0.0,4.0
4,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1.0,73.0


In [163]:
import math # to make sure I have a squareroot function
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, pc, cluster,eatscore in zip(mappable_df['Latitude'], mappable_df['Longitude'], mappable_df['PostalCode'], mappable_df['K5cluster'],mappable_df['RestaurantCount']):
    cluster = int(cluster) #fix the floats
    label = folium.Popup(str(pc) + ' Cluster ' + str(int(cluster)), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=int(2 + math.sqrt(eatscore)/1.4), # designed to scale the circle size from 2-9
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

I could expand on this and construct a metric balancing the ratio of international restaurants to American ones, but it would exercise most of the same skills already used in this, so I will consider this a successful exploration already.

We have found and mapped where in town the most restaurants are, and we've used a classifier to cluster similar conditions together. **Cluster 4** (orange in the map) seems to be places with good variety of cuisines, and **Cluster 1** are places with simply large numbers of restaurants overall.
We will want to avoid areas identified as **Cluster 0** as they tend to have more generic options, and **Clusters 2 and 3** simply have few options at all.

So with that, I will save and upload this notebook! *whew*