# Segmenting and Clustering Neighborhoods in Toronto

### Table of Contents

Part 1

1a. Download and Explore Dataset

Part 2

2a. Get coordinates

Part 3

3a. Explore Neighborhoods in Toronto
3b. Analyze Each Set of Neighborhood (based on Postal Code) in Downtown Toronto
3c. Cluster Neighborhoods
3d. Examine Clusters

### Preface

Acknowledgements: Thanks to the authors of the capstone course. Some of the code was copied from course materials.

## Part 1

This is the first part of the week 3 project of the Applied Data Science Capstone class. In this notebook, I will explore, segment, and cluster neighborhoods in Toronto, Canada.

### 1a. Download and Explore Dataset

#### Do the necessary installations and imports

In [1]:
# import Numerical and dataframe libraries
import numpy as np 
import pandas as pd 

!conda install --yes -c conda-forge geopy folium=0.5.0 beautifulsoup4 lxml geocoder

from geopy.geocoders import Nominatim 

# import web service libraries
import requests 
import json 
from pandas.io.json import json_normalize 

# import Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - beautifulsoup4
    - folium=0.5.0
    - geocoder
    - geopy
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    geocoder-1.38.1            |             py_1          53 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    altair-4.0.0               |             py_0         606 KB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    beautifulsoup4-4.8.2       |           py36_0         157 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    lxml-4.4.1               

## 1. Download and Explore Dataset about Toronto Neighborhoods

### Parse the wikipedia page that has Toronto neighborhood information

#### Get the wikipedia page that has the information

In [2]:
canada_M_postal_code_page_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
response = requests.get(canada_M_postal_code_page_url)
postal_code_page = ""
if (response.status_code == 200):
    postal_code_page = response.text
else:
    print ("Response to request was not OK. It was {}".format(response.status_code))
len(postal_code_page)

78651

#### Parse the wikipedia page

In [3]:
from bs4 import BeautifulSoup
postal_soup = BeautifulSoup(postal_code_page, 'lxml')
# print(postal_soup.prettify())

#### The page has multiple tables. Get a list of them, print out their lengths, and then proceed with the longest one, since it is the one that has the Toronto postal code information
I assume that the HTML table element that is the longest when pretty printed is the one that has the information we are looking for.

In [4]:
tables = postal_soup.find_all('table')
table_lengths = [len (table.prettify()) for table in tables]
table_lengths

[55576, 148, 180, 8480, 6088]

In [5]:
postal_table = tables[0]
# print(postal_table.prettify())

#### Get a list of all of the rows, since the rows have the postal data. Process each row, putting the ones that have data we need into a new dataframe containing Postal Codes, Boroughs, and Neighborhoods

Assumptions:
- I assume that each row in the HTML table has at least 3 columns
- I assume the first column only contains zip codes
- I assume the second column contains only a name or a hyperlink with a name as text in the hyperlink element
- I assume the third column contains only a name or multiple hyperlinks with the name in the text of the first hyperlink

In [6]:
table_rows = postal_table.find_all('tr')
print ("Wiki page table has {} rows".format(len(table_rows)))
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood']
# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)
# Iterate through all of the rows, except the first one, since the first one is the header row
for row in table_rows[1:]:
    # The columns of the HTML table are (0) postcode (1) Borough (2) Neighborhood
    table_data = row.find_all('td')
    borough = table_data[1].string
    if borough == 'Not assigned':
        # print(row.prettify())
        pass
    else:
        # print (row.prettify())
        pc = table_data[0].string.strip()
        bor_field = table_data[1]
        borough = ""
        if type(bor_field.a == "NoneType"):
            borough = bor_field.string.strip()
        else:
            borough = bor_field.a.string.strip()
        nei_field = table_data[2]
        nei = ""
        a_elems = nei_field.find_all('a')
        if len(a_elems) == 0:
            nei = nei_field.string.strip()
            if (nei.strip()) == "Not assigned":
                nei = borough
        else:
            nei = nei_field.a.string.strip()
        # print(pc)
        # print(borough)
        # print(nei)
        neighborhoods = neighborhoods.append({'PostalCode': pc,
                                              'Borough': borough, 
                                              'Neighborhood': nei}, ignore_index=True)
neighborhoods.sort_values(['PostalCode', 'Neighborhood'])

Wiki page table has 288 rows


Unnamed: 0,PostalCode,Borough,Neighborhood
8,M1B,Scarborough,Malvern
7,M1B,Scarborough,Rouge
20,M1C,Scarborough,Highland Creek
22,M1C,Scarborough,Port Union
21,M1C,Scarborough,Rouge Hill
32,M1E,Scarborough,Guildwood
33,M1E,Scarborough,Morningside
34,M1E,Scarborough,West Hill
38,M1G,Scarborough,Woburn
42,M1H,Scarborough,Cedarbrae


#### For the postal codes that have more than one neighborhood, combine the rows into one row 

In [7]:
def string_list(series):
    ''' Return a string that contains all of the unique strings that are in the series object,
    with the strings separated by commas (with a space after each comma)'''
    return (', '.join(series.unique()))

# Group the neighboorhood dataframe by PostalCode
ng = neighborhoods.groupby(['PostalCode'])

# Create a new dataframe that has one row per PostalCode and has the Neighborhoods within each PostalCode in one long string of names separated by ", ".
# Note that if a PostalCode contains more than one Borough, the row will have multiple borough names separated by commas, but PostalCode's should
# not have more than one Borough.
postal_codes_df_first = ng.agg(string_list)
postal_codes_df_first

Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Rouge, Malvern"
M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
M1E,Scarborough,"Guildwood, Morningside, West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
M1J,Scarborough,Scarborough Village
M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
M1N,Scarborough,"Birch Cliff, Cliffside West"


#### Change the dataframe so that it is indexed by integers instead of PostalCodes

In [8]:
postal_codes_df = postal_codes_df_first.reset_index()
postal_codes_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [9]:
postal_codes_df.shape

(103, 3)

## Part 2

This is the second part of the week 3 project of the Applied Data Science Capstone class. In this notebook, I will explore, segment, and cluster neighborhoods in Toronto, Canada.

### 2a. Get Coordinates

##### Define a function to get the coordinates for one postal code

In [10]:
import geocoder 


postal_code = 'M9W'

def get_coords(p_code):
    '''
    Use geocoder to get the latitude and longitude of the give postal code
    '''
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    i = 0
    while(lat_lng_coords is None and i < 2500):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        print(g)
        lat_lng_coords = g.latlng
        i = i+1

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    print('It took () iteration(s)'.format(i))
    return (latitude, longitude)

#get_coords(postal_code)


Note: I ran the get_coords function one time, letting the loop iterate more than 1800 times, and it the value of the variable g was always 
```   
<[REQUEST_DENIED] Google - Geocode [empty]>
```
so I decided to use the CSV file provided by the course   

###### Get the file and read it into a dataframe

In [11]:
geo_coord = pd.read_csv("https://cocl.us/Geospatial_data", index_col="Postal Code")
geo_coord

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476
M1J,43.744734,-79.239476
M1K,43.727929,-79.262029
M1L,43.711112,-79.284577
M1M,43.716316,-79.239476
M1N,43.692657,-79.264848


###### If the postal_codes_df and geo_coords dataframes have the same number of rows, consider the geo_coords dataframe valid

In [12]:
# If I couldn't visually inspect the data, I would check for invalid values in the code, but I inspected it visually.
if postal_codes_df.shape[0] == geo_coord.shape[0]: 
    print("geo_coords valid")

geo_coords valid


In [13]:
#### Join the two data frames add the coordinates to the neighborhood data
combined_postal = pd.merge(postal_codes_df, geo_coord, left_on="PostalCode", right_index=True, validate="1:1")
combined_postal

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


##### This ends Part 2

### Part 3

This is the last part of the week 3 project of the Applied Data Science Capstone class. 

### 3a. Explore Neighborhoods in Toronto


In [24]:
# Find the latitude and longitude of the center of the map by finding the latitude and longitude that are the medians of all of the latitudes and longitudes

latitude = combined_postal.loc[:,"Latitude"].median()
longitude = combined_postal.loc[:,"Longitude"].median()
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6969476, -79.3887901.


In [31]:
# create map of New York using latitude and longitude values
map_toro = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(combined_postal['Latitude'], combined_postal['Longitude'], combined_postal['Borough'], combined_postal['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough) 
    label = folium.Popup(label, parse_html=True) 
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc', fill_opacity=0.7, parse_html=False).add_to(map_toro)
    
map_toro

##### Create a dataframe and a map containing only Downtown Toronto neighborhoods

In [57]:
# find just the neighborhoods that are in Downtown Toronto
borough = "Downtown Toronto"
downtown_df = combined_postal[combined_postal["Borough"]==borough]
downtown_df.reset_index(inplace=True, drop=True)
downtown_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752


In [58]:
latitude = downtown_df.loc[:,"Latitude"].median()
longitude = downtown_df.loc[:,"Longitude"].median()
print('The geograpical coordinates of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Downtown Toronto are 43.6532057, -79.38175229999999.


In [59]:
# create map of Downtown Toronto using latitude and longitude values
map_dt = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map
for lat, lng, borough, neighborhood in zip(downtown_df['Latitude'], downtown_df['Longitude'], downtown_df['Borough'], downtown_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough) 
    label = folium.Popup(label, parse_html=True) 
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc', fill_opacity=0.7, parse_html=False).add_to(map_dt)
    
map_dt

### 3b. Analyze Each Set of Neighborhood (based on Postal Code) in Downtown Toronto

##### Define Foursquare Credentials

In [62]:
# The code was removed by Watson Studio for sharing.

##### Define Foursquare Version

In [63]:
VERSION = '20180605' # Foursquare API version


##### Explore first neighborhood in Downtown Toronto

In [64]:
# Get the neighborhood's name
downtown_df.loc[0, "Neighborhood"]

'Rosedale'

In [66]:
# Get the neighborhood's long. and lat.
neighborhood_latitude = downtown_df.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = downtown_df.loc[0, 'Longitude'] # neighborhood longitu de value
neighborhood_name = downtown_df.loc[0, 'Neighborhood'] # neighborhood name 

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name,
                                                               neighborhood_latitude, neighborhood_longitude))

Latitude and longitude values of Rosedale are 43.6795626, -79.37752940000001.


##### Get the top 100 venues that are in Rosedale within a radius of 500 meters.

In [133]:
# Create the GET URL
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500
latitude = neighborhood_latitude
longitude = neighborhood_longitude
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
# Known-good URL for debugging
url1 = 'https://api.foursquare.com/v2/venues/explore?client_id=WSPXLU033XSGYW205KXJXWBIO4QPOKQU2LAWUVN4UXWOBGOM&client_secret=00D1TND5ZXNTZ4UBU3EURF3FIKFGDUHD1QD2AX2LONKYTQWU&ll=40.7896239,-73.9598939&v=20180605&radius=500&limit=100'


In [134]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e17530398205d001bb981bd'},
 'response': {'headerLocation': 'Rosedale',
  'headerFullLocation': 'Rosedale, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.6840626045, 'lng': -79.37131878274371},
   'sw': {'lat': 43.675062595499995, 'lng': -79.38374001725632}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4aff2d47f964a520743522e3',
       'name': 'Rosedale Park',
       'location': {'address': '38 Scholfield Ave.',
        'crossStreet': 'at Edgar Ave.',
        'lat': 43.68232820227814,
        'lng': -79.37893434347683,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.68232820227814,
          'lng': -79.37893434347683}],
        'distance': 32

###### From the Foursquare lab in the previous module, we know that all the information is in the items key. Before we proceed, let's borrow the get_category_type function from the Foursquare lab.

In [135]:
# function that extracts the category of the venue 
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
    if len(categories_list) == 0: 
        return None
    else:
        return categories_list[0]['name']

In [136]:
# clean the json and structure it into a pandas dataframe
venues = results['response']['groups'][0]['items'] 
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venu e.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1) # clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns] 
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Rosedale Park,Playground,43.682328,
1,Whitney Park,Park,43.682036,
2,Alex Murray Parkette,Park,43.6783,
3,Milkman's Lane,Trail,43.676352,


In [137]:
# Print the number of venues that were returned
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


##### Explore Neighborhoods in Downtown Toronto

In [138]:
# Create a function to repeat the same process to all the neighborhoods in Downtown Toronto
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

##### Now write the code to run the above function on each neighborhood and create a new dataframe called dt_venues.

In [139]:
dt_venues = getNearbyVenues(names=downtown_df['Neighborhood'],
                                   latitudes=downtown_df['Latitude'],
                                   longitudes=downtown_df['Longitude']
                                  )


Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Queen's Park


In [140]:
print(dt_venues.shape)
dt_venues.head()

(1314, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rosedale,43.679563,-79.377529,Rosedale Park,43.682328,-79.378934,Playground
1,Rosedale,43.679563,-79.377529,Whitney Park,43.682036,-79.373788,Park
2,Rosedale,43.679563,-79.377529,Alex Murray Parkette,43.6783,-79.382773,Park
3,Rosedale,43.679563,-79.377529,Milkman's Lane,43.676352,-79.373842,Trail
4,"Cabbagetown, St. James Town",43.667967,-79.367675,Cranberries,43.667843,-79.369407,Diner


##### Let's check how many venues were returned for each neighborhood

In [141]:
dt_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Berczy Park,57,57,57,57,57,57
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",16,16,16,16,16,16
"Cabbagetown, St. James Town",43,43,43,43,43,43
Central Bay Street,84,84,84,84,84,84
"Chinatown, Grange Park, Kensington Market",91,91,91,91,91,91
Christie,17,17,17,17,17,17
Church and Wellesley,84,84,84,84,84,84
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
"Design Exchange, Toronto Dominion Centre",100,100,100,100,100,100


##### Let's find out how many unique categories can be curated from all the returned venues

In [142]:
print('There are {} uniques categories.'.format(len(dt_venues['Venue Category'].unique())))

There are 210 uniques categories.


##### Do more detailed analysis

In [148]:
# one hot encoding
dt_onehot = pd.get_dummies(dt_venues[['Venue Category']], prefix="", prefix_sep="")


# add neighborhood column back to dataframe, at the end
#pieces = [dt_onehot.copy(), dt_venues['Neighborhood'].copy()]
#dt_onehot = pd.concat(pieces, axis=1)

dt_onehot = pd.merge(dt_onehot.copy(), dt_venues[['Neighborhood']], left_index=True, right_index=True, validate="1:1")
# move neighborhood column to the first column
fixed_columns = [dt_onehot.columns[-1]] + list(dt_onehot.columns[:-1])
dt_onehot = dt_onehot[fixed_columns]

dt_onehot
#dt_venues['Neighborhood']
#dt_onehot['Neighborhood']
#dt_onehot.columns

Unnamed: 0,Neighborhood_y,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Rosedale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Rosedale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Rosedale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Rosedale,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,"Cabbagetown, St. James Town",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,"Cabbagetown, St. James Town",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,"Cabbagetown, St. James Town",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,"Cabbagetown, St. James Town",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,"Cabbagetown, St. James Town",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,"Cabbagetown, St. James Town",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


###### Let's examine the shape

In [149]:
dt_onehot.shape

(1314, 211)

###### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [152]:
dtg = dt_onehot.groupby(['Neighborhood_y'])
dt_grouped = dtg.mean().reset_index()
dt_grouped

Unnamed: 0,Neighborhood_y,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,...,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0625,0.0625,0.0625,0.125,0.1875,0.125,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,0.0,...,0.0,0.0,0.011905,0.0,0.0,0.011905,0.0,0.0,0.0,0.011905
5,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.054945,0.0,0.054945,0.010989,0.0,0.0,0.0,0.0
6,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,0.011905,0.0,0.02381
8,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,...,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0
9,"Design Exchange, Toronto Dominion Centre",0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,...,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0


###### Let's confirm the new size

In [153]:
dt_grouped.shape

(19, 211)

#### Let's print each neighborhood along with the top 5 most common venues

In [156]:
num_top_venues = 5

for hood in dt_grouped['Neighborhood_y']:
    print("----"+hood+"----")
    temp = dt_grouped[dt_grouped['Neighborhood_y'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
          venue  freq
0   Coffee Shop  0.08
1    Steakhouse  0.04
2           Bar  0.04
3          Café  0.04
4  Burger Joint  0.03


----Berczy Park----
                venue  freq
0         Coffee Shop  0.09
1        Cocktail Bar  0.05
2      Farmers Market  0.04
3  Seafood Restaurant  0.04
4            Beer Bar  0.04


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
              venue  freq
0   Airport Service  0.19
1    Airport Lounge  0.12
2  Airport Terminal  0.12
3           Airport  0.06
4               Bar  0.06


----Cabbagetown, St. James Town----
         venue  freq
0  Coffee Shop  0.07
1  Pizza Place  0.05
2         Park  0.05
3         Café  0.05
4       Bakery  0.05


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.14
1  Italian Restaurant  0.06
2                Café  0.06
3      Ice Cream Shop  0.04
4        Burger Joint  0.04


----China

###### Let's put that into a *pandas* dataframe

First, let's reuse a function to sort the venues in descending order.

In [158]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [162]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood_y']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood_y'] = dt_grouped['Neighborhood_y']

for ind in np.arange(dt_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dt_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood_y,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Bar,Steakhouse,Hotel,Burger Joint,Restaurant,Sushi Restaurant,Thai Restaurant,Asian Restaurant
1,Berczy Park,Coffee Shop,Cocktail Bar,Café,Beer Bar,Seafood Restaurant,Steakhouse,Bakery,Cheese Shop,Farmers Market,Liquor Store
2,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Lounge,Airport Terminal,Airport,Airport Food Court,Airport Gate,Harbor / Marina,Bar,Boutique,Boat or Ferry
3,"Cabbagetown, St. James Town",Coffee Shop,Pizza Place,Italian Restaurant,Restaurant,Park,Bakery,Pub,Café,Caribbean Restaurant,Breakfast Spot
4,Central Bay Street,Coffee Shop,Italian Restaurant,Café,Burger Joint,Japanese Restaurant,Ice Cream Shop,Sandwich Place,Bubble Tea Shop,Juice Bar,Bakery


#### 3c. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [164]:
# set number of clusters
kclusters = 5

dt_grouped_clustering = dt_grouped.drop('Neighborhood_y', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dt_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 2, 0, 3, 0, 4, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [168]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Unnamed: 0,Cluster Labels,Neighborhood_y,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,"Adelaide, King, Richmond",Coffee Shop,Café,Bar,Steakhouse,Hotel,Burger Joint,Restaurant,Sushi Restaurant,Thai Restaurant,Asian Restaurant
1,0,Berczy Park,Coffee Shop,Cocktail Bar,Café,Beer Bar,Seafood Restaurant,Steakhouse,Bakery,Cheese Shop,Farmers Market,Liquor Store
2,2,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Lounge,Airport Terminal,Airport,Airport Food Court,Airport Gate,Harbor / Marina,Bar,Boutique,Boat or Ferry
3,0,"Cabbagetown, St. James Town",Coffee Shop,Pizza Place,Italian Restaurant,Restaurant,Park,Bakery,Pub,Café,Caribbean Restaurant,Breakfast Spot
4,3,Central Bay Street,Coffee Shop,Italian Restaurant,Café,Burger Joint,Japanese Restaurant,Ice Cream Shop,Sandwich Place,Bubble Tea Shop,Juice Bar,Bakery
5,0,"Chinatown, Grange Park, Kensington Market",Café,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Coffee Shop,Dumpling Restaurant,Chinese Restaurant,Bakery,Bar,Mexican Restaurant,Comfort Food Restaurant
6,4,Christie,Grocery Store,Café,Park,Baby Store,Athletics & Sports,Candy Store,Italian Restaurant,Diner,Restaurant,Coffee Shop
7,0,Church and Wellesley,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar,Yoga Studio,Theater,Bubble Tea Shop,Café,Gym
8,0,"Commerce Court, Victoria Hotel",Coffee Shop,Café,Restaurant,Hotel,Steakhouse,Gastropub,Deli / Bodega,Seafood Restaurant,American Restaurant,Gym
9,0,"Design Exchange, Toronto Dominion Centre",Coffee Shop,Hotel,Café,Bar,Restaurant,Steakhouse,Gastropub,Seafood Restaurant,American Restaurant,Italian Restaurant


In [175]:
dt_merged = downtown_df.copy()

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
dt_merged = pd.merge(dt_merged, neighborhoods_venues_sorted.set_index('Neighborhood_y').copy(), left_on='Neighborhood', right_index=True, validate="1:1")

dt_merged.head() # check the last columns!


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,1,Park,Playground,Trail,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675,0,Coffee Shop,Pizza Place,Italian Restaurant,Restaurant,Park,Bakery,Pub,Café,Caribbean Restaurant,Breakfast Spot
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,0,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar,Yoga Studio,Theater,Bubble Tea Shop,Café,Gym
3,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,3,Coffee Shop,Park,Bakery,Pub,Café,Mexican Restaurant,Breakfast Spot,Brewery,Beer Store,Ice Cream Shop
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,0,Coffee Shop,Clothing Store,Café,Cosmetics Shop,Middle Eastern Restaurant,Ramen Restaurant,Pizza Place,Restaurant,Japanese Restaurant,Diner


Finally, let's visualize the resulting clusters

In [192]:
# create map
latitude = downtown_df.loc[:,"Latitude"].median()
longitude = downtown_df.loc[:,"Longitude"].median()
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=14)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dt_merged['Latitude'], dt_merged['Longitude'], dt_merged['Neighborhood'], dt_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        # color=rainbow[cluster-1], # Use the default color for the perimeter of the shape because it is easier to see
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 3d. Examine Clusters

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

###### Cluster 1

In [183]:
dt_merged.loc[dt_merged['Cluster Labels'] == 0, dt_merged.columns[[2] + list(range(5, dt_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,"Cabbagetown, St. James Town",0,Coffee Shop,Pizza Place,Italian Restaurant,Restaurant,Park,Bakery,Pub,Café,Caribbean Restaurant,Breakfast Spot
2,Church and Wellesley,0,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar,Yoga Studio,Theater,Bubble Tea Shop,Café,Gym
4,"Ryerson, Garden District",0,Coffee Shop,Clothing Store,Café,Cosmetics Shop,Middle Eastern Restaurant,Ramen Restaurant,Pizza Place,Restaurant,Japanese Restaurant,Diner
5,St. James Town,0,Coffee Shop,Café,Restaurant,Beer Bar,Hotel,Clothing Store,Cosmetics Shop,Breakfast Spot,Bakery,Park
6,Berczy Park,0,Coffee Shop,Cocktail Bar,Café,Beer Bar,Seafood Restaurant,Steakhouse,Bakery,Cheese Shop,Farmers Market,Liquor Store
8,"Adelaide, King, Richmond",0,Coffee Shop,Café,Bar,Steakhouse,Hotel,Burger Joint,Restaurant,Sushi Restaurant,Thai Restaurant,Asian Restaurant
10,"Design Exchange, Toronto Dominion Centre",0,Coffee Shop,Hotel,Café,Bar,Restaurant,Steakhouse,Gastropub,Seafood Restaurant,American Restaurant,Italian Restaurant
11,"Commerce Court, Victoria Hotel",0,Coffee Shop,Café,Restaurant,Hotel,Steakhouse,Gastropub,Deli / Bodega,Seafood Restaurant,American Restaurant,Gym
12,"Harbord, University of Toronto",0,Café,Bookstore,Bakery,Sandwich Place,Japanese Restaurant,Italian Restaurant,Bar,Restaurant,Sushi Restaurant,Poutine Place
13,"Chinatown, Grange Park, Kensington Market",0,Café,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Coffee Shop,Dumpling Restaurant,Chinese Restaurant,Bakery,Bar,Mexican Restaurant,Comfort Food Restaurant


###### Cluster 2

In [184]:
dt_merged.loc[dt_merged['Cluster Labels'] == 1, dt_merged.columns[[2] + list(range(5, dt_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Rosedale,1,Park,Playground,Trail,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


##### Cluster 3

In [186]:
dt_merged.loc[dt_merged['Cluster Labels'] == 2, dt_merged.columns[[2] + list(range(5, dt_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,"CN Tower, Bathurst Quay, Island airport, Harbo...",2,Airport Service,Airport Lounge,Airport Terminal,Airport,Airport Food Court,Airport Gate,Harbor / Marina,Bar,Boutique,Boat or Ferry


##### Cluster 4

In [187]:
dt_merged.loc[dt_merged['Cluster Labels'] == 3, dt_merged.columns[[2] + list(range(5, dt_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Harbourfront,3,Coffee Shop,Park,Bakery,Pub,Café,Mexican Restaurant,Breakfast Spot,Brewery,Beer Store,Ice Cream Shop
7,Central Bay Street,3,Coffee Shop,Italian Restaurant,Café,Burger Joint,Japanese Restaurant,Ice Cream Shop,Sandwich Place,Bubble Tea Shop,Juice Bar,Bakery
9,"Harbourfront East, Toronto Islands, Union Station",3,Coffee Shop,Aquarium,Café,Italian Restaurant,Hotel,Scenic Lookout,Brewery,Restaurant,Fried Chicken Joint,Sporting Goods Shop
18,Queen's Park,3,Coffee Shop,Sushi Restaurant,Gym,Park,Portuguese Restaurant,Bar,Chinese Restaurant,Beer Bar,Smoothie Shop,Japanese Restaurant
