# Segmenting and Clustering Neighborhoods in Toronto

## Introduction

In this notebook, we will be discussing the neighborhoods in Toronto dataset which will be obtained from the web using web scraping techniques. Also, we will use the Foursquare API to explore neighborhoods in Toronto. We will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. The *k*-means clustering algorithm will be used to complete this task. Finally, the Folium library will be used to visualize the neighborhoods and their emerging clusters.

# Part 1 - Creating a dataframe containing PostalCode, Borough and neighborhood columns

In [1]:
# downloading and imporing required libraries for web scraping
!pip install requests
!pip install bs4
from bs4 import BeautifulSoup 
import requests
import pandas as pd

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


To explore and cluster the neighborhoods in Toronto, we will scrape the following Wikipedia page and then read it into a pandas dataframe and clean it as follows:


In [2]:
wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page= requests.get(wikipedia_link).text
page_lxml= BeautifulSoup(page,'lxml')
table=page_lxml.find('table')
#table.findAll('tr')

In [4]:
tables = pd.read_html(wikipedia_link)
dataframe = tables[0]
dataframe.columns = ['PostalCode', 'Borough', 'Neighborhood'] #renaming the PostCode column to PostalCode column

dataframe.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [5]:
#  Subseting the cells that have an assigned Borough. i.e. ignoring cells with a Borough that is 'Not assigned',

df = dataframe[dataframe.Borough != 'Not assigned']
df.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [6]:
# combining multiple neighborhood rows that have similar PostalCode into one row with the neighborhoods separated with a comma.
cleaned_df=df.groupby("PostalCode").agg(lambda x:','.join(set(x)))
cleaned_df=cleaned_df.reset_index()
cleaned_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"West Hill,Guildwood,Morningside"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [7]:
cleaned_df.loc[cleaned_df['Neighborhood']=="Not assigned",'Neighborhood']=cleaned_df.loc[cleaned_df['Neighborhood']=="Not assigned",'Borough'] # For a cell that has a borough but a Not assigned neighborhood, then the neighborhood is the same as the borough

len(cleaned_df[cleaned_df['Neighborhood'] == 'Not assigned'])

0

In [9]:
cleaned_df.shape

(103, 3)

# Part 2 - to get Latitude and Longitude coordinates

Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name. In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. In order to create a data frame consisting of the variables postalcode, borough, neighborhood, latitude and the longitude, a csv file that has the geographical coordinates of each postal code is merged with the above cleaned data frame using inner join on the common variable that exist in both of the data frames (postalcode). 

In [10]:
# Importing the csv file that contains Latitude and Longitude of a given PostalCode
lat_lon_df = pd.read_csv("http://cocl.us/Geospatial_data")
lat_lon_df.columns = ['PostalCode', 'Latitude', 'Longitude'] #renaming the PostCode column to PostalCode column
lat_lon_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [38]:
# Merging the above two data frames cleaned_df and lat_lon_df 
df_merged = pd.merge(cleaned_df, lat_lon_df, on='PostalCode', how='inner')
df_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"West Hill,Guildwood,Morningside",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# Part 3 - Exploring and clustering the neighborhoods in Toronto

Before we start working this part, let's download all the required libraries.

In [12]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.49-py_0         conda-forge
    geopy:           1.20.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

In [17]:
# Quickly examine the resulting dataframe and the dataset has 11 boroughs and 103 neighborhoods.

print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df_merged['Borough'].unique()),
        df_merged.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


In [40]:
# To work with only Boroughs that contain the word Toronto, let us use the contains operator.

df_toronto = df_merged[df_merged['Borough'].str.contains('Toronto')]
df_toronto = df_toronto.reset_index()
del df_toronto['index']
df_toronto = df_toronto.drop('PostalCode', 1)
df_toronto.head()


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,East Toronto,The Beaches,43.676357,-79.293031
1,East Toronto,"Riverdale,The Danforth West",43.679557,-79.352188
2,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
3,East Toronto,Studio District,43.659526,-79.340923
4,Central Toronto,Lawrence Park,43.72802,-79.38879


### Use geopy library to get the latitude and longitude values of Toronto.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>Toronto_explorer</em>, as shown below.

In [28]:
address = 'Toronto'

geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


### Create a map of Toronto with neighborhoods superimposed on top.

In [30]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

For illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in Central Toronto. So let's slice the original dataframe and create a new dataframe of the Central Toronto data.

In [41]:
central_data = df_toronto[df_toronto['Borough'] == 'Central Toronto'].reset_index(drop=True)
central_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Central Toronto,Lawrence Park,43.72802,-79.38879
1,Central Toronto,Davisville North,43.712751,-79.390197
2,Central Toronto,North Toronto West,43.715383,-79.405678
3,Central Toronto,Davisville,43.704324,-79.38879
4,Central Toronto,"Summerhill East,Moore Park",43.689574,-79.38316


Let's get the geographical coordinates of Downtown Toronto.

In [47]:
address = 'Central Toronto, Toronto'

geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Central Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Central Toronto are 43.653963, -79.387207.


As we did with all of Toronto, let's visualize Central Toronto the neighborhoods in it.

In [48]:
# create map of Central Toronto using latitude and longitude values
map_central = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(central_data['Latitude'], central_data['Longitude'], central_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_central)  
    
map_central

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

### Define Foursquare Credentials and Version

In [57]:
CLIENT_ID = '2L42TZQYJ4UULJGSDYSYSGWDSUNE0H0OWZQU3VQVTPIOZI15' # your Foursquare ID
CLIENT_SECRET = 'LQHQSPCBSQXINPE1RQEUYZGWYKW0WGZPA5DYBMWHUEQZBT4I' # your Foursquare Secret
VERSION = '20190718' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 2L42TZQYJ4UULJGSDYSYSGWDSUNE0H0OWZQU3VQVTPIOZI15
CLIENT_SECRET:LQHQSPCBSQXINPE1RQEUYZGWYKW0WGZPA5DYBMWHUEQZBT4I


### Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [58]:
central_data.loc[0, 'Neighborhood']

'Lawrence Park'

Get the neighborhood's latitude and longitude values.

In [59]:
neighborhood_latitude = central_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = central_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = central_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Lawrence Park are 43.7280205, -79.3887901.


### Now, let's get the top 100 venues that are in Lawrence Park within a radius of 500 meters.

First, let's create the GET request URL. Name your URL **url**.

In [60]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL


'https://api.foursquare.com/v2/venues/explore?&client_id=2L42TZQYJ4UULJGSDYSYSGWDSUNE0H0OWZQU3VQVTPIOZI15&client_secret=LQHQSPCBSQXINPE1RQEUYZGWYKW0WGZPA5DYBMWHUEQZBT4I&v=20190718&ll=43.7280205,-79.3887901&radius=500&limit=100'

Send the GET request and examine the resutls

In [61]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d32b6572b274a002c71bf60'},
  'headerLocation': 'Toronto',
  'headerFullLocation': 'Toronto',
  'headerLocationGranularity': 'city',
  'totalResults': 3,
  'suggestedBounds': {'ne': {'lat': 43.7325205045, 'lng': -79.3825744605273},
   'sw': {'lat': 43.7235204955, 'lng': -79.3950057394727}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '50e6da19e4b0d8a78a0e9794',
       'name': 'Lawrence Park Ravine',
       'location': {'address': '3055 Yonge Street',
        'crossStreet': 'Lawrence Avenue East',
        'lat': 43.72696303913755,
        'lng': -79.39438246708775,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.72696303913755,
          'lng': -79.39438246708775}],
        'distance': 465,
        'cc': 'CA',
  

In [62]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [63]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Lawrence Park Ravine,Park,43.726963,-79.394382
1,Zodiac Swim School,Swim School,43.728532,-79.38286
2,TTC Bus #162 - Lawrence-Donway,Bus Line,43.728026,-79.382805


And how many venues were returned by Foursquare?

In [64]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

3 venues were returned by Foursquare.


## 2. Explore Neighborhoods in Central toronto

### Let's create a function to repeat the same process to all the neighborhoods in Central Toronto

In [65]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Now write the code to run the above function on each neighborhood and create a new dataframe called *central_venues*.

In [66]:

central_venues = getNearbyVenues(names=central_data['Neighborhood'],
                                   latitudes=central_data['Latitude'],
                                   longitudes=central_data['Longitude']
                                  )



Lawrence Park
Davisville North
North Toronto West
Davisville
Summerhill East,Moore Park
Forest Hill SE,Summerhill West,Rathnelly,Deer Park,South Hill
Roselawn
Forest Hill West,Forest Hill North
Yorkville,North Midtown,The Annex


### Let's check the size of the resulting dataframe

In [67]:
print(central_venues.shape)
central_venues.head()

(114, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Lawrence Park,43.72802,-79.38879,Lawrence Park Ravine,43.726963,-79.394382,Park
1,Lawrence Park,43.72802,-79.38879,Zodiac Swim School,43.728532,-79.38286,Swim School
2,Lawrence Park,43.72802,-79.38879,TTC Bus #162 - Lawrence-Donway,43.728026,-79.382805,Bus Line
3,Davisville North,43.712751,-79.390197,Sherwood Park,43.716551,-79.387776,Park
4,Davisville North,43.712751,-79.390197,Summerhill Market North,43.715499,-79.392881,Food & Drink Shop


Let's check how many venues were returned for each neighborhood

In [68]:
central_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Davisville,33,33,33,33,33,33
Davisville North,9,9,9,9,9,9
"Forest Hill SE,Summerhill West,Rathnelly,Deer Park,South Hill",15,15,15,15,15,15
"Forest Hill West,Forest Hill North",5,5,5,5,5,5
Lawrence Park,3,3,3,3,3,3
North Toronto West,20,20,20,20,20,20
Roselawn,2,2,2,2,2,2
"Summerhill East,Moore Park",3,3,3,3,3,3
"Yorkville,North Midtown,The Annex",24,24,24,24,24,24


### Let's find out how many unique categories can be curated from all the returned venues

In [69]:
print('There are {} uniques categories.'.format(len(central_venues['Venue Category'].unique())))

There are 62 uniques categories.


## 3. Analyze Each Neighborhood

In [70]:
# one hot encoding
cenral_onehot = pd.get_dummies(central_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
cenral_onehot['Neighborhood'] = central_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [cenral_onehot.columns[-1]] + list(cenral_onehot.columns[:-1])
cenral_onehot = cenral_onehot[fixed_columns]

cenral_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,BBQ Joint,Bagel Shop,Breakfast Spot,Brewery,Burger Joint,Bus Line,Café,Chinese Restaurant,Clothing Store,Coffee Shop,Cosmetics Shop,Dance Studio,Dessert Shop,Diner,Farmers Market,Fast Food Restaurant,Food & Drink Shop,Fried Chicken Joint,Furniture / Home Store,Garden,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Health & Beauty Service,History Museum,Hotel,Indian Restaurant,Indoor Play Area,Italian Restaurant,Jewelry Store,Jewish Restaurant,Light Rail Station,Liquor Store,Metro Station,Mexican Restaurant,Miscellaneous Shop,Music Venue,Park,Pharmacy,Pizza Place,Playground,Pub,Rental Car Location,Restaurant,Salon / Barbershop,Sandwich Place,Seafood Restaurant,Spa,Sporting Goods Shop,Sports Bar,Supermarket,Sushi Restaurant,Swim School,Thai Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yoga Studio
0,Lawrence Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Lawrence Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,Lawrence Park,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Davisville North,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Davisville North,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [71]:
cenral_onehot.shape

(114, 63)

### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [72]:
central_grouped = cenral_onehot.groupby('Neighborhood').mean().reset_index()
central_grouped

Unnamed: 0,Neighborhood,American Restaurant,BBQ Joint,Bagel Shop,Breakfast Spot,Brewery,Burger Joint,Bus Line,Café,Chinese Restaurant,Clothing Store,Coffee Shop,Cosmetics Shop,Dance Studio,Dessert Shop,Diner,Farmers Market,Fast Food Restaurant,Food & Drink Shop,Fried Chicken Joint,Furniture / Home Store,Garden,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Health & Beauty Service,History Museum,Hotel,Indian Restaurant,Indoor Play Area,Italian Restaurant,Jewelry Store,Jewish Restaurant,Light Rail Station,Liquor Store,Metro Station,Mexican Restaurant,Miscellaneous Shop,Music Venue,Park,Pharmacy,Pizza Place,Playground,Pub,Rental Car Location,Restaurant,Salon / Barbershop,Sandwich Place,Seafood Restaurant,Spa,Sporting Goods Shop,Sports Bar,Supermarket,Sushi Restaurant,Swim School,Thai Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yoga Studio
0,Davisville,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.060606,0.0,0.0,0.060606,0.0,0.030303,0.090909,0.030303,0.030303,0.0,0.0,0.030303,0.0,0.0,0.030303,0.030303,0.0,0.030303,0.0,0.0,0.0,0.0,0.030303,0.030303,0.060606,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.060606,0.060606,0.0,0.0,0.0,0.030303,0.0,0.090909,0.030303,0.0,0.0,0.0,0.0,0.060606,0.0,0.030303,0.030303,0.0,0.0,0.0,0.0
1,Davisville North,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.111111,0.111111,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Forest Hill SE,Summerhill West,Rathnelly,Deer ...",0.066667,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.133333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.133333,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.066667,0.066667,0.066667,0.0,0.0,0.0,0.0,0.0,0.066667,0.0
3,"Forest Hill West,Forest Hill North",0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.2,0.0,0.0,0.0
4,Lawrence Park,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0
5,North Toronto West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.1,0.1,0.0,0.0,0.05,0.05,0.0,0.05,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.05,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.05,0.0,0.05,0.0,0.0,0.0,0.0,0.05,0.0,0.05,0.0,0.0,0.05,0.05,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.05
6,Roselawn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Summerhill East,Moore Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.333333,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"Yorkville,North Midtown,The Annex",0.041667,0.041667,0.0,0.0,0.0,0.041667,0.0,0.125,0.0,0.0,0.125,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.041667,0.0,0.0,0.0,0.041667,0.0,0.041667,0.041667,0.0,0.0,0.0,0.041667,0.041667,0.083333,0.0,0.041667,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0


### Let's confirm the new size

In [73]:
central_grouped.shape

(9, 63)

### Let's print each neighborhood along with the top 5 most common venues

In [74]:
num_top_venues = 5

for hood in central_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = central_grouped[central_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Davisville----
                venue  freq
0      Sandwich Place  0.09
1        Dessert Shop  0.09
2  Italian Restaurant  0.06
3            Pharmacy  0.06
4         Pizza Place  0.06


----Davisville North----
               venue  freq
0  Food & Drink Shop  0.11
1               Park  0.11
2      Grocery Store  0.11
3                Gym  0.11
4       Dance Studio  0.11


----Forest Hill SE,Summerhill West,Rathnelly,Deer Park,South Hill----
                 venue  freq
0          Coffee Shop  0.13
1                  Pub  0.13
2  American Restaurant  0.07
3         Liquor Store  0.07
4          Pizza Place  0.07


----Forest Hill West,Forest Hill North----
              venue  freq
0     Jewelry Store   0.2
1             Trail   0.2
2          Bus Line   0.2
3              Park   0.2
4  Sushi Restaurant   0.2


----Lawrence Park----
                 venue  freq
0             Bus Line  0.33
1          Swim School  0.33
2                 Park  0.33
3  American Restaurant  0.00
4       

### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [95]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [96]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = central_grouped['Neighborhood']

for ind in np.arange(central_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(central_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Davisville,Sandwich Place,Dessert Shop,Pharmacy,Sushi Restaurant,Pizza Place,Italian Restaurant,Café,Coffee Shop,Restaurant,Diner
1,Davisville North,Hotel,Grocery Store,Park,Dance Studio,Sandwich Place,Clothing Store,Food & Drink Shop,Gym,Breakfast Spot,Greek Restaurant
2,"Forest Hill SE,Summerhill West,Rathnelly,Deer ...",Pub,Coffee Shop,Sports Bar,Vietnamese Restaurant,Light Rail Station,Liquor Store,Fried Chicken Joint,Pizza Place,Restaurant,American Restaurant
3,"Forest Hill West,Forest Hill North",Bus Line,Trail,Park,Jewelry Store,Sushi Restaurant,Gourmet Shop,Food & Drink Shop,Fried Chicken Joint,Furniture / Home Store,Garden
4,Lawrence Park,Bus Line,Park,Swim School,Greek Restaurant,Food & Drink Shop,Fried Chicken Joint,Furniture / Home Store,Garden,Gourmet Shop,Yoga Studio


## 4. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 2 clusters.

In [97]:
# set number of clusters
kclusters = 2

central_grouped_clustering = central_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(central_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 1, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [98]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

central_merged = central_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
central_merged = central_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

central_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Bus Line,Park,Swim School,Greek Restaurant,Food & Drink Shop,Fried Chicken Joint,Furniture / Home Store,Garden,Gourmet Shop,Yoga Studio
1,Central Toronto,Davisville North,43.712751,-79.390197,0,Hotel,Grocery Store,Park,Dance Studio,Sandwich Place,Clothing Store,Food & Drink Shop,Gym,Breakfast Spot,Greek Restaurant
2,Central Toronto,North Toronto West,43.715383,-79.405678,0,Clothing Store,Coffee Shop,Yoga Studio,Furniture / Home Store,Fast Food Restaurant,Mexican Restaurant,Miscellaneous Shop,Park,Diner,Dessert Shop
3,Central Toronto,Davisville,43.704324,-79.38879,0,Sandwich Place,Dessert Shop,Pharmacy,Sushi Restaurant,Pizza Place,Italian Restaurant,Café,Coffee Shop,Restaurant,Diner
4,Central Toronto,"Summerhill East,Moore Park",43.689574,-79.38316,0,Park,Playground,Restaurant,Yoga Studio,Farmers Market,Health & Beauty Service,Gym / Fitness Center,Gym,Grocery Store,Greek Restaurant


Finally, let's visualize the resulting clusters

In [99]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(central_merged['Latitude'], central_merged['Longitude'], central_merged['Neighborhood'], central_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examine Clusters
Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. 

### Cluster 1

In [90]:
central_merged.loc[central_merged['Cluster Labels'] == 0, central_merged.columns[[1] + list(range(5, central_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Lawrence Park,Bus Line,Park,Swim School,Greek Restaurant,Food & Drink Shop,Fried Chicken Joint,Furniture / Home Store,Garden,Gourmet Shop,Yoga Studio
1,Davisville North,Hotel,Grocery Store,Park,Dance Studio,Sandwich Place,Clothing Store,Food & Drink Shop,Gym,Breakfast Spot,Greek Restaurant
2,North Toronto West,Clothing Store,Coffee Shop,Yoga Studio,Furniture / Home Store,Fast Food Restaurant,Mexican Restaurant,Miscellaneous Shop,Park,Diner,Dessert Shop
3,Davisville,Sandwich Place,Dessert Shop,Pharmacy,Sushi Restaurant,Pizza Place,Italian Restaurant,Café,Coffee Shop,Restaurant,Diner
5,"Forest Hill SE,Summerhill West,Rathnelly,Deer ...",Pub,Coffee Shop,Sports Bar,Vietnamese Restaurant,Light Rail Station,Liquor Store,Fried Chicken Joint,Pizza Place,Restaurant,American Restaurant
7,"Forest Hill West,Forest Hill North",Bus Line,Trail,Park,Jewelry Store,Sushi Restaurant,Gourmet Shop,Food & Drink Shop,Fried Chicken Joint,Furniture / Home Store,Garden
8,"Yorkville,North Midtown,The Annex",Coffee Shop,Café,Sandwich Place,Pizza Place,American Restaurant,Park,Pharmacy,Liquor Store,Pub,Jewish Restaurant


### Cluster 2

In [91]:
central_merged.loc[central_merged['Cluster Labels'] == 1, central_merged.columns[[1] + list(range(5, central_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,"Summerhill East,Moore Park",Park,Playground,Restaurant,Yoga Studio,Farmers Market,Health & Beauty Service,Gym / Fitness Center,Gym,Grocery Store,Greek Restaurant


### Cluster 3

In [92]:
central_merged.loc[central_merged['Cluster Labels'] == 2, central_merged.columns[[1] + list(range(5, central_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Roselawn,Music Venue,Garden,Yoga Studio,Farmers Market,History Museum,Health & Beauty Service,Gym / Fitness Center,Gym,Grocery Store,Greek Restaurant
