# **FEDE's "Seg and Clust Neighborhoods in Toronto"**

Segmenting and Clustering Neighborhoods in Toronto

## Objectives

For this assignment, I will be exploring and clustering the neighborhoods in Toronto.

This lab is splitted in parts, I divided the activity as indicated in the instructions. Then, the parts and contents are the following:

PART 1: Scrapping and building the dataframe

PART 2: Obtaining Coordinates

PART 3: Explore and Cluster

## PART 3: Explore and Cluster

### A. Download and Explore

#### First of all, I read the dataframe generated in the PART 2. Wich contains the df_1 + coordinates.

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [2]:
df_3 = pd.read_csv("df_2.csv", index_col=[0])

In [3]:
df_3.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
4,M7A,Queen's Park,Ontario Provincial Government,43.66253,-79.39188


#### Importing necessary libraries.

In [4]:
#!conda install -c conda-forge geopy --yes
#!conda install -c conda-forge geocoder

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
#!pip install folium
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


#### Use geopy library to get the latitude and longitude values of Toronto.


In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>to_explorer</em>, as shown below.

In [5]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


#### Create a map of Toronto with neighborhoods superimposed on top.

In [6]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

We add the markers to map...

In [7]:
# add markers to map
for postal, lat, lng, borough, neighborhood in zip(df_3['PostalCode'], df_3['Latitude'], df_3['Longitude'], df_3['Borough'], df_3['Neighborhood']):
    label = '{}, {}'.format(postal, neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### B. Explore the Boroughs

Let's individulize the principal BOROUGHS and how many neighborhoods contains each one.

In [8]:
df_3.groupby('Borough').count()

Unnamed: 0_level_0,PostalCode,Neighborhood,Latitude,Longitude
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Central Toronto,9,9,9,9
Downtown Toronto,17,17,17,17
Downtown Toronto Stn A,1,1,1,1
East Toronto,4,4,4,4
East Toronto Business,1,1,1,1
East York,4,4,4,4
East York/East Toronto,1,1,1,1
Etobicoke,11,11,11,11
Etobicoke Northwest,1,1,1,1
Mississauga,1,1,1,1


In [9]:
len(df_3[df_3.Borough.str.contains('York')])

34

In [10]:
len(df_3[df_3.Borough.str.contains('Toronto')])

39

In [11]:
len(df_3[df_3.Borough.str.contains('Scarborough')])

17

Let's simplify the above map and segment and cluster only the neighborhoods in SCARBOROUGH. So let's slice the original dataframe and create a new dataframe of the SCARBOROUGH DATA.

In [12]:
scar_data = df_3[df_3['Borough'] == 'Scarborough'].reset_index(drop=True)

In [13]:
scar_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.81139,-79.19662
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.78574,-79.15875
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.76575,-79.1747
3,M1G,Scarborough,Woburn,43.76812,-79.21761
4,M1H,Scarborough,Cedarbrae,43.76944,-79.23892


As we did with all of Toronto, let's visualizate Scarborough the neighborhoods in it.

In [14]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

for postal, lat, lng, borough, neighborhood in zip(scar_data['PostalCode'], scar_data['Latitude'], scar_data['Longitude'], scar_data['Borough'], scar_data['Neighborhood']):
    label = '{}, {}'.format(postal, neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### C. Explore Scarborough


We will use the explore function to get the most common **venue categories** in the neighborhood.
<br>Later, we will use this **feature** to group the neighborhoods into **clusters**, we will be using the **k-means clustering algorithm** to complete this task.
<br>Finally, we will use the Folium library to visualize the neighborhoods in Scarborough and their emerging clusters.

#### We are going to start utilizing the Foursquare API to explore the neighborhoods in Scarborough and segment them.

Define Foursquare Credentials and Version

In [15]:
CLIENT_ID = '3SB5XYLCUUR5VRJEUPYMJKMH1KQYBPCEO0LMN3EZWPQLXNUI' # your Foursquare ID
CLIENT_SECRET = 'QMNUTP0JPJC5CMZI2IGTU30GP1ISFC4Z44GR4UZ3M2TCCVWJ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 20

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 3SB5XYLCUUR5VRJEUPYMJKMH1KQYBPCEO0LMN3EZWPQLXNUI
CLIENT_SECRET:QMNUTP0JPJC5CMZI2IGTU30GP1ISFC4Z44GR4UZ3M2TCCVWJ


In [16]:
import json # library to handle JSON files

In [17]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [18]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

#### Now, let's get the top 20 venues that are in each neghborhood within a radius of 800 meters.

Let's use a function to do the same process to all the neighborhoods in Scarborough

In [19]:
def getNearbyVenues(names, latitudes, longitudes, radius=800):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now, the code to run the above function on each neighborhood and create a new dataframe called "scarborough_venues"

In [20]:
scarborough_venues = getNearbyVenues(names=scar_data['Neighborhood'],
                                   latitudes=scar_data['Latitude'],
                                   longitudes=scar_data['Longitude']
                                  )

Malvern, Rouge
Rouge Hill, Port Union, Highland Creek
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
Kennedy Park, Ionview, East Birchmount Park
Golden Mile, Clairlea, Oakridge
Cliffside, Cliffcrest, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Wexford Heights, Scarborough Town Centre
Wexford, Maryvale
Agincourt
Clarks Corners, Tam O'Shanter, Sullivan
Milliken, Agincourt North, Steeles East, L'Amoreaux East
Steeles West, L'Amoreaux West
Upper Rouge


#### Let's check the size of the resulting dataframe

In [21]:
print(scarborough_venues.shape)
scarborough_venues

(169, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.81139,-79.19662,Canadiana exhibit,43.817962,-79.193374,Zoo Exhibit
1,"Malvern, Rouge",43.81139,-79.19662,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
2,"Malvern, Rouge",43.81139,-79.19662,Ecopainting inc.,43.808417,-79.202392,Construction & Landscaping
3,"Malvern, Rouge",43.81139,-79.19662,Grizzly Bear Exhibit,43.817031,-79.193458,Zoo Exhibit
4,"Rouge Hill, Port Union, Highland Creek",43.78574,-79.15875,Scarborough Historical Society,43.788755,-79.162438,History Museum
5,"Rouge Hill, Port Union, Highland Creek",43.78574,-79.15875,Royal Canadian Legion,43.782533,-79.163085,Bar
6,"Rouge Hill, Port Union, Highland Creek",43.78574,-79.15875,Malt & Salt Fish & Chips,43.783655,-79.150843,Fish & Chips Shop
7,"Rouge Hill, Port Union, Highland Creek",43.78574,-79.15875,Bramber Woods Park,43.788786,-79.166729,Park
8,"Guildwood, Morningside, West Hill",43.76575,-79.1747,Heron Park Community Centre,43.768867,-79.176958,Gym / Fitness Center
9,"Guildwood, Morningside, West Hill",43.76575,-79.1747,Heron Park,43.769327,-79.177201,Park


#### Let's check how many venues were returned for each neighborhood

In [22]:
scarborough_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,20,20,20,20,20,20
"Birch Cliff, Cliffside West",10,10,10,10,10,10
Cedarbrae,12,12,12,12,12,12
"Clarks Corners, Tam O'Shanter, Sullivan",20,20,20,20,20,20
"Cliffside, Cliffcrest, Scarborough Village West",12,12,12,12,12,12
"Dorset Park, Wexford Heights, Scarborough Town Centre",5,5,5,5,5,5
"Golden Mile, Clairlea, Oakridge",16,16,16,16,16,16
"Guildwood, Morningside, West Hill",5,5,5,5,5,5
"Kennedy Park, Ionview, East Birchmount Park",11,11,11,11,11,11
"Malvern, Rouge",4,4,4,4,4,4


#### Let's find out how many unique categories can be curated from all the returned venues


In [23]:
print('There are {} uniques categories.'.format(len(scarborough_venues['Venue Category'].unique())))

There are 75 uniques categories.


### D. Analyze Each Neighborhood in Scarborough


To use the categorical features, we need to convert the categorical features to binary using pandas **get dummies**.

In [24]:
# one hot encoding
scar_onehot = pd.get_dummies(scarborough_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
scar_onehot['Neighborhood'] = scarborough_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [scar_onehot.columns[-1]] + list(scar_onehot.columns[:-1])
scar_onehot = scar_onehot[fixed_columns]

scar_onehot.head()

Unnamed: 0,Neighborhood,Athletics & Sports,Auto Garage,Bakery,Bank,Bar,Beer Store,Big Box Store,Bistro,Breakfast Spot,Burger Joint,Bus Line,Bus Station,Café,Cantonese Restaurant,Caribbean Restaurant,Chinese Restaurant,Coffee Shop,College Stadium,Construction & Landscaping,Convenience Store,Deli / Bodega,Department Store,Diner,Discount Store,Electronics Store,Fast Food Restaurant,Fish & Chips Shop,Flower Shop,Gas Station,General Entertainment,Golf Course,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Hakka Restaurant,History Museum,Hobby Shop,Ice Cream Shop,Indian Restaurant,Intersection,Korean Restaurant,Latin American Restaurant,Light Rail Station,Liquor Store,Lounge,Malay Restaurant,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Nail Salon,Noodle House,Other Great Outdoors,Park,Pet Store,Pharmacy,Pizza Place,Playground,Pool,Pool Hall,Pub,Restaurant,Sandwich Place,Seafood Restaurant,Shopping Mall,Skating Rink,Supermarket,Sushi Restaurant,Thai Restaurant,Train Station,Video Game Store,Vietnamese Restaurant,Zoo Exhibit
0,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,"Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [25]:
scar_onehot.shape

(169, 76)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category


In [26]:
scar_grouped = scar_onehot.groupby('Neighborhood').mean().reset_index()
scar_grouped

Unnamed: 0,Neighborhood,Athletics & Sports,Auto Garage,Bakery,Bank,Bar,Beer Store,Big Box Store,Bistro,Breakfast Spot,Burger Joint,Bus Line,Bus Station,Café,Cantonese Restaurant,Caribbean Restaurant,Chinese Restaurant,Coffee Shop,College Stadium,Construction & Landscaping,Convenience Store,Deli / Bodega,Department Store,Diner,Discount Store,Electronics Store,Fast Food Restaurant,Fish & Chips Shop,Flower Shop,Gas Station,General Entertainment,Golf Course,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Hakka Restaurant,History Museum,Hobby Shop,Ice Cream Shop,Indian Restaurant,Intersection,Korean Restaurant,Latin American Restaurant,Light Rail Station,Liquor Store,Lounge,Malay Restaurant,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Nail Salon,Noodle House,Other Great Outdoors,Park,Pet Store,Pharmacy,Pizza Place,Playground,Pool,Pool Hall,Pub,Restaurant,Sandwich Place,Seafood Restaurant,Shopping Mall,Skating Rink,Supermarket,Sushi Restaurant,Thai Restaurant,Train Station,Video Game Store,Vietnamese Restaurant,Zoo Exhibit
0,Agincourt,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.05,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.05,0.05,0.0,0.05,0.05,0.0,0.05,0.05,0.05,0.05,0.0,0.05,0.05,0.0,0.0,0.0,0.0,0.0
1,"Birch Cliff, Cliffside West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.1,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.1,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Cedarbrae,0.083333,0.0,0.166667,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0
3,"Clarks Corners, Tam O'Shanter, Sullivan",0.0,0.0,0.0,0.05,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.05,0.05,0.05,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.15,0.0,0.0,0.05,0.0,0.05,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.05,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.05,0.05,0.0
4,"Cliffside, Cliffcrest, Scarborough Village West",0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.166667,0.083333,0.0,0.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Dorset Park, Wexford Heights, Scarborough Town...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0
6,"Golden Mile, Clairlea, Oakridge",0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0625,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.1875,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Guildwood, Morningside, West Hill",0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"Kennedy Park, Ionview, East Birchmount Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.181818,0.0,0.0,0.181818,0.0,0.0,0.0,0.181818,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0
9,"Malvern, Rouge",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5


#### Let's confirm the new size


In [27]:
scar_grouped.shape

(16, 76)

#### Let's print each neighborhood along with the top 5 most common venues:


In [28]:
num_top_venues = 5

for hood in scar_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = scar_grouped[scar_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                venue  freq
0  Chinese Restaurant  0.20
1                Pool  0.05
2       Shopping Mall  0.05
3            Pharmacy  0.05
4         Pizza Place  0.05


----Birch Cliff, Cliffside West----
                   venue  freq
0                   Park   0.2
1                  Diner   0.1
2  General Entertainment   0.1
3      Convenience Store   0.1
4               Gym Pool   0.1


----Cedarbrae----
                  venue  freq
0                Bakery  0.17
1    Athletics & Sports  0.08
2  Caribbean Restaurant  0.08
3       Thai Restaurant  0.08
4            Playground  0.08


----Clarks Corners, Tam O'Shanter, Sullivan----
                  venue  freq
0  Fast Food Restaurant  0.15
1  Caribbean Restaurant  0.05
2            Restaurant  0.05
3        Sandwich Place  0.05
4           Coffee Shop  0.05


----Cliffside, Cliffcrest, Scarborough Village West----
            venue  freq
0        Pharmacy  0.17
1  Ice Cream Shop  0.17
2     Pizza Place  0.08
3  Dis

#### Let's put that into a _pandas_ dataframe


First, let's write a function to sort the venues in descending order.

In [29]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 5 venues for each neighborhood.

In [30]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = scar_grouped['Neighborhood']

for ind in np.arange(scar_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(scar_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Agincourt,Chinese Restaurant,Restaurant,Sandwich Place,Breakfast Spot,Pharmacy
1,"Birch Cliff, Cliffside West",Park,Skating Rink,Gym Pool,Convenience Store,Diner
2,Cedarbrae,Bakery,Hakka Restaurant,Playground,Bank,Caribbean Restaurant
3,"Clarks Corners, Tam O'Shanter, Sullivan",Fast Food Restaurant,Chinese Restaurant,Sandwich Place,Liquor Store,Coffee Shop
4,"Cliffside, Cliffcrest, Scarborough Village West",Ice Cream Shop,Pharmacy,Discount Store,Auto Garage,Coffee Shop


### E. Cluster Neighborhoods


Group the neigborhoods into clusters to find -or according to- similar characteristics.
We run **_k_-means** to cluster the neighborhoods into 5 clusters.

For this I have to use the "scar_grouped" dataframe.

In [31]:
# set number of clusters
kclusters = 5

scar_grouped_clustering = scar_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(scar_grouped_clustering)

# check cluster labels generated for each row in the dataframe
#kmeans.labels_[0:5]
kmeans.labels_

array([0, 0, 0, 0, 0, 3, 0, 1, 0, 2, 1, 4, 0, 0, 0, 0])

Let's create a new dataframe that includes the cluster as well as the top 5 venues for each neighborhood.

In [32]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

scar_merged = scar_data

# merge scar_grouped with scar_data to add latitude/longitude for each neighborhood
scar_merged = scar_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

scar_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.81139,-79.19662,2.0,Zoo Exhibit,Construction & Landscaping,Fast Food Restaurant,Deli / Bodega,Department Store
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.78574,-79.15875,4.0,Park,Fish & Chips Shop,History Museum,Bar,Deli / Bodega
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.76575,-79.1747,1.0,Park,Gymnastics Gym,Gym / Fitness Center,Athletics & Sports,Bank
3,M1G,Scarborough,Woburn,43.76812,-79.21761,0.0,Park,Supermarket,Chinese Restaurant,Department Store,Fast Food Restaurant
4,M1H,Scarborough,Cedarbrae,43.76944,-79.23892,0.0,Bakery,Hakka Restaurant,Playground,Bank,Caribbean Restaurant


#### I have to drop the Nighborhood "Upper Rouge" because this area remains undeveloped, therefore it yields NaN results in the chosen radius.

In [33]:
scar_merged= scar_merged.dropna()

In [34]:
scar_merged

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.81139,-79.19662,2.0,Zoo Exhibit,Construction & Landscaping,Fast Food Restaurant,Deli / Bodega,Department Store
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.78574,-79.15875,4.0,Park,Fish & Chips Shop,History Museum,Bar,Deli / Bodega
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.76575,-79.1747,1.0,Park,Gymnastics Gym,Gym / Fitness Center,Athletics & Sports,Bank
3,M1G,Scarborough,Woburn,43.76812,-79.21761,0.0,Park,Supermarket,Chinese Restaurant,Department Store,Fast Food Restaurant
4,M1H,Scarborough,Cedarbrae,43.76944,-79.23892,0.0,Bakery,Hakka Restaurant,Playground,Bank,Caribbean Restaurant
5,M1J,Scarborough,Scarborough Village,43.74446,-79.23117,0.0,Ice Cream Shop,Sandwich Place,Indian Restaurant,Big Box Store,Fast Food Restaurant
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.72582,-79.26461,0.0,Coffee Shop,Discount Store,Convenience Store,Train Station,Bus Station
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.71289,-79.28506,0.0,Intersection,Bakery,Coffee Shop,Gym,Metro Station
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.7236,-79.23496,0.0,Ice Cream Shop,Pharmacy,Discount Store,Auto Garage,Coffee Shop
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.6951,-79.26466,0.0,Park,Skating Rink,Gym Pool,Convenience Store,Diner


In [35]:
scar_merged['Cluster Labels']=scar_merged['Cluster Labels'].astype(int)

In [36]:
type(scar_merged['Cluster Labels'][0])

numpy.int32

We can easily check the centroid values by averaging the features in each cluster.


In [37]:
scar_merged.groupby('Cluster Labels').mean()

Unnamed: 0_level_0,Latitude,Longitude
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1
0,43.751855,-79.265768
1,43.79178,-79.22757
2,43.81139,-79.19662
3,43.75998,-79.2694
4,43.78574,-79.15875


### Let's visualize the resulting clusters

In [38]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(scar_merged['Latitude'], scar_merged['Longitude'], scar_merged['Neighborhood'], scar_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### F. Examine Clusters

#### Utilization of K-means

<em>k</em>-means partitioned the neighborhoods into five groups since we specified the algorithm to generate 5 clusters. The neighborhoods in each cluster are similar to each other in terms of the features included in the dataset (prevalent venue categories).


Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster

#### Cluster 0

In [39]:
scar_merged.loc[scar_merged['Cluster Labels'] == 0, scar_merged.columns[[2] + list(range(5, scar_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
3,Woburn,0,Park,Supermarket,Chinese Restaurant,Department Store,Fast Food Restaurant
4,Cedarbrae,0,Bakery,Hakka Restaurant,Playground,Bank,Caribbean Restaurant
5,Scarborough Village,0,Ice Cream Shop,Sandwich Place,Indian Restaurant,Big Box Store,Fast Food Restaurant
6,"Kennedy Park, Ionview, East Birchmount Park",0,Coffee Shop,Discount Store,Convenience Store,Train Station,Bus Station
7,"Golden Mile, Clairlea, Oakridge",0,Intersection,Bakery,Coffee Shop,Gym,Metro Station
8,"Cliffside, Cliffcrest, Scarborough Village West",0,Ice Cream Shop,Pharmacy,Discount Store,Auto Garage,Coffee Shop
9,"Birch Cliff, Cliffside West",0,Park,Skating Rink,Gym Pool,Convenience Store,Diner
11,"Wexford, Maryvale",0,Pizza Place,Burger Joint,Coffee Shop,Sandwich Place,Vietnamese Restaurant
12,Agincourt,0,Chinese Restaurant,Restaurant,Sandwich Place,Breakfast Spot,Pharmacy
13,"Clarks Corners, Tam O'Shanter, Sullivan",0,Fast Food Restaurant,Chinese Restaurant,Sandwich Place,Liquor Store,Coffee Shop


#### Cluster 1

In [40]:
scar_merged.loc[scar_merged['Cluster Labels'] == 1, scar_merged.columns[[2] + list(range(5, scar_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,"Guildwood, Morningside, West Hill",1,Park,Gymnastics Gym,Gym / Fitness Center,Athletics & Sports,Bank
14,"Milliken, Agincourt North, Steeles East, L'Amo...",1,Park,Gym,Pharmacy,Intersection,Fast Food Restaurant


#### Cluster 2

In [41]:
scar_merged.loc[scar_merged['Cluster Labels'] == 2, scar_merged.columns[[2] + list(range(5, scar_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Malvern, Rouge",2,Zoo Exhibit,Construction & Landscaping,Fast Food Restaurant,Deli / Bodega,Department Store


#### Cluster 3

In [42]:
scar_merged.loc[scar_merged['Cluster Labels'] == 3, scar_merged.columns[[2] + list(range(5, scar_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
10,"Dorset Park, Wexford Heights, Scarborough Town...",3,Vietnamese Restaurant,Pet Store,Coffee Shop,Indian Restaurant,Electronics Store


#### Cluster 4

In [43]:
scar_merged.loc[scar_merged['Cluster Labels'] == 4, scar_merged.columns[[2] + list(range(5, scar_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,"Rouge Hill, Port Union, Highland Creek",4,Park,Fish & Chips Shop,History Museum,Bar,Deli / Bodega


## Author

Federico Sarrailh