<h2> Capstone project: Segmenting and Clustering Neighborhoods in Toronto, Canada</h2>
by: <b>Nur Cahyo Nugroho </b>

<h3> PART 1: Web scrapping using BeautifulSoup </h3>

<h4>Import all necessary libraries </h4>

In [36]:
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

<h4> Call BeautifulSoup </h4>

In [5]:
#request to wiki based on URL
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

#parse to html.parser
soup = BeautifulSoup(res.content,'lxml')

<h4>Using BeautifulSoup to get HTML content from website for webscrapping</h4>
<ul>
    <li>get the first table with index [0] to get all record related to postal code </li>
    <li>ignore tr</li>
    <li>loop the record in tr and td and put all the logics before append the record into list </li>
</ul>

In [8]:
#get the first table in HTML contains all the postal code data, index no 0
table = soup.find_all('table')[0] 

#create empty list
postcode_list = []

#loop all records in tr
for tr in table.find_all('tr'):
    #ignore th
    if tr.find_all('td'):
        td_val = ''
   
        #loop all records in td
        for td in tr.find_all('td'):
            td_val += td.get_text().strip('\n') + ','
      
        #identify record with 'Not assigned'
        if not td_val.split(',')[1].__contains__('Not assigned'):
            rec = td_val.split(',')[0:3]
        
            #check if it's still empty, for purpose to add first list
            if len(postcode_list) == 0:
                postcode_list.append(rec)
            else:
                exist = False
            
                #compare with existing record, and append the value in 'Neighborhood' if already exist
                for existing_rec in postcode_list:
                    if existing_rec[0] == rec[0] and existing_rec[1] == rec[1]:
                        existing_rec[2] += ',' + rec[2]
                        exist = True
                        break
           
                #only add if record is not exist
                if exist == False:
                    #check if 'Neighborhood' is not assigned
                    if (rec[2]).__contains__('Not assigned'):
                        rec[2] = rec[1]
                    postcode_list.append(rec)

postcode_list

[['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront,Regent Park'],
 ['M6A', 'North York', 'Lawrence Heights,Lawrence Manor'],
 ['M7A', "Queen's Park", "Queen's Park"],
 ['M9A', 'Etobicoke', 'Islington Avenue'],
 ['M1B', 'Scarborough', 'Rouge,Malvern'],
 ['M3B', 'North York', 'Don Mills North'],
 ['M4B', 'East York', 'Woodbine Gardens,Parkview Hill'],
 ['M5B', 'Downtown Toronto', 'Ryerson,Garden District'],
 ['M6B', 'North York', 'Glencairn'],
 ['M9B',
  'Etobicoke',
  'Cloverdale,Islington,Martin Grove,Princess Gardens,West Deane Park'],
 ['M1C', 'Scarborough', 'Highland Creek,Rouge Hill,Port Union'],
 ['M3C', 'North York', 'Flemingdon Park,Don Mills South'],
 ['M4C', 'East York', 'Woodbine Heights'],
 ['M5C', 'Downtown Toronto', 'St. James Town'],
 ['M6C', 'York', 'Humewood-Cedarvale'],
 ['M9C',
  'Etobicoke',
  'Bloordale Gardens,Eringate,Markland Wood,Old Burnhamthorpe'],
 ['M1E', 'Scarborough', 'Guildwood,Morni

<h4> Create dataframe from the list </h4>

In [9]:
column_name = ['Postal Code', 'Borough', 'Neighborhood']
canada_postcode_df = pd.DataFrame(postcode_list, columns = column_name)
canada_postcode_df.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


<h4> Display the shape of dataframe</h34

In [10]:
canada_postcode_df.shape

(103, 3)

<h3> -------------------- This is the end of PART 1 --------------------</h3>

.

<h3> -------------------- This is the start of PART 2 --------------------</h3>

<h3> PART 2: Use the Geocoder package and update the dataframe </h3>

<h4>Import and/or install all necessary libraries</h4>

In [11]:
#!conda install -c conda-forge geopy --yes 
#!conda install -c conda-forge folium=0.5.0 --yes

import folium # plotting library

<h4> Get geospatial data from csv, and put in dataframe

In [12]:
geo_df = pd.read_csv('https://cocl.us/Geospatial_data')
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


<h4> Merge dataframe canada_postcode_df and geo_df </h4>

In [13]:
merged_df = pd.merge(left=canada_postcode_df, right=geo_df, left_on='Postal Code', right_on='Postal Code')
merged_df.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937


<h3> -------------------- This is the end of PART 2 --------------------</h3>

.

<h3> -------------------- This is the start of PART 3 --------------------</h3>

<h3>PART 3: Explore and cluster the neighborhoods in Toronto</h3>

<h4> Check uniqueness of borough and total neighborhoods </h4>

In [14]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(merged_df['Borough'].unique()), merged_df.shape[0]))

The dataframe has 11 boroughs and 103 neighborhoods.


<h4>Create a map of Toronto with neighborhoods superimposed on top </h4>

In [15]:
# create map of Toronto using latitude and longitude values
toronto_map = folium.Map(location=[43.65, -79.38], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(merged_df['Latitude'], merged_df['Longitude'], merged_df['Borough'], merged_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='yellow',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map

<h3>1. Create dataframe the get borough with word 'York' </h3>

In [16]:
york_df = merged_df[merged_df['Borough'].str.contains("York")].reset_index(drop=True)
york_df.head(7)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
3,M3B,North York,Don Mills North,43.745906,-79.352188
4,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
5,M6B,North York,Glencairn,43.709577,-79.445073
6,M3C,North York,"Flemingdon Park,Don Mills South",43.7259,-79.340923


<h4>Create a map of York with neighborhoods superimposed on top </h4>

In [17]:
# create map of York using latitude and longitude values
york_map = folium.Map(location=[43.65, -79.38], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(york_df['Latitude'], york_df['Longitude'], york_df['Borough'], york_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='orange',
        fill=True,
        fill_color='yellow',
        fill_opacity=0.7,
        parse_html=False).add_to(york_map)  
    
york_map

<h4>Get neighborhood information, to get lang and lat </h4>

In [18]:
neigh_lat = york_df.loc[0, 'Latitude'] # neighborhood latitude value
neigh_long = york_df.loc[0, 'Longitude'] # neighborhood longitude value

neighb_name = york_df.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighb_name, neigh_lat, neigh_long))

Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


<h4>Define Foursquare information </h4>

In [19]:
CLIENT_ID = 'DBSEJZIAXAJDY0J3H32YWG231WYT54IEEDGLBYIE1SINFIEY' # your Foursquare ID
CLIENT_SECRET = '1UWUHAE3LATTQKRUPZ5RLXCHL1SH3KGUEDHCQPNQU0COYKZI' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: DBSEJZIAXAJDY0J3H32YWG231WYT54IEEDGLBYIE1SINFIEY
CLIENT_SECRET:1UWUHAE3LATTQKRUPZ5RLXCHL1SH3KGUEDHCQPNQU0COYKZI


<h4>Get top 100 venues with radius 500 meters from Parkwoods </h4>

Define Foursquare API detail:

In [20]:
radius=500
LIMIT=100
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neigh_lat, neigh_long, VERSION, radius, LIMIT)
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d35c02abe7078003999f857'},
  'headerLocation': 'Parkwoods - Donalda',
  'headerFullLocation': 'Parkwoods - Donalda, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 3,
  'suggestedBounds': {'ne': {'lat': 43.757758604500005,
    'lng': -79.32343823984928},
   'sw': {'lat': 43.7487585955, 'lng': -79.33587476015072}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e8d9dcdd5fbbbb6b3003c7b',
       'name': 'Brookbanks Park',
       'location': {'address': 'Toronto',
        'lat': 43.751976046055574,
        'lng': -79.33214044722958,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.751976046055574,
          'lng': -79.33214044722958}],
        'distance': 245,
        'cc': 'CA',
        'c

Create get_category_type function:

In [21]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Clean json and put it in dataframe:

In [22]:
from pandas.io.json import json_normalize

venues = results['response']['groups'][0]['items']

# flatten JSON
nearby_venues = json_normalize(venues) 

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,KFC,Fast Food Restaurant,43.754387,-79.333021
2,Variety Store,Food & Drink Shop,43.751974,-79.333114


Info returned by Foursquare:

In [37]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

3 venues were returned by Foursquare.


<h3>2. Explore neighborhoods in borough with name contains 'York' </h3>

Create function to find all neighborhoods in borough with name contains 'York':

In [38]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Call above function and create new dataframe:

In [39]:
york_venue = getNearbyVenues(names=york_df['Neighborhood'],
                                   latitudes=york_df['Latitude'],
                                   longitudes=york_df['Longitude']
                                  )

Parkwoods
Victoria Village
Lawrence Heights,Lawrence Manor
Don Mills North
Woodbine Gardens,Parkview Hill
Glencairn
Flemingdon Park,Don Mills South
Woodbine Heights
Humewood-Cedarvale
Caledonia-Fairbanks
Leaside
Hillcrest Village
Bathurst Manor,Downsview North,Wilson Heights
Thorncliffe Park
Fairview,Henry Farm,Oriole
Northwood Park,York University
East Toronto
Bayview Village
CFB Toronto,Downsview East
Silver Hills,York Mills
Downsview West
Downsview,North Park,Upwood Park
Humber Summit
Newtonbrook,Willowdale
Downsview Central
Bedford Park,Lawrence Manor East
Del Ray,Keelesdale,Mount Dennis,Silverthorn
Emery,Humberlea
Willowdale South
Downsview Northwest
The Junction North,Runnymede
Weston
York Mills West
Willowdale West


Check the shape size:

In [40]:
print(york_venue.shape)

(344, 7)


Check how many venues for each neighborhoods:

In [41]:
york_venue.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor,Downsview North,Wilson Heights",20,20,20,20,20,20
Bayview Village,4,4,4,4,4,4
"Bedford Park,Lawrence Manor East",25,25,25,25,25,25
"CFB Toronto,Downsview East",2,2,2,2,2,2
Caledonia-Fairbanks,6,6,6,6,6,6
"Del Ray,Keelesdale,Mount Dennis,Silverthorn",5,5,5,5,5,5
Don Mills North,5,5,5,5,5,5
Downsview Central,2,2,2,2,2,2
Downsview Northwest,5,5,5,5,5,5
Downsview West,5,5,5,5,5,5


Find out unique venue category:

In [42]:
print('There are {} uniques categories.'.format(len(york_venue['Venue Category'].unique())))

There are 121 uniques categories.


<h3>3. Analyze each neighborhood </h3>

In [43]:
# one hot encoding
york_onehot = pd.get_dummies(york_venue[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
york_onehot['Neighborhood'] = york_venue['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [york_onehot.columns[-1]] + list(york_onehot.columns[:-1])
york_onehot = york_onehot[fixed_columns]

york_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bagel Shop,Bakery,Bank,...,Toy / Game Store,Trail,Turkish Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Check the new dataframe shape:

In [44]:
york_onehot.shape

(344, 122)

<h4> Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category </h4>

In [45]:
york_grouped = york_onehot.groupby('Neighborhood').mean().reset_index()
york_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bagel Shop,Bakery,Bank,...,Toy / Game Store,Trail,Turkish Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wings Joint,Women's Store,Yoga Studio
0,"Bathurst Manor,Downsview North,Wilson Heights",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,...,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park,Lawrence Manor East",0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,...,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CFB Toronto,Downsview East",0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Caledonia-Fairbanks,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0


Confirm new shape:

In [46]:
york_grouped.shape

(33, 122)

<h4> Print each neighborhood along with the top 5 most common venues </h4>

In [47]:
num_top_venues = 5

for hood in york_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = york_grouped[york_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor,Downsview North,Wilson Heights----
                 venue  freq
0          Coffee Shop  0.10
1          Supermarket  0.05
2  Fried Chicken Joint  0.05
3        Deli / Bodega  0.05
4             Pharmacy  0.05


----Bayview Village----
                 venue  freq
0                 Bank  0.25
1                 Café  0.25
2   Chinese Restaurant  0.25
3  Japanese Restaurant  0.25
4   Mexican Restaurant  0.00


----Bedford Park,Lawrence Manor East----
                venue  freq
0         Coffee Shop  0.08
1  Italian Restaurant  0.08
2           Juice Bar  0.08
3    Greek Restaurant  0.04
4         Pizza Place  0.04


----CFB Toronto,Downsview East----
         venue  freq
0      Airport   0.5
1         Park   0.5
2         Pool   0.0
3  Pizza Place   0.0
4    Piano Bar   0.0


----Caledonia-Fairbanks----
                  venue  freq
0                  Park  0.33
1  Fast Food Restaurant  0.17
2         Women's Store  0.17
3              Pharmacy  0.17
4                M

<h4> Put in panda dataframe </h4>

Create function to sort venue in descending order:

In [48]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create new dataframe and display top 5 of each neighborhood:

In [49]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = york_grouped['Neighborhood']

for ind in np.arange(york_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(york_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Bathurst Manor,Downsview North,Wilson Heights",Coffee Shop,Fried Chicken Joint,Sushi Restaurant,Park,Pharmacy
1,Bayview Village,Chinese Restaurant,Japanese Restaurant,Café,Bank,Yoga Studio
2,"Bedford Park,Lawrence Manor East",Juice Bar,Italian Restaurant,Coffee Shop,Greek Restaurant,Indian Restaurant
3,"CFB Toronto,Downsview East",Airport,Park,Yoga Studio,Empanada Restaurant,Construction & Landscaping
4,Caledonia-Fairbanks,Park,Women's Store,Pharmacy,Fast Food Restaurant,Market


<h3> 4. Cluster neighborhoods </h3>

Run *k*-means to cluster the neighborhood into 5 clusters:

In [50]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

york_grouped_clustering = york_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(york_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:5]

array([0, 0, 0, 2, 0], dtype=int32)

Create a new dataframe that includes the cluster as well as the top 5 venues for each neighborhood:

In [51]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

york_merged = york_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
york_merged = york_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

york_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,0.0,Park,Food & Drink Shop,Fast Food Restaurant,Yoga Studio,Electronics Store
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Coffee Shop,Portuguese Restaurant,Pizza Place,Hockey Arena,Intersection
2,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763,0.0,Clothing Store,Furniture / Home Store,Accessories Store,Athletics & Sports,Miscellaneous Shop
3,M3B,North York,Don Mills North,43.745906,-79.352188,0.0,Café,Gym / Fitness Center,Japanese Restaurant,Caribbean Restaurant,Baseball Field
4,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937,0.0,Pizza Place,Fast Food Restaurant,Bank,Café,Pharmacy


In [54]:
york_merged['Cluster Labels'] = york_merged['Cluster Labels'].fillna(0)

Visualize the resulting clusters:

In [55]:
# create map
map_clusters = folium.Map(location=[43.7, -79.4], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(york_merged['Latitude'], york_merged['Longitude'], york_merged['Neighborhood'], york_merged['Cluster Labels'].astype(int)):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h3>Examining cluster </h3>

<h4>Cluster 1: </h4>

In [58]:
york_merged.loc[york_merged['Cluster Labels'] == 0, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,North York,0.0,Park,Food & Drink Shop,Fast Food Restaurant,Yoga Studio,Electronics Store
1,North York,0.0,Coffee Shop,Portuguese Restaurant,Pizza Place,Hockey Arena,Intersection
2,North York,0.0,Clothing Store,Furniture / Home Store,Accessories Store,Athletics & Sports,Miscellaneous Shop
3,North York,0.0,Café,Gym / Fitness Center,Japanese Restaurant,Caribbean Restaurant,Baseball Field
4,East York,0.0,Pizza Place,Fast Food Restaurant,Bank,Café,Pharmacy
5,North York,0.0,Sushi Restaurant,Pub,Park,Japanese Restaurant,Yoga Studio
6,North York,0.0,Coffee Shop,Asian Restaurant,Gym,Beer Store,Italian Restaurant
7,East York,0.0,Pharmacy,Athletics & Sports,Spa,Beer Store,Skating Rink
8,York,0.0,Field,Hockey Arena,Tennis Court,Trail,Yoga Studio
9,York,0.0,Park,Women's Store,Pharmacy,Fast Food Restaurant,Market


<h4>Cluster 2: </h4>

In [59]:
york_merged.loc[york_merged['Cluster Labels'] == 1, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
24,North York,1.0,Food Truck,Baseball Field,Yoga Studio,Empanada Restaurant,Construction & Landscaping
27,North York,1.0,Baseball Field,Yoga Studio,Empanada Restaurant,Construction & Landscaping,Convenience Store


<h4>Cluster 3: </h4>

In [60]:
york_merged.loc[york_merged['Cluster Labels'] == 2, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
16,East York,2.0,Park,Pizza Place,Coffee Shop,Convenience Store,Comfort Food Restaurant
18,North York,2.0,Airport,Park,Yoga Studio,Empanada Restaurant,Construction & Landscaping
31,York,2.0,Convenience Store,Park,Yoga Studio,Coffee Shop,Construction & Landscaping
32,North York,2.0,Convenience Store,Park,Bank,Yoga Studio,Empanada Restaurant


<h4>Cluster 4: </h4>

In [61]:
york_merged.loc[york_merged['Cluster Labels'] == 3, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
22,North York,3.0,Empanada Restaurant,Pizza Place,General Entertainment,Gastropub,Comfort Food Restaurant


<h4>Cluster 5: </h4>

In [62]:
york_merged.loc[york_merged['Cluster Labels'] == 4, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
23,North York,4.0,Piano Bar,Yoga Studio,Coffee Shop,Construction & Landscaping,Convenience Store
