<a id='title'></a>
<h1 align=center>Segmenting and Clustering Neighborhoods in Toronto Based on Top 10 Common Venues</h1>

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#section1">Extract Dataset</a>
<br>
2. <a href="#section2">Transform Dataset</a>
<br>
3. <a href="#section3">Explore Neighborhoods in Toronto</a>
<br>
4. <a href="#section4">Analyze Each Neighborhood</a>
<br>
5. <a href="#section5">Cluster Neighborhoods</a>
<br>
6. <a href="#section6">Examine Clusters</a>    
</font>
</div>

## WARNING

Unfortunately, the folium map can't be rendered on GitHub (more explanation can be found [here](https://www.tutorialspoint.com/jupyter/sharing_jupyter_notebook_using_github_and_nbviewer.htm)).

Worry not, because you can see this notebook perfectly on [nbviewer](https://nbviewer.jupyter.org/github/celineinc/Coursera_Capstone/blob/master/Segmenting%20and%20Clustering%20Neighborhoods%20in%20Toronto.ipynb#section3).

Before we get started, we need to install and import all the modules and library needed for analyzing and clustering the data.

#### Install and import all the dependencies that will be needed

In [1]:
#!pip install lxml

#!pip install pandas
# library for data analsysis
import pandas as pd

#!pip install numpy
# library to handle data in a vectorized manner
import numpy as np

#!pip install geopy
# module to convert an address into latitude and longitude values
from geopy.geocoders import Nominatim

#!pip install requests
import requests

# library to handle JSON files
import json
from pandas.io.json import json_normalize

#!pip install scikit-learn
# import k-means from clustering stage
from sklearn.cluster import KMeans

#!pip install matplotlib
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#!pip install folium
# map rendering library
import folium

<a id='section1'></a>
## 1. Extract Dataset

#### Download and tranform the data into a *pandas* dataframe
In order to segment and cluster neighborhoods in Toronto, we need data of the neighborhoods.

The data is scraped from the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [2]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df.rename(columns={'Postcode': 'PostalCode', 'Neighbourhood':'Neighborhood'}, inplace=True)
print("\n", "The dataframe's shape: ", df.shape)
df.head()


 The dataframe's shape:  (287, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


<a id='section2'></a>
## 2. Transform Dataset

#### Cleaning the data
We will process only the cells that have an assigned borough. In other words, we have to drop the row with a borough that is <b>Not assigned</b>.

In [3]:
del_row_index = df[df['Borough'] == 'Not assigned'].index
df.drop(del_row_index, inplace = True)
print("\n", "The dataframe's shape: ", df.shape)
df.head(10)


 The dataframe's shape:  (210, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Queen's Park,Not assigned
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


To make our analysis more accurate, we will also <b>assign a Borough value for a Neighborhood cell</b> if the row has a borough but a <b>Not assigned neighborhood</b>.

In [4]:
na_row_index = df[df['Neighborhood'] == 'Not assigned'].index
for index in na_row_index:
    df.at[index, 'Neighborhood'] = df.at[index, 'Borough']
print("\n", "The dataframe's shape: ", df.shape)
df.head(10)


 The dataframe's shape:  (210, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Queen's Park,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


To simplify (by reducing the number of rows) our dataframe, we will <b>merge neighborhood that have the same postal code</b>.

In [5]:
aggregation_functions = {'Borough': 'first', 'Neighborhood': ', '.join}
cleaned_df = df.groupby(df['PostalCode']).aggregate(aggregation_functions)
cleaned_df.reset_index(inplace=True)
print("\n", "The final dataframe's shape: ", cleaned_df.shape)
cleaned_df


 The final dataframe's shape:  (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."



#### Add Latitude and Longitude Column

In [6]:
df_geo = pd.read_csv("Geospatial_Coordinates.csv")
df_geo.head()
toronto_data = pd.merge(cleaned_df, df_geo, on='PostalCode')
toronto_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437


<a id='section3'></a>
## 3. Explore Neighborhoods in Toronto

#### Use geopy library to get the latitude and longitude values of Toronto City.


In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>explorer</em>, as shown below.

In [7]:
address = 'City of Toronto'

geolocator = Nominatim(user_agent="explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.7170226, -79.41978303501344.


#### Create a map of Toronto with neighborhoods superimposed on top.


In [8]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighborhood']):
    label = '{} -- {}'.format(borough, neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='purple',
        fill=True,
        fill_color='#b055e0',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<b>Folium</b> is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [9]:
CLIENT_ID = '2GP4IM5S2WRSVY13A1P1N5OMWMGWB3VWPDDRTSABTMWWELX1' # my Foursquare ID
CLIENT_SECRET = 'LKDLYX4GLB3G1UWWNUMYXC0JKRYYZ5NX0SBCZFIR2C4VVXLB' # my Foursquare Secret
currentDate = pd.datetime.now().strftime('%Y'+'%m'+'%d') # get current date
VERSION = currentDate # Foursquare API version

#### Let's explore the first neighborhood in our dataframe.

Get the first neighborhood's name, latitude, and longitude values.

In [10]:
neighborhood_name = toronto_data.loc[0, 'Neighborhood'] # neighborhood name
neighborhood_latitude = toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Rouge, Malvern are 43.806686299999996, -79.19435340000001.


#### Now, let's get the top 100 venues that are in Rouge, Malvern within a radius of 500 meters.

First, let's create the GET request URL.

In [11]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=2GP4IM5S2WRSVY13A1P1N5OMWMGWB3VWPDDRTSABTMWWELX1&client_secret=LKDLYX4GLB3G1UWWNUMYXC0JKRYYZ5NX0SBCZFIR2C4VVXLB&v=20200119&ll=43.806686299999996,-79.19435340000001&radius=500&limit=100'

Send the GET request and examine the results.

In [12]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e2346e795feaf001b36ae97'},
  'headerLocation': 'Malvern',
  'headerFullLocation': 'Malvern, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 2,
  'suggestedBounds': {'ne': {'lat': 43.8111863045, 'lng': -79.18812958073042},
   'sw': {'lat': 43.80218629549999, 'lng': -79.2005772192696}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bb6b9446edc76b0d771311c',
       'name': "Wendy's",
       'location': {'crossStreet': 'Morningside & Sheppard',
        'lat': 43.80744841934756,
        'lng': -79.19905558052072,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.80744841934756,
          'lng': -79.19905558052072}],
        'distance': 387,
        'cc': 'CA',
        'city': 'Toronto',
    

From the Foursquare lab in the course, I know that all the information is in the *items* key. Before proceeding futhermore, I am borrowing the **get_category_type** function from the Foursquare lab.

In [13]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [14]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Wendy's,Fast Food Restaurant,43.807448,-79.199056
1,Interprovincial Group,Print Shop,43.80563,-79.200378


#### Now that we know what to do with a neighborhood, let's define a function to repeat the same process to all the neighborhoods in Toronto (The function is borrowed from the lab).

In [15]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_data in venues_list for item in venue_data])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return nearby_venues

In [16]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

#### Let's check the size of the resulting dataframe

In [17]:
print(toronto_venues.shape)
toronto_venues.head(10)

(2216, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Rouge, Malvern",43.806686,-79.194353,Interprovincial Group,43.80563,-79.200378,Print Shop
2,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
5,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Marina Spa,43.766,-79.191,Spa
6,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Big Bite Burrito,43.766299,-79.19072,Mexican Restaurant
7,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Enterprise Rent-A-Car,43.764076,-79.193406,Rental Car Location
8,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Woburn Medical Centre,43.766631,-79.192286,Medical Center
9,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Lawrence Ave E & Kingston Rd,43.767704,-79.18949,Intersection


Let's check how many venues were returned for each neighborhood

In [18]:
toronto_venues_counter = toronto_venues.groupby('Neighborhood').count().iloc[:,0:1]
toronto_venues_counter.rename(columns={'Neighborhood Latitude': 'Total Venues Available'}, inplace=True)
toronto_venues_counter

Unnamed: 0_level_0,Total Venues Available
Neighborhood,Unnamed: 1_level_1
"Adelaide, King, Richmond",100
Agincourt,3
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",3
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",11
"Alderwood, Long Branch",9
...,...
Willowdale West,7
Woburn,4
"Woodbine Gardens, Parkview Hill",12
Woodbine Heights,11


#### Let's find out how many unique categories can be curated from all the returned venues

In [19]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 272 uniques categories.


<a id='section4'></a>
## 4. Analyze Each Neighborhood

In this section, we want to know top 10 venues that exist in each neighborhood.

In [20]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhoods'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print("\n", "The dataframe's shape: ", toronto_onehot.shape)
toronto_onehot.head()


 The dataframe's shape:  (2216, 273)


Unnamed: 0,Neighborhoods,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Highland Creek, Rouge Hill, Port Union",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [21]:
toronto_grouped = toronto_onehot.groupby('Neighborhoods').mean().reset_index()
print("\n", "The dataframe's shape: ", toronto_grouped.shape)
toronto_grouped.head()


 The dataframe's shape:  (100, 273)


Unnamed: 0,Neighborhoods,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's print each neighborhood along with the top 5 most common venues

In [22]:
num_top_venues = 5

for neighborhood in toronto_grouped['Neighborhoods']:
    print("----", neighborhood, "----")
    temp = toronto_grouped[toronto_grouped['Neighborhoods'] == neighborhood].T.reset_index()
    temp.columns = ['Venue','Freq']
    temp = temp.iloc[1:]
    temp['Freq'] = temp['Freq'].astype(float)
    temp = temp.round({'Freq': 2})
    print(temp.sort_values('Freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')


---- Adelaide, King, Richmond ----
             Venue  Freq
0      Coffee Shop  0.07
1              Bar  0.04
2             Café  0.04
3       Steakhouse  0.04
4  Thai Restaurant  0.03


---- Agincourt ----
                       Venue  Freq
0                     Lounge  0.33
1             Breakfast Spot  0.33
2  Latin American Restaurant  0.33
3          Accessories Store  0.00
4              Metro Station  0.00


---- Agincourt North, L'Amoreaux East, Milliken, Steeles East ----
                Venue  Freq
0          Playground  0.33
1                Park  0.33
2              Bakery  0.33
3   Accessories Store  0.00
4  Mexican Restaurant  0.00


---- Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown ----
                 Venue  Freq
0        Grocery Store  0.18
1             Pharmacy  0.09
2  Japanese Restaurant  0.09
3         Liquor Store  0.09
4           Beer Store  0.09


---- Alderwood, Long Branch ----
          Venue

4           Accessories Store  0.00


---- East Birchmount Park, Ionview, Kennedy Park ----
                Venue  Freq
0      Discount Store   0.4
1   Convenience Store   0.2
2         Coffee Shop   0.2
3    Department Store   0.2
4  Miscellaneous Shop   0.0


---- East Toronto ----
                             Venue  Freq
0                             Park  0.67
1                Convenience Store  0.33
2                Accessories Store  0.00
3                    Metro Station  0.00
4  Molecular Gastronomy Restaurant  0.00


---- Emery, Humberlea ----
                           Venue  Freq
0  Paper / Office Supplies Store   0.5
1                 Baseball Field   0.5
2              Accessories Store   0.0
3                  Metro Station   0.0
4     Modern European Restaurant   0.0


---- Fairview, Henry Farm, Oriole ----
                  Venue  Freq
0        Clothing Store  0.11
1  Fast Food Restaurant  0.08
2           Coffee Shop  0.08
3              Tea Room  0.03
4            Fo

                 Venue  Freq
0          Coffee Shop  0.10
1       Clothing Store  0.05
2       Cosmetics Shop  0.04
3                 Café  0.04
4  Japanese Restaurant  0.03


---- Scarborough Village ----
                             Venue  Freq
0                       Playground   0.5
1                Convenience Store   0.5
2                Accessories Store   0.0
3                    Metro Station   0.0
4  Molecular Gastronomy Restaurant   0.0


---- Silver Hills, York Mills ----
                             Venue  Freq
0                        Cafeteria   1.0
1                Accessories Store   0.0
2               Mexican Restaurant   0.0
3  Molecular Gastronomy Restaurant   0.0
4       Modern European Restaurant   0.0


---- St. James Town ----
         Venue  Freq
0         Café  0.06
1  Coffee Shop  0.06
2   Restaurant  0.05
3        Hotel  0.03
4     Beer Bar  0.03


---- Stn A PO Boxes 25 The Esplanade ----
                Venue  Freq
0         Coffee Shop  0.12
1           

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order (The function is borrowed from the lab).

In [23]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [24]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhoods']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Steakhouse,Bar,Thai Restaurant,Hotel,Burger Joint,Sushi Restaurant,Clothing Store,Asian Restaurant
1,Agincourt,Lounge,Latin American Restaurant,Breakfast Spot,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Empanada Restaurant
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Park,Bakery,Playground,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Liquor Store,Discount Store,Pharmacy,Fried Chicken Joint,Pizza Place,Fast Food Restaurant,Sandwich Place,Beer Store,Japanese Restaurant
4,"Alderwood, Long Branch",Pizza Place,Gym,Pool,Skating Rink,Coffee Shop,Pharmacy,Pub,Sandwich Place,Dessert Shop,Dim Sum Restaurant


<a id='section5'></a>
## 5. Cluster Neighborhoods

Run *k*-means to cluster the neighborhooda into <b>5 clusters based on top 10 venues in each neighborhood</b>.

In [25]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhoods', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check the clustered labels generated and the respective number of rows which are clustered together
np.unique(kmeans.labels_, return_counts=True)

(array([0, 1, 2, 3, 4]), array([87,  9,  1,  1,  2], dtype=int64))

Let's create a new dataframe that includes the neighborhoods' location as well as the cluster and top 10 venues for each neighborhood.

In [26]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0.0,Fast Food Restaurant,Print Shop,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Dessert Shop
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.0,Bar,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Festival
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.0,Rental Car Location,Medical Center,Spa,Mexican Restaurant,Breakfast Spot,Electronics Store,Pizza Place,Intersection,Discount Store,Dog Run
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.0,Coffee Shop,Indian Restaurant,Korean Restaurant,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.0,Hakka Restaurant,Bakery,Fried Chicken Joint,Caribbean Restaurant,Thai Restaurant,Athletics & Sports,Gas Station,Bank,Dumpling Restaurant,Drugstore


In some cases, the k-means clustering can result a *NaN*.

In [27]:
# check if there is any NaN result of k-means clustering
toronto_merged[toronto_merged['Cluster Labels'].isnull()]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,M1X,Scarborough,Upper Rouge,43.836125,-79.205636,,,,,,,,,,,
94,M9B,Etobicoke,"Cloverdale, Islington, Martin Grove, Princess ...",43.650943,-79.554724,,,,,,,,,,,


In order to visualize the best and more accurate clustering result, we need to drop the missing values.

In [28]:
# drop the rows with missing values
toronto_merged_no_missing = toronto_merged.dropna()

# convert the Cluster Labels type from float to int
toronto_merged_no_missing = toronto_merged_no_missing.astype({'Cluster Labels': int})

toronto_merged_no_missing

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0,Fast Food Restaurant,Print Shop,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Dessert Shop
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0,Bar,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Festival
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0,Rental Car Location,Medical Center,Spa,Mexican Restaurant,Breakfast Spot,Electronics Store,Pizza Place,Intersection,Discount Store,Dog Run
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0,Coffee Shop,Indian Restaurant,Korean Restaurant,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0,Hakka Restaurant,Bakery,Fried Chicken Joint,Caribbean Restaurant,Thai Restaurant,Athletics & Sports,Gas Station,Bank,Dumpling Restaurant,Drugstore
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188,1,Park,Yoga Studio,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant
99,M9P,Etobicoke,Westmount,43.696319,-79.532242,0,Pizza Place,Middle Eastern Restaurant,Discount Store,Intersection,Coffee Shop,Sandwich Place,Chinese Restaurant,Yoga Studio,Doner Restaurant,Dog Run
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv...",43.688905,-79.554724,0,Mobile Phone Shop,Pizza Place,Bus Line,Sandwich Place,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Yoga Studio
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437,0,Grocery Store,Liquor Store,Discount Store,Pharmacy,Fried Chicken Joint,Pizza Place,Fast Food Restaurant,Sandwich Place,Beer Store,Japanese Restaurant


Finally, let's visualize the resulting clusters

In [29]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
for lat, lon, hood, cluster in zip(toronto_merged_no_missing['Latitude'], toronto_merged_no_missing['Longitude'], toronto_merged_no_missing['Neighborhood'], toronto_merged_no_missing['Cluster Labels']):
    label = folium.Popup(str(hood) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='section6'></a>
## 6. Examine Clusters

Now, we will examine each cluster and determine the discriminating venue categories that distinguish each cluster.

#### Cluster 1

In [30]:
toronto_merged_no_missing.loc[toronto_merged_no_missing['Cluster Labels'] == 0, toronto_merged_no_missing.columns[[2] + list(range(5, toronto_merged_no_missing.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Rouge, Malvern",0,Fast Food Restaurant,Print Shop,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Dessert Shop
1,"Highland Creek, Rouge Hill, Port Union",0,Bar,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Festival
2,"Guildwood, Morningside, West Hill",0,Rental Car Location,Medical Center,Spa,Mexican Restaurant,Breakfast Spot,Electronics Store,Pizza Place,Intersection,Discount Store,Dog Run
3,Woburn,0,Coffee Shop,Indian Restaurant,Korean Restaurant,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
4,Cedarbrae,0,Hakka Restaurant,Bakery,Fried Chicken Joint,Caribbean Restaurant,Thai Restaurant,Athletics & Sports,Gas Station,Bank,Dumpling Restaurant,Drugstore
...,...,...,...,...,...,...,...,...,...,...,...,...
96,Humber Summit,0,Furniture / Home Store,Pizza Place,Empanada Restaurant,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Dumpling Restaurant
99,Westmount,0,Pizza Place,Middle Eastern Restaurant,Discount Store,Intersection,Coffee Shop,Sandwich Place,Chinese Restaurant,Yoga Studio,Doner Restaurant,Dog Run
100,"Kingsview Village, Martin Grove Gardens, Richv...",0,Mobile Phone Shop,Pizza Place,Bus Line,Sandwich Place,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Yoga Studio
101,"Albion Gardens, Beaumond Heights, Humbergate, ...",0,Grocery Store,Liquor Store,Discount Store,Pharmacy,Fried Chicken Joint,Pizza Place,Fast Food Restaurant,Sandwich Place,Beer Store,Japanese Restaurant


#### Cluster 2

In [31]:
toronto_merged_no_missing.loc[toronto_merged_no_missing['Cluster Labels'] == 1, toronto_merged_no_missing.columns[[2] + list(range(5, toronto_merged_no_missing.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,"Agincourt North, L'Amoreaux East, Milliken, St...",1,Park,Bakery,Playground,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
23,York Mills West,1,Park,Bank,Convenience Store,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant
25,Parkwoods,1,Park,Food & Drink Shop,Yoga Studio,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
30,"CFB Toronto, Downsview East",1,Park,Snack Place,Airport,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
40,East Toronto,1,Park,Convenience Store,Yoga Studio,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
44,Lawrence Park,1,Park,Bus Line,Swim School,Drugstore,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Dim Sum Restaurant
50,Rosedale,1,Park,Playground,Trail,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Drugstore
74,Caledonia-Fairbanks,1,Park,Market,Women's Store,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant,Grocery Store,Empanada Restaurant,Electronics Store,Eastern European Restaurant
98,Weston,1,Park,Yoga Studio,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant


#### Cluster 3

In [32]:
toronto_merged_no_missing.loc[toronto_merged_no_missing['Cluster Labels'] == 2, toronto_merged_no_missing.columns[[2] + list(range(5, toronto_merged_no_missing.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,"Silver Hills, York Mills",2,Cafeteria,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Yoga Studio,College Stadium


#### Cluster 4

In [33]:
toronto_merged_no_missing.loc[toronto_merged_no_missing['Cluster Labels'] == 3, toronto_merged_no_missing.columns[[2] + list(range(5, toronto_merged_no_missing.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
63,Roselawn,3,Garden,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio,Dessert Shop


#### Cluster 5

In [34]:
toronto_merged_no_missing.loc[toronto_merged_no_missing['Cluster Labels'] == 4, toronto_merged_no_missing.columns[[2] + list(range(5, toronto_merged_no_missing.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
91,"Humber Bay, King's Mill Park, Kingsway Park So...",4,Baseball Field,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Festival
97,"Emery, Humberlea",4,Paper / Office Supplies Store,Baseball Field,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant


## Conclusion

As you can see from the dataframe of each cluster above, now we can easily tell which neighborhoods in Toronto have similar venues with the others.

### Thank you for reading this notebook!

This notebook was created by [Chrysant Celine Setyawan](https://www.linkedin.com/in/celine-setyawan) as an assignment from *Applied Data Science Capstone* Course on **Coursera** based on the lab tutorial.

You can take the course online by clicking [here](http://cocl.us/DP0701EN_Coursera_Week3_LAB2).

<a href="#title">Back to Top</a>