# Segmenting and Clustering Neighborhoods in TORONTO 

**TO: REVIEWERS**

This Jupyter Notebook presents 3 parts of the completed Assignment. <br>
According to the submission instructions, each part should be submitted individually, even if they are presented in the same notebook.
Hence, each part should be reviewed and assessed separately. 
To facilitate navigating and reviewing the notebook, 3 stages of work are structured and have titles in color.

<font color='red'>**1. Toronto neighborhood collections based on their Postal Codes** </font>

<font color='red'>**2. Geographical Coordinates of Toronto neighborhoods** </font>

<font color='red'>**3. Clustering the specifically featured Toronto neighborhoods** </font>

To compare the obtained here results with the IMAGES (provided in the "My submission" section as the checking points), please find the Markdown cell starting with "**checking point: IMAGE-01**" and "**checking point: IMAGE-02**". <br>
Please mind that "*the screenshot given in the submission is only an example but the result you got should be in the given format only*", as the course instructor wrote. 

Thank you for reviewing my work. 

**A.B.**


In [1]:
import numpy as np       # library to handle vector arrays
import pandas as pd      # library for data analysis

import json         # library to handle JSON files
import requests     # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

from sklearn.cluster import KMeans    # import k-means for clustering data

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors


In [2]:
!conda install -c conda-forge lxml --yes 
import lxml   # scraping html-tables from web-pages

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.11

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    lxml-4.4.1                 |   py36h7ec2d77_0         1.6 MB  conda-forge

The following packages will be UPDATED:

    lxml: 4.2.5-py37hefd8a0e_0 --> 4.4.1-py36h7ec2d77_0 conda-forge


Downloading and Extracting Packages
lxml-4.4.1           | 1.6 MB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [3]:
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim   # latitude and longitude coordinates for a given Postal Code

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.11

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          90 KB

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.20.0-py_0 conda-forge


Downloading and Extracting Packages
geopy-1.20.0         | 57 KB     | ##################################### | 100% 
geographiclib-1.49   | 32 KB     | ##

In [4]:
!conda install -c conda-forge geocoder --yes 
import geocoder    # latitude and longitude coordinates for a given Postal Code

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.11

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    click-7.0                  |             py_0          61 KB  conda-forge
    ratelim-0.1.6              |             py_2           6 KB  conda-forge
    geocoder-1.38.1            |             py_1          53 KB  conda-forge
    future-0.17.1              |        py36_1000         701 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         821 KB

The following NEW packages will be INSTALLED:

    future:   0.17.1-py36_1000 conda-forge
    geocoder: 1.38.1-py_1      conda-for

In [5]:
!conda install -c conda-forge folium=0.5.0 --yes 
import folium     # map rendering library

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.11

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.



<a id='item1'></a>

## <font color='red'>1. Toronto neighborhood collections based on their Postal Codes </font>

We look for the **Toronto neighborhood** data. <br> 
Wikipedia can be viewed as a reliable resource, in many cases. 
So, we use the "website scraping libraries" in Python. <br>

We use 2 different methods to fetch the **html-table** from the web-site into a **pandas dataframe.** <br>
Just to be sure to get the same dataframe with 2 different methods.<br>

NOTE.
When reading the Wikipedia table into a *pandas* dataframe, the *lxml* module does a website scraping. <br>
Note that *lxml* handles with guaranteed acceptance only the http, ftp and file url protocols. <br> 
If you have a URL that starts with 'https' you might try removing the 's' to have the 'http'.<br>


#### Download and Explore Dataset

In [6]:
# METHOD: use requests and pandas, "text" attribute and "read_html" method
#
urla = 'http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(urla)
df_0 = pd.read_html(r.text, header=0)[0]
# df_0.sort_values("Postcode", axis = 0, ascending = True, na_position ='last').reset_index(drop=True, inplace=True)
df_0

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor


In [7]:
# METHOD: use pandas only, "read_html" method
#
urla ='http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
tables_0 = pd.read_html(urla, header=0, keep_default_na=False)[0]
# tables_0.sort_values("Postcode", axis = 0, ascending = True, na_position ='last').reset_index(drop=True, inplace=True)
tables_0

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor


#### Writing useful functions for the Pre-processing of the DataFrame

In [8]:
def controldf (df):
    print( 'number of rows: ',  len( df.index ) )
    print( 'number of cols: ',  len( df.columns), '\n')
    print( 'Borough, unique: ', df.Borough.unique().shape[0] )
    print( 'Neighbourhood, unique: ', df.Neighbourhood.unique().shape[0] )


In [9]:
def cleanA(df, col_name, col_ok, mask):
    for i in np.arange( len( df.index) ):
        if df[col_name].values[i] == mask:
            print( df[col_name].values[i], '...REPLACED BY...', df[col_ok].values[i] )
            df[col_name].values[i] = df[col_ok].values[i]


In [10]:
def cleanB(df, col_name, bb):
    cur = 'cha'
    for ii in range( len( df.index) ):
        cur =  df[col_name].values[ii][0]
        df[bb].values[ii] = cur   # print( df[bb].values[ii] )


In [11]:
def cleanC(df, col_name, nn): 
    cur_list = 'cha'
    for ii in range( len( df.index) ): 
        cur_list = df[col_name].values[ii] 
        combi = ", ".join(cur_list)
        df[nn].values[ii] = combi   # print( df[nn].values[ii] )


#### Pre-processing the DataFrame 

(1) &nbsp; 
There are 3 cells with special postal service info in the **Neighbourhood**: 

1. *'Canada Post Gateway Processing Centre'* 
2. *'Stn A PO Boxes 25 The Esplanade'* 
3. *'Business Reply Mail Processing Centre 969 Eastern'* 

They have nothing to do with the **Neighbourhood**, hence should be replaced by the string *'Not assigned'*. 

(2) &nbsp; 
Only process the cells that have an assigned Borough. 
Ignore cells with **Not assigned Borough.** 

(3) &nbsp; 
More than one neighborhood can exist in one postal code area. 
Combine rows under the same **Postalcode** into one row. 
In this row, have the list of corresponding neighborhoods separated with a comma. 

(4) &nbsp; 
If a cell has a Borough but a **Not assigned Neighborhood**, then the Neighborhood will be the same as the Borough. 


In [12]:
# Replace the special postal service info in 'Neighbourhood'
#
df_0.Neighbourhood.replace(['Canada Post Gateway Processing Centre'],['Not assigned'], inplace=True)
df_0.Neighbourhood.replace(['Stn A PO Boxes 25 The Esplanade'],['Not assigned'], inplace=True)
df_0.Neighbourhood.replace(['Business Reply Mail Processing Centre 969 Eastern'],['Not assigned'], inplace=True)
#
controldf(df_0)


number of rows:  288
number of cols:  3 

Borough, unique:  12
Neighbourhood, unique:  206


In [13]:
# Drop the cells with Not assigned in 'Borough'. 
#
df0 = df_0[ df_0.Borough != 'Not assigned' ]
#
controldf(df0)


number of rows:  211
number of cols:  3 

Borough, unique:  11
Neighbourhood, unique:  206


In [14]:
# Replace Not assigned in 'Neighbourhood'
#
cleanA(df0, 'Neighbourhood','Borough', 'Not assigned')
#
controldf(df0)


Not assigned ...REPLACED BY... Queen's Park
Not assigned ...REPLACED BY... Mississauga
Not assigned ...REPLACED BY... Downtown Toronto
Not assigned ...REPLACED BY... East Toronto
number of rows:  211
number of cols:  3 

Borough, unique:  11
Neighbourhood, unique:  208


In [15]:
#  Columns: Postcode, Borough, Neighbourhood

print( "Number of Boroughs = ", df0.Borough.unique().shape[0] )
print( "Number of Neighbourhoods = ", df0.Neighbourhood.unique().shape[0], "\n" )
#
print( df0.Borough.unique() )
print( df0.Neighbourhood.unique() )


Number of Boroughs =  11
Number of Neighbourhoods =  208 

['North York' 'Downtown Toronto' "Queen's Park" 'Etobicoke' 'Scarborough'
 'East York' 'York' 'East Toronto' 'West Toronto' 'Central Toronto'
 'Mississauga']
['Parkwoods' 'Victoria Village' 'Harbourfront' 'Regent Park'
 'Lawrence Heights' 'Lawrence Manor' "Queen's Park" 'Islington Avenue'
 'Rouge' 'Malvern' 'Don Mills North' 'Woodbine Gardens' 'Parkview Hill'
 'Ryerson' 'Garden District' 'Glencairn' 'Cloverdale' 'Islington'
 'Martin Grove' 'Princess Gardens' 'West Deane Park' 'Highland Creek'
 'Rouge Hill' 'Port Union' 'Flemingdon Park' 'Don Mills South'
 'Woodbine Heights' 'St. James Town' 'Humewood-Cedarvale'
 'Bloordale Gardens' 'Eringate' 'Markland Wood' 'Old Burnhamthorpe'
 'Guildwood' 'Morningside' 'West Hill' 'The Beaches' 'Berczy Park'
 'Caledonia-Fairbanks' 'Woburn' 'Leaside' 'Central Bay Street' 'Christie'
 'Cedarbrae' 'Hillcrest Village' 'Bathurst Manor' 'Downsview North'
 'Wilson Heights' 'Thorncliffe Park' 'Adelaid

Hence, Toronto neighborhood has a total of **11 boroughs** and **208 neighborhoods**. <br>

In [16]:
# Combine rows under the same Postalcode into one row

df1 = df0.groupby('Postcode').agg( lambda x : list(x) ).reset_index()
#
print( "shape: ", df1.shape, '\n' ) 
print( "info: ", df1.info(), '\n' ) 
df1
#
# NOTE:  check if 3 columns are presented,
#        with the same names
# NOTE:  .reset_index() is important to be included,
#        this makes 2 column instead of 3 columns,
# df1 = tables_0.groupby('Postcode').agg( lambda x : list(x) )

shape:  (103, 3) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
Postcode         103 non-null object
Borough          103 non-null object
Neighbourhood    103 non-null object
dtypes: object(3)
memory usage: 2.5+ KB
info:  None 



Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,"[Scarborough, Scarborough]","[Rouge, Malvern]"
1,M1C,"[Scarborough, Scarborough, Scarborough]","[Highland Creek, Rouge Hill, Port Union]"
2,M1E,"[Scarborough, Scarborough, Scarborough]","[Guildwood, Morningside, West Hill]"
3,M1G,[Scarborough],[Woburn]
4,M1H,[Scarborough],[Cedarbrae]
...,...,...,...
98,M9N,[York],[Weston]
99,M9P,[Etobicoke],[Westmount]
100,M9R,"[Etobicoke, Etobicoke, Etobicoke, Etobicoke]","[Kingsview Village, Martin Grove Gardens, Rich..."
101,M9V,"[Etobicoke, Etobicoke, Etobicoke, Etobicoke, E...","[Albion Gardens, Beaumond Heights, Humbergate,..."


In [17]:
df1['Borough'].values[4][0]

'Scarborough'

In [18]:
df1['Neighbourhood'].values[99][0]

'Westmount'

In [19]:
# Add 2 new columns for working with 2 existing columns 

df1['BB'] =  np.arange( len( df1.index)).astype(str)
df1['NN'] =  np.arange( len( df1.index)).astype(str)
# df1

In [20]:
# Re-arrange 2 columns, 'Borough' and 'Neighbourhood'
#
cleanB(df1, 'Borough', 'BB') 
# df1
cleanC(df1, 'Neighbourhood', 'NN')
# df1

In [22]:
# Re-assign the dataframe to the final look
#
df2 = df1[ ['Postcode', 'BB', 'NN'] ]
df2.rename( columns={'Postcode':'PostalCode', 'BB':'Borough', 'NN':'Neighborhood'}, inplace=True )
df2

# NOTE: the assignment submission instructions require the 3 column names as
#       PostalCode, Borough, Neighborhood.

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


 ## <font color='red'> checking point: IMAGE-01 </font>
 

In [23]:
print( "shape: ", df2.shape, '\n' )
print( "info:", df2.info(), '\n' ) 


shape:  (103, 3) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
PostalCode      103 non-null object
Borough         103 non-null object
Neighborhood    103 non-null object
dtypes: object(3)
memory usage: 2.5+ KB
info: None 



## <font color='red'> 2. Geographical Coordinates of Toronto neighborhoods </font> ##

Now, with a dataframe of the postal code, borough name and neighborhood name, <br>
in order to utilize the Foursquare location data, <br>
we need to get the latitude and the longitude coordinates of each neighborhood.<br>

In order to segment the neighborhoods and explore them,  <br>
we will essentially need  <br>
(1) a dataset that contains the boroughs and the neighborhoods that exist in each borough <br>
(2) the the latitude and longitude coordinates of each neighborhood. <br>


#### Trying the Geocoder Python Package to Get Geospatial Coordinates... 

In [None]:
# DOES NOT WORK: Python Kernel has been busy for ages and interrupted

import geocoder 

# select the PostalCode
postal_code='MG5'

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

# DOES NOT WORK: Python Kernel has been busy for ages and interrupted

In [None]:
# DOES NOT WORK: no geo data for Toronto, only Ottawa

import geocoder
#
g = geocoder.toronto('323 Yonge Street')  #  Toronto, ON M5B 1R7
g.json

# DOES NOT WORK: no geo data for Toronto, only Ottawa

#### Download and Explore Dataset with Geospatial Coordinates

In [24]:
!wget -q -O  Geospatial_Coordinates.csv  http://cocl.us/Geospatial_data

In [25]:
# Load a Dataset into pandas Dataframe
#
dg_0 = pd.read_csv('Geospatial_Coordinates.csv', header=0)
# dg_0.sort_values("Postal Code", axis = 0, ascending = True, na_position ='last').reset_index(drop=True, inplace=True)
dg_0

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [26]:
print( "shape: ", df2.shape, '\n' )
print( "info:", df2.info(), '\n' ) 


shape:  (103, 3) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
PostalCode      103 non-null object
Borough         103 non-null object
Neighborhood    103 non-null object
dtypes: object(3)
memory usage: 2.5+ KB
info: None 



#### Combined Dataframe: neighborhood collections and their geographical coordinates 

In [27]:
def simila (df, colf, dg, colg):
    for ii in range( len( df.index) ): 
        if df[colf][ii] != dg[colg][ii]: 
            print( 'MISMATCH', 'ii=',ii, df[colf][ii], dg[colg][ii]) 
            continue
        else: 
            # print( 'MATCH', 'ii=',ii,  df[colf][ii], dg[colg][ii]) 
            continue
            
            
simila(df2, 'PostalCode', dg_0, 'Postal Code')
#
print("No MISMATCH detected between the 'PostalCode' and 'Postal Code' entries.")
print("Two columns, 'PostalCode' and 'Postal Code', in two dataframes are identical.")


No MISMATCH detected between the 'PostalCode' and 'Postal Code' entries.
Two columns, 'PostalCode' and 'Postal Code', in two dataframes are identical.


In [28]:
# Re-arrange 2 dataframes to create the new one
#
df3 = pd.concat([ df2[['PostalCode', 'Borough', 'Neighborhood']], dg_0 [['Latitude', 'Longitude']] ], axis=1)
df3

# NOTE: the assignment submission instructions require the 5 column names as
#       PostalCode, Borough, Neighborhood, Latitude, Longitude

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437


 ## <font color='red'> checking point: IMAGE-02 </font>
 

In [29]:
print( "shape: ", df3.shape, '\n' )
print( "info:", df3.info(), '\n' ) 


shape:  (103, 5) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 5 columns):
PostalCode      103 non-null object
Borough         103 non-null object
Neighborhood    103 non-null object
Latitude        103 non-null float64
Longitude       103 non-null float64
dtypes: float64(2), object(3)
memory usage: 4.1+ KB
info: None 



## <font color='red'> 3. Clustering the specifically featured Toronto neighborhoods </font> 

#### Nominal Latitude and Longitude of Toronto, ON with GeoPy Library

We have to know the nominal geospatial coordinates of the City of Toronto, ON. <br> 
We can obtain them with the **geolocator** utility from the GeoPy library. <br>
In order to define an instance of the **geolocator**, we need to define a user_agent. <br> 
We will name our agent <em>toronto_explorer</em>.

In [30]:

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode('TORONTO, ON')
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of TORONTO, ON are {}, {}.'.format(latitude, longitude))


The geograpical coordinate of TORONTO, ON are 43.653963, -79.387207.


#### Create a map of Toronto with neighborhoods superimposed on top

*Folium* is a great visualization library.   
The *Folium* maps are easy to navigate and fast to respond zooming. 

In [31]:
# create a map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers of neighborhoods to the map
for lat, lng, borough, neighborhood in zip(df3['Latitude'], df3['Longitude'], df3['Borough'], df3['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

We simplify the above map and segment and cluster only the neighborhoods in **Downtown Toronto**. <br> 
For that we create a new dataframe of the **Downtown Toronto** data only. <br>
Then we visualize the **Downtown Toronto** and the neighborhoods in this borough.

In [32]:
# create a DataFrame of the Downtown Toronto data only
#
dttoronto_data = df3[ df3['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
dttoronto_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752


There are **18 neighborhoods** in the **Downtown Toronto** borough. 

In [33]:
print( "shape: ", dttoronto_data.shape, '\n' )
print( "info:", dttoronto_data.info(), '\n' ) 


shape:  (18, 5) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 5 columns):
PostalCode      18 non-null object
Borough         18 non-null object
Neighborhood    18 non-null object
Latitude        18 non-null float64
Longitude       18 non-null float64
dtypes: float64(2), object(3)
memory usage: 848.0+ bytes
info: None 



In [34]:
# create a map of Downtown Toronto using latitude and longitude values
map_dttoronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers of neighborhoods to the map
for lat, lng, label in zip(dttoronto_data['Latitude'], dttoronto_data['Longitude'], dttoronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dttoronto)  
    
map_dttoronto

#### Define Foursquare Credentials and Version

Now we start utilizing the **Foursquare API** to explore Downtown Toronto neighborhoods, segment them and cluster.

In [None]:
# DO NOT CHANGE THIS CELL
#
CLIENT_ID = 'your-client-ID' # your Foursquare ID
CLIENT_SECRET = 'your-client-secret' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
#
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [35]:
CLIENT_ID = 'MNTM0UL0DDU1BXQMMNJ5Z3RCBZJ5K1BVEQPGY0GO0Z3JZPVU'     # my Client ID
CLIENT_SECRET = '130X3PBB1XHCAN0FETVL5L45G0XIBBMNQUE2NEQW0GYQDKJQ' # my Client Secret
VERSION = '20190830'   # 2019-AUG-30    # Foursquare API version
#
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: MNTM0UL0DDU1BXQMMNJ5Z3RCBZJ5K1BVEQPGY0GO0Z3JZPVU
CLIENT_SECRET:130X3PBB1XHCAN0FETVL5L45G0XIBBMNQUE2NEQW0GYQDKJQ


#### (1) Explore One Neighborhood in Downtown Toronto 

We explore the first neighborhood in the dataframe.

In [36]:
# Get the neighborhood's Name as well as latitude and longitude values.
#
neighborhood_name = dttoronto_data.loc[0, 'Neighborhood']   # neighborhood name
#
print(neighborhood_name, ' is the first neighborhood in the dataframe. ')
#
neighborhood_latitude = dttoronto_data.loc[0, 'Latitude']   # neighborhood latitude value
neighborhood_longitude = dttoronto_data.loc[0, 'Longitude'] # neighborhood longitude value
#
print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))


Rosedale  is the first neighborhood in the dataframe. 
Latitude and longitude values of Rosedale are 43.6795626, -79.37752940000001.


Further we explore the venues in **Rosedale,** <br> 
which is one of the neighborhoods in the **Downtown Toronto** borough of the City of Toronto. <br>


Let's get the **top 100 venues** that are in "**Rosedale**" within a radius of 500 meters.

First, we create the GET request URL <br> 
by providing the proper formatting to **url**.

In [37]:
# method = GET for group = venue using endpoint = explore
#
LIMIT = 100     # limit number of venues returned by Foursquare API
#
radius = 500    # radius (meters) from the Rosedale location
#
# create URL,  endpoint = explore ( Regular API endpoint )
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
#
# display URL
url


'https://api.foursquare.com/v2/venues/explore?&client_id=MNTM0UL0DDU1BXQMMNJ5Z3RCBZJ5K1BVEQPGY0GO0Z3JZPVU&client_secret=130X3PBB1XHCAN0FETVL5L45G0XIBBMNQUE2NEQW0GYQDKJQ&v=20190830&ll=43.6795626,-79.37752940000001&radius=500&limit=100'

Second, we send the GET request <br> 
and examine the results from **requests** which output in the **json format.**

In [38]:
# method = GET for group = venue using endpoint = explore
#
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d6f485ebbed210032b64dc3'},
 'response': {'headerLocation': 'Rosedale',
  'headerFullLocation': 'Rosedale, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 5,
  'suggestedBounds': {'ne': {'lat': 43.6840626045, 'lng': -79.37131878274371},
   'sw': {'lat': 43.675062595499995, 'lng': -79.38374001725632}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bae2150f964a520df873be3',
       'name': 'Mooredale House',
       'location': {'address': '146 Crescent Rd.',
        'crossStreet': 'btwn. Lamport Ave. and Mt. Pleasant Rd.',
        'lat': 43.678630645646535,
        'lng': -79.38009142511322,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.678630645646535,
          'lng': -79.380091425113

From the **Foursquare** LAB in this course we know that 
all the information is in the *items* key. <br>
From the **Foursquare** LAB 
we copy the function **get_category_type** for further analysis.

In [39]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we clean the *json* structure and <br> 
port it into a *pandas* dataframe.

In [40]:
venues = results['response']['groups'][0]['items']
#
nearby_venues = json_normalize(venues)     # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(10)

Unnamed: 0,name,categories,lat,lng
0,Mooredale House,Building,43.678631,-79.380091
1,Rosedale Park,Playground,43.682328,-79.378934
2,Whitney Park,Park,43.682036,-79.373788
3,Alex Murray Parkette,Park,43.6783,-79.382773
4,Milkman's Lane,Trail,43.676352,-79.373842


We print the total number of venues returned by **Foursquare API** <br> 
in the neighborhood **Rosedale** and within a radius of 500 meters. 

In [41]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

5 venues were returned by Foursquare.


#### (2) Explore All Neighborhoods in Downtown Toronto

Let's wtite a function to repeat the same process to all the neighborhoods in **Downtown Toronto.**

In [42]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Let's explore all the neighborhoods in **Downtown Toronto**. <br>
We need to collect the top venues returned by Foursquare API in every neighborhood in **Downtown Toronto** and within a radius of 500 meters. Then, with the obtained results, we create a new dataframe named "**dttoronto_venues**". 
<br>

In [43]:
# Create the dataframe with the top venues in all neighborhoods in Downtown Toronto. 
#
dttoronto_venues = getNearbyVenues(names=dttoronto_data['Neighborhood'],
                                   latitudes=dttoronto_data['Latitude'],
                                   longitudes=dttoronto_data['Longitude']
                                  )


Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront, Regent Park
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Downtown Toronto
First Canadian Place, Underground city
Christie


Hence, the code has generated data for all **18 neighborhoods** in the **Downtown Toronto** borough. 

In [44]:
print( "shape: ", dttoronto_venues.shape, '\n' )
print( "info:", dttoronto_venues.info(), '\n' ) 

dttoronto_venues

shape:  (1291, 7) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1291 entries, 0 to 1290
Data columns (total 7 columns):
Neighborhood              1291 non-null object
Neighborhood Latitude     1291 non-null float64
Neighborhood Longitude    1291 non-null float64
Venue                     1291 non-null object
Venue Latitude            1291 non-null float64
Venue Longitude           1291 non-null float64
Venue Category            1291 non-null object
dtypes: float64(4), object(3)
memory usage: 70.7+ KB
info: None 



Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rosedale,43.679563,-79.377529,Mooredale House,43.678631,-79.380091,Building
1,Rosedale,43.679563,-79.377529,Rosedale Park,43.682328,-79.378934,Playground
2,Rosedale,43.679563,-79.377529,Whitney Park,43.682036,-79.373788,Park
3,Rosedale,43.679563,-79.377529,Alex Murray Parkette,43.678300,-79.382773,Park
4,Rosedale,43.679563,-79.377529,Milkman's Lane,43.676352,-79.373842,Trail
...,...,...,...,...,...,...,...
1286,Christie,43.669542,-79.422564,Queens Club,43.672386,-79.418106,Athletics & Sports
1287,Christie,43.669542,-79.422564,Pioneer Gas,43.670355,-79.428400,Convenience Store
1288,Christie,43.669542,-79.422564,Marian Engel Park,43.673754,-79.423988,Park
1289,Christie,43.669542,-79.422564,Garrison Creek Park,43.671690,-79.427805,Park


Let's check how many venues were returned for each neighborhood.

In [45]:
dttoronto_venues.groupby('Neighborhood')['Venue', 'Venue Category'].count()

Unnamed: 0_level_0,Venue,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1
"Adelaide, King, Richmond",100,100
Berczy Park,55,55
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",13,13
"Cabbagetown, St. James Town",44,44
Central Bay Street,86,86
"Chinatown, Grange Park, Kensington Market",100,100
Christie,16,16
Church and Wellesley,89,89
"Commerce Court, Victoria Hotel",100,100
"Design Exchange, Toronto Dominion Centre",100,100


Let's check how many venue categories (unique) can be curated from all the returned venues.

In [46]:
num_venues_unique = len( dttoronto_venues['Venue'].unique() )
num_categ_unique = len( dttoronto_venues['Venue Category'].unique() )

print( 'Number of unique venues in Downtown Toronto: ', num_venues_unique, '\n')
print( 'Number of unique categories of venues in Downtown Toronto: ', num_categ_unique, '\n') 


Number of unique venues in Downtown Toronto:  763 

Number of unique categories of venues in Downtown Toronto:  203 



#### (3) Analyze Each Neighborhood

Let's create a dataframe representing **203 unique categories** of venues, <br>
specifically, showing there occurrence frequencies in the **1291 venue set**.

In [47]:
# one hot encoding
dttoronto_onehot = pd.get_dummies(dttoronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dttoronto_onehot['Neighborhood'] = dttoronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [dttoronto_onehot.columns[-1]] + list(dttoronto_onehot.columns[:-1])
dttoronto_onehot = dttoronto_onehot[fixed_columns]

dttoronto_onehot

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1286,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1287,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1288,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1289,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [48]:
print( "shape: ", dttoronto_onehot.shape, '\n' )
print( "info:", dttoronto_onehot.info(), '\n' ) 


shape:  (1291, 203) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1291 entries, 0 to 1290
Columns: 203 entries, Yoga Studio to Wings Joint
dtypes: object(1), uint8(202)
memory usage: 264.9+ KB
info: None 



Let's group rows by neighborhood and take the mean of the frequency of occurrence of each category.

In [49]:
dttoronto_grouped = dttoronto_onehot.groupby('Neighborhood').mean().reset_index()
dttoronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0
2,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.076923,0.076923,0.076923,0.153846,0.076923,0.153846,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.011628,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011628,...,0.0,0.0,0.0,0.0,0.0,0.011628,0.0,0.0,0.011628,0.0
5,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.06,0.0,0.04,0.01,0.0
6,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Church and Wellesley,0.011236,0.011236,0.0,0.0,0.0,0.0,0.0,0.0,0.011236,...,0.011236,0.011236,0.0,0.0,0.0,0.0,0.011236,0.011236,0.0,0.011236
8,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0
9,"Design Exchange, Toronto Dominion Centre",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0


In [50]:
print( "shape: ", dttoronto_grouped.shape, '\n' )
print( "info:", dttoronto_grouped.info(), '\n' ) 


shape:  (18, 203) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Columns: 203 entries, Neighborhood to Wings Joint
dtypes: float64(202), object(1)
memory usage: 28.7+ KB
info: None 



Let's print each neighborhood along with the **top 5 venues** for each neighborhood.

In [51]:
num_top_venues = 5

for hood in dttoronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = dttoronto_grouped[dttoronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')
    

----Adelaide, King, Richmond----
         venue  freq
0  Coffee Shop  0.08
1         Café  0.05
2   Steakhouse  0.04
3          Bar  0.04
4   Restaurant  0.03


----Berczy Park----
          venue  freq
0   Coffee Shop  0.09
1  Cocktail Bar  0.05
2    Steakhouse  0.04
3          Café  0.04
4   Cheese Shop  0.04


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
              venue  freq
0  Airport Terminal  0.15
1    Airport Lounge  0.15
2   Harbor / Marina  0.08
3     Boat or Ferry  0.08
4  Sculpture Garden  0.08


----Cabbagetown, St. James Town----
         venue  freq
0  Coffee Shop  0.07
1   Restaurant  0.07
2         Park  0.07
3       Bakery  0.05
4         Café  0.05


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.14
1                Café  0.06
2  Italian Restaurant  0.05
3      Sandwich Place  0.05
4        Burger Joint  0.03


----Chinatown, Grange Park, Kensington Market----
 

Let's create the new dataframe to display the **top 10 venues** for each neighborhood.

In [52]:
# the function to sort the venues in decending order
#
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [53]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = dttoronto_grouped['Neighborhood']

for ind in np.arange(dttoronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dttoronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Steakhouse,Bar,Hotel,Cosmetics Shop,Burger Joint,Thai Restaurant,American Restaurant,Restaurant
1,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Café,Cheese Shop,Steakhouse,Seafood Restaurant,Beer Bar,Farmers Market,Belgian Restaurant
2,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Lounge,Airport Terminal,Coffee Shop,Boutique,Sculpture Garden,Airport Service,Boat or Ferry,Airport Gate,Harbor / Marina,Airport
3,"Cabbagetown, St. James Town",Coffee Shop,Restaurant,Park,Pub,Pizza Place,Café,Italian Restaurant,Bakery,Liquor Store,Sandwich Place
4,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Ice Cream Shop,Burger Joint,Chinese Restaurant,Middle Eastern Restaurant,Spa,Indian Restaurant
5,"Chinatown, Grange Park, Kensington Market",Café,Vegetarian / Vegan Restaurant,Bar,Chinese Restaurant,Vietnamese Restaurant,Mexican Restaurant,Dumpling Restaurant,Bakery,Coffee Shop,Caribbean Restaurant
6,Christie,Grocery Store,Café,Park,Coffee Shop,Baby Store,Restaurant,Italian Restaurant,Diner,Nightclub,Convenience Store
7,Church and Wellesley,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Men's Store,Dance Studio,Fast Food Restaurant,Gym,Mediterranean Restaurant
8,"Commerce Court, Victoria Hotel",Coffee Shop,Hotel,Café,Restaurant,American Restaurant,Deli / Bodega,Gastropub,Seafood Restaurant,Steakhouse,Gym
9,"Design Exchange, Toronto Dominion Centre",Coffee Shop,Café,Hotel,Restaurant,Bar,Italian Restaurant,American Restaurant,Bakery,Gastropub,Deli / Bodega


#### (4) Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into **5 clusters.**

In [54]:
# set number of clusters
kclusters = 5

dttoronto_grouped_clustering = dttoronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dttoronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 3, 1, 1, 1, 4, 1, 1, 1], dtype=int32)

Let's create a new dataframe that includes the **5 cluster labels** as well as the **top 10 venues** for each neighborhood.

In [55]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dttoronto_merged = dttoronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
dttoronto_merged = dttoronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

dttoronto_merged     # the last 10 columns represent 10 most common venues

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,2,Park,Playground,Trail,Building,Dance Studio,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Discount Store
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675,1,Coffee Shop,Restaurant,Park,Pub,Pizza Place,Café,Italian Restaurant,Bakery,Liquor Store,Sandwich Place
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,1,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Men's Store,Dance Studio,Fast Food Restaurant,Gym,Mediterranean Restaurant
3,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636,1,Coffee Shop,Pub,Park,Bakery,Café,Gym / Fitness Center,Theater,Mexican Restaurant,Restaurant,Breakfast Spot
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,1,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Italian Restaurant,Pizza Place,Ramen Restaurant,Bookstore,Restaurant,Bakery
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Coffee Shop,Café,Restaurant,Hotel,Italian Restaurant,Bakery,Clothing Store,Cosmetics Shop,Gastropub,Breakfast Spot
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,1,Coffee Shop,Cocktail Bar,Bakery,Café,Cheese Shop,Steakhouse,Seafood Restaurant,Beer Bar,Farmers Market,Belgian Restaurant
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,1,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Ice Cream Shop,Burger Joint,Chinese Restaurant,Middle Eastern Restaurant,Spa,Indian Restaurant
8,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568,1,Coffee Shop,Café,Steakhouse,Bar,Hotel,Cosmetics Shop,Burger Joint,Thai Restaurant,American Restaurant,Restaurant
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752,1,Coffee Shop,Hotel,Aquarium,Café,Brewery,Fried Chicken Joint,Scenic Lookout,Baseball Stadium,Sporting Goods Shop,Sports Bar


In [56]:
print( "shape: ", dttoronto_merged.shape, '\n' )
print( "info:", dttoronto_merged.info(), '\n' ) 


shape:  (18, 16) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 16 columns):
PostalCode                18 non-null object
Borough                   18 non-null object
Neighborhood              18 non-null object
Latitude                  18 non-null float64
Longitude                 18 non-null float64
Cluster Labels            18 non-null int32
1st Most Common Venue     18 non-null object
2nd Most Common Venue     18 non-null object
3rd Most Common Venue     18 non-null object
4th Most Common Venue     18 non-null object
5th Most Common Venue     18 non-null object
6th Most Common Venue     18 non-null object
7th Most Common Venue     18 non-null object
8th Most Common Venue     18 non-null object
9th Most Common Venue     18 non-null object
10th Most Common Venue    18 non-null object
dtypes: float64(2), int32(1), object(13)
memory usage: 2.3+ KB
info: None 



Finally, let's visualize the resulting clusters

In [57]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dttoronto_merged['Latitude'], dttoronto_merged['Longitude'], dttoronto_merged['Neighborhood'], dttoronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters) 
       
map_clusters

NOTE. <br>
Here, *Folium* added 18 Circle Markers to the map of **Downtown Toronto** in the following order,     <br>

2    #00b5eb    <br>
1    #8000ff    <br>
1    #8000ff    <br>
1    #8000ff    <br>
1    #8000ff    <br>
1    #8000ff    <br>
1    #8000ff    <br>
1    #8000ff    <br>
1    #8000ff    <br>
1    #8000ff    <br>
1    #8000ff    <br>
1    #8000ff    <br>
0    #ff0000    <br>
1    #8000ff    <br>
3    #80ffb4    <br>
1    #8000ff    <br>
1    #8000ff    <br>
4    #ffb360    <br>

The 1st column represent the Marker Label.     <br>
The 2nd column represents the Marker Color.     <br>

<a id='item5'></a>

#### (5) Examine Clusters

Let's examine each cluster and determine the discriminating venue categories that distinguish each cluster. <br> 
Based on the defining categories, we can even assign a name to each cluster. <br>

#### Cluster 1

In [58]:
dttoronto_merged.loc[dttoronto_merged['Cluster Labels'] == 0, dttoronto_merged.columns[[1] + list(range(5, dttoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,Downtown Toronto,0,Café,Restaurant,Bookstore,Sandwich Place,Japanese Restaurant,Bar,Bakery,Sushi Restaurant,Pub,Beer Store


#### Cluster 2

In [59]:
dttoronto_merged.loc[dttoronto_merged['Cluster Labels'] == 1, dttoronto_merged.columns[[1] + list(range(5, dttoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Downtown Toronto,1,Coffee Shop,Restaurant,Park,Pub,Pizza Place,Café,Italian Restaurant,Bakery,Liquor Store,Sandwich Place
2,Downtown Toronto,1,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Men's Store,Dance Studio,Fast Food Restaurant,Gym,Mediterranean Restaurant
3,Downtown Toronto,1,Coffee Shop,Pub,Park,Bakery,Café,Gym / Fitness Center,Theater,Mexican Restaurant,Restaurant,Breakfast Spot
4,Downtown Toronto,1,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Italian Restaurant,Pizza Place,Ramen Restaurant,Bookstore,Restaurant,Bakery
5,Downtown Toronto,1,Coffee Shop,Café,Restaurant,Hotel,Italian Restaurant,Bakery,Clothing Store,Cosmetics Shop,Gastropub,Breakfast Spot
6,Downtown Toronto,1,Coffee Shop,Cocktail Bar,Bakery,Café,Cheese Shop,Steakhouse,Seafood Restaurant,Beer Bar,Farmers Market,Belgian Restaurant
7,Downtown Toronto,1,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Ice Cream Shop,Burger Joint,Chinese Restaurant,Middle Eastern Restaurant,Spa,Indian Restaurant
8,Downtown Toronto,1,Coffee Shop,Café,Steakhouse,Bar,Hotel,Cosmetics Shop,Burger Joint,Thai Restaurant,American Restaurant,Restaurant
9,Downtown Toronto,1,Coffee Shop,Hotel,Aquarium,Café,Brewery,Fried Chicken Joint,Scenic Lookout,Baseball Stadium,Sporting Goods Shop,Sports Bar
10,Downtown Toronto,1,Coffee Shop,Café,Hotel,Restaurant,Bar,Italian Restaurant,American Restaurant,Bakery,Gastropub,Deli / Bodega


#### Cluster 3

In [60]:
dttoronto_merged.loc[dttoronto_merged['Cluster Labels'] == 2, dttoronto_merged.columns[[1] + list(range(5, dttoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,2,Park,Playground,Trail,Building,Dance Studio,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Discount Store


#### Cluster 4

In [61]:
dttoronto_merged.loc[dttoronto_merged['Cluster Labels'] == 3, dttoronto_merged.columns[[1] + list(range(5, dttoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Downtown Toronto,3,Airport Lounge,Airport Terminal,Coffee Shop,Boutique,Sculpture Garden,Airport Service,Boat or Ferry,Airport Gate,Harbor / Marina,Airport


#### Cluster 5

In [62]:
dttoronto_merged.loc[dttoronto_merged['Cluster Labels'] == 4, dttoronto_merged.columns[[1] + list(range(5, dttoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,Downtown Toronto,4,Grocery Store,Café,Park,Coffee Shop,Baby Store,Restaurant,Italian Restaurant,Diner,Nightclub,Convenience Store



### END.
