# 

<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>


# Introduction

In this lab, you will convert addresses into their equivalent latitude and longitude values. Also, the Foursquare API will be used to explore neighborhoods in Toronto. The **explore** function will be used to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. The _k_-means clustering algorithm will be used to complete this task. Finally, the Folium library will be used to visualize the neighborhoods in Toronto and their emerging clusters.


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1.  <a href="#item1">Download and Explore Dataset</a>

2.  <a href="#item2">Explore Neighborhoods in Toronto</a>

3.  <a href="#item3">Analyze Each Neighborhood</a>

4.  <a href="#item4">Cluster Neighborhoods</a>

5.  <a href="#item5">Examine Clusters</a>  
    </font>
    </div>


Before we get the data and start exploring it, let's download all the dependencies that we will need.


In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#import beutifulsoup
import requests
from bs4 import BeautifulSoup

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python-3.7-main

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |      conda_forge           3 KB  conda-forge
    _openmp_mutex-4.5          |           1_llvm           5 KB  conda-forge
    _py-xgboost-mutex-2.0      |            cpu_0           8 KB  conda-forge
    _pytorch_select-0.2        |            gpu_0           2 KB
    absl-py-0.12.0             |     pyhd8ed1ab_0          96 KB  conda-forge
    aiohttp-3.7.4              |   py37h5e8e339_0  

<a id='item1'></a>


## 1. Download and Explore Dataset


We will need a dataset that contains the neighborhoods, as well as the the latitude and logitude coordinates of each neighborhood. 


In [2]:
   
page=requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text

#Create a BeautifulSoup object
soup=BeautifulSoup(page,"html5lib")

tag_object=soup.title
print("tag object:",tag_object)

#find all html tables in the web page
tables = soup.find_all('table') # in html table is represented by the tag <table>

for index,table in enumerate(tables):
    if ("M1A" in str(table)):
        table_index = index
print("Table Index: ", table_index)

#Prettify table
print(tables[table_index].prettify())

tag object: <title>List of postal codes of Canada: M - Wikipedia</title>
Table Index:  0
<table cellpadding="2" cellspacing="0" rules="all" style="width:100%; border-collapse:collapse; border:1px solid #ccc;">
 <tbody>
  <tr>
   <td style="width:11%; vertical-align:top; color:#ccc;">
    <p>
     <b>
      M1A
     </b>
     <br/>
     <span style="font-size:85%;">
      <i>
       Not assigned
      </i>
     </span>
    </p>
   </td>
   <td style="width:11%; vertical-align:top; color:#ccc;">
    <p>
     <b>
      M2A
     </b>
     <br/>
     <span style="font-size:85%;">
      <i>
       Not assigned
      </i>
     </span>
    </p>
   </td>
   <td style="width:11%; vertical-align:top;">
    <p>
     <b>
      M3A
     </b>
     <br/>
     <span style="font-size:85%;">
      <a href="/wiki/North_York" title="North York">
       North York
      </a>
      <br/>
      (
      <a href="/wiki/Parkwoods" title="Parkwoods">
       Parkwoods
      </a>
      )
     </span>
    </p>
   </

Create a table of the postcodes and remove postcodes with unassigned boroughs


In [3]:
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    #remove postcodes with unassigned borough
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

#print(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                            'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})




Check the number of rows


In [4]:
df.shape

(103, 3)

Get latitude and longitude from GeoCoder

In [5]:
!pip install geocoder

distutils: /opt/conda/envs/Python-3.7-main/include/python3.7m/UNKNOWN
sysconfig: /opt/conda/envs/Python-3.7-main/include/python3.7m[0m
user = False
home = None
root = None
prefix = None[0m
Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 6.2 MB/s  eta 0:00:01
[?25hCollecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
distutils: /opt/conda/envs/Python-3.7-main/include/python3.7m/UNKNOWN
sysconfig: /opt/conda/envs/Python-3.7-main/include/python3.7m[0m
user = False
home = None
root = None
prefix = None[0m
Successfully installed geocoder-1.38.1 ratelim-0.1.6


Get latitude and longitude for postal codes

In [6]:
import geocoder # import geocoder

for postal_code in df['PostalCode']:
    print(postal_code)
    # initialize your variable to None
    lat_lng_coords = None

# loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    print("Latitude: ", latitude)
    print("Longitude: ", longitude)
    

M3A


KeyboardInterrupt: 

Import CSV with Toronto postal codes

In [7]:
df_lat_long = pd.read_csv(r'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv')
print(df_lat_long)

    Postal Code   Latitude  Longitude
0           M1B  43.806686 -79.194353
1           M1C  43.784535 -79.160497
2           M1E  43.763573 -79.188711
3           M1G  43.770992 -79.216917
4           M1H  43.773136 -79.239476
5           M1J  43.744734 -79.239476
6           M1K  43.727929 -79.262029
7           M1L  43.711112 -79.284577
8           M1M  43.716316 -79.239476
9           M1N  43.692657 -79.264848
10          M1P  43.757410 -79.273304
11          M1R  43.750072 -79.295849
12          M1S  43.794200 -79.262029
13          M1T  43.781638 -79.304302
14          M1V  43.815252 -79.284577
15          M1W  43.799525 -79.318389
16          M1X  43.836125 -79.205636
17          M2H  43.803762 -79.363452
18          M2J  43.778517 -79.346556
19          M2K  43.786947 -79.385975
20          M2L  43.757490 -79.374714
21          M2M  43.789053 -79.408493
22          M2N  43.770120 -79.408493
23          M2P  43.752758 -79.400049
24          M2R  43.782736 -79.442259
25          

Add Latitude and Longitude to dataframe

In [11]:
neighborhoods = pd.merge(df, df_lat_long, 
                     left_on = 'PostalCode', 
                     right_on = 'Postal Code', 
                     how='left')
print(neighborhoods)

    PostalCode                 Borough  \
0          M3A              North York   
1          M4A              North York   
2          M5A        Downtown Toronto   
3          M6A              North York   
4          M7A            Queen's Park   
5          M9A               Etobicoke   
6          M1B             Scarborough   
7          M3B              North York   
8          M4B               East York   
9          M5B        Downtown Toronto   
10         M6B              North York   
11         M9B               Etobicoke   
12         M1C             Scarborough   
13         M3C              North York   
14         M4C               East York   
15         M5C        Downtown Toronto   
16         M6C                    York   
17         M9C               Etobicoke   
18         M1E             Scarborough   
19         M4E            East Toronto   
20         M5E        Downtown Toronto   
21         M6E                    York   
22         M1G             Scarbor

Remove one of the two Postal Code columns

In [12]:
del neighborhoods['Postal Code']


Show final merged dataframe

In [13]:
print(neighborhoods)

    PostalCode                 Borough  \
0          M3A              North York   
1          M4A              North York   
2          M5A        Downtown Toronto   
3          M6A              North York   
4          M7A            Queen's Park   
5          M9A               Etobicoke   
6          M1B             Scarborough   
7          M3B              North York   
8          M4B               East York   
9          M5B        Downtown Toronto   
10         M6B              North York   
11         M9B               Etobicoke   
12         M1C             Scarborough   
13         M3C              North York   
14         M4C               East York   
15         M5C        Downtown Toronto   
16         M6C                    York   
17         M9C               Etobicoke   
18         M1E             Scarborough   
19         M4E            East Toronto   
20         M5E        Downtown Toronto   
21         M6E                    York   
22         M1G             Scarbor

#### Use geopy library to get the latitude and longitude values of Toronto.


In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>tor_explorer</em>, as shown below.


In [14]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


#### Create a map of Toronto with neighborhoods superimposed on top.


In [15]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.


Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.


#### Define Foursquare Credentials and Version


In [16]:
CLIENT_ID = 'QJ4UWXTV5FE5ZJK2PGQ5AU0CIPJBA4ZKFAFASGMPAIP2FD3K' # your Foursquare ID
CLIENT_SECRET = 'KJCHW4VFBSMXPMUBHJ5QA2J2I0BOZCXXD20NF33CN5YEKCQM' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: QJ4UWXTV5FE5ZJK2PGQ5AU0CIPJBA4ZKFAFASGMPAIP2FD3K
CLIENT_SECRET:KJCHW4VFBSMXPMUBHJ5QA2J2I0BOZCXXD20NF33CN5YEKCQM


#### Let's explore the first neighborhood in our dataframe.


Get the neighborhood's name.


In [17]:
neighborhoods.loc[0, 'Neighborhood']

'Parkwoods'

Get the neighborhood's latitude and longitude values.


In [18]:
neighborhood_latitude = neighborhoods.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhoods.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = neighborhoods.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


#### Now, let's get the top 100 venues that are in Parkwoods within a radius of 500 meters.


First, let's create the GET request URL. Name your URL **url**.


In [19]:
# The correct answer is:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL
 


'https://api.foursquare.com/v2/venues/explore?&client_id=QJ4UWXTV5FE5ZJK2PGQ5AU0CIPJBA4ZKFAFASGMPAIP2FD3K&client_secret=KJCHW4VFBSMXPMUBHJ5QA2J2I0BOZCXXD20NF33CN5YEKCQM&v=20180605&ll=43.7532586,-79.3296565&radius=500&limit=100'

Send the GET request and examine the resutls


In [20]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '608748bfd274d67370a9e8f7'},
 'response': {'headerLocation': 'Parkwoods - Donalda',
  'headerFullLocation': 'Parkwoods - Donalda, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.757758604500005,
    'lng': -79.32343823984928},
   'sw': {'lat': 43.7487585955, 'lng': -79.33587476015072}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e8d9dcdd5fbbbb6b3003c7b',
       'name': 'Brookbanks Park',
       'location': {'address': 'Toronto',
        'lat': 43.751976046055574,
        'lng': -79.33214044722958,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.751976046055574,
          'lng': -79.33214044722958}],
        'distance': 245,
        'cc': 'CA'

From the Foursquare lab in the previous module, we know that all the information is in the _items_ key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.


In [21]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a _pandas_ dataframe.


In [23]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

print(nearby_venues.head())

  app.launch_new_instance()


AttributeError: 'Series' object has no attribute '_mgr'

And how many venues were returned by Foursquare?


In [24]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


<a id='item2'></a>


## 2. Explore Neighborhoods in Toronto


#### Let's create a function to repeat the same process to all the neighborhoods in Toronto


In [25]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called _toronto_venues_.


In [26]:
toronto_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )


Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Ontario Provincial Government
Islington Avenue
Malvern, Rouge
Don Mills North
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills South
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
The Danforth  East
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmount Park
Bayview Village
Downsview East
The Danforth

#### Let's check the size of the resulting dataframe


In [27]:
print(toronto_venues.shape)


(2104, 7)


Let's check how many venues were returned for each neighborhood


In [29]:
print(toronto_venues.groupby('Neighborhood').count())

                                                    Neighborhood Latitude  \
Neighborhood                                                                
Agincourt                                                               4   
Alderwood, Long Branch                                                  8   
Bathurst Manor, Wilson Heights, Downsview North                        23   
Bayview Village                                                         4   
Bedford Park, Lawrence Manor East                                      22   
Berczy Park                                                            59   
Birch Cliff, Cliffside West                                             4   
Brockton, Parkdale Village, Exhibition Place                           22   
CN Tower, King and Spadina, Railway Lands, Harb...                     15   
Caledonia-Fairbanks                                                     4   
Cedarbrae                                                               8   

#### Let's find out how many unique categories can be curated from all the returned venues


In [31]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 268 unique categories.


<a id='item3'></a>


## 3. Analyze Each Neighborhood


In [32]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.head())

   Yoga Studio  Accessories Store  Adult Boutique  Airport  \
0            0                  0               0        0   
1            0                  0               0        0   
2            0                  0               0        0   
3            0                  0               0        0   
4            0                  0               0        0   

   Airport Food Court  Airport Gate  Airport Lounge  Airport Service  \
0                   0             0               0                0   
1                   0             0               0                0   
2                   0             0               0                0   
3                   0             0               0                0   
4                   0             0               0                0   

   Airport Terminal  American Restaurant  Antique Shop  Aquarium  Art Gallery  \
0                 0                    0             0         0            0   
1                 0             

And let's examine the new dataframe size.


In [33]:
toronto_onehot.shape

(2104, 268)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category


In [34]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
print(toronto_grouped)

                                         Neighborhood  Yoga Studio  \
0                                           Agincourt     0.000000   
1                              Alderwood, Long Branch     0.000000   
2     Bathurst Manor, Wilson Heights, Downsview North     0.000000   
3                                     Bayview Village     0.000000   
4                   Bedford Park, Lawrence Manor East     0.000000   
5                                         Berczy Park     0.000000   
6                         Birch Cliff, Cliffside West     0.000000   
7        Brockton, Parkdale Village, Exhibition Place     0.000000   
8   CN Tower, King and Spadina, Railway Lands, Har...     0.000000   
9                                 Caledonia-Fairbanks     0.000000   
10                                          Cedarbrae     0.000000   
11                                 Central Bay Street     0.016393   
12                                           Christie     0.000000   
13                  

#### Let's confirm the new size


In [35]:
toronto_grouped.shape

(99, 268)

#### Let's print each neighborhood along with the top 5 most common venues


In [36]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0                     Lounge  0.25
1  Latin American Restaurant  0.25
2             Breakfast Spot  0.25
3               Skating Rink  0.25
4         Mexican Restaurant  0.00


----Alderwood, Long Branch----
          venue  freq
0   Pizza Place  0.25
1      Pharmacy  0.12
2   Coffee Shop  0.12
3  Skating Rink  0.12
4           Pub  0.12


----Bathurst Manor, Wilson Heights, Downsview North----
               venue  freq
0               Bank  0.09
1        Coffee Shop  0.09
2         Restaurant  0.04
3  Mobile Phone Shop  0.04
4        Supermarket  0.04


----Bayview Village----
                 venue  freq
0                 Café  0.25
1  Japanese Restaurant  0.25
2                 Bank  0.25
3   Chinese Restaurant  0.25
4        Movie Theater  0.00


----Bedford Park, Lawrence Manor East----
                venue  freq
0         Coffee Shop  0.09
1      Sandwich Place  0.09
2  Italian Restaurant  0.09
3       Grocery Store  0.05
4  

#### Let's put that into a _pandas_ dataframe


First, let's write a function to sort the venues in descending order.


In [37]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.


In [38]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

print(neighborhoods_venues_sorted.head())


                                      Neighborhood      1st Most Common Venue  \
0                                        Agincourt  Latin American Restaurant   
1                           Alderwood, Long Branch                Pizza Place   
2  Bathurst Manor, Wilson Heights, Downsview North                Coffee Shop   
3                                  Bayview Village         Chinese Restaurant   
4                Bedford Park, Lawrence Manor East                Coffee Shop   

  2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue  \
0                Lounge          Skating Rink        Breakfast Spot   
1           Coffee Shop                   Gym              Pharmacy   
2                  Bank                  Park     Mobile Phone Shop   
3                  Café                  Bank   Japanese Restaurant   
4        Sandwich Place    Italian Restaurant         Grocery Store   

  5th Most Common Venue 6th Most Common Venue 7th Most Common Venue  \
0   Dumpling Re

<a id='item4'></a>


## 4. Cluster Neighborhoods


Run _k_-means to cluster the neighborhood into 3 clusters.


In [39]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
#kmeans.labels_[0:50] 
kmeans.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 3, 0, 0, 0, 0, 4, 0, 0, 0,
       0, 0, 0, 2, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0,
       0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.


In [40]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = neighborhoods

# merge toronto_grouped with neighborhoods to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

print(toronto_merged.head())


  PostalCode           Borough                      Neighborhood   Latitude  \
0        M3A        North York                         Parkwoods  43.753259   
1        M4A        North York                  Victoria Village  43.725882   
2        M5A  Downtown Toronto         Regent Park, Harbourfront  43.654260   
3        M6A        North York  Lawrence Manor, Lawrence Heights  43.718518   
4        M7A      Queen's Park     Ontario Provincial Government  43.662301   

   Longitude  Cluster Labels 1st Most Common Venue 2nd Most Common Venue  \
0 -79.329656             0.0                  Park     Food & Drink Shop   
1 -79.315572             0.0           Coffee Shop           Pizza Place   
2 -79.360636             0.0           Coffee Shop                  Café   
3 -79.464763             0.0        Clothing Store     Accessories Store   
4 -79.389494             0.0           Coffee Shop      Sushi Restaurant   

    3rd Most Common Venue  4th Most Common Venue 5th Most Common Ven

Change "Cluster Labels" column from float64 to int32

In [41]:
print(toronto_merged)
print(toronto_merged.dtypes)

    PostalCode                 Borough  \
0          M3A              North York   
1          M4A              North York   
2          M5A        Downtown Toronto   
3          M6A              North York   
4          M7A            Queen's Park   
5          M9A               Etobicoke   
6          M1B             Scarborough   
7          M3B              North York   
8          M4B               East York   
9          M5B        Downtown Toronto   
10         M6B              North York   
11         M9B               Etobicoke   
12         M1C             Scarborough   
13         M3C              North York   
14         M4C               East York   
15         M5C        Downtown Toronto   
16         M6C                    York   
17         M9C               Etobicoke   
18         M1E             Scarborough   
19         M4E            East Toronto   
20         M5E        Downtown Toronto   
21         M6E                    York   
22         M1G             Scarbor

In [42]:
#toronto_merged=toronto_merged.reset_index()
toronto_merged_clean = toronto_merged.dropna()
toronto_merged_clean['Cluster Labels'] = toronto_merged_clean['Cluster Labels'].values.astype(np.int32)
print(toronto_merged_clean.dtypes)
print(toronto_merged_clean)

PostalCode                 object
Borough                    object
Neighborhood               object
Latitude                  float64
Longitude                 float64
Cluster Labels              int32
1st Most Common Venue      object
2nd Most Common Venue      object
3rd Most Common Venue      object
4th Most Common Venue      object
5th Most Common Venue      object
6th Most Common Venue      object
7th Most Common Venue      object
8th Most Common Venue      object
9th Most Common Venue      object
10th Most Common Venue     object
dtype: object
    PostalCode                 Borough  \
0          M3A              North York   
1          M4A              North York   
2          M5A        Downtown Toronto   
3          M6A              North York   
4          M7A            Queen's Park   
6          M1B             Scarborough   
7          M3B              North York   
8          M4B               East York   
9          M5B        Downtown Toronto   
10         M6B        

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


Finally, let's visualize the resulting clusters


In [43]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged_clean['Latitude'], toronto_merged_clean['Longitude'], toronto_merged_clean['Neighborhood'], toronto_merged_clean['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>


## 5. Examine Clusters


Examine each cluster and determine the discriminating venue categories that distinguish each cluster.


#### Cluster 1


In [44]:
toronto_merged_clean.loc[toronto_merged_clean['Cluster Labels'] == 0, toronto_merged_clean.columns[[1] + list(range(5, toronto_merged_clean.shape[1]))]]

AttributeError: 'NoneType' object has no attribute 'items'

                    Borough  Cluster Labels       1st Most Common Venue  \
0                North York               0                        Park   
1                North York               0                 Coffee Shop   
2          Downtown Toronto               0                 Coffee Shop   
3                North York               0              Clothing Store   
4              Queen's Park               0                 Coffee Shop   
7                North York               0              Baseball Field   
8                 East York               0                 Pizza Place   
9          Downtown Toronto               0                 Coffee Shop   
10               North York               0                        Park   
12              Scarborough               0                         Bar   
13               North York               0                  Restaurant   
14                East York               0                Dance Studio   
15         Downtown Toron

#### Cluster 2


In [45]:
toronto_merged_clean.loc[toronto_merged_clean['Cluster Labels'] == 1, toronto_merged_clean.columns[[1] + list(range(5, toronto_merged_clean.shape[1]))]]

AttributeError: 'NoneType' object has no attribute 'items'

       Borough  Cluster Labels 1st Most Common Venue 2nd Most Common Venue  \
6  Scarborough               1  Fast Food Restaurant         Women's Store   

  3rd Most Common Venue        4th Most Common Venue 5th Most Common Venue  \
6          Dance Studio  Eastern European Restaurant   Dumpling Restaurant   

  6th Most Common Venue 7th Most Common Venue 8th Most Common Venue  \
6             Drugstore            Donut Shop      Doner Restaurant   

  9th Most Common Venue 10th Most Common Venue  
6               Dog Run    Distribution Center  

#### Cluster 3


In [46]:
toronto_merged_clean.loc[toronto_merged_clean['Cluster Labels'] == 2, toronto_merged_clean.columns[[1] + list(range(5, toronto_merged_clean.shape[1]))]]


AttributeError: 'NoneType' object has no attribute 'items'

             Borough  Cluster Labels 1st Most Common Venue  \
21              York               2                  Park   
52        North York               2                  Park   
64              York               2                  Park   
66        North York               2                  Park   
91  Downtown Toronto               2                  Park   
98         Etobicoke               2                  Park   

   2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue  \
21         Women's Store                   Bar               Dog Run   
52         Women's Store   Distribution Center         Deli / Bodega   
64         Women's Store   Distribution Center         Deli / Bodega   
66     Convenience Store         Women's Store   Distribution Center   
91            Playground                 Trail                 Diner   
98                 River         Women's Store        Discount Store   

   5th Most Common Venue 6th Most Common Venue 7th Most Comm

#### Cluster 4


In [47]:
toronto_merged_clean.loc[toronto_merged_clean['Cluster Labels'] == 3, toronto_merged_clean.columns[[1] + list(range(5, toronto_merged_clean.shape[1]))]]

AttributeError: 'NoneType' object has no attribute 'items'

            Borough  Cluster Labels 1st Most Common Venue  \
32      Scarborough               3            Playground   
83  Central Toronto               3                  Park   
85      Scarborough               3                  Park   

   2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue  \
32         Women's Store   Distribution Center         Deli / Bodega   
83            Playground                 Trail          Tennis Court   
85            Playground          Intersection         Women's Store   

   5th Most Common Venue 6th Most Common Venue 7th Most Common Venue  \
32      Department Store          Dessert Shop    Dim Sum Restaurant   
83            Donut Shop      Doner Restaurant               Dog Run   
85        Discount Store         Deli / Bodega      Department Store   

   8th Most Common Venue 9th Most Common Venue 10th Most Common Venue  
32                 Diner        Discount Store                Dog Run  
83   Distribution Center        D

#### Cluster 5


In [48]:
toronto_merged_clean.loc[toronto_merged_clean['Cluster Labels'] == 4, toronto_merged_clean.columns[[1] + list(range(5, toronto_merged_clean.shape[1]))]]

AttributeError: 'NoneType' object has no attribute 'items'

        Borough  Cluster Labels 1st Most Common Venue 2nd Most Common Venue  \
53   North York               4        Baseball Field            Food Truck   
57   North York               4        Baseball Field               Dog Run   
101   Etobicoke               4        Baseball Field               Dog Run   

    3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue  \
53    Distribution Center      Department Store          Dessert Shop   
57       Department Store          Dessert Shop    Dim Sum Restaurant   
101      Department Store          Dessert Shop    Dim Sum Restaurant   

    6th Most Common Venue 7th Most Common Venue 8th Most Common Venue  \
53     Dim Sum Restaurant                 Diner        Discount Store   
57                  Diner        Discount Store   Distribution Center   
101                 Diner        Discount Store   Distribution Center   

    9th Most Common Venue 10th Most Common Venue  
53          Women's Store            Escape Ro