<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

# Collecting and transforming Neighborhoods data from Toronto City provided by Wikipedia page #

## Introduction

In order to obtain the data about the neighborhoods in Toronto, we present in this document the process to obtain the dataset that is in the table of postal codes as well as the the latitude and logitude coordinates of each neighborhood. Futhermore, we explain how the data will be transformed and stored into a pandas dataframe.

## Table of Contents

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Get the Geographical Coordinates</a>

3. <a href="#item3">Using Geographical Coordinates in the map of Toronto</a>

4. <a href="#item4">Explore Neighborhoods in Manhattan</a>

5. <a href="#item5">Analyze Each Neighborhood</a>

6. <a href="#item5">Cluster Neighborhoods</a>

7. <a href="#item5">Examine Clusters</a>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import requests

## 1. Download and Explore Dataset

The dataset to explore the neighborhoods in Toronto is the wikipedia site.
In order to segment the neighborhoods and explore them, we will essentially need a dataset that contains the postal codes, boroughs and the neighborhoods that exist in each postal code. 

After to capture and format the dataset, we will create a new dataframe that will consist of three columns: PostalCode, Borough, and Neighborhood.

**Notes when scrape the wikipedia page:**

Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned**.

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that **M5A** is listed twice and has two neighborhoods: **Harbourfront** and **Regent Park**. These two rows will be combined into one row with the neighborhoods separated with a **comma**.

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the **9th** cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be **Queen's Park**.

This dataset exists for free on the web.  Here is the link to the dataset: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [2]:
wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
raw_toronto_wikipedia_page = requests.get(wikipedia_link)

Storing the wikipedia page in a page variable

In [4]:
page = raw_toronto_wikipedia_page.text

Finding Postal Code Table inside 'wikipedia page' and storing in a table_script variable

In [5]:
html_table_tag_start = "wikitable sortable"
html_table_tag_end = "</tbody></table>"
table_start = page.find(html_table_tag_start) + len(html_table_tag_start)
table_end = page.find(html_table_tag_end,table_start)
table_script = page[table_start:table_end]


Removing tags not important to dataset

In [6]:
table_script = table_script.replace("</a>","")
table_script = table_script.replace("<td>","")
table_script = table_script.replace("\n","")
table_script = table_script.replace("\t","")
table_script = table_script.replace("\"><tbody><tr><th>Postcode</th><th>Borough</th><th>Neighbourhood</th>","")

Removing rows have "Not assigned" string and storing just valids rows in a new list

In [7]:
tr_table = table_script.split("</tr>")
tr_table_valid = [];
for p in tr_table:    
    not_assigned = p.find("Not assigned</td>Not assigned")
    if (not_assigned == -1):
        if (len(p) > 0):
            tr_table_valid.append(p)

In [8]:
print(tr_table_valid)

['<tr>M3A</td><a href="/wiki/North_York" title="North York">North York</td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</td>', '<tr>M4A</td><a href="/wiki/North_York" title="North York">North York</td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</td>', '<tr>M5A</td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</td>', '<tr>M5A</td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</td>', '<tr>M6A</td><a href="/wiki/North_York" title="North York">North York</td><a href="/wiki/Lawrence_Heights" title="Lawrence Heights">Lawrence Heights</td>', '<tr>M6A</td><a href="/wiki/North_York" title="North York">North York</td><a href="/wiki/Lawrence_Manor" title="Lawrence Manor">Lawrence Manor</td>', '<tr>M7A</td><a href="/wiki/Queen%27s_Park_(Toronto)" title="

Create a new DataFrame

In [9]:
# define the dataframe columns
column_names = ['PostalCode','Borough', 'Neighborhood'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [10]:
neighborhoods

Unnamed: 0,PostalCode,Borough,Neighborhood


Extract Postal Code, Borough, Neighborhood for each row in the list and store in DataFrame

In [11]:
for r in tr_table_valid:
    PostalCode = ''
    Borough = ''
    neighborhood_name = '' 
    td_table = r.split("</td")
    if (td_table[0].rfind(">") > -1):
        PostalCode = td_table[0][td_table[0].rfind(">")+1:len(td_table[0])]    
    if (td_table[1].rfind(">") > -1):
        Borough =  td_table[1][td_table[1].rfind(">")+1:len(td_table[1])]
    if (td_table[2].rfind(">") > -1):
        neighborhood_name =  td_table[2][td_table[2].rfind(">")+1:len(td_table[2])]
    else:
        neighborhood_name = Borough
    if (neighborhood_name == "Not assigned"):
        neighborhood_name = Borough
        
    neighborhoods = neighborhoods.append({'PostalCode': PostalCode,'Borough': Borough,
                                          'Neighborhood': neighborhood_name}, ignore_index=True)    

Check dataframe results

In [12]:
neighborhoods.head(220)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Check Shape from dataFrame

In [13]:
neighborhoods.shape

(212, 3)

Create a new DataFrame

In [14]:
# instantiate the postal Code dataframe
postalcode = pd.DataFrame(columns=column_names)

Extract Postal Code, Borough, Neighborhood for each row in the list and store in DataFrame **agroupping by** Postal Code

In [15]:
grouped_PostalCode = neighborhoods.groupby('PostalCode')
for name,group in grouped_PostalCode:
    g_PostalCode = name
    g_Borough = group['Borough'].unique()[0]
    g_Neighborhood = ",".join(group['Neighborhood'].values.tolist())
    postalcode = postalcode.append({'PostalCode': g_PostalCode,'Borough': g_Borough,
                                    'Neighborhood': g_Neighborhood}, ignore_index=True)

Check dataframe results again using groups

In [16]:
postalcode.head(150)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


Check Shape from dataFrame

In [17]:
postalcode.shape

(103, 3)

## 2. Get the Geographical Coordinates

In order to utilize the Map location data, we need to get the latitude and the longitude coordinates of each Toronto neighborhood.

We try use the Geocoder Python package: https://geocoder.readthedocs.io/index.html to get the latitude and the longitude coordinates.
Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

After to capture the latitude and the longitude coordinates, we will create a new dataframe that will consist of five columns: PostalCode, Borough, Neighborhood, Latitude and Longitude.

**Important Note:** There is a limit on how many times you can call geocoder.google function. It is 2500 times per day. This should be way more than enough for you to get acquainted with the package and to use it to get the geographical coordinates of the neighborhoods in the Toronto.

Install libraries to GeoCoder API

In [18]:
#!conda install -c conda-forge geocoder --yes
#!conda install -c conda-forge/label/gcc7 geocoder --yes

In [19]:
#import geocoder

Below the code to call the Geocoder API for each row the postalcode dataframe.

In [20]:
# initialize your variable to None
#lat_lng_coords = None
#postal_code = 'M4V'

# loop until you get the coordinates
#while(lat_lng_coords is None):
#  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#  lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]
#print(latitude + " - - " + longitude)


Unfortunately it was not possible to use it due to **faulty responses**. So we chose to work with the csv file that already has the coordinates.

In [21]:
df_geospatialdata = pd.read_csv("http://cocl.us/Geospatial_data")

Rename 'Postal Code' column name to 'PostalCode'

In [22]:
df_geospatialdata.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
df_geospatialdata.set_index('PostalCode', inplace=True)

Check Dataframe with each latitude and the longitude coordinates of each Toronto neighborhood

In [23]:
df_geospatialdata

Unnamed: 0_level_0,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476
M1J,43.744734,-79.239476
M1K,43.727929,-79.262029
M1L,43.711112,-79.284577
M1M,43.716316,-79.239476
M1N,43.692657,-79.264848


Set the same index column from postalcode dataframe and df_geospatialdata 

In [24]:
postalcode.set_index('PostalCode', inplace=True)

In [25]:
postalcode

Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Rouge,Malvern"
M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
M1E,Scarborough,"Guildwood,Morningside,West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
M1J,Scarborough,Scarborough Village
M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
M1N,Scarborough,"Birch Cliff,Cliffside West"


Merge two dataframes: **postalcode** and **df_geospatialdata** with output in third **result** dataframe.

The Result dataframe will have 3 columns from PostalCode (PostalCode,Borough,Neighborhood) and 2 columns from Geospatialdata dataframe (Latitude,Longitude).

Key common between dataframes is the **PostalCode** column.

In [26]:
result = pd.merge(postalcode,
                     df_geospatialdata[['Latitude','Longitude']],
                     on='PostalCode')

Check Dataframe results

In [27]:
result.head(130)

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
M1G,Scarborough,Woburn,43.770992,-79.216917
M1H,Scarborough,Cedarbrae,43.773136,-79.239476
M1J,Scarborough,Scarborough Village,43.744734,-79.239476
M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


In [28]:
result.shape

(103, 4)

## 3. Using Geographical Coordinates in the map of Toronto

This topic cover the process of explore and cluster the neighborhoods in Toronto. We decide to work with only boroughs that contain the word Toronto as example.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [None]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: / 

#### Use geopy library to get the latitude and longitude values of Toronto City.

In [None]:
address = 'Toronto, ON, Canada'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of  Toronto, Canada are {}, {}.'.format(latitude, longitude))

#### Create a map of Toronto with neighborhoods superimposed on top.

In [None]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(result['Latitude'], result['Longitude'], result['Borough'], result['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

However, for illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in Toronto City. So let's slice the original dataframe and create a new dataframe with **Borough** that contain **the word Toronto**.

In [None]:
toronto_data = result[result['Borough'].str.contains("Toronto")].reset_index(drop=True)
toronto_data.head()

In [None]:
toronto_data.shape

Let's get the geographical coordinates of **Borough** that contain **the word Toronto**.

In [None]:
address = 'Toronto, ON, Canada'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Canada are {}, {}.'.format(latitude, longitude))

As we did with all of Toronto City, let's visualize only **Borough** that contain **the word Toronto** the neighborhoods in it.

In [None]:
# create map of Toronto City using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [None]:
CLIENT_ID = 'ZCEJYMP51D1PU3UTWRIVAKPFLL14CX3AXLEQGB551IBUUQXZ' # your Foursquare ID
CLIENT_SECRET = '5KQOOD03AVBUMJ4UM3ZM05AJSWBMQZ0U52AR5ALRWDPR0T5X' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

#### Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [None]:
toronto_data.loc[0, 'Neighborhood']

Get the neighborhood's latitude and longitude values.

In [None]:
neighborhood_latitude = toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = toronto_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

#### Now, let's get the top 100 venues that are in a radius of 500 meters.

First, let's create the GET request URL. Name your URL **url**.

In [None]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url

Send the GET request and examine the resutls

In [None]:
results = requests.get(url).json()
results

Let's borrow the **get_category_type** function.

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [None]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

And how many venues were returned by Foursquare?

In [None]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

<a id='item2'></a>

## 4. Explore Neighborhoods in Toronto City

#### Let's create a function to repeat the same process to all the neighborhoods in Toronto City

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called *toronto_venues*.

In [None]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

#### Let's check the size of the resulting dataframe

In [None]:
print(toronto_venues.shape)
toronto_venues.head()

Let's check how many venues were returned for each neighborhood

In [None]:
toronto_venues.groupby('Neighborhood').count()

#### Let's find out how many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

<a id='item3'></a>

## 5. Analyze Each Neighborhood

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

And let's examine the new dataframe size.

In [None]:
toronto_onehot.shape

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

#### Let's confirm the new size

In [None]:
toronto_grouped.shape

#### Let's print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

<a id='item4'></a>

## 6. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [None]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
toronto_merged = toronto_data

# add clustering labels
toronto_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Finally, let's visualize the resulting clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

## 7. Examine Clusters

Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster:

#### Cluster 1

In [None]:
# toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 2

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 3

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 4

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 5

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

### About the Author:  
 [Clayton Magalhaes]( https://www.linkedin.com/in/cvianam/) Clayton Magalhaes is a Fraud Prevention Specialist at IBM.



 <hr>
Copyright &copy; 2018 [cognitiveclass.ai](cognitiveclass.ai?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).