# Capstone Project for Data Science

In this notebook I will work on my final project for the Data Science course offered by IBM through Coursera.

<h1>The project: Segmenting and Clustering Neighborhoods in Toronto</h1>

In this project, I will segmentate the neighborhoods of Toronto, and then cluster them into similar groups using the k-Means clustering machine learning model. 

<h2>Data aquisition</h2>

I will use data from Wikipedia for this project, the link containing the data is presented [here](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). The idea is to obtain boroughs and neighborhoods based on their postal codes, to achieve this, I will use the BeautifulSoup and Pandas packages.

In [1]:
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# Opening the wikipedia-link
contents = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

In [3]:
# Reading the contents and creating a Beautiful Soup object
html_contents = contents.read()
parser = BeautifulSoup(html_contents, 'html.parser')

In [4]:
# Creating an empty list that will store the data
toronto_data = []

In [5]:
# As the data is stored in form of a table, I will use the Beautiful Soup object to find it
table = parser.find('tbody')

#Iterating over the table and formatting the data
for row in table.find_all('td'):
    observation = {}
    if row.span.text == 'Not assigned':
        pass
    else:
        observation['PostalCode'] = row.p.text[0:3]
        observation['Borough'] = (row.p.text[3:]).split('(')[0]
        observation['Neighborhood'] = ((((row.span.text[3:]).split('(')[1]).strip(')').replace('/', ',')).replace(')', ' ').strip(' ')).replace(' ,', ',')
        toronto_data.append(observation)

In [6]:
data = pd.DataFrame(toronto_data)
data

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [7]:
# It seems like there are some errors in the Borough table, let's fix them
data['Borough']=data['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                         'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                         'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                         'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
data

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [8]:
# Let's go trough the shape of the dataframe
data.shape

(103, 3)

<h2>First attemp: Getting the latitude and longitude for each neighborhood based on their postal code using geocoder</h2>

<b>NOTE: This didn't work </b>

First of all, I need to install the geocoder package so that I'm able to get the latitude and longitude values for each neighborhood.
Regardless of the fact that this did not work as expected, I'm keeping this section to show what I've tried to do; the geocoder API did't return anything, so the loop never ended.

In [9]:
# !pip install geocoder

In [10]:
# import geocoder

In [11]:
"""
postal_codes = list(data['PostalCode'])
latitudes = []
longitudes = []
for i in range(0, len(data['PostalCode'])+1):
    lat_lon = None
    while (lat_lon is None):
        g = geocoder.google('{postal_codes[i]}, Toronto, Ontario')
        lat_lon = g.latlng
    latitudes.append(lat_lon[0])
    longitudes.append(lat_lon[1])
"""

"\npostal_codes = list(data['PostalCode'])\nlatitudes = []\nlongitudes = []\nfor i in range(0, len(data['PostalCode'])+1):\n    lat_lon = None\n    while (lat_lon is None):\n        g = geocoder.google('{postal_codes[i]}, Toronto, Ontario')\n        lat_lon = g.latlng\n    latitudes.append(lat_lon[0])\n    longitudes.append(lat_lon[1])\n"

<h2>Second attemp: Getting the latitude and longitude using a csv file</h2>

IBM provided a csv file containing the coordinates for each postal code; so I'm going to use that data to keep working.
Below you will see a hidden cell which is the one I used to load the data into this notebook; I'm showing only the output of that cell, which is the dataframe containing data about latitudes and longitudes by postal code.

In [12]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now that I have the data I need, is time to merge this Dataframe with the original one, where I have the neighborhoord data. To merge this Dataframes together, first I need to sort the <b>PostalCode</b> column in the orignal dataframe.

In [13]:
data.sort_values(by='PostalCode', inplace = True)
data

Unnamed: 0,PostalCode,Borough,Neighborhood
6,M1B,Scarborough,"Malvern, Rouge"
12,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
18,M1E,Scarborough,"Guildwood, Morningside, West Hill"
22,M1G,Scarborough,Woburn
26,M1H,Scarborough,Cedarbrae
...,...,...,...
64,M9N,York,Weston
70,M9P,Etobicoke,Westmount
77,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
89,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


Now that the dataframe is sorted, and PostalCode column matches the one in the latitudes and longitudes dataframe, I will merge this two together. I'm using the **merge** method provided by pandas.

In [14]:
data = data.merge(lat_long, how='inner', on=lat_long['Postal Code'])
data

Unnamed: 0,key_0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M1B,M1B,Scarborough,"Malvern, Rouge",M1B,43.806686,-79.194353
1,M1C,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",M1C,43.784535,-79.160497
2,M1E,M1E,Scarborough,"Guildwood, Morningside, West Hill",M1E,43.763573,-79.188711
3,M1G,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476
...,...,...,...,...,...,...,...
98,M9N,M9N,York,Weston,M9N,43.706876,-79.518188
99,M9P,M9P,Etobicoke,Westmount,M9P,43.696319,-79.532242
100,M9R,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",M9R,43.688905,-79.554724
101,M9V,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",M9V,43.739416,-79.588437


It worked properly; now I want to get rid of the unwanted columns; so I will remove **key_0, and Postal Code** columns from the dataframe

In [15]:
data.drop(['key_0', 'Postal Code'], axis = 1, inplace = True)
data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


<h2>Analyzing and clustering the data</h2>

In this section, I will use folium, the foursquare API, and the k-Means algorithm to investigate about canadian neighborhoods.

In [16]:
!pip install folium

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


In [17]:
# Importing modules
import folium
import requests
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim 
import json
from pandas import json_normalize
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

In [18]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(f'The geograpical coordinates of Toronto are {latitude}, {longitude}.')

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


In [19]:
# Creating a map of Toronto
toronto_map = folium.Map(location=[latitude, longitude], zoom_start=10)
toronto_map

In [20]:
# Adding labels for each neighborhood
for lat, lng, borough, neighborhood in zip(data['Latitude'], data['Longitude'], data['Borough'], data['Neighborhood']):
    label = f'{neighborhood}, {borough}'
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#6E8DA5',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)
toronto_map

In [21]:
# The code was removed by Watson Studio for sharing.

This cell was used to define Foursquare credentials.


<h2>Obtaining data avout venues</h2>

Now that I have the map, and all the relevant data; my next step was be to obtain data about the venues that are next to the neighborhoods in Canada; I used a loop to obtain the data of each neighborhood.

In [22]:
# Creating the function
def get_venues(names, lats, longs, radius = 500):
    venues_list = []
    for name, lat, long in zip(names, lats, longs):
        print(f"Actual neighborhood: {name}")
        
        url = f'https://api.foursquare.com/v2/venues/explore?&client_id={client}&client_secret={secret}&v={version}&ll={lat},{long}&radius={radius}&limit={limit}'
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
        except Exception as e:
            print(f'There was an error in obtaining the json response {e}')
            continue
        
        #Appending only the relevant data for each venue
        venues_list.append([(name, lat, lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                             'Neighborhood Latitude', 
                             'Neighborhood Longitude', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category']
    
    return(nearby_venues)

In [23]:
# Applying the function
toronto_venues = get_venues(data['Neighborhood'], data['Latitude'], data['Longitude'])

Actual neighborhood: Malvern, Rouge
Actual neighborhood: Rouge Hill, Port Union, Highland Creek
Actual neighborhood: Guildwood, Morningside, West Hill
Actual neighborhood: Woburn
Actual neighborhood: Cedarbrae
Actual neighborhood: Scarborough Village
Actual neighborhood: Kennedy Park, Ionview, East Birchmount Park
Actual neighborhood: Golden Mile, Clairlea, Oakridge
Actual neighborhood: Cliffside, Cliffcrest, Scarborough Village West
Actual neighborhood: Birch Cliff, Cliffside West
Actual neighborhood: Dorset Park, Wexford Heights, Scarborough Town Centre
Actual neighborhood: Wexford, Maryvale
Actual neighborhood: Agincourt
Actual neighborhood: Clarks Corners, Tam O'Shanter, Sullivan
Actual neighborhood: Milliken, Agincourt North, Steeles East, L'Amoreaux East
Actual neighborhood: Steeles West, L'Amoreaux West
Actual neighborhood: Upper Rouge
Actual neighborhood: Hillcrest Village
Actual neighborhood: Fairview, Henry Farm, Oriole
Actual neighborhood: Bayview Village
Actual neighborhood

In [24]:
print(f"There are {toronto_venues.shape[0]} observations in the data frame")
toronto_venues.head(10)

There are 2139 observations in the data frame


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.806686,-79.594054,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.594054,RIGHT WAY TO GOLF,43.785177,-79.161108,Golf Course
2,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.594054,Great Shine Window Cleaning,43.783145,-79.157431,Home Service
3,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.594054,Royal Canadian Legion,43.782533,-79.163085,Bar
4,"Guildwood, Morningside, West Hill",43.763573,-79.594054,RBC Royal Bank,43.76679,-79.191151,Bank
5,"Guildwood, Morningside, West Hill",43.763573,-79.594054,G & G Electronics,43.765309,-79.191537,Electronics Store
6,"Guildwood, Morningside, West Hill",43.763573,-79.594054,Sail Sushi,43.765951,-79.191275,Restaurant
7,"Guildwood, Morningside, West Hill",43.763573,-79.594054,Big Bite Burrito,43.766299,-79.19072,Mexican Restaurant
8,"Guildwood, Morningside, West Hill",43.763573,-79.594054,Enterprise Rent-A-Car,43.764076,-79.193406,Rental Car Location
9,"Guildwood, Morningside, West Hill",43.763573,-79.594054,Krispy Kreme Doughnuts,43.767169,-79.18966,Donut Shop


In [25]:
# Looking for the number of venues per neighborhood
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Alderwood, Long Branch",8,8,8,8,8,8
"Bathurst Manor, Wilson Heights, Downsview North",23,23,23,23,23,23
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",24,24,24,24,24,24
...,...,...,...,...,...,...
Willowdale West,6,6,6,6,6,6
"Willowdale, Newtonbrook",3,3,3,3,3,3
Woburn,4,4,4,4,4,4
Woodbine Heights,7,7,7,7,7,7


In [26]:
# Creating a data frame with One-hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Adding the neighborhood column to the dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# Moving the nwighborhood column to the first place
first_column = toronto_onehot.pop('Neighborhood')
toronto_onehot.insert(0, 'Neighborhood', first_column)
toronto_onehot

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2134,"South Steeles, Silverstone, Humbergate, Jamest...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2135,"Clairville, Humberwood, Woodbine Downs, West H...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2136,"Clairville, Humberwood, Woodbine Downs, West H...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2137,"Clairville, Humberwood, Woodbine Downs, West H...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
# Finding the frequency of each venue category per Neighborhood
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Willowdale West,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,"Willowdale, Newtonbrook",0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("\t\t---> "+hood+" <---")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['Venue Category','Frequency']
    temp = temp.iloc[1:]
    temp['Frequency'] = temp['Frequency'].astype(float)
    temp = temp.round({'Frequency': 2})
    print(temp.sort_values('Frequency', ascending=False).reset_index(drop=True).head(num_top_venues), "\n")

		---> Agincourt <---
              Venue Category  Frequency
0                     Lounge       0.25
1             Breakfast Spot       0.25
2  Latin American Restaurant       0.25
3         Chinese Restaurant       0.25
4              Metro Station       0.00 

		---> Alderwood, Long Branch <---
  Venue Category  Frequency
0    Pizza Place       0.25
1    Coffee Shop       0.12
2            Gym       0.12
3       Pharmacy       0.12
4            Pub       0.12 

		---> Bathurst Manor, Wilson Heights, Downsview North <---
   Venue Category  Frequency
0            Bank       0.09
1     Coffee Shop       0.09
2  Ice Cream Shop       0.04
3     Bridal Shop       0.04
4  Sandwich Place       0.04 

		---> Bayview Village <---
              Venue Category  Frequency
0                       Café       0.25
1                       Bank       0.25
2         Chinese Restaurant       0.25
3        Japanese Restaurant       0.25
4  Middle Eastern Restaurant       0.00 

		---> Bedford Park, Lawr

In [29]:
# Function to return the most common venues ordered
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [30]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append(f'{ind+1}{indicators[ind]} Most Common Venue')
    except:
        columns.append(f'{ind+1}th Most Common Venue')

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.shape

(100, 11)

<h2> Clustering neighborhoods </h2>

In [31]:
# Creating and visualizing the dataframe that will be used for clistering
cluster_df = toronto_grouped.drop('Neighborhood', axis = 1)
cluster_df

Unnamed: 0,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
# Creating the k-Means object
n_clusters = 6

kmeans_model = KMeans(n_clusters=n_clusters)
kmeans_model

KMeans(n_clusters=6)

In [33]:
#Training the model
kmeans_model.fit(cluster_df)

# Looking for the first 100 results
kmeans_model.labels_[:100]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       2, 0, 0, 0, 0, 0, 0, 5, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       5, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 5, 0, 0,
       0, 0, 0, 0, 3, 0, 0, 0, 5, 0, 0, 5], dtype=int32)

In [34]:
# add clustering labels
kmeans_model.labels_.shape

(100,)

In [35]:
# Adding thr Cluster Labels to the neighborhood Dataframe
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans_model.labels_)

#Joining the dataframes
toronto_merged = data
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,4.0,Fast Food Restaurant,Farmers Market,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,0.0,Golf Course,Bar,Home Service,Yoga Studio,Doner Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.0,Mexican Restaurant,Donut Shop,Bank,Intersection,Medical Center,Rental Car Location,Restaurant,Breakfast Spot,Electronics Store,Drugstore
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.0,Coffee Shop,Other Repair Shop,Korean BBQ Restaurant,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Eastern European Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.0,Bank,Gas Station,Caribbean Restaurant,Hakka Restaurant,Fried Chicken Joint,Lounge,Thai Restaurant,Athletics & Sports,Bakery,Diner
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,0.0,Playground,Yoga Studio,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029,0.0,Department Store,Coffee Shop,Discount Store,Bus Station,Chinese Restaurant,Hobby Shop,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Dim Sum Restaurant
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577,0.0,Bakery,Park,Bus Line,Ice Cream Shop,Intersection,Bus Station,Metro Station,Donut Shop,Discount Store,Drugstore
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476,0.0,Motel,American Restaurant,Yoga Studio,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848,0.0,Café,General Entertainment,Skating Rink,College Stadium,Concert Hall,Construction & Landscaping,Ethiopian Restaurant,Escape Room,Comfort Food Restaurant,Electronics Store


In [36]:
#Converting the Cluster Labels columns from float to int
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].fillna(0.0).astype(int)
toronto_merged.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,4,Fast Food Restaurant,Farmers Market,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,0,Golf Course,Bar,Home Service,Yoga Studio,Doner Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0,Mexican Restaurant,Donut Shop,Bank,Intersection,Medical Center,Rental Car Location,Restaurant,Breakfast Spot,Electronics Store,Drugstore
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0,Coffee Shop,Other Repair Shop,Korean BBQ Restaurant,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Eastern European Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0,Bank,Gas Station,Caribbean Restaurant,Hakka Restaurant,Fried Chicken Joint,Lounge,Thai Restaurant,Athletics & Sports,Bakery,Diner
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,0,Playground,Yoga Studio,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029,0,Department Store,Coffee Shop,Discount Store,Bus Station,Chinese Restaurant,Hobby Shop,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Dim Sum Restaurant
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577,0,Bakery,Park,Bus Line,Ice Cream Shop,Intersection,Bus Station,Metro Station,Donut Shop,Discount Store,Drugstore
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476,0,Motel,American Restaurant,Yoga Studio,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848,0,Café,General Entertainment,Skating Rink,College Stadium,Concert Hall,Construction & Landscaping,Ethiopian Restaurant,Escape Room,Comfort Food Restaurant,Electronics Store


In [37]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(n_clusters)
ys = [i + x + (i*x)**2 for i in range(n_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Thank you for going through this notebook!