## Assignment - Segmenting and Clustering neighborhoods in Toronto

<br>
Hello! <br><br>
This Jupyter Notebook is created to segment and cluster neighborhoods in Toronto, as part of the Assignment for week 3 (Applied Data Science Capstone - Coursera). <br>
I hope you'll find it useful. <br>

<br> --------------------- <br>

### Phase 1 - Scrap Wikipedia for data using Wikipedia or BeautifulSoup libraries

In this phase, we will build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe. <br><br>

There are various way to do that, for example using Wikipedia or BeautifulSoup Python libraries . <br>
For this tutorial, we decided to use the Wikipedia Python library which makes it easy to access and parse data from Wikipedia. More details can be find in the following page - https://pypi.org/project/wikipedia/).<br>

For the curious, in the last cell we wrote the code to use the BeautifulSoup Python library. <br><br>

We will start downloading the Wikipedia and BeautifulSoup libraries packages. <br>

In [2]:
# installing Wikipedia package 
!conda install -c conda-forge --no-deps wikipedia --yes

# installing BeautifulSoup package
!conda install -c conda-forge --no-deps beautifulsoup4 --yes

# installing lxml package
!conda install -c conda-forge --no-deps lxml --yes

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - wikipedia


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    wikipedia-1.4.0            |             py_2          13 KB  conda-forge

The following NEW packages will be INSTALLED:

    wikipedia: 1.4.0-py_2 conda-forge


Downloading and Extracting Packages
wikipedia-1.4.0      | 13 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Solving environment: done


  current version: 4.5.11
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /ho

If you want to use the BeautifulSoup solution, you'll need to download the following packages: <br>

<br> We will then use the wikipedia library to get the data from the page - code below: <br> 

In [3]:
# using Wikipedia package

import pandas as pd  # library for data analsysis
import wikipedia as wp  # library for to access Wikipedia
import bs4  # beautifulSoup library
import requests  # library to handle requests

#Get the html source
html = wp.page("List of postal codes of Canada: M").html().encode("UTF-8")

#Create our dataframe tor_df
tor_df = pd.read_html(html)[0]
tor_df.to_csv('beautifulsoup_pandas.csv',header=0,index=False)
tor_df.head()

  'The soupsieve package is not installed. CSS selectors cannot be used.'


Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


<br> <br>
The final dataframe should satisfy the following conditions: <br>
  - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood. <br>
  - Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned. <br>
  - More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table. <br>
  - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. <br><br>

In [4]:
# Replace '/' by ',' in the Neighborhood column 
tor_df['Neighborhood'] = tor_df['Neighborhood'].str.replace(' /',',')

# Remove rows with a borough that is "Not assigned".
tor_df.drop(tor_df[tor_df['Borough']=="Not assigned"].index,axis=0, inplace=True)

# If a Postal Code is listed more than once, regroup all the neighberhoods in one row and seperate them with ','
tor_df = tor_df.groupby("Postal code").agg(lambda x:','.join(set(x)))

# If a Borough has a "Not assigned" Neighborhood, copy the name of the Borough in the Neighborhood cell.
tor_df.loc[tor_df['Neighborhood']=="Not assigned",'Neighborhood']=tor_df.loc[tor_df['Neighborhood']=="Not assigned",'Borough']

tor_df.reset_index(inplace=True)

tor_df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


<br> Let's check the shape of our DataFrame now: <br>

In [5]:
print(tor_df.shape)

print('There are {} uniques Postal Codes.'.format(len(tor_df['Postal code'].unique())))
print('There are {} uniques Boroughs.'.format(len(tor_df['Borough'].unique())))

(103, 3)
There are 103 uniques Postal Codes.
There are 10 uniques Boroughs.


<br> <br>

#### <u> Extra </u>
Below is the code if you want to use the BeautifulSoup library only <br>

In [9]:
# using BeautifulSoup package

import random  # library for random number generation
from bs4 import BeautifulSoup

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
parsed_table = []

def get_wikipedia_data(u):
    wiki_url = requests.get(u).text
    soup = BeautifulSoup(wiki_url, 'html5lib')
    rows = soup.find('table', class_='wikitable').find_all('tr')
    
    for row in rows:
        children = row.findChildren(recursive=False)
        row_text = []
        for child in children:
            clean_text = child.text
            clean_text = clean_text.split('&#91;')[0]     # This is to discard reference/citation links
            clean_text = clean_text.split('&#160;')[-1]   # This is to clean the header row of the sort icons
            clean_text = clean_text.strip()
            row_text.append(clean_text)
            parsed_table.append(row_text)
    
    return parsed_table

if __name__=="__main__":
    print ('this is the data:')
    tor_list = get_wikipedia_data(url) 
    #for row in tor_df:
    #    print ('|'.join(row))

tor_df = pd.DataFrame (tor_list,columns=['PostalCode', 'Borough', 'Neighborhood'])
final_df = tor_df.iloc[3::3,:]
final_df.head()

this is the data:


Unnamed: 0,PostalCode,Borough,Neighborhood
3,M1A,Not assigned,
6,M2A,Not assigned,
9,M3A,North York,Parkwoods
12,M4A,North York,Victoria Village
15,M5A,Downtown Toronto,Regent Park / Harbourfront


<bt> 
<br> --------------------- <br>

### Phase 2 - Getting latitude and longitude of each neighborhood using Geocoder package

Now that we have built our dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. <br>
We tried to use the Geocoder Python package: https://geocoder.readthedocs.io/index.html, but it wasn't successful.<br><br>

Given that this package can be very unreliable, Coursera provided us with the csv file containing the final data. We will download it and use it. <br>
Here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data  <br>

In [6]:
# installing Geocoder package 
#!conda install -c conda-forge --no-deps geocoder --yes
#!conda install -c conda-forge --no-deps ratelim --yes
#!conda install -c conda-forge --no-deps click --yes

!conda install -c conda-forge --no-deps altair --yes
!conda install -c conda-forge --no-deps vincent --yes

# installing Geopy package 
!conda install -c conda-forge geopy=1.19.0 --yes 

# installing Forge package 
!conda install -c conda-forge --no-deps folium=0.5.0 --yes

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - altair


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge

The following NEW packages will be INSTALLED:

    altair: 4.1.0-py_1 conda-forge


Downloading and Extracting Packages
altair-4.1.0         | 614 KB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Solving environment: done


  current version: 4.5.11
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jup

In [7]:
import csv

print("Libraries imported!")

Libraries imported!


<br> We will download and read the csv file: <br>

In [8]:
!wget -q -O 'toronto_data.csv' http://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


In [8]:
ll_df = pd.read_csv('toronto_data.csv')
ll_df.tail()

Unnamed: 0,Postal Code,Latitude,Longitude
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437
102,M9W,43.706748,-79.594054


In [9]:
tor_df["Latitude"] = ll_df["Latitude"]
tor_df["Longitude"] = ll_df["Longitude"]
tor_df.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


<bt> 
<br> --------------------- <br>

### Phase 3 - Exploring and clustering the neighborhoods in Toronto

We decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. <br>

In [10]:
import numpy as np
from geopy.geocoders import Nominatim
import requests
from pandas.io.json import json_normalize

import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium # map rendering library

print("Libraries imported!")

Libraries imported!


<br> We will create a new DataFrame, with only the borougj containing the word Toronto <br>

In [11]:
# select rows with Toronto word in the Borough column
tor_df1 = tor_df[tor_df.Borough.str.contains('Toronto',case=False)]
tor_df1.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [12]:
print('There are {} uniques Boroughs.'.format(len(tor_df1['Borough'].unique())))
print('There are {} uniques Postal Code.'.format(len(tor_df1['Postal code'].unique())))

There are 4 uniques Boroughs.
There are 39 uniques Postal Code.


In [13]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):

    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues) 

In [15]:
CLIENT_ID = 'IWKKM4IYZLOOAHTBBCDWQHYRDGAOVAHRX1G20BPEFARUHJA0' # your Foursquare ID
CLIENT_SECRET = 'NVCZJLTEUMMES03H1Q5OQW4YZZNCNKVWOOVD2IE1LLUQKZKJ' # your Foursquare Secret
VERSION = '20190605' # Foursquare API version
radius = 500
limit = 100

toronto_venues = getNearbyVenues(names=tor_df1['Neighborhood'],
                                   latitudes=tor_df1['Latitude'],
                                   longitudes=tor_df1['Longitude']
                                  )

The Beaches
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North & West
The Annex, North Midtown, Yorkville
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst  Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Dufferin, Dovercourt Village
Little Portugal, Trinity
Brockton, Parkdale Village, Exhibition Place
High Park, The Junction South
Parkdale, Ro

<br> Let's check the size of the resulting dataframe

In [16]:
print(toronto_venues.shape)
toronto_venues.head()

(1681, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,The Beaches,43.676357,-79.293031,Dip 'n Sip,43.678897,-79.297745,Coffee Shop


Let's check how many venues were returned for each neighborhood

In [17]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,55,55,55,55,55,55
"Brockton, Parkdale Village, Exhibition Place",22,22,22,22,22,22
Business reply mail Processing CentrE,16,16,16,16,16,16
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
Central Bay Street,77,77,77,77,77,77
Christie,19,19,19,19,19,19
Church and Wellesley,79,79,79,79,79,79
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,36,36,36,36,36,36
Davisville North,8,8,8,8,8,8


Let's find out how many unique categories can be curated from all the returned venues

In [18]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 235 uniques categories.


Let's analyse each neighberhood

In [34]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
# there is already a column called Neighborhood in the toronto_onehot dataframe (an element in Venue Category), so we are creating a new one called Neighborhoods
toronto_onehot['Neighborhoods'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhoods,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category¶

In [36]:
toronto_grouped = toronto_onehot.groupby('Neighborhoods').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhoods,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Business reply mail Processing CentrE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0625,0.0625,0.0625,0.125,0.125,0.125,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012987,0.0,...,0.0,0.0,0.0,0.0,0.012987,0.0,0.0,0.012987,0.0,0.012987
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.012658,0.0,0.0,0.0,0.0,0.0,0.0,0.012658,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025316
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,...,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's get the top 10 most common venues for each neighorhood

In [37]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhoods']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhoods'] = toronto_grouped['Neighborhoods']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhoods,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Bakery,Farmers Market,Restaurant,Cheese Shop,Café,Beer Bar,Department Store
1,"Brockton, Parkdale Village, Exhibition Place",Café,Breakfast Spot,Coffee Shop,Nightclub,Furniture / Home Store,Burrito Place,Restaurant,Italian Restaurant,Stadium,Intersection
2,Business reply mail Processing CentrE,Yoga Studio,Pizza Place,Skate Park,Smoke Shop,Farmers Market,Spa,Fast Food Restaurant,Burrito Place,Restaurant,Light Rail Station
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Lounge,Airport Service,Airport Terminal,Boutique,Airport,Airport Food Court,Airport Gate,Bar,Harbor / Marina,Sculpture Garden
4,Central Bay Street,Coffee Shop,Italian Restaurant,Japanese Restaurant,Burger Joint,Sandwich Place,Thai Restaurant,Gym / Fitness Center,Spa,Ice Cream Shop,Dessert Shop


And now we will cluter our neighberhoods. <br>
Run k-means to cluster the neighborhood into 5 clusters.

In [45]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhoods', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([2, 2, 0, 0, 2, 2, 2, 2, 2, 2], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [62]:
# add clustering labels
#neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = tor_df1

# merge toronto_grouped with tor_df to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhoods'), on='Neighborhood')

print(toronto_merged.shape)
toronto_merged.head() # check the last columns!

(39, 16)


Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Pub,Neighborhood,Coffee Shop,Health Food Store,Trail,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,2,Greek Restaurant,Coffee Shop,Italian Restaurant,Furniture / Home Store,Bookstore,Ice Cream Shop,Yoga Studio,Spa,Brewery,Bubble Tea Shop
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,2,Sandwich Place,Sushi Restaurant,Food & Drink Shop,Burrito Place,Liquor Store,Restaurant,Fast Food Restaurant,Italian Restaurant,Intersection,Fish & Chips Shop
43,M4M,East Toronto,Studio District,43.659526,-79.340923,2,Café,Coffee Shop,American Restaurant,Bakery,Brewery,Gastropub,Yoga Studio,Fish Market,Pet Store,Park
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Park,Lawyer,Bus Line,Swim School,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


Finally, let's visualize the resulting clusters

In [64]:
# getting the latitude and longitute of Toronto

address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [65]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster.

#### Cluster 1

In [68]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
44,Central Toronto,0,Park,Lawyer,Bus Line,Swim School,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
68,Downtown Toronto,0,Airport Lounge,Airport Service,Airport Terminal,Boutique,Airport,Airport Food Court,Airport Gate,Bar,Harbor / Marina,Sculpture Garden
76,West Toronto,0,Pharmacy,Bakery,Gym / Fitness Center,Café,Bar,Bank,Supermarket,Brewery,Middle Eastern Restaurant,Art Gallery
82,West Toronto,0,Thai Restaurant,Bar,Mexican Restaurant,Café,Arts & Crafts Store,Diner,Bakery,Speakeasy,Italian Restaurant,Cajun / Creole Restaurant
87,East Toronto,0,Yoga Studio,Pizza Place,Skate Park,Smoke Shop,Farmers Market,Spa,Fast Food Restaurant,Burrito Place,Restaurant,Light Rail Station


#### Cluster 2

In [67]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
50,Downtown Toronto,1,Park,Playground,Trail,Department Store,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
64,Central Toronto,1,Park,Jewelry Store,Trail,Sushi Restaurant,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


#### Cluster 3

In [69]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,East Toronto,2,Pub,Neighborhood,Coffee Shop,Health Food Store,Trail,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant
41,East Toronto,2,Greek Restaurant,Coffee Shop,Italian Restaurant,Furniture / Home Store,Bookstore,Ice Cream Shop,Yoga Studio,Spa,Brewery,Bubble Tea Shop
42,East Toronto,2,Sandwich Place,Sushi Restaurant,Food & Drink Shop,Burrito Place,Liquor Store,Restaurant,Fast Food Restaurant,Italian Restaurant,Intersection,Fish & Chips Shop
43,East Toronto,2,Café,Coffee Shop,American Restaurant,Bakery,Brewery,Gastropub,Yoga Studio,Fish Market,Pet Store,Park
45,Central Toronto,2,Park,Breakfast Spot,Hotel,Food & Drink Shop,Department Store,Convenience Store,Sandwich Place,Gym,American Restaurant,Distribution Center
46,Central Toronto,2,Coffee Shop,Clothing Store,Yoga Studio,Mexican Restaurant,Spa,Fast Food Restaurant,Sporting Goods Shop,Salon / Barbershop,Diner,Restaurant
47,Central Toronto,2,Dessert Shop,Sandwich Place,Pizza Place,Café,Italian Restaurant,Gym,Coffee Shop,Sushi Restaurant,Restaurant,Seafood Restaurant
49,Central Toronto,2,Pub,Coffee Shop,Restaurant,American Restaurant,Sushi Restaurant,Bank,Fried Chicken Joint,Sports Bar,Pizza Place,Supermarket
51,Downtown Toronto,2,Coffee Shop,Park,Chinese Restaurant,Café,Restaurant,Pizza Place,Bakery,Pub,Italian Restaurant,Butcher
52,Downtown Toronto,2,Japanese Restaurant,Coffee Shop,Gay Bar,Restaurant,Sushi Restaurant,Café,Pub,Men's Store,Mediterranean Restaurant,Hotel


#### Cluster 4

In [70]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
48,Central Toronto,3,Playground,Summer Camp,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


#### Cluster 5

In [71]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
63,Central Toronto,4,Home Service,Pool,Garden,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


<br> <br>
That's the end of the Notebook, I hope you find it useful.

### Thank you