# SEGMENTING AND CLUSTERING NEIGHBORHOODS IN TORONTO

This notebook contains the works for the following assignments on this topic.

Please refer to the following items in this notebook:

1. Creation of Dataframe for Toronto Neighborhood from data extracted from www.wikipedia.com (cells 1-7)

2. Update Toronto Neighborhood's Dataframe to add the latitude and longitude coordinates  (cells 8-11)

3. Explore Toronto Neighborhood  (cells 12- end)

**First, Let's install beautifulsoup4**

In [1]:
!pip install beautifulsoup4 

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 19.6MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2; python_version >= "3.0" (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/36/69/d82d04022f02733bf9a72bc3b96332d360c0c5307096d76f6bb7489f7e57/soupsieve-2.2.1-py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.3 soupsieve-2.2.1


**Let's import the libraries**

In [2]:
import pandas as pd
import requests
import lxml.html as lh
from bs4 import BeautifulSoup

### <font color=red>1. Creation of Dataframe for Toronto Neighborhood using data extracted from www.wikipedia.com ### </font>

**Let's scrape table cells found in the url**

In [3]:
#Define url
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#Make the GET request to fetch the raw HTML content
html_content = requests.get(url).text

#Parse the HTML content
soup = BeautifulSoup(html_content, "lxml")

#Extract the content of the table into "table".
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell={}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        #cell['Neighborhood'] =(((row.span.text).split(')')[0]).split('('))[1]
        cell['Neighborhood'] = ((((row.span.text).split('(')[-1]).strip(')')).replace(' /',',')).strip(' ')
        table_contents.append(cell)


**Clean the data**

In [4]:
#Verify the values of the Borough field.
Toronto_df = pd.DataFrame(table_contents)
pd.DataFrame(Toronto_df['Borough'].unique())

Unnamed: 0,0
0,North York
1,Downtown Toronto
2,Queen's Park
3,Etobicoke
4,Scarborough
5,East York
6,York
7,East Toronto
8,West Toronto
9,East YorkEast Toronto


In [5]:
#Correct values of some Boroughs as advised in this assignment's Tip for Webscraping

Toronto_df['Borough'] = Toronto_df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                       'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                       'EtobicokeNorthwest': 'Etobicoke Northwest',
                                       'MississaugaCanada Post Gateway Processing Centre':'Mississauga',
                                       'East YorkEast Toronto': 'East York/East Toronto'})
pd.DataFrame(Toronto_df['Borough'].unique())  #re-check data

Unnamed: 0,0
0,North York
1,Downtown Toronto
2,Queen's Park
3,Etobicoke
4,Scarborough
5,East York
6,York
7,East Toronto
8,West Toronto
9,East York/East Toronto


In [6]:
Toronto_df.head(20)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills)North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [7]:
Toronto_df.shape

#This ends assigment number 1.

(103, 3)

### <font color=red>  2. Update Toronto Neighborhood's Dataframe to add the latitude and longitude coordinates</font>

**First, Let's install geopy**

In [8]:
pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 1.1MB/s ta 0:00:011
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Note: you may need to restart the kernel to use updated packages.


**After trying a couple of times using geocoder, the api is giving me an Error which required me to signup for Cloud. 
Because of this, I decided to just make use of the csv file**

In [9]:
#Load the Geospatial_Coordinates.csv file
file_name='Geospatial_Coordinates.csv'
geoCoor_df=pd.read_csv(file_name)
geoCoor_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [10]:
#Add new columns Latitude and Longitude to Toronto_df
Toronto_df['Latitude']= ''
Toronto_df['Longitude']= ''
Toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,,
1,M4A,North York,Victoria Village,,
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",,
3,M6A,North York,"Lawrence Manor, Lawrence Heights",,
4,M7A,Queen's Park,Ontario Provincial Government,,


In [11]:
#Populate Toronto_df with the Latitude and longitude coordinates based on the geoCoor_df
Toronto_df['Latitude'] = Toronto_df['PostalCode'].map(geoCoor_df.set_index('Postal Code')['Latitude'])
Toronto_df['Longitude'] = Toronto_df['PostalCode'].map(geoCoor_df.set_index('Postal Code')['Longitude'])
Toronto_df.head(20)

#This ends assignment number 2.

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills)North,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


### <font color=red>  3. Explore Toronto Neighborhood</font>

In [12]:
#Let's install geopy
!conda install -c conda-forge geopy --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2021.5.30  |       ha878542_0         136 KB  conda-forge
    certifi-2021.5.30          |   py36h5fab9bb_0         141 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-2.1.0                |     pyhd3deb0d_0          64 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         375 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-2.1.0-pyhd3deb0d_0

The following packages will be

In [13]:
#Let's import the necessary libraries
import numpy as np
import folium
import json
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize	#transform JSON file into a pandas dataframe

#Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#import k-means for clustering stage
from sklearn.cluster import KMeans


In [14]:
#Let's just work on Boroughs that contain the word "Toronto"
Toronto_Borough = Toronto_df.loc[(Toronto_df['Borough'].str.contains('Toronto'))].reset_index(drop=True)
print(Toronto_Borough.shape)
Toronto_Borough.head(39)

(39, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
5,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
6,M6G,Downtown Toronto,Christie,43.669542,-79.422564
7,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
8,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259
9,M4J,East York/East Toronto,The Danforth East,43.685347,-79.338106


**Let's get the geographical coordinates for Downtown Toronto**

In [15]:
address = 'Downtown Toronto'

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6541737, -79.38081162653639.


**Create a map of Toronto with neighborhoods superimposed on top.**

In [16]:
#Create map of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

#Add markers to map
for lat, lng, borough, neighborhood in zip(Toronto_Borough['Latitude'], Toronto_Borough['Longitude'],
                                           Toronto_Borough['Borough'], Toronto_Borough['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity= 0.7,
        parse_html=False).add_to(map_Toronto)

map_Toronto

**Let's start utilizing the Foursquare API to explore the neighborhoods and segment them**

In [17]:
#Define Foursquare Credentials
CLIENT_ID='FXIGGFYXJ0F2WYJAIK5BWRQPNVNOCHSB5DSLAVYZM3FYZSYD'        #Foursquare ID
CLIENT_SECRET='UBPUPBMGHTTXBNIOO5XTM2NRZ1B5UY251OP3GN01FIJTNPNM'    #Foursquare Secret
ACCESS_TOKEN='DYEESP4IBFCOC5B0LEFIMFTF1HTH5CDDIVGL1QCKDWODIJCF'     #Your Foursquare Access Token
VERSION='20180604'
LIMIT=30


**Lets explore Harbourfront, the first neighborhood from Downtown Toronto**

In [18]:
#Let's get Harbourfront's latitude and longitude coordinates
neighborhood_latitude = Toronto_Borough.loc[0, 'Latitude']          #neighborhood latitude value
neighborhood_longitude = Toronto_Borough.loc[0, 'Longitude']        #neighborhood longitude value
neighborhood_name = Toronto_Borough.loc[0, 'Neighborhood']           #neighborhood name

print('Latitude and Longitude values of {} are {}, {}.'.format(neighborhood_name, neighborhood_latitude, neighborhood_longitude))


Latitude and Longitude values of Regent Park, Harbourfront are 43.6542599, -79.3606359.


**Now, let's get the top 100 venues that are in Harbourfront within a radius of 500 meters.**

In [19]:
#First, let's create the GET request URL. Name your URL url.

# type your answer here
radius=500
limit=100

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION,
    neighborhood_latitude,
    neighborhood_longitude,
    radius,
    limit
)
url


'https://api.foursquare.com/v2/venues/explore?&client_id=FXIGGFYXJ0F2WYJAIK5BWRQPNVNOCHSB5DSLAVYZM3FYZSYD&client_secret=UBPUPBMGHTTXBNIOO5XTM2NRZ1B5UY251OP3GN01FIJTNPNM&v=20180604&ll=43.6542599,-79.3606359&radius=500&limit=100'

In [20]:
#Send the GET request and examine the resutls
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '60cfa341edff614526b3fe59'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Corktown',
  'headerFullLocation': 'Corktown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 42,
  'suggestedBounds': {'ne': {'lat': 43.6587599045, 'lng': -79.3544279001486},
   'sw': {'lat': 43.6497598955, 'lng': -79.36684389985142}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '53b8466a498e83df908c3f21',
       'name': 'Tandem Coffee',
       'location': {'address': '368 King St E',
        'crossStreet': 'at Trinity St',
        'lat': 43.65355870959944,
        'lng': -79.36180945913513,
        'labeledLatLngs': [{'label': 'display',
 

In [21]:
#Let's create a function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']

    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


In [22]:
#Now we are ready to clean the json and structure it into a pandas dataframe.
venues = results['response']['groups'][0]['items']

nearby_venues = json_normalize(venues)    #flatten JSON

#filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

#filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

#clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()


  after removing the cwd from sys.path.


Unnamed: 0,name,categories,lat,lng
0,Tandem Coffee,Coffee Shop,43.653559,-79.361809
1,Roselle Desserts,Bakery,43.653447,-79.362017
2,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008
3,Body Blitz Spa East,Spa,43.654735,-79.359874
4,Impact Kitchen,Restaurant,43.656369,-79.35698


In [23]:
#How many venues were returned by Foursquare?
print('There are {} venues returned by Foursquare.'.format(nearby_venues.shape[0]))

There are 42 venues returned by Foursquare.


**Let's explore the neighborhood of the places that are in the Toronto_Borough**

*First, let's create a function to repeat the same process to all the neighborhoods in Toronto_Borough*


In [24]:
#This function gets the nearby venues given the name of the place, the coordinates and the radius

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)

        #create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)

        #make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        #return only relevant information for each nearby venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        nearby_venues.columns = ['Neighborhood',
                                 'Neighborhood Latitude',
                                 'Neighborhood Longitude',
                                 'Venue',
                                 'Venue Latitude',
                                 'Venue Longitude',
                                 'Venue Category']
    
    return(nearby_venues)


**Now write the code to run the above function on each neighborhood and create a new dataframe called Toronto_Borough_venues.**

In [25]:
Toronto_Borough_venues = getNearbyVenues(names=Toronto_Borough['Neighborhood'],
                                   latitudes=Toronto_Borough['Latitude'],
                                   longitudes=Toronto_Borough['Longitude'])

Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
The Danforth  East
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Enclave of M5E
St. James Town, Cabbagetown
First Canadi

**Let's check the size of the resulting dataframe**


In [26]:
print(Toronto_Borough_venues.shape)
Toronto_Borough_venues.head()

(795, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
1,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


*We can see that there are 795 Neighborhoods in Boroughs containing the word 'Toronto*

**Let's check how many venues were returned for each neighborhood**

In [27]:
Toronto_Borough_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,30,30,30,30,30,30
"Brockton, Parkdale Village, Exhibition Place",22,22,22,22,22,22
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",17,17,17,17,17,17
Central Bay Street,30,30,30,30,30,30
Christie,15,15,15,15,15,15
Church and Wellesley,30,30,30,30,30,30
"Commerce Court, Victoria Hotel",30,30,30,30,30,30
Davisville,26,26,26,26,26,26
Davisville North,10,10,10,10,10,10
"Dufferin, Dovercourt Village",13,13,13,13,13,13


**Let's find out how many unique categories can be curated from all the returned venues**

In [28]:
print('There are {} uniques categories.'.format(len(Toronto_Borough_venues['Venue Category'].unique())))

There are 179 uniques categories.


**Let's <font color=red>ANALYZE</font> each neighborhood**

In [29]:
#one hot encoding
Toronto_Borough_onehot = pd.get_dummies(Toronto_Borough_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_Borough_onehot['Neighborhood'] = Toronto_Borough_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [Toronto_Borough_onehot.columns[-1]] + list(Toronto_Borough_onehot.columns[:-1])
Toronto_Borough_onehot = Toronto_Borough_onehot[fixed_columns]

Toronto_Borough_onehot.head()

Unnamed: 0,Yoga Studio,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Aquarium,...,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
#And let's examine the new dataframe size.
Toronto_Borough_onehot.shape

(795, 179)

**Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category**

In [31]:
Toronto_Borough_grouped = Toronto_Borough_onehot.groupby('Neighborhood').mean().reset_index()
Toronto_Borough_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.058824,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Central Bay Street,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0
4,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Church and Wellesley,0.033333,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333
6,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,...,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.033333,0.0
7,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Dufferin, Dovercourt Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
#Let's confirm the new size
Toronto_Borough_grouped.shape

(39, 179)

**Let's print each neighborhood along with the top 5 most common venues**

In [33]:
num_top_venues = 5

for hood in Toronto_Borough_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = Toronto_Borough_grouped[Toronto_Borough_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')


----Berczy Park----
                venue  freq
0      Farmers Market  0.07
1        Cocktail Bar  0.07
2  Seafood Restaurant  0.07
3            Beer Bar  0.07
4              Bakery  0.07


----Brockton, Parkdale Village, Exhibition Place----
            venue  freq
0     Coffee Shop  0.09
1          Bakery  0.09
2            Café  0.09
3  Sandwich Place  0.09
4  Breakfast Spot  0.09


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
              venue  freq
0   Airport Service  0.18
1    Airport Lounge  0.12
2  Airport Terminal  0.12
3          Boutique  0.06
4           Airport  0.06


----Central Bay Street----
                 venue  freq
0          Coffee Shop  0.17
1       Sandwich Place  0.13
2     Sushi Restaurant  0.10
3  Japanese Restaurant  0.07
4                 Café  0.07


----Christie----
           venue  freq
0  Grocery Store  0.27
1           Café  0.20
2           Park  0.13
3      Nightclub  0.07
4  

**Let's put that into a pandas dataframe**


In [34]:
#First, let's write a function to sort the venues in descending order.

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


**Now let's create the new dataframe and display the top 10 venues for each neighborhood.**

In [35]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
Toronto_Borough_venues_sorted = pd.DataFrame(columns=columns)
Toronto_Borough_venues_sorted['Neighborhood'] = Toronto_Borough_grouped['Neighborhood']

for ind in np.arange(Toronto_Borough_grouped.shape[0]):
    Toronto_Borough_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_Borough_grouped.iloc[ind, :], num_top_venues)

Toronto_Borough_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Beer Bar,Farmers Market,Seafood Restaurant,Bakery,Cocktail Bar,Bagel Shop,Park,Jazz Club,Japanese Restaurant,Bistro
1,"Brockton, Parkdale Village, Exhibition Place",Café,Sandwich Place,Breakfast Spot,Coffee Shop,Bakery,Nightclub,Japanese Restaurant,Bar,Italian Restaurant,Stadium
2,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Coffee Shop,Harbor / Marina,Boutique,Sculpture Garden,Rental Car Location,Plane,Boat or Ferry
3,Central Bay Street,Coffee Shop,Sandwich Place,Sushi Restaurant,Café,Japanese Restaurant,Bank,Modern European Restaurant,Park,Middle Eastern Restaurant,Comic Shop
4,Christie,Grocery Store,Café,Park,Coffee Shop,Nightclub,Restaurant,Italian Restaurant,Baby Store,Athletics & Sports,Dance Studio


**Let's <font color='red'>Cluster</font> Neighborhoods**


In [36]:
#Run k-means to cluster the neighborhood into 5 clusters.
# set number of clusters
kclusters = 5

TorontoBor_grouped_clustering = Toronto_Borough_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(TorontoBor_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 


array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

**Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.**

In [37]:
# add clustering labels
Toronto_Borough_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

TorontonBor_merged = Toronto_Borough

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood

TorontonBor_merged = TorontonBor_merged.join(Toronto_Borough_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

TorontonBor_merged.head() # check the last columns!


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1,Coffee Shop,Park,Bakery,Pub,Performing Arts Venue,Spa,Sandwich Place,Restaurant,Breakfast Spot,Café
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1,Café,Middle Eastern Restaurant,Coffee Shop,Theater,Clothing Store,Japanese Restaurant,Burger Joint,Falafel Restaurant,Plaza,Ramen Restaurant
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Café,Italian Restaurant,Farmers Market,Restaurant,Coffee Shop,Gastropub,Molecular Gastronomy Restaurant,Japanese Restaurant,Bakery,Salon / Barbershop
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,Health Food Store,Pub,Wine Shop,Cuban Restaurant,Dog Run,Distribution Center,Discount Store,Diner,Dessert Shop,Department Store
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,1,Beer Bar,Farmers Market,Seafood Restaurant,Bakery,Cocktail Bar,Bagel Shop,Park,Jazz Club,Japanese Restaurant,Bistro


**Finally, let's visualize the resulting clusters**

In [38]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(TorontonBor_merged['Latitude'], TorontonBor_merged['Longitude'], TorontonBor_merged['Neighborhood'], 
                                  TorontonBor_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


**Now, let's examine the clusters**

In [40]:
#Cluster 0 - Boroughs in Cluster 0 shows Parks as the most common venues
TorontonBor_merged.loc[TorontonBor_merged['Cluster Labels'] == 0, TorontonBor_merged.columns[[1] + list(range(5, TorontonBor_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Central Toronto,0,Park,Bus Line,Swim School,Cuban Restaurant,Dog Run,Distribution Center,Discount Store,Diner,Dessert Shop,Department Store
21,Central Toronto,0,Park,Jewelry Store,Trail,Bus Line,Sushi Restaurant,Cuban Restaurant,Distribution Center,Discount Store,Diner,Dessert Shop
33,Downtown Toronto,0,Park,Trail,Playground,Cosmetics Shop,Distribution Center,Discount Store,Diner,Dessert Shop,Department Store,Deli / Bodega


In [41]:
#Cluster 1 - Boroughs in Cluster 1 shows comprises venues for Parks, restaurants, cafes, shops, spa etc, and most likely the busiest are.
TorontonBor_merged.loc[TorontonBor_merged['Cluster Labels'] == 1, TorontonBor_merged.columns[[1] + list(range(5, TorontonBor_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,1,Coffee Shop,Park,Bakery,Pub,Performing Arts Venue,Spa,Sandwich Place,Restaurant,Breakfast Spot,Café
1,Downtown Toronto,1,Café,Middle Eastern Restaurant,Coffee Shop,Theater,Clothing Store,Japanese Restaurant,Burger Joint,Falafel Restaurant,Plaza,Ramen Restaurant
2,Downtown Toronto,1,Café,Italian Restaurant,Farmers Market,Restaurant,Coffee Shop,Gastropub,Molecular Gastronomy Restaurant,Japanese Restaurant,Bakery,Salon / Barbershop
3,East Toronto,1,Health Food Store,Pub,Wine Shop,Cuban Restaurant,Dog Run,Distribution Center,Discount Store,Diner,Dessert Shop,Department Store
4,Downtown Toronto,1,Beer Bar,Farmers Market,Seafood Restaurant,Bakery,Cocktail Bar,Bagel Shop,Park,Jazz Club,Japanese Restaurant,Bistro
5,Downtown Toronto,1,Coffee Shop,Sandwich Place,Sushi Restaurant,Café,Japanese Restaurant,Bank,Modern European Restaurant,Park,Middle Eastern Restaurant,Comic Shop
6,Downtown Toronto,1,Grocery Store,Café,Park,Coffee Shop,Nightclub,Restaurant,Italian Restaurant,Baby Store,Athletics & Sports,Dance Studio
7,Downtown Toronto,1,Café,Coffee Shop,Sushi Restaurant,Opera House,Japanese Restaurant,Bakery,Seafood Restaurant,Restaurant,Record Shop,Plaza
8,West Toronto,1,Music Venue,Bar,Brewery,Bakery,Café,Middle Eastern Restaurant,Bank,Pharmacy,Pet Store,Gas Station
10,Downtown Toronto,1,Park,Hotel,Aquarium,Café,Brewery,Pizza Place,IT Services,Skating Rink,Basketball Stadium,History Museum


In [42]:
#Cluster 2 - Cluster 2 has only  Borough but has a combination of Gym, shops and restaurants. 
TorontonBor_merged.loc[TorontonBor_merged['Cluster Labels'] == 2, TorontonBor_merged.columns[[1] + list(range(5, TorontonBor_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,Central Toronto,2,Gym,Wine Shop,Cuban Restaurant,Dog Run,Distribution Center,Discount Store,Diner,Dessert Shop,Department Store,Deli / Bodega


In [43]:
#Cluster 4 - Borough in Cluster 4 shows combination of most venues for Parks, restaurants, cafes etc.
TorontonBor_merged.loc[TorontonBor_merged['Cluster Labels'] == 4, TorontonBor_merged.columns[[1] + list(range(5, TorontonBor_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,East York/East Toronto,4,Park,Metro Station,Convenience Store,Cuban Restaurant,Distribution Center,Discount Store,Diner,Dessert Shop,Department Store,Deli / Bodega


**This ends works for question 3**