# The Battle of Neighborhoods, New York vs Toronto

Ade Irman Budi H.

25 July 2021

## 1. Introduction

New York and Toronto are metropolitan city, people all around the world are gathered on there. Both of them are also place for tourist to shopping and find happiness, because they offer a wide variety of experiences that might people like. We try to group the neighbourhoods of New York and Toronto through a picture and see the similarity and dissimilarity.




## 2. Business Problem

The aim is to help help stakeholders make decisions in different kinds of industrial, hotel, cuisines, and what the city has to offer so tourists around the world can choose their preferences depending on the experiences that either New York or Toronto offer and what would they want.

## 3. Data Description

### 3.1 Foursquare API Data for Toronto and New York

We will need data about different venues in different neighbourhoods of that specific borough. In order to gain that information we will use "Foursquare" locational information. Foursquare is a location data provider with information about all manner of venues and events within an area of interest. Such information includes venue names, locations, menus and even photos. As such, the foursquare location platform will be used as the sole data source since all the stated required information can be obtained through the API.

After finding the list of neighbourhoods, we then connect to the Foursquare API to gather information about venues inside each and every neighbourhood. For each neighbourhood, we have chosen the radius to be 500 meters.

The data retrieved from Foursquare contained information of venues within a specified distance of the longitude and latitude of the postcodes. The information obtained per venue as follows:

1. Neighbourhood : Name of the Neighbourhood
2. Neighbourhood Latitude : Latitude of the Neighbourhood
3. Neighbourhood Longitude : Longitude of the Neighbourhood
4. Venue : Name of the Venue
5. Venue Latitude : Latitude of Venue
6. Venue Longitude : Longitude of Venue
7. Venue Category : Category of Venue
    
This API is used for New York and Toronto, but because of lack of information i also use scapping method for Toronto.

### 3.2  Additional for Toronto

For additional , I scrapped the data from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

I also use the dataset from <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv" target="_blank">Geospatial_Coordinates.csv</a>

The data retrieved from Wikipedia and Downloaded Dataset contained information of venues within a specified distance of the longitude and latitude of the postcodes. The information obtained per venue as follows:

1. Borough : Name of Borough
2. Neighborhood: Name of Neighbourhood
3. Postal Code : Postal codes for Toronto
4. Latitude : Latitude of Borough
5. Longitude : Longitude of Borough



## 4. Methodology

### 4.1 Importing Libraries

First step, i usually import libraries. There is library i commonly use:

In [1]:
import pandas as pd                   # for manipulating data, open .csv and .json file
import numpy as np                    # for math operation
import matplotlib.pyplot as plt       # to set graph, figsize
import matplotlib.cm as cm            # for handling utilities in color map
import matplotlib.colors as colors    # to generate colors
from geopy.geocoders import Nominatim # to convert an address into latitude and longitude values
import json                           # for open .json file
import folium                         # generating maps for New York and Toronto
import requests                       # for request url
from sklearn.cluster import KMeans    # for generating cluster
from bs4 import BeautifulSoup         # for scapping wikipedia wesite

### 4.2 Data Collection

#### 4.2.1 Toronto Data Collection
First i scrapped Toronto data from <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" target="_blank">Wikipedia Link</a>. Then i download this file <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv" target="_blank">Geospatial_Coordinates.csv</a> for additional such as lattitude and longitude. After 2 steps above, i merged the data in Ms.Excel and the final file is data.csv

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html5')



table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)
        
        
        
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})


df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [3]:
tor_data = pd.read_csv('data.csv')
tor_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


#### 4.2.2 New York Data Collection

I downloaded .json file from <a href="https://geo.nyu.edu/catalog/nyu-2451-34572" target="_blank">NewYork.json</a>. After that i changed .json format to Pandas DataFrame. It's because of the simplicity for manipulating data in Data Frame.

In [4]:
with open('nyu-2451-34572-geojson.json') as json_data:
    newyork_data = json.load(json_data)
    
    
neighborhoods_data = newyork_data['features']


column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
neighborhoods = pd.DataFrame(columns=column_names)


for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)


ny_data = neighborhoods
ny_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


### 4.3 Data Preprocessing

The data have already processed in Ms.Excel, because i usually used Ms.Excel

### 4.4 Feature Selection and Engineering
the tor_data i had is containing Postal Code, but i don't need it so i can drop Postal Code column

In [5]:
tor_data = tor_data.drop('Postal Code', axis=1)
tor_data.head(3)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711


## 5. Visualizing the Neighbourhood of New York and Toronto

### 5.1 Visualize the Neighbourhood of New York

In [6]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [7]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(ny_data['Latitude'], ny_data['Longitude'], ny_data['Borough'], ny_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

### 5.2 Visualize the Neighbourhood of Toronto

In [8]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The coordinates of Toronto are 43.6534817, -79.3839347.


In [9]:
map_toronto = folium.Map(location=[latitude,longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(tor_data['Latitude'], tor_data['Longitude'], tor_data['Borough'], tor_data['Neighborhood']):
    label = '{}, {}'.format(borough, neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## 6. Neighbourhood of Toronto and New York with Venues

### 6.1 Searching the venues, i need Foursquare API

In [10]:
CLIENT_ID = '4WUKVVO3SKQZSR4YVWR2VKIKO0YCGLR0NLJ1HEOLFFQTDVWK' 
CLIENT_SECRET = '5HW01QZRZVG3KCFNGCXKM2EC3SZSCFIQURUMRZIS4SQDMITG' 
VERSION = '20180605' 
LIMIT = 100 

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 4WUKVVO3SKQZSR4YVWR2VKIKO0YCGLR0NLJ1HEOLFFQTDVWK
CLIENT_SECRET:5HW01QZRZVG3KCFNGCXKM2EC3SZSCFIQURUMRZIS4SQDMITG


In [11]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

In [12]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### 6.2 Toronto Venues

In [13]:
venues_tor = getNearbyVenues(tor_data['Neighborhood'], tor_data['Latitude'], tor_data['Longitude'])

Malvern, Rouge
Rouge Hill, Port Union, Highland Creek
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
Kennedy Park, Ionview, East Birchmount Park
Golden Mile, Clairlea, Oakridge
Cliffside, Cliffcrest, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Wexford Heights, Scarborough Town Centre
Wexford, Maryvale
Agincourt
Clarks Corners, Tam O'Shanter, Sullivan
Milliken, Agincourt North, Steeles East, L'Amoreaux East
Steeles West, L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
York Mills, Silver Hills
Willowdale, Newtonbrook
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Don Mills South
Bathurst Manor, Wilson Heights, Downsview North
Northwood Park, York University
Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Parkview Hill, Woodbine Gardens
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
The Danforth  East
The Danforth West, Riverdale


In [14]:
venues_tor.head(3)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,"Malvern, Rouge",43.806686,-79.194353,Interprovincial Group,43.80563,-79.200378,Print Shop
2,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,Great Shine Window Cleaning,43.783145,-79.157431,Home Service


In [15]:
venues_cat_tor = pd.get_dummies(venues_tor['Venue Category'])
venues_cat_tor.head(3)

Unnamed: 0,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
venues_cat_tor['Neighbourhood'] = venues_tor['Neighborhood']
grouped_df_tor = venues_cat_tor.groupby('Neighbourhood').mean().reset_index()
grouped_df_tor.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
def most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted_tor = pd.DataFrame(columns=columns)
venues_sorted_tor['Neighbourhood'] = grouped_df_tor['Neighbourhood']

for ind in np.arange(grouped_df_tor.shape[0]):
    venues_sorted_tor.iloc[ind, 1:] = most_common_venues(grouped_df_tor.iloc[ind, :], num_top_venues)

venues_sorted_tor.head(3)

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Clothing Store,Lounge,Breakfast Spot,Skating Rink,Latin American Restaurant,Mexican Restaurant,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop
1,"Alderwood, Long Branch",Pizza Place,Playground,Pub,Sandwich Place,Coffee Shop,Pool,Gym,Mediterranean Restaurant,Men's Store,Metro Station
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Pizza Place,Gas Station,Pharmacy,Deli / Bodega,Diner,Restaurant,Mobile Phone Shop,Middle Eastern Restaurant


### 6.3 New York Venues

In [19]:
venues_ny = getNearbyVenues(ny_data['Neighborhood'], ny_data['Latitude'], ny_data['Longitude'])

Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker

In [20]:
venues_ny.drop(['Venue Latitude', 'Venue Longitude'], axis=1, inplace=True)
venues_ny.head(3)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,Dessert Shop
1,Wakefield,40.894705,-73.847201,Rite Aid,Pharmacy
2,Wakefield,40.894705,-73.847201,Carvel Ice Cream,Ice Cream Shop


In [21]:
venues_cat_ny = pd.get_dummies(venues_ny['Venue Category'])
venues_cat_ny.head(3)

Unnamed: 0,ATM,Accessories Store,Acupuncturist,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Antique Shop,Arcade,...,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yemeni Restaurant,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
venues_cat_ny['Neighbourhood'] = venues_ny['Neighborhood']
grouped_df_ny = venues_cat_ny.groupby('Neighbourhood').mean().reset_index()
grouped_df_ny.head()

Unnamed: 0,Neighbourhood,ATM,Accessories Store,Acupuncturist,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Antique Shop,...,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yemeni Restaurant,Yoga Studio
0,Allerton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Annadale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Arden Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Arlington,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Arrochar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
def most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted_ny = pd.DataFrame(columns=columns)
venues_sorted_ny['Neighbourhood'] = grouped_df_ny['Neighbourhood']

for ind in np.arange(grouped_df_ny.shape[0]):
    venues_sorted_ny.iloc[ind, 1:] = most_common_venues(grouped_df_ny.iloc[ind, :], num_top_venues)

venues_sorted_ny.head(3)

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Allerton,Pizza Place,Supermarket,Deli / Bodega,Discount Store,Spa,Chinese Restaurant,Check Cashing Service,Fast Food Restaurant,Pharmacy,Smoke Shop
1,Annadale,Juice Bar,Pub,Park,Train Station,Pizza Place,Restaurant,Pharmacy,Diner,Dance Studio,Professional & Other Places
2,Arden Heights,Pharmacy,Coffee Shop,Pizza Place,Playground,Pakistani Restaurant,Peruvian Roast Chicken Joint,Peruvian Restaurant,Persian Restaurant,Perfume Shop,Performing Arts Venue


## 7. Building Cluster Model with KMeans

### 7.1 Building Cluster Model for Toronto

In [24]:
k_model_tor = KMeans(n_clusters=5).fit(grouped_df_tor.drop('Neighbourhood',1))
k_model_tor.labels_[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 4])

In [25]:
venues_sorted_tor.insert(0, 'Cluster Labels', k_model_tor.labels_)

tor_data = tor_data.join(venues_sorted_tor.set_index('Neighbourhood'), on='Neighborhood')
tor_data.dropna(inplace=True)

tor_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,"Malvern, Rouge",43.806686,-79.194353,0.0,Fast Food Restaurant,Print Shop,Mexican Restaurant,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Accessories Store,Monument / Landmark
1,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,1.0,Home Service,Bar,Mexican Restaurant,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Accessories Store,Moroccan Restaurant
2,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.0,Donut Shop,Medical Center,Breakfast Spot,Intersection,Restaurant,Rental Car Location,Mexican Restaurant,Electronics Store,Bank,Accessories Store
3,Scarborough,Woburn,43.770992,-79.216917,0.0,Coffee Shop,Korean BBQ Restaurant,Indian Restaurant,Middle Eastern Restaurant,Monument / Landmark,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Accessories Store
4,Scarborough,Cedarbrae,43.773136,-79.239476,0.0,Hakka Restaurant,Athletics & Sports,Lounge,Fried Chicken Joint,Bank,Bakery,Caribbean Restaurant,Thai Restaurant,Gas Station,Noodle House


### 7.2 Building Cluster Model for New York

In [26]:
k_model_ny = KMeans(n_clusters=5).fit(grouped_df_ny.drop('Neighbourhood',1))
k_model_ny.labels_[:10]

array([4, 4, 4, 0, 0, 0, 0, 0, 0, 0])

In [27]:
venues_sorted_ny.insert(0, 'Cluster Labels', k_model_ny.labels_)

ny_data = ny_data.join(venues_sorted_ny.set_index('Neighbourhood'), on='Neighborhood')
ny_data.dropna(inplace=True)

ny_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bronx,Wakefield,40.894705,-73.847201,0.0,Pharmacy,Dessert Shop,Caribbean Restaurant,Donut Shop,Sandwich Place,Ice Cream Shop,Laundromat,ATM,Park,Paper / Office Supplies Store
1,Bronx,Co-op City,40.874294,-73.829939,4.0,Bus Station,Fast Food Restaurant,Donut Shop,Park,Salon / Barbershop,Grocery Store,Market,Baseball Field,Liquor Store,Pizza Place
2,Bronx,Eastchester,40.887556,-73.827806,4.0,Bus Station,Caribbean Restaurant,Diner,Deli / Bodega,Food & Drink Shop,Bowling Alley,Pizza Place,Chinese Restaurant,Cosmetics Shop,Bakery
3,Bronx,Fieldston,40.895437,-73.905643,0.0,Bus Station,River,Music Venue,Plaza,ATM,Outlet Store,Peruvian Restaurant,Persian Restaurant,Perfume Shop,Performing Arts Venue
4,Bronx,Riverdale,40.890834,-73.912585,0.0,Bus Station,Park,Baseball Field,Food Truck,Playground,Plaza,Gym,Medical Supply Store,Bank,Home Service


## 8. Visualizing Clustered Toronto and New York Neighbourhood

### 8.1 Visualizing Clustered Toronto Neighbourhood

In [28]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The coordinates of Toronto are 43.6534817, -79.3839347.


In [29]:
map_clusters_tor = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tor_data['Latitude'], tor_data['Longitude'], tor_data['Neighborhood'], tor_data['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters_tor)

map_clusters_tor

### 8.2 Visualizing Clustered New York Neighbourhood

In [30]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [31]:
map_clusters_ny = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ny_data['Latitude'], ny_data['Longitude'], ny_data['Neighborhood'], ny_data['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters_ny)

map_clusters_ny

## 9. Discussion

We can see from the table or the picture that the neighbourhood between New York and Toronto are much the same. From table we can see that most venues in neighbourhood of New York and Toroto are restaurant, cafe, and shopping center. So the stakeholder can predict which business could they opened.

## 10. Conclusion

The neighbourhoods of New York and Toronto are very similar. You can see the most 2 cluster (red and orange) are dominating between other 3 cluster.