# Applied Data Science Capstone Project : Final Assignment

## 1. Introduction

### 1.1 Background

New York City (NYC) and Toronto are located in North America and are major financial hubs in the world. They are made up of different skyscrapers and business centers. Both are very cosmopolitan and have dynamic life style. Apart from the commercial perspective, they also build with many high-rise residential building. Many Global organization around the world have office located in these 2 countries. Many people often relocate from other countries to these 2 cities and working in the central business district (CBD) areas. They may not be aware of the similarities or differences in these 2 cities. One of the examples is related to the ethnic makeups in NYC and Toronto. NYC has a much larger Black and Latino population, whereas Toronto has proportionally more Asians and Indians. Hence the likelihood of NYC having more America or south America Restaurant than Toronto is higher. 

The target audience for this project is the expatriate who will move to either cities and will work on the CBD areas. Hence the scopes will focus on the Manhattan New York and East, downtown, central and West Toronto areas

### 1.2. Problem and Interests

Given the diversity of the culture, this project will compare the following neighbourhoods of these two cities and determine how similar or dissimilar they are. In total,

    •	Manhattan consists of 40 neighbourhoods

    •	East, downtown, central and West Toronto (Toronto City Area) consists of 39 Neighbourhoods

It will focus on 3 areas

    •	Difference of the venue category between these 2 cities.

    •	Difference between the food culture based on the type of restaurant 

    •	Both cities will be independently split into clusters by neighbourhood. And then comparison between clusters will 
         be done and identify similarity based on the venue category

It meant to provide the information for expatriates who plan to live in the neighbourhoods around the CBD areas so that they choose the neighbourhoods best suit to their life style and needs.

## 2. Data

### 2.1. Source of Data and Data Acquisition

Two data sets, one for Manhattan, one for Toronto, created from the previous labs or projects of the training course will be used as the source of data. These datasets have already populated with the information of the boroughs and neighbourhoods of NYC and Toronto as well as the respective latitudes and longitudes.

Before the data analysis, the neighbourhood candidates  need to be filtered from the source of datasets. The outcome will have 2 datasets.

•	Neighbourhood Candidates Set A - represent the  40 neighbourhoods of Manhattan

•	Neighbourhood Candidates Set B - represent the  39 neighbourhoods of East, downtown, central and West Toronto.


### 2.2. Feature Selection

The venues and venue categories will be the key features for the analysis. Hence,  Foursquare API will be used to extract the revenues and revenue categories of all the neighbourhoods for these 2 cities. These data will combine with the datasets set A and set B to create new datasets that have the neighbourhoods and the revenue categories

#### Install Beautifulsoup, selenium

In [1]:
!pip install beautifulsoup4
!pip install selenium
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np



#### Read the New York City data - borough, neighorhood, Latitude and Longitude

In [2]:
df_ny = pd.read_csv("c:/Users/frank/Downloads/newyork.csv", header = None)
headers = ["number", "Borough", "Neighbourhood", "Latitude", "Longitude"]
df_ny.columns = headers

#### Filter out the Manhattan Data

In [3]:
df_ma = df_ny[df_ny['Borough'] == "Manhattan"]
print("Total # of Neighourhoods in Manhattan : ", df_ma.count()['Neighbourhood'])
df_ma.head(5)

Total # of Neighourhoods in Manhattan :  40


Unnamed: 0,number,Borough,Neighbourhood,Latitude,Longitude
6,6,Manhattan,Marble Hill,40.876551,-73.91066
100,100,Manhattan,Chinatown,40.715618,-73.994279
101,101,Manhattan,Washington Heights,40.851903,-73.9369
102,102,Manhattan,Inwood,40.867684,-73.92121
103,103,Manhattan,Hamilton Heights,40.823604,-73.949688


#### Read the Toronto data - borough, neighorhood, Latitude and Longitude

In [4]:
df_temp = pd.read_csv("c:/Users/frank/Downloads/toronto.csv", header = None)
headers = ["number", "Postal Code", "Borough", "Neighbourhood", "Latitude", "Longitude"]
df_temp.columns = headers
df_to = df_temp.loc[(df_temp['Borough'] == "East Toronto") | (df_temp['Borough'] == "West Toronto") | (df_temp['Borough'] == "Downtown Toronto") | (df_temp['Borough'] == "Central Toronto")]
print("Total # of Neighourhoods in Toronto : ", df_to.count()['Neighbourhood'])
df_to.head(5)

Total # of Neighourhoods in Toronto :  39


Unnamed: 0,number,Postal Code,Borough,Neighbourhood,Latitude,Longitude
38,37.0,M4E,East Toronto,The Beaches,43.67635739999999,-79.2930312
42,41.0,M4K,East Toronto,"The Danforth West, Riverdale",43.6795571,-79.352188
43,42.0,M4L,East Toronto,"India Bazaar, The Beaches West",43.6689985,-79.31557159999998
44,43.0,M4M,East Toronto,Studio District,43.6595255,-79.340923
45,44.0,M4N,Central Toronto,Lawrence Park,43.7280205,-79.3887901


### 2.2. Review the geographical locations of Manhattan and Toronto City

#### Install Geopy and folium, import matplotlib and folium in preparation to show the NYC & Toronto maps

In [5]:
!pip install geopy  
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
!pip install folium
import folium # map rendering library



#### Prepare and show the Boroughs and Neighourhoods of Manhattan on a map

In [6]:
address = 'Manhattan, US'
geolocator = Nominatim(user_agent="TT_explorer")
location = geolocator.geocode(address)
ma_latitude = location.latitude
ma_longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(ma_latitude, ma_longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


In [7]:
# create map of Manhattan using ma_latitude and ma_longitude values
map_ma = folium.Map(location=[ma_latitude, ma_longitude], zoom_start=11)
# add markers to map
for lat, lng, borough, neighbourhood in zip(df_ma['Latitude'], df_ma['Longitude'], df_ma['Borough'], df_ma['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_ma)  
map_ma

#### Prepare and show the Boroughs and Neighourhoods of east, west, downtown and central Toronto on a map

In [8]:
address = 'Toronto, Canada'
geolocator = Nominatim(user_agent="TT_explorer")
location = geolocator.geocode(address)
to_latitude = location.latitude
to_longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(to_latitude, to_longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [9]:
# create map of Toronto using to_latitude and to_longitude values
map_toronto = folium.Map(location=[to_latitude, to_longitude], zoom_start=10)
# add markers to map
for lat, lng, borough, neighbourhood in zip(df_to['Latitude'], df_to['Longitude'], df_to['Borough'], df_to['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)    
map_toronto

It is interesting to see the different shapes of the 2 cities; Manhattan is rectangular shaped while Toronto City Area is more squarish shapes

## 3. Metholodgy

After the data source have been loaded into the dataframe with data cleansing and filtering, Foursquare API will be used to collect the venues,latitudes, longitudes and venue categories for the neighbourhoods of Manhattan and Toronto City area.

To address the 1st audience interest, multiple datasets will be created to store venue categories followed by using "SET" operations to identify 

    1. The common venue catergories for both cities. 
    
    2. The venue categories existed in Manhattan but not in Toronto City Area.
    
    3. The venue categories existed in Toronto City but not in Manhattan.
    
Difference between the food culture based on the type of restaurant will be the 2nd part of interest in this project. The "Restaurant" will be the key word to extract the records from the previous datasets and conduct an analysis or comparsion.

Finally, the similarity of neighbourhood based on the venue category will be assessed. To do that, both cities will be independently split into clusters by neighbourhood using cluster algorithm "kmeans"; and the comparing the clusters and surface out the similarity based on the venue category

## 4. Analysis

#### Expore the venues of Toronto City Center

In [10]:
import json
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#### Prepare the Foursquare API parameters

In [11]:
CLIENT_ID = 'I0Z3BBTQRSATRME1NU0AUJC5SEHU5TBND3CN4OREQWOYLPL2' # your Foursquare ID
CLIENT_SECRET = 'HUGFOWBV4BAXVH5M2XBEFEUXOQDX10VXNZAUAD5WAXE0JBH2' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

#### Define the function to get the nearby venues

In [12]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

#### Get the nearby venues for Toronto City Area

In [13]:
df_to_venues = getNearbyVenues(names=df_to['Neighbourhood'], 
                               latitudes=df_to['Latitude'],
                               longitudes=df_to['Longitude'])
df_to_venues.head(5)

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.67635739999999,-79.2930312,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.67635739999999,-79.2930312,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.67635739999999,-79.2930312,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.67635739999999,-79.2930312,Upper Beaches,43.680563,-79.292869,Neighborhood
4,The Beaches,43.67635739999999,-79.2930312,Dip 'n Sip,43.678897,-79.297745,Coffee Shop


In [14]:
df_to_u_venues = df_to_venues['Venue Category'].unique()
print('There are {} uniques categories in Toronto City areas.'.format(len(df_to_venues['Venue Category'].unique())))
set_to = set(df_to_u_venues)
set_to

There are 236 uniques categories in Toronto City areas.


{'Adult Boutique',
 'Airport',
 'Airport Food Court',
 'Airport Gate',
 'Airport Lounge',
 'Airport Service',
 'Airport Terminal',
 'American Restaurant',
 'Antique Shop',
 'Aquarium',
 'Art Gallery',
 'Art Museum',
 'Arts & Crafts Store',
 'Asian Restaurant',
 'Athletics & Sports',
 'Auto Workshop',
 'BBQ Joint',
 'Baby Store',
 'Bagel Shop',
 'Bakery',
 'Bank',
 'Bar',
 'Baseball Stadium',
 'Basketball Stadium',
 'Beach',
 'Bed & Breakfast',
 'Beer Bar',
 'Beer Store',
 'Belgian Restaurant',
 'Bike Rental / Bike Share',
 'Bistro',
 'Boat or Ferry',
 'Bookstore',
 'Boutique',
 'Brazilian Restaurant',
 'Breakfast Spot',
 'Brewery',
 'Bubble Tea Shop',
 'Building',
 'Burger Joint',
 'Burrito Place',
 'Bus Line',
 'Butcher',
 'Café',
 'Cajun / Creole Restaurant',
 'Candy Store',
 'Caribbean Restaurant',
 'Cheese Shop',
 'Chinese Restaurant',
 'Chocolate Shop',
 'Church',
 'Climbing Gym',
 'Clothing Store',
 'Cocktail Bar',
 'Coffee Shop',
 'College Arts Building',
 'College Auditorium',


#### Get the nearby venues for Manhattan

In [15]:
df_ma_venues = getNearbyVenues(names=df_ma['Neighbourhood'], 
                               latitudes=df_ma['Latitude'],
                               longitudes=df_ma['Longitude'])
df_ma_venues.head(5)

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop
4,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop


In [16]:
df_ma_u_venues = df_ma_venues['Venue Category'].unique()
print('There are {} uniques categories in Manhattan.'.format(len(df_ma_venues['Venue Category'].unique())))
df_ma_u_venues
set_ma = set(df_ma_u_venues)
set_ma

There are 333 uniques categories in Manhattan.


{'Accessories Store',
 'Adult Boutique',
 'Afghan Restaurant',
 'African Restaurant',
 'American Restaurant',
 'Antique Shop',
 'Arepa Restaurant',
 'Argentinian Restaurant',
 'Art Gallery',
 'Art Museum',
 'Arts & Crafts Store',
 'Asian Restaurant',
 'Athletics & Sports',
 'Auditorium',
 'Australian Restaurant',
 'Austrian Restaurant',
 'BBQ Joint',
 'Baby Store',
 'Badminton Court',
 'Bagel Shop',
 'Bakery',
 'Bank',
 'Bar',
 'Baseball Field',
 'Basketball Court',
 'Bed & Breakfast',
 'Beer Bar',
 'Beer Garden',
 'Beer Store',
 'Big Box Store',
 'Bike Rental / Bike Share',
 'Bike Shop',
 'Bike Trail',
 'Bistro',
 'Board Shop',
 'Boat or Ferry',
 'Bookstore',
 'Boutique',
 'Boxing Gym',
 'Brazilian Restaurant',
 'Breakfast Spot',
 'Bridal Shop',
 'Bridge',
 'Bubble Tea Shop',
 'Building',
 'Burger Joint',
 'Burrito Place',
 'Bus Line',
 'Bus Station',
 'Butcher',
 'Cafeteria',
 'Café',
 'Cajun / Creole Restaurant',
 'Camera Store',
 'Candy Store',
 'Cantonese Restaurant',
 'Caribbean 

#### Identify the common venue categories for both Cities

In [17]:
common_venues = set_to & set_ma
common_venues

{'Adult Boutique',
 'American Restaurant',
 'Antique Shop',
 'Art Gallery',
 'Art Museum',
 'Arts & Crafts Store',
 'Asian Restaurant',
 'Athletics & Sports',
 'BBQ Joint',
 'Baby Store',
 'Bagel Shop',
 'Bakery',
 'Bank',
 'Bar',
 'Bed & Breakfast',
 'Beer Bar',
 'Beer Store',
 'Bike Rental / Bike Share',
 'Bistro',
 'Boat or Ferry',
 'Bookstore',
 'Boutique',
 'Brazilian Restaurant',
 'Breakfast Spot',
 'Bubble Tea Shop',
 'Building',
 'Burger Joint',
 'Burrito Place',
 'Bus Line',
 'Butcher',
 'Café',
 'Cajun / Creole Restaurant',
 'Candy Store',
 'Caribbean Restaurant',
 'Cheese Shop',
 'Chinese Restaurant',
 'Chocolate Shop',
 'Climbing Gym',
 'Clothing Store',
 'Cocktail Bar',
 'Coffee Shop',
 'College Arts Building',
 'College Cafeteria',
 'Concert Hall',
 'Convenience Store',
 'Cosmetics Shop',
 'Creperie',
 'Cuban Restaurant',
 'Cupcake Shop',
 'Dance Studio',
 'Deli / Bodega',
 'Department Store',
 'Dessert Shop',
 'Diner',
 'Discount Store',
 'Dog Run',
 'Donut Shop',
 'Dump

#### Identify venue categories exist in Manhattan but but in Toronto

In [18]:
set_ma.difference(set_to)

{'Accessories Store',
 'Afghan Restaurant',
 'African Restaurant',
 'Arepa Restaurant',
 'Argentinian Restaurant',
 'Auditorium',
 'Australian Restaurant',
 'Austrian Restaurant',
 'Badminton Court',
 'Baseball Field',
 'Basketball Court',
 'Beer Garden',
 'Big Box Store',
 'Bike Shop',
 'Bike Trail',
 'Board Shop',
 'Boxing Gym',
 'Bridal Shop',
 'Bridge',
 'Bus Station',
 'Cafeteria',
 'Camera Store',
 'Cantonese Restaurant',
 'Cha Chaan Teng',
 'Christmas Market',
 'Circus',
 'Club House',
 'College Academic Building',
 'College Bookstore',
 'College Theater',
 'Comedy Club',
 'Community Center',
 'Cooking School',
 'Cycle Studio',
 'Czech Restaurant',
 'Daycare',
 'Dim Sum Restaurant',
 'Dive Bar',
 "Doctor's Office",
 'Drugstore',
 'Dry Cleaner',
 'Duty-free Shop',
 'Empanada Restaurant',
 'English Restaurant',
 'Exhibit',
 'Eye Doctor',
 'Food Stand',
 'Golf Course',
 'Gym Pool',
 'Gymnastics Gym',
 'Hardware Store',
 'Hawaiian Restaurant',
 'Heliport',
 'High School',
 'Hill',
 

#### Identify venue categories exist in Toronto but but in Manhattan

In [19]:
set_to.difference(set_ma)

{'Airport',
 'Airport Food Court',
 'Airport Gate',
 'Airport Lounge',
 'Airport Service',
 'Airport Terminal',
 'Aquarium',
 'Auto Workshop',
 'Baseball Stadium',
 'Basketball Stadium',
 'Beach',
 'Belgian Restaurant',
 'Brewery',
 'Church',
 'College Auditorium',
 'College Gym',
 'College Rec Center',
 'Colombian Restaurant',
 'Comfort Food Restaurant',
 'Comic Shop',
 'Coworking Space',
 'Distribution Center',
 'Doner Restaurant',
 'Fish & Chips Shop',
 'Fruit & Vegetable Store',
 'General Travel',
 'Gluten-free Restaurant',
 'Home Service',
 'Hospital',
 'IT Services',
 'Lake',
 'Lawyer',
 'Light Rail Station',
 'Neighborhood',
 'Other Great Outdoors',
 'Plane',
 'Poutine Place',
 'Sculpture Garden',
 'Stadium',
 'Stationery Store',
 'Swim School',
 'Tanning Salon',
 'Theme Restaurant',
 'Tibetan Restaurant'}

In [20]:
print("No of same venues categories for both Manhattan and Tornoto :", len(common_venues))
print('Manhattan has {} venues categories different from Tornoto City.'.format(len(set_ma.difference(set_to))))
print('Toronto City has {} venues categories different from Mahanttan.'.format(len(set_to.difference(set_ma))))

No of same venues categories for both Manhattan and Tornoto : 192
Manhattan has 141 venues categories different from Tornoto City.
Toronto City has 44 venues categories different from Mahanttan.


#### Create dataframes to look into the difference between the food culture based on the type of restaurant - the 2nd part of interest in this project. 

In [21]:
df_common_venues = pd.DataFrame(common_venues)
df_common_venues.shape

(192, 1)

In [22]:
df_in_ma_not_in_to = pd.DataFrame(set_ma.difference(set_to))
print(df_in_ma_not_in_to.shape)
df_in_to_not_in_ma = pd.DataFrame(set_to.difference(set_ma))
print(df_in_to_not_in_ma.shape)

(141, 1)
(44, 1)


In [23]:
df_common_venues_restaurant = df_common_venues.loc[(df_common_venues[0].str.contains("Restaurant") == True)]
print("List of common Restaurants")
df_common_venues_restaurant

List of common Restaurants


Unnamed: 0,0
11,Cuban Restaurant
13,Ethiopian Restaurant
17,Seafood Restaurant
26,Mediterranean Restaurant
32,Modern European Restaurant
43,Sushi Restaurant
45,Ramen Restaurant
47,Taiwanese Restaurant
54,Italian Restaurant
62,Brazilian Restaurant


The food culture for both cities covered almost all regions : Asia, LATAM, European and America.

In [24]:
df_in_ma_not_in_to_restaurant = df_in_ma_not_in_to.loc[(df_in_ma_not_in_to[0].str.contains("Restaurant") == True)]
print("List of  Restaurants in Manhattan but not in Toronto City area")
df_in_ma_not_in_to_restaurant

List of  Restaurants in Manhattan but not in Toronto City area


Unnamed: 0,0
7,Paella Restaurant
14,Tapas Restaurant
19,Swiss Restaurant
28,Japanese Curry Restaurant
30,Southern / Soul Food Restaurant
33,Arepa Restaurant
36,Argentinian Restaurant
39,Cantonese Restaurant
40,South American Restaurant
41,Malay Restaurant


Manhattan surfaced out my restaurant related to the provinces of different countries.

In [25]:
df_in_to_not_in_ma_restaurant = df_in_to_not_in_ma.loc[(df_in_to_not_in_ma[0].str.contains("Restaurant") == True)]
print("List of  Restaurants in Toronto City area but not in Manhattan")
df_in_to_not_in_ma_restaurant

List of  Restaurants in Toronto City area but not in Manhattan


Unnamed: 0,0
0,Theme Restaurant
2,Tibetan Restaurant
3,Belgian Restaurant
19,Doner Restaurant
24,Colombian Restaurant
32,Gluten-free Restaurant
43,Comfort Food Restaurant


Toronto City area has some interesting theme restaurants, which not exist in Manhattan.

#### Comparing the clusters of neighbourhood - Independently split into clusters by neighbourhood. And then identify similarity based on the venue category

#### Analyse the neighbourhood of Manhattan

In [26]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

#### Use one hot encoding to sp0lit the venue categories to many columns 

In [27]:
# one hot encoding
df_ma_onehot = pd.get_dummies(df_ma_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
df_ma_onehot['Neighbourhood'] = df_ma_venues['Neighbourhood'] 
# move neighborhood column to the first column
fixed_columns = [df_ma_onehot.columns[-1]] + list(df_ma_onehot.columns[:-1])
df_ma_onehot = df_ma_onehot[fixed_columns]
df_ma_onehot.head(5)

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
df_ma_grouped = df_ma_onehot.groupby('Neighbourhood').mean().reset_index()
df_ma_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Battery Park City,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.025974,0.0,0.0,0.0
1,Carnegie Hill,0.0,0.0,0.0,0.0,0.011364,0.0,0.0,0.011364,0.0,...,0.0,0.011364,0.0,0.0,0.0,0.011364,0.034091,0.0,0.011364,0.034091
2,Central Harlem,0.0,0.0,0.0,0.065217,0.043478,0.0,0.0,0.0,0.043478,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Chelsea,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.06,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.01
4,Chinatown,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,...,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01


In [29]:
ma_num_top_venues = 3

for hood in df_ma_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = df_ma_grouped[df_ma_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(ma_num_top_venues))
    print('\n')

----Battery Park City----
         venue  freq
0  Coffee Shop  0.08
1         Park  0.08
2        Hotel  0.06


----Carnegie Hill----
         venue  freq
0  Coffee Shop  0.09
1         Café  0.05
2  Yoga Studio  0.03


----Central Harlem----
                venue  freq
0  African Restaurant  0.07
1      Cosmetics Shop  0.07
2  Chinese Restaurant  0.04


----Chelsea----
         venue  freq
0  Coffee Shop  0.06
1  Art Gallery  0.06
2       Bakery  0.05


----Chinatown----
                venue  freq
0  Chinese Restaurant  0.08
1              Bakery  0.07
2        Cocktail Bar  0.04


----Civic Center----
                  venue  freq
0           Coffee Shop  0.08
1                   Spa  0.05
2  Gym / Fitness Center  0.05


----Clinton----
                  venue  freq
0               Theater  0.06
1    Italian Restaurant  0.05
2  Gym / Fitness Center  0.05


----East Harlem----
                venue  freq
0  Mexican Restaurant  0.13
1     Thai Restaurant  0.08
2              Bakery  0

In [30]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
df_ma_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
df_ma_neighborhoods_venues_sorted['Neighbourhood'] = df_ma_grouped['Neighbourhood']
for ind in np.arange(df_ma_grouped.shape[0]):
    df_ma_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_ma_grouped.iloc[ind, :], num_top_venues)
df_ma_neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Coffee Shop,Park,Hotel,Clothing Store,Gym,Memorial Site,Shopping Mall,Wine Shop,Burger Joint,Gourmet Shop
1,Carnegie Hill,Coffee Shop,Café,Yoga Studio,Cosmetics Shop,French Restaurant,Gym,Gym / Fitness Center,Bar,Bookstore,Pizza Place
2,Central Harlem,African Restaurant,Cosmetics Shop,French Restaurant,American Restaurant,Bar,Chinese Restaurant,Art Gallery,Seafood Restaurant,Spa,Event Space
3,Chelsea,Art Gallery,Coffee Shop,Bakery,French Restaurant,American Restaurant,Ice Cream Shop,Seafood Restaurant,Cocktail Bar,Park,Market
4,Chinatown,Chinese Restaurant,Bakery,Cocktail Bar,Bubble Tea Shop,Ice Cream Shop,Hotpot Restaurant,Salon / Barbershop,Optical Shop,Dessert Shop,American Restaurant


#### Use K-means to create clusters for the neighbourhoods of Manhattan.

In [31]:
# import k-means from clustering stage
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 3
df_ma_grouped_clustering = df_ma_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans_ma = KMeans(n_clusters=kclusters, random_state=0).fit(df_ma_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans_ma.labels_[0:10]
df_ma_grouped_clustering

Unnamed: 0,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,...,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.025974,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.011364,0.0,0.0,0.011364,0.0,0.022727,...,0.0,0.011364,0.0,0.0,0.0,0.011364,0.034091,0.0,0.011364,0.034091
2,0.0,0.0,0.0,0.065217,0.043478,0.0,0.0,0.0,0.043478,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.06,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.01
4,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,...,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01
5,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.01,0.0,0.02
6,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.03,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.01,0.0,...,0.0,0.03,0.0,0.0,0.0,0.03,0.01,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01


#### Add the clustering labels and merge df_ma_grouped_clustering with df_ma to add latitude/longitude for each neighborhood

In [32]:
# add clustering labels
df_ma_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans_ma.labels_)
df_ma_merged = df_ma
# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
df_ma_merged = df_ma_merged .join(df_ma_neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
df_ma_merged.head() # check the last columns!"
df_ma_merged = df_ma_merged.drop(['number', 'Borough'], axis=1)
df_ma_merged

Unnamed: 0,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Marble Hill,40.876551,-73.91066,1,Gym,Discount Store,Sandwich Place,Coffee Shop,Yoga Studio,Pizza Place,Steakhouse,Shopping Mall,Seafood Restaurant,Department Store
100,Chinatown,40.715618,-73.994279,0,Chinese Restaurant,Bakery,Cocktail Bar,Bubble Tea Shop,Ice Cream Shop,Hotpot Restaurant,Salon / Barbershop,Optical Shop,Dessert Shop,American Restaurant
101,Washington Heights,40.851903,-73.9369,0,Café,Bakery,Grocery Store,Deli / Bodega,Chinese Restaurant,Mobile Phone Shop,New American Restaurant,Latin American Restaurant,Park,Pizza Place
102,Inwood,40.867684,-73.92121,0,Mexican Restaurant,Lounge,Restaurant,Café,Bakery,Spanish Restaurant,Frozen Yogurt Shop,Caribbean Restaurant,Chinese Restaurant,Deli / Bodega
103,Hamilton Heights,40.823604,-73.949688,0,Pizza Place,Coffee Shop,Café,Mexican Restaurant,Deli / Bodega,Cocktail Bar,Latin American Restaurant,Sushi Restaurant,Park,Yoga Studio
104,Manhattanville,40.816934,-73.957385,0,Seafood Restaurant,Coffee Shop,Italian Restaurant,Mexican Restaurant,Chinese Restaurant,Deli / Bodega,Sushi Restaurant,Climbing Gym,Supermarket,Boutique
105,Central Harlem,40.815976,-73.943211,0,African Restaurant,Cosmetics Shop,French Restaurant,American Restaurant,Bar,Chinese Restaurant,Art Gallery,Seafood Restaurant,Spa,Event Space
106,East Harlem,40.792249,-73.944182,0,Mexican Restaurant,Bakery,Thai Restaurant,Deli / Bodega,Spa,Latin American Restaurant,Sandwich Place,Taco Place,Gym,Grocery Store
107,Upper East Side,40.775639,-73.960508,1,Italian Restaurant,Coffee Shop,Exhibit,Bakery,Gym / Fitness Center,American Restaurant,Spa,French Restaurant,Hotel,Juice Bar
108,Yorkville,40.77593,-73.947118,0,Italian Restaurant,Gym,Coffee Shop,Deli / Bodega,Sushi Restaurant,Bar,Wine Shop,Diner,Japanese Restaurant,Pharmacy


In [33]:
# create map
df_ma_map_clusters = folium.Map(location=[ma_latitude, ma_longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
#print(rainbow)
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_ma_merged['Latitude'], df_ma_merged['Longitude'], df_ma_merged['Neighbourhood'], df_ma_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    #print(type(cluster), "cluster is ", cluster)
    folium.CircleMarker([lat, lon], radius=5, popup=label, color=rainbow[cluster-1], fill=True, fill_color=rainbow[cluster-1], fill_opacity=0.7).add_to(df_ma_map_clusters)
       
df_ma_map_clusters

#### Analyse the Manhattan clusters

In [34]:
df_ma_merged.loc[df_ma_merged['Cluster Labels'] == 0, df_ma_merged.columns[[0] + list(range(4, df_ma_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
100,Chinatown,Chinese Restaurant,Bakery,Cocktail Bar,Bubble Tea Shop,Ice Cream Shop,Hotpot Restaurant,Salon / Barbershop,Optical Shop,Dessert Shop,American Restaurant
101,Washington Heights,Café,Bakery,Grocery Store,Deli / Bodega,Chinese Restaurant,Mobile Phone Shop,New American Restaurant,Latin American Restaurant,Park,Pizza Place
102,Inwood,Mexican Restaurant,Lounge,Restaurant,Café,Bakery,Spanish Restaurant,Frozen Yogurt Shop,Caribbean Restaurant,Chinese Restaurant,Deli / Bodega
103,Hamilton Heights,Pizza Place,Coffee Shop,Café,Mexican Restaurant,Deli / Bodega,Cocktail Bar,Latin American Restaurant,Sushi Restaurant,Park,Yoga Studio
104,Manhattanville,Seafood Restaurant,Coffee Shop,Italian Restaurant,Mexican Restaurant,Chinese Restaurant,Deli / Bodega,Sushi Restaurant,Climbing Gym,Supermarket,Boutique
105,Central Harlem,African Restaurant,Cosmetics Shop,French Restaurant,American Restaurant,Bar,Chinese Restaurant,Art Gallery,Seafood Restaurant,Spa,Event Space
106,East Harlem,Mexican Restaurant,Bakery,Thai Restaurant,Deli / Bodega,Spa,Latin American Restaurant,Sandwich Place,Taco Place,Gym,Grocery Store
108,Yorkville,Italian Restaurant,Gym,Coffee Shop,Deli / Bodega,Sushi Restaurant,Bar,Wine Shop,Diner,Japanese Restaurant,Pharmacy
109,Lenox Hill,Italian Restaurant,Sushi Restaurant,Coffee Shop,Cocktail Bar,Pizza Place,Café,Gym / Fitness Center,Gym,Burger Joint,Salad Place
111,Upper West Side,Wine Bar,Bakery,Bar,Italian Restaurant,Indian Restaurant,Café,Coffee Shop,Pizza Place,Ice Cream Shop,Mediterranean Restaurant


The neighbourhood of cluster 0 in Manhattan has many different restaurant.

In [35]:
df_ma_merged.loc[df_ma_merged['Cluster Labels'] == 1, df_ma_merged.columns[[0] + list(range(4, df_ma_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Marble Hill,Gym,Discount Store,Sandwich Place,Coffee Shop,Yoga Studio,Pizza Place,Steakhouse,Shopping Mall,Seafood Restaurant,Department Store
107,Upper East Side,Italian Restaurant,Coffee Shop,Exhibit,Bakery,Gym / Fitness Center,American Restaurant,Spa,French Restaurant,Hotel,Juice Bar
110,Roosevelt Island,Park,Gym,Dry Cleaner,Bubble Tea Shop,Soccer Field,Farmers Market,Supermarket,Metro Station,School,Outdoors & Recreation
113,Clinton,Theater,Italian Restaurant,Gym / Fitness Center,Coffee Shop,American Restaurant,Gym,Spa,Wine Shop,Hotel,Sandwich Place
114,Midtown,Hotel,Clothing Store,Coffee Shop,Sporting Goods Shop,Theater,Bookstore,Café,Steakhouse,Gym,Bakery
115,Murray Hill,Coffee Shop,Sandwich Place,Bar,Japanese Restaurant,American Restaurant,Gym / Fitness Center,Burger Joint,Hotel,Mediterranean Restaurant,Taco Place
125,Morningside Heights,Coffee Shop,Park,American Restaurant,Bookstore,Burger Joint,Café,Ice Cream Shop,New American Restaurant,Supermarket,Mediterranean Restaurant
127,Battery Park City,Coffee Shop,Park,Hotel,Clothing Store,Gym,Memorial Site,Shopping Mall,Wine Shop,Burger Joint,Gourmet Shop
128,Financial District,Coffee Shop,Pizza Place,Bar,Hotel,Gym,Cocktail Bar,Park,Mexican Restaurant,Gym / Fitness Center,Sandwich Place
247,Carnegie Hill,Coffee Shop,Café,Yoga Studio,Cosmetics Shop,French Restaurant,Gym,Gym / Fitness Center,Bar,Bookstore,Pizza Place


The neighbourhood of cluster 1 in Manhattan has a good mix of venue categories such as cafe / coffeeshop, Gym, Spa, restaurants, Part and Hotel.

In [36]:
df_ma_merged.loc[df_ma_merged['Cluster Labels'] == 2, df_ma_merged.columns[[0] + list(range(4, df_ma_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
275,Stuyvesant Town,Park,Bar,Boat or Ferry,Coffee Shop,Heliport,Food Truck,Gas Station,Bistro,Skating Rink,Farmers Market


The neighbourhood of cluster 2 in Manhattan has one neighbourhood.

#### Analyse the neighbourhood of Toronto City area

In [37]:
# one hot encoding
df_to_onehot = pd.get_dummies(df_to_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
df_to_onehot['Neighbourhood'] = df_to_venues['Neighbourhood'] 
# move neighborhood column to the first column
fixed_columns = [df_to_onehot.columns[-1]] + list(df_to_onehot.columns[:-1])
df_to_onehot = df_to_onehot[fixed_columns]
df_to_onehot.head(5)

Unnamed: 0,Neighbourhood,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
df_to_grouped = df_to_onehot.groupby('Neighbourhood').mean().reset_index()
df_to_grouped.head()

Unnamed: 0,Neighbourhood,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.055556,0.055556,0.055556,0.111111,0.166667,0.111111,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.016667,0.0,0.016667


In [39]:
num_top_venues = 3

for hood in df_to_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = df_to_grouped[df_to_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
          venue  freq
0   Coffee Shop  0.09
1  Cocktail Bar  0.05
2   Cheese Shop  0.04


----Brockton, Parkdale Village, Exhibition Place----
            venue  freq
0            Café  0.13
1       Nightclub  0.09
2  Breakfast Spot  0.09


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
            venue  freq
0         Brewery  0.07
1  Farmers Market  0.07
2   Garden Center  0.07


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
              venue  freq
0   Airport Service  0.17
1    Airport Lounge  0.11
2  Airport Terminal  0.11


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.18
1  Italian Restaurant  0.05
2                Café  0.05


----Christie----
           venue  freq
0  Grocery Store  0.24
1           Café  0.18
2           Park  0.12


----Church and Wellesley----
                 venue  freq
0          Coffee

In [40]:
#def return_most_common_venues(row, num_top_venues):
#    row_categories = row.iloc[1:]
#    row_categories_sorted = row_categories.sort_values(ascending=False)
#    return row_categories_sorted.index.values[0:num_top_venues]

In [41]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
df_to_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
df_to_neighborhoods_venues_sorted['Neighbourhood'] = df_to_grouped['Neighbourhood']
for ind in np.arange(df_to_grouped.shape[0]):
    df_to_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_to_grouped.iloc[ind, :], num_top_venues)
df_to_neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Restaurant,Farmers Market,Beer Bar,Seafood Restaurant,Greek Restaurant,Basketball Stadium
1,"Brockton, Parkdale Village, Exhibition Place",Café,Breakfast Spot,Nightclub,Coffee Shop,Pet Store,Stadium,Bar,Intersection,Bakery,Restaurant
2,"Business reply mail Processing Centre, South C...",Gym / Fitness Center,Farmers Market,Skate Park,Auto Workshop,Burrito Place,Garden,Fast Food Restaurant,Garden Center,Light Rail Station,Park
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Sculpture Garden,Harbor / Marina,Plane,Boat or Ferry,Rental Car Location,Boutique,Bar
4,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Café,Burger Joint,Salad Place,Bubble Tea Shop,Poke Place,Portuguese Restaurant,Pizza Place


#### Use K-means to create clusters for the neighbourhoods of Toronto City area.

In [42]:
# import k-means from clustering stage
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 3
df_to_grouped_clustering = df_to_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans_to = KMeans(n_clusters=kclusters, random_state=0).fit(df_to_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans_to.labels_[0:10]
df_to_grouped_clustering

Unnamed: 0,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.055556,0.055556,0.055556,0.111111,0.166667,0.111111,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.016667,0.0,0.016667
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0125,0.0,0.0,0.0,0.0,0.0,0.0,0.0125,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.029412,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
# add clustering labels
df_to_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans_to.labels_)
df_to_merged = df_to
# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
df_to_merged = df_to_merged .join(df_to_neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
df_to_merged.head() # check the last columns!"
df_to_merged = df_to_merged.drop(['number', 'Postal Code', 'Borough'], axis=1)
df_to_merged

Unnamed: 0,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
38,The Beaches,43.67635739999999,-79.2930312,2,Coffee Shop,Health Food Store,Neighborhood,Trail,Pub,Yoga Studio,Dog Run,Diner,Discount Store,Distribution Center
42,"The Danforth West, Riverdale",43.6795571,-79.352188,2,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Liquor Store,Indian Restaurant,Spa,Bookstore,Brewery
43,"India Bazaar, The Beaches West",43.6689985,-79.31557159999998,2,Sandwich Place,Park,Fast Food Restaurant,Coffee Shop,Food & Drink Shop,Light Rail Station,Restaurant,Italian Restaurant,Fish & Chips Shop,Steakhouse
44,Studio District,43.6595255,-79.340923,2,Coffee Shop,American Restaurant,Bakery,Brewery,Café,Gastropub,Yoga Studio,Fish Market,Pet Store,Park
45,Lawrence Park,43.7280205,-79.3887901,0,Park,Swim School,Bus Line,Yoga Studio,Diner,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
46,Davisville North,43.7127511,-79.3901975,2,Gym / Fitness Center,Hotel,Breakfast Spot,Food & Drink Shop,Sandwich Place,Department Store,Park,Convenience Store,Distribution Center,Ethiopian Restaurant
47,"North Toronto West, Lawrence Park",43.7153834,-79.40567840000001,2,Clothing Store,Coffee Shop,Yoga Studio,Fast Food Restaurant,Italian Restaurant,Café,Mexican Restaurant,Salon / Barbershop,Metro Station,Chinese Restaurant
48,Davisville,43.7043244,-79.3887901,2,Sandwich Place,Dessert Shop,Pizza Place,Coffee Shop,Sushi Restaurant,Café,Italian Restaurant,Gym,Gas Station,Toy / Game Store
49,"Moore Park, Summerhill East",43.6895743,-79.38315990000001,2,Lawyer,Restaurant,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
50,"Summerhill West, Rathnelly, South Hill, Forest...",43.68641229999999,-79.4000493,2,Coffee Shop,American Restaurant,Liquor Store,Restaurant,Bank,Bagel Shop,Supermarket,Sushi Restaurant,Fried Chicken Joint,Pub


In [44]:
# create map
df_to_map_clusters = folium.Map(location=[to_latitude, to_longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
#print(rainbow)
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_to_merged['Latitude'], df_to_merged['Longitude'], df_to_merged['Neighbourhood'], df_to_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    #print(type(cluster), "cluster is ", cluster)
    folium.CircleMarker([lat, lon], radius=5, popup=label, color=rainbow[cluster-1], fill=True, fill_color=rainbow[cluster-1], fill_opacity=0.7).add_to(df_to_map_clusters)
       
df_to_map_clusters

#### Analyse the clusters of Toronto

In [45]:
df_to_merged.loc[df_to_merged['Cluster Labels'] == 0, df_to_merged.columns[[0] + list(range(4, df_to_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
45,Lawrence Park,Park,Swim School,Bus Line,Yoga Studio,Diner,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
51,Rosedale,Park,Playground,Trail,Yoga Studio,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
65,"Forest Hill North & West, Forest Hill Road Park",Park,Sushi Restaurant,Trail,Jewelry Store,Yoga Studio,Donut Shop,Discount Store,Distribution Center,Dog Run,Doner Restaurant


The cluster 0 of Toronto City has some parks, trail, playgroud, restaurants, donut shop. It appears to be residential.

In [46]:
df_to_merged.loc[df_to_merged['Cluster Labels'] == 1, df_to_merged.columns[[0] + list(range(4, df_to_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
64,Roselawn,Home Service,Garden,Yoga Studio,Department Store,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


The cluster 1 of Toronto City has some Home service, Gaden, electronics store, restaurants, donut shop. It appears to be residential.

In [47]:
df_to_merged.loc[df_to_merged['Cluster Labels'] == 2, df_to_merged.columns[[0] + list(range(4, df_to_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
38,The Beaches,Coffee Shop,Health Food Store,Neighborhood,Trail,Pub,Yoga Studio,Dog Run,Diner,Discount Store,Distribution Center
42,"The Danforth West, Riverdale",Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Liquor Store,Indian Restaurant,Spa,Bookstore,Brewery
43,"India Bazaar, The Beaches West",Sandwich Place,Park,Fast Food Restaurant,Coffee Shop,Food & Drink Shop,Light Rail Station,Restaurant,Italian Restaurant,Fish & Chips Shop,Steakhouse
44,Studio District,Coffee Shop,American Restaurant,Bakery,Brewery,Café,Gastropub,Yoga Studio,Fish Market,Pet Store,Park
46,Davisville North,Gym / Fitness Center,Hotel,Breakfast Spot,Food & Drink Shop,Sandwich Place,Department Store,Park,Convenience Store,Distribution Center,Ethiopian Restaurant
47,"North Toronto West, Lawrence Park",Clothing Store,Coffee Shop,Yoga Studio,Fast Food Restaurant,Italian Restaurant,Café,Mexican Restaurant,Salon / Barbershop,Metro Station,Chinese Restaurant
48,Davisville,Sandwich Place,Dessert Shop,Pizza Place,Coffee Shop,Sushi Restaurant,Café,Italian Restaurant,Gym,Gas Station,Toy / Game Store
49,"Moore Park, Summerhill East",Lawyer,Restaurant,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
50,"Summerhill West, Rathnelly, South Hill, Forest...",Coffee Shop,American Restaurant,Liquor Store,Restaurant,Bank,Bagel Shop,Supermarket,Sushi Restaurant,Fried Chicken Joint,Pub
52,"St. James Town, Cabbagetown",Coffee Shop,Café,Pizza Place,Restaurant,Italian Restaurant,Bakery,Pub,Beer Store,Chinese Restaurant,Diner


The neighbourhood of cluster 2 in Toronto has a good mix of venue categories such as cafe / coffeeshop, Gym, Spa, restaurants, Part and Hotel.

## 5. Results and Discussion


It is interesting to see the different shapes of the 2 cities; Manhattan is rectangular shaped while Toronto City Area is more squarish shapes.

The number of same venues categories for both Manhattan and Toronto is 192. Manhattan has 141 venues categories different from Toronto City. Toronto City has 44 venues categories different from Manhattan.

The food culture for both cities covered almost all regions: Asia, LATAM, European, Australia, Africa, North and South America. Manhattan has many restaurants, which are related to different provinces of different countries. Toronto City area has some interesting theme restaurants, which not exist in Manhattan.

The neighbourhood of cluster 0 and 1 of Manhattan are similar to cluster 2 of Toronto City, given that these clusters have good mix of venues and facilities such as cafe / coffeeshop, Gym, Spa, restaurants, Part and Hotel.

Cluster 0 of Toronto City has more parks, trails, playgrounds, restaurants, shops that seem to be more suitable for family living.

Both cluster 2 of Manhattan and cluster 0 of Toronto City have only 1 neighbourhood, the comparison is inconclusive, given that the venues are quite different.


## 6. Conclusion

Although some differences have been surfaced out from this study, the majority of venue categories are similar. Expatriates stayed in either one of the cities should have problem to adapt on the other.