# Battle of the Neighborhoods

### -- A Glance of the Major Cities in Canada from a Tourist's Point of View

## 1. Business Proglem

### 1.1 Inspiration
As a first time visitor in Canada, 3 of the major cities Toronto, Vandouver and Ottawa are on my list. 
While planning my trip, I want to learn about the cities on the high level so I know what to expect, and what to focus on in each city.
#####    1.1.1 Downtowns: How similar/dissimilar are the downtowns?
#####    1.1.2 Neighborhoods: What are the different neighborhoods in each city?
#####    1.1.3 Food Selection: What are the most popular cuisines in each city?
#####    1.1.4 To-do: What are the most popular recreational activities in each city?
#####    1.1.5 Other: What are some of the unique characteristics of each city?

### 1.2 Generalization
This analysis can be packaged into a product that provides customizable trip planning service to travelers.
##### 1.2.1 Custom Combination of Cities
Unlike most of the current travel sites that provide plans for only fixed combination of cities, with this tool customers can create their own combination of cities in the plan. The tool will then run the analysis in real time and provide costomized reports.
##### 1.2.2 Compare and Contrast
Rather than just giving out plans, compare and contrast the cities from various perspective to give travelers more background information. Travelers can then refine their plans to include/exclude, or increase/reduce time in certain cities based on what they see and what interest them
##### 1.2.3 Refining Plans by Category
In each categories (such as food, culture, activities), customers can have an overview of what’s most popular and what’s most unique to each city, so that they can prioritize based on their own interests.

## 2. Data

To achieve this, a few different datasets will be needed. I will discuss each of them in the upcoming section.

### 2.1 Zip Code and Neiborhood

##### 2.1.1 Formatted Data
For Toronto, there’s a well-formatted data source that contains zip code – neighborhood information:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Once scraped from the website, there's not a lot of transformation needed to convert it into a clean dataframe. However, some processing and cleansing is still needed to address the following issue: 1) Rows with missing information, and 2) Records that are outside of the area we are interested in

In [116]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import requests # library to handle requests

# conda install -c anaconda beautifulsoup4 
from bs4 import BeautifulSoup

In [117]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
# Save initial DataFrame to df_toronto
df_toronto = df[0]
df_toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [118]:
# Data Cleansing

# drop records where Borough = 'Not assigned'
df_toronto = df_toronto[df_toronto.Borough != 'Not assigned']
# Now check records where Borough = 'Not assigned'. There shoudn't be any records showing anymore
df_toronto[df_toronto.Borough == 'Not assigned']
# group data by by PostCode, and combine records by concatenating "Neighbourhood" values as comma deliminated strings
df_toronto = df_toronto.groupby(['Postcode','Borough'])['Neighbourhood'].apply(lambda x: ','.join(x)).reset_index()
# If Neighbourhood = "Not assigned", assign Borough value to Neighbourhood
df_toronto[df_toronto.Neighbourhood == 'Not assigned'] = df_toronto[df_toronto.Neighbourhood == 'Not assigned'].assign(Neighbourhood = df_toronto[df_toronto.Neighbourhood == 'Not assigned'].Borough)
# Rename column "Postcode" to "PostalCode"
df_toronto.rename(columns = {'Postcode':'PostalCode','Neighbourhood':'Neighborhood'}, inplace = True)

In [119]:
# keep only records that has the string "Toronto" in their Boroughs
toronto_data = df_toronto[df_toronto['Borough'].str.contains('Toronto')]
np.shape(toronto_data)

(38, 3)

In [120]:
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
37,M4E,East Toronto,The Beaches
41,M4K,East Toronto,"The Danforth West,Riverdale"
42,M4L,East Toronto,"The Beaches West,India Bazaar"
43,M4M,East Toronto,Studio District
44,M4N,Central Toronto,Lawrence Park


##### 2.1.2 Unformatted Data
For Vancouver and Ottawa, more data processing is needed

Vancouver: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_V

Ottawa: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_K

Data is not organized in the format we need, some processing is required
Some cleansing is needed for missing data and irrelevant data (for the screenshot shown, need to limit to Vancouver data only)

In [121]:
# Firstly, process Vancouver data
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_V")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))

In [122]:
df = pd.DataFrame(df[0])
df = pd.DataFrame(df.values.flatten())
df.head()

Unnamed: 0,0
0,V1AKimberley
1,V2APenticton
2,V3ALangley Township(Langley City)
3,V4ASurreySouthwest
4,V5ABurnaby(Government Road / Lake City / SFU /...


In [123]:
# parse out Postal Code (first 3 characters in column 0)
df['PostalCode'] = df[0].apply(lambda x: x[:3])
df['new_col'] = df[0].apply(lambda x: x[3:])
# keep only records that contain the string "Vancouver"
vancouver_data = df[df['new_col'].str.contains('Vancouver')]
# parse out the remaining two columns
# find location of the string "Vancouver", Bourough will be the string before that plus the string "Vancouver"
# Neighborhood will be the string after "Vancouver"
vancouver_data['Bourough'] = vancouver_data.new_col.apply(lambda x: x[:x.find('Vancouver') + 9])
vancouver_data['Neighborhood'] = vancouver_data.new_col.apply(lambda x: x[x.find('Vancouver') + 9:])
# Clean up the dataframe by dropping columns created for intermidiate steps
vancouver_data.drop(['new_col'], axis=1, inplace=True)
vancouver_data.drop(vancouver_data.columns[0], axis=1, inplace=True)
# check the shape of the final dataframe
np.shape(vancouver_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


(44, 3)

In [124]:
vancouver_data.head()

Unnamed: 0,PostalCode,Bourough,Neighborhood
5,V6A,Vancouver,(Strathcona / Chinatown / Downtown Eastside)
14,V6B,Vancouver,(NE Downtown / Gastown / Harbour Centre / Inte...
23,V6C,Vancouver,(Waterfront / Coal Harbour / Canada Place)
32,V6E,Vancouver,(SE West End / Davie Village)
41,V6G,Vancouver,(NW West End / Stanley Park)


In [125]:
# Now repeat the previous steps to process Ottawa data
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_K")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
df = pd.DataFrame(df[0])
df = pd.DataFrame(df.values.flatten())
df.head()

Unnamed: 0,0
0,K1AGovernment of CanadaOttawa and Gatineau off...
1,K2AOttawa(Highland Park / McKellar Park /Westb...
2,K4AOttawa(Fallingbrook)
3,K6AHawkesbury
4,K7ASmiths Falls


In [126]:
# parse out Postal Code (first 3 characters in column 0)
df['PostalCode'] = df[0].apply(lambda x: x[:3])
df['new_col'] = df[0].apply(lambda x: x[3:])
# keep only records that contain the string "Ottawa"
ottawa_data = df[df['new_col'].str.contains('Ottawa')]
# parse out the remaining two columns
# find location of the string "Ottawa", Bourough will be the string before that plus the string "Ottawa"
# Neighborhood will be the string after "Ottawa"
ottawa_data['Bourough'] = ottawa_data.new_col.apply(lambda x: x[:x.find('Ottawa') + 6])
ottawa_data['Neighborhood'] = ottawa_data.new_col.apply(lambda x: x[x.find('Ottawa') + 6:])
# Clean up the dataframe by dropping columns created for intermidiate steps
ottawa_data.drop(['new_col'], axis=1, inplace=True)
ottawa_data.drop(ottawa_data.columns[0], axis=1, inplace=True)
# In addition, we don't need the first record (Government of Canada, index = 0), so drop it
ottawa_data.drop(ottawa_data.index[0], inplace=True)
# check the shape of the final dataframe
np.shape(ottawa_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


(40, 3)

In [127]:
ottawa_data.head()

Unnamed: 0,PostalCode,Bourough,Neighborhood
1,K2A,Ottawa,(Highland Park / McKellar Park /Westboro /Glab...
2,K4A,Ottawa,(Fallingbrook)
7,K1B,Ottawa,(Blackburn Hamlet / Pine View / Sheffield Glen)
8,K2B,Ottawa,(Britannia /Whitehaven / Bayshore / Pinecrest)
9,K4B,Ottawa,(Navan)


### 2.2 Coordinates
For Vancouver and Ottawa, coordinates data is downloaded in csv format from Google fusion table:
https://fusiontables.google.com/DataSource?docid=1H_cl-oyeG4FDwqJUTeI_aGKmmkJdPDzRNccp96M&hl=en_US&pli=1
for FSA starting with V and K

The datasets are saved to GitHub repository for this project.

For Toronto, coordinates data is found here:
http://cocl.us/Geospatial_data

In [128]:
# read csv from GitHub repository: FSA starting with V
coordinate = pd.read_csv('https://raw.githubusercontent.com/audrey-wj/Coursera_Capstone/master/Battle%20of%20Neighborhoods/Postal%20Codes_V-filtered.csv')
# original file is at individual Postal Code level. Grouping it by FSA (first 3 digit of the Postal Code)
fsa_v = coordinate.groupby('FSA')[['Latitude','Longitude']].mean().reset_index()
# renaming column "FSA" to "PostalCode" so the dataframe can be merged with Neiborhood dataframe
fsa_v.rename(columns = {'FSA':'PostalCode'}, inplace = True)
fsa_v.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,V1A,49.676884,-115.976582
1,V1B,50.245959,-119.23302
2,V1C,49.508152,-115.762707
3,V1E,50.695111,-119.272105
4,V1G,55.758085,-120.237495


In [129]:
# repeat the process for FSA starting with K
coordinate = pd.read_csv('https://raw.githubusercontent.com/audrey-wj/Coursera_Capstone/master/Battle%20of%20Neighborhoods/Postal%20Codes_K-filtered.csv')
fsa_k = coordinate.groupby('FSA')[['Latitude','Longitude']].mean().reset_index()
fsa_k.rename(columns = {'FSA':'PostalCode'}, inplace = True)
fsa_k.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,K1A,45.412145,-75.695946
1,K1B,45.422631,-75.593948
2,K1C,45.464042,-75.530664
3,K1E,45.473663,-75.505666
4,K1G,45.395864,-75.635204


In [130]:
# get coordicates data for Toronto
toronto_coordinate = pd.read_csv('http://cocl.us/Geospatial_data')
toronto_coordinate.rename(columns = {'Postal Code':'PostalCode'}, inplace = True)
toronto_coordinate.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [131]:
# Now merge Neighborhood data with Coordinates data
vancouver_data = pd.merge(vancouver_data, fsa_v, on = 'PostalCode', how = 'left')
vancouver_data.head()

Unnamed: 0,PostalCode,Bourough,Neighborhood,Latitude,Longitude
0,V6A,Vancouver,(Strathcona / Chinatown / Downtown Eastside),49.278507,-123.09159
1,V6B,Vancouver,(NE Downtown / Gastown / Harbour Centre / Inte...,49.278408,-123.11181
2,V6C,Vancouver,(Waterfront / Coal Harbour / Canada Place),49.28472,-123.117017
3,V6E,Vancouver,(SE West End / Davie Village),49.284366,-123.118055
4,V6G,Vancouver,(NW West End / Stanley Park),49.290228,-123.135351


In [132]:
ottawa_data = pd.merge(ottawa_data, fsa_k, on = 'PostalCode', how = 'left')
ottawa_data.head()

Unnamed: 0,PostalCode,Bourough,Neighborhood,Latitude,Longitude
0,K2A,Ottawa,(Highland Park / McKellar Park /Westboro /Glab...,45.377666,-75.764976
1,K4A,Ottawa,(Fallingbrook),45.472155,-75.479227
2,K1B,Ottawa,(Blackburn Hamlet / Pine View / Sheffield Glen),45.422631,-75.593948
3,K2B,Ottawa,(Britannia /Whitehaven / Bayshore / Pinecrest),45.361812,-75.789919
4,K4B,Ottawa,(Navan),45.411875,-75.402756


In [133]:
toronto_data = pd.merge(toronto_data, toronto_coordinate, on = 'PostalCode', how = 'left')
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


### 2.3 Venue Information
Use Foursqaure API to explore evenues near each Zip Code. The format of API is https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}

Once data is obtained, a cluster analysis will be performed on neighborhoods, to find different groups of neighborhoods in each city.

For analysis on “food selection”, “to-do”, and “others”, realized the “category name” of venues are too detailed/granular for the purpose of this analysis. For example, restaurants/food is separated into cuisines and subtypes of cuisines; similar for other categories. Some manual grouping (most likely based on string matches) will be needed before further analyses.

In [150]:
# Foursquare credentials and version

CLIENT_ID = '5EKKRYVDBZQFWXKVJJFRFVXQV0H0R4NPRIOXS5ITKCJRQUYU' # your Foursquare ID
CLIENT_SECRET = 'J11R0A1BPNPEOHIYOZNEBXAOKBIAWHXW5WVZTOPKOF5RSLZY' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 200

In [151]:
# define fuction getNearbyVenues to get venues information from Foursquare API
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
#         print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [152]:
radius = 5000
toronto_venues = getNearbyVenues(names = toronto_data['Neighborhood'],
                                 latitudes = toronto_data['Latitude'], 
                                 longitudes = toronto_data['Longitude'], 
                                 radius = radius)

In [153]:
print(toronto_venues.shape)
toronto_venues.head()

(3800, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,The Fox Theatre,43.672801,-79.287272,Indie Movie Theater
1,The Beaches,43.676357,-79.293031,Ed's Real Scoop,43.67263,-79.287993,Ice Cream Shop
2,The Beaches,43.676357,-79.293031,Kew Gardens,43.669038,-79.298538,Park
3,The Beaches,43.676357,-79.293031,Tori's Bakeshop,43.672114,-79.290331,Vegetarian / Vegan Restaurant
4,The Beaches,43.676357,-79.293031,Woodbine Beach,43.663112,-79.306374,Beach


In [29]:
vancouver_venues = getNearbyVenues(names = vancouver_data['Neighborhood'],
                                 latitudes = vancouver_data['Latitude'], 
                                 longitudes = vancouver_data['Longitude'], 
                                 radius = radius)

(Strathcona / Chinatown / Downtown Eastside)
(NE Downtown / Gastown / Harbour Centre / International Village / Victory Square / Yaletown)
(Waterfront / Coal Harbour / Canada Place)
(SE West End / Davie Village)
(NW West End / Stanley Park)
 (district municipality)Outer East
(West Fairview / Granville Island / NE Shaughnessy)
 (district municipality)Inner East
(NW Shaughnessy / East Kitsilano / Quilchena)
 (district municipality)East Central
(North Hastings-Sunrise)
(Central Kitsilano / Greektown)
 (district municipality)North Central
(North Grandview-Woodland)
(NW Arbutus Ridge / NE Dunbar-Southlands)
 (city)South Central
(South Hastings-Sunrise / North Renfrew-Collingwood)
(South Shaughnessy / NW Oakridge / NE Kerrisdale / SE Arbutus Ridge)
 (city)Southwest Central
(South Grandview-Woodland / NE Kensington-Cedar Cottage)
(West Kerrisdale / South Dunbar-Southlands / Musqueam)
 (district municipality)Northwest Central
(SE Kensington-Cedar Cottage / Victoria-Fraserview)
(SE Kerrisdale / 

In [30]:
print(vancouver_venues.shape)
vancouver_venues.head()

(4244, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,(Strathcona / Chinatown / Downtown Eastside),49.278507,-123.09159,Phnom Penh,49.278517,-123.098214,Asian Restaurant
1,(Strathcona / Chinatown / Downtown Eastside),49.278507,-123.09159,Union Market,49.277371,-123.086989,Deli / Bodega
2,(Strathcona / Chinatown / Downtown Eastside),49.278507,-123.09159,The Mackenzie Room,49.283168,-123.094911,Restaurant
3,(Strathcona / Chinatown / Downtown Eastside),49.278507,-123.09159,Bao Bei,49.279491,-123.100595,Chinese Restaurant
4,(Strathcona / Chinatown / Downtown Eastside),49.278507,-123.09159,Matchstick Coffee Roasters,49.278626,-123.099303,Café


In [26]:
ottawa_venues = getNearbyVenues(names = ottawa_data['Neighborhood'],
                                 latitudes = ottawa_data['Latitude'], 
                                 longitudes = ottawa_data['Longitude'], 
                                 radius = radius)

In [27]:
print(ottawa_venues.shape)
ottawa_venues.head()

(3303, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,(Highland Park / McKellar Park /Westboro /Glab...,45.377666,-75.764976,S&G Fries,45.376841,-75.755878,Food Truck
1,(Highland Park / McKellar Park /Westboro /Glab...,45.377666,-75.764976,Equator Coffee Westboro,45.391714,-75.753717,Coffee Shop
2,(Highland Park / McKellar Park /Westboro /Glab...,45.377666,-75.764976,MEC Ottawa,45.391274,-75.754961,Sporting Goods Shop
3,(Highland Park / McKellar Park /Westboro /Glab...,45.377666,-75.764976,Golden Palace,45.369951,-75.771759,Chinese Restaurant
4,(Highland Park / McKellar Park /Westboro /Glab...,45.377666,-75.764976,Pure Kitchen,45.391683,-75.754797,Vegetarian / Vegan Restaurant


In [138]:
# save data to hard drive so don't need to call API repeatedly
# toronto_venues.to_csv('C:/Users/audre/Documents/Study/Coursera/toronto_venues.csv')
# vancouver_venues.to_csv('C:/Users/audre/Documents/Study/Coursera/vancouver_venues.csv')
# ottawa_venues.to_csv('C:/Users/audre/Documents/Study/Coursera/ottawa_venues.csv')

# read venue data from hard drive
# toronto_venues = pd.read_csv('C:/Users/audre/Documents/Study/Coursera/toronto_venues.csv')
# vancouver_venues = pd.read_csv('C:/Users/audre/Documents/Study/Coursera/vancouver_venues.csv')
# ottawa_venues = pd.read_csv('C:/Users/audre/Documents/Study/Coursera/ottawa_venues.csv')

The current "Venue Category" data is too granular for many parts of this analysis. A manual grouping will be performed to put them into more generic groups. This will be done in excel and then imported back.

In [32]:
# First, get a full list of unique "Venue Category" values in all 3 cities
all_venues = toronto_venues.append(vancouver_venues).append(ottawa_venues)
print(all_venues.shape)
venue_category = pd.DataFrame(all_venues['Venue Category'].unique())
print(venue_category.shape)

# Export the venue category list in a csv file to perform the manual grouping
# venue_category.to_csv('C:/Users/audre/Documents/Study/Coursera/venue_category.csv')
# The manual grouping is then done in Excel

(11347, 7)
(295, 1)


In [33]:
# Save the grouping file to GitHub, import it back to be used
category_group = pd.read_csv('https://raw.githubusercontent.com/audrey-wj/Coursera_Capstone/master/Battle%20of%20Neighborhoods/venue_category.csv')
category_group.head()

Unnamed: 0,Venue Category,Category Group
0,Indie Movie Theater,Entertainment
1,Vegetarian / Vegan Restaurant,Food/Restaurant
2,Ice Cream Shop,Specialty Food
3,Gastropub,Bar/Pub/Lounge
4,Park,Community


In [217]:
# Merge category group file to venue list dataframes
toronto_venues = pd.merge(toronto_venues, category_group, on = 'Venue Category', how = 'left')
vancouver_venues = pd.merge(vancouver_venues, category_group, on = 'Venue Category', how = 'left')
ottawa_venues = pd.merge(ottawa_venues, category_group, on = 'Venue Category', how = 'left')
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Category Group
0,The Beaches,43.676357,-79.293031,The Fox Theatre,43.672801,-79.287272,Indie Movie Theater,Entertainment
1,The Beaches,43.676357,-79.293031,Ed's Real Scoop,43.67263,-79.287993,Ice Cream Shop,Specialty Food
2,The Beaches,43.676357,-79.293031,Kew Gardens,43.669038,-79.298538,Park,Community
3,The Beaches,43.676357,-79.293031,Tori's Bakeshop,43.672114,-79.290331,Vegetarian / Vegan Restaurant,Food/Restaurant
4,The Beaches,43.676357,-79.293031,Woodbine Beach,43.663112,-79.306374,Beach,Outdoor


# 3. Methodology
## 3.1. Overview of the Cities
Compare the top venue types (Category Groups) for the downtown areas of the 3 cities.

In [140]:
# onehot coding of the Category Group data
toronto_onehot = pd.get_dummies(toronto_venues[['Category Group']], prefix="", prefix_sep="")
toronto_onehot['City'] = "Toronto"
toronto_onehot = pd.concat([toronto_venues['Neighborhood'],toronto_onehot], axis = 1, sort = False)
toronto_onehot = toronto_onehot[[toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])]

vancouver_onehot = pd.get_dummies(vancouver_venues[['Category Group']], prefix="", prefix_sep="")
vancouver_onehot['City'] = 'Vancouver'
vancouver_onehot = pd.concat([vancouver_venues['Neighborhood'],vancouver_onehot], axis = 1, sort = False)
vancouver_onehot = vancouver_onehot[[vancouver_onehot.columns[-1]] + list(vancouver_onehot.columns[:-1])]

ottawa_onehot = pd.get_dummies(ottawa_venues[['Category Group']], prefix="", prefix_sep="")
ottawa_onehot['City'] = 'Ottawa'
ottawa_onehot = pd.concat([ottawa_venues['Neighborhood'],ottawa_onehot], axis = 1, sort = False)
ottawa_onehot = ottawa_onehot[[ottawa_onehot.columns[-1]] + list(ottawa_onehot.columns[:-1])]

In [74]:
city_grouped = toronto_onehot.append(vancouver_onehot).append(ottawa_onehot).groupby('City').mean().reset_index()
city_grouped.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Unnamed: 0,City,Bank/Finanical Institusions,Bar/Pub/Lounge,Coffee/Tea,Community,Entertainment,Event Space,Food/Restaurant,Fram,General Shopping,Grocery/Supermarket,Historic/Museum,Hotel/Accommodation,Library,Medicine,Office/Business,Outdoor,Service,Specialty Food,Specialty Store,Sports/Fitness,Tourism,Training/Education,Transportation
0,Ottawa,0.011807,0.050863,0.112322,0.023918,0.027551,0.000908,0.399031,0.002422,0.025734,0.041477,0.018771,0.019376,0.001211,0.016954,0.000606,0.01211,0.003936,0.042386,0.116561,0.055404,0.003936,,0.010899
1,Toronto,,0.073684,0.138421,0.076316,0.034474,0.004211,0.337632,,0.001316,0.031579,0.014211,0.025526,,0.000263,0.000263,0.020263,0.011316,0.071053,0.09,0.054474,0.009737,0.000789,0.004474
2,Vancouver,0.000707,0.049717,0.112158,0.079642,0.015316,0.006126,0.394675,0.001649,0.012488,0.02262,0.002356,0.027097,,0.002592,0.001414,0.048539,0.003299,0.089303,0.049717,0.063619,0.009896,0.002592,0.004006


In [75]:
# define function return_ost_common_venues to sort venues by frequency in decending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [76]:
# Create a new dataframe that has the top 10 venues for each city
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# initialize a new dataframe
city_venues_sorted = pd.DataFrame(columns=columns)
city_venues_sorted['City'] = city_grouped['City']

for ind in np.arange(city_grouped.shape[0]):
    city_venues_sorted.iloc[ind, 1:] = return_most_common_venues(city_grouped.iloc[ind, :], num_top_venues)

city_venues_sorted.head()

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Ottawa,Food/Restaurant,Specialty Store,Coffee/Tea,Sports/Fitness,Bar/Pub/Lounge,Specialty Food,Grocery/Supermarket,Entertainment,General Shopping,Community
1,Toronto,Food/Restaurant,Coffee/Tea,Specialty Store,Community,Bar/Pub/Lounge,Specialty Food,Sports/Fitness,Entertainment,Grocery/Supermarket,Hotel/Accommodation
2,Vancouver,Food/Restaurant,Coffee/Tea,Specialty Food,Community,Sports/Fitness,Specialty Store,Bar/Pub/Lounge,Outdoor,Hotel/Accommodation,Grocery/Supermarket


#### Top 3 Venue Types (Category Groups)
No.1 Venue Type: For all 3 cities, Food/Restaurant is the most common venue type in the downtown area. 
Other members in Top 3 Places: Coffee/Tea places are also among the top 3 venue types for all 3 cities. The remaining member in the top 3 places in both Ottawa and Vancouver is "Specialty Store", wheareas in Vancouver "Specialty Food" made the 3rd place. This makes the desity of food venues in downtown Vancouver the highest among the 3 cities.

#### Shopping
With "Specialty Store" being the no.2 venue type, and General Shopping (which primarily comprises of department stores, shopping malls & etc.), Ottawa appears to be the city that will be most attractive to shoppers.

#### Car Dependency
With a significantly higher dencity of public trasportation facility (Category Group = Transportation), Ottawa is the least car dependent city, which usually means more flexibility and convenience to travelers.

#### Culture
Ottawa and Toronto have more Cultural venues with a combined Entertainment and Historic/Museum density of 5% (compared to 1.8% in Vancouver)

#### Outdoor
Vancouver has much more Outdoor acitivity options (5% venues being Outdoor compared to 2% in Toronto and only 1.2% in Ottawa). 

#### Business
Vancouver has much higher density of Office spaces.
Vancouver also has the highest Event Spaces and Hotel/Accommodation venues, which makes it a place very suitable for conventions, exhibitions, and other events.

#### Other
Toronto may be a more comfortable and convenient city to live in with higher density of Service venues (salon, spa, auto shop, & etc.), and a good balance of other venues including public facilities (Category Group = Community), grocery store/supermarket, gym/fitness, and good amount of shopping and food selection in general. It also seems to be an interesting place to live in with its high density of bars, pubs, and coffee/tea places, and the good amount of cultural venues including museuems and entertainment venues.

## 3.2. Neighborhood Clusters
Performa KNN clustering analysis on neighborhoods in each city.

In [83]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# conda install -c conda-forge geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# conda install -c conda-forge folium
import folium # map rendering library

Since the same process will be applied to each of the 3 cities, will define a function to do the clustering.

In [185]:
def neighborhood_cluster(num_cluster, onehot_data, zipcode_data):
    # prepare data
    neighborhood = onehot_data.groupby('Neighborhood').mean().reset_index()
    neighborhood_clustering = neighborhood.drop('Neighborhood',1)
    # run k-means clustering
    kmeans = KMeans(n_clusters = num_cluster, random_state=0).fit(neighborhood_clustering)
    # add clustering labels to neighborhood dataframe
    neighborhood.insert(0, 'Cluster Labels', kmeans.labels_)
    # merge Zip Code Location data with neighborhood data
    neighborhood = pd.merge(zipcode_data,neighborhood,on = 'Neighborhood', how = 'inner')
    
    return neighborhood

In [187]:
def cluster_analysis(num_top_venues, clustered_data):
    # average venue frequency by cluster
    cluster = clustered_data.groupby(['Cluster Labels']).mean().reset_index()
    cluster = cluster.drop(['Latitude','Longitude'], 1)
    
    # top venues by cluster
    indicators = ['st', 'nd', 'rd']

    # create columns according to number of top venues
    columns = ['Cluster Labels']
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

    # initialize a new dataframe
    cluster_sorted = pd.DataFrame(columns=columns)
    cluster_sorted['Cluster Labels'] = cluster['Cluster Labels']
    
    # use "return_most_common_venues" function to return top venues for each cluster
    for ind in np.arange(cluster.shape[0]):
        cluster_sorted.iloc[ind, 1:] = return_most_common_venues(cluster.iloc[ind, :], num_top_venues)

    return cluster, cluster_sorted

### 3.2.1. Toronto

In [190]:
# cluster toronto neighborhood
kclusters = 4
toronto_neighborhood = neighborhood_cluster(kclusters, toronto_onehot, toronto_data)
toronto_neighborhood.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,Bar/Pub/Lounge,Coffee/Tea,Community,Entertainment,Event Space,Food/Restaurant,General Shopping,Grocery/Supermarket,Historic/Museum,Hotel/Accommodation,Medicine,Office/Business,Outdoor,Service,Specialty Food,Specialty Store,Sports/Fitness,Tourism,Training/Education,Transportation
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,0.16,0.11,0.11,0.01,0.0,0.32,0.0,0.01,0.0,0.0,0.0,0.0,0.07,0.0,0.11,0.05,0.05,0.0,0.0,0.0
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,3,0.06,0.19,0.09,0.01,0.0,0.35,0.0,0.05,0.03,0.03,0.0,0.0,0.0,0.02,0.08,0.04,0.05,0.0,0.0,0.0
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,2,0.12,0.15,0.09,0.01,0.0,0.38,0.0,0.01,0.0,0.01,0.0,0.0,0.05,0.0,0.09,0.06,0.03,0.0,0.0,0.0
3,M4M,East Toronto,Studio District,43.659526,-79.340923,3,0.06,0.18,0.08,0.04,0.0,0.33,0.0,0.06,0.04,0.02,0.0,0.0,0.03,0.0,0.08,0.04,0.04,0.0,0.0,0.0
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,0.04,0.18,0.06,0.01,0.0,0.37,0.01,0.03,0.0,0.0,0.0,0.0,0.02,0.02,0.11,0.07,0.07,0.01,0.0,0.0


In [198]:
# toronto neighborhood analysis
toronto_cluster, toronto_cluster_sorted = cluster_analysis(10, toronto_neighborhood)
toronto_cluster

Unnamed: 0,Cluster Labels,Bar/Pub/Lounge,Coffee/Tea,Community,Entertainment,Event Space,Food/Restaurant,General Shopping,Grocery/Supermarket,Historic/Museum,Hotel/Accommodation,Medicine,Office/Business,Outdoor,Service,Specialty Food,Specialty Store,Sports/Fitness,Tourism,Training/Education,Transportation
0,0,0.026667,0.158333,0.063333,0.006667,0.003333,0.355,0.006667,0.043333,0.008333,0.003333,0.001667,0.0,0.03,0.023333,0.103333,0.09,0.065,0.01,0.001667,0.0
1,1,0.055333,0.117333,0.065333,0.058,0.006667,0.324,0.0,0.028,0.014667,0.04,0.0,0.0,0.014667,0.01,0.044,0.121333,0.075333,0.016,0.0,0.009333
2,2,0.146,0.127,0.092,0.011,0.002,0.374,0.001,0.016,0.002,0.01,0.0,0.001,0.029,0.006,0.098,0.054,0.028,0.002,0.0,0.001
3,3,0.05,0.182857,0.088571,0.041429,0.002857,0.3,0.0,0.051429,0.035714,0.035714,0.0,0.0,0.011429,0.011429,0.062857,0.074286,0.038571,0.007143,0.002857,0.002857


In [199]:
toronto_cluster_sorted

Unnamed: 0,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,Food/Restaurant,Coffee/Tea,Specialty Food,Specialty Store,Sports/Fitness,Community,Grocery/Supermarket,Outdoor,Bar/Pub/Lounge,Service
1,1,Food/Restaurant,Specialty Store,Coffee/Tea,Sports/Fitness,Community,Entertainment,Bar/Pub/Lounge,Specialty Food,Hotel/Accommodation,Grocery/Supermarket
2,2,Food/Restaurant,Bar/Pub/Lounge,Coffee/Tea,Specialty Food,Community,Specialty Store,Outdoor,Sports/Fitness,Grocery/Supermarket,Entertainment
3,3,Food/Restaurant,Coffee/Tea,Community,Specialty Store,Specialty Food,Grocery/Supermarket,Bar/Pub/Lounge,Entertainment,Sports/Fitness,Historic/Museum


In [261]:
# Visualize the resulting clusters
# Use geopy library to get latitude and longitude values of Toronto, Canada
address = 'Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# rainbow = [purple, teal, lime, red]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_neighborhood['Latitude'], toronto_neighborhood['Longitude'], toronto_neighborhood['Neighborhood'], toronto_neighborhood['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Cluster 0: Residential (red)
Venues include Shopping, Pharmacy, Service, and a decent amount of Grocery/Supermarket, Outdoor, and Sports/Fitness (including gyms and studios) facilities
#### Cluster 1: Event/Tourism (purple)
High concentration of Event Space, Hotel/Accommodation, Transportation, Tourism, Entertainment, Sports/Fitness (including statiums)
#### Cluster 2: Business/City Center (teal/blue)
High Concentration of Bar/Pub/Lounge, Food/Restaurant & Specialty Food. Office buildings are also found in this area, and it also has a decent amount of public facilities
#### Cluster 3: Culture/Education (lime/green)
Center of culture and education. Where you can find historic sites and museums, training and education institutions. There's also a decent amount of public facilities, entertainment, and hotel/accomodations.

### 3.2.2 Vancouver

In [203]:
# cluster Vancouver neighborhood
vancouver_neighborhood = neighborhood_cluster(kclusters, vancouver_onehot, vancouver_data)
vancouver_neighborhood.head()

Unnamed: 0,PostalCode,Bourough,Neighborhood,Latitude,Longitude,Cluster Labels,Bank/Finanical Institusions,Bar/Pub/Lounge,Coffee/Tea,Community,Entertainment,Event Space,Food/Restaurant,Fram,General Shopping,Grocery/Supermarket,Historic/Museum,Hotel/Accommodation,Medicine,Office/Business,Outdoor,Service,Specialty Food,Specialty Store,Sports/Fitness,Tourism,Training/Education,Transportation
0,V6A,Vancouver,(Strathcona / Chinatown / Downtown Eastside),49.278507,-123.09159,2,0.0,0.1,0.13,0.04,0.03,0.01,0.42,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.02,0.01,0.08,0.05,0.04,0.0,0.01,0.0
1,V6B,Vancouver,(NE Downtown / Gastown / Harbour Centre / Inte...,49.278408,-123.11181,0,0.0,0.04,0.09,0.04,0.03,0.0,0.41,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.02,0.01,0.11,0.08,0.07,0.0,0.01,0.0
2,V6C,Vancouver,(Waterfront / Coal Harbour / Canada Place),49.28472,-123.117017,0,0.0,0.03,0.08,0.05,0.03,0.0,0.4,0.0,0.01,0.01,0.0,0.1,0.0,0.0,0.02,0.01,0.1,0.07,0.08,0.0,0.01,0.0
3,V6E,Vancouver,(SE West End / Davie Village),49.284366,-123.118055,0,0.0,0.02,0.08,0.06,0.03,0.0,0.4,0.0,0.01,0.01,0.0,0.1,0.0,0.0,0.02,0.01,0.1,0.07,0.08,0.0,0.01,0.0
4,V6G,Vancouver,(NW West End / Stanley Park),49.290228,-123.135351,0,0.0,0.03,0.05,0.09,0.03,0.0,0.32,0.0,0.02,0.01,0.0,0.12,0.0,0.0,0.06,0.01,0.08,0.09,0.05,0.04,0.0,0.0


In [204]:
# toronto neighborhood analysis
vancouver_cluster, vancouver_cluster_sorted = cluster_analysis(10, vancouver_neighborhood)
vancouver_cluster

Unnamed: 0,Cluster Labels,Bank/Finanical Institusions,Bar/Pub/Lounge,Coffee/Tea,Community,Entertainment,Event Space,Food/Restaurant,Fram,General Shopping,Grocery/Supermarket,Historic/Museum,Hotel/Accommodation,Medicine,Office/Business,Outdoor,Service,Specialty Food,Specialty Store,Sports/Fitness,Tourism,Training/Education,Transportation
0,0,0.0,0.028667,0.082,0.074,0.028667,0.003333,0.382667,0.0,0.010667,0.01,0.0,0.054,0.0,0.0,0.049333,0.004667,0.126,0.059333,0.074667,0.007333,0.004667,0.0
1,1,0.0026,0.036417,0.133831,0.11288,0.005913,0.006263,0.272222,0.003665,0.031591,0.039723,0.005152,0.016331,0.007558,0.006161,0.112703,0.005589,0.033232,0.036065,0.075053,0.02455,0.0,0.031059
2,2,0.001111,0.11,0.117778,0.081111,0.007778,0.005556,0.398889,0.004444,0.01,0.025556,0.005556,0.014444,0.0,0.0,0.034444,0.004444,0.076667,0.034444,0.051111,0.013333,0.002222,0.001111
3,3,0.0,0.036364,0.138182,0.064545,0.01,0.010909,0.486364,0.0,0.003636,0.026364,0.001818,0.006364,0.004545,0.000909,0.014545,0.0,0.084545,0.055455,0.050909,0.0,0.001818,0.001818


In [205]:
vancouver_cluster_sorted

Unnamed: 0,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,Food/Restaurant,Specialty Food,Coffee/Tea,Sports/Fitness,Community,Specialty Store,Hotel/Accommodation,Outdoor,Bar/Pub/Lounge,Entertainment
1,1,Food/Restaurant,Coffee/Tea,Community,Outdoor,Sports/Fitness,Grocery/Supermarket,Bar/Pub/Lounge,Specialty Store,Specialty Food,General Shopping
2,2,Food/Restaurant,Coffee/Tea,Bar/Pub/Lounge,Community,Specialty Food,Sports/Fitness,Outdoor,Specialty Store,Grocery/Supermarket,Hotel/Accommodation
3,3,Food/Restaurant,Coffee/Tea,Specialty Food,Community,Specialty Store,Sports/Fitness,Bar/Pub/Lounge,Grocery/Supermarket,Outdoor,Event Space


In [206]:
# Visualize the resulting clusters
# Use geopy library to get latitude and longitude values of Vancouver, Canada
address = 'Vancouver, British Columbia, Canada'

geolocator = Nominatim(user_agent="vancouver_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# rainbow = [purple, teal, lime, red]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(vancouver_neighborhood['Latitude'], vancouver_neighborhood['Longitude'], vancouver_neighborhood['Neighborhood'], vancouver_neighborhood['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Cluster 0: City Center (red)
Center of entertainment and hotel/accomodation for the city, also has good amount of food selection including specialty food and general restaurant. Some training and education institution and outdoor facilities can also be found.

#### Cluster 1: Residential, Harbor, Tourism & Outdoor (purple)
Higher centration of banks, public facilities, general shopping, grocery store/supermarket and pharmacies make it conveneint for residents. It has a good amount of office buildings, making commute easier for office workers. 
Unlike residential areas in Toronto, this function area in Vancouver also has a good amount ourdoor facilities, trouism sites, historical sites and museums. 
The harbor and public transportation facilities make it convenient for both residents and visitors.

#### Cluster 2: Culture & Entertainment (teal/blue)
This area is a good location for local cultural events and entertaiment, with high concentration of bars, pubs, and lounges, decent entertainment, food venues, and historic sites/museums.

#### Cluster 3: Event & Entertainment (lime/green)
This is the event center for the city. It has the highest density of event spaces, great food selection including both general restaurants and specialty food, and decent entertainment options.

### 3.2.3 Ottawa

In [207]:
# cluster Vancouver neighborhood
ottawa_neighborhood = neighborhood_cluster(kclusters, ottawa_onehot, ottawa_data)
ottawa_neighborhood.head()

Unnamed: 0,PostalCode,Bourough,Neighborhood,Latitude,Longitude,Cluster Labels,Bank/Finanical Institusions,Bar/Pub/Lounge,Coffee/Tea,Community,Entertainment,Event Space,Food/Restaurant,Fram,General Shopping,Grocery/Supermarket,Historic/Museum,Hotel/Accommodation,Library,Medicine,Office/Business,Outdoor,Service,Specialty Food,Specialty Store,Sports/Fitness,Tourism,Transportation
0,K2A,Ottawa,(Highland Park / McKellar Park /Westboro /Glab...,45.377666,-75.764976,1,0.0,0.08,0.08,0.01,0.02,0.0,0.46,0.0,0.01,0.04,0.0,0.01,0.0,0.01,0.0,0.02,0.0,0.08,0.13,0.05,0.0,0.0
1,K4A,Ottawa,(Fallingbrook),45.472155,-75.479227,1,0.032609,0.021739,0.130435,0.0,0.021739,0.0,0.380435,0.0,0.032609,0.086957,0.0,0.01087,0.0,0.054348,0.0,0.01087,0.0,0.01087,0.141304,0.065217,0.0,0.0
2,K1B,Ottawa,(Blackburn Hamlet / Pine View / Sheffield Glen),45.422631,-75.593948,1,0.012195,0.02439,0.146341,0.012195,0.012195,0.0,0.280488,0.0,0.060976,0.073171,0.012195,0.02439,0.0,0.012195,0.0,0.0,0.02439,0.02439,0.158537,0.073171,0.0,0.04878
3,K2B,Ottawa,(Britannia /Whitehaven / Bayshore / Pinecrest),45.361812,-75.789919,1,0.0,0.02,0.1,0.02,0.04,0.0,0.43,0.0,0.04,0.04,0.0,0.0,0.0,0.03,0.0,0.02,0.0,0.03,0.18,0.05,0.0,0.0
4,K4B,Ottawa,(Navan),45.411875,-75.402756,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.4,0.0,0.0


In [208]:
# toronto neighborhood analysis
ottawa_cluster, ottawa_cluster_sorted = cluster_analysis(10, ottawa_neighborhood)
ottawa_cluster

Unnamed: 0,Cluster Labels,Bank/Finanical Institusions,Bar/Pub/Lounge,Coffee/Tea,Community,Entertainment,Event Space,Food/Restaurant,Fram,General Shopping,Grocery/Supermarket,Historic/Museum,Hotel/Accommodation,Library,Medicine,Office/Business,Outdoor,Service,Specialty Food,Specialty Store,Sports/Fitness,Tourism,Transportation
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.1,0.125,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.325,0.0,0.125
1,1,0.018268,0.038357,0.105296,0.018458,0.021166,0.000385,0.396854,0.002952,0.036173,0.052919,0.004112,0.009597,0.0,0.022601,0.001203,0.008786,0.004098,0.023697,0.139618,0.073709,0.003564,0.015577
2,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.25,0.0,0.25,0.0
3,3,0.0,0.072727,0.12,0.043636,0.035455,0.001818,0.400909,0.002727,0.008182,0.015455,0.049091,0.037273,0.003636,0.001818,0.0,0.016364,0.001818,0.078182,0.062727,0.030909,0.008182,0.009091


In [212]:
ottawa_neighborhood[ottawa_neighborhood['Cluster Labels'] == 2]

Unnamed: 0,PostalCode,Bourough,Neighborhood,Latitude,Longitude,Cluster Labels,Bank/Finanical Institusions,Bar/Pub/Lounge,Coffee/Tea,Community,Entertainment,Event Space,Food/Restaurant,Fram,General Shopping,Grocery/Supermarket,Historic/Museum,Hotel/Accommodation,Library,Medicine,Office/Business,Outdoor,Service,Specialty Food,Specialty Store,Sports/Fitness,Tourism,Transportation
26,K4P,Ottawa,(Greely),45.252179,-75.572107,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.25,0.0,0.25,0.0


In [215]:
ottawa_venues[ottawa_venues['Neighborhood'] =='(Greely)']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Category Group
2232,(Greely),45.252179,-75.572107,Little Ray's Reptile Zoo,45.288291,-75.574538,Zoo,Tourism
2233,(Greely),45.252179,-75.572107,Canada Post,45.263368,-75.558906,Paper / Office Supplies Store,Specialty Store
2234,(Greely),45.252179,-75.572107,Poplar Grove Trailer Park,45.25927,-75.552722,Campground,Outdoor
2235,(Greely),45.252179,-75.572107,Third World Bazaar,45.254898,-75.625542,Flea Market,General Shopping


In [209]:
ottawa_cluster_sorted

Unnamed: 0,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,Sports/Fitness,Transportation,Fram,Historic/Museum,Grocery/Supermarket,Service,Specialty Food,Specialty Store,Bar/Pub/Lounge,Coffee/Tea
1,1,Food/Restaurant,Specialty Store,Coffee/Tea,Sports/Fitness,Grocery/Supermarket,Bar/Pub/Lounge,General Shopping,Specialty Food,Medicine,Entertainment
2,2,General Shopping,Specialty Store,Outdoor,Tourism,Transportation,Bar/Pub/Lounge,Coffee/Tea,Community,Entertainment,Event Space
3,3,Food/Restaurant,Coffee/Tea,Specialty Food,Bar/Pub/Lounge,Specialty Store,Historic/Museum,Community,Hotel/Accommodation,Entertainment,Sports/Fitness


In [211]:
# Visualize the resulting clusters
# Use geopy library to get latitude and longitude values of Vancouver, Canada
address = 'Ottawa, Ontario, Canada'

geolocator = Nominatim(user_agent="ottawa_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# rainbow = [purple, teal, lime, red]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ottawa_neighborhood['Latitude'], ottawa_neighborhood['Longitude'], ottawa_neighborhood['Neighborhood'], ottawa_neighborhood['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Cluster 0: Outer City (red)
Only two neighborhoods are categorized in this cluster. This cluster is primarily comprised of farms, historical sites and museums, transportation facilities, and sports facilities.

#### Cluster 1: Residential (purple)
This is the residential area with easy access to banks, entertainment, food and restaurant, shopping (general and specialty goods), grocery store and supermarket, pharmacies, service facilities, and gyms and fitness facilities. 
There are also some office buildings which makes it convenient for commuters.  

#### Cluster 2: Suburb (teal/blue)
Only one neighborhood (Greely) is in this cluster. Only four venues are found in this neighborhood, which is a zoo, a post office, a trailer park, and a bazaar. This is a suburban neighborhood.

#### Cluster 3: City Center (lime/green)
This is the city center with high concentration of bars, pubs, and lounges, entertainment, event spaces, great food selections including general restaurant and specialty stores, and hotels and accommodations.
This is also the center of public cultural functions including public service facilities, historical site and museums, and libraries.

## 3.3. Deep Dive into Food Selection
In this section we will take a closer look at the "Food/Restaurant" and "Specialty Food" category groups

In [224]:
# limit venue data to just "Food/Restaurant" and "Specialty Food" Category Groups
toronto_food = toronto_venues[toronto_venues['Category Group'] == 'Food/Restaurant'].append(toronto_venues[toronto_venues['Category Group'] == 'Specialty Food'])
vancouver_food = vancouver_venues[vancouver_venues['Category Group'] == 'Food/Restaurant'].append(vancouver_venues[vancouver_venues['Category Group'] == 'Specialty Food'])
ottawa_food = ottawa_venues[ottawa_venues['Category Group'] == 'Food/Restaurant'].append(ottawa_venues[ottawa_venues['Category Group'] == 'Specialty Food'])

print(np.shape(toronto_food))
print(np.shape(vancouver_food))
print(np.shape(ottawa_food))

(1423, 8)
(2054, 8)
(1458, 8)


In [226]:
# define a function to perform onehot coding and sort venue data
def onehot_coding(city, venue_data):
    onehot_data = pd.get_dummies(venue_data[['Venue Category']], prefix="", prefix_sep="")
    onehot_data['City'] = city
    onehot_data = pd.concat([venue_data['Neighborhood'],onehot_data], axis = 1, sort = False)
    onehot_data = onehot_data[[onehot_data.columns[-1]] + list(onehot_data.columns[:-1])]
    
    return onehot_data

In [243]:
def sort_data(data, num_top_venues = 10):
    indicators = ['st', 'nd', 'rd']

    # create columns according to number of top venues
    columns = ['City']
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

    # initialize a new dataframe
    venues_sorted = pd.DataFrame(columns=columns)
    venues_sorted['City'] = data['City']

    for ind in np.arange(data.shape[0]):
        venues_sorted.iloc[ind, 1:] = return_most_common_venues(data.iloc[ind, :], num_top_venues)

    return venues_sorted

In [229]:
toronto_food_onehot = onehot_coding('Toronto',toronto_food)
vancouver_food_onehot = onehot_coding('Vancouver', vancouver_food)
ottawa_food_onehot = onehot_coding('Ottawa', ottawa_food)

In [237]:
toronto_food_grouped = toronto_food_onehot.groupby('City').mean().reset_index()
vancouver_food_grouped = vancouver_food_onehot.groupby('City').mean().reset_index()
ottawa_food_grouped = ottawa_food_onehot.groupby('City').mean().reset_index()

In [246]:
num_top_venues = 15
toronto_food_sorted = sort_data(toronto_food_grouped, num_top_venues)
vancouver_food_sorted = sort_data(vancouver_food_grouped, num_top_venues)
ottawa_food_sorted = sort_data(ottawa_food_grouped, num_top_venues)

In [245]:
toronto_food_sorted

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue
0,Toronto,Pizza Place,Sandwich Place,Japanese Restaurant,Bakery,Restaurant,Italian Restaurant,Dessert Shop,Mexican Restaurant,French Restaurant,Ice Cream Shop,Mediterranean Restaurant,Tapas Restaurant,Vegetarian / Vegan Restaurant,American Restaurant,Asian Restaurant


In [247]:
vancouver_food_sorted

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue
0,Vancouver,Japanese Restaurant,Bakery,Restaurant,Italian Restaurant,Seafood Restaurant,Ice Cream Shop,Pizza Place,Dessert Shop,Sandwich Place,Sushi Restaurant,Vietnamese Restaurant,Indian Restaurant,Vegetarian / Vegan Restaurant,French Restaurant,Breakfast Spot


In [248]:
ottawa_food_sorted

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue
0,Ottawa,Restaurant,Fast Food Restaurant,Pizza Place,Sandwich Place,Italian Restaurant,Middle Eastern Restaurant,American Restaurant,Burger Joint,Breakfast Spot,Mexican Restaurant,Steakhouse,Deli / Bodega,Bakery,Diner,Thai Restaurant


-	Relative to the other two cities, Ottawa has less international food selections and more casual eating options (pizza, fast food, and sandwiches)
-	Toronto has more international food by having Japanese, Italian, Mexican, French, Mediterranean, and Spanish (Tapas) restaurants as top venues.
-	Vancouver has more Asian food by having Japanese, Vietnamese, and Indian restaurants in the top venue list. 

## 3.4. Deep Dive into Activities
In this sections, we will look more closely into "Outdoor", "Tourism", and "Historic/Museum" Category Groups

In [249]:
# limit venue data to just "Food/Restaurant" and "Specialty Food" Category Groups
toronto_activity = toronto_venues[toronto_venues['Category Group'] == 'Outdoor'].append(toronto_venues[toronto_venues['Category Group'] == 'Tourism']).append(toronto_venues[toronto_venues['Category Group'] == 'Historic/Museum'])
vancouver_activity = vancouver_venues[vancouver_venues['Category Group'] == 'Outdoor'].append(vancouver_venues[vancouver_venues['Category Group'] == 'Tourism']).append(vancouver_venues[vancouver_venues['Category Group'] == 'Historic/Museum'])
ottawa_activity = ottawa_venues[ottawa_venues['Category Group'] == 'Ourdoor'].append(ottawa_venues[ottawa_venues['Category Group'] == 'Tourism']).append(ottawa_venues[ottawa_venues['Category Group'] == 'Historic/Museum'])

print(np.shape(toronto_activity))
print(np.shape(vancouver_activity))
print(np.shape(ottawa_activity))

(200, 8)
(258, 8)
(75, 8)


In [250]:
toronto_activity_onehot = onehot_coding('Toronto',toronto_activity)
vancouver_activity_onehot = onehot_coding('Vancouver', vancouver_activity)
ottawa_activity_onehot = onehot_coding('Ottawa', ottawa_activity)

In [252]:
toronto_activity_grouped = toronto_activity_onehot.groupby('City').mean().reset_index()
vancouver_activity_grouped = vancouver_activity_onehot.groupby('City').mean().reset_index()
ottawa_activity_grouped = ottawa_activity_onehot.groupby('City').mean().reset_index()

In [257]:
num_top_venues = 9
toronto_activity_sorted = sort_data(toronto_activity_grouped, num_top_venues)
vancouver_activity_sorted = sort_data(vancouver_activity_grouped, num_top_venues)
ottawa_activity_sorted = sort_data(ottawa_activity_grouped, num_top_venues)

In [258]:
toronto_activity_sorted

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue
0,Toronto,Museum,Historic Site,Scenic Lookout,Monument / Landmark,Lake,Castle,Aquarium,Field,Other Great Outdoors


In [259]:
vancouver_activity_sorted

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue
0,Vancouver,Trail,Beach,Scenic Lookout,Sculpture Garden,Lighthouse,National Park,Museum,Fair,Mountain


In [260]:
ottawa_activity_sorted

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue
0,Ottawa,Museum,History Museum,Historic Site,Science Museum,Memorial Site,Art Museum,Monument / Landmark,Botanical Garden,Zoo


-	Ottawa has relatively limited outdoor activity options. There is ample amount of cultural activities in Ottawa such as museums, art, historical sites & etc.
-	There are plenty to look at in the Toronto, in addition to museums and historical sites like Ottawa, Toronto also has more varieties like Castles and Aquariums.
-	In Vancouver, visitors have many outdoor activity options. There are trails, beaches, scenic lookouts, national parks & etc. in Vancouver area.