# Coursera Capstone Project: Find Similar Hotels

## 1. Introduction

This notebook is result of the Capstone Project of the IBM Data Science specialization on Coursera. You are free to choose your business problem, so I decided to look at a common almost "daily" problem city travellers have.

Consider you've been travelling to one city and liked the stay, especially how your hotel was located. Maybe you preferred a neighbourbood a bit outside the city center where it is more quiet and some parks to go for a walk, but still there were a number of restaurants and bars nearby so you could spend the evening right next to the hotel without going back late in the evening from the city center.

Now you want to travel to some city and not find a similar hotel by hotel standards (you don't need much more than bed to sleep and a shower), but find a hotel which is in a similar part of the city and has a similar environment in terms of venues around the hotel.

This the problem I decided to take a deeper look at in this analysis.

## 2. Methodology Overview

### 2.1 Problem Definition

As a (frequent) traveller to a specific city you've become used to spending your stay in a certain neighbourhood, regarding the neighbourhood itself and the venues nearby. So, when travelling somewhere else you want to find a hotel which has similar venues nearby and is in a similar neighbourhood.

So, given a hotel in one city (which also defines the neighbourhood) we want to find similar neighbourhoods in some other city and then find the hotels which have the most similar venues closeby.

The *question* we want to answer is:

**Which 10 hotels (target hotels) in city X have the most similar environment compared to a given hotel (origin hotel) in city Y?**

For this sample analysis we'll use a hotel in New York and we'll try to find a list of similar locations to stay in Toronto.

### Business Relevance

The similarity of the environment of a hotel can help customers of online booking services like Booking.com to find not simply a similar hotel (which we do **not** look at in this analysis), but find a hotel which is similarly located, based on the venues in it's environment. 

This can help improving hotel recommendation significantly because simple hotel recommendations based on collaborative filtering or using content-based recommendations are not aware more than a simple location rating for a hotel and even no information about the environment of the hotel at all. It's probably best to combine more than one recommendation algorithm to get the best result for the customer.

#### Exclusions

To get a good answer to our question we need to consider the overall general location of the hotel and it's closer vincinity. We do not consider travel time to and from the airport or similar, assuming we are doing this analysis for a longer stay, so travel to and from the hotel is not part of the comparison. We also ignore the proximity to monuments or museums for simplicity's sake.

### 2.2 Analytical Approach

#### General Idea

We solely base our comparison on similarity of the neighbourhood of the hotel, considering a walking distance of 1000m and the close proximity of the hotel, considering a radius of 250m.

#### Data Requirements

We need the venues in the neighbourhoods and around the hotels. These can be obtained using online APIs, which will require geocoordinates for the neighbourhoods and the hotels. We also need a list of neighbourhoods for the target city and potentially also for the origin city (from the origin citry we only need information about the neightbourhood of the hotel).

#### Modeling

We will transform the data about venues into one-hot encoded information about the neighbourhoods and the hotels. Then we can sum up the venues by type and calculate the mean across all neighbourhoods/hotels. Using this vector the proximity of two neighbourhoods or hotels can be calculated.

After getting the information for the neighbourhood of our origin hotel and all neighhourhoods of the target city we can calculate the similarity of the neighbourhoods using eucledian distance and pick the top 3 target neighbourhoods. Then we go through all hotels of these three neighbourhoods and get the venues in their immediate environment and do the same by calculating the distance to the data vector of the origin hotel. We choose the top 10 results as possible similar candidates target hotels.

**Note:** Due to the limitations of the free FourSquare API we'll be limited to a maximum of 100 venues per neighbourhood/hotel. This should be fine for the direct comparison of hotels, but it's definitely not really enough to compare two lively neighbourhoods in two big cities. Therefore we will choose a hotel in a neighbourhood with less than 100 venues as our origin hotel.

### 2.2 Data Sources

#### New York

1. Neighbourhood information

We use the data from this link [https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json] for a list of neighbourhoods and their geocoordinates

2. Venue information

To get information about the hotel's closeby venues and the information about the venus in a neighbourhood we use the FourSquare "explore" API.

#### Toronto

1. Neighbourhood information

To get neighbourhood information about Toronto we use the Wikipedia page "List of postal codes of Canada: M". This list has no geocoordinates yet.

2. Neighbourhood geocoordinates

For geocoordinates for the postal codes of the neighbourhoods in Toronto we use data from the following link [https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv]

3. Venue information

To get venue information for the neighbourhoods and the hotels we use the FourSquare API as well

### 2.3 Steps of the Analysis

The analysis will follow the following sequence:

* Get data about origin city (New York) and choose origin hotel

* Get/Save venue information about the neighbourhood the origin hotel is in

* Get venue information about the direct environment of the origin hotel

* Get data about target city neighbourhoods

* Find the top 3 neighbourhoods in the target city that are most similar to the neighbourhood or the origin hotel

* Get all hotels from these 3 neighbourhoods

* Get the venues in immediately vicinity of each hotel in target city

* Find the most similar hotels

### 2.4 Data Challenges

As part of the analysis we found the following challenges

* The number of venues reported by FourSquare varies a lot from city to city, their data doesn't seem to have the same quality in all cities

* FourSquare seems to report only a very small number of hotels

* Venue categories cannot simply be compared between cities because of cultural difference, especially the restaurant types will not be the same if you travel to different countries and surely you a more interested in a similar number of restaurants and not in having exactly the same restaurant types around your hotel. We combined all restaurant types into a single category "restaurant" to overcome this

* Restaurants should probably be classified more coarsly like "Fast Food", "Bar with Food", "Full blown restaurant". We left this open, but it could have been included in the data cleanup steps

* There are some almost duplicate categories like Gym and Gym/Training Center. For productive use this would probably need further analysis

* It turns out that even though we have not so different cities (New York and Toronto) obviously the categories used to categorize venues are not overlapped as much as one might think. This might be due to cultural differences. To make neighbourhoods and hotel environments comparable we reduced the comparison to the categories found in both cities/neighbourhoods.

* FourSquare's 100 venue limit makes the free API not so useful for this analysis for places that are in environments where there are a lot of venues. We picked a neighbourhood with less than the 100 venues limit for this reason. This should allow to find a reasonable similar environment. The question remains if we should have excluded all neighbourhoods with 100 venues from the target city from the comparison as well (which we didn't)



## 3 Analysis/Methodology

### 3.1 Get data about New York and choose origin hotel

In this section we'll get the information for our origin hotel and it's neighbourhood and take a look at the data to understand the structure of the information. For this we actually get the data for all neighbourhoods of our origin city and see the number of venues in each neighbourhoods and then pick an origin hotel as per the note above.

In a second step we'll get the neighbourhood information for our target city.

Get dependencies

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


#### Download neighbourhood information for New York and print structure of result

In [5]:
#!wget -q -O newyork_data.json https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
#print('Data downloaded!')

Data downloaded!


In [7]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [8]:
newyork_data

{'type': 'FeatureCollection',
 'totalFeatures': 306,
 'features': [{'type': 'Feature',
   'id': 'nyu_2451_34572.1',
   'geometry': {'type': 'Point',
    'coordinates': [-73.84720052054902, 40.89470517661]},
   'geometry_name': 'geom',
   'properties': {'name': 'Wakefield',
    'stacked': 1,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.2',
   'geometry': {'type': 'Point',
    'coordinates': [-73.82993910812398, 40.87429419303012]},
   'geometry_name': 'geom',
   'properties': {'name': 'Co-op City',
    'stacked': 2,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.87429419303012]}},
  {'type': 'Feature',
 

Ok, the data we need is in the attribute 'feature', let's get that part of the data and look at the first entry:

In [10]:
neighborhoods_data = newyork_data['features']
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Let's pick the coordinates, the borough and the name from the data and create a nice Pandas dataframe out of it

In [12]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
neighborhoods.head(5)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Let's verify we have all 5 boroughs and 306 neighbourhoods

In [13]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


#### Now let's get the venues for these neighbourhoods and pick a neighbourhood for the hotel

First we define our FourSquare credentials and a method to get all the venues for a list of neighbourhoods

In [14]:
CLIENT_ID = 'APRMKJZFDIPMAVMWPGNVRLPLPE15ES40S0BFPG0DVGK2GDZJ' # your Foursquare ID
CLIENT_SECRET = 'K4J2PK5EUQINK15AUZJYONNJF1U5IOIENEH4YHJK31ZBYINA' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Your credentails:
CLIENT_ID: APRMKJZFDIPMAVMWPGNVRLPLPE15ES40S0BFPG0DVGK2GDZJ
CLIENT_SECRET:K4J2PK5EUQINK15AUZJYONNJF1U5IOIENEH4YHJK31ZBYINA


Radius defines our "walking distance" around the center of the neighbourhood.

In [15]:
radius = 1000

##### Get the venues for all neighbourhoods

In [69]:
newyork_venues = getNearbyVenues(names=neighborhoods['Neighborhood'], latitudes=neighborhoods['Latitude'], 
                                 longitudes=neighborhoods['Longitude'],radius=radius)


Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker

In [70]:
print(newyork_venues.shape)
newyork_venues.head(5)

(10153, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy
2,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
3,Wakefield,40.894705,-73.847201,Subway,40.890468,-73.849152,Sandwich Place
4,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy


### Pick a hotel

#### Let's find the top 10 neighbourhoods with almost 100 venues... we'll choose a hotel from there

**Why are we doing this?** *FourSquare only returns a maximum of 100 venues. To be able to get reasonable results we need to work with neighbourhoods that fall below this maximum, because for other neighbourhoods the data is truncated and therefore erroneous for statistical purposes*

In [71]:
sorted_neighborhoods = newyork_venues.groupby('Neighborhood').count()
sorted_neighborhoods = sorted_neighborhoods[sorted_neighborhoods['Venue'] < 100].sort_values(by='Venue', ascending=False)
sorted_neighborhoods = sorted_neighborhoods.head(10)
sorted_neighborhoods

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Lincoln Square,99,99,99,99,99,99
Clinton Hill,99,99,99,99,99,99
Cobble Hill,98,98,98,98,98,98
Upper West Side,98,98,98,98,98,98
Gramercy,94,94,94,94,94,94
Tribeca,93,93,93,93,93,93
Carnegie Hill,91,91,91,91,91,91
Boerum Hill,89,89,89,89,89,89
Washington Heights,88,88,88,88,88,88
Battery Park City,87,87,87,87,87,87


#### Let's find hotels in these 10 neighborhoods

In [76]:
top10_venues = newyork_venues[newyork_venues['Neighborhood'].isin(sorted_neighborhoods.index.values)]
top10_hotel_venues = top10_venues[top10_venues['Venue Category']=='Hotel']
top10_hotel_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
4268,Lincoln Square,40.773529,-73.985338,The Phillips Club,40.774473,-73.983349,Hotel
4312,Lincoln Square,40.773529,-73.985338,The Empire Hotel,40.771545,-73.98263,Hotel
4990,Tribeca,40.721522,-74.010683,Greenwich Hotel,40.719876,-74.009949,Hotel
5023,Tribeca,40.721522,-74.010683,Arlo SoHo,40.724491,-74.007965,Hotel
5490,Gramercy,40.73721,-73.981376,Freehand New York,40.73974,-73.984622,Hotel
5506,Gramercy,40.73721,-73.981376,Gramercy Park Hotel,40.738533,-73.985651,Hotel
5588,Battery Park City,40.711932,-74.016869,Conrad New York Downtown,40.715035,-74.01584,Hotel
5605,Battery Park City,40.711932,-74.016869,New York Marriott Downtown,40.709504,-74.014672,Hotel
5621,Battery Park City,40.711932,-74.016869,W New York - Downtown,40.709277,-74.013658,Hotel
5627,Battery Park City,40.711932,-74.016869,Courtyard by Marriott New York Downtown Manhat...,40.709386,-74.01266,Hotel


#### Ok, let's just pick "Greenwhich Hotel" in Tribeca. Let's check out the neighbourhood:

In [77]:
origin_hotel = top10_hotel_venues[hotel_hood_venues['Venue']=='Greenwich Hotel'] .reset_index()
hotel_hood = origin_hotel.loc[0,'Neighborhood']
hotel_hood_venues = newyork_venues[newyork_venues['Neighborhood']==hotel_hood]
sorted_categories = hotel_hood_venues.groupby('Venue Category').count()
sorted_categories = sorted_categories.sort_values(by='Venue', ascending=False)
sorted_categories.head(10)

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
American Restaurant,6,6,6,6,6,6
Park,5,5,5,5,5,5
Italian Restaurant,5,5,5,5,5,5
Wine Bar,4,4,4,4,4,4
Spa,3,3,3,3,3,3
Café,3,3,3,3,3,3
French Restaurant,2,2,2,2,2,2
Steakhouse,2,2,2,2,2,2
Gym / Fitness Center,2,2,2,2,2,2
Hotel,2,2,2,2,2,2


Seems to be a nice environment with a lot of parks, but still a lot of restaurants.

#### Let's get the venues close to the hotel. Radius 250m. Then take a look at the top 10 type of venues around the hotel

In [82]:
hotel_venues = getNearbyVenues(names=origin_hotel['Venue'], latitudes=origin_hotel['Venue Latitude'], 
                                 longitudes=origin_hotel['Venue Longitude'],radius=250)
hotel_environment_summary = hotel_venues.groupby('Venue Category').count()
hotel_environment_summary.head(10)

Greenwich Hotel


Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
American Restaurant,4,4,4,4,4,4
Art Gallery,2,2,2,2,2,2
Bar,1,1,1,1,1,1
Boutique,3,3,3,3,3,3
Café,1,1,1,1,1,1
Chinese Restaurant,1,1,1,1,1,1
Cocktail Bar,1,1,1,1,1,1
Coffee Shop,1,1,1,1,1,1
Cycle Studio,1,1,1,1,1,1
Greek Restaurant,1,1,1,1,1,1


### 3.2 Get Neighbourhood Information for Target City (Toronto)

#### Now that we've got the data for New York and our origin hotel, and we also know the structure of the data we get from FourSquare let's just get the same neighbourhood information for Toronto

##### Get the names of the boroughs and neighbourhoods

In [83]:
# Import libraries

!pip install wikipedia
import wikipedia

# Get page using wikipedia API

html = wikipedia.page("List of postal codes of Canada: M").html().encode("UTF-8")
df = pd.read_html(html, header = None)[0]
df.head()

# process to extract useful data as a pandas dataframe

neighbourhood_data = []

for row in range(0,df.shape[0]):
    for col in range(0,df.shape[1]):
        cell = df.iloc[row,col]
        if cell.endswith('Not assigned'):
            pass
        else:
            hoods = ((((cell.split('(')[1]).strip(')')).replace(' /',',')).strip(' '))
            data_row = {}
            data_row['PostalCode'] = cell[:3]
            data_row['Borough'] = cell[3:].split('(')[0]
            data_row['NeighbourHood'] = hoods
            neighbourhood_data.append(data_row)

toronto_hoods = pd.DataFrame(neighbourhood_data)
toronto_hoods['Borough']=toronto_hoods['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

toronto_hoods.head(10)



Unnamed: 0,PostalCode,Borough,NeighbourHood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills)North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


#### Now combine with spatial data

Geospatial_Coordinates.csv is from [https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv]

In [84]:
spatial_data = pd.read_csv("Geospatial_Coordinates.csv")
toronto_hoods = pd.merge(toronto_hoods,spatial_data,left_on='PostalCode', right_on='Postal Code')
toronto_hoods = toronto_hoods.drop('Postal Code', axis=1)

In [85]:
toronto_hoods.head(10)

Unnamed: 0,PostalCode,Borough,NeighbourHood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills)North,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


#### And get the venues for Toronto

In [86]:
toronto_venues = getNearbyVenues(names=toronto_hoods['NeighbourHood'],
                                   latitudes=toronto_hoods['Latitude'],
                                   longitudes=toronto_hoods['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Ontario Provincial Government
Islington Avenue
Malvern, Rouge
Don Mills)North
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills)South
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
The Danforth East
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmount Park
Bayview Village
Downsview)East
The Danforth 

## 3.3 How to compare neighbourhoods in different cities

#### Let's take a quick look at what we got

In [114]:
print('There {} venues in Toronto'.format(toronto_venues.shape[0]))
print('There are {} uniques categories in Toronto.'.format(len(toronto_venues['Venue Category'].unique())))
print('There are {} uniques categories in New York.'.format(len(newyork_venues['Venue Category'].unique())))
print('There are {} uniques categories in Origin Hotel Hood.'.format(len(hotel_hood_venues['Venue Category'].unique())))

There 2126 venues in Toronto
There are 218 uniques categories in Toronto.
There are 342 uniques categories in New York.
There are 45 uniques categories in Origin Hotel Hood.



## The Dimensions of Our Comparison Space 

#### Ok. So there are a lot of categories we don't need

Let's remove everything from Toronto that doesn't has a category that exists in New York and also remove everything from our New York venues that doesn't have a category that exists in Toronto. This way we'll get the "dimensions" in which neighbourhoods or environments can be compared.

But first We will also combine all restaurants into a single "Restaurant Category". This will make the restaurants as a criterion more useful.

**Why do we do this?** *Well, if you compare different cities in different countries there is always a big difference in restaurant culture. If you compare the venues on the restaurant type granularity you will not get a good similarity because of course you'll find as many French restaurants as you'll find American restaurants in New York.*

Probably it would be possible to be a bit more fine granular e.g. distinguish between "Restaurant" and "Fast Food", but for simplicity's sake I'll leave it at a single category of restaurants. However, we will consolidate "Gym" and "Gym/Fitness Center" into a single "Gym" category.

#### So, first let's combine all types of restaurants into a single category and merge the two types of gym categories

In [187]:
toronto_cleaned = pd.DataFrame(toronto_venues)
toronto_cleaned.loc[toronto_cleaned['Venue Category']
    .isin(['Restaurant','Diner','Burger Joint','Noodle House','Salad Place','Steakhouse']),'Venue Category'] = "Restaurant"
toronto_cleaned.loc[toronto_cleaned['Venue Category'].str.contains('Restaurant')>0,'Venue Category'] = "Restaurant"
toronto_cleaned.loc[toronto_cleaned['Venue Category'].str.contains('Gym / Fitness Center') > 0,'Venue Category'] = "Gym"

In [188]:
newyork_cleaned = pd.DataFrame(newyork_venues)
newyork_cleaned.loc[newyork_cleaned['Venue Category']
                    .isin(['Restaurant','Diner','Burger Joint','Noodle House','Salad Place','Steakhouse']),'Venue Category'] = "Restaurant"
newyork_cleaned.loc[newyork_cleaned['Venue Category'].str.contains('Gym / Fitness Center') > 0,'Venue Category'] = "Gym"

In [189]:
hotel_hood_cleaned = pd.DataFrame(hotel_hood_venues)
hotel_hood_cleaned.loc[hotel_hood_cleaned['Venue Category']
                       .isin(['Restaurant','Diner','Burger Joint','Noodle House','Salad Place','Steakhouse']),'Venue Category'] = "Restaurant"
hotel_hood_cleaned.loc[hotel_hood_cleaned['Venue Category'].str.contains('Restaurant')>0,'Venue Category'] = "Restaurant"
hotel_hood_cleaned.loc[hotel_hood_cleaned['Venue Category'].str.contains('Gym / Fitness Center') > 0,'Venue Category'] = "Gym"

In [190]:
hotel_venues_cleaned = pd.DataFrame(hotel_venues)
hotel_venues_cleaned.loc[hotel_venues_cleaned['Venue Category']
                         .isin(['Restaurant','Diner','Burger Joint','Noodle House','Salad Place','Steakhouse']),'Venue Category'] = "Restaurant"
hotel_venues_cleaned.loc[hotel_venues_cleaned['Venue Category'].str.contains('Restaurant')>0,'Venue Category'] = "Restaurant"
hotel_venues_cleaned.loc[hotel_venues_cleaned['Venue Category'].str.contains('Gym / Fitness Center') > 0,'Venue Category'] = "Gym"

In [191]:
print('There are {} uniques categories in Toronto.'.format(len(toronto_cleaned['Venue Category'].unique())))
print('There are {} uniques categories in New York.'.format(len(newyork_cleaned['Venue Category'].unique())))
print('There are {} uniques categories in Origin Hotel Hood.'.format(len(hotel_hood_cleaned['Venue Category'].unique())))
print('There are {} uniques categories in Origin Hotel Vicinity.'.format(len(hotel_venues_cleaned['Venue Category'].unique())))

There are 212 uniques categories in Toronto.
There are 336 uniques categories in New York.
There are 39 uniques categories in Origin Hotel Hood.
There are 20 uniques categories in Origin Hotel Vicinity.


#### Now we remove all venues from Toronto that are in categories we don't find in our origin hotel hood and all venues in our origin hotel hood and origin hotel vicinity that have categories that don't exist in Toronto

In [192]:
toronto_cleaned = toronto_cleaned[toronto_cleaned['Venue Category'].isin(hotel_hood_cleaned['Venue Category'])]

In [193]:
hotel_hood_cleaned = hotel_hood_cleaned[hotel_hood_cleaned['Venue Category'].isin(toronto_cleaned['Venue Category'])]

In [194]:
hotel_venues_cleaned = hotel_venues_cleaned[hotel_venues_cleaned['Venue Category'].isin(toronto_cleaned['Venue Category'])]

In [195]:
print('There are {} uniques categories in Toronto.'.format(len(toronto_cleaned['Venue Category'].unique())))
print('There are {} uniques categories in Origin Hotel Hood.'.format(len(hotel_hood_cleaned['Venue Category'].unique())))
print('There are {} uniques categories in Origin Hotel Vicinity.'.format(len(hotel_venues_cleaned['Venue Category'].unique())))

There are 28 uniques categories in Toronto.
There are 28 uniques categories in Origin Hotel Hood.
There are 15 uniques categories in Origin Hotel Vicinity.


Let's take a look at what categories have remained

In [138]:
toronto_cleaned.groupby('Venue Category').count()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Art Gallery,14,14,14,14,14,14
Bakery,47,47,47,47,47,47
Bar,29,29,29,29,29,29
Boutique,3,3,3,3,3,3
Bridal Shop,1,1,1,1,1,1
Café,97,97,97,97,97,97
Clothing Store,33,33,33,33,33,33
Coffee Shop,191,191,191,191,191,191
Dog Run,3,3,3,3,3,3
Event Space,3,3,3,3,3,3


### Convert to one-hot encoding and calculate sum per neighbourhood to get a single vector for each neighbourhood

In [144]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_cleaned[['Venue Category']], prefix="", prefix_sep="")

# Neighborhood also comes as a category, by droping the column this will have no effect
# toronto_onehot = toronto_onehot.drop('Neighborhood', axis=1)

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_cleaned['Neighborhood'] 


# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot = toronto_onehot.groupby('Neighborhood').sum().reset_index()
toronto_onehot.head()

Unnamed: 0,Neighborhood,Art Gallery,Bakery,Bar,Boutique,Bridal Shop,Café,Clothing Store,Coffee Shop,Dog Run,Event Space,Gastropub,Gym,Gym Pool,Hotel,Men's Store,Park,Performing Arts Venue,Playground,Poke Place,Pub,Restaurant,Scenic Lookout,Skate Park,Spa,Stationery Store,Wine Bar,Wine Shop,Yoga Studio
0,Agincourt,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,"Alderwood, Long Branch",0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
2,"Bathurst Manor, Wilson Heights, Downsview North",0,0,0,0,1,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0
3,Bayview Village,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0
4,"Bedford Park, Lawrence Manor East",0,0,0,1,0,1,0,2,0,0,0,0,0,0,0,0,0,0,0,1,10,0,0,0,0,0,0,0


In [145]:
# one hot encoding
hotelhood_onehot = pd.get_dummies(hotel_hood_cleaned[['Venue Category']], prefix="", prefix_sep="")

# Neighborhood also comes as a category, by droping the column this will have no effect
# toronto_onehot = toronto_onehot.drop('Neighborhood', axis=1)

# add neighborhood column back to dataframe
hotelhood_onehot['Neighborhood'] = hotel_hood_cleaned['Neighborhood'] 


# move neighborhood column to the first column
fixed_columns = [hotelhood_onehot.columns[-1]] + list(hotelhood_onehot.columns[:-1])
hotelhood_onehot = hotelhood_onehot[fixed_columns]

hotelhood_onehot = hotelhood_onehot.groupby('Neighborhood').sum().reset_index()
hotelhood_onehot.head()

Unnamed: 0,Neighborhood,Art Gallery,Bakery,Bar,Boutique,Bridal Shop,Café,Clothing Store,Coffee Shop,Dog Run,Event Space,Gastropub,Gym,Gym Pool,Hotel,Men's Store,Park,Performing Arts Venue,Playground,Poke Place,Pub,Restaurant,Scenic Lookout,Skate Park,Spa,Stationery Store,Wine Bar,Wine Shop,Yoga Studio
0,Tribeca,1,1,2,2,1,3,1,2,1,1,1,3,1,2,2,5,1,2,2,1,31,2,2,3,1,4,2,1


## Find closest neighbourhoods

In [159]:
allhoods = hotelhood_onehot.append( toronto_onehot )
allhoods.head()

Unnamed: 0,Neighborhood,Art Gallery,Bakery,Bar,Boutique,Bridal Shop,Café,Clothing Store,Coffee Shop,Dog Run,Event Space,Gastropub,Gym,Gym Pool,Hotel,Men's Store,Park,Performing Arts Venue,Playground,Poke Place,Pub,Restaurant,Scenic Lookout,Skate Park,Spa,Stationery Store,Wine Bar,Wine Shop,Yoga Studio
0,Tribeca,1,1,2,2,1,3,1,2,1,1,1,3,1,2,2,5,1,2,2,1,31,2,2,3,1,4,2,1
0,Agincourt,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,"Alderwood, Long Branch",0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
2,"Bathurst Manor, Wilson Heights, Downsview North",0,0,0,0,1,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0
3,Bayview Village,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0


### Calculate distance matrix and sort by distance from Tribeca

In [210]:
from scipy.spatial.distance import pdist, squareform

allhoods_data = allhoods.drop('Neighborhood', axis=1)
allhoods_data.head()

distances = pdist(allhoods_data.values, metric='euclidean')
dist_matrix = squareform(distances)
distances = pd.DataFrame(dist_matrix, index = allhoods['Neighborhood'], columns=allhoods['Neighborhood'])
distances = distances.sort_values(by="Tribeca")
distances = distances.drop("Tribeca", axis = 0)
top3 = distances.head(3).index.values
print("Top 3 are ", top3)
distances.head(10)

Top 3 are  ['Church and Wellesley' 'St. James Town' 'Richmond, Adelaide, King']


Neighborhood,Tribeca,Agincourt,"Alderwood, Long Branch","Bathurst Manor, Wilson Heights, Downsview North",Bayview Village,"Bedford Park, Lawrence Manor East",Berczy Park,"Birch Cliff, Cliffside West","Brockton, Parkdale Village, Exhibition Place","CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",Caledonia-Fairbanks,Cedarbrae,Central Bay Street,Christie,Church and Wellesley,"Clairville, Humberwood, Woodbine Downs, West Humber, Kipling Heights, Rexdale, Elms, Tandridge, Old Rexdale","Clarks Corners, Tam O'Shanter, Sullivan","Cliffside, Cliffcrest, Scarborough Village West","Commerce Court, Victoria Hotel",Davisville,Davisville North,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",Don Mills)North,Don Mills)South,"Dorset Park, Wexford Heights, Scarborough Town Centre",Downsview)East,Downsview)Northwest,Downsview)West,"Dufferin, Dovercourt Village",Enclave of L4W,Enclave of M4L,Enclave of M5E,"Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood","Fairview, Henry Farm, Oriole","First Canadian Place, Underground city",Forest Hill North & West,"Garden District, Ryerson",Glencairn,"Golden Mile, Clairlea, Oakridge","Guildwood, Morningside, West Hill","Harbourfront East, Union Station, Toronto Islands","High Park, The Junction South",Hillcrest Village,Humber Summit,"India Bazaar, The Beaches West","Kennedy Park, Ionview, East Birchmount Park","Kensington Market, Chinatown, Grange Park","Kingsview Village, St. Phillips, Martin Grove Gardens, Richview Gardens","Lawrence Manor, Lawrence Heights",Lawrence Park,Leaside,"Little Portugal, Trinity","Malvern, Rouge","Milliken, Agincourt North, Steeles East, L'Amoreaux East","Mimico NW, The Queensway West, South of Bloor, Kingsway Park South West, Royal York South West","Moore Park, Summerhill East","New Toronto, Mimico South, Humber Bay Shores","North Park, Maple Leaf Park, Upwood Park",North Toronto West,"Northwood Park, York University",Ontario Provincial Government,"Parkdale, Roncesvalles","Parkview Hill, Woodbine Gardens",Parkwoods,"Regent Park, Harbourfront","Richmond, Adelaide, King",Rosedale,Roselawn,"Rouge Hill, Port Union, Highland Creek","Runnymede, Swansea",Scarborough Village,"South Steeles, Silverstone, Humbergate, Jamestown, Mount Olive, Beaumond Heights, Thistletown, Albion Gardens",St. James Town,"St. James Town, Cabbagetown","Steeles West, L'Amoreaux West",Studio District,"Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park","The Annex, North Midtown, Yorkville",The Beaches,The Danforth East,"The Danforth West, Riverdale","The Kingsway, Montgomery Road, Old Mill North",Thorncliffe Park,"Toronto Dominion Centre, Design Exchange","University of Toronto, Harbord",Victoria Village,"West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale",Westmount,Weston,"Wexford, Maryvale",Willowdale)South,Willowdale)West,"Willowdale, Newtonbrook",Woburn,Woodbine Heights,York Mills West
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1
Church and Wellesley,10.392305,29.034462,29.799329,25.806976,28.035692,19.899749,14.142136,29.966648,27.748874,30.033315,30.016662,27.147744,8.185353,27.838822,0.0,30.033315,24.269322,29.051678,13.152946,20.0,30.016662,29.051678,28.05352,23.937418,26.172505,30.0,30.033315,30.0,29.068884,27.549955,28.089144,8.42615,29.765752,17.888544,10.049876,29.034462,9.69536,28.089144,30.066593,28.089144,18.303005,21.354157,28.071338,30.033315,24.228083,29.832868,13.152946,30.0,28.94823,30.0,24.677925,13.527749,29.051678,30.016662,28.124722,30.016662,26.907248,30.016662,23.727621,28.879058,23.4094,25.96151,29.966648,29.034462,25.690465,8.660254,30.033315,29.051678,30.033315,20.615528,30.049958,29.051678,7.483315,17.720045,25.0,20.760539,25.729361,26.495283,29.966648,30.0,13.784049,30.0,24.939928,10.488088,22.693611,28.861739,28.106939,27.910571,30.0,27.147744,17.944358,29.832868,30.016662,28.7054,30.0,30.0
St. James Town,10.954451,25.475478,26.229754,22.36068,24.372115,16.613248,10.86278,26.267851,23.664319,26.457513,26.400758,23.558438,7.549834,23.853721,7.483315,26.457513,20.856654,25.495098,14.525839,16.124515,26.324893,25.495098,24.310492,20.273135,22.693611,26.381812,26.381812,26.381812,25.199206,24.062419,24.515301,7.28011,26.038433,15.231546,9.327379,25.436195,9.380832,24.433583,26.305893,24.556058,16.340135,17.492856,24.576411,26.381812,20.712315,26.267851,9.0,26.381812,25.41653,26.381812,21.283797,10.630146,25.495098,26.400758,24.433583,26.400758,23.10844,26.324893,20.223748,25.337719,19.949937,22.538855,26.153394,25.436195,21.863211,9.327379,26.381812,25.495098,26.457513,17.058722,26.476405,25.495098,0.0,13.784049,21.563859,16.763055,22.36068,22.671568,26.457513,26.381812,11.045361,26.381812,21.447611,11.045361,18.466185,25.317978,24.494897,24.392622,26.381812,23.558438,14.491377,26.267851,26.362853,25.179357,26.381812,26.381812
"Richmond, Adelaide, King",12.529964,32.649655,33.27161,29.410882,31.670175,23.853721,18.083141,33.511192,30.708305,33.600595,33.674916,30.854497,9.486833,31.240999,8.660254,33.600595,28.213472,32.726136,7.348469,23.430749,33.376639,32.726136,31.559468,27.239677,30.0,33.660065,33.54102,33.660065,32.496154,30.757113,31.843367,9.486833,33.24154,20.174241,4.472136,32.741411,7.141428,31.780497,33.600595,31.811947,16.8523,25.159491,31.827661,33.54102,28.178006,33.361655,15.684387,33.660065,32.295511,33.660065,28.213472,17.832555,32.726136,33.674916,31.654384,33.674916,30.282008,33.615473,27.313001,32.388269,26.362853,29.647934,33.391616,32.741411,28.089144,0.0,33.719431,32.726136,33.600595,24.041631,33.674916,32.726136,9.327379,21.0,28.774989,23.916521,29.410882,29.74895,33.660065,33.660065,17.972201,33.660065,28.687977,6.082763,25.884358,32.434549,31.76476,31.527766,33.660065,30.854497,21.794495,33.361655,33.704599,32.171416,33.660065,33.660065
"First Canadian Place, Underground city",13.228757,32.465366,32.908965,29.240383,31.38471,23.769729,18.303005,33.18132,30.248967,33.361655,33.406586,30.659419,10.0,30.82207,10.049876,33.361655,28.106939,32.480764,6.78233,23.130067,32.924155,32.480764,31.208973,27.055499,29.832868,33.391616,33.211444,33.391616,32.218007,30.397368,31.622777,9.591663,32.908965,21.330729,0.0,32.496154,8.660254,31.559468,33.331667,31.591138,16.370706,25.0,31.606961,33.211444,27.964263,33.090784,15.937377,33.391616,32.264532,33.391616,28.071338,18.165902,32.480764,33.406586,31.368774,33.406586,29.949958,33.346664,27.276363,32.171416,26.134269,29.512709,32.969683,32.496154,27.513633,4.472136,33.451457,32.480764,33.361655,23.664319,33.406586,32.480764,9.327379,20.615528,28.635642,23.706539,29.206164,29.308702,33.361655,33.391616,18.193405,33.391616,28.478062,5.744563,25.41653,32.186954,31.543621,31.304952,33.391616,30.659419,21.702534,33.090784,33.436507,31.921779,33.391616,33.391616
Enclave of M5E,14.035669,26.11513,26.476405,22.781571,25.159491,17.406895,11.445523,26.925824,24.228083,27.037012,26.981475,24.289916,6.0,24.617067,8.42615,27.037012,21.863211,26.134269,12.961481,17.291616,26.832816,26.134269,25.099801,20.832667,23.537205,26.962938,26.962938,26.962938,25.845696,24.083189,25.21904,0.0,26.514147,15.394804,9.591663,26.07681,7.416198,25.099801,26.814175,25.258662,12.649111,19.052559,25.278449,26.962938,21.540659,26.664583,11.661904,26.962938,25.865034,26.962938,21.540659,12.884099,26.134269,26.981475,25.099801,26.981475,23.727621,26.870058,20.832667,25.787594,19.570386,23.17326,26.851443,26.07681,20.420578,9.486833,26.962938,26.134269,27.037012,17.435596,27.055499,26.134269,7.28011,14.106736,22.271057,17.492856,22.605309,22.912878,26.888659,26.962938,11.7047,26.962938,22.113344,9.848858,19.949937,25.768197,25.159491,24.899799,26.962938,24.289916,15.84298,26.664583,26.944387,25.436195,26.962938,26.962938
Central Bay Street,14.247807,25.845696,26.324893,22.338308,24.839485,17.058722,12.369317,26.627054,23.979158,26.739484,26.683328,24.083189,0.0,24.33105,8.185353,26.739484,21.494185,25.826343,12.961481,16.881943,26.720778,25.826343,24.819347,20.59126,23.194827,26.70206,26.70206,26.70206,25.806976,23.748684,24.939928,6.0,26.210685,15.459625,10.0,25.806976,7.28011,24.939928,26.776856,24.939928,11.313708,18.841444,24.959968,26.70206,21.494185,26.324893,11.575837,26.70206,25.632011,26.70206,21.023796,12.247449,25.826343,26.720778,24.939928,26.720778,23.515952,26.720778,20.445048,25.436195,18.947295,22.781571,26.589472,25.806976,20.952327,9.486833,26.739484,25.826343,26.739484,17.320508,26.720778,25.826343,7.549834,14.387495,21.863211,17.378147,22.338308,22.649503,26.739484,26.70206,11.18034,26.70206,21.794495,10.246951,20.049938,25.41653,24.959968,24.535688,26.70206,24.083189,15.264338,26.324893,26.720778,25.039968,26.70206,26.70206
"Toronto Dominion Centre, Design Exchange",14.764823,33.090784,33.585711,29.799329,32.062439,24.494897,18.814888,33.823069,30.91925,33.941125,34.014703,31.288976,10.246951,31.511903,10.488088,33.941125,28.827071,33.105891,6.708204,24.248711,33.600595,33.105891,32.015621,27.946377,30.512293,34.0,33.941125,34.0,32.787193,30.805844,32.264532,9.848858,33.496268,21.679483,5.744563,33.12099,8.3666,32.171416,33.882149,32.233523,15.32971,25.806976,32.249031,33.941125,28.827071,33.645208,16.643317,34.0,32.83291,34.0,28.583212,19.0,33.105891,34.014703,32.109189,34.014703,30.692019,33.926391,27.910571,32.710854,26.608269,30.099834,33.734256,33.12099,27.676705,6.082763,34.058773,33.105891,33.941125,24.515301,34.014703,33.105891,11.045361,21.447611,29.274562,24.433583,29.765752,29.899833,33.970576,34.0,18.867962,34.0,29.257478,0.0,26.286879,32.756679,32.15587,31.890437,34.0,31.288976,22.405357,33.645208,34.044089,32.434549,34.0,34.0
"Garden District, Ryerson",14.764823,28.827071,29.427878,25.651511,28.035692,20.542639,15.231546,29.765752,27.018512,29.866369,29.849623,27.294688,7.28011,27.477263,9.69536,29.866369,24.839485,29.017236,10.908712,20.297783,29.681644,29.017236,27.982137,23.600847,26.476405,29.866369,29.832868,29.866369,28.861739,26.814175,28.160256,7.416198,29.359837,15.427249,8.660254,29.0,0.0,28.124722,29.866369,28.160256,12.845233,22.135944,28.178006,29.832868,24.758837,29.495762,14.317821,29.866369,28.178006,29.866369,24.392622,15.716234,29.017236,29.883106,28.089144,29.883106,26.683328,29.849623,23.345235,28.600699,22.315914,26.038433,29.698485,29.0,23.706539,7.141428,29.899833,29.017236,29.866369,20.615528,29.883106,29.017236,9.380832,17.720045,25.199206,20.420578,25.612497,25.806976,29.866369,29.866369,15.099669,29.866369,25.13961,8.3666,23.0,28.618176,28.142495,27.766887,29.866369,27.294688,18.761663,29.495762,29.883106,28.248894,29.866369,29.866369
"Kensington Market, Chinatown, Grange Park",15.524175,19.949937,20.760539,17.058722,18.788294,11.789826,7.681146,20.615528,17.916473,20.615528,20.78461,18.0,11.575837,18.220867,13.152946,20.615528,15.556349,19.924859,21.307276,11.445523,20.976177,19.924859,18.814888,15.491933,17.262677,20.760539,20.856654,20.760539,19.183326,18.920888,18.973666,11.661904,20.420578,12.60952,15.937377,19.849433,14.317821,18.814888,20.566964,19.026298,17.378147,11.874342,19.052559,20.856654,15.491933,20.712315,0.0,20.760539,20.07486,20.760539,16.124515,5.830952,19.924859,20.78461,18.920888,20.78461,17.635192,20.639767,15.099669,19.570386,14.933185,16.881943,20.663978,19.849433,17.291616,15.684387,20.760539,19.924859,20.615528,11.83216,20.880613,19.924859,9.0,9.433981,16.248077,11.661904,17.058722,17.233688,20.856654,20.760539,7.937254,20.760539,16.217275,16.643317,12.165525,19.79899,18.894444,18.920888,20.760539,18.0,9.949874,20.712315,20.736441,19.723083,20.760539,20.760539
"Commerce Court, Victoria Hotel",16.583124,37.576588,37.960506,34.190642,36.565011,28.861739,23.259407,38.327536,35.425979,38.483763,38.470768,35.805028,12.961481,35.944402,13.152946,38.483763,33.286634,37.589892,0.0,28.407745,38.052595,37.589892,36.441734,32.15587,34.985711,38.457769,38.353618,38.457769,37.389838,35.24202,36.71512,12.961481,37.934153,25.70992,6.78233,37.576588,10.908712,36.660606,38.405729,36.71512,17.492856,30.380915,36.728735,38.353618,33.136083,38.091994,21.307276,38.457769,37.26929,38.457769,32.924155,23.4094,37.589892,38.470768,36.551334,38.470768,35.1141,38.418745,32.280025,37.215588,30.740852,34.597688,38.170669,37.576588,31.796226,7.348469,38.483763,37.589892,38.483763,28.879058,38.496753,37.589892,14.525839,25.70992,33.704599,28.84441,34.161382,34.278273,38.457769,38.457769,23.043437,38.457769,33.570821,6.708204,31.048349,37.20215,36.674242,36.331804,38.457769,35.805028,26.888659,38.091994,38.470768,36.837481,38.457769,38.457769


#### Looking at the distance values, choosing the top 3 seems to be reasonable as the distance goes up quickly

Let's take a quick look how the origin neighborhood and the 3 top target neighbourhoods compare

In [211]:
allhoods[allhoods['Neighborhood'].isin(np.append(top3,['Tribeca']))]

Unnamed: 0,Neighborhood,Art Gallery,Bakery,Bar,Boutique,Bridal Shop,Café,Clothing Store,Coffee Shop,Dog Run,Event Space,Gastropub,Gym,Gym Pool,Hotel,Men's Store,Park,Performing Arts Venue,Playground,Poke Place,Pub,Restaurant,Scenic Lookout,Skate Park,Spa,Stationery Store,Wine Bar,Wine Shop,Yoga Studio
0,Tribeca,1,1,2,2,1,3,1,2,1,1,1,3,1,2,2,5,1,2,2,1,31,2,2,3,1,4,2,1
13,Church and Wellesley,0,0,0,0,0,2,1,6,1,0,1,0,0,2,2,1,0,0,0,2,29,0,0,0,0,0,0,2
64,"Richmond, Adelaide, King",1,2,2,0,0,5,3,10,0,1,1,4,0,3,0,0,0,0,1,0,31,0,0,0,0,0,0,0
71,St. James Town,2,2,0,0,0,5,1,5,0,0,2,2,0,1,0,2,1,0,0,0,25,0,0,0,0,1,0,0


## 3.4 Find most similar hotels

### Get hotels in top 3 closest neighbourhoods

In [172]:
top3_hotels = toronto_cleaned[toronto_cleaned['Neighborhood'].isin(top3)]
top3_hotels = top3_hotels[top3_hotels['Venue Category']=='Hotel']
top3_hotels

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
318,St. James Town,43.651494,-79.375418,Hotel Victoria,43.648084,-79.377582,Hotel
594,"Richmond, Adelaide, King",43.650571,-79.384568,Shangri-La Toronto,43.649129,-79.386557,Hotel
636,"Richmond, Adelaide, King",43.650571,-79.384568,The Adelaide Hotel Toronto,43.649831,-79.380164,Hotel
668,"Richmond, Adelaide, King",43.650571,-79.384568,DoubleTree by Hilton Hotel Toronto Downtown,43.654608,-79.385942,Hotel
2079,Church and Wellesley,43.66586,-79.38316,Town Inn Suites,43.669056,-79.382573,Hotel
2091,Church and Wellesley,43.66586,-79.38316,The Anndore House,43.668801,-79.385413,Hotel


### Get venues around hotels in top 3 neighbourhoods

In [173]:
top3_venues = getNearbyVenues(names=top3_hotels['Venue'],
                                   latitudes=top3_hotels['Venue Latitude'],
                                   longitudes=top3_hotels['Venue Longitude'],
                                   radius=250
                                  )
top3_venues.head()

Hotel Victoria
Shangri-La Toronto
The Adelaide Hotel Toronto
DoubleTree by Hilton Hotel Toronto Downtown
Town Inn Suites
The Anndore House


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hotel Victoria,43.648084,-79.377582,Hockey Hall Of Fame (Hockey Hall of Fame),43.646974,-79.377323,Museum
1,Hotel Victoria,43.648084,-79.377582,Mos Mos Coffee,43.648159,-79.378745,Café
2,Hotel Victoria,43.648084,-79.377582,Beerbistro,43.649419,-79.377237,Gastropub
3,Hotel Victoria,43.648084,-79.377582,Equinox Bay Street,43.6481,-79.379989,Gym
4,Hotel Victoria,43.648084,-79.377582,Berczy Park,43.648048,-79.375172,Park


In [180]:
print(top3_venues.shape)
top3_venues.groupby("Venue Category").count()

(250, 7)


Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
American Restaurant,5,5,5,5,5,5
Art Gallery,3,3,3,3,3,3
Art Museum,1,1,1,1,1,1
Asian Restaurant,2,2,2,2,2,2
Bagel Shop,2,2,2,2,2,2
Bakery,2,2,2,2,2,2
Bank,2,2,2,2,2,2
Bar,2,2,2,2,2,2
Beer Bar,2,2,2,2,2,2
Bistro,1,1,1,1,1,1


### Clean up categories of found venues and origin hotel venues

In [196]:
top3_cleaned = pd.DataFrame(top3_venues)
top3_cleaned.loc[top3_cleaned['Venue Category']
    .isin(['Diner','Burger Joint','Noodle House','Salad Place','Steakhouse']),'Venue Category'] = "Restaurant"
top3_cleaned.loc[top3_cleaned['Venue Category'].str.contains('Restaurant')>0,'Venue Category'] = "Restaurant"
top3_cleaned.loc[top3_cleaned['Venue Category'].str.contains('Gym / Fitness Center') > 0,'Venue Category'] = "Gym"
top3_venues.groupby("Venue Category").count()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Art Gallery,3,3,3,3,3,3
Art Museum,1,1,1,1,1,1
Bagel Shop,2,2,2,2,2,2
Bakery,2,2,2,2,2,2
Bank,2,2,2,2,2,2
Bar,2,2,2,2,2,2
Beer Bar,2,2,2,2,2,2
Bistro,1,1,1,1,1,1
Bookstore,2,2,2,2,2,2
Breakfast Spot,5,5,5,5,5,5


Here the environment of our origin hotel in comparison

In [197]:
hotel_venues_cleaned.groupby("Venue Category").count()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Art Gallery,2,2,2,2,2,2
Bar,1,1,1,1,1,1
Boutique,3,3,3,3,3,3
Café,1,1,1,1,1,1
Coffee Shop,1,1,1,1,1,1
Gym,1,1,1,1,1,1
Gym Pool,1,1,1,1,1,1
Hotel,1,1,1,1,1,1
Men's Store,2,2,2,2,2,2
Poke Place,1,1,1,1,1,1


In [214]:
top3_cleaned = top3_cleaned[top3_cleaned['Venue Category'].isin(hotel_venues_cleaned['Venue Category'])]
hotel_venues_cleaned = hotel_venues_cleaned[hotel_venues_cleaned['Venue Category'].isin(top3_cleaned['Venue Category'])]

print('There are {} uniques categories in Origin Hotel Vincinity.'.format(len(hotel_venues_cleaned['Venue Category'].unique())))
print('There are {} uniques categories in Target Hotel Vincinity.'.format(len(top3_cleaned['Venue Category'].unique())))

There are 10 uniques categories in Origin Hotel Vincinity.
There are 10 uniques categories in Target Hotel Vincinity.


### Convert venues categories to one-hot encoding

In [199]:
# one hot encoding
hotel_onehot = pd.get_dummies(hotel_venues_cleaned[['Venue Category']], prefix="", prefix_sep="")

# Neighborhood also comes as a category, by droping the column this will have no effect
# toronto_onehot = toronto_onehot.drop('Neighborhood', axis=1)

# add neighborhood column back to dataframe
hotel_onehot['Neighborhood'] = hotel_venues_cleaned['Neighborhood'] 


# move neighborhood column to the first column
fixed_columns = [hotel_onehot.columns[-1]] + list(hotel_onehot.columns[:-1])
hotel_onehot = hotel_onehot[fixed_columns]

hotel_onehot = hotel_onehot.groupby('Neighborhood').sum().reset_index()
hotel_onehot.head()

Unnamed: 0,Neighborhood,Art Gallery,Bar,Café,Coffee Shop,Gym,Hotel,Men's Store,Poke Place,Restaurant,Wine Bar
0,Greenwich Hotel,2,1,1,1,1,1,2,1,15,1


In [200]:
# one hot encoding
top3_onehot = pd.get_dummies(top3_cleaned[['Venue Category']], prefix="", prefix_sep="")

# Neighborhood also comes as a category, by droping the column this will have no effect
# toronto_onehot = toronto_onehot.drop('Neighborhood', axis=1)

# add neighborhood column back to dataframe
top3_onehot['Neighborhood'] = top3_cleaned['Neighborhood'] 


# move neighborhood column to the first column
fixed_columns = [top3_onehot.columns[-1]] + list(top3_onehot.columns[:-1])
top3_onehot = top3_onehot[fixed_columns]

top3_onehot = top3_onehot.groupby('Neighborhood').sum().reset_index()
top3_onehot.head()

Unnamed: 0,Neighborhood,Art Gallery,Bar,Café,Coffee Shop,Gym,Hotel,Men's Store,Poke Place,Restaurant,Wine Bar
0,DoubleTree by Hilton Hotel Toronto Downtown,0,0,0,0,0,1,0,1,2,0
1,Hotel Victoria,1,0,2,6,1,3,0,0,16,0
2,Shangri-La Toronto,0,1,3,2,0,2,0,0,23,0
3,The Adelaide Hotel Toronto,1,1,4,4,3,2,0,0,20,1
4,The Anndore House,1,0,2,2,1,2,0,0,23,0


In [204]:
allhotels = hotel_onehot.append( top3_onehot )
allhotels.head(10)

Unnamed: 0,Neighborhood,Art Gallery,Bar,Café,Coffee Shop,Gym,Hotel,Men's Store,Poke Place,Restaurant,Wine Bar
0,Greenwich Hotel,2,1,1,1,1,1,2,1,15,1
0,DoubleTree by Hilton Hotel Toronto Downtown,0,0,0,0,0,1,0,1,2,0
1,Hotel Victoria,1,0,2,6,1,3,0,0,16,0
2,Shangri-La Toronto,0,1,3,2,0,2,0,0,23,0
3,The Adelaide Hotel Toronto,1,1,4,4,3,2,0,0,20,1
4,The Anndore House,1,0,2,2,1,2,0,0,23,0
5,Town Inn Suites,0,0,2,5,0,2,1,0,9,0


###  Calculate distances between hotels for ranking

In [212]:
allhotel_data = allhotels.drop('Neighborhood', axis=1)

distances = pdist(allhotel_data.values, metric='euclidean')
dist_matrix = squareform(distances)
hotel_distances = pd.DataFrame(dist_matrix, index = allhotels['Neighborhood'], columns=allhotels['Neighborhood'])
hotel_distances = hotel_distances.sort_values(by="Greenwich Hotel")
hotel_distances = hotel_distances.drop("Greenwich Hotel", axis = 0)
hotel_distances

Neighborhood,Greenwich Hotel,DoubleTree by Hilton Hotel Toronto Downtown,Hotel Victoria,Shangri-La Toronto,The Adelaide Hotel Toronto,The Anndore House,Town Inn Suites
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Hotel Victoria,6.244998,15.588457,0.0,8.3666,5.567764,8.124038,7.348469
The Adelaide Hotel Toronto,7.348469,19.235384,5.567764,5.0,0.0,4.795832,11.789826
Town Inn Suites,7.937254,9.0,7.348469,14.422205,11.789826,14.422205,0.0
The Anndore House,8.660254,21.283797,8.124038,2.0,4.795832,0.0,14.422205
Shangri-La Toronto,9.0,21.377558,8.3666,0.0,5.0,2.0,14.422205
DoubleTree by Hilton Hotel Toronto Downtown,13.490738,0.0,15.588457,21.377558,19.235384,21.283797,9.0


## 4. Result

### Clean up distance table for result

In [213]:
top3 = hotel_distances.head(3).index.values
print("Top 3 are ", top3)
hotel_distances = hotel_distances.reset_index()
result = pd.DataFrame([])
result['Hotel'] = hotel_distances['Neighborhood']
result['Similarity Distance'] = hotel_distances['Greenwich Hotel']
result

Top 3 are  ['Hotel Victoria' 'The Adelaide Hotel Toronto' 'Town Inn Suites']


Unnamed: 0,Hotel,Similarity Distance
0,Hotel Victoria,6.244998
1,The Adelaide Hotel Toronto,7.348469
2,Town Inn Suites,7.937254
3,The Anndore House,8.660254
4,Shangri-La Toronto,9.0
5,DoubleTree by Hilton Hotel Toronto Downtown,13.490738


## 5. Discussion

### Is Euclidean the correct distance measure to calculate the similarity?

### Normalization/Calculation of mean() instead of using sum()

### Is similarity what the customer is actually looking for?

### Where is machine learning in this analysis?

## 6. Conclusion
