# Neighbourhood recommender system (The Battle of Neighbourhoods)

## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Analysis and Results](#methodology)
* [Discussion](#results)
* [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>

This project aims to create a system based on the Foursquare data which will use user preferences to recommend the top neighbourhoods which closely match with majority of desired point of interest of the user.

The system will check availability of sport and outdoor activities, cafes, restaurants, community centres, banks, gas stations, supermarkets and so on. It will use Content Based Filtering metodology to provide neighbourhood recomendation to the user. 

The system can complement real estate search engines or can be used as stand alone tool for the first discovery step in the users journey to find the desired neighbourhood.


## Data <a name="data"></a>

* **Neighbourhood information**

In this pilot study we’ll be using neighbourhood data for Toronto, Canada which postal codes starts with letter M and can be obtained here:

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [2]:
# Install and Import libraries
!pip install pgeocode
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab

import numpy as np # library to handle data in a vectorized manner
import pandas as pd 

import pgeocode # to get postal code coordinates

from time import sleep # for time delay

import requests # library to handle requests
import json

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


#### Get data for Canadian postal codes with Borough and Neighbourhood information

Download the data and convert it into pandas data frame. Remove rows with ‘Not assigned’ or ‘Mississauga’ in the column **"Borough"**

In [3]:
# Get Canadian postal codes which start with M
HTMLtabs = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
print(f'Total tables: {len(HTMLtabs)}')

# convert to dataframe df ignoring 'Not assigned' and 'Mississauga' values for the Borough
df = HTMLtabs[0][HTMLtabs[0].Borough != 'Not assigned']
df.drop(df.loc[df['Borough'] == 'Mississauga'].index, inplace=True)

# changed 'Not assigned' values for Neighbourhood to Borough values if any
for row in df.index: 
    if (df['Neighbourhood'][row] == 'Not assigned'):
        df['Neighbourhood'][row] = df['Borough'][row]
print(f'df Shape: {df.shape}')

post_code_df = df.reset_index(drop=True)
post_code_df.head()

Total tables: 3
df Shape: (102, 3)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


* **Geolocation information**

Following blocks of code wiil add Latitude and Longitude columns to the dataframe of Canadian postal codes and use "pgeocode" python library to populate the geolocation data (more information about "pgeocode" can be found here https://pypi.org/project/pgeocode/) 


In [4]:
# adding Latitude and Longitude columns to df
post_code_df = pd.concat([post_code_df, pd.DataFrame(columns = ['Latitude', 'Longitude'])])
post_code_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,,
1,M4A,North York,Victoria Village,,
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",,
3,M6A,North York,"Lawrence Manor, Lawrence Heights",,
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",,


In [5]:
# run "pgeocode" package to get Latitude and Longitude
nm = pgeocode.Nominatim('ca') # to select Canada as Country

for row in post_code_df.index: 
    post_code_df['Latitude'][row] = nm.query_postal_code(post_code_df['Postal Code'][row])['latitude']
    post_code_df['Longitude'][row] = nm.query_postal_code(post_code_df['Postal Code'][row])['longitude']
post_code_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889


* **Foursquare API**

The Foursquare API was used for to get venue “categories” though it'send  point as json object and converted to the pandas data frame object. This data frame resulted in 4 columns and 470 rows of the venue categories. The Foursquare API "search" end point then was used to get data for each neighbourhood and for all 470 venue categories. 

In [8]:
CLIENT_ID = 'XXXXX' # your Foursquare ID
CLIENT_SECRET = 'XXXXX' # your Foursquare Secret
VERSION = '20210101' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

#### Data acquisition for Points of Interests (POI) using venue “categories” end point

In [9]:
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION)
            
# make the GET request
results = requests.get(url).json()["response"]['categories']

In [10]:
# create list of categories
categories_list=[]
for v in results:
    for i in range(len(v['categories'])):
        categories_list.append([(v['id'],v['name'],v['categories'][i]['id'],v['categories'][i]['name'])])
categories_list[:5]

[[('4d4b7104d754a06370d81259',
   'Arts & Entertainment',
   '56aa371be4b08b9a8d5734db',
   'Amphitheater')],
 [('4d4b7104d754a06370d81259',
   'Arts & Entertainment',
   '4fceea171983d5d06c3e9823',
   'Aquarium')],
 [('4d4b7104d754a06370d81259',
   'Arts & Entertainment',
   '4bf58dd8d48988d1e1931735',
   'Arcade')],
 [('4d4b7104d754a06370d81259',
   'Arts & Entertainment',
   '4bf58dd8d48988d1e2931735',
   'Art Gallery')],
 [('4d4b7104d754a06370d81259',
   'Arts & Entertainment',
   '4bf58dd8d48988d1e4931735',
   'Bowling Alley')]]

In [11]:
# make data frame from the list of categories 
categories_df = pd.DataFrame([item for categories_list in categories_list for item in categories_list])
categories_df.columns = ['Main Category ID', 
                  'Main Category Name', 
                  'Category ID', 
                  'Category Name']
categories_df.shape

(470, 4)

In [12]:
categories_df.head()

Unnamed: 0,Main Category ID,Main Category Name,Category ID,Category Name
0,4d4b7104d754a06370d81259,Arts & Entertainment,56aa371be4b08b9a8d5734db,Amphitheater
1,4d4b7104d754a06370d81259,Arts & Entertainment,4fceea171983d5d06c3e9823,Aquarium
2,4d4b7104d754a06370d81259,Arts & Entertainment,4bf58dd8d48988d1e1931735,Arcade
3,4d4b7104d754a06370d81259,Arts & Entertainment,4bf58dd8d48988d1e2931735,Art Gallery
4,4d4b7104d754a06370d81259,Arts & Entertainment,4bf58dd8d48988d1e4931735,Bowling Alley


In [13]:
# create randome sample of 10 categoies to use for testing later
categories_sample = categories_df.sample(n = 10).reset_index(drop=True) 
categories_sample

Unnamed: 0,Main Category ID,Main Category Name,Category ID,Category Name
0,4d4b7105d754a06374d81259,Food,4bf58dd8d48988d169941735,Australian Restaurant
1,4d4b7105d754a06374d81259,Food,4bf58dd8d48988d10d941735,German Restaurant
2,4d4b7105d754a06375d81259,Professional & Other Places,5fac002599ce226e27fe72e5,Architecture Firm
3,4d4b7105d754a06378d81259,Shop & Service,52f2ab2ebcbc57f1066b8b26,Fabric Shop
4,4d4b7105d754a06378d81259,Shop & Service,503287a291d4c4b30a586d65,Financial or Legal Service
5,4d4b7105d754a06378d81259,Shop & Service,4bf58dd8d48988d1f9941735,Food & Drink Shop
6,4d4b7105d754a06374d81259,Food,55d25775498e9f6a0816a37a,Friterie
7,4d4b7105d754a06378d81259,Shop & Service,52f2ab2ebcbc57f1066b8b44,Auto Garage
8,4d4b7105d754a06377d81259,Outdoors & Recreation,4bf58dd8d48988d162941735,Other Great Outdoors
9,4d4b7105d754a06378d81259,Shop & Service,554a5e17498efabeda6cc559,Photography Studio


In [14]:
# create randome sample of 10 postal codes to use for testing
post_code_sample = post_code_df.sample(n = 10).reset_index(drop=True)
post_code_sample

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M6B,North York,Glencairn,43.7081,-79.4479
1,M6P,West Toronto,"High Park, The Junction South",43.6605,-79.4633
2,M6M,York,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",43.6934,-79.4857
3,M9B,Etobicoke,"West Deane Park, Princess Gardens, Martin Grov...",43.6505,-79.5517
4,M5W,Downtown Toronto,Stn A PO Boxes,43.6437,-79.3787
5,M3N,North York,Downsview,43.7568,-79.521
6,M1P,Scarborough,"Dorset Park, Wexford Heights, Scarborough Town...",43.7612,-79.2707
7,M1V,Scarborough,"Milliken, Agincourt North, Steeles East, L'Amo...",43.8177,-79.2819
8,M2N,North York,"Willowdale, Willowdale East",43.7673,-79.4111
9,M4H,East York,Thorncliffe Park,43.7059,-79.3464


#### Lets use Foursquare "search' end point to get number of venues in the neighbourhood

In [15]:
# Following function accepts latitude and longitude values for a neighbourhood
# as well as venue Category ID and returns number of venues found
def getPOInumber(category_id, lat, lng, radius=1000):
    
    venues_list=[]
           
# create the API request URL
    url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&categoryId={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        lat, 
        lng, 
        radius, 
        category_id,
        LIMIT)
            
# make the GET request
    POInumber = len(requests.get(url).json()["response"]['venues'])        

    return(POInumber)

In [16]:
# just randome test for the above function 
getPOInumber('4bf58dd8d48988d150941735',43.6555,-79.3626)

3

#### Following code tests data acquisition metodology for 10 Category ID in 10 Neighbourhood

In [18]:
# first let's add a column for each Category ID to post_code_sample data frame created above
for row in categories_sample.index: 
    print(categories_sample['Category Name'][row])
    post_code_sample = pd.concat([post_code_sample, pd.DataFrame(columns = [categories_sample['Category Name'][row]])])
post_code_sample.head()

Australian Restaurant
German Restaurant
Architecture Firm
Fabric Shop
Financial or Legal Service
Food & Drink Shop
Friterie
Auto Garage
Other Great Outdoors
Photography Studio


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Australian Restaurant,German Restaurant,Architecture Firm,Fabric Shop,Financial or Legal Service,Food & Drink Shop,Friterie,Auto Garage,Other Great Outdoors,Photography Studio
0,M6B,North York,Glencairn,43.7081,-79.4479,,,,,,,,,,
1,M6P,West Toronto,"High Park, The Junction South",43.6605,-79.4633,,,,,,,,,,
2,M6M,York,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",43.6934,-79.4857,,,,,,,,,,
3,M9B,Etobicoke,"West Deane Park, Princess Gardens, Martin Grov...",43.6505,-79.5517,,,,,,,,,,
4,M5W,Downtown Toronto,Stn A PO Boxes,43.6437,-79.3787,,,,,,,,,,


In [19]:
# populate post_code_sample dataframe
# this will make 100 requests to Foursquare API search end point
for row in post_code_sample.index:
    print(post_code_sample['Postal Code'][row], post_code_sample['Latitude'][row], post_code_sample['Longitude'][row])
    
    for col in list(post_code_sample.columns.values)[5:]:
        cat_id = categories_sample.loc[categories_sample['Category Name'] == col]['Category ID'].values[0]
        poi_count = getPOInumber(cat_id,post_code_sample['Latitude'][row],post_code_sample['Longitude'][row])
        post_code_sample[col][row] = poi_count

M6B 43.7081 -79.4479
M6P 43.6605 -79.4633
M6M 43.6934 -79.4857
M9B 43.6505 -79.5517
M5W 43.6437 -79.3787
M3N 43.7568 -79.521
M1P 43.7612 -79.2707
M1V 43.8177 -79.2819
M2N 43.7673 -79.4111
M4H 43.7059 -79.3464


In [20]:
# print the results
post_code_sample

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Australian Restaurant,German Restaurant,Architecture Firm,Fabric Shop,Financial or Legal Service,Food & Drink Shop,Friterie,Auto Garage,Other Great Outdoors,Photography Studio
0,M6B,North York,Glencairn,43.7081,-79.4479,0,0,0,0,1,14,0,1,2,0
1,M6P,West Toronto,"High Park, The Junction South",43.6605,-79.4633,0,0,0,0,5,27,0,0,3,1
2,M6M,York,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",43.6934,-79.4857,0,0,0,0,0,4,0,0,2,0
3,M9B,Etobicoke,"West Deane Park, Princess Gardens, Martin Grov...",43.6505,-79.5517,0,0,0,0,1,3,0,0,0,0
4,M5W,Downtown Toronto,Stn A PO Boxes,43.6437,-79.3787,3,2,0,0,8,45,0,0,36,4
5,M3N,North York,Downsview,43.7568,-79.521,0,0,0,0,1,11,0,2,0,0
6,M1P,Scarborough,"Dorset Park, Wexford Heights, Scarborough Town...",43.7612,-79.2707,0,0,0,0,4,7,0,6,2,0
7,M1V,Scarborough,"Milliken, Agincourt North, Steeles East, L'Amo...",43.8177,-79.2819,0,0,0,0,0,4,0,0,0,0
8,M2N,North York,"Willowdale, Willowdale East",43.7673,-79.4111,0,0,0,0,6,24,0,2,2,1
9,M4H,East York,Thorncliffe Park,43.7059,-79.3464,0,0,0,0,2,13,0,0,0,1


#### test looks good :)

## WARNING! DO NOT RUN! Following section of the code only needed for the initial run. 
## It will create about 50000 requests to Foursquare API 

In [209]:
# copy post_code_df to another dataframe
pc_data_raw = post_code_df.copy()
pc_data_raw.shape

(102, 5)

In [210]:
# add 470 categoies as columns to pc_data_raw data frame
for row in categories_df.index: 
    print(categories_df['Category Name'][row])
    pc_data_raw = pd.concat([pc_data_raw, pd.DataFrame(columns = [categories_df['Category Name'][row]])])
pc_data_raw.head()

Amphitheater
Aquarium
Arcade
Art Gallery
Bowling Alley
Casino
Circus
Comedy Club
Concert Hall
Country Dance Club
Disc Golf
Escape Room
Exhibit
General Entertainment
Go Kart Track
Historic Site
Karaoke Box
Laser Tag
Memorial Site
Mini Golf
Movie Theater
Museum
Music Venue
Pachinko Parlor
Performing Arts Venue
Pool Hall
Public Art
Racecourse
Racetrack
Roller Rink
Salsa Club
Samba School
Stadium
Theme Park
Tour Provider
VR Cafe
Water Park
Zoo
College Academic Building
College Administrative Building
College Auditorium
College Bookstore
College Cafeteria
College Classroom
College Gym
College Lab
College Library
College Quad
College Rec Center
College Residence Hall
College Stadium
College Theater
Community College
Fraternity House
General College & University
Law School
Medical School
Sorority House
Student Center
Trade School
University
Christmas Market
Conference
Convention
Festival
Line / Queue
Music Festival
Other Event
Parade
Sporting Event
Stoop Sale
Street Fair
Trade Fair
Afghan Res

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Amphitheater,Aquarium,Arcade,Art Gallery,Bowling Alley,...,Taxi,Toll Booth,Toll Plaza,Tourist Information Center,Train Station,Tram Station,Transportation Service,Travel Lounge,Truck Stop,Tunnel
0,M3A,North York,Parkwoods,43.7545,-79.33,,,,,,...,,,,,,,,,,
1,M4A,North York,Victoria Village,43.7276,-79.3148,,,,,,...,,,,,,,,,,
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626,,,,,,...,,,,,,,,,,
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504,,,,,,...,,,,,,,,,,
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889,,,,,,...,,,,,,,,,,


In [211]:
pc_data_raw.shape

(102, 475)

In [253]:
#### DO NOT RUN!!

# this will make ~50000 requests to Foursquare API with 1 hour delay betwiin each ~5000
# about 12 hours of acquisition time 
a = 0
b = 9
for row in pc_data_raw.index:
    print(pc_data_raw['Postal Code'][row], pc_data_raw['Latitude'][row], pc_data_raw['Longitude'][row], 'processing ...')
    
    for col in list(pc_data_raw.columns.values)[5:]:
        cat_id = categories_df.loc[categories_df['Category Name'] == col]['Category ID'].values[0]
        poi_count = getPOInumber(cat_id,pc_data_raw['Latitude'][row],pc_data_raw['Longitude'][row])
        pc_data_raw[col][row] = poi_count
    if a == b:
        b = b + 10
        print ('waiting ...')
        sleep(3600)
    a = a + 1       

M5X 43.6492 -79.3823 processing ...
M8X 43.6518 -79.5076 processing ...
M4Y 43.6656 -79.383 processing ...
M7Y 43.7804 -79.2505 processing ...
M8Y 43.6325 -79.4939 processing ...
M8Z 43.6256 -79.5231 processing ...


In [254]:
# print out the end of the created data frame
pc_data_raw[80:]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Amphitheater,Aquarium,Arcade,Art Gallery,Bowling Alley,...,Taxi,Toll Booth,Toll Plaza,Tourist Information Center,Train Station,Tram Station,Transportation Service,Travel Lounge,Truck Stop,Tunnel
80,M6S,West Toronto,"Runnymede, Swansea",43.6512,-79.4828,0,0,0,5,0,...,0,0,0,0,0,0,0,0,0,0
81,M1T,Scarborough,"Clarks Corners, Tam O'Shanter, Sullivan",43.7812,-79.3036,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
82,M4T,Central Toronto,"Moore Park, Summerhill East",43.6899,-79.3853,0,0,1,8,0,...,0,0,0,1,0,0,2,1,0,0
83,M5T,Downtown Toronto,"Kensington Market, Chinatown, Grange Park",43.6541,-79.3978,0,1,8,46,1,...,0,0,0,1,4,7,3,1,0,0
84,M1V,Scarborough,"Milliken, Agincourt North, Steeles East, L'Amo...",43.8177,-79.2819,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
85,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.6861,-79.4025,0,1,0,15,1,...,0,0,0,1,0,0,0,2,0,0
86,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.6404,-79.3995,0,7,5,46,1,...,0,0,0,1,8,5,3,1,0,1
87,M8V,Etobicoke,"New Toronto, Mimico South, Humber Bay Shores",43.6075,-79.5013,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0
88,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.7432,-79.5876,0,0,0,6,0,...,0,0,0,0,0,0,0,0,0,0
89,M1W,Scarborough,"Steeles West, L'Amoreaux West",43.8016,-79.3216,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


## Important! Save the hard work done into a file.

In [255]:
# saving data to file
pc_data_raw.to_csv('TO_pc_data_raw_20210101.csv', index=False)

In [261]:
# make a copy of the data and continue with analysis
pc_data_clean = pc_data_raw.copy()

## Initial data acquisition part endshere (end of "DO NOT RUN!" part)

## Load data from file created during "Initial acquisition" and continue with analysis

You don't need to run this part if you just did **"Initial data acquisition"**

In [21]:
# load data from file
pc_data_clean = pd.read_csv('TO_pc_data_raw_20210101.csv')
pc_data_clean.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Amphitheater,Aquarium,Arcade,Art Gallery,Bowling Alley,...,Taxi,Toll Booth,Toll Plaza,Tourist Information Center,Train Station,Tram Station,Transportation Service,Travel Lounge,Truck Stop,Tunnel
0,M3A,North York,Parkwoods,43.7545,-79.33,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,M4A,North York,Victoria Village,43.7276,-79.3148,0,0,0,1,0,...,0,0,0,0,0,0,2,0,0,0
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626,0,0,4,47,1,...,0,0,0,2,0,6,1,1,0,0
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504,0,0,2,1,1,...,0,0,0,0,1,0,0,0,0,0
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889,1,1,3,45,1,...,1,0,0,0,2,4,1,0,0,0


## Analysis and Results<a name="methodology"></a>

Evaluate and use **Content Based Filtering** for neighbourhood recommender system
#check and remove columns which has only 0 values


In [22]:
# check dataset shape
pc_data_clean.shape

(102, 475)

In [24]:
#check and remove columns which has only 0 values
for col in list(pc_data_clean.columns.values)[5:]:
    col_sum = pc_data_clean[col].sum()
    if col_sum == 0:
        pc_data_clean = pc_data_clean.drop(col,1)

In [25]:
# check the new shape
pc_data_clean.shape

(102, 420)

In [26]:
pc_data_clean.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Amphitheater,Aquarium,Arcade,Art Gallery,Bowling Alley,...,Road,Taxi Stand,Taxi,Tourist Information Center,Train Station,Tram Station,Transportation Service,Travel Lounge,Truck Stop,Tunnel
0,M3A,North York,Parkwoods,43.7545,-79.33,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,M4A,North York,Victoria Village,43.7276,-79.3148,0,0,0,1,0,...,0,0,0,0,0,0,2,0,0,0
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626,0,0,4,47,1,...,0,0,0,2,0,6,1,1,0,0
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504,0,0,2,1,1,...,0,0,0,0,1,0,0,0,0,0
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889,1,1,3,45,1,...,0,0,1,0,2,4,1,0,0,0


In [27]:
# make Postal Code the index of the data frame and drop extra information 
pc_idx_clean = pc_data_clean.set_index(pc_data_clean['Postal Code'])
pc_idx_clean = pc_idx_clean.drop('Postal Code', 1).drop('Borough', 1).drop('Neighbourhood', 1).drop('Latitude', 1).drop('Longitude', 1)
pc_idx_clean.head()

Unnamed: 0_level_0,Amphitheater,Aquarium,Arcade,Art Gallery,Bowling Alley,Casino,Circus,Comedy Club,Concert Hall,Country Dance Club,...,Road,Taxi Stand,Taxi,Tourist Information Center,Train Station,Tram Station,Transportation Service,Travel Lounge,Truck Stop,Tunnel
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M3A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
M4A,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,2,0,0,0
M5A,0,0,4,47,1,0,1,4,5,0,...,0,0,0,2,0,6,1,1,0,0
M6A,0,0,2,1,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
M7A,1,1,3,45,1,1,0,3,11,0,...,0,0,1,0,2,4,1,0,0,0


#### Data normalization 

Create new dataframe with normalize data using min-max method. 

In [28]:
# use min-max method to normalize data
pc_idx_clean_norm=(pc_idx_clean-pc_idx_clean.min())/(pc_idx_clean.max()-pc_idx_clean.min())
pc_idx_clean_norm.head()

Unnamed: 0_level_0,Amphitheater,Aquarium,Arcade,Art Gallery,Bowling Alley,Casino,Circus,Comedy Club,Concert Hall,Country Dance Club,...,Road,Taxi Stand,Taxi,Tourist Information Center,Train Station,Tram Station,Transportation Service,Travel Lounge,Truck Stop,Tunnel
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M3A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.0
M4A,0.0,0.0,0.0,0.020408,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,0.0
M5A,0.0,0.0,0.333333,0.959184,0.333333,0.0,0.5,0.307692,0.192308,0.0,...,0.0,0.0,0.0,0.25,0.0,0.857143,0.2,0.111111,0.0,0.0
M6A,0.0,0.0,0.166667,0.020408,0.333333,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.0
M7A,0.5,0.1,0.25,0.918367,0.333333,0.166667,0.0,0.230769,0.423077,0.0,...,0.0,0.0,1.0,0.0,0.064516,0.571429,0.2,0.0,0.0,0.0


#### Visualize postal codes data before continue wtih analysis 

This is good practice, which allow as to see what are we starting with

In [29]:
# Use mean latitude and longitude values from pc_data_clean for approximate center
mean_lat = pc_data_clean['Latitude'].mean()
mean_long = pc_data_clean['Longitude'].mean()
print('Latitude:', mean_lat)
print('Longitude:', mean_long)

Latitude: 43.7067156862745
Longitude: -79.393987254902


##### Use Folium library to create a map 

In [30]:
# create map of Toronto with mean latitude and longitude values from df
pc_map = folium.Map(location=[mean_lat, mean_long], zoom_start=11)

# add markers to map
for lat, lng, label in zip(pc_data_clean['Latitude'], pc_data_clean['Longitude'], pc_data_clean['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(pc_map)  
    
pc_map

#### Generate test sets of potential user inputs 

* **Random selection**

The test set will be generated randomly from the full set of POIs 

In [31]:
# create list of passible user choices 
pio_names = pd.DataFrame(list(pc_data_clean.columns.values)[5:])
pio_names.set_index(0,inplace=True)
pio_names.head()

Amphitheater
Aquarium
Arcade
Art Gallery
Bowling Alley


In [34]:
# add 'User choice' column with 20 rundom values of 1
pio_names['User choice'] = 0
pio_names_sample = pio_names.sample(20)

for idx_val in pio_names_sample.index.values:
    pio_names['User choice'][idx_val] = 1

In [35]:
#print chosen values
pio_names.loc[pio_names['User choice'] == 1]

Unnamed: 0_level_0,User choice
0,Unnamed: 1_level_1
College Cafeteria,1
Asian Restaurant,1
Athletics & Sports,1
Boat Launch,1
Dog Run,1
Hot Spring,1
Lighthouse,1
Distillery,1
Office,1
Power Plant,1


#### The list doesn't look very real. 

It includes unlikely choises such us Power Plant and Lighthouse, but it still can be used for testing and evaualtion  

In [36]:
# transpose the list 
pio_names = pio_names.transpose()
poi_user_series = pio_names.iloc[0]
poi_user_series

0
Amphitheater              0
Aquarium                  0
Arcade                    0
Art Gallery               0
Bowling Alley             0
                         ..
Tram Station              0
Transportation Service    0
Travel Lounge             1
Truck Stop                0
Tunnel                    0
Name: User choice, Length: 415, dtype: int64

In [49]:
# create recommendation values for each neighbourhood
recommendationTable_df = ((pc_idx_clean_norm*poi_user_series).sum(axis=1))/(poi_user_series.sum())
recommendationTable_df.head()

Postal Code
M3A    0.012352
M4A    0.068138
M5A    0.442520
M6A    0.113546
M7A    0.382447
dtype: float64

In [50]:
#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#Just a peek at the values
recommendationTable_df.head(10)

Postal Code
M5G    0.599294
M5H    0.565643
M5K    0.545878
M5C    0.532936
M5X    0.527880
M5L    0.527880
M5T    0.521196
M5V    0.477053
M5E    0.459912
M5A    0.442520
dtype: float64

In [51]:
# Show top 10 recommendations 
pc_data_clean.loc[pc_data_clean['Postal Code'].isin(recommendationTable_df.head(10).keys())]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Amphitheater,Aquarium,Arcade,Art Gallery,Bowling Alley,...,Road,Taxi Stand,Taxi,Tourist Information Center,Train Station,Tram Station,Transportation Service,Travel Lounge,Truck Stop,Tunnel
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626,0,0,4,47,1,...,0,0,0,2,0,6,1,1,0,0
15,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756,0,4,10,45,2,...,1,2,1,8,30,4,5,8,0,0
20,M5E,Downtown Toronto,Berczy Park,43.6456,-79.3754,0,10,12,44,2,...,1,2,1,4,31,4,4,7,0,0
24,M5G,Downtown Toronto,Central Bay Street,43.6564,-79.386,0,0,10,46,2,...,1,2,1,7,12,6,3,6,0,0
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.6496,-79.3833,0,10,12,44,2,...,1,2,1,7,31,5,4,9,0,0
42,M5K,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.6469,-79.3823,0,10,12,44,2,...,1,2,1,8,31,5,4,8,0,0
48,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.6492,-79.3823,0,10,12,44,2,...,1,2,1,7,31,5,4,8,0,0
83,M5T,Downtown Toronto,"Kensington Market, Chinatown, Grange Park",43.6541,-79.3978,0,1,8,46,1,...,0,0,0,1,4,7,3,1,0,0
86,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.6404,-79.3995,0,7,5,46,1,...,0,0,0,1,8,5,3,1,0,1
96,M5X,Downtown Toronto,"First Canadian Place, Underground city",43.6492,-79.3823,0,10,12,44,2,...,1,2,1,7,31,5,4,8,0,0


#### Look like all downtown location chosen lets plot top 20 

This is might not be surprising because major amenities indeed can be found in the centres of the city. However, as it will be shown later in discussion that section bios in the data collected by Foursquare is also factor.

In [52]:
# create map with top 20 recommended neighbourhoods

df = pc_data_clean.loc[pc_data_clean['Postal Code'].isin(recommendationTable_df.head(20).keys())]

rec_map = folium.Map(location=[mean_lat, mean_long], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(rec_map)  
    
rec_map

* **Custom set of POI**

Lets try to create a set of 20 POI more likly be chosen by a user

In [41]:
real_user_list=[
'Stadium',
'Bakery',
'Bistro',
'Coffee Shop',
'Indian Restaurant',
'Italian Restaurant',
'Kebab Restaurant',
'Mediterranean Restaurant',
'Pizza Place',
'Sri Lankan Restaurant',
'Turkish Restaurant',
'Athletics & Sports',
'Park',
'Picnic Area',
'Pool',
'Community Center',
'Medical Center',
'School',
'Bank',
'Shopping Plaza']

The set is processed in the similar way as before

In [42]:
# make data frame out of the list and assign 1 value to 'User choice' 
real_user_poi = pd.DataFrame(real_user_list)

real_user_poi.set_index(0,inplace=True)
real_user_poi['User choice'] = 1
real_user_poi

Unnamed: 0_level_0,User choice
0,Unnamed: 1_level_1
Stadium,1
Bakery,1
Bistro,1
Coffee Shop,1
Indian Restaurant,1
Italian Restaurant,1
Kebab Restaurant,1
Mediterranean Restaurant,1
Pizza Place,1
Sri Lankan Restaurant,1


In [43]:
# transpose the list
real_user_poi = real_user_poi.transpose()

poi_real_user_series = real_user_poi.iloc[0]
poi_real_user_series

0
Stadium                     1
Bakery                      1
Bistro                      1
Coffee Shop                 1
Indian Restaurant           1
Italian Restaurant          1
Kebab Restaurant            1
Mediterranean Restaurant    1
Pizza Place                 1
Sri Lankan Restaurant       1
Turkish Restaurant          1
Athletics & Sports          1
Park                        1
Picnic Area                 1
Pool                        1
Community Center            1
Medical Center              1
School                      1
Bank                        1
Shopping Plaza              1
Name: User choice, dtype: int64

In [53]:
# create recommendation values for each neighbourhood
recommendationTable_df = ((pc_idx_clean_norm*poi_real_user_series).sum(axis=1))/(poi_real_user_series.sum())
recommendationTable_df.head()

Postal Code
M3A    0.055640
M4A    0.069555
M5A    0.412016
M6A    0.116617
M7A    0.611862
dtype: float64

In [54]:
#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#Just a peek at the values
recommendationTable_df.head(10)

Postal Code
M5K    0.796147
M5X    0.778418
M5H    0.777631
M5L    0.777282
M5G    0.727171
M5C    0.719994
M5B    0.709972
M4Y    0.620515
M7A    0.611862
M5T    0.609684
dtype: float64

#### Values much higher then random set

In contrast to random set, as up can see in the table above, values from recommendation system much higher

In [55]:
# Show top 10 recommendations 
pc_data_clean.loc[pc_data_clean['Postal Code'].isin(recommendationTable_df.head(10).keys())]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Amphitheater,Aquarium,Arcade,Art Gallery,Bowling Alley,...,Road,Taxi Stand,Taxi,Tourist Information Center,Train Station,Tram Station,Transportation Service,Travel Lounge,Truck Stop,Tunnel
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889,1,1,3,45,1,...,0,0,1,0,2,4,1,0,0,0
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783,0,0,6,45,3,...,0,3,0,5,3,5,1,3,0,0
15,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756,0,4,10,45,2,...,1,2,1,8,30,4,5,8,0,0
24,M5G,Downtown Toronto,Central Bay Street,43.6564,-79.386,0,0,10,46,2,...,1,2,1,7,12,6,3,6,0,0
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.6496,-79.3833,0,10,12,44,2,...,1,2,1,7,31,5,4,9,0,0
42,M5K,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.6469,-79.3823,0,10,12,44,2,...,1,2,1,8,31,5,4,8,0,0
48,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.6492,-79.3823,0,10,12,44,2,...,1,2,1,7,31,5,4,8,0,0
83,M5T,Downtown Toronto,"Kensington Market, Chinatown, Grange Park",43.6541,-79.3978,0,1,8,46,1,...,0,0,0,1,4,7,3,1,0,0
96,M5X,Downtown Toronto,"First Canadian Place, Underground city",43.6492,-79.3823,0,10,12,44,2,...,1,2,1,7,31,5,4,8,0,0
98,M4Y,Downtown Toronto,Church and Wellesley,43.6656,-79.383,0,1,4,44,2,...,0,1,1,1,2,2,1,0,0,0


#### Create map of custom user recommendations


In [56]:
# create map with top 20 recommended neighbourhoods

df = pc_data_clean.loc[pc_data_clean['Postal Code'].isin(recommendationTable_df.head(20).keys())]

rec_map = folium.Map(location=[mean_lat, mean_long], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(rec_map)  
    
rec_map

#### Looks a bit different but even more in the center of the city

This can be attributed to number of places to eat in the list

* **Random Postal code test set***

#### Lets make anoter set based on to 20 POIs in the postal code M9C 

This neighbourhood on the far west side of the city of Toronto.

In [57]:
# create test list first
m9c_norm_tst = pc_idx_clean_norm.loc[pc_idx_clean_norm.index == 'M9C'].transpose()

In [58]:
# sort values and select top 20
m9c_norm_tst = m9c_norm_tst['M9C'].sort_values(ascending=False).head(20)

In [59]:
# replace the list values to 1 
for i in m9c_norm_tst.index:
    m9c_norm_tst[i] = 1
m9c_norm_tst

Scandinavian Restaurant                     1.0
Assisted Living                             1.0
Medical Center                              1.0
Office                                      1.0
Residential Building (Apartment / Condo)    1.0
Athletics & Sports                          1.0
Shopping Plaza                              1.0
Stationery Store                            1.0
Eastern European Restaurant                 1.0
Fish & Chips Shop                           1.0
Distribution Center                         1.0
Transportation Service                      1.0
Park                                        1.0
Gas Station                                 1.0
Casino                                      1.0
Car Wash                                    1.0
Pet Store                                   1.0
Spiritual Center                            1.0
Pool                                        1.0
Salon / Barbershop                          1.0
Name: M9C, dtype: float64

In [60]:
# cerate recommendations from postal code set
recommendationTable_df = ((pc_idx_clean_norm*m9c_norm_tst).sum(axis=1))/(m9c_norm_tst.sum())
recommendationTable_df.head()

Postal Code
M3A    0.082932
M4A    0.178192
M5A    0.548327
M6A    0.193100
M7A    0.548378
dtype: float64

In [61]:
#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#Just a peek at the values
recommendationTable_df.head(10)

Postal Code
M5K    0.734931
M5G    0.719990
M5H    0.716474
M5C    0.702586
M5B    0.700461
M5X    0.695552
M5L    0.695552
M5T    0.651619
M4Y    0.584375
M5E    0.571511
dtype: float64

In [62]:
# Show top 10 recommendations 
pc_data_clean.loc[pc_data_clean['Postal Code'].isin(recommendationTable_df.head(10).keys())]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Amphitheater,Aquarium,Arcade,Art Gallery,Bowling Alley,...,Road,Taxi Stand,Taxi,Tourist Information Center,Train Station,Tram Station,Transportation Service,Travel Lounge,Truck Stop,Tunnel
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783,0,0,6,45,3,...,0,3,0,5,3,5,1,3,0,0
15,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756,0,4,10,45,2,...,1,2,1,8,30,4,5,8,0,0
20,M5E,Downtown Toronto,Berczy Park,43.6456,-79.3754,0,10,12,44,2,...,1,2,1,4,31,4,4,7,0,0
24,M5G,Downtown Toronto,Central Bay Street,43.6564,-79.386,0,0,10,46,2,...,1,2,1,7,12,6,3,6,0,0
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.6496,-79.3833,0,10,12,44,2,...,1,2,1,7,31,5,4,9,0,0
42,M5K,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.6469,-79.3823,0,10,12,44,2,...,1,2,1,8,31,5,4,8,0,0
48,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.6492,-79.3823,0,10,12,44,2,...,1,2,1,7,31,5,4,8,0,0
83,M5T,Downtown Toronto,"Kensington Market, Chinatown, Grange Park",43.6541,-79.3978,0,1,8,46,1,...,0,0,0,1,4,7,3,1,0,0
96,M5X,Downtown Toronto,"First Canadian Place, Underground city",43.6492,-79.3823,0,10,12,44,2,...,1,2,1,7,31,5,4,8,0,0
98,M4Y,Downtown Toronto,Church and Wellesley,43.6656,-79.383,0,1,4,44,2,...,0,1,1,1,2,2,1,0,0,0


#### Create map of top 20 recommendation based on postal code M9C

In [63]:
# create map with top 20 recommended neighbourhoods

df = pc_data_clean.loc[pc_data_clean['Postal Code'].isin(recommendationTable_df.head(20).keys())]

rec_map = folium.Map(location=[mean_lat, mean_long], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(rec_map)  
    
rec_map

#### We can see the postal code M9C is not past of top 20

The M9C neighbourhood is actually missing from top 20 recommendations plotted on the map. It could simple be due to the other neighbourhoods having even higher normalised values for the POIs selected.

### No Downtown test
#### Let remove Downtown borough and see how it will affect the results 

In [64]:
# create data frame which has Neighbourhoods in 'Downtown Toronto' removed
pc_no_dt_idx_clean_norm = pc_idx_clean_norm.copy()
for pc in pc_data_clean.loc[pc_data_clean['Borough'] == 'Downtown Toronto']['Postal Code'].values:
    pc_no_dt_idx_clean_norm = pc_no_dt_idx_clean_norm.drop(pc)
pc_no_dt_idx_clean_norm

Unnamed: 0_level_0,Amphitheater,Aquarium,Arcade,Art Gallery,Bowling Alley,Casino,Circus,Comedy Club,Concert Hall,Country Dance Club,...,Road,Taxi Stand,Taxi,Tourist Information Center,Train Station,Tram Station,Transportation Service,Travel Lounge,Truck Stop,Tunnel
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M3A,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.032258,0.000000,0.0,0.0,0.0,0.0
M4A,0.0,0.0,0.000000,0.020408,0.000000,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.4,0.0,0.0,0.0
M6A,0.0,0.0,0.166667,0.020408,0.333333,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.032258,0.000000,0.0,0.0,0.0,0.0
M9A,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.038462,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0
M1B,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.285714,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
M1X,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0
M8X,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0
M7Y,0.0,0.0,0.333333,0.000000,0.333333,0.0,0.0,0.0,0.000000,1.0,...,1.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0
M8Y,0.0,0.0,0.000000,0.061224,0.000000,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.4,0.0,0.0,0.0


In [65]:
# make recommendations for M9C postal code list with Downtown removed 
recommendationTable_df = ((pc_no_dt_idx_clean_norm*m9c_norm_tst).sum(axis=1))/(m9c_norm_tst.sum())
recommendationTable_df.head()

Postal Code
M3A    0.082932
M4A    0.178192
M6A    0.193100
M9A    0.040489
M1B    0.029694
dtype: float64

In [66]:
#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#Just a peek at the values
recommendationTable_df.head(10)

Postal Code
M5R    0.498320
M6J    0.438233
M7Y    0.368827
M2N    0.368667
M4P    0.364959
M4M    0.355859
M6R    0.339694
M4V    0.335226
M6H    0.333318
M4S    0.328664
dtype: float64

In [68]:
pc_data_clean.loc[pc_data_clean['Postal Code'].isin(recommendationTable_df.head(10).keys())]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Amphitheater,Aquarium,Arcade,Art Gallery,Bowling Alley,...,Road,Taxi Stand,Taxi,Tourist Information Center,Train Station,Tram Station,Transportation Service,Travel Lounge,Truck Stop,Tunnel
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.6655,-79.4378,0,0,0,24,0,...,0,1,0,0,0,0,0,0,0,0
37,M6J,West Toronto,"Little Portugal, Trinity",43.648,-79.4177,0,1,4,47,0,...,1,0,0,0,2,0,2,1,0,0
54,M4M,East Toronto,Studio District,43.6561,-79.3406,0,0,1,20,0,...,0,0,0,0,1,1,1,0,0,0
59,M2N,North York,"Willowdale, Willowdale East",43.7673,-79.4111,0,0,3,2,2,...,0,0,0,1,0,0,1,0,0,1
67,M4P,Central Toronto,Davisville North,43.7135,-79.3887,0,0,2,1,1,...,0,0,0,0,1,0,1,0,0,0
74,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.6736,-79.4035,1,2,2,47,0,...,1,0,0,0,0,2,1,2,0,1
75,M6R,West Toronto,"Parkdale, Roncesvalles",43.6469,-79.4521,2,0,0,27,1,...,0,0,0,0,2,0,1,0,0,0
78,M4S,Central Toronto,Davisville,43.702,-79.3853,0,0,2,5,1,...,0,0,0,0,1,0,1,0,0,0
85,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.6861,-79.4025,0,1,0,15,1,...,0,0,0,1,0,0,0,2,0,0
99,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.7804,-79.2505,0,0,4,0,1,...,1,0,0,0,0,0,0,0,0,0


In [69]:
# create map with top 20 recommended neighbourhoods

df = pc_data_clean.loc[pc_data_clean['Postal Code'].isin(recommendationTable_df.head(20).keys())]

rec_map = folium.Map(location=[mean_lat, mean_long], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(rec_map)  
    
rec_map

In [75]:
# look for potential missing data
pc_data_clean.loc[pc_data_clean['Bus Stop'] == 0] # OK this Neigbourhoods has no Bus Stps ???

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Amphitheater,Aquarium,Arcade,Art Gallery,Bowling Alley,...,Road,Taxi Stand,Taxi,Tourist Information Center,Train Station,Tram Station,Transportation Service,Travel Lounge,Truck Stop,Tunnel
6,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.193,0,0,0,0,0,...,0,0,0,0,0,2,0,0,0,0
7,M3B,North York,Don Mills,43.745,-79.359,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
12,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.7878,-79.1564,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
36,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.623,-79.3936,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
38,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.7298,-79.2639,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0
39,M2K,North York,Bayview Village,43.7797,-79.3813,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
50,M9L,North York,Humber Summit,43.7598,-79.5565,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
77,M1S,Scarborough,Agincourt,43.7946,-79.2644,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
82,M4T,Central Toronto,"Moore Park, Summerhill East",43.6899,-79.3853,0,0,1,8,0,...,0,0,0,1,0,0,2,1,0,0
84,M1V,Scarborough,"Milliken, Agincourt North, Steeles East, L'Amo...",43.8177,-79.2819,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


### No Downtown test on the user picked test set 

In [70]:
# create recommendation values for each neighbourhood
recommendationTable_df = ((pc_no_dt_idx_clean_norm*poi_real_user_series).sum(axis=1))/(poi_real_user_series.sum())
#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#Just a peek at the values
recommendationTable_df.head(10)

Postal Code
M5R    0.403280
M6J    0.389854
M6H    0.267841
M6K    0.242622
M4P    0.236772
M4V    0.232944
M2N    0.232102
M4M    0.231580
M4L    0.231477
M4S    0.231151
dtype: float64

In [71]:
# create map with top 20 recommended neighbourhoods

df = pc_data_clean.loc[pc_data_clean['Postal Code'].isin(recommendationTable_df.head(20).keys())]

rec_map = folium.Map(location=[mean_lat, mean_long], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(rec_map)  
    
rec_map

## Discussion <a name="results"></a>

This study evaluated the content based filtering as the appropriate methodology for neighbourhood recommender system. The results are based on evaluation of four different test sets. The data from top 10 neighbourhood scores of the each test set shows that system performs the best for user selected POI set. It’s neighbourhood scores significantly higher in comparison to other random sample sets.

Interestingly, for any given test set the city centre neighbourhoods seems to score the best. As it was stated in result section, many desired amenities indeed can be found in the city centre. However, close evaluation of the dataset acquired from Foursquare API shows that the city centre area also better annotated, and certain POIs is better annotated in general. For example, if we search for Bus Stop we find that according to Foursquare API data 14 neighbourhoods in Toronto doesn’t have any bus stops in 1000m radius. An alternative search in Google maps shows multiple bus stops in the area. The Foursquare API is a good source of POI information an alternative data source should also be evaluated.

With all of the discussed drawbacks the content based filtering method seems to work in the neighbourhood recommender well. The model can be refined and improved with use of alternative normalisation technic or an additional prediction alygotighms. There are plenty of alternative unsupervised approaches which can be used as an alternative to content based filtering.
Unsupervised machine learning method such as K-means clustering can also be evaluated, but its not a part of this study. 

## Conclusion <a name="conclusion"></a>

This study evaluated the content based filtering as an appropriate methodology for neighbourhood recommender system.  Based on four different test results we can see that system performs well for the user selected set of POIs in comparison to the other random sample selections. Even though, the content based filtering method seems to suit the neighbourhood recommender well, the alternative approaches, such as K-means clustering or other unsupervised methods should also be evaluated.