# A DISTRICT RECOMMENDER
_by George Lekkas_ <p>
Please note the text commentary in this notebook is a subset of the full contents of the Project Report
    
### Table of Contents
* [Introduction: Business Problem](#Introduction)
* [Data](#Data)
    * [District Data](#District_Data)
    * [Geocoding](#Geocoding)
    * [Venue Data](#Venue_Data)
    * [Venue Classification](#Venue_Classification)
    * [User Preferences](#User_Preferences)
* [Methodology](#Methodology)
* [Analysis](#Analysis)
    * [Transform Data](#Transform_Data)
    * [Extract Data](#Extract_Data)
    * [Calculate User Profile](#Calculate_User_Profile)
    * [Assign Score to Each District](#Assign_Score_To_Each_District)
    * [Propose Best Recommendation](#Propose_Best_Recommendation)

# 1. Introduction / Business Problem<a name="Introduction"></a>
Anyone having to move to a new city for work or studies must decide where to stay. The new “District Recommender” service comes to support such decisions with objective data.<p>
Users interested in the new “District Recommender” service may be workers who accepted job offers that require their relocating to a new city, or students who are about to start their first year in a University. They will declare their preferences in terms of areas in cities they have visited and enjoyed in the past.<p>
The service will be successful if it can produce neighborhood recommendations for new destinations that are compatible with a user’s prior preferences. The new system will be tested by evaluating its recommendations on a small number of cities.

# 2. Data<a name="Data"></a>
The District Recommender service will supports mobility of young workers or students in many countries. To start with, the system will be developed and tested with three popular destination cities: 
-	London, England;
-	Glasgow, Scotland;
-	Copenhagen, Denmark. 


## 2.1 District data<a name="District_Data"></a>
The District Recommender service wil
Our neighborhoods in each city will be defined by partially aggregated postcodes; popular, coarse-grained subdivisions that define entire boroughs. We will not use the detailed postcode used to define specific streets or even single buildings in the UK or Denmark. So, in the UK we will use the postal district ‘EC2’ (Bishopsgate), not the full postcode ‘EC2M 4NR’, of which there are 1.7 million in the UK.

Care must be taken to eliminate any districts that are not residential but are used for mail distribution purposes.
The easiest way to obtain the postcodes is to scrape them from web sites such as Wikipedia.


In [4729]:
import pandas as pd
import numpy as np

# Get the latest version of BeautifulSoup and use XML parser for speed
#!pip install beautifulsoup4
from bs4 import BeautifulSoup
from lxml import html
import requests

# Submit a User-Agent header, else the connection is refused by the Wikipedia server
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}

### 2.1.1 London district data

Read the Web page

In [4730]:
# The list of London postal districts
url_london="https://www.doogal.co.uk/london_postcodes.php"

# GET request fetches the raw HTML content
html_content_london = requests.get(url_london).text

# Parse the html content, skipping useless part in the beginning
start = html_content_london.find('Download ')
soup_london = BeautifulSoup(html_content_london[start:], "lxml")

Put data in a dictionary for each table row, of the form {'District_code': '', 'District_name': ''} <br> This will prove useful when we want to use logic to exclude records, change values etc.

In [4731]:
import re
london_data = []
# Postal districts are some of the clickable hyperlinks, select them via a regular expression
# Parse district code and description
html_districts = soup_london.find_all("a", href=re.compile("UKPostcodes.php"))
for tag in html_districts:
    t_row = {}
    text = tag.text.split(':')
    t_row['District_code'] = text[0]
    if len(text) > 1:
        t_row['District_name'] = text[1].lstrip()
    else:
        t_row['District_name'] = text[0]  # The web page we scrape is ill-formed for some central districts 
    
    london_data.append(t_row)

Convert into dataframe. All districts correspond to real locations, no special postal service codes in this Web page

In [4732]:
london_df = pd.DataFrame(london_data)
london_df.tail(7)

Unnamed: 0,District_code,District_name
160,IG,Ilford
161,KT,Kingston
162,RM,Romford
163,SM,Sutton
164,TW,Twickenham
165,UB,Uxbridge
166,WD,Watford


In [4733]:
print("There are", len(london_df), "districts in London")

There are 167 districts in London


### 2.1.1 Glasgow district data
Read the Web page

In [4734]:
# The list of Glasgow postal districts
url_glasgow="https://en.wikipedia.org/wiki/G_postcode_area"

# GET request fetches the raw HTML content
html_content_glasgow = requests.get(url_glasgow).text

# Parse the html content, try to locate the area of interest 
start = html_content_glasgow.find('approximate coverage')
soup_glasgow = BeautifulSoup(html_content_glasgow[start:], "lxml")

Parse data and place them in a dictionary

In [4735]:
glasgow_data = []
# Parse district code and description
html_districts = soup_glasgow.tbody.find_all("tr")
for tag in html_districts:
    t_row = {}
    t_row['District_code'] = tag.th.text.rstrip()
    col_idx = 0
    for cell in tag.find_all('td'):
        text = cell.text
        # The 3rd column is the 2nd <td> element due to the table formatting
        if col_idx == 1:
            text = cell.text.split(':')
            # some cells have text that should be stripped before the columns, others not
            if len(text) > 1:
                t_row['District_name'] = text[1].rstrip().lstrip()
            else:
                t_row['District_name'] = text[0].rstrip()
                   
        # local authority area
        if col_idx == 2:
            t_row['Local_authority_area'] = cell.text.rstrip()
        
        col_idx = col_idx + 1
    

    glasgow_data.append(t_row)

In [4736]:
glasgow_data[0:2]

[{'District_code': 'Postcode district'},
 {'District_code': 'G1',
  'District_name': 'Merchant City',
  'Local_authority_area': 'Glasgow City'}]

Convert into dataframe. Must drop the non-geographic postcodes

In [4737]:
glasgow_df = pd.DataFrame(glasgow_data)

# drop empty initial row
glasgow_df.dropna(inplace=True)

# drop non geographic destinations
glasgow_df = glasgow_df[glasgow_df['Local_authority_area']!='non-geographic']

# Now we don't need that column anymore
glasgow_df.drop('Local_authority_area', axis=1, inplace=True)

glasgow_df.head(10)

Unnamed: 0,District_code,District_name
1,G1,Merchant City
2,G2,"Blythswood Hill, Anderston (part)"
3,G3,"Anderston, Finnieston, Garnethill, Park, Woodl..."
4,G4,"Calton (part), Cowcaddens (part), Drygate, Kel..."
5,G5,Gorbals
7,G11,"Broomhill, Partick, Partickhill"
8,G12,"West End (part), Cleveden, Dowanhill, Hillhead..."
9,G13,"Anniesland, Knightswood, Yoker"
10,G14,"Whiteinch, Scotstoun"
11,G15,Drumchapel


In [4738]:
print("There are", len(glasgow_df), "postcodes in Glasgow")

There are 51 postcodes in Glasgow


### 2.1.3 Copenhagen district data
Read the Web page

In [4739]:
# The list of Copenhagen postal districts
url_copenhagen="https://en.wikipedia.org/wiki/List_of_postal_codes_in_Denmark"

# GET request fetches the raw HTML content
html_content_copenhagen = requests.get(url_copenhagen).text

# Parse the html content, try to locate the area of interest 
start = html_content_copenhagen.find('mw-headline')
end = html_content_copenhagen.find('List_of_postal_codes_in_Europe') 
soup_copenhagen = BeautifulSoup(html_content_copenhagen[start:end], "lxml")

Insert it into a dictionary

In [4740]:
copenhagen_data = []
html_districts = soup_copenhagen.find_all("li")
for tag in html_districts:
    t_row = {}
    text = tag.text.split(' -')
    t_row['District_code'] = text[0].rstrip()
    if len(text) > 1:
        t_row['District_name'] = text[1].lstrip()
    else:
        t_row['District_name'] = text[0]
    
    copenhagen_data.append(t_row)

In [4741]:
copenhagen_data[0:2]

[{'District_code': '1000-1499', 'District_name': 'Copenhagen K'},
 {'District_code': '1500-1799', 'District_name': 'Copenhagen V'}]

Convert into dataframe. The Web page lists all Denmark codes, we must drop postcodes outside Copenhagen, i.e. greater or equal to 3000

In [4742]:
copenhagen_df = pd.DataFrame(copenhagen_data)

# drop empty initial row
#copenhagen_df.dropna(inplace=True)

# drop all codes outside Copenhagen, Federiksberg and vicinity
copenhagen_df = copenhagen_df[copenhagen_df['District_code'] < '3000']

# drop Santa Claus, Greenland post code
copenhagen_df = copenhagen_df[copenhagen_df['District_code'] != '2412']

copenhagen_df.head(6)

Unnamed: 0,District_code,District_name
0,1000-1499,Copenhagen K
1,1500-1799,Copenhagen V
2,1800-1999,Frederiksberg C
3,2000,Frederiksberg
4,2100,Copenhagen Ø
5,2150,Nordhavn


In [4743]:
print("There are", len(copenhagen_df), "postcodes in Copenhagen")

There are 51 postcodes in Copenhagen


## 2.2 Geocoding<a name="Geocoding"></a>
Now we need geocoding information for all the districts that we will use later. We define a function that can query a geocoding service and return the results as a series

In [4744]:
#!pip install geocoder
import time
import geocoder

# Query a geocoding service about district codes related to a city 
# Input data is in a 'District_code' column of a dataframe
# In the end the dataframe is augmented 
def geocode_dataframe(df, city):

    latitudes = []
    longitudes = []
    print('Obtaining location addresses for {}: '.format(city), end='')
    for postal_code in df['District_code']:
        lat_lng_coords = None
        # loop until you get the coordinates
        while(lat_lng_coords is None):
            g = geocoder.arcgis('{}, {}'.format(postal_code, city))
            lat_lng_coords = g.latlng
            #print(postal_code, " returned lat =", g.lat)
            time.sleep(3)
        print(' .', end='')
        latitudes.append(lat_lng_coords[0])
        longitudes.append(lat_lng_coords[1])
    
    print(' done.')

    return latitudes, longitudes

Obtain geocoding information for all test cities. To allow for repeated tests without problems on the limit of API calls per day, the data is saved to local files and read from there.

In [4745]:
online_geocoding = False

# Suppress warning, we are indeed using the correct assignment method
pd.set_option('mode.chained_assignment', None)

if online_geocoding:
    london_df.loc[:,'Latitude'], london_df.loc[:,'Longitude'] = geocode_dataframe(london_df, "London, England")
    glasgow_df.loc[:,'Latitude'], glasgow_df.loc[:,'Longitude'] = geocode_dataframe(glasgow_df, "Glasgow, Scotland")
    copenhagen_df.loc[:,'Latitude'], copenhagen_df.loc[:,'Longitude'] = geocode_dataframe(copenhagen_df, "Copenhagen, Denmark")
else:
    london_df = pd.read_csv('London_districts_geocoded.csv')
    glasgow_df = pd.read_csv('Glasgow_districts_geocoded.csv') 
    copenhagen_df = pd.read_csv('Copenhagen_districts_geocoded.csv') 
    
pd.set_option('mode.chained_assignment', 'warn')

In [4746]:
print("London", london_df.shape)
print("Glasgow", glasgow_df.shape)
print("Copenhagen", copenhagen_df.shape)

London (167, 5)
Glasgow (51, 5)
Copenhagen (51, 5)


In [4747]:
# If we contacted an online provider, save a recent copy of the results in case the API 
#    becomes unavailable, or our experimentation hits daily usage limits 
if online_geocoding:
    london_df.to_csv('London_districts_geocoded.csv')
    glasgow_df.to_csv('Glasgow_districts_geocoded.csv')
    copenhagen_df.to_csv('Copenhagen_districts_geocoded.csv')

## 2.3 Venue data<a name="Venue_Data"></a>

### 2.3.1 Foursquare as source of data
For each district, we will use the Foursquare API https://developer.foursquare.com to obtain a list of available venues in each location. The list of venues will be used to define the profile of each district.<p>
    
First, lets' define API keys and version of the API. The following cell will be hidden

In [4748]:
CLIENT_ID = '52VQHWMFSFEKTROCKQBDCZG1OSHFQJLG4E3JYHFS30AK3JHA' # your Foursquare ID
CLIENT_SECRET = 'PP5LO3LWQ2KC3DICE1PGSOCU2ZGGVGRIACOMCTKLJJR3PHO3' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version


### 2.3.2 Concatenate city data sets 
We used separate dataframes before geocoding because each city or country mahe have different sources for the data on districts. Now we'll merge all available data in a unique data set before geocoding

In [4749]:
# We concatenate the data and construct a hierarchical index, the first part will be the city name
# This will allow to  work with subsets of the data, once the city of destination is known

# DELETE THESE THREE
#london_df.loc[:,'City'] = "London"
#glasgow_df.loc[:,'City'] = "Glasgow"
#copenhagen_df.loc[:,'City'] = "Copenhagen"

frames = [london_df, glasgow_df, copenhagen_df]
cities_df = pd.concat(frames, keys=['London', 'Glasgow', 'Copenhagen'], ignore_index=False, sort=False)

cities_df.drop('Unnamed: 0', axis=1, inplace=True)
cities_df.tail()

Unnamed: 0,Unnamed: 1,District_code,District_name,Latitude,Longitude
Copenhagen,46,2950,Vedbæk,55.67567,12.56756
Copenhagen,47,2960,Rungsted Kyst,55.67567,12.56756
Copenhagen,48,2970,Hørsholm,55.67567,12.56756
Copenhagen,49,2980,Kokkedal,55.67567,12.56756
Copenhagen,50,2990,Nivå,55.67567,12.56756


In [4750]:
cities_df.shape

(269, 4)

### 2.3.3 Systematic extraction of venue data
Let's create a function that will, for each district:
1. Create a Foursquare API call with the appropriate coordinates
2. Retrieve nearby venues at a radius of 600m in JSON format
3. Filter data to extract venue category
4. Return a dataframe containing venue name, venue location, venue category


In [4751]:
# Note that we don't have to use the city name to dismbiguate between district codes, the geography coordinates
#   already do the job 

results = []
LIMIT = 200    # limit of number of venues returned by Foursquare API
radius = 1000   # define radius

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name, end= ' ')
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&time={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
            'any')
            
        # make the GET request
        r = requests.get(url).json()
        resp_len = len(r["response"])
        if resp_len < 6:
            results = r["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['District_code', 
                  'District Latitude', 
                  'District Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [4752]:
# We may choose not to call the Foursquare API but read the data from a file that had been saved earlier
online_venueinfo = False

if online_venueinfo:
    city_venues = getNearbyVenues( names = cities_df['District_code'],
                               latitudes = cities_df['Latitude'],
                               longitudes = cities_df['Longitude']
                              )

In [4753]:
if online_venueinfo:
    city_venues.to_csv('city_venues.csv')
else:
    city_venues = pd.read_csv('city_venues.csv')

In [4754]:
city_venues.shape

(11639, 8)

In [4755]:
city_venues[0:3]

Unnamed: 0.1,Unnamed: 0,District_code,District Latitude,District Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,EC1A,51.516355,-0.099135,Pilpel,51.515195,-0.098462,Falafel Restaurant
1,1,EC1A,51.516355,-0.099135,Virgin Active,51.517952,-0.097651,Gym / Fitness Center
2,2,EC1A,51.516355,-0.099135,Postman's Park,51.51686,-0.097643,Park


## 2.4 Venue Classification<a name="Venue_Classification"></a>
The Foursquare API supports a hierarchy of categories https://developer.foursquare.com/docs/build-with-foursquare/categories/. We will try to use the hierarchy to improve neighborhood comparison.


Let's check how many venue categories were collected

In [4756]:
print('There are {} unique categories.'.format(len(city_venues['Venue Category'].unique())))

There are 359 unique categories.


### 2.4.1 Get Foursquare venue categories

In [4757]:
# Create the API request URL
url_categs = 'https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION)

root = requests.get(url_categs).json()

# json_categs holds the full hierarchical catalog of Foursquare categories in a JSON dictionary data structure
json_categs = root['response']['categories']

### 2.4.2 Steps to calculate venue aggregation groups

The Foursquare exploration API returns a detailed classification for each venue: for example, it distinguishes between a Spanish restaurant, an Italian restaurant, and so on. We would like to select certain categories in the category hierarchy, that can serve as __aggregation points__ for all detailed categories below.<p>
    
We need a new function that receives a venue category as input, and returns the corresponding, higher level category that will be used to aggregate venues.
The function may also return the value of None, if the venue belongs to a category that is of no interest to our analysis.<br>
The steps are:
1. Define a list of high-level categories deemed useful for aggregation and comparison purposes
2. Augment the Categories JSON dictionary with links that permit to navigate upwards in the hierarchy 
3. Define a search function that, given an input string, searches the Categories dictionary to locate the corresponding category
4. Define a lookup function that, starting from a specific node in the category tree, navigates up to the node that will act as point of aggregation

### 2.4.3 Selection of categories used in comparison
We start by selecting most of the top-level venue categories.<br>
The approach allows to be selective and treat some parts of the graph in mode detail than others, for example we might add 'Movie Theaters' as venue of particular interest and use it in district comparisons

In [4758]:
# Our initial select list of categories to be used in district comparisons
# The user can add to these in the Preferences section
useful_categories = ['Arts & Entertainment', 'College & University', 'Event', 'Food', 'Nightlife Spot', \
                     'Outdoors & Recreation', 'Professional & Other Places', 'Shop & Service', \
                     'Travel & Transport']

### 2.4.4 Augment hierarchy of categories with parent links

In [4759]:
# The set_parent function navigates each element in each tree of the JSON Category hierarchy 
# There are 10 distinct top-level categories
# A 'parent' element pointing upwards is added to each category

def set_parent(node):
    if 'categories' in node.keys():
        for n in node['categories']:
            n['parent'] = node
            set_parent(n)

# top_level_nodes is a global data structure that will hold the independent fragments of the Categories tree
top_level_nodes = []

for top in json_categs:
    top['parent'] = None
    top_level_nodes.append(top)
    set_parent(top)

### 2.4.5 A function to search categories

In [4760]:
# get_category receives a string as input, searches the Category trees and returns the corresponding node
# We don't need to check for cyclical graphs, etc. since the data is curated by Foursquare

# A top-down recursive search that looks for a perfect match to the category name
def find_node(node, input):
    if node['name'] == input:
        return node
    if 'categories' in node.keys():
        for n in node['categories']:
            if find_node(n, input) is not None:
                return n
    else:
        return None
    
# Search each of the Category tree fragments for the input string
def get_category(input):
    for t in top_level_nodes:
        result = find_node(t, input)
        if result is not None:
            return result
    return None

### 2.4.6 Aggregation category lookup

In [4761]:
# The function starts from a specific point in the Category tree and navigates up the hierarchy 
# until it finds a node that can act as an aggregation point (i.e. is present in the useful_categories)
# Note the function may also return None, not all venues are of interest for our comparison, a value of None
# means the corrssponding venue will later be dropped

def get_aggregation_category(classification):
    node = classification
    if node['name'] in useful_categories:
        return node['name']
    while node['parent'] is not None:
        node = node['parent']
        if node['name'] in useful_categories:
            return node['name']
    return None

In [4762]:
city_venues.head(3)

Unnamed: 0.1,Unnamed: 0,District_code,District Latitude,District Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,EC1A,51.516355,-0.099135,Pilpel,51.515195,-0.098462,Falafel Restaurant
1,1,EC1A,51.516355,-0.099135,Virgin Active,51.517952,-0.097651,Gym / Fitness Center
2,2,EC1A,51.516355,-0.099135,Postman's Park,51.51686,-0.097643,Park


Now we're going to augment the city_venues dataset with the High-Level Venue Category column 

In [4763]:
high_level_categories = []
for tc in city_venues['Venue Category']:
    dc = get_category(tc)
    #print("Detailed venue is:", tc, end=' ')
    hlc = get_aggregation_category(dc)
    #print(", \thigh-level category is: ", hlc)
    high_level_categories.append(hlc)
    
city_venues['HL Venue Category'] = high_level_categories

In [4764]:
city_venues[0:3]

Unnamed: 0.1,Unnamed: 0,District_code,District Latitude,District Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,HL Venue Category
0,0,EC1A,51.516355,-0.099135,Pilpel,51.515195,-0.098462,Falafel Restaurant,Food
1,1,EC1A,51.516355,-0.099135,Virgin Active,51.517952,-0.097651,Gym / Fitness Center,Outdoors & Recreation
2,2,EC1A,51.516355,-0.099135,Postman's Park,51.51686,-0.097643,Park,Outdoors & Recreation


## 2.5 User Preferences<a name="User_Preferences"></a>

The District Recommender service requests its users to submit their preferences in terms of a district they liked in a city they have visited in the past. These locations are used as input, to calculate the user’s preferences in terms of  type and density of venues in the district, and consequently of the lifestyle they would like to enjoy.
<br>
In our test example, the user declares preferences on a few districts of the city of London, with __W10__ (Portobello / Ladbroke Grove) and __W11__ (Notting Hill) obtaining the highest ratings. The user also declares the city of destination to be __Copenhagen__.

In [4765]:
# We assume the user is familiar with the city of London, so expressed preferences by picking areas 
#   on a London map and assigning them ratings from 1 to 5 
# The resulting data is shown below
user_input = [
            {'City': 'London', 'District_code':'W11', 'rating': 5.0},
            {'City': 'London', 'District_code':'N1', 'rating': 4.0},
            {'City': 'London', 'District_code':'E9', 'rating': 3.5}
            ]

input_districts = pd.DataFrame(user_input)

input_destination_city = 'Copenhagen'

Merge the above user preferences with the districts dataframe. This ensures that all user preferences correspond to districts that are in our database. If not so, we might have misspelled a code somewhere

In [4766]:
# Filtering out the districts by post code
inputId = cities_df[cities_df['District_code'].isin(input_districts['District_code'].tolist())]
input_districts = pd.merge(inputId, input_districts)

# Let's recap user preferences
input_districts.sort_values('rating', ascending=False)

input_districts.set_index('District_code')

Unnamed: 0_level_0,District_name,Latitude,Longitude,City,rating
District_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
E9,"Hackney, Homerton",51.53777,-0.0448,London,3.5
N1,"Barnsbury, Canonbury, Islington",51.52969,-0.08697,London,4.0
W11,"Holland Park, Notting Hill",51.51244,-0.20639,London,5.0


# 3. Methodology<a name="Methodology"></a>

To buld a __content-based recommender__ that recommends districts to our user, we need two ingredients that have become available after the work in the previous chapter:
* A full dataset of __districts & venue category populations__, that includes a city the user knows and a city where the user wants to move to. The dataset contains the number of venues in each venue category for each district.
* A set of __user preferences__ in terms of districts in the known city, and numerical ratings for each district.

The recommendation will be produced in four steps:
1. __Transform__ the dataset of districts and venue category populations, using one hot encoding
2. __Extract__ from the full dataset only the districts that correspond to the user's preferences
3. __Calculate a user profile vector__ of weights assigned to Venue categories, by multiplying User ratings and Venue category numbers
4. Obtain a numeric __preference score for each unknown district__, by multiplying the user profile vector to the venue population data for each district.
5. __Sort results__ and propose the top scoring alternatives

# 4. Analysis<a name="Analysis"></a>

## 4.1 Transform data<a name="Transform_Data"></a>

In [4767]:
# one hot encoding
cities_onehot = pd.get_dummies(city_venues[['HL Venue Category']], prefix="", prefix_sep="")

cities_onehot.insert(0, 'District_code', city_venues['District_code'], allow_duplicates=False) 

cities_onehot.head()

Unnamed: 0,District_code,Arts & Entertainment,College & University,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Shop & Service,Travel & Transport
0,EC1A,0,0,1,0,0,0,0,0
1,EC1A,0,0,0,0,1,0,0,0
2,EC1A,0,0,0,0,1,0,0,0
3,EC1A,0,0,1,0,0,0,0,0
4,EC1A,0,0,0,0,1,0,0,0


In [4768]:
cities_onehot.shape

(11639, 9)

We'll group rows by district and take a numerical indicator of the frequency of occurrence of each category.
In our first take to the problem, this will be the mean.

In [4769]:
cities_grouped = cities_onehot.groupby('District_code').mean().reset_index()
cities_grouped.head()

Unnamed: 0,District_code,Arts & Entertainment,College & University,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Shop & Service,Travel & Transport
0,1000-1499,0.261364,0.0,0.443182,0.102273,0.034091,0.0,0.102273,0.056818
1,1500-1799,0.0,0.0,0.538462,0.076923,0.153846,0.0,0.076923,0.153846
2,1800-1999,0.261364,0.0,0.443182,0.102273,0.034091,0.0,0.102273,0.056818
3,2000,0.0,0.0,0.411765,0.058824,0.176471,0.0,0.294118,0.058824
4,2100,0.127273,0.0,0.381818,0.054545,0.218182,0.0,0.218182,0.0


Looking at the values above, we see that there is a large disparity in the values of the Food column and other values such as Shop & Service. <br>
__The 5x-10x difference between the columns would have a disproportionate impact on the calculation of numerical values of the user preferences.__
<br>
For this reason we'll proceed to transform the data. Min-max scaling goes in the right direction but is not sufficient to cancel such a large discrepanc in values. For this reason we'll use another transformation, the __QuantileTranformer__ that produces a uniform distribution and is robust to outliers. 

In [4770]:
# We don't have a large dataset, so we use the n_quantiles parameter to silence the warning
from sklearn.preprocessing import QuantileTransformer

scaler = QuantileTransformer(n_quantiles=min(len(cities_grouped), 1000))

# We cannot pass a string column as parameter to the scaler
numeric_columns = cities_grouped.columns.drop('District_code')

cities_grouped[numeric_columns] = scaler.fit_transform(cities_grouped[numeric_columns])
cities_grouped.head()

Unnamed: 0,District_code,Arts & Entertainment,College & University,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Shop & Service,Travel & Transport
0,1000-1499,0.916667,0.0,0.494186,0.45155,0.300388,0.0,0.362403,0.527132
1,1500-1799,0.0,0.0,0.742248,0.284884,0.79845,0.0,0.207364,0.860465
2,1800-1999,0.916667,0.0,0.494186,0.45155,0.300388,0.0,0.362403,0.527132
3,2000,0.0,0.0,0.368217,0.228682,0.835271,0.0,0.841085,0.612403
4,2100,0.732558,0.0,0.304264,0.217054,0.883721,0.0,0.74031,0.0


This looks better

In [4771]:
cities_grouped.shape

(259, 9)

## 4.2 Extract Data<a name="Extract_Data"></a>

Extract the subset of the cities' encoded data that belongs to districts known to the user

In [4772]:
# Filtering out the movies from the input
user_city_data = cities_grouped[cities_grouped['District_code'].isin(input_districts['District_code'].tolist())]
user_city_data

Unnamed: 0,District_code,Arts & Entertainment,College & University,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Shop & Service,Travel & Transport
72,E9,0.521318,0.0,0.612403,0.618217,0.866279,0.0,0.476744,0.393411
145,N1,0.0,0.0,0.689922,0.926357,0.542636,0.992248,0.22093,0.271318
232,W11,0.432171,0.0,0.70155,0.655039,0.449612,0.0,0.75969,0.306202


## 4.3 Calculate User Profile<a name="Calculate_User_Profile"></a>


The user profile is represented as a vector with the weights of each venue category, e.g. Food 18, Shop & Service 7.5<br>  To obtain the weights we need to multuply the user ratings vector by the matrix containing the venue categories and weights of each district known to the user.<p>

In [4773]:
# The user data is a result of filtering, let's reset the index first
user_city_data = user_city_data.reset_index(drop=True)

In [4774]:
# Before the multiplication we'll drop any column that doesn't contain weights
user_categories = user_city_data.drop('District_code', 1)
user_categories

Unnamed: 0,Arts & Entertainment,College & University,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Shop & Service,Travel & Transport
0,0.521318,0.0,0.612403,0.618217,0.866279,0.0,0.476744,0.393411
1,0.0,0.0,0.689922,0.926357,0.542636,0.992248,0.22093,0.271318
2,0.432171,0.0,0.70155,0.655039,0.449612,0.0,0.75969,0.306202


We multiply the ratings to the categories table and then sum up the resulting table by column.<br>
This operation is actually a dot product between a matrix and a vector.

In [4775]:
# Calculate a dot product to get weights
user_profile = user_categories.transpose().dot(input_districts['rating'])

# Let's see the user profile, with the strongest determinants first
user_profile

Arts & Entertainment           3.985465
College & University           0.000000
Food                           8.410853
Nightlife Spot                 9.144380
Outdoors & Recreation          7.450581
Professional & Other Places    3.968992
Shop & Service                 6.350775
Travel & Transport             3.993217
dtype: float64

## 4.4 Assign Score to Each District<a name="Assign_Score_To_Each_District"></a>

To arrive at a single number for each district, we'll calculate a __weighted average__: the weights corresponding to venue categories of each district will be multiplied by the weight of the user's preference for each category.
<br>
At this point we'll also use the __destination city__ parameter, no need to predict preference values applicable to districts of other cities.

In [4776]:
# Get the district codes that are of interest to the user, i.e. filtered by the "Destination City" index
destination_districts = cities_df['District_code'].loc[input_destination_city]

# Let's get the venue category data of the districts that interest us
#venue_categories_table = cities_grouped.set_index(cities_df['District_code'])
venue_categories_table = \
   cities_grouped[cities_grouped['District_code'].isin(destination_districts)].set_index(destination_districts)

# Drop unnecessary information before multiplying
venue_categories_table = venue_categories_table.drop('District_code', 1)
venue_categories_table.head(7)

Unnamed: 0_level_0,Arts & Entertainment,College & University,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Shop & Service,Travel & Transport
District_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1000-1499,0.916667,0.0,0.494186,0.45155,0.300388,0.0,0.362403,0.527132
1500-1799,0.0,0.0,0.742248,0.284884,0.79845,0.0,0.207364,0.860465
1800-1999,0.916667,0.0,0.494186,0.45155,0.300388,0.0,0.362403,0.527132
2000,0.0,0.0,0.368217,0.228682,0.835271,0.0,0.841085,0.612403
2100,0.732558,0.0,0.304264,0.217054,0.883721,0.0,0.74031,0.0
2150,0.0,0.0,0.0,0.0,0.95155,0.0,0.0,1.0
2200,0.364341,0.0,0.608527,0.676357,0.819767,0.806202,0.689922,0.0


In [4777]:
venue_categories_table.shape

(51, 8)

We'll calculate the weighted average of every district

In [4778]:
# Multiply the categories by the weights and then take the weighted average
recommendation_table = ((venue_categories_table*user_profile).sum(axis=1))/(user_profile.sum())
recommendation_table.head(5)

District_code
1000-1499    0.429140
1500-1799    0.451454
1800-1999    0.429140
2000         0.443338
2100         0.432967
dtype: float64

## 4.5 Propose Best Recommendation<a name="Propose_Best_Recommendation"></a>

Sort recommendation table in descending order, put results in a dataframe

In [4824]:
# Sort with highest score first
recommendation_table = recommendation_table.sort_values(ascending=False)
rec_df = pd.DataFrame({'District_code':recommendation_table.index, 'Score':recommendation_table.values})

# Get district names from the full dataset (order is by District code)
destination_district_names = cities_df.loc[input_destination_city].set_index('District_code')['District_name']
destination_lats = cities_df.loc[input_destination_city].set_index('District_code')['Latitude']
destination_longs = cities_df.loc[input_destination_city].set_index('District_code')['Longitude']

# Match names to the recommendations dataframe
rec_names = destination_district_names.loc[rec_df['District_code']]
rec_lats = destination_lats.loc[rec_df['District_code']]
rec_longs = destination_longs.loc[rec_df['District_code']]

# Put district name in the recommendations; use insert() to leave score as the last column
rec_df.insert(1, 'District_name', rec_names.values)
rec_df.insert(3, 'Latitude', rec_lats.values)
rec_df.insert(4, 'Longitude', rec_longs.values)

rec_df.head(5)

Unnamed: 0,District_code,District_name,Score,Latitude,Longitude
0,2200,Copenhagen N,0.610662,55.696715,12.543886
1,2400,Copenhagen NV,0.546174,55.709575,12.528592
2,1500-1799,Copenhagen V,0.451454,55.66645,12.533538
3,2000,Frederiksberg,0.443338,55.6704,12.511796
4,2100,Copenhagen Ø,0.432967,55.705645,12.572474


The final output shown to the user will be better understood if it contains the district name

### 4.5.1 Show recommendation on a map
The above is the final recommendation table to be shown in the District Recommender app.<br>
We'll probably want to show these results also in a map.

In [4789]:
# Setup data for the map
input_dest_city_country = "Copenhagen, Denmark"

g = geocoder.arcgis('{}'.format(input_dest_city_country))
lat_lng_coords = g.latlng
latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

Copenhagen, Denmark  returned lat = 55.67567000000008


In [4862]:
#!conda install -c conda-forge folium=0.5.0 --yes
import folium
# create map and display it
destination_map = folium.Map(location=[latitude, longitude], zoom_start=12)

In [4864]:
# instantiate a feature group for the incidents in the dataframe
rec_districts = folium.map.FeatureGroup()

# loop through the first 5 recommendations and add each to the feature group
for index, row in rec_df.head(5).iterrows():
    name = row['District_name']
    score = row['Score']
    label_text = "%s, score: %s" % (name, str(round(score, 2)))
    lx = row['Latitude']
    ly = row['Longitude']
    
    rec_districts.add_child(
        folium.features.CircleMarker(
            [lx, ly],
            radius=5, # size of circle markers
            color='yellow',
            fill=True,
            fill_color='blue',
            fill_opacity=0.8
        )
    )
    folium.Marker([lx, ly], popup=label_text).add_to(destination_map)

# add recommendations to map
destination_map.add_child(rec_districts)

# 5. Results<a name="Results"></a>

## 5.1 The Process Followed<a name="The_Process_Followed"></a>
The work that was performed had two main stages, Data collection and Data Analysis.<p>
During the Data Collection phase:
1. Scraping of Web pages (Wikipedia) to obtain data related to the postcode districts of 3 major European cities: London, Copenhage and Glasgow.
2. Geocoding of the districts and then use of the georeferencing information to query the Foursquare API about the venues present in each district.
3. Query of the Foursquare API to obtain the full hierarchy of categories in JSON format. Use of the hierarchy to add a new attribute to each venue, called High-Level Venue Category.
4. Collection of the user's preferences: the destination city was Copenhagen, Denmark, while the lifestyle preferences were declared as three neighborhoods with associated ratings: Hackney, Islington and Notting Hill.

During the Data Analysis phase:
1. The data on High-Level venue Categories was transformed via one-hot encoding 
2. Each column was grouped by District, using the mean() value
3. The mean() value was not appropriate to use in recommendations, one more transformation was necessary. After some experimentation, each column was scaled using a Quantile Transformer (for more details, see the Project Report)
4. This grouped and transformed data on "Venue categories per district", is representative of the profile of each neighborhood. It is the full set of "district profiles" in terms of number and types of venues present.
5. The subset of this data related to user preferences was extracted and multiplied to user ratings, to calculate the User Profile, a vector of venue categories and numbers. It was of the form:


| High Level Venue Category | Score  |
| --- | --- |
|Arts & Entertainment           |3.985465|
|Food                           |8.410853|
|Nightlife Spot                 |9.144380|
|...          |...|

6. The rows related to the destination city were extracted from the district profiles, and multiplied by the numerical user profile of the previous step, to obtain a numerical score for each district. The formula is 
    District_score = District_numeric_profile * User_numeric_profile
7. The top 5 recommendations were displayed on a map, the markers are clickable and display the name of each district and the associated score.

## 5.2 Assessment of results<a name="Assessment_Of_Results"></a>
The recommendations produced by the algorithm were credible. The user expressed preferences in terms of 3 London neighborhoods: Hackney, Islington and Notting Hill, that can roughly be associated to the "trendy / hipster" category.<p>
The District Recommender suggested the areas of Copenhagen North, North-West, West, East and of Federiksberg. These are the central / semi-central boroughs of the city with many venues for Food, Art, Shops and Recreational activities, the names of Nørrebro, Vesterbro, Østerbro are well known internationally.<p>
The Project Report and presentation will contain the results of the Recommender under different inputs.

# 6. Discussion<a name="Discussion"></a>
A recommender using venues and locations presents some difficulties in the handling of the numerical values that represent the profile of each district. How much importance should we give to the existence of one or dozens of venues of the same type? Is it sufficient to mark "at least one venue exists", or should we differentiate districts by a higher degree based on the number of similar venues? Experiments with different scaling methods produced drastically different results.<p>
An improved approach in the future should integrate additional data on property prices. This is one aspect of the profile of each district that is not well represented in the venue profile obtained by Foursquare. Even down to the detail of "Italian Restaurant", there can be many price points and types of neighborhoods. Unfortunately price data were not easy to find outside England. Even if not available, a proxy could be used, such as hotel room prices for each area, given sufficient time to write programs and obtain the data.


# 7. Conclusion

The Content-Based Recommender of this notebook follows a well-known basic approach that can produce results based only on a user's preferences. It can be customised in detail, the user can select which categories to use for comparison and whether to use a category that is located at the bottom, at mid-level or top-level of the hierarchy.<p>
The system produces basic recommendations that are credible and not out of sync with reality, although somewhat "large-grained". To improve the algorithm's accuracy, more data sources should be integrated and more advanced algorithms should be implemented, following the advances of specific research on the topic of Recommenders published in journals and discussed  at international conferences.