# Chicago Restaurant Recommender System

## Table of Contents

1. Introduction
2. Methodology
3. Results
4. Conclusion

## Introduction

Anyone who has watched movies on Netflix or shopped on Amazon is very familiar with the recommendations provided as you browse.  The systems underpinning these suggestions have become an important feature for online businesses wishing to provide the best possible user experience.  The systems can reduce an overwhelming variety of choices to a small group of carefully chosen items.  They can also introduce users to items that they would not have searched for on their own.  By employing recommender systems, businesses hope to entice users to watch more movies and buy more products.

Foursquare is another company that provides recommendations to its users.  For the sake of this project, however, we will assume this is not the case.  In that world, Foursquare would be in severe jeopardy of losing users to competitors who can offer useful suggestions.  As a starting point, we will create a recommender system for restaurants in the city of Chicago, Illinois, USA.

We will employ data from Foursquare and the city of Chicago to build a content-based recommender system.  This is a method that builds profiles for each user based on their previous ratings.  It then compares a prospective restaurant against the user’s past experiences to estimate a rating.  When you repeat this for several eateries within an area, the recommender system will provide a list sorted from the highest estimated rating to the lowest.  The hope is that this will greatly improve customer satisfaction with Foursquare, so that the company can continue to be the leading location technology platform.

## Methodology

### Chicago Neighborhoods

The starting point for this task was with the city of Chicago.  The city is well-known for its neighborhoods, officially called community areas.  Many people tend to frequent restaurants within a handful of neighborhoods due to the logistical challenges of going to places far from home or work.  With this in mind we have used the neighborhoods as features for finding recommendations.

While the city publicizes a rich assortment of data on its neighborhoods, none of them included the longitude/latitude information required for Foursquare API queries.  We therefore collected and entered this information manually into a spread sheet of neighborhood census data, then deleted the extraneous columns.

In [3]:
# Import pandas library using an alias
import pandas as pd
# library to handle data in a vectorized manner
import numpy as np

# library to handle JSON files
import json
# library to handle requests
import requests
# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize

# import geocoder
import geocoder
# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# map rendering library
import folium

In [4]:
# Read the Excel file into a dataframe
df_chi_data = pd.read_excel('CCASF12010CMAP.xlsx')
df_chi_data.head()

Unnamed: 0,GEOGKEYX,GEOGNAME,LATITUDE,UNSIGNED LONGITUDE,LONGITUDE
0,GeogKey,Geog,,,
1,1,Rogers Park,42.016667,87.666667,-87.666667
2,2,West Ridge,42.0,87.683333,-87.683333
3,3,Uptown,41.966667,87.666667,-87.666667
4,4,Lincoln Square,41.966667,87.683333,-87.683333


In [5]:
# Drop the first row
df_chi_data.drop([0], axis = 0, inplace=True)

# Drop two unnecessary columns
df_chi_data.drop(['UNSIGNED LONGITUDE'], axis=1, inplace=True)
df_chi_data.drop(['GEOGKEYX'], axis=1, inplace=True)
df_chi_data.head()

Unnamed: 0,GEOGNAME,LATITUDE,LONGITUDE
1,Rogers Park,42.016667,-87.666667
2,West Ridge,42.0,-87.683333
3,Uptown,41.966667,-87.666667
4,Lincoln Square,41.966667,-87.683333
5,North Center,41.95,-87.683333


### Foursquare Data

Over the years Foursquare has built an extremely rich dataset that is centered on venues and users.  For each venue this set includes details such as the name, location and category.  While Foursquare gathers a plethora of data on its users, most of that information is not available to the public.  For this project, we are limited to identifying the users who liked each venue.

#### Building a Venue List

Working from our Chicago neighborhood data, we have iterated through each neighborhood and queried Foursquare for venues within it.  We also collected the name, identification number, longitude and latitude for each venue.  The most important piece of information for our purposes is the category.  The reasoning is that if a user likes a lot of Thai restaurants, he is more likely to enjoy others as well.  We limited our search results to those within the Food category and collected the subcategory for each.  This became the second feature for generating recommendations.

In [6]:
# Define Foursquare variables
CLIENT_ID = 'CT3K4Z2AEBTWGOKQLQKZ135JJ3B44KOQTB4BMEJ4R0AXXWSD'
CLIENT_SECRET = 'TSJKVMWFPKG2ZV4NYZQDVIR5FIYOSRICDTHHKHWSMV5JHDMZ'
VERSION = '20180605'

In [7]:
# Create a function to search for venues in a given area
def getNearbyVenues(names, latitudes, longitudes, radius, LIMIT, CATEGORY, INTENT):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&categoryId={}&radius={}&limit={}&intent={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            lat, 
            lng,
            VERSION,
            CATEGORY, 
            radius, 
            LIMIT, 
            INTENT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            v['name'], 
            v['id'], 
            v['location']['lat'], 
            v['location']['lng'],  
            v['location']['distance'],  
            v['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Venue', 
                  'Venue ID', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Distance',
                  'Venue Category']
    
    return(nearby_venues)

In [8]:
# Call function to create a dataframe of venues
chi_venues = getNearbyVenues(names=df_chi_data['GEOGNAME'],
                                   latitudes=df_chi_data['LATITUDE'],
                                   longitudes=df_chi_data['LONGITUDE'],
                                   radius=500,
                                   LIMIT=100, 
                                   CATEGORY = '4d4b7105d754a06374d81259',
                                   INTENT = 'browse'
                                  )
chi_venues.head()

Unnamed: 0,Neighborhood,Venue,Venue ID,Venue Latitude,Venue Longitude,Distance,Venue Category
0,Rogers Park,Charmers Cafe,5710dcf0498e87c71d20b69d,42.016164,-87.66825,142,Café
1,Rogers Park,Caribbean American Bakery,4b5dde6ef964a520fa7029e3,42.019371,-87.669705,392,Bakery
2,Rogers Park,Tjam Kitchen,5a2071ca47f876422319a3b6,42.01931,-87.66692,294,Restaurant
3,Rogers Park,Jarvis Grill,4c117c7e17002d7f4755e609,42.015989,-87.66888,198,Fast Food Restaurant
4,Rogers Park,Jamaican Bakery,51cf7f97498ee7d50a505393,42.018398,-87.669414,297,Bakery


In [9]:
# Check the dimensions
chi_venues.shape

(1338, 7)

One unexpected aspect of the data was the presence of chain restaurants.  While a user who likes one Starbucks will undoubtedly like another, we did not find value in this type of recommendation.  Chain restaurants, especially for locations within the same city, will offer an almost identical experience.  The purpose of this recommender system is to enhance the user experience.  Including chains could skew results to try to repeat the same experiences.  For those reasons, we have removed all venue names that appear more than five times.

In [10]:
# Find venues that appear more than 5 times
chi_venues_grouped = chi_venues.groupby('Venue').filter(lambda x: len(x) > 5)
chi_venues_grouped

Unnamed: 0,Neighborhood,Venue,Venue ID,Venue Latitude,Venue Longitude,Distance,Venue Category
90,Uptown,Subway,4b5b3875f964a52031ec28e3,41.965282,-87.661418,461,Sandwich Place
121,Lincoln Square,Starbucks,4aa3dfaaf964a520384420e3,41.964799,-87.685861,294,Coffee Shop
128,Lincoln Square,Potbelly Sandwich Shop,49f4c21ff964a5204b6b1fe3,41.966985,-87.687272,327,Sandwich Place
143,Lincoln Square,Dunkin',4c52d9412543a593290bfc85,41.966271,-87.688664,443,Donut Shop
166,North Center,Starbucks,54273f56498e550c0584a8bb,41.947936,-87.688509,486,Coffee Shop
169,North Center,Potbelly Sandwich Shop,542eed7c498e15b89f63ad00,41.948428,-87.688678,475,Sandwich Place
179,Lake View,Dunkin',4b5b6256f964a520aef928e3,41.954298,-87.650165,478,Donut Shop
180,Lake View,Dunkin',532488dc498e89b38c11821c,41.947083,-87.653812,452,Donut Shop
185,Lake View,Subway,4bd4d3736798ef3bbf80628d,41.947220,-87.653939,449,Sandwich Place
202,Lake View,Subway,4a42e581f964a5205da61fe3,41.951745,-87.649436,199,Sandwich Place


In [11]:
# Remove all the rows matching each of those venues
chi_venues = chi_venues[chi_venues.Venue != "Dunkin'"]
chi_venues = chi_venues[chi_venues.Venue != "Potbelly Sandwich Shop"]
chi_venues = chi_venues[chi_venues.Venue != "Starbucks"]
chi_venues = chi_venues[chi_venues.Venue != "Subway"]
chi_venues = chi_venues[chi_venues.Venue != "Baskin-Robbins"]
chi_venues = chi_venues[chi_venues.Venue != "Popeyes Louisiana Kitchen"]
chi_venues = chi_venues[chi_venues.Venue != "McDonald's"]

# Check the new dimensions
chi_venues.shape

(1223, 7)

In [12]:
# Verify there are no more chains
chi_venues_grouped = chi_venues.groupby('Venue').filter(lambda x: len(x) > 5)
chi_venues_grouped

Unnamed: 0,Neighborhood,Venue,Venue ID,Venue Latitude,Venue Longitude,Distance,Venue Category


Another thing that could skew our data is the possibility of duplicates.  Our Foursquare query used a radius from the center of each neighborhood.  Since neighborhoods come in a variety of shapes and sizes, this creates the possibility of retrieving the same establishment for multiple neighborhoods.  We therefore checked for duplicates based on the venue identification number.

In [13]:
# Find the duplicate venues
chi_venues_duplicates = chi_venues[chi_venues.duplicated(['Venue ID'])]
print("Duplicate venues are:", chi_venues_duplicates, sep='\n')

Duplicate venues are:
            Neighborhood                                          Venue  \
314           North Park                                   Coffee Joint   
315           North Park                                  Laschet's Inn   
317           North Park                                      Mod Pizza   
318           North Park                   Reclaimed Bar and Restaurant   
319           North Park                               Borinquen Lounge   
320           North Park                                   Pete's Pizza   
625   East Garfield Park                               Al's Under the L   
626   East Garfield Park                           Lake's Best Pizzaria   
627   East Garfield Park                                   Vegies Pizza   
628   East Garfield Park                            Supper Club Chicago   
629   East Garfield Park                               All Time Bar B Q   
630   East Garfield Park                                      C & D BBQ   
631

We found sixty-one duplicates in our dataset.  Because it would be difficult to accurately identify the neighborhood for each of these, we used the distance from the neighborhood center as a measurement.  The first step was to sort our dataset on the venue identifier and distance.  Then we removed all duplicates aside from the first entry.  We used a sample entry to check before and after to confirm that only one remains.

In [14]:
# Sort the venues by ID and distance
chi_venues.sort_values(by=['Venue ID', 'Distance'], inplace=True)
chi_venues.head()

Unnamed: 0,Neighborhood,Venue,Venue ID,Venue Latitude,Venue Longitude,Distance,Venue Category
746,The Loop,Atwood,3fd66200f964a520c7f01ee3,41.883205,-87.628191,426,New American Restaurant
234,Lincoln Park,Sai Cafe,3fd66200f964a520e1ed1ee3,41.918481,-87.653361,343,Sushi Restaurant
718,The Loop,Monk's Pub,40b28c80f964a52045fb1ee3,41.88564,-87.634339,269,Pub
130,Lincoln Square,Daily Bar & Grill,40b28c80f964a5205ffd1ee3,41.964823,-87.686073,305,Bar
168,North Center,Laschet's Inn,40b28c80f964a520a5fc1ee3,41.954091,-87.681978,469,German Restaurant


In [15]:
# Check entries for a given venue
chi_venues[chi_venues['Venue ID'] == '59b5338d28122f42d21e8656']

Unnamed: 0,Neighborhood,Venue,Venue ID,Venue Latitude,Venue Longitude,Distance,Venue Category
756,Near South Side,Woven & Bound - Marriott Marquis Chicago,59b5338d28122f42d21e8656,41.85331,-87.620071,464,American Restaurant
838,Douglas,Woven & Bound - Marriott Marquis Chicago,59b5338d28122f42d21e8656,41.85331,-87.620071,464,American Restaurant


In [16]:
# Drop duplicates, keeping the first entry only
chi_venues.drop_duplicates(subset="Venue ID", keep='first', inplace=True)

# Check our sample venue again
chi_venues[chi_venues['Venue ID'] == '59b5338d28122f42d21e8656']

Unnamed: 0,Neighborhood,Venue,Venue ID,Venue Latitude,Venue Longitude,Distance,Venue Category
756,Near South Side,Woven & Bound - Marriott Marquis Chicago,59b5338d28122f42d21e8656,41.85331,-87.620071,464,American Restaurant


In [17]:
# Check the new dimensions
chi_venues.shape

(1165, 7)

In [18]:
# Reset the index after removing rows
chi_venues = chi_venues.reset_index(drop=True)
chi_venues.head()

Unnamed: 0,Neighborhood,Venue,Venue ID,Venue Latitude,Venue Longitude,Distance,Venue Category
0,The Loop,Atwood,3fd66200f964a520c7f01ee3,41.883205,-87.628191,426,New American Restaurant
1,Lincoln Park,Sai Cafe,3fd66200f964a520e1ed1ee3,41.918481,-87.653361,343,Sushi Restaurant
2,The Loop,Monk's Pub,40b28c80f964a52045fb1ee3,41.88564,-87.634339,269,Pub
3,Lincoln Square,Daily Bar & Grill,40b28c80f964a5205ffd1ee3,41.964823,-87.686073,305,Bar
4,North Center,Laschet's Inn,40b28c80f964a520a5fc1ee3,41.954091,-87.681978,469,German Restaurant


In [19]:
# Check number of venues for each neighborhood
chi_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Venue,Venue ID,Venue Latitude,Venue Longitude,Distance,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Albany Park,47,47,47,47,47,47
Archer Heights,6,6,6,6,6,6
Armour Square,49,49,49,49,49,49
Ashburn,11,11,11,11,11,11
Auburn Gresham,14,14,14,14,14,14
Austin,15,15,15,15,15,15
Avalon Park,22,22,22,22,22,22
Avondale,15,15,15,15,15,15
Belmont Cragin,36,36,36,36,36,36
Beverly,30,30,30,30,30,30


In [20]:
# How many unique categories
print('There are {} uniques categories.'.format(len(chi_venues['Venue Category'].unique())))

There are 106 uniques categories.


#### Building a User List

With our list of viable restaurants in hand, the next step was to build a list of users.  As mentioned above, Foursquare limits the amount of data available to the public.  For this system we were limited to finding the users who liked each of our venues.  We looped through our list of venues and built a list of users and the venues they liked.  We discovered that some venues have not been liked by any users, so we skipped those.  However, those venues are still useful in our dataset as possible recommendations.  Due to the limitations of the available data, we recorded the rating of each liked venue as one (1).  Since we cannot access data on dislikes or how often a user has visited the venue, our results may not be as rich as they could be.

In [21]:
# Create a function to get users who liked each venue
def getVenueLikes(venueids):
    
    venues_list=[]
    for venue_id in venueids:
        #print(venue_id)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/{}/likes?client_id={}&client_secret={}&v={}'.format(
            venue_id, 
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION)
            
        # make the GET request
        results = requests.get(url).json()['response']['likes']
        
        # return price for each venue
        if "items" in results:
            for i in results['items']:
                venues_list.append([(venue_id, i['id'], 1) ])
        else:
            print("No likes for venue {}.".format(venue_id))

    venue_likes = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    venue_likes.columns = ['Venue ID', 'User ID', 'Rating']
    
    return(venue_likes)

In [22]:
# Call function for each venue in our dataset
chi_venue_likes = getVenueLikes(venueids=chi_venues['Venue ID'])
chi_venue_likes.head()

No likes for venue 4b453428f964a520e10726e3.
No likes for venue 4b5e41eaf964a520708629e3.
No likes for venue 4b8b1e8ff964a520c29332e3.
No likes for venue 4b904e7bf964a520ff8233e3.
No likes for venue 4b9c0829f964a520ed4236e3.
No likes for venue 4b9fe205f964a520464737e3.
No likes for venue 4ba2a3ddf964a520c30b38e3.
No likes for venue 4ba43b9af964a520768e38e3.
No likes for venue 4ba56a9ff964a520050539e3.
No likes for venue 4bb12974f964a520987f3ce3.
No likes for venue 4bb1463cf964a520dd883ce3.
No likes for venue 4bbce13607809521ffe2d991.
No likes for venue 4bbfde2974a9a5932350cff6.
No likes for venue 4bc248e74cdfc9b6b9e99521.
No likes for venue 4bca602468f976b0045c5f83.
No likes for venue 4bcb48ad3740b713d3236265.
No likes for venue 4bd46a2129eb9c74a17e91e1.
No likes for venue 4bd60445637ba5939978f770.
No likes for venue 4bfd7d64f61dc9b6fb1a9fde.
No likes for venue 4c093c466071a593798cdd32.
No likes for venue 4c116a33d41e76b0905e310d.
No likes for venue 4c1f7248e923ef3b37bd4e54.
No likes f

No likes for venue 4f322fd719836c91c7beaf62.
No likes for venue 4f32337719836c91c7c017f6.
No likes for venue 4f3233c519836c91c7c0375e.
No likes for venue 4f32357419836c91c7c0e161.
No likes for venue 4f3235f519836c91c7c11167.
No likes for venue 4f32366319836c91c7c13a57.
No likes for venue 4f3236c019836c91c7c15e7b.
No likes for venue 4f32379519836c91c7c1ad62.
No likes for venue 4f3239c119836c91c7c28aea.
No likes for venue 4f323a2619836c91c7c2b337.
No likes for venue 4f323b5b19836c91c7c32a4a.
No likes for venue 4f323e3a19836c91c7c44fb2.
No likes for venue 4f32413819836c91c7c57bcc.
No likes for venue 4f32427919836c91c7c5facf.
No likes for venue 4f3242c419836c91c7c617a6.
No likes for venue 4f3243fc19836c91c7c695e6.
No likes for venue 4f3244c319836c91c7c6e78e.
No likes for venue 4f32463b19836c91c7c781d2.
No likes for venue 4f32467919836c91c7c79b41.
No likes for venue 4f32474e19836c91c7c7f2e9.
No likes for venue 4f32478e19836c91c7c80cf2.
No likes for venue 4f32480719836c91c7c83b66.
No likes f

No likes for venue 5159e309e4b0862f1cec1b4f.
No likes for venue 515a28e570430dedbbba9679.
No likes for venue 5160772ae4b006e8f72b17bb.
No likes for venue 5166fa2be4b0b9e4740adc09.
No likes for venue 5180833c498e14703c78575a.
No likes for venue 518b1cbc498e4bb1c631e42c.
No likes for venue 51976b93498e7818792c30a1.
No likes for venue 51a100e0498e11239470f37a.
No likes for venue 51c63cbc498e8400012cdd88.
No likes for venue 51ca49b2498e9851c779c678.
No likes for venue 51cec2d9ccdaddcc550cb17b.
No likes for venue 51cf7f97498ee7d50a505393.
No likes for venue 51d20e12498eaadd679d665e.
No likes for venue 51d5f4ec498ee77b08b1c588.
No likes for venue 51e1657a8bbd9beea4fe0123.
No likes for venue 51e1e439498eef8b3bbe33ba.
No likes for venue 5223916d11d27dd32e355ce5.
No likes for venue 522e1dc611d25e98528453c5.
No likes for venue 523cd91f11d22b2bae77a371.
No likes for venue 5248714d11d27f049676e8bf.
No likes for venue 524b7a09498ed08ed444a71d.
No likes for venue 5256b0a411d2156e3fcd1960.
No likes f

Unnamed: 0,Venue ID,User ID,Rating
0,3fd66200f964a520c7f01ee3,24740490,1
1,3fd66200f964a520c7f01ee3,9589924,1
2,3fd66200f964a520c7f01ee3,465843554,1
3,3fd66200f964a520e1ed1ee3,24486209,1
4,3fd66200f964a520e1ed1ee3,476508060,1


In [23]:
# Check the dimensions
chi_venue_likes.shape

(1571, 3)

### One-Hot Encoding

In order to use machine learning to produce our recommendations, we employed the one-hot encoding technique.  This converts the features that we wish to build our model on – restaurant category and neighborhood – into numerical values based on user likes.  That allows our model to easily compare venues to find those that are most similar to ones a user has liked in the past.

Since we chose to use two pieces of data, we performed this exercise twice.  First, we created a new list of the venues where all the different types of restaurants became individual columns.  Each venue was classified as one if it matched the type and zero if not.

In [24]:
# Use one hot encoding to create new dataframe of categories
chi_onehot_cat = pd.get_dummies(chi_venues[['Venue Category']], prefix="", prefix_sep="")
chi_onehot_cat.head()

Unnamed: 0,Afghan Restaurant,African Restaurant,American Restaurant,Arcade,Arepa Restaurant,Argentinian Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,...,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
# Check the dimensions
chi_onehot_cat.shape

(1165, 106)

In [26]:
# Add the venue ID back to the dataframe
chi_onehot_cat['Venue ID'] = chi_venues['Venue ID']
chi_onehot_cat.head()

Unnamed: 0,Afghan Restaurant,African Restaurant,American Restaurant,Arcade,Arepa Restaurant,Argentinian Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,...,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wings Joint,Venue ID
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3fd66200f964a520c7f01ee3
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3fd66200f964a520e1ed1ee3
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,40b28c80f964a52045fb1ee3
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,40b28c80f964a5205ffd1ee3
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,40b28c80f964a520a5fc1ee3


In [27]:
# Move the venue ID to the first column
fixed_columns = [chi_onehot_cat.columns[-1]] + list(chi_onehot_cat.columns[:-1])
chi_onehot_cat = chi_onehot_cat[fixed_columns]
chi_onehot_cat.head()

Unnamed: 0,Venue ID,Afghan Restaurant,African Restaurant,American Restaurant,Arcade,Arepa Restaurant,Argentinian Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,...,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wings Joint
0,3fd66200f964a520c7f01ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3fd66200f964a520e1ed1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,40b28c80f964a52045fb1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,40b28c80f964a5205ffd1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,40b28c80f964a520a5fc1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, we repeated this process for the neighborhoods.  In both cases we also added the venue identifier to the data and made sure we still had the same number as in our original venue dataset.

In [28]:
# Use one hot encoding to create new dataframe of neighborhoods
chi_onehot_nhood = pd.get_dummies(chi_venues[['Neighborhood']], prefix="", prefix_sep="")
chi_onehot_nhood.head()

Unnamed: 0,Albany Park,Archer Heights,Armour Square,Ashburn,Auburn Gresham,Austin,Avalon Park,Avondale,Belmont Cragin,Beverly,...,Washington Heights,Washington Park,West Elsdon,West Englewood,West Garfield Park,West Lawn,West Pullman,West Ridge,West Town,Woodlawn
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
# Add the venue ID back to the dataframe
chi_onehot_nhood['Venue ID'] = chi_venues['Venue ID']

# Move the venue ID to the first column
fixed_columns = [chi_onehot_nhood.columns[-1]] + list(chi_onehot_nhood.columns[:-1])
chi_onehot_nhood = chi_onehot_nhood[fixed_columns]

chi_onehot_nhood.head()

Unnamed: 0,Venue ID,Albany Park,Archer Heights,Armour Square,Ashburn,Auburn Gresham,Austin,Avalon Park,Avondale,Belmont Cragin,...,Washington Heights,Washington Park,West Elsdon,West Englewood,West Garfield Park,West Lawn,West Pullman,West Ridge,West Town,Woodlawn
0,3fd66200f964a520c7f01ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3fd66200f964a520e1ed1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,40b28c80f964a52045fb1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,40b28c80f964a5205ffd1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,40b28c80f964a520a5fc1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
# Check the dimensions
chi_onehot_nhood.shape

(1165, 66)

Finally, we combined both of the one-hot encoding lists for categories and neighborhoods into a single dataset.

In [31]:
# Merge the one-hot encoding dataframes into a new dataframe
chi_onehot_all = pd.merge(chi_onehot_nhood, chi_onehot_cat, on='Venue ID', how='outer')
chi_onehot_all.head()

Unnamed: 0,Venue ID,Albany Park,Archer Heights,Armour Square,Ashburn,Auburn Gresham,Austin,Avalon Park,Avondale,Belmont Cragin,...,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wings Joint
0,3fd66200f964a520c7f01ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3fd66200f964a520e1ed1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,40b28c80f964a52045fb1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,40b28c80f964a5205ffd1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,40b28c80f964a520a5fc1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
# Check the dimensions
chi_onehot_all.shape

(1165, 172)

## Results

In order to obtain the results from our recommender model we chose a test user.  To get a better result, we first identified users who have liked more than five venues.  One drawback to content-based recommender systems is that users with few ratings produce poor results.  Our minimum of five likes was to ensure a better outcome.  Then we chose a user at random and extracted the venues they liked.

In [33]:
 # Find users who have liked more than 5 venues
chi_venue_likes.groupby('User ID').filter(lambda x: len(x) > 5)

Unnamed: 0,Venue ID,User ID,Rating
101,4a4d5480f964a520e3ad1fe3,134989,1
102,4a4d5480f964a520e3ad1fe3,22744,1
113,4a5652f3f964a520feb41fe3,589793,1
136,4a6ca0e5f964a5200ed11fe3,589793,1
149,4a761079f964a52017e21fe3,18469680,1
191,4abe9675f964a520a58e20e3,38216300,1
202,4ac6ad04f964a520c6b520e3,22744,1
207,4ac79839f964a5204fb820e3,124547371,1
214,4ac90764f964a52028be20e3,418573148,1
218,4ad87465f964a5207f1121e3,3462946,1


In [34]:
# Get ratings for our test user
test_user_id = '589793'
test_user_ratings = chi_venue_likes[chi_venue_likes['User ID'] == test_user_id].drop('User ID', 1)
test_user_ratings

Unnamed: 0,Venue ID,Rating
113,4a5652f3f964a520feb41fe3,1
136,4a6ca0e5f964a5200ed11fe3,1
1092,54b84a42498ef55d202d374f,1
1510,5aebadb47e4b4e0031e37d81,1
1513,5b2c0a47916bc10039f8e072,1
1533,5b7c5b461fa763002c85e395,1


In [35]:
# Reset the index to avoid future issues
test_user_ratings = test_user_ratings.reset_index(drop=True)

# Show the user's ratings
test_user_ratings

Unnamed: 0,Venue ID,Rating
0,4a5652f3f964a520feb41fe3,1
1,4a6ca0e5f964a5200ed11fe3,1
2,54b84a42498ef55d202d374f,1
3,5aebadb47e4b4e0031e37d81,1
4,5b2c0a47916bc10039f8e072,1
5,5b7c5b461fa763002c85e395,1


Next, we extracted those venues from our one-hot encoding list and removed the unnecessary columns.

In [36]:
# Create a new dataframe of venues the test user liked
test_user_likes =  chi_onehot_all[chi_onehot_all['Venue ID'].isin(test_user_ratings['Venue ID'].tolist())]
test_user_likes

Unnamed: 0,Venue ID,Albany Park,Archer Heights,Armour Square,Ashburn,Auburn Gresham,Austin,Avalon Park,Avondale,Belmont Cragin,...,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wings Joint
40,4a5652f3f964a520feb41fe3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
48,4a6ca0e5f964a5200ed11fe3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
873,54b84a42498ef55d202d374f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1100,5aebadb47e4b4e0031e37d81,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1107,5b2c0a47916bc10039f8e072,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1121,5b7c5b461fa763002c85e395,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
# Reset the index to avoid future issues
test_user_likes = test_user_likes.reset_index(drop=True)

# Drop unnecessary column
test_user_features = test_user_likes.drop('Venue ID', 1)

# Show the dataframe
test_user_features

Unnamed: 0,Albany Park,Archer Heights,Armour Square,Ashburn,Auburn Gresham,Austin,Avalon Park,Avondale,Belmont Cragin,Beverly,...,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


To create a user profile, we then calculated a dot product of the one-hot encoding list and a vector of the user’s ratings.  This produced a weighted rating of each of our features.  A reminder that the feature set is a combination of restaurant categories and neighborhoods.

In [38]:
# View the vector of the test user's ratings
test_user_ratings['Rating']

0    1
1    1
2    1
3    1
4    1
5    1
Name: Rating, dtype: int64

In [39]:
# Use a dot produt to get weights
userProfile = test_user_features.transpose().dot(test_user_ratings['Rating'])

# The dimensions of the user profile (too big to list in full)
userProfile.shape

(171,)

The next step was to create a clean dataset of our features.

In [40]:
# Get the features of every restaurant in our original dataframe
features_df = chi_onehot_all.set_index(chi_onehot_all['Venue ID'])
features_df.head()

Unnamed: 0_level_0,Venue ID,Albany Park,Archer Heights,Armour Square,Ashburn,Auburn Gresham,Austin,Avalon Park,Avondale,Belmont Cragin,...,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wings Joint
Venue ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3fd66200f964a520c7f01ee3,3fd66200f964a520c7f01ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3fd66200f964a520e1ed1ee3,3fd66200f964a520e1ed1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40b28c80f964a52045fb1ee3,40b28c80f964a52045fb1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40b28c80f964a5205ffd1ee3,40b28c80f964a5205ffd1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40b28c80f964a520a5fc1ee3,40b28c80f964a520a5fc1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [41]:
# Drop the unnecessary information
features_df = features_df.drop('Venue ID', 1)
features_df.head()

Unnamed: 0_level_0,Albany Park,Archer Heights,Armour Square,Ashburn,Auburn Gresham,Austin,Avalon Park,Avondale,Belmont Cragin,Beverly,...,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wings Joint
Venue ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3fd66200f964a520c7f01ee3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3fd66200f964a520e1ed1ee3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40b28c80f964a52045fb1ee3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40b28c80f964a5205ffd1ee3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40b28c80f964a520a5fc1ee3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Using the user’s profile and the complete list of venue features, we took the weighted average of every venue based on the user profile and recommended the top five restaurants.

In [42]:
# Multiply the features by the weights and then take the weighted average
recommendationTable_df = ((features_df*userProfile).sum(axis=1))/(userProfile.sum())

# Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
recommendationTable_df.head()

Venue ID
4a5652f3f964a520feb41fe3    0.5
5b7c5b461fa763002c85e395    0.5
5786fed8498e6ad5339dc258    0.5
4b170c41f964a52076c123e3    0.5
4a82fb6ef964a520b8f91fe3    0.5
dtype: float64

In [43]:
# The final recommendation table
recommended_venues = chi_venues.loc[chi_venues['Venue ID'].isin(recommendationTable_df.head(5).keys())]
recommended_venues

Unnamed: 0,Neighborhood,Venue,Venue ID,Venue Latitude,Venue Longitude,Distance,Venue Category
40,Lincoln Square,Bistro Campagne,4a5652f3f964a520feb41fe3,41.963663,-87.685472,378,French Restaurant
58,Lincoln Square,Rosded Thai Cuisine,4a82fb6ef964a520b8f91fe3,41.966728,-87.687698,361,Thai Restaurant
104,Lincoln Square,Aroy Thai,4b170c41f964a52076c123e3,41.966642,-87.679138,347,Thai Restaurant
978,Lincoln Square,La Boulangerie,5786fed8498e6ad5339dc258,41.965011,-87.678553,436,Bakery
1121,Lincoln Square,Baker Miller,5b7c5b461fa763002c85e395,41.966328,-87.686981,304,Bakery


In [44]:
# Add the suggested ratings
recommended_ratings = pd.merge(recommended_venues, recommendationTable_df.rename('Rating'), left_on='Venue ID', right_index=True)
recommended_ratings

Unnamed: 0,Neighborhood,Venue,Venue ID,Venue Latitude,Venue Longitude,Distance,Venue Category,Rating
40,Lincoln Square,Bistro Campagne,4a5652f3f964a520feb41fe3,41.963663,-87.685472,378,French Restaurant,0.5
58,Lincoln Square,Rosded Thai Cuisine,4a82fb6ef964a520b8f91fe3,41.966728,-87.687698,361,Thai Restaurant,0.5
104,Lincoln Square,Aroy Thai,4b170c41f964a52076c123e3,41.966642,-87.679138,347,Thai Restaurant,0.5
978,Lincoln Square,La Boulangerie,5786fed8498e6ad5339dc258,41.965011,-87.678553,436,Bakery,0.5
1121,Lincoln Square,Baker Miller,5b7c5b461fa763002c85e395,41.966328,-87.686981,304,Bakery,0.5


In [45]:
# Drop unnecessary columns
recommended_venues = recommended_ratings.drop('Venue ID', 1).drop('Distance', 1)
recommended_venues

Unnamed: 0,Neighborhood,Venue,Venue Latitude,Venue Longitude,Venue Category,Rating
40,Lincoln Square,Bistro Campagne,41.963663,-87.685472,French Restaurant,0.5
58,Lincoln Square,Rosded Thai Cuisine,41.966728,-87.687698,Thai Restaurant,0.5
104,Lincoln Square,Aroy Thai,41.966642,-87.679138,Thai Restaurant,0.5
978,Lincoln Square,La Boulangerie,41.965011,-87.678553,Bakery,0.5
1121,Lincoln Square,Baker Miller,41.966328,-87.686981,Bakery,0.5


Now that we had our recommendations, we were able to plot them on a map with markers that indicated the venue name, category and expected rating.

In [46]:
# Get coordinates for Chicago
address = 'Chicago, IL'

geolocator = Nominatim(user_agent="chi_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinate of Chicago are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Chicago are 41.8755616, -87.6244212.


In [47]:
# Create a map
recommendation_map = folium.Map(location=[latitude, longitude], zoom_start=11)

In [48]:
# Add markers to the map for each recommendation
for lat, lon, name, cat, rat in zip(recommended_venues['Venue Latitude'], recommended_venues['Venue Longitude'], recommended_venues['Venue'], recommended_venues['Venue Category'], recommended_venues['Rating']):
    label = folium.Popup(str(name) + ' (' + str(cat) + ', ' + str(rat) + ')', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        fill=True,
        fill_opacity=0.7).add_to(recommendation_map)
       
# Display the map
recommendation_map

We repeated the process for another user to show that the recommendations differ based on each user’s profile.

In [49]:
# Get ratings for our 2nd test user
test_user_id = '38216300'
test_user_ratings = chi_venue_likes[chi_venue_likes['User ID'] == test_user_id].drop('User ID', 1)
test_user_ratings

Unnamed: 0,Venue ID,Rating
191,4abe9675f964a520a58e20e3,1
413,4b745822f964a52092d62de3,1
436,4b802b0df964a520b35830e3,1
468,4b9d42c3f964a520f89d36e3,1
475,4ba110cbf964a520499437e3,1
547,4c180cc6d4d9c9287ee4ee29,1
555,4c24e254a852c928f5dae36c,1
564,4c38ad440a71c9b6591041c9,1
608,4c618119b6f3be9a88d86173,1
694,4d28e38b8292236a88ca1dbb,1


In [50]:
# Reset the index to avoid future issues
test_user_ratings = test_user_ratings.reset_index(drop=True)

# Show the user's ratings
test_user_ratings

Unnamed: 0,Venue ID,Rating
0,4abe9675f964a520a58e20e3,1
1,4b745822f964a52092d62de3,1
2,4b802b0df964a520b35830e3,1
3,4b9d42c3f964a520f89d36e3,1
4,4ba110cbf964a520499437e3,1
5,4c180cc6d4d9c9287ee4ee29,1
6,4c24e254a852c928f5dae36c,1
7,4c38ad440a71c9b6591041c9,1
8,4c618119b6f3be9a88d86173,1
9,4d28e38b8292236a88ca1dbb,1


In [51]:
# Create a new dataframe of venues the test user liked
test_user_likes =  chi_onehot_all[chi_onehot_all['Venue ID'].isin(test_user_ratings['Venue ID'].tolist())]
test_user_likes

Unnamed: 0,Venue ID,Albany Park,Archer Heights,Armour Square,Ashburn,Auburn Gresham,Austin,Avalon Park,Avondale,Belmont Cragin,...,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wings Joint
70,4abe9675f964a520a58e20e3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
158,4b745822f964a52092d62de3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
166,4b802b0df964a520b35830e3,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
182,4b9d42c3f964a520f89d36e3,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
187,4ba110cbf964a520499437e3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
231,4c180cc6d4d9c9287ee4ee29,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
234,4c24e254a852c928f5dae36c,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
239,4c38ad440a71c9b6591041c9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
259,4c618119b6f3be9a88d86173,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
323,4d28e38b8292236a88ca1dbb,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [52]:
# Reset the index to avoid future issues
test_user_likes = test_user_likes.reset_index(drop=True)

# Drop unnecessary column
test_user_features = test_user_likes.drop('Venue ID', 1)

# Show the dataframe
test_user_features

Unnamed: 0,Albany Park,Archer Heights,Armour Square,Ashburn,Auburn Gresham,Austin,Avalon Park,Avondale,Belmont Cragin,Beverly,...,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [53]:
# View the vector of the test user's ratings
test_user_ratings['Rating']

0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    1
Name: Rating, dtype: int64

In [54]:
# Use a dot produt to get weights
userProfile = test_user_features.transpose().dot(test_user_ratings['Rating'])

# The dimensions of the user profile (too big to list in full)
userProfile.shape

(171,)

In [55]:
# Get the features of every restaurant in our original dataframe
features_df = chi_onehot_all.set_index(chi_onehot_all['Venue ID'])
features_df.head()

Unnamed: 0_level_0,Venue ID,Albany Park,Archer Heights,Armour Square,Ashburn,Auburn Gresham,Austin,Avalon Park,Avondale,Belmont Cragin,...,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wings Joint
Venue ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3fd66200f964a520c7f01ee3,3fd66200f964a520c7f01ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3fd66200f964a520e1ed1ee3,3fd66200f964a520e1ed1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40b28c80f964a52045fb1ee3,40b28c80f964a52045fb1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40b28c80f964a5205ffd1ee3,40b28c80f964a5205ffd1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40b28c80f964a520a5fc1ee3,40b28c80f964a520a5fc1ee3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [56]:
# Drop the unnecessary information
features_df = features_df.drop('Venue ID', 1)
features_df.head()

Unnamed: 0_level_0,Albany Park,Archer Heights,Armour Square,Ashburn,Auburn Gresham,Austin,Avalon Park,Avondale,Belmont Cragin,Beverly,...,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theme Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wings Joint
Venue ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3fd66200f964a520c7f01ee3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3fd66200f964a520e1ed1ee3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40b28c80f964a52045fb1ee3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40b28c80f964a5205ffd1ee3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40b28c80f964a520a5fc1ee3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [57]:
# Multiply the features by the weights and then take the weighted average
recommendationTable_df = ((features_df*userProfile).sum(axis=1))/(userProfile.sum())

# Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
recommendationTable_df.head()

Venue ID
4c180cc6d4d9c9287ee4ee29    0.233333
4f36f610e4b0e313d311d8e9    0.233333
593478c3123a1963f50fc66f    0.233333
4c3770e32c8020a1c0068900    0.233333
4c38ad440a71c9b6591041c9    0.233333
dtype: float64

In [58]:
# The final recommendation table
recommended_venues = chi_venues.loc[chi_venues['Venue ID'].isin(recommendationTable_df.head(5).keys())]
recommended_venues

Unnamed: 0,Neighborhood,Venue,Venue ID,Venue Latitude,Venue Longitude,Distance,Venue Category
231,Portage Park,Little Caesars Pizza,4c180cc6d4d9c9287ee4ee29,41.953485,-87.767461,393,Pizza Place
238,Belmont Cragin,Mama Mia Chicago Pizza,4c3770e32c8020a1c0068900,41.937767,-87.766347,494,Pizza Place
239,Portage Park,Cochiaro's,4c38ad440a71c9b6591041c9,41.945677,-87.766661,481,Pizza Place
641,Belmont Cragin,Papa Turhano's,4f36f610e4b0e313d311d8e9,41.929633,-87.765876,417,Pizza Place
1034,Portage Park,Easy Street Pizza & Beer Garden,593478c3123a1963f50fc66f,41.949234,-87.766956,88,Pizza Place


In [59]:
# Add the suggested ratings
recommended_ratings = pd.merge(recommended_venues, recommendationTable_df.rename('Rating'), left_on='Venue ID', right_index=True)
recommended_ratings

Unnamed: 0,Neighborhood,Venue,Venue ID,Venue Latitude,Venue Longitude,Distance,Venue Category,Rating
231,Portage Park,Little Caesars Pizza,4c180cc6d4d9c9287ee4ee29,41.953485,-87.767461,393,Pizza Place,0.233333
238,Belmont Cragin,Mama Mia Chicago Pizza,4c3770e32c8020a1c0068900,41.937767,-87.766347,494,Pizza Place,0.233333
239,Portage Park,Cochiaro's,4c38ad440a71c9b6591041c9,41.945677,-87.766661,481,Pizza Place,0.233333
641,Belmont Cragin,Papa Turhano's,4f36f610e4b0e313d311d8e9,41.929633,-87.765876,417,Pizza Place,0.233333
1034,Portage Park,Easy Street Pizza & Beer Garden,593478c3123a1963f50fc66f,41.949234,-87.766956,88,Pizza Place,0.233333


In [60]:
# Drop unnecessary columns
recommended_venues = recommended_ratings.drop('Venue ID', 1).drop('Distance', 1)
recommended_venues

Unnamed: 0,Neighborhood,Venue,Venue Latitude,Venue Longitude,Venue Category,Rating
231,Portage Park,Little Caesars Pizza,41.953485,-87.767461,Pizza Place,0.233333
238,Belmont Cragin,Mama Mia Chicago Pizza,41.937767,-87.766347,Pizza Place,0.233333
239,Portage Park,Cochiaro's,41.945677,-87.766661,Pizza Place,0.233333
641,Belmont Cragin,Papa Turhano's,41.929633,-87.765876,Pizza Place,0.233333
1034,Portage Park,Easy Street Pizza & Beer Garden,41.949234,-87.766956,Pizza Place,0.233333


In [61]:
# Create a map
recommendation_map = folium.Map(location=[latitude, longitude], zoom_start=11)

In [62]:
# Add markers to the map for each recommendation
for lat, lon, name, cat, rat in zip(recommended_venues['Venue Latitude'], recommended_venues['Venue Longitude'], recommended_venues['Venue'], recommended_venues['Venue Category'], recommended_venues['Rating']):
    label = folium.Popup(str(name) + ' (' + str(cat) + ', ' + str(rat) + ')', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        fill=True,
        fill_opacity=0.7).add_to(recommendation_map)
       
# Display the map
recommendation_map

## Conclusion

The recommender system has proven useful for suggesting new eating venues based on each user’s profile.  Using restaurant categories allows the system to recommend other similar venues.  Meanwhile, the neighborhood information makes it less likely that the system will make suggestions that are in far-flung parts of the city.

There are, however, some aspects of the system that could be improved.  The limits of Foursquare pose the greatest challenge in this regard.  For example, the initial plan also called for including the price points of each venue, e.g. cheap, moderate or expensive.  Then we discovered a limit of five hundred requests per day for the queries required to fetch that data.  Even after significantly reducing the number of neighborhoods, we exceeded this threshold.

The other major limitation of the Foursquare data is that we can only see when a user has liked a restaurant.  The data provides no insight into whether the user disliked an establishment or how many times she has eaten there.  That would be extremely valuable information for our system, as a simple yes/no rating does not provide much depth of knowledge.  Of course, Foursquare does have access to all of this data and can, therefore, offer much better recommendations.

The exercise presented here offers a good example of using machine learning to provide recommendations to users.  Past experiences and ratings allowed us to build a content-based recommender system for finding those suggestions.  Unfortunately, our example is not as rich in data as we would like, and the recommendations provided may not be the best possible.  Fortunately for Foursquare, in the real world they do have access to an extremely rich dataset that they can mine to offer much better recommendations.