# California Restaurant "Likes" Prediction Using Foursquare API and Machine Learning

## Introduction

California boasts an incredibly diverse collection of restaurants catering to different palettes and appetites. A large part of marketing for a modern restaurant (or any company) is social media, where the number of "likes" that the company can receive will dictate its brand and image to the general public.

For a new business owner (or existing company) to open a new restaurant in California, knowing ahead of time the potential social media image they can have would provide an excellent solution to the ever present business problem of uncertainty. In this case the uncertainty is regarding performance of social media presence.

We can mitigate this uncertainty through leveraging data gathered from FourSquare's API, specifically, we are able to scrape "likes" data of different restaurants directly from the API as well as their location and category of cuisine. The question we will try to address is, how accurately can we predict the amount of "likes" a new restaurant opening in this region can expect to have based on the type of cuisine it will serve and which city in California it will open in. (For the purposes of this analysis, we will contain the geographical scope of analysis to three heavily populated cities in California, namely San Francisco, Los Angeles, and San Diego).

Leveraging this data will solve the problem as it allows the new business owner (or existing company) to make preemptive business decisions regarding opening the restaurant in terms of whether it is feasible to open one in this region and expect good social media presence, what type of cuisine and which city of three would be the best. This project will analyze and model the data via machine learning through comparing both linear and logistic regressions to see which method will yield better predictive capabilities after training and testing.

Let us begin by importing the necessary packages.

In [2]:
import numpy as np 

import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json

from geopy.geocoders import Nominatim 

import requests
from pandas.io.json import json_normalize 


import matplotlib.cm as cm
import matplotlib.colors as colors




from urllib.request import urlopen
from bs4 import BeautifulSoup


import matplotlib.pyplot as plt
import pylab as pl

from sklearn import linear_model
from sklearn.metrics import jaccard_similarity_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import log_loss
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error, r2_score
import itertools


print('Libraries imported.')

Libraries imported.


## Data Scraping and Cleaning

In this section we will first retrieve the geographical coordinates of the three cities (San Francisco, Los Angeles, and San Diego). Then, we will leverage the FourSquare API to obtain URLs that lead to the raw data in JSON form. We will speerately scrape the raw data in these URLs in order to retrieve the following columns: "name", "categories", "latitude", "longitude". and "id" for each city. We can also provide another column ("city") to indicate which city the restaurants are from.

It is important to note that the extracts are not of every restaurant in those cities but rather all of the restaurants within a 1000KM range of the geographical coordinates that geolocator was able to provide. However, the extraction from the FourSquare API actually obtains venue data so it will include venues other than restaurants such as concert halls, stores, libraries etc. As such, this means that the data will need to be further cleaned somewhat manually by removing all of the non-restaurant rows. Once this is complete, we have a shortened by cleaned list to pull "likes" data. The reason the cleaning takes precedence is mainly that pulling the "likes" data is the computing process which takes the longest time in this project so we want to make sure we are not pulling information that will end up being dropped anyways.

The "id" is an important column as it will allow us to further pull the "likes" from the API. We can retreive the "likes" based on the restaurant "id" and then append it to the data frame. Once this is complete, we finally name the dataframe 'raw_dataset' as it is the most complete compiled form before needing any processing for analysis via machine learning.

In [11]:
address1 = 'San Francisco, California'

geolocator = Nominatim()
location1 = geolocator.geocode(address1)
latitude1 = location1.latitude
longitude1 = location1.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address1, latitude1, longitude1))


address2 = 'Los Angeles, California'

geolocator = Nominatim()
location2 = geolocator.geocode(address2)
latitude2 = location2.latitude
longitude2 = location2.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address2, latitude2, longitude2))

address3 = 'San Diego, California'

geolocator = Nominatim()
location3 = geolocator.geocode(address3)
latitude3 = location3.latitude
longitude3 = location3.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address3, latitude3, longitude3))

  app.launch_new_instance()


The geograpical coordinate of San Francisco, California are 37.7790262, -122.4199061.




The geograpical coordinate of Los Angeles, California are 34.0536909, -118.2427666.




The geograpical coordinate of San Diego, California are 32.7174209, -117.1627714.


In [12]:
CLIENT_ID = 'YFWOVSTNCDJGOV4HKOVITLZNBSQWGL335URWVM3QSAEZRMRB' # your Foursquare ID
CLIENT_SECRET = 'KPBPAWW4R00TRLLPJDM1LETEJDP2JFYTQQ22Y2RSQQPZW23T' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius

# create URLs
url1 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude1, 
    longitude1, 
    radius, 
    LIMIT)


# create URLs
url2 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude2, 
    longitude2, 
    radius, 
    LIMIT)


# create URLs
url3 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude3, 
    longitude3, 
    radius, 
    LIMIT)

print(url1, url2, url3)

Your credentails:
CLIENT_ID: YFWOVSTNCDJGOV4HKOVITLZNBSQWGL335URWVM3QSAEZRMRB
CLIENT_SECRET:KPBPAWW4R00TRLLPJDM1LETEJDP2JFYTQQ22Y2RSQQPZW23T
https://api.foursquare.com/v2/venues/explore?&client_id=YFWOVSTNCDJGOV4HKOVITLZNBSQWGL335URWVM3QSAEZRMRB&client_secret=KPBPAWW4R00TRLLPJDM1LETEJDP2JFYTQQ22Y2RSQQPZW23T&v=20180605&ll=37.7790262,-122.4199061&radius=1000&limit=100 https://api.foursquare.com/v2/venues/explore?&client_id=YFWOVSTNCDJGOV4HKOVITLZNBSQWGL335URWVM3QSAEZRMRB&client_secret=KPBPAWW4R00TRLLPJDM1LETEJDP2JFYTQQ22Y2RSQQPZW23T&v=20180605&ll=34.0536909,-118.2427666&radius=1000&limit=100 https://api.foursquare.com/v2/venues/explore?&client_id=YFWOVSTNCDJGOV4HKOVITLZNBSQWGL335URWVM3QSAEZRMRB&client_secret=KPBPAWW4R00TRLLPJDM1LETEJDP2JFYTQQ22Y2RSQQPZW23T&v=20180605&ll=32.7174209,-117.1627714&radius=1000&limit=100


In [16]:
results1 = requests.get(url1).json()
results1

results2 = requests.get(url2).json()
results2

results3 = requests.get(url3).json()
results3

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    

### first city ###    
    
venues1 = results1['response']['groups'][0]['items']
nearby_venues1 = json_normalize(venues1) # flatten JSON

# filter columns
filtered_columns1 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues1 =nearby_venues1.loc[:, filtered_columns1]

# filter the category for each row
nearby_venues1['venue.categories'] = nearby_venues1.apply(get_category_type, axis=1)

# clean columns
nearby_venues1.columns = [col.split(".")[-1] for col in nearby_venues1.columns]


### second city ###

venues2 = results2['response']['groups'][0]['items']
nearby_venues2 = json_normalize(venues2) # flatten JSON

# filter columns
filtered_columns2 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues2 =nearby_venues2.loc[:, filtered_columns2]

# filter the category for each row
nearby_venues2['venue.categories'] = nearby_venues2.apply(get_category_type, axis=1)

# clean columns
nearby_venues2.columns = [col.split(".")[-1] for col in nearby_venues2.columns]


### third city ###

venues3 = results3['response']['groups'][0]['items']
nearby_venues3 = json_normalize(venues3) # flatten JSON

# filter columns
filtered_columns3 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues3 =nearby_venues3.loc[:, filtered_columns3]

# filter the category for each row
nearby_venues3['venue.categories'] = nearby_venues3.apply(get_category_type, axis=1)

# clean columns
nearby_venues3.columns = [col.split(".")[-1] for col in nearby_venues3.columns]





print('{} venues were returned by Foursquare.'.format(nearby_venues1.shape[0]))
print('{} venues were returned by Foursquare.'.format(nearby_venues2.shape[0]))
print('{} venues were returned by Foursquare.'.format(nearby_venues3.shape[0]))

100 venues were returned by Foursquare.
100 venues were returned by Foursquare.
100 venues were returned by Foursquare.


In [17]:
nearby_venues1['city'] = 'San Francisco'
nearby_venues2['city'] = 'Los Angeles'
nearby_venues3['city'] = 'San Diego'

In [18]:
nearby_venues = nearby_venues1.copy()
nearby_venues = nearby_venues.append(nearby_venues2)
nearby_venues = nearby_venues.append(nearby_venues3)

In [19]:
nearby_venues['categories'].unique()

removal_list = ['Concert Hall', 'Opera House', 'Dance Studio',
                'Performing Arts Venue', 'Art Museum', 'Park',
                'Massage Studio', 'Music Venue', 'Bookstore', 'Clothing Store',
                'Boutique', 'Furniture/Home Store', 'Jazz Club',
                'Theater', 'Optical Shop', "Men's Store", 'Rock Club',
                'Gym / Fitness Center', 'Wine Shop', 'Indie Movie Theater',
                'Chocolate Shop', 'Dessert Shop', 'Recreation Center', 
                'Plaza', 'Hotel', 'Luggage Store', 'Farmers Market', 'Gym',
                'Jewelry Store', 'Furniture / Home Store', 'Butcher', 
                'Bakery', 'Marijuana Dispensary', 'Ice Cream Shop',
                'Comic Shop', 'Bagel Shop', 'Spa', 'Liquor Store', 'Bike Shop',
                'Yoga Studio', 'Pedestrian Plaza', 'Candy Store',
                'Park', 'Bookstore', 'Candy Store',  'Jazz Club', 'Art Gallery', 
                 'Supermarket', 'Museum', 'Boutique', 'Plaza', 'Building', 'Bakery',
                 'Historic Site', 'Ice Cream Shop', ' Concert Hall', 'Pharmacy', 
                 'Market', 'Movie Theater', 'Performing Arts Venue', 'Music Venue',
                 'Theater', 'Art Museum', 'Cheese Shop', 'Opera House',
                 'Pedestrian Plaza', 'School', 'Gift Shop', 'Athletics & Sports',
                 'Shoe Repair', 'General Entertainment', 'Stationery Store',
                 'Toy / Game Store', 'Brewery', 'Hotel', 'Theater', 'Music Venue', 'Business Service',
                 'Donut Shop', 'Liquor Store', 'Beer Store',
                 'Lounge', 'Plaza', 'Health Food Store', 'Concert Hall', 
                 'Lingerie Store', 'Gym', 'Mobile Phone Shop',
                 'Chocolate Shop', 'Ice Cream Shop', 'Hostel', 'Convenience Store', 
                 'Park', 'Farmers Market', 'Cosmetics Shop', 'Piano Bar',
                 'Nightclub', 'Massage Studio', 'Comedy Club', 'Concert Hall']

nearby_venues = nearby_venues[~nearby_venues['categories'].isin(removal_list)]

nearby_venues['categories'].unique()

array(['Ramen Restaurant', 'Beer Bar', 'Wine Bar', 'Coffee Shop',
       'Tiki Bar', 'Cocktail Bar', 'Italian Restaurant',
       'Sushi Restaurant', 'French Restaurant', 'Music School',
       'Poke Place', 'Food & Drink Shop',
       'Southern / Soul Food Restaurant', 'Event Space', 'Sandwich Place',
       'Thai Restaurant', 'Pizza Place', 'Beer Garden', 'Souvlaki Shop',
       'Juice Bar', 'American Restaurant', 'Mexican Restaurant',
       'New American Restaurant', 'Restaurant', 'Accessories Store',
       'German Restaurant', 'Café', 'Breakfast Spot', 'Speakeasy',
       'Udon Restaurant', 'Bar', 'Falafel Restaurant',
       'Filipino Restaurant', 'BBQ Joint', 'Japanese Restaurant',
       'Shopping Mall', 'Mediterranean Restaurant', 'Gastropub',
       'Latin American Restaurant', 'Cajun / Creole Restaurant',
       'Tea Room', 'Yoshoku Restaurant', 'Food Truck', 'Bubble Tea Shop',
       'High School', 'Train Station', 'Kids Store', 'Taco Place',
       'Seafood Restaurant', '

In [20]:
url_list = []
like_list = []
json_list = []

for i in list(nearby_venues.id):
    venue_url = 'https://api.foursquare.com/v2/venues/{}/likes?client_id={}&client_secret={}&v={}'.format(i, CLIENT_ID, CLIENT_SECRET, VERSION)
    url_list.append(venue_url)
for link in url_list:
    result = requests.get(link).json()
    likes = result['response']['likes']['count']
    like_list.append(likes)
print(like_list)


nearby_venues['likes'] = like_list
nearby_venues.head()

[351, 291, 132, 480, 457, 941, 298, 1114, 602, 120, 347, 152, 823, 375, 56, 77, 530, 301, 1184, 29, 439, 370, 18, 69, 801, 1074, 79, 73, 677, 50, 59, 19, 36, 309, 206, 609, 175, 47, 65, 74, 817, 314, 260, 288, 239, 243, 490, 113, 118, 731, 155, 218, 123, 80, 356, 340, 320, 138, 57, 585, 33, 84, 55, 92, 64, 22, 25, 78, 162, 105, 200, 48, 32, 76, 120, 154, 395, 9, 327, 482, 357, 384, 32, 17, 58, 740, 21, 106, 37, 140, 34, 111, 31, 24, 982, 23, 56, 9, 130, 12, 112, 235, 10, 39, 32, 8, 12, 46, 157, 76, 62, 34, 9, 150, 77, 131, 170, 207, 26, 36, 41, 105, 3, 94, 10, 34, 65, 295, 18, 97, 7, 30, 32, 8, 127, 30, 18, 321, 119, 18, 104, 104, 196, 186, 19, 22, 83, 33, 78, 525, 101, 31, 489, 140, 534, 35, 204, 479, 131, 5, 175, 20, 54, 27, 116, 22, 54, 48, 10, 175, 52]


Unnamed: 0,name,categories,lat,lng,id,city,likes
12,Nojo Ramen Tavern,Ramen Restaurant,37.776637,-122.42127,4d8eabc7d265236af9a71017,San Francisco,351
13,The Beer Hall,Beer Bar,37.776837,-122.417916,519fcb37498e91a13cb23d6b,San Francisco,291
14,Birba,Wine Bar,37.77775,-122.424159,551b7760498e1612b67f33f9,San Francisco,132
16,Philz Coffee,Coffee Shop,37.781433,-122.417073,5151a10ce4b06ae7735335db,San Francisco,480
17,Blue Bottle Coffee,Coffee Shop,37.776286,-122.416867,5560dbdb498e91a2bcde84f6,San Francisco,457


In [21]:
raw_dataset = nearby_venues
raw_dataset.head()

Unnamed: 0,name,categories,lat,lng,id,city,likes
12,Nojo Ramen Tavern,Ramen Restaurant,37.776637,-122.42127,4d8eabc7d265236af9a71017,San Francisco,351
13,The Beer Hall,Beer Bar,37.776837,-122.417916,519fcb37498e91a13cb23d6b,San Francisco,291
14,Birba,Wine Bar,37.77775,-122.424159,551b7760498e1612b67f33f9,San Francisco,132
16,Philz Coffee,Coffee Shop,37.781433,-122.417073,5151a10ce4b06ae7735335db,San Francisco,480
17,Blue Bottle Coffee,Coffee Shop,37.776286,-122.416867,5560dbdb498e91a2bcde84f6,San Francisco,457


## Data Preparation

The data still needs some more processing before it is suitable for model training and testing. Mainly, the "categories" column contains too many different types of cuisines to allow a model to yield any meaningful results. However, the different types of natural cuisines have natural groupings based on conventionally accepted cultural groupings of cuisine. Broadly speaking, all of the different types of cuisine could be reclassified as European, Latin American, Asian, North American, drinking establishments (bars), or casual establishments such as coffee shops or ice cream parlours. We can implement manual classification as there really aren't that many different types of cuisines.

As this project will compare both linear and logistic regression, it makes sense to have "likes" as both a continuous and categorical (but ordinal) variable. In the case of turning into a categorical variable, we can bin the data based on percentiles and classify them into these ordinal percentile categories. I tried different ways of binning but in the end, splitting the sample into three different bins proved to yield the best classification results from a prediction standpoint.

As the last stage of data preparation, it is important to note that the regressors are categorical variables (3 different cities and 6 different categories of cusines). Hence, they require dummy variable encoding for meaningful analysis. We can accomplish this via one-hot encoding.

In [23]:
raw_dataset['categories'].unique()


array(['Ramen Restaurant', 'Beer Bar', 'Wine Bar', 'Coffee Shop',
       'Tiki Bar', 'Cocktail Bar', 'Italian Restaurant',
       'Sushi Restaurant', 'French Restaurant', 'Music School',
       'Poke Place', 'Food & Drink Shop',
       'Southern / Soul Food Restaurant', 'Event Space', 'Sandwich Place',
       'Thai Restaurant', 'Pizza Place', 'Beer Garden', 'Souvlaki Shop',
       'Juice Bar', 'American Restaurant', 'Mexican Restaurant',
       'New American Restaurant', 'Restaurant', 'Accessories Store',
       'German Restaurant', 'Café', 'Breakfast Spot', 'Speakeasy',
       'Udon Restaurant', 'Bar', 'Falafel Restaurant',
       'Filipino Restaurant', 'BBQ Joint', 'Japanese Restaurant',
       'Shopping Mall', 'Mediterranean Restaurant', 'Gastropub',
       'Latin American Restaurant', 'Cajun / Creole Restaurant',
       'Tea Room', 'Yoshoku Restaurant', 'Food Truck', 'Bubble Tea Shop',
       'High School', 'Train Station', 'Kids Store', 'Taco Place',
       'Seafood Restaurant', '

In [24]:
euro = ['French Restaurant', 'Scandinavian Restaurant', 'Souvlaki Shop', 
       'Mediterranean Restaurant', 'Italian Restaurant', 'Pizza Place']

latino = ['Mexican Restaurant', 'Latin American Restaurant', 
          'Brazilian Restaurant', 'Taco Place']

bar = ['Beer Bar', 'Cocktail Bar', 'Tiki Bar', 'Wine Bar', 'Hotel Bar',
       'Beer Garden', 'Speakeasy', 'Brewery', 'Pub', 'Bar', 'Gastropub',
       'Hookah Bar']

asian = ['Ramen Restaurant', 'Sushi Restaurant', 'Vietnamese Restaurant',
         'Thai Restaurant', 'Poke Place', 'Indian Restaurant', 
         'Japanese Curry Restaurant', 'Japanese Restaurant', 
         'Indonesian Restaurant', 'Udon Restaurant', 'Noodle House',
         'Falafel Restaurant', 'Filipino Restaurant', 'Turkish Restaurant',
         'Yoshoku Restaurant']

casual = ['Coffee Shop', 'Café', 'Sandwich Place', 'Food Truck',
          'Juice Bar', 'Frozen Yogurt Shop', 'Deli / Bodega', 'Dessert Shop',
          'Hot Dog Joint', 'Burger Joint', 'Breakfast Spot', 
          'Fondue Restaurant']

american = ['Southern / Soul Food Restaurant', 'Food & Drink Shop', 
            'Restaurant', 'American Restaurant', 'BBQ Joint', 
            'Theme Restaurant', 'New American Restaurant',
            'Vegetarian / Vegan Restaurant', 'Seafood Restaurant']

def conditions(s):
    if s['categories'] in euro:
        return 'euro'
    if s['categories'] in latino:
        return 'latino'
    if s['categories'] in asian:
        return 'asian'
    if s['categories'] in casual:
        return 'casual'
    if s['categories'] in american:
        return 'american'
    if s['categories'] in bar:
        return 'bar'


raw_dataset['categories_classified']=raw_dataset.apply(conditions, axis=1)
raw_dataset

Unnamed: 0,name,categories,lat,lng,id,city,likes,categories_classified
12,Nojo Ramen Tavern,Ramen Restaurant,37.776637,-122.42127,4d8eabc7d265236af9a71017,San Francisco,351,asian
13,The Beer Hall,Beer Bar,37.776837,-122.417916,519fcb37498e91a13cb23d6b,San Francisco,291,bar
14,Birba,Wine Bar,37.77775,-122.424159,551b7760498e1612b67f33f9,San Francisco,132,bar
16,Philz Coffee,Coffee Shop,37.781433,-122.417073,5151a10ce4b06ae7735335db,San Francisco,480,casual
17,Blue Bottle Coffee,Coffee Shop,37.776286,-122.416867,5560dbdb498e91a2bcde84f6,San Francisco,457,casual
18,Blue Bottle Coffee,Coffee Shop,37.77643,-122.423224,43d3901ef964a5201f2e1fe3,San Francisco,941,casual
19,Fig & Thistle Wine Bar,Wine Bar,37.777256,-122.423365,51aad1ee498e9bb839540dc7,San Francisco,298,bar
20,Smuggler's Cove,Tiki Bar,37.779386,-122.423422,4afe6db4f964a520682f22e3,San Francisco,1114,bar
21,Whitechapel,Cocktail Bar,37.78223,-122.418884,55df9e27498ef4c72ac83a77,San Francisco,602,bar
23,Linden Room,Cocktail Bar,37.776503,-122.422794,57b3c7c8498e9b9e08349941,San Francisco,120,bar


In [25]:
pd.crosstab(index=raw_dataset["categories_classified"],
            columns="count")

col_0,count
categories_classified,Unnamed: 1_level_1
american,18
asian,36
bar,28
casual,39
euro,21
latino,14


In [26]:
print(np.percentile(raw_dataset['likes'], 33))
print(np.percentile(raw_dataset['likes'], 66))

54.0
163.60000000000002


In [27]:
def rankings(s):
    if s['likes']<=62:
        return 1
    if s['likes']<=193.22:
        return 2
    if s['likes']>193.22:
        return 3
    
    
# def rankings(s):
#     if s['likes']<=111.5:
#         return 0
#     if s['likes']>111.5:
#         return 1

raw_dataset['ranking']=raw_dataset.apply(rankings, axis=1)
raw_dataset

Unnamed: 0,name,categories,lat,lng,id,city,likes,categories_classified,ranking
12,Nojo Ramen Tavern,Ramen Restaurant,37.776637,-122.42127,4d8eabc7d265236af9a71017,San Francisco,351,asian,3
13,The Beer Hall,Beer Bar,37.776837,-122.417916,519fcb37498e91a13cb23d6b,San Francisco,291,bar,3
14,Birba,Wine Bar,37.77775,-122.424159,551b7760498e1612b67f33f9,San Francisco,132,bar,2
16,Philz Coffee,Coffee Shop,37.781433,-122.417073,5151a10ce4b06ae7735335db,San Francisco,480,casual,3
17,Blue Bottle Coffee,Coffee Shop,37.776286,-122.416867,5560dbdb498e91a2bcde84f6,San Francisco,457,casual,3
18,Blue Bottle Coffee,Coffee Shop,37.77643,-122.423224,43d3901ef964a5201f2e1fe3,San Francisco,941,casual,3
19,Fig & Thistle Wine Bar,Wine Bar,37.777256,-122.423365,51aad1ee498e9bb839540dc7,San Francisco,298,bar,3
20,Smuggler's Cove,Tiki Bar,37.779386,-122.423422,4afe6db4f964a520682f22e3,San Francisco,1114,bar,3
21,Whitechapel,Cocktail Bar,37.78223,-122.418884,55df9e27498ef4c72ac83a77,San Francisco,602,bar,3
23,Linden Room,Cocktail Bar,37.776503,-122.422794,57b3c7c8498e9b9e08349941,San Francisco,120,bar,2


In [28]:
reg_dataset = pd.get_dummies(raw_dataset[['categories_classified', 
                                          'city',]], 
                               prefix="", 
                               prefix_sep="")

# add name, ranking, and likes columns back to dataframe
reg_dataset['ranking'] = raw_dataset['ranking']
reg_dataset['likes'] = raw_dataset['likes']
reg_dataset['name'] = raw_dataset['name']

# move name column to the first column
reg_columns = [reg_dataset.columns[-1]] + list(reg_dataset.columns[:-1])
reg_dataset = reg_dataset[reg_columns]


reg_dataset.head()

Unnamed: 0,name,american,asian,bar,casual,euro,latino,Los Angeles,San Diego,San Francisco,ranking,likes
12,Nojo Ramen Tavern,0,1,0,0,0,0,0,0,1,3,351
13,The Beer Hall,0,0,1,0,0,0,0,0,1,3,291
14,Birba,0,0,1,0,0,0,0,0,1,2,132
16,Philz Coffee,0,0,0,1,0,0,0,0,1,3,480
17,Blue Bottle Coffee,0,0,0,1,0,0,0,0,1,3,457


## Methodology

This project will utilize both linear and logistic regression machine learning methods to train and test the data. Namely, linear regression will be used in an attempt to predict the number of "likes" a new restaurant in this region will have. We will utilize the Sci-Kit Learn Package to run the model.

We can also utilize logisitc regression as a classification method rather than direct prediction of the number of likes. Since the number of "likes" can be binned into different categories based on different percentile bins, it is also potentiallly possible to see which range of "likes" a new restaurant in this region will have.

Since the "likes" are binned into multiple (more than 2) categories, the type of logistic regression will be multinomial. Additionally, although the ranges are indeed discrete categories, they are also ordinal in nature. Therefore the logistic regression will need to be specified as being both multinomial and ordinal. This can be done through the Sci-Kit Learn Package as well.

## Results

### 1. Linear Regression

A linear regression model was trained on a random subsample of 80% of the sample and then tested on the other 20%. To see if this is a reasonable model. the residual sum of squares score and variance score were both calculated (36805.64 and 0.01 respectively). Given the low variance score, this is probably not a valid/good way of modelling the data. Therefore, we move on to logistic regression.

In [29]:
msk = np.random.rand(len(reg_dataset)) < 0.8
train = reg_dataset[msk]
test = reg_dataset[~msk]

regr = linear_model.LinearRegression()
x = np.asanyarray(train[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Los Angeles', 
                         'San Diego', 'San Francisco']])
y = np.asanyarray(train[['likes']])
regr.fit (x, y)
# The coefficients
print ('Coefficients: ', regr.coef_)

Coefficients:  [[ 191.62669301   82.77082522  133.2021904   162.31694155  139.51158023
   155.86375216  -31.17178745 -116.0104181   147.18220555]]


In [30]:
y_hat= regr.predict(test[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Los Angeles', 
                         'San Diego', 'San Francisco']])
x = np.asanyarray(test[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Los Angeles', 
                         'San Diego', 'San Francisco']])
y = np.asanyarray(test[['likes']])
print("Residual sum of squares: %.2f"
      % np.mean((y_hat - y) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(x, y))

Residual sum of squares: 23267.60
Variance score: -0.24


### 2. Logistic Regression

A multinomial ordinal logisitc regression model was trained on a random subsample of 80% of the sample and then tested on the other 20%. To see if this is a reasonable model, its jaccard similarity score and log-loss were calculated (66.66% and 1.009 respectively). Although this is not a perfect prediction, a similarity of 66% between the training set and test set is a reasonable result. The classification report is also printed later on below.

Given the modestly accurate ability of this model, we can also run the model on the full dataset. The coefficients show that opening a restaurant in San Francisco, opening a bar, or serving cuisine that is american or asian in nature, are associated negatively with "likes."

In [31]:
x_train = np.asanyarray(train[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Los Angeles', 
                         'San Diego', 'San Francisco']])
y_train = np.asanyarray(train['ranking'])

x_test = np.asanyarray(test[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Los Angeles', 
                         'San Diego', 'San Francisco']])
y_test = np.asanyarray(test['ranking'])


# LR = LogisticRegression(C=0.01, solver='liblinear').fit(x_train, y_train)
# LR

mul_ordinal = linear_model.LogisticRegression(multi_class='multinomial',
                                              solver='newton-cg',
                                              fit_intercept=True).fit(x_train,
                                                                      y_train)

mul_ordinal

coef = mul_ordinal.coef_[0]
print (coef)

[-0.85376276 -0.49363961 -0.66860068 -0.4388444  -0.07514553 -0.85578844
  0.19000397  0.56608058 -0.7560811 ]


In [32]:
yhat = mul_ordinal.predict(x_test)
yhat

yhat_prob = mul_ordinal.predict_proba(x_test)
yhat_prob


jaccard_similarity_score(y_test, yhat)

0.45714285714285713

In [33]:
log_loss(y_test, yhat_prob)


1.0627998374025773

In [34]:
x_all = np.asanyarray(reg_dataset[['american', 'asian', 'bar', 'casual',
                                   'euro', 'latino', 'Los Angeles', 
                                   'San Diego', 'San Francisco']])
y_all = np.asanyarray(reg_dataset['ranking'])



LR = linear_model.LogisticRegression(multi_class='multinomial',
                                            solver='newton-cg',
                                            fit_intercept=True).fit(x_all,
                                                                    y_all)

LR

coef = LR.coef_[0]
print (coef)

[-0.77764509 -0.46537834 -0.93638089 -0.24884504 -0.11391841 -0.61207444
  0.19502015  0.44170103 -0.63671828]


In [35]:
print (classification_report(y_test, yhat))


              precision    recall  f1-score   support

           1       0.45      0.67      0.54        15
           2       0.00      0.00      0.00        12
           3       0.46      0.75      0.57         8

   micro avg       0.46      0.46      0.46        35
   macro avg       0.31      0.47      0.37        35
weighted avg       0.30      0.46      0.36        35



  'precision', 'predicted', average, warn_for)


## Discussion