# California Restaurant "Likes" Prediction Using Foursquare API and Machine Learning

**Oliver Ma** <br>
Capstone Project <br>
IBM Data Science Professional Certificate <br> 

## 1. Introduction

California boasts an incredibly diverse collection of restaurants catering to different palettes and appetites. A large part of marketing for a modern restaurant (or any company) is social media, where the number of "likes" that the company can receive will dictate its brand and image to the general public. <br>

For a new business owner (or existing company) to open a new restaurant in California, knowing ahead of time the potential social media image they can have would provide an excellent solution to the ever present business problem of uncertainty. In this case the uncertainty is regarding performance of social media presence. 
<br>

We can mitigate this uncertainty through leveraging data gathered from FourSquare's API, specifically, we are able to scrape "likes" data of different restaurants directly from the API as well as their location and category of cuisine. The question we will try to address is, how accurately can we predict the amount of "likes" a new restaurant opening in this region can expect to have based on the type of cuisine it will serve and which city in California it will open in. (For the purposes of this analysis, we will contain the geographical scope of analysis to three heavily populated cities in California, namely San Francisco, Los Angeles, and San Diego). 
<br>

Leveraging this data will solve the problem as it allows the new business owner (or existing company) to make preemptive business decisions regarding opening the restaurant in terms of whether it is feasible to open one in this region and expect good social media presence, what type of cuisine and which city of three would be the best. This project will analyze and model the data via machine learning through comparing both linear and logistic regressions to see which method will yield better predictive capabilities after training and testing. 
<br>

Let us begin by importing the necessary packages.



In [2]:
import numpy as np 
import pandas as pd 
import json
from geopy.geocoders import Nominatim 
import requests
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors

import folium 
from urllib.request import urlopen
from bs4 import BeautifulSoup

import matplotlib.pyplot as plt
import pylab as pl

from sklearn import linear_model
from sklearn.metrics import jaccard_similarity_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import log_loss
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error, r2_score
import itertools

## 1. Configs

In [22]:
# credentials
CLIENT_ID = '1AJETP2F5JJMQ22BA33HNS2YNOF0QKT5HDAJNHBRMK04LQVC' # your Foursquare ID
CLIENT_SECRET = 'KEVIFVL2Y0BINL5HGRGRTWGOKUETUMVY5L124X1BJHVBDEOD' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

# magic numbers thresholds for data to scrape
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius

# addresses of interest 
address_list = ['San Francisco, California', 'Los Angeles, California', 'San Diego, California']

# 2. Data 

## 2.1 Data Scraping and Cleaning

In this section we will first retrieve the geographical coordinates of the three cities (San Francisco, Los Angeles, and San Diego). Then, we will leverage the FourSquare API to obtain URLs that lead to the raw data in JSON form. We will speerately scrape the raw data in these URLs in order to retrieve the following columns: "name", "categories", "latitude", "longitude". and "id" for each city. We can also provide another column ("city") to indicate which city the restaurants are from. 
<br>

It is important to note that the extracts are not of every restaurant in those cities but rather all of the restaurants within a 1000KM range of the geographical coordinates that geolocator was able to provide. However, the extraction from the FourSquare API actually obtains venue data so it will include venues other than restaurants such as concert halls, stores, libraries etc. As such, this means that the data will need to be further cleaned somewhat manually by removing all of the non-restaurant rows. Once this is complete, we have a shortened by cleaned list to pull "likes" data. The reason the cleaning takes precedence is mainly that pulling the "likes" data is the computing process which takes the longest time in this project so we want to make sure we are not pulling information that will end up being dropped anyways.

The "id" is an important column as it will allow us to further pull the "likes" from the API. We can retreive the "likes" based on the restaurant "id" and then append it to the data frame. Once this is complete, we finally name the dataframe 'raw_dataset' as it is the most complete compiled form before needing any processing for analysis via machine learning. 

In [7]:
address_tuple_list = []
for idx, address in enumerate(address_list):
  print("now fetching address for {} / {} locations".format(idx+1, len(address_list)))
  location = geolocator.geocode(address)
  latitude = location.latitude
  longitude = location.longitude 
  location_tuple = (longitude, latitude)
  address_tuple_list.append(location_tuple)


now fetching address for 1 / 3 locations
now fetching address for 2 / 3 locations
now fetching address for 3 / 3 locations


In [17]:
url_list = []
for idx, tup in enumerate(address_tuple_list):
  print("now fetching urls for {} / {} locations".format(idx+1, len(address_tuple_list)))
  url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    address_tuple_list[idx][1], 
    address_tuple_list[idx][0], 
    radius, 
    LIMIT)
  url_list.append(url)

now fetching urls for 1 / 3 locations
now fetching urls for 2 / 3 locations
now fetching urls for 3 / 3 locations


In [45]:
def get_category_type(row):
  """
    returns the category of the venue (if available)
  """
  try:
      categories_list = row['categories']
  except:
      categories_list = row['venue.categories']
      
  if len(categories_list) == 0:
      return None
  else:
      return categories_list[0]['name']

def clean_json(result_json):
  """
    returns a cleaned flattened json for the result of an urlib get request 
  """
  venues = result_json['response']['groups'][0]['items']
  nearby_venues = pd.json_normalize(venues) # flatten JSON
  # filter columns
  filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 
                      'venue.location.lng', 'venue.id']
  nearby_venues = nearby_venues.loc[:, filtered_columns]
  # filter the category for each row
  nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
  # clean columns
  nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
  return nearby_venues 

In [57]:
df_nearby_venues = pd.DataFrame()
for idx, url in enumerate(url_list):
  print("now fetching results for {} / {} urls".format(idx+1, len(url_list)))
  result = requests.get(url).json()
  nearby_venues = clean_json(result)
  nearby_venues['city'] = address_list[idx]
  df_nearby_venues = df_nearby_venues.append(nearby_venues)

now fetching results for 1 / 3 urls
now fetching results for 2 / 3 urls
now fetching results for 3 / 3 urls


In [61]:
# list of type of venues to remove 
removal_list = ['Concert Hall', 'Opera House', 'Dance Studio',
                'Performing Arts Venue', 'Art Museum', 'Park',
                'Massage Studio', 'Music Venue', 'Bookstore', 'Clothing Store',
                'Boutique', 'Furniture/Home Store', 'Jazz Club',
                'Theater', 'Optical Shop', "Men's Store", 'Rock Club',
                'Gym / Fitness Center', 'Wine Shop', 'Indie Movie Theater',
                'Chocolate Shop', 'Dessert Shop', 'Recreation Center', 
                'Plaza', 'Hotel', 'Luggage Store', 'Farmers Market', 'Gym',
                'Jewelry Store', 'Furniture / Home Store', 'Butcher', 
                'Bakery', 'Marijuana Dispensary', 'Ice Cream Shop',
                'Comic Shop', 'Bagel Shop', 'Spa', 'Liquor Store', 'Bike Shop',
                'Yoga Studio', 'Pedestrian Plaza', 'Candy Store',
                'Park', 'Bookstore', 'Candy Store',  'Jazz Club', 'Art Gallery', 
                 'Supermarket', 'Museum', 'Boutique', 'Plaza', 'Building', 'Bakery',
                 'Historic Site', 'Ice Cream Shop', ' Concert Hall', 'Pharmacy', 
                 'Market', 'Movie Theater', 'Performing Arts Venue', 'Music Venue',
                 'Theater', 'Art Museum', 'Cheese Shop', 'Opera House',
                 'Pedestrian Plaza', 'School', 'Gift Shop', 'Athletics & Sports',
                 'Shoe Repair', 'General Entertainment', 'Stationery Store',
                 'Toy / Game Store', 'Brewery', 'Hotel', 'Theater', 'Music Venue', 'Business Service',
                 'Donut Shop', 'Liquor Store', 'Beer Store',
                 'Lounge', 'Plaza', 'Health Food Store', 'Concert Hall', 
                 'Lingerie Store', 'Gym', 'Mobile Phone Shop',
                 'Chocolate Shop', 'Ice Cream Shop', 'Hostel', 'Convenience Store', 
                 'Park', 'Farmers Market', 'Cosmetics Shop', 'Piano Bar',
                 'Nightclub', 'Massage Studio', 'Comedy Club', 'Concert Hall']

df_nearby_venues = df_nearby_venues[~df_nearby_venues['categories'].isin(removal_list)]

# check that only restaurant data remains
df_nearby_venues['categories'].unique()     

array(['Vietnamese Restaurant', 'Tiki Bar', 'Ramen Restaurant',
       'Beer Bar', 'Wine Bar', 'Coffee Shop', 'Cocktail Bar',
       'French Restaurant', 'Vegetarian / Vegan Restaurant',
       'Italian Restaurant', 'Music School',
       'Southern / Soul Food Restaurant', 'Café', 'Food & Drink Shop',
       'Dumpling Restaurant', 'Event Space', 'Sandwich Place',
       'Poke Place', 'Souvlaki Shop', 'Pizza Place', 'Sushi Restaurant',
       'New American Restaurant', 'Beer Garden', 'Juice Bar',
       'Mexican Restaurant', 'Thai Restaurant', 'American Restaurant',
       'German Restaurant', 'Kids Store', 'Bar', 'Accessories Store',
       'Indian Restaurant', 'Speakeasy', 'Breakfast Spot',
       'Udon Restaurant', 'Japanese Restaurant', 'BBQ Joint',
       'Latin American Restaurant', 'Filipino Restaurant',
       'Deli / Bodega', 'Mediterranean Restaurant', 'Shopping Mall',
       'Cajun / Creole Restaurant', 'Yoshoku Restaurant', 'Restaurant',
       'Train Station', 'Noodle House

In [66]:
# pull the likes from the API based on venue ID

url_list = []
like_list = []
json_list = []

for idx, venue in enumerate(list(df_nearby_venues.id)):
  venue_url = 'https://api.foursquare.com/v2/venues/{}/likes?client_id={}&client_secret={}&v={}'.format(venue, CLIENT_ID, CLIENT_SECRET, VERSION)
  url_list.append(venue_url)
print("venue url fetching complete")

for idx, link in enumerate(url_list):
  result = requests.get(link).json()
  likes = result['response']['likes']['count']
  like_list.append(likes)
print("venue likes fetching complete")

df_nearby_venues['likes'] = like_list

venue url fetching complete
venue likes fetching complete


In [68]:
# check that data has been appropriately scraped and parsed
df_nearby_venues.head()

Unnamed: 0,name,categories,lat,lng,id,city,likes
9,DragonEats,Vietnamese Restaurant,37.778289,-122.423266,5005c520e4b0a8ae9875b787,"San Francisco, California",175
11,Smuggler's Cove,Tiki Bar,37.779386,-122.423422,4afe6db4f964a520682f22e3,"San Francisco, California",1107
13,Nojo Ramen Tavern,Ramen Restaurant,37.776637,-122.42127,4d8eabc7d265236af9a71017,"San Francisco, California",350
14,The Beer Hall,Beer Bar,37.776837,-122.417916,519fcb37498e91a13cb23d6b,"San Francisco, California",291
15,Fig & Thistle Wine Bar,Wine Bar,37.777256,-122.423365,51aad1ee498e9bb839540dc7,"San Francisco, California",296


## 2.2 Data Preparation

The data still needs some more processing before it is suitable for model training and testing. Mainly, the "categories" column contains too many different types of cuisines to allow a model to yield any meaningful results. However, the different types of natural cuisines have natural groupings based on conventionally accepted cultural groupings of cuisine. Broadly speaking, all of the different types of cuisine could be reclassified as European, Latin American, Asian, North American, drinking establishments (bars), or casual establishments such as coffee shops or ice cream parlours. We can implement manual classification as there really aren't that many different types of cuisines.

As this project will compare both linear and logistic regression, it makes sense to have "likes" as both a continuous and categorical (but ordinal) variable. In the case of turning into a categorical variable, we can bin the data based on percentiles and classify them into these ordinal percentile categories. I tried different ways of binning but in the end, splitting the sample into three different bins proved to yield the best classification results from a prediction standpoint.

As the last stage of data preparation, it is important to note that the regressors are categorical variables (3 different cities and 6 different categories of cusines). Hence, they require dummy variable encoding for meaningful analysis. We can accomplish this via one-hot encoding. 

In [69]:
# inspecting the raw dataset shows that there may be too many different types of cuisines
raw_dataset = df_nearby_venues
raw_dataset['categories'].unique()

array(['Vietnamese Restaurant', 'Tiki Bar', 'Ramen Restaurant',
       'Beer Bar', 'Wine Bar', 'Coffee Shop', 'Cocktail Bar',
       'French Restaurant', 'Vegetarian / Vegan Restaurant',
       'Italian Restaurant', 'Music School',
       'Southern / Soul Food Restaurant', 'Café', 'Food & Drink Shop',
       'Dumpling Restaurant', 'Event Space', 'Sandwich Place',
       'Poke Place', 'Souvlaki Shop', 'Pizza Place', 'Sushi Restaurant',
       'New American Restaurant', 'Beer Garden', 'Juice Bar',
       'Mexican Restaurant', 'Thai Restaurant', 'American Restaurant',
       'German Restaurant', 'Kids Store', 'Bar', 'Accessories Store',
       'Indian Restaurant', 'Speakeasy', 'Breakfast Spot',
       'Udon Restaurant', 'Japanese Restaurant', 'BBQ Joint',
       'Latin American Restaurant', 'Filipino Restaurant',
       'Deli / Bodega', 'Mediterranean Restaurant', 'Shopping Mall',
       'Cajun / Creole Restaurant', 'Yoshoku Restaurant', 'Restaurant',
       'Train Station', 'Noodle House

In [70]:
# we can group some cuisines together to make a better categorical variable

euro = ['French Restaurant', 'Scandinavian Restaurant', 'Souvlaki Shop', 
       'Mediterranean Restaurant', 'Italian Restaurant', 'Pizza Place']

latino = ['Mexican Restaurant', 'Latin American Restaurant', 
          'Brazilian Restaurant', 'Taco Place']

bar = ['Beer Bar', 'Cocktail Bar', 'Tiki Bar', 'Wine Bar', 'Hotel Bar',
       'Beer Garden', 'Speakeasy', 'Brewery', 'Pub', 'Bar', 'Gastropub',
       'Hookah Bar']

asian = ['Ramen Restaurant', 'Sushi Restaurant', 'Vietnamese Restaurant',
         'Thai Restaurant', 'Poke Place', 'Indian Restaurant', 
         'Japanese Curry Restaurant', 'Japanese Restaurant', 
         'Indonesian Restaurant', 'Udon Restaurant', 'Noodle House',
         'Falafel Restaurant', 'Filipino Restaurant', 'Turkish Restaurant',
         'Yoshoku Restaurant']

casual = ['Coffee Shop', 'Café', 'Sandwich Place', 'Food Truck',
          'Juice Bar', 'Frozen Yogurt Shop', 'Deli / Bodega', 'Dessert Shop',
          'Hot Dog Joint', 'Burger Joint', 'Breakfast Spot', 
          'Fondue Restaurant']

american = ['Southern / Soul Food Restaurant', 'Food & Drink Shop', 
            'Restaurant', 'American Restaurant', 'BBQ Joint', 
            'Theme Restaurant', 'New American Restaurant',
            'Vegetarian / Vegan Restaurant', 'Seafood Restaurant']

def categorize_restaurants(df):
  """
    returns category of restaurant
  """
    if df['categories'] in euro:
        return 'euro'
    if df['categories'] in latino:
        return 'latino'
    if df['categories'] in asian:
        return 'asian'
    if df['categories'] in casual:
        return 'casual'
    if df['categories'] in american:
        return 'american'
    if df['categories'] in bar:
        return 'bar'


raw_dataset['categories_classified'] = raw_dataset.apply(categorize_restaurants, axis=1)
raw_dataset.head()

Unnamed: 0,name,categories,lat,lng,id,city,likes,categories_classified
9,DragonEats,Vietnamese Restaurant,37.778289,-122.423266,5005c520e4b0a8ae9875b787,"San Francisco, California",175,asian
11,Smuggler's Cove,Tiki Bar,37.779386,-122.423422,4afe6db4f964a520682f22e3,"San Francisco, California",1107,bar
13,Nojo Ramen Tavern,Ramen Restaurant,37.776637,-122.42127,4d8eabc7d265236af9a71017,"San Francisco, California",350,asian
14,The Beer Hall,Beer Bar,37.776837,-122.417916,519fcb37498e91a13cb23d6b,"San Francisco, California",291,bar
15,Fig & Thistle Wine Bar,Wine Bar,37.777256,-122.423365,51aad1ee498e9bb839540dc7,"San Francisco, California",296,bar


In [72]:
# check how many are of each category
pd.crosstab(index=raw_dataset["categories_classified"], columns="count")  

col_0,count
categories_classified,Unnamed: 1_level_1
american,19
asian,37
bar,28
casual,40
euro,18
latino,13


In [83]:
# classify the likes into different ranking levels
# Determine 3 rankings by binning based on percentiles

thres1 = np.percentile(raw_dataset['likes'], 33)
thres2 = np.percentile(raw_dataset['likes'], 66)


def apply_rankings(df, thres1, thres2):
  """
    return new column with rankings based on threshold bins
  """
  if df['likes'] < thres1:
      return 1
  if df['likes'] >= thres1 or df['likes'] <= thres2:
      return 2
  if df['likes'] > thres2:
      return 3

raw_dataset['ranking'] = raw_dataset.apply(apply_rankings, axis=1, args = [thres1, thres2])
raw_dataset.head()

Unnamed: 0,name,categories,lat,lng,id,city,likes,categories_classified,ranking
9,DragonEats,Vietnamese Restaurant,37.778289,-122.423266,5005c520e4b0a8ae9875b787,"San Francisco, California",175,asian,2
11,Smuggler's Cove,Tiki Bar,37.779386,-122.423422,4afe6db4f964a520682f22e3,"San Francisco, California",1107,bar,2
13,Nojo Ramen Tavern,Ramen Restaurant,37.776637,-122.42127,4d8eabc7d265236af9a71017,"San Francisco, California",350,asian,2
14,The Beer Hall,Beer Bar,37.776837,-122.417916,519fcb37498e91a13cb23d6b,"San Francisco, California",291,bar,2
15,Fig & Thistle Wine Bar,Wine Bar,37.777256,-122.423365,51aad1ee498e9bb839540dc7,"San Francisco, California",296,bar,2


In [86]:
# create dummies for linear regression modelling

# one hot encoding
reg_dataset = pd.get_dummies(raw_dataset[['categories_classified', 
                                          'city',]], 
                               prefix="", 
                               prefix_sep="")

# add name, ranking, and likes columns back to dataframe
reg_dataset['ranking'] = raw_dataset['ranking']
reg_dataset['likes'] = raw_dataset['likes']
reg_dataset['name'] = raw_dataset['name']

# move name column to the first column
reg_columns = [reg_dataset.columns[-1]] + list(reg_dataset.columns[:-1])
reg_dataset = reg_dataset[reg_columns]


reg_dataset.head()

Unnamed: 0,name,american,asian,bar,casual,euro,latino,"Los Angeles, California","San Diego, California","San Francisco, California",ranking,likes
9,DragonEats,0,1,0,0,0,0,0,0,1,2,175
11,Smuggler's Cove,0,0,1,0,0,0,0,0,1,2,1107
13,Nojo Ramen Tavern,0,1,0,0,0,0,0,0,1,2,350
14,The Beer Hall,0,0,1,0,0,0,0,0,1,2,291
15,Fig & Thistle Wine Bar,0,0,1,0,0,0,0,0,1,2,296


# 3. Methodology

This project will utilize both linear and logistic regression machine learning methods to train and test the data. Namely, linear regression will be used in an attempt to predict the number of "likes" a new restaurant in this region will have. We will utilize the Sci-Kit Learn Package to run the model. 

We can also utilize logisitc regression as a classification method rather than direct prediction of the number of likes. Since the number of "likes" can be binned into different categories based on different percentile bins, it is also potentiallly possible to see which range of "likes" a new restaurant in this region will have. 

Since the "likes" are binned into multiple (more than 2) categories, the type of logistic regression will be multinomial. Additionally, although the ranges are indeed discrete categories, they are also ordinal in nature. Therefore the logistic regression will need to be specified as being both multinomial and ordinal. This can be done through the Sci-Kit Learn Package as well.

# 4. Results 

## 4.1 Linear Regression Results

A linear regression model was trained on a random subsample of 80% of the sample and then tested on the other 20%. To see if this is a reasonable model. the residual sum of squares score and variance score were both calculated. Given the low variance score, this is probably not a valid/good way of modelling the data. Therefore, we move on to logistic regression.

In [89]:
# Multiple Linear Regression

msk = np.random.rand(len(reg_dataset)) < 0.8
train = reg_dataset[msk]
test = reg_dataset[~msk]

regr = linear_model.LinearRegression()
x = np.asanyarray(train[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Los Angeles, California', 
                         'San Diego, California', 'San Francisco, California']])
y = np.asanyarray(train[['likes']])
regr.fit (x, y)
# The coefficients
print ('Coefficients: ', regr.coef_)

Coefficients:  [[ 158.48147819   50.54049676  121.2099482   126.95564966   49.64553015
    66.94623656  -57.78537586 -113.14345435  170.9288302 ]]


In [90]:
# Multiple Linear Regression Prediction Capabilities

y_hat= regr.predict(test[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Los Angeles, California', 
                         'San Diego, California', 'San Francisco, California']])
x = np.asanyarray(test[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Los Angeles, California', 
                         'San Diego, California', 'San Francisco, California']])
y = np.asanyarray(test[['likes']])
print("Residual sum of squares: %.2f"
      % np.mean((y_hat - y) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(x, y))

Residual sum of squares: 39571.39
Variance score: -0.80


## 4.2 Logistic Regression Results

A multinomial ordinal logisitc regression model was trained on a random subsample of 80% of the sample and then tested on the other 20%. To see if this is a reasonable model, its jaccard similarity score and log-loss were calculated (66.66% and 1.009 respectively). Although this is not a perfect prediction, a similarity of 66% between the training set and test set is a reasonable result. The classification report is also printed later on below.
<br>

Given the modestly accurate ability of this model, we can also run the model on the full dataset. The coefficients show that opening a restaurant in San Francisco, opening a bar, or serving cuisine that is american or asian in nature, are associated negatively with "likes." 

In [91]:
# Multinomial Ordinal Logistic Regression

x_train = np.asanyarray(train[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Los Angeles, California', 
                         'San Diego, California', 'San Francisco, California']])
y_train = np.asanyarray(train['ranking'])

x_test = np.asanyarray(test[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Los Angeles, California', 
                         'San Diego, California', 'San Francisco, California']])
y_test = np.asanyarray(test['ranking'])

mul_ordinal = linear_model.LogisticRegression(multi_class='multinomial',
                                              solver='newton-cg',
                                              fit_intercept=True).fit(x_train,
                                                                      y_train)

mul_ordinal

coef = mul_ordinal.coef_[0]
print (coef)

[ 0.47041481  0.43095355  0.59559234  0.25901742 -0.22010891  0.3352171
 -0.32206993 -0.35064342  0.67271347]


In [95]:
# Multinomial Ordinal Logistic Regression Prediction Capabilities

yhat = mul_ordinal.predict(x_test)
yhat

yhat_prob = mul_ordinal.predict_proba(x_test)
yhat_prob


jaccard_similarity_score(y_test, yhat)



0.5151515151515151

In [96]:
log_loss(y_test, yhat_prob)

0.7306569195318565

In [98]:
# Exploration of Coefficient Magnitudes of Full Dataset

x_all = np.asanyarray(reg_dataset[['american', 'asian', 'bar', 'casual',
                                   'euro', 'latino', 'Los Angeles, California', 
                                   'San Diego, California', 'San Francisco, California']])
y_all = np.asanyarray(reg_dataset['ranking'])



LR = linear_model.LogisticRegression(
    multi_class='multinomial',
    solver='newton-cg',
    fit_intercept=True
  ).fit(x_all, y_all)

coef = LR.coef_[0]
print (coef)

[ 0.71096151  0.3983821   0.71476868  0.26933379  0.01300037  0.33648868
 -0.19695283 -0.35016416  0.54711218]


In [99]:
print (classification_report(y_test, yhat))

              precision    recall  f1-score   support

           1       0.33      0.07      0.11        15
           2       0.53      0.89      0.67        18

    accuracy                           0.52        33
   macro avg       0.43      0.48      0.39        33
weighted avg       0.44      0.52      0.41        33



## 5. Discussion

The first thing to note is that given the data, logistic regression presents a better fit for the data over linear regression. Using logistic regression we were able to obtain a Jaccard Similarity Score of 51%, which although not perfect, is more reasonable than the low variance score obtained from the linear regression. As stated before, please note that for the purposes of this project, we are assumming that likes are a good proxy for how well a new restaurant will do in terms of brand, image and by extension how well the restaurant will perform business-wise. Whether or not these assumptions hold up in a real-life scenario is up for discussion, but this project does contain limitations in scope due to the amount of data that can be fetched from the FourSquare API. 

<br>

As such, to obtain insights into this data, we can proceed with breaking down the results of the logistic regression model. The results showed that the model is better at predicting if a restaurant will fall into the best or worst percentile of likes. This allows us to roughly predict the potential performance of the business opportunity. Different binning methods for the classes were attempted, but the use of 3 bins yielded the best Jaccard Similarity Score. 

## 6. Conclusion

In conclusion, after analyzing restaurant "likes" in California from 300 restaurants, we have developed a general classification model for which "ranking" of likes a new restaurant will potentially fall into based on its 
characteristics.