# Capstone Project - Week 5

## Introduction:

The goal and purpose of this proposal is to notionally facilitate the exploration of restaurants around a few cities in California. With large numbers of immigrants and an increasingly solitary society brought about by COVID and increased internet use, people could use help finding popular restaurant spots. I am choosing Oakland, San Diego, and Emeryville for this proposal.

Because the restaurant industry has been decimated due to mandatory shutdowns, now would be a good opportunity for investors and entrepreneurs to make a diligent investment in restaurant. The type of restaurant would be an important criteria when scouting a city for a potential property.

This project seeks to address the question of how we can discern how accurately we can predict the amount of "likes" a new restaurant opening in this region can expect to have based on the type of cuisine it will serve and which city in California it will open in. For this project I analyzed and modeled the data using machine learning by comparing both linear and logistic regressions to see which method yielded better predictive capabilities after training and testing.


## Data:

Upon completion we will name the dataframe 'raw_dataset' . This is the most complete compiled form prior to the use of any processing for analysis and machine learning.

We will begin by retrieving the geographical coordinates of three cities in California, to include: (Oakland, Emeryville, and San Diego). We will then make use of the Foursquare API to retrieve the the URLs that give us the raw data in JSON form. Each respective URL will then be scraped for the columns: 'name', 'categories', 'latitude', 'longitude', and'id' for each city. The city column will help us when separating where the restaurants are located.

To constrain the amount of data for the scope of this project, I decided to focus on those restaurants found within a 1000 meter radius from the coordinates provided by the geolocator. The Foursquare API gives us with more venue categories than we need so we will tidy our results by removing non-restaurant rows. Pulling the 'likes' data is necessary for us to make our final decision. We don't want to be pulling information that will be discarded since it is not needed for our analysis.

We will use the 'id' column in order to pull the 'likes' using the API and append the information into the dataframe. We will then conclude by naming the dataframe 'raw_dataset', which we used in the machine learning portion of the project.


## Methods:

Both linear and logistic regression were used to train and test the data. Linear regression was used to predict the number of 'likes' a new restaurant in this region will acquire. Sci-Kit Learn was employed for this stage.

Logistic regression was used as the classification method. Because we used binning when classifying by number of 'likes', we are able to make use of multinomial logistic regression to perform the analysis. Although the ranges are discrete categories, they can be considered ordinal in nature. The logistic regression will need to be specified as being both multinomial and ordinal. The Sci-Kit Learn package is well suited for this.


## Imported Libraries

In [2]:
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json
import numpy as np 
import itertools
import requests

from pandas.io.json import json_normalize 
!conda install -c conda-forge folium=0.5.0 --yes
import folium as fo
import pylab as pl
import warnings
import matplotlib.pyplot as plt
import matplotlib.rcsetup as rc
import matplotlib.cm as cm
import matplotlib.colors as colors
warnings.filterwarnings('ignore')

from urllib.request import urlopen
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim 
from sklearn import linear_model
from sklearn.metrics import jaccard_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import log_loss
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score


Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



## Retrieve Foursquare Data (City Coordinates)

In [3]:
address1 = 'Oakland, California'

geolocator = Nominatim(user_agent="foursquare_agent")
location1 = geolocator.geocode(address1)
latitude1 = location1.latitude
longitude1 = location1.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address1, latitude1, longitude1))

address2 = 'Emeryville, California'

geolocator = Nominatim(user_agent="foursquare_agent")
location2 = geolocator.geocode(address2)
latitude2 = location2.latitude
longitude2 = location2.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address2, latitude2, longitude2))

address3 = 'San Diego, California'

geolocator = Nominatim(user_agent="foursquare_agent")
location3 = geolocator.geocode(address3)
latitude3 = location3.latitude
longitude3 = location3.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address3, latitude3, longitude3))

The geograpical coordinate of Oakland, California are 37.8044557, -122.2713563.
The geograpical coordinate of Emeryville, California are 37.8314089, -122.2865266.
The geograpical coordinate of San Diego, California are 32.7174202, -117.1627728.


## Foursquare Credentials:

In [4]:
CLIENT_ID = '4AD32KW5NOMW31Z5S5ISVCOZP20ADSQOHK1VDDPODEUGUOQO' # your Foursquare ID
CLIENT_SECRET = 'IZ2ELVLTPA33SCKKYRCQJXNEQ0JYDI5EREYBRO335BLIGPUE' # your Foursquare Secret
VERSION = '20191008' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius

# Here we create URLs
url1 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude1, 
    longitude1, 
    radius, 
    LIMIT)


# Here we create URLs
url2 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude2, 
    longitude2, 
    radius, 
    LIMIT)


# Here we create URLs
url3 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude3, 
    longitude3, 
    radius, 
    LIMIT)

print(url1, url2, url3)

Your credentails:
CLIENT_ID: 4AD32KW5NOMW31Z5S5ISVCOZP20ADSQOHK1VDDPODEUGUOQO
CLIENT_SECRET:IZ2ELVLTPA33SCKKYRCQJXNEQ0JYDI5EREYBRO335BLIGPUE
https://api.foursquare.com/v2/venues/explore?&client_id=4AD32KW5NOMW31Z5S5ISVCOZP20ADSQOHK1VDDPODEUGUOQO&client_secret=IZ2ELVLTPA33SCKKYRCQJXNEQ0JYDI5EREYBRO335BLIGPUE&v=20191008&ll=37.8044557,-122.2713563&radius=1000&limit=100 https://api.foursquare.com/v2/venues/explore?&client_id=4AD32KW5NOMW31Z5S5ISVCOZP20ADSQOHK1VDDPODEUGUOQO&client_secret=IZ2ELVLTPA33SCKKYRCQJXNEQ0JYDI5EREYBRO335BLIGPUE&v=20191008&ll=37.8314089,-122.2865266&radius=1000&limit=100 https://api.foursquare.com/v2/venues/explore?&client_id=4AD32KW5NOMW31Z5S5ISVCOZP20ADSQOHK1VDDPODEUGUOQO&client_secret=IZ2ELVLTPA33SCKKYRCQJXNEQ0JYDI5EREYBRO335BLIGPUE&v=20191008&ll=32.7174202,-117.1627728&radius=1000&limit=100




## Data Exploration:


In [5]:
# Here we extract (scrape) the data from the generated URLs

results1 = requests.get(url1).json()
results1

results2 = requests.get(url2).json()
results2

results3 = requests.get(url3).json()
results3

# This is the function that extracts the category of the venue

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    

# This is our first city   

venues1 = results1['response']['groups'][0]['items']
nearby_venues1 = pd.json_normalize(venues1) # flatten JSON

# filter columns
filtered_columns1 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues1 = nearby_venues1.loc[:, filtered_columns1]

# filter the category for each row
nearby_venues1['venue.categories'] = nearby_venues1.apply(get_category_type, axis=1)

# tidy up the columns
nearby_venues1.columns = [col.split(".")[-1] for col in nearby_venues1.columns]

# SECOND CITY

venues2 = results2['response']['groups'][0]['items']
nearby_venues2 = pd.json_normalize(venues2) # flatten JSON

# filter columns
filtered_columns2 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues2 = nearby_venues2.loc[:, filtered_columns2]

# filter the category for each row
nearby_venues2['venue.categories'] = nearby_venues2.apply(get_category_type, axis=1)

# clean columns
nearby_venues2.columns = [col.split(".")[-1] for col in nearby_venues2.columns]

# THIRD CITY

venues3 = results3['response']['groups'][0]['items']
nearby_venues3 = pd.json_normalize(venues3) # flatten JSON

# filter columns
filtered_columns3 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues3 = nearby_venues3.loc[:, filtered_columns3]

# filter the category for each row
nearby_venues3['venue.categories'] = nearby_venues3.apply(get_category_type, axis=1)

# clean columns
nearby_venues3.columns = [col.split(".")[-1] for col in nearby_venues3.columns]





print('{} venues for Oakland, California were returned by Foursquare.'.format(nearby_venues1.shape[0]))
print()
print('{} venues for Emeryville, California were returned by Foursquare.'.format(nearby_venues2.shape[0]))
print()
print('{} venues for San Diego, California were returned by Foursquare.'.format(nearby_venues3.shape[0]))

100 venues for Oakland, California were returned by Foursquare.

100 venues for Emeryville, California were returned by Foursquare.

100 venues for San Diego, California were returned by Foursquare.


In [6]:
# This section will add location data to the data sets of each city

nearby_venues1['city'] = 'Oakland'
nearby_venues2['city'] = 'Emeryville'
nearby_venues3['city'] = 'San Diego'

In [7]:
# This section will combine the three cities into one data set

nearby_venues = nearby_venues1.copy()
nearby_venues = nearby_venues.append(nearby_venues2)
nearby_venues = nearby_venues.append(nearby_venues3)

In [8]:
nearby_venues['categories'].unique()

array(['Clothing Store', 'Vegetarian / Vegan Restaurant', 'Bar',
       'Comic Shop', 'Café', 'Japanese Restaurant', 'Brewery',
       'Music Venue', 'Vietnamese Restaurant', 'Bagel Shop',
       'American Restaurant', 'Seafood Restaurant', 'Skating Rink',
       'Kitchen Supply Store', 'Coffee Shop', 'Mexican Restaurant',
       'Beer Bar', 'Furniture / Home Store', 'Juice Bar', 'Dessert Shop',
       'Brazilian Restaurant', 'Afghan Restaurant', 'Chinese Restaurant',
       'Caribbean Restaurant', 'Lounge', 'Salad Place', 'Bakery',
       'Ice Cream Shop', 'Sausage Shop', 'Wine Bar', 'Taco Place',
       'Cocktail Bar', 'Climbing Gym', 'Sandwich Place',
       'Bubble Tea Shop', 'Food Court', 'Burger Joint', 'Tiki Bar',
       'New American Restaurant', 'Nightclub', 'Rental Car Location',
       'Gay Bar', 'Tapas Restaurant', 'Burmese Restaurant',
       'Indian Restaurant', 'Cambodian Restaurant', 'Falafel Restaurant',
       'Museum', 'Beer Garden', 'Street Food Gathering', 'ATM',
 

## Data Cleaning:

In [9]:
# check list and manually remove all non-restaurant data

nearby_venues['categories'].unique()

removal_list = ['Clothing Store','Bar','Brewery', 
                'Comic Shop', 'Yoga Studio','Café', 
                'Coffee Shop', 'Tiki Bar', 'Music Venue', 
                'Wine Bar',  'Cocktail Bar', 'Dance Studio', 
                'Gym / Fitness Center','Beer Bar', 
                'Bubble Tea Shop', 'Nightclub', 'Food Court', 
                'Ice Cream Shop', 'Cupcake Shop', 'Skating Rink', 
                'Dessert Shop', 'Climbing Gym', 'Bakery', 
                'Farmers Market', 'Gay Bar','Beer Garden',
                'Tea Room','Arts & Crafts Store', 'Grocery Store', 
                'Sports Bar', 'Museum', 'Street Food Gathering', 
                'Library', 'Skate Park', 'Movie Theater','Park', 
                'Gym', 'Stadium', 'Furniture / Home Store', 'Discount Store', 
                'Playground', 'Cosmetics Shop', 'Casino', 
                'Pet Store','Electronics Store', 'Snack Place',
                'Salon / Barbershop', 'Shopping Plaza', 'Deli / Bodega', 
                'Candy Store', 'Liquor Store', 'Hotel', 
                'Shoe Store', 'Bookstore', 'Shopping Mall', 
                'Dive Bar', 'Video Game Store', 'Pharmacy', 
                'Accessories Store', 'Lingerie Store', 'Mobile Phone Shop', 
                'Pool Hall', 'Juice Bar', 'Kids Store', 
                'Supplement Shop', 'Big Box Store', 'Mattress Store', 
                'Hardware Store', 'Paper / Office Supplies Store', 'Theater', 
                'Business Service', 'Donut Shop', 'Beer Store', 
                'Lounge', 'Health Food Store', 'Pedestrian Plaza', 
                'Hookah Bar', 'Concert Hall', 'Chocolate Shop', 
                'Hostel', 'Convenience Store', 'Pub', 
                'Plaza', 'Comedy Club', 'Speakeasy', 
                'Tattoo Parlor', 'Massage Studio']

nearby_venues = nearby_venues[~nearby_venues['categories'].isin(removal_list)]

nearby_venues['categories'].unique().tolist()

['Vegetarian / Vegan Restaurant',
 'Japanese Restaurant',
 'Vietnamese Restaurant',
 'Bagel Shop',
 'American Restaurant',
 'Seafood Restaurant',
 'Kitchen Supply Store',
 'Mexican Restaurant',
 'Brazilian Restaurant',
 'Afghan Restaurant',
 'Chinese Restaurant',
 'Caribbean Restaurant',
 'Salad Place',
 'Sausage Shop',
 'Taco Place',
 'Sandwich Place',
 'Burger Joint',
 'New American Restaurant',
 'Rental Car Location',
 'Tapas Restaurant',
 'Burmese Restaurant',
 'Indian Restaurant',
 'Cambodian Restaurant',
 'Falafel Restaurant',
 'ATM',
 'Southern / Soul Food Restaurant',
 'Breakfast Spot',
 'Jazz Club',
 'Flower Shop',
 'Pizza Place',
 'Mediterranean Restaurant',
 'Diner',
 'Video Store',
 'Scandinavian Restaurant',
 'Wings Joint',
 "Men's Store",
 'Bank',
 'Asian Restaurant',
 'Burrito Place',
 'Fast Food Restaurant',
 'Optical Shop',
 'Sushi Restaurant',
 'Fried Chicken Joint',
 'Recreation Center',
 'Intersection',
 'Trail',
 'French Restaurant',
 'Fondue Restaurant',
 'Theme R

## DataFrame Creation:

In [10]:
# Here wee set up to pull the likes from the API based on venue ID

url_list = []
like_list = []
json_list = []

for i in list(nearby_venues.id):
    venue_url = 'https://api.foursquare.com/v2/venues/{}/likes?client_id={}&client_secret={}&v={}'.format(i, CLIENT_ID, CLIENT_SECRET, VERSION)
    url_list.append(venue_url)
for link in url_list:
    result = requests.get(link).json()
    likes = result['response']['likes']['count']
    like_list.append(likes)
print(like_list)


nearby_venues['likes'] = like_list
nearby_venues.head()

[77, 64, 33, 72, 22, 7, 63, 50, 115, 199, 40, 66, 43, 100, 51, 0, 176, 182, 367, 33, 25, 15, 24, 0, 39, 237, 1, 91, 171, 4, 27, 226, 38, 46, 9, 76, 1, 2, 42, 247, 13, 38, 66, 7, 191, 8, 54, 136, 132, 332, 24, 4, 0, 1, 16, 36, 18, 54, 0, 1, 2, 31, 60, 17, 63, 92, 22, 68, 17, 4, 1, 15, 65, 7, 2, 31, 7, 7, 4, 156, 75, 33, 9, 15, 81, 18, 131, 26, 105, 59, 171, 41, 24, 38, 22, 21, 31, 30, 2, 298, 35, 31, 61, 7, 2, 132, 12, 222, 31, 480, 97, 187, 23, 34, 104, 152, 16, 140, 92, 19, 108, 50, 51, 54]


Unnamed: 0,name,categories,lat,lng,id,city,likes
1,Golden Lotus Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.80329,-122.270473,49cebb1bf964a520785a1fe3,Oakland,77
6,Abura-Ya,Japanese Restaurant,37.805959,-122.267693,539a69a7498ee67090b2b285,Oakland,64
8,Nature Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.802157,-122.270983,4f52deb2e4b0ac6d0c91df05,Oakland,33
10,Tay Ho Restaurant & Bar,Vietnamese Restaurant,37.802062,-122.269573,4c8b16ec52a98cfad73533e9,Oakland,72
11,Beauty’s Bagel Shop,Bagel Shop,37.806082,-122.268356,5bd0959cf1fdaf002ce03e11,Oakland,22


In [11]:
nearby_venues.count()

name          124
categories    124
lat           124
lng           124
id            124
city          124
likes         124
dtype: int64

In [12]:
# this is really the raw dataset now so let us rename it something more appropriate

raw_dataset = nearby_venues
raw_dataset.head()

Unnamed: 0,name,categories,lat,lng,id,city,likes
1,Golden Lotus Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.80329,-122.270473,49cebb1bf964a520785a1fe3,Oakland,77
6,Abura-Ya,Japanese Restaurant,37.805959,-122.267693,539a69a7498ee67090b2b285,Oakland,64
8,Nature Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.802157,-122.270983,4f52deb2e4b0ac6d0c91df05,Oakland,33
10,Tay Ho Restaurant & Bar,Vietnamese Restaurant,37.802062,-122.269573,4c8b16ec52a98cfad73533e9,Oakland,72
11,Beauty’s Bagel Shop,Bagel Shop,37.806082,-122.268356,5bd0959cf1fdaf002ce03e11,Oakland,22


## Data Preparation for Machine Learning:

In [13]:

# We now inspect the raw dataset shows that there may be too many different types of cuisines

raw_dataset['categories'].unique().tolist()

['Vegetarian / Vegan Restaurant',
 'Japanese Restaurant',
 'Vietnamese Restaurant',
 'Bagel Shop',
 'American Restaurant',
 'Seafood Restaurant',
 'Kitchen Supply Store',
 'Mexican Restaurant',
 'Brazilian Restaurant',
 'Afghan Restaurant',
 'Chinese Restaurant',
 'Caribbean Restaurant',
 'Salad Place',
 'Sausage Shop',
 'Taco Place',
 'Sandwich Place',
 'Burger Joint',
 'New American Restaurant',
 'Rental Car Location',
 'Tapas Restaurant',
 'Burmese Restaurant',
 'Indian Restaurant',
 'Cambodian Restaurant',
 'Falafel Restaurant',
 'ATM',
 'Southern / Soul Food Restaurant',
 'Breakfast Spot',
 'Jazz Club',
 'Flower Shop',
 'Pizza Place',
 'Mediterranean Restaurant',
 'Diner',
 'Video Store',
 'Scandinavian Restaurant',
 'Wings Joint',
 "Men's Store",
 'Bank',
 'Asian Restaurant',
 'Burrito Place',
 'Fast Food Restaurant',
 'Optical Shop',
 'Sushi Restaurant',
 'Fried Chicken Joint',
 'Recreation Center',
 'Intersection',
 'Trail',
 'French Restaurant',
 'Fondue Restaurant',
 'Theme R

In [14]:

# we can group some cuisines together to make a better categorical variable.  We could then sub-categorize after gleaning information from the broad categories.

european = ['Mediterranean Restaurant', 'Scandinavian Restaurant', 'Pizza Place',
       'French Restaurant', 'Falafel Restaurant', 'Italian Restaurant',
       'Turkish Restaurant']

latin = ['Mexican Restaurant', 'Taco Place', 'Brazilian Restaurant', 
          'Burrito Place']

asian = ['Japanese Restaurant', 'Vietnamese Restaurant', 'Chinese Restaurant',
         'Hot Dog Joint', 'Hotpot Restaurant', 'Indian Restaurant',
         'Thai Restaurant', 'Dumpling Restaurant', 'Dim Sum Restaurant',
         'Asian Restaurant', 'Filipino Restaurant', 'Sushi Restaurant',
         'Ramen Restaurant']

american = ['Vegetarian / Vegan Restaurant', 'Seafood Restaurant', 'Caribbean Restaurant',
           'Burger Joint', 'American Restaurant', 'New American Restaurant',
            'Southern / Soul Food Restaurant', 'Diner']

casual = ['Bagel Shop', 'Sandwich Place', 'Fried Chicken Joint', 
          'Breakfast Spot', 'Wings Joint', 'Fast Food Restaurant',
          'Theme Restaurant']

def conditions(s):
    if s['categories'] in european:
        return 'european'
    if s['categories'] in latin:
        return 'latin'
    if s['categories'] in asian:
        return 'asian'
    if s['categories'] in american:
        return 'american'
    if s['categories'] in casual:
        return 'casual'

raw_dataset['categories_classified'] = raw_dataset.apply(conditions, axis=1)
raw_dataset

Unnamed: 0,name,categories,lat,lng,id,city,likes,categories_classified
1,Golden Lotus Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.80329,-122.270473,49cebb1bf964a520785a1fe3,Oakland,77,american
6,Abura-Ya,Japanese Restaurant,37.805959,-122.267693,539a69a7498ee67090b2b285,Oakland,64,asian
8,Nature Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.802157,-122.270983,4f52deb2e4b0ac6d0c91df05,Oakland,33,american
10,Tay Ho Restaurant & Bar,Vietnamese Restaurant,37.802062,-122.269573,4c8b16ec52a98cfad73533e9,Oakland,72,asian
11,Beauty’s Bagel Shop,Bagel Shop,37.806082,-122.268356,5bd0959cf1fdaf002ce03e11,Oakland,22,casual
12,Catered To You,American Restaurant,37.807046,-122.270191,4be210911dd22d7f354493bd,Oakland,7,american
13,The Cook And Her Farmer,Seafood Restaurant,37.801583,-122.27486,5376a879498e8eeb2402cd71,Oakland,63,american
14,Binh Minh Quan,Vietnamese Restaurant,37.80202,-122.269396,4a75022cf964a52043e01fe3,Oakland,50,asian
16,Umami Mart,Kitchen Supply Store,37.800483,-122.273744,501c7838e4b08947b4df755e,Oakland,115,
19,Cosecha,Mexican Restaurant,37.801607,-122.274889,4e179c7752b123a586cef176,Oakland,199,latin


In [15]:

# double check to make sure categories_classified has been created correctly

pd.crosstab(index = raw_dataset["categories_classified"],
            columns="count")

col_0,count
categories_classified,Unnamed: 1_level_1
american,26
asian,23
casual,17
european,21
latin,14


In [16]:
raw_dataset['likes'].mean()

66.55645161290323

In [17]:
# create a function to bin for us

def rankings(df):
    
    if df['likes'] <= 60:
        return 3
    
    elif df['likes'] <= 100:
        return 2
    
    elif df['likes'] > 100:
        return 1

In [18]:
# apply rankings function to dataset

raw_dataset['ranking'] = raw_dataset.apply(rankings, axis=1)
raw_dataset

Unnamed: 0,name,categories,lat,lng,id,city,likes,categories_classified,ranking
1,Golden Lotus Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.80329,-122.270473,49cebb1bf964a520785a1fe3,Oakland,77,american,2
6,Abura-Ya,Japanese Restaurant,37.805959,-122.267693,539a69a7498ee67090b2b285,Oakland,64,asian,2
8,Nature Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.802157,-122.270983,4f52deb2e4b0ac6d0c91df05,Oakland,33,american,3
10,Tay Ho Restaurant & Bar,Vietnamese Restaurant,37.802062,-122.269573,4c8b16ec52a98cfad73533e9,Oakland,72,asian,2
11,Beauty’s Bagel Shop,Bagel Shop,37.806082,-122.268356,5bd0959cf1fdaf002ce03e11,Oakland,22,casual,3
12,Catered To You,American Restaurant,37.807046,-122.270191,4be210911dd22d7f354493bd,Oakland,7,american,3
13,The Cook And Her Farmer,Seafood Restaurant,37.801583,-122.27486,5376a879498e8eeb2402cd71,Oakland,63,american,2
14,Binh Minh Quan,Vietnamese Restaurant,37.80202,-122.269396,4a75022cf964a52043e01fe3,Oakland,50,asian,3
16,Umami Mart,Kitchen Supply Store,37.800483,-122.273744,501c7838e4b08947b4df755e,Oakland,115,,1
19,Cosecha,Mexican Restaurant,37.801607,-122.274889,4e179c7752b123a586cef176,Oakland,199,latin,1


## Machine Learning and  Linear Regression:

In [20]:
# create dummies for linear regression modelling

# one hot encoding
reg_dataset = pd.get_dummies(raw_dataset[['categories_classified', 
                                          'city',]], 
                               prefix="", 
                               prefix_sep="")

# add name, ranking, and likes columns back to dataframe
reg_dataset['ranking'] = raw_dataset['ranking']
reg_dataset['likes'] = raw_dataset['likes']
reg_dataset['name'] = raw_dataset['name']

# move name column to the first column
reg_columns = [reg_dataset.columns[-1]] + list(reg_dataset.columns[:-1])
reg_dataset = reg_dataset[reg_columns]


reg_dataset.head()

Unnamed: 0,name,american,asian,casual,european,latin,Emeryville,Oakland,San Diego,ranking,likes
1,Golden Lotus Vegetarian Restaurant,1,0,0,0,0,0,1,0,2,77
6,Abura-Ya,0,1,0,0,0,0,1,0,2,64
8,Nature Vegetarian Restaurant,1,0,0,0,0,0,1,0,3,33
10,Tay Ho Restaurant & Bar,0,1,0,0,0,0,1,0,2,72
11,Beauty’s Bagel Shop,0,0,1,0,0,0,1,0,3,22


In [21]:
# Multiple Linear Regression

msk = np.random.rand(len(reg_dataset)) < 0.8
train = reg_dataset[msk]
test = reg_dataset[~msk]

regr = linear_model.LinearRegression()
x = np.asanyarray(train[['american', 'asian', 'casual',
                         'european', 'latin', 'Oakland', 
                         'Emeryville', 'San Diego']])

y = np.asanyarray(train[['likes']])
regr.fit (x, y)

# The coefficients

print ('Coefficients: ', regr.coef_)

Coefficients:  [[ 53.2504383   -6.99860452 -14.28940242  -5.68454468  40.92777102
   -0.59106254 -11.14361164  11.73467418]]


In [22]:
# Multiple Linear Regression Prediction Capabilities

y_hat= regr.predict(test[['american', 'asian', 'casual',
                         'european', 'latin', 'Oakland', 
                         'Emeryville', 'San Diego']])

x = np.asanyarray(test[['american', 'asian', 'casual',
                         'european', 'latin', 'Oakland', 
                         'Emeryville', 'San Diego']])

y = np.asanyarray(test[['likes']])
print("Residual sum of squares: %.2f"
      % np.mean((y_hat - y) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(x, y))

Residual sum of squares: 7659.26
Variance score: 0.05


## Machine Learning | Logistic Regression:

In [23]:
# Multinomial Ordinal Logistic Regression

x_train = np.asanyarray(train[['american', 'asian', 'casual',
                         'european', 'latin', 'Oakland', 
                         'Emeryville', 'San Diego']])

y_train = np.asanyarray(train['ranking'])

x_test = np.asanyarray(test[['american', 'asian', 'casual',
                         'european', 'latin', 'Oakland', 
                         'Emeryville', 'San Diego']])

y_test = np.asanyarray(test['ranking'])


mul_ordinal = linear_model.LogisticRegression(multi_class='multinomial',
                                              solver='newton-cg',
                                              fit_intercept=True).fit(x_train,
                                                                      y_train)

mul_ordinal

coef = mul_ordinal.coef_[0]
print (coef)

[ 0.39210719 -0.55559261 -0.34088283  0.05179413  0.02554497 -0.01118164
 -0.32049529  0.33167705]


In [24]:
# Multinomial Ordinal Logistic Regression Prediction Capabilities

yhat = mul_ordinal.predict(x_test)
yhat

yhat_prob = mul_ordinal.predict_proba(x_test)
yhat_prob


# average = None, average = 'micro', average = 'macro', or average = 'weighted'
jaccard_score(y_test, yhat, average='weighted')

0.554074074074074

In [25]:
log_loss(y_test, yhat_prob)

0.8313247034719107

In [26]:
# Exploration of Coefficient Magnitudes of Full Dataset

x_all = np.asanyarray(reg_dataset[['american', 'asian', 'casual',
                                   'european', 'latin', 'Oakland', 
                                   'Emeryville', 'San Diego']])

y_all = np.asanyarray(reg_dataset['ranking'])



LR = linear_model.LogisticRegression(multi_class='multinomial',
                                            solver='newton-cg',
                                            fit_intercept=True).fit(x_all,
                                                                    y_all)

LR

coef = LR.coef_[0]
print(coef)

[ 0.4403209  -0.50643379 -0.21386951 -0.01138729  0.23092498  0.02776571
 -0.37434754  0.34659341]


In [27]:
print(classification_report(y_test, yhat))

              precision    recall  f1-score   support

           1       1.00      0.40      0.57         5
           2       0.00      0.00      0.00         4
           3       0.72      1.00      0.84        18

    accuracy                           0.74        27
   macro avg       0.57      0.47      0.47        27
weighted avg       0.67      0.74      0.66        27




# Results:

In this project, we created a linear regression model and trained it on a random subsample of 80% and set aside the other 20% for testing purposes. To evaluate whether a model was reasonable, the variance score and residual sum of squares were calculated. The variance score is somewhat low, which suggests that this is not necessarily a good way of modeling our data. We then moved on to logistic regression for our analysis.

A multinomial ordinal logistic regression model was then trained on a random subsample of 80% and then tested on the remaining 20%. The jaccard score and log-loss were calculated. Although the prediction is not ideal, the jaccard score of is somewhat plausible considering that we did not account for the skill of each restaurateur. The classification report is included in the analysis.

Given the modest predictive capability of this mode, we have the ability to run the model on the complete dataset. The coefficients suggest that opening a restaurant in Emeryville, or serving cuisine that is asian, or casual, are negatively associated with 'likes'.  For future analysis (perhaps another project), it would be interesting to correlate these dislikes with the relative saturation of similar restaurants.  Perhaps patrons are exceptionally picky (and prone to dislike), for instance, if the area has a high concentration of asian or casual restaurants. 


# Discussion:

Based on our data, it appears that logistic regression presents a better fit for the data over linear regression. By using logistic regression, we were able to obtain a reasonable Jaccard Score of about 26%. While this is not ideal, due to other unaccounted variables (skill of restaurateur), it is more reasonable than the low variance score obtained from the linear regression. As previously stated, for purposes of this project, we are assumming that likes are a good proxy for how well a new restaurant will do in terms of brand, image and by extension how well the restaurant will perform business-wise. Whether or not these assumptions hold up in a real-life scenario is not the point, but this project provides a starting point and admittedly contains limitations in scope due to the exhaustive amount of data that could be retrieved from the FourSquare API.

In the context of our intended audience, an entrepreneur, we would want to obtain insight from this data. We can proceed by breaking down the results of the logistic regression model. The results suggest that the precision score for classifying whether the new restaurant would fall into classes 1, 2, or 3 (highest, medium, lowest) were 40%, 0%, and 50%. Therefore, the model is better at predicting if a restaurant will fall into the best or worst percentile of likes. This is good as we are mostly concerned with whether the restuarant will perform well or not so the high accuracy of predictions for the two extremum is a desired feature. This allows us to fairly accurately predict the general performance of the business opportunity. Different binning methods for the classes were attempted (and are not shown in this program out of the interest of brevity), but the use of 3 bins by far yielded the best Jaccard Score.

Our intended audience is not just attempting to predict the general business performance, but he is also interested in gaining insight to inform the best business strategy. In this case strategy insight can be gleamed from the coefficient values from running the logistic regression on the full dataset. As such, we can see that opening a restaurant in Emeryville, or serving cuisine that is asian or casual in nature, are associated negatively with "likes." This suggests that the business opportunity should be opening a restaurant in either Oakland or San Diego, with a cuisine that is European, Latin, or American in nature would be the best approach for maximizing likes.

# Conclusion:

By analyzing the restaurant 'likes' in select California cities from the 300 chosen restaurants, we can reasonably conclude that the best approach for maximizing business performance (measured strictly by 'likes') is to open a restaurant that is either European, Latino, or American. The data also suggest taht opening the venue in Oakland or San Diego is better than Emeryville. We have also determined that the predictive capabilities of the logistic regression prediction model are superior for classifying whether a restaurant fell in either the best or worst classes when the data was binned into their 3 respective classes.