# IBM Applied Data Science Capstone

## Introduction

For the purposes of this final project, I have decided that I will explore restaurant locations in the state of New York; specifically the cities of Buffalo, Ithaca, and Albany. This will be done to gain better insights into restaurant investment opportunities. I will explore the types of cuisine and popularity and what each city should concentrate on when surveying for potential locations.

The question that will be attempted to be addressed is how we can predict how much popularity a new restaurant opening in each respective city can expect to get based on the type of food served. This will be analyzed based on how many "likes" a new restaurant opening in each city. I will use machine learning by comparing both linear and logistic regressions and compare which method produced better predictive capability.

## Data

In this project, I will specifically focus on restaurants within a 100 mile radius from coordinated provided by the Foursquare geolocator. The Foursquare API will likely provide more data and venues than necessary, so I will remove non-restaurant rows to focus the data. I will also analyze and compare the 'likes' data in order to draw the final conclusions. 

I will specifically be looking at the geographical coordinates of the three cities mentioned above. I will use the Foursquare API to obtain raw data, which will be focused on the specific name, latitudinal and longitudinal coordinates, category/type, and which respective city the data is referring to.

## Methodology

For this final project, I will use linear and logistic regression to train and test the data. Linear regression is used to predict the number of likes a restaurant might gain when opening in each city. I will use Sci-Kit for this purpose.

Logistic regression will be used to classify the types of restaurants and cuisines. It will need to be multinomial and ordinal, despite the ranges and types of restaurants being discrete in nature. Thus, Sci-Kit will be effective at handling this.

### Libraries

In [3]:
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np 
import json
import requests
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
!pip install folium
import matplotlib.pyplot as plt
import pylab as pl
import itertools
import warnings
warnings.filterwarnings('ignore')

from urllib.request import urlopen
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim 
from sklearn import linear_model
from sklearn.metrics import jaccard_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import log_loss
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error, r2_score

print('Libraries successfully imported!')

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Libraries successfully imported!


### Retrieving Foursquare Coordinates

In [4]:
address1 = 'Buffalo, New York'

geolocator = Nominatim(user_agent="foursquare_agent")
location1 = geolocator.geocode(address1)
latitude1 = location1.latitude
longitude1 = location1.longitude
print('The geographical coordinates of {} are {}, {}.'.format(address1, latitude1, longitude1))

address2 = 'Ithaca, New York'

geolocator = Nominatim(user_agent="foursquare_agent")
location2 = geolocator.geocode(address2)
latitude2 = location2.latitude
longitude2 = location2.longitude
print('The geographical coordinates of {} are {}, {}.'.format(address2, latitude2, longitude2))

address3 = 'Albany, New York'

geolocator = Nominatim(user_agent="foursquare_agent")
location3 = geolocator.geocode(address3)
latitude3 = location3.latitude
longitude3 = location3.longitude
print('The geographical coordinates of {} are {}, {}.'.format(address3, latitude3, longitude3))

The geographical coordinates of Buffalo, New York are 42.8867166, -78.8783922.
The geographical coordinates of Ithaca, New York are 42.4396039, -76.4968019.
The geographical coordinates of Albany, New York are 42.6511674, -73.754968.


### Foursquare Credentials

In [3]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '' # Foursquare API version


LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# create URLs

url1 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude1, 
    longitude1, 
    radius, 
    LIMIT)


url2 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude2, 
    longitude2, 
    radius, 
    LIMIT)


url3 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude3, 
    longitude3, 
    radius, 
    LIMIT)

### Data Analysis

In [4]:
results1 = requests.get(url1).json()
results1

results2 = requests.get(url2).json()
results2

results3 = requests.get(url3).json()
results3

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    

# Buffalo, NY 

venues1 = results1['response']['groups'][0]['items']
nearby_venues1 = pd.json_normalize(venues1) # flatten JSON

# filter columns
filtered_columns1 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues1 = nearby_venues1.loc[:, filtered_columns1]

# filter the category for each row
nearby_venues1['venue.categories'] = nearby_venues1.apply(get_category_type, axis=1)

# clean columns
nearby_venues1.columns = [col.split(".")[-1] for col in nearby_venues1.columns]

# Ithaca, NY

venues2 = results2['response']['groups'][0]['items']
nearby_venues2 = pd.json_normalize(venues2) # flatten JSON

# filter columns
filtered_columns2 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues2 = nearby_venues2.loc[:, filtered_columns2]

# filter the category for each row
nearby_venues2['venue.categories'] = nearby_venues2.apply(get_category_type, axis=1)

# clean columns
nearby_venues2.columns = [col.split(".")[-1] for col in nearby_venues2.columns]

# Albany, NY

venues3 = results3['response']['groups'][0]['items']
nearby_venues3 = pd.json_normalize(venues3) # flatten JSON

# filter columns
filtered_columns3 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues3 = nearby_venues3.loc[:, filtered_columns3]

# filter the category for each row
nearby_venues3['venue.categories'] = nearby_venues3.apply(get_category_type, axis=1)

# clean columns
nearby_venues3.columns = [col.split(".")[-1] for col in nearby_venues3.columns]


print('{} venues for Buffalo, New York were returned by Foursquare.'.format(nearby_venues1.shape[0]))
print()
print('{} venues for Ithaca, New York were returned by Foursquare.'.format(nearby_venues2.shape[0]))
print()
print('{} venues for Albany, New York were returned by Foursquare.'.format(nearby_venues3.shape[0]))

38 venues for Buffalo, New York were returned by Foursquare.

51 venues for Ithaca, New York were returned by Foursquare.

41 venues for Albany, New York were returned by Foursquare.


In [5]:
# add location data to the data sets of each respective city

nearby_venues1['city'] = 'Buffalo'
nearby_venues2['city'] = 'Ithaca'
nearby_venues3['city'] = 'Albany'

In [6]:
# combine the three cities into one data set

nearby_venues = nearby_venues1.copy()
nearby_venues = nearby_venues.append(nearby_venues2)
nearby_venues = nearby_venues.append(nearby_venues3)

In [7]:
nearby_venues

Unnamed: 0,name,categories,lat,lng,id,city
0,Osteria 166,Trattoria/Osteria,42.887629,-78.876464,51af66fe498e8ab4596a02c8,Buffalo
1,Statler City,Event Space,42.887706,-78.877718,4ce9872de888f04de784486b,Buffalo
2,Niagara Square,Plaza,42.886556,-78.878107,4c3cde9a7c1ee21e432a8d71,Buffalo
3,New Era Flagship Store: Buffalo,Boutique,42.888891,-78.877346,4bb4ea62b1edef3b8c4b2cdd,Buffalo
4,Buffalo City Hall Observation Deck,Scenic Lookout,42.886657,-78.879279,4df8ef77e4cd2129701bac30,Buffalo
5,Public Espresso + Coffee,Coffee Shop,42.884952,-78.873359,54f767b1498ea61cbaa0f998,Buffalo
6,Hotel @ The Lafayette,Hotel,42.885041,-78.873417,4be6f3bbbcef2d7f76a405e5,Buffalo
7,Hampton Inn & Suites,Hotel,42.890419,-78.877083,4b3a81a5f964a520e76825e3,Buffalo
8,JJ's Casa di Pizza,Pizza Place,42.886884,-78.873355,55ccd10a498edd526dac99b1,Buffalo
9,Lucky Day Whiskey Bar,Whisky Bar,42.888448,-78.87486,59428ab079f6c73dd779f2bb,Buffalo


In [8]:
nearby_venues1['categories'].unique()

array(['Trattoria/Osteria', 'Event Space', 'Plaza', 'Boutique',
       'Scenic Lookout', 'Coffee Shop', 'Hotel', 'Pizza Place',
       'Whisky Bar', 'Steakhouse', 'Farmers Market', 'Art Gallery',
       'Hot Dog Joint', 'Greek Restaurant', 'Gym', 'Sports Bar',
       'Wine Bar', 'Mediterranean Restaurant', 'Cocktail Bar',
       'Restaurant', 'Office', 'Beer Bar', 'Bar', 'Sandwich Place',
       'Brewery', 'French Restaurant', 'Pub', 'American Restaurant',
       'Hotel Pool'], dtype=object)

In [9]:
# check list and manually remove all non-restaurant and irrelevant data

nearby_venues['categories'].unique()

removal_list = ['Clothing Store','Bar','Brewery', 
                'Comic Shop', 'Yoga Studio','Café', 
                'Coffee Shop', 'Tiki Bar', 'Music Venue', 
                'Wine Bar',  'Cocktail Bar', 'Dance Studio', 
                'Gym / Fitness Center','Beer Bar', 
                'Bubble Tea Shop', 'Nightclub', 'Food Court', 
                'Ice Cream Shop', 'Cupcake Shop', 'Skating Rink', 
                'Dessert Shop', 'Climbing Gym', 'Bakery', 
                'Farmers Market', 'Gay Bar','Beer Garden',
                'Tea Room','Arts & Crafts Store', 'Grocery Store', 
                'Sports Bar', 'Museum', 'Street Food Gathering', 
                'Library', 'Skate Park', 'Movie Theater','Park', 
                'Gym', 'Stadium', 'Furniture / Home Store', 'Discount Store', 
                'Playground', 'Cosmetics Shop', 'Casino', 
                'Pet Store','Electronics Store', 'Snack Place',
                'Salon / Barbershop', 'Shopping Plaza', 'Deli / Bodega', 
                'Candy Store', 'Liquor Store', 'Hotel', 
                'Shoe Store', 'Bookstore', 'Shopping Mall', 
                'Dive Bar', 'Video Game Store', 'Pharmacy', 
                'Accessories Store', 'Lingerie Store', 'Mobile Phone Shop', 
                'Pool Hall', 'Juice Bar', 'Kids Store', 
                'Supplement Shop', 'Big Box Store', 'Mattress Store', 
                'Hardware Store', 'Paper / Office Supplies Store', 'Theater', 
                'Business Service', 'Donut Shop', 'Beer Store', 
                'Lounge', 'Health Food Store', 'Pedestrian Plaza', 
                'Hookah Bar', 'Concert Hall', 'Chocolate Shop', 
                'Hostel', 'Convenience Store', 'Pub', 
                'Plaza', 'Comedy Club', 'Speakeasy', 
                'Tattoo Parlor', 'Massage Studio']

nearby_venues = nearby_venues[~nearby_venues['categories'].isin(removal_list)]

nearby_venues['categories'].unique().tolist()

['Trattoria/Osteria',
 'Event Space',
 'Boutique',
 'Scenic Lookout',
 'Pizza Place',
 'Whisky Bar',
 'Steakhouse',
 'Art Gallery',
 'Hot Dog Joint',
 'Greek Restaurant',
 'Mediterranean Restaurant',
 'Restaurant',
 'Office',
 'Sandwich Place',
 'French Restaurant',
 'American Restaurant',
 'Hotel Pool',
 'Breakfast Spot',
 'Latin American Restaurant',
 'Bagel Shop',
 'Tapas Restaurant',
 'Organic Grocery',
 'Asian Restaurant',
 'Wine Shop',
 'Hotel Bar',
 'New American Restaurant',
 'Mexican Restaurant',
 'Bed & Breakfast',
 'Gourmet Shop',
 'Thai Restaurant',
 'Korean Restaurant',
 'Vegetarian / Vegan Restaurant',
 'Ethiopian Restaurant',
 'Record Shop',
 'Trail',
 'Taco Place',
 'Bistro',
 'Indian Restaurant',
 'Bus Stop',
 'English Restaurant',
 'Performing Arts Venue',
 'Hockey Arena',
 'Food Truck',
 'Italian Restaurant',
 'Bank',
 'Rental Car Location',
 'Burger Joint',
 'Food']

### DataFrame

In [10]:
# pull the likes from the API based on venue ID

url_list = []
like_list = []
json_list = []

for i in list(nearby_venues.id):
    venue_url = 'https://api.foursquare.com/v2/venues/{}/likes?client_id={}&client_secret={}&v={}'.format(i, CLIENT_ID, CLIENT_SECRET, VERSION)
    url_list.append(venue_url)
for link in url_list:
    result = requests.get(link).json()
    likes = result['response']['likes']['count']
    like_list.append(likes)
print(like_list)


nearby_venues['likes'] = like_list
nearby_venues.head()

[85, 76, 21, 3, 16, 19, 18, 19, 21, 7, 28, 7, 10, 20, 5, 24, 18, 16, 2, 4, 19, 76, 7, 49, 149, 68, 24, 26, 21, 10, 35, 14, 145, 27, 5, 20, 13, 25, 78, 89, 15, 48, 13, 43, 34, 24, 11, 57, 8, 35, 16, 6, 1, 14, 91, 44, 28, 18, 60, 218, 8, 24, 10, 25, 28, 1, 0, 8, 0, 2]


Unnamed: 0,name,categories,lat,lng,id,city,likes
0,Osteria 166,Trattoria/Osteria,42.887629,-78.876464,51af66fe498e8ab4596a02c8,Buffalo,85
1,Statler City,Event Space,42.887706,-78.877718,4ce9872de888f04de784486b,Buffalo,76
3,New Era Flagship Store: Buffalo,Boutique,42.888891,-78.877346,4bb4ea62b1edef3b8c4b2cdd,Buffalo,21
4,Buffalo City Hall Observation Deck,Scenic Lookout,42.886657,-78.879279,4df8ef77e4cd2129701bac30,Buffalo,3
8,JJ's Casa di Pizza,Pizza Place,42.886884,-78.873355,55ccd10a498edd526dac99b1,Buffalo,16


In [11]:
nearby_venues.count()

name          70
categories    70
lat           70
lng           70
id            70
city          70
likes         70
dtype: int64

In [12]:
raw_dataset = nearby_venues
raw_dataset.head()

Unnamed: 0,name,categories,lat,lng,id,city,likes
0,Osteria 166,Trattoria/Osteria,42.887629,-78.876464,51af66fe498e8ab4596a02c8,Buffalo,85
1,Statler City,Event Space,42.887706,-78.877718,4ce9872de888f04de784486b,Buffalo,76
3,New Era Flagship Store: Buffalo,Boutique,42.888891,-78.877346,4bb4ea62b1edef3b8c4b2cdd,Buffalo,21
4,Buffalo City Hall Observation Deck,Scenic Lookout,42.886657,-78.879279,4df8ef77e4cd2129701bac30,Buffalo,3
8,JJ's Casa di Pizza,Pizza Place,42.886884,-78.873355,55ccd10a498edd526dac99b1,Buffalo,16


### Preparing for Machine Learning

In [13]:
raw_dataset['categories'].unique().tolist()

['Trattoria/Osteria',
 'Event Space',
 'Boutique',
 'Scenic Lookout',
 'Pizza Place',
 'Whisky Bar',
 'Steakhouse',
 'Art Gallery',
 'Hot Dog Joint',
 'Greek Restaurant',
 'Mediterranean Restaurant',
 'Restaurant',
 'Office',
 'Sandwich Place',
 'French Restaurant',
 'American Restaurant',
 'Hotel Pool',
 'Breakfast Spot',
 'Latin American Restaurant',
 'Bagel Shop',
 'Tapas Restaurant',
 'Organic Grocery',
 'Asian Restaurant',
 'Wine Shop',
 'Hotel Bar',
 'New American Restaurant',
 'Mexican Restaurant',
 'Bed & Breakfast',
 'Gourmet Shop',
 'Thai Restaurant',
 'Korean Restaurant',
 'Vegetarian / Vegan Restaurant',
 'Ethiopian Restaurant',
 'Record Shop',
 'Trail',
 'Taco Place',
 'Bistro',
 'Indian Restaurant',
 'Bus Stop',
 'English Restaurant',
 'Performing Arts Venue',
 'Hockey Arena',
 'Food Truck',
 'Italian Restaurant',
 'Bank',
 'Rental Car Location',
 'Burger Joint',
 'Food']

In [14]:
# grouping types of restaurants/cuisines together

european = ['Mediterranean Restaurant', 'Scandinavian Restaurant', 'Pizza Place',
       'French Restaurant', 'Falafel Restaurant', 'Italian Restaurant',
       'Turkish Restaurant']

latin = ['Mexican Restaurant', 'Taco Place', 'Brazilian Restaurant', 
          'Burrito Place']

asian = ['Japanese Restaurant', 'Vietnamese Restaurant', 'Chinese Restaurant',
         'Hot Dog Joint', 'Hotpot Restaurant', 'Indian Restaurant',
         'Thai Restaurant', 'Dumpling Restaurant', 'Dim Sum Restaurant',
         'Asian Restaurant', 'Filipino Restaurant', 'Sushi Restaurant',
         'Ramen Restaurant']

american = ['Vegetarian / Vegan Restaurant', 'Seafood Restaurant', 'Caribbean Restaurant',
           'Burger Joint', 'American Restaurant', 'New American Restaurant',
            'Southern / Soul Food Restaurant', 'Diner']

casual = ['Bagel Shop', 'Sandwich Place', 'Fried Chicken Joint', 
          'Breakfast Spot', 'Wings Joint', 'Fast Food Restaurant',
          'Theme Restaurant']

def conditions(s):
    if s['categories'] in european:
        return 'european'
    if s['categories'] in latin:
        return 'latin'
    if s['categories'] in asian:
        return 'asian'
    if s['categories'] in american:
        return 'american'
    if s['categories'] in casual:
        return 'casual'

raw_dataset['categories_classified'] = raw_dataset.apply(conditions, axis=1)
raw_dataset

Unnamed: 0,name,categories,lat,lng,id,city,likes,categories_classified
0,Osteria 166,Trattoria/Osteria,42.887629,-78.876464,51af66fe498e8ab4596a02c8,Buffalo,85,
1,Statler City,Event Space,42.887706,-78.877718,4ce9872de888f04de784486b,Buffalo,76,
3,New Era Flagship Store: Buffalo,Boutique,42.888891,-78.877346,4bb4ea62b1edef3b8c4b2cdd,Buffalo,21,
4,Buffalo City Hall Observation Deck,Scenic Lookout,42.886657,-78.879279,4df8ef77e4cd2129701bac30,Buffalo,3,
8,JJ's Casa di Pizza,Pizza Place,42.886884,-78.873355,55ccd10a498edd526dac99b1,Buffalo,16,european
9,Lucky Day Whiskey Bar,Whisky Bar,42.888448,-78.87486,59428ab079f6c73dd779f2bb,Buffalo,19,
11,SEAR,Steakhouse,42.889636,-78.877703,581d47538d169071e52163e1,Buffalo,18,
13,Western New York Book Arts Collaborative,Art Gallery,42.887049,-78.873117,4b636953f964a520ea772ae3,Buffalo,19,
14,Ted's Hot Dogs,Hot Dog Joint,42.890795,-78.877662,564cc50b498e17660a87f54a,Buffalo,21,asian
15,Taki's Restaurant,Greek Restaurant,42.886087,-78.876165,4bc343784cdfc9b6450f9721,Buffalo,7,


In [15]:
pd.crosstab(index = raw_dataset["categories_classified"],
            columns="count")

col_0,count
categories_classified,Unnamed: 1_level_1
american,10
asian,6
casual,8
european,10
latin,3


In [16]:
raw_dataset['likes'].mean()

31.557142857142857

In [17]:
# function to bin for us

def rankings(df):
    
    if df['likes'] <= 60:
        return 3
    
    elif df['likes'] <= 100:
        return 2
    
    elif df['likes'] > 100:
        return 1

In [18]:
# rankings function to dataset

raw_dataset['ranking'] = raw_dataset.apply(rankings, axis=1)
raw_dataset

Unnamed: 0,name,categories,lat,lng,id,city,likes,categories_classified,ranking
0,Osteria 166,Trattoria/Osteria,42.887629,-78.876464,51af66fe498e8ab4596a02c8,Buffalo,85,,2
1,Statler City,Event Space,42.887706,-78.877718,4ce9872de888f04de784486b,Buffalo,76,,2
3,New Era Flagship Store: Buffalo,Boutique,42.888891,-78.877346,4bb4ea62b1edef3b8c4b2cdd,Buffalo,21,,3
4,Buffalo City Hall Observation Deck,Scenic Lookout,42.886657,-78.879279,4df8ef77e4cd2129701bac30,Buffalo,3,,3
8,JJ's Casa di Pizza,Pizza Place,42.886884,-78.873355,55ccd10a498edd526dac99b1,Buffalo,16,european,3
9,Lucky Day Whiskey Bar,Whisky Bar,42.888448,-78.87486,59428ab079f6c73dd779f2bb,Buffalo,19,,3
11,SEAR,Steakhouse,42.889636,-78.877703,581d47538d169071e52163e1,Buffalo,18,,3
13,Western New York Book Arts Collaborative,Art Gallery,42.887049,-78.873117,4b636953f964a520ea772ae3,Buffalo,19,,3
14,Ted's Hot Dogs,Hot Dog Joint,42.890795,-78.877662,564cc50b498e17660a87f54a,Buffalo,21,asian,3
15,Taki's Restaurant,Greek Restaurant,42.886087,-78.876165,4bc343784cdfc9b6450f9721,Buffalo,7,,3


### Machine Learning with Linear Regression

In [19]:
# one hot encoding
reg_dataset = pd.get_dummies(raw_dataset[['categories_classified', 
                                          'city',]], 
                               prefix="", 
                               prefix_sep="")

# add name, ranking, and likes columns back to dataframe
reg_dataset['ranking'] = raw_dataset['ranking']
reg_dataset['likes'] = raw_dataset['likes']
reg_dataset['name'] = raw_dataset['name']

# move name column to the first column
reg_columns = [reg_dataset.columns[-1]] + list(reg_dataset.columns[:-1])
reg_dataset = reg_dataset[reg_columns]


reg_dataset.head()

Unnamed: 0,name,american,asian,casual,european,latin,Albany,Buffalo,Ithaca,ranking,likes
0,Osteria 166,0,0,0,0,0,0,1,0,2,85
1,Statler City,0,0,0,0,0,0,1,0,2,76
3,New Era Flagship Store: Buffalo,0,0,0,0,0,0,1,0,3,21
4,Buffalo City Hall Observation Deck,0,0,0,0,0,0,1,0,3,3
8,JJ's Casa di Pizza,0,0,0,1,0,0,1,0,3,16


In [20]:
# Multiple Linear Regression

msk = np.random.rand(len(reg_dataset)) < 0.8
train = reg_dataset[msk]
test = reg_dataset[~msk]

regr = linear_model.LinearRegression()
x = np.asanyarray(train[['american', 'asian', 'casual',
                         'european', 'latin', 'Buffalo', 
                         'Ithaca', 'Albany']])

y = np.asanyarray(train[['likes']])
regr.fit (x, y)

# The coefficients

print ('Coefficients: ', regr.coef_)

Coefficients:  [[10.38407138 -2.16867183 40.91159147 -2.59607991 64.13714141 -5.42653963
  10.37149945 -4.94495982]]


In [21]:
# Multiple Linear Regression Prediction Capabilities

y_hat= regr.predict(test[['american', 'asian', 'casual',
                         'european', 'latin', 'Buffalo', 
                         'Ithaca', 'Albany']])

x = np.asanyarray(test[['american', 'asian', 'casual',
                         'european', 'latin', 'Buffalo', 
                         'Ithaca', 'Albany']])

y = np.asanyarray(test[['likes']])
print("Residual sum of squares: %.2f"
      % np.mean((y_hat - y) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(x, y))

Residual sum of squares: 3389.30
Variance score: -0.43


### Machine Learning with Logistic Regression

In [22]:
# Multinomial Ordinal Logistic Regression

x_train = np.asanyarray(train[['american', 'asian', 'casual',
                         'european', 'latin', 'Buffalo', 
                         'Ithaca', 'Albany']])

y_train = np.asanyarray(train['ranking'])

x_test = np.asanyarray(test[['american', 'asian', 'casual',
                         'european', 'latin', 'Buffalo', 
                         'Ithaca', 'Albany']])

y_test = np.asanyarray(test['ranking'])

mul_ordinal = linear_model.LogisticRegression(multi_class='multinomial',
                                              solver='newton-cg',
                                              fit_intercept=True).fit(x_train, y_train)
mul_ordinal

coef = mul_ordinal.coef_[0]
print (coef)

[-0.17357656 -0.14464051  0.4220661  -0.14201318  0.77810885 -0.23183394
  0.44190246 -0.21009532]


In [23]:
# Multinomial Ordinal Logistic Regression Prediction Capabilities

yhat = mul_ordinal.predict(x_test)
yhat

yhat_prob = mul_ordinal.predict_proba(x_test)
yhat_prob

# average = None, average = 'micro', average = 'macro', or average = 'weighted'
jaccard_score(y_test, yhat, average='weighted')

0.7901234567901234

In [24]:
log_loss(y_test, yhat_prob)

0.5605713573289273

In [25]:
# Exploration of Coefficient Magnitudes of Full Dataset

x_all = np.asanyarray(reg_dataset[['american', 'asian', 'casual',
                                   'european', 'latin', 'Buffalo', 
                                   'Ithaca', 'Albany']])

y_all = np.asanyarray(reg_dataset['ranking'])



LR = linear_model.LogisticRegression(multi_class='multinomial',
                                            solver='newton-cg',
                                            fit_intercept=True).fit(x_all,
                                                                    y_all)

LR

coef = LR.coef_[0]
print(coef)

[-0.29461684 -0.1579739   0.34007561 -0.19153309  0.63332368 -0.40952155
  0.15929545  0.25022527]


In [26]:
print(classification_report(y_test, yhat))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00         1
           2       0.00      0.00      0.00         1
           3       0.89      1.00      0.94        16

    accuracy                           0.89        18
   macro avg       0.30      0.33      0.31        18
weighted avg       0.79      0.89      0.84        18



## Results

A linear regression model was trained on a random subsample of 80% and then the other 20% was used for testing purposes. In order to evaluate if the model is reasonable, the residual sum of squares and variance score were both calculated with values of 3389.30 and -0.43 respectively. The variance score is negative which means that an error occurred somewhere in the analysis and is thus not a good way of modeling and representing our data. From there, I moved on to logistic regression for a more precise analysis.

The multinomial ordinal logistic regression model was also trained on a random subsample of 80% and then tested on the remaining 20%. The jaccard score and log-loss were both calculated with values of 79.01% and 0.561 respectively. The prediction is seems to be promising as a jaccard score of 76% is certainly reasonable. The classification report is included in the analysis.

Given the modestly accurate ability of this mode, we have the ability to run the model on the complete dataset. The coefficients we got show that opening a restaurant in Buffalo and Albany, or serving cuisine that is asian or european, are negatively associated with 'likes'.

## Discussion 

From the results, I found that logistic regression presents a better fit for the data over linear regression. By using logistic regression, a Jaccard Score of 79.01% was able to be obtained which seems to be reasonable and likely than the improbable negative variance score derived from the linear regression. It must be stressed that we are assuming that likes are a good representative for how successful a new restaurant will do in terms of the type of cuisine, image in addition to how the restaurant will perform  in terms of business. Whether or not these findings are valid in a real-world application remains to be seen, but this analysis does contains limitations in scope due to the limited amount of data that can be retrieved by the Foursquare API.

The results showed that the precision score for classifying whether the new restaurant would fall into classes 1, 2, or 3 (highest, medium, lowest) were 0%, 0%, and 89%, respectively. We can see that the model is slightly better in terms or predictions, however it seems that the results have the strongest affinity for class 3. This may require further analysis as the results seem to favor one extreme over the others. However, these preliminary results allow us to gain a glimpse into how the general performance of a new restaurant in each respective city might fare in terms of business.

Furthermore, we attempting to predict the general business performance as well glean insights into potential business opportunities. In this case, a business opportunity can be seen from the coefficient values from running the logistic regression on the full dataset. As such, we can see that opening a restaurant in Buffalo or Albany, or serving asian or european cuisine trend negatively with "likes." This suggests that a business opportunity should be opening a restaurant in Ithaca, with an american, latin, or casual cuisine would be the best opportunity to maximize likes.

## Conclusion 

In conclusion, after analyzing restaurant 'likes' in the state of New York from a sample size of 130 restaurants, we found that the best approach to maximize business performance (as measured by 'likes') is to open a restaurant that is either American, Latino, or Casual and that opening the venue in Ithaca rather than Buffalo or Albany would present the best business opportunity. The predictive capabilities of the logistic regression prediction model proved to be more accurate than linear regression for classifying whether a restaurant fell in either the best or worst classes. However, more analysis and specific data will likely be necessary as the results seem to favor one extreme the others, as it is possible that an error may have occurred in the calculations.