## üç¥‚≠ê Michelin Restaurant Star Prediction

Given *data about Michelin starred restaurants*, let's try to predict the **number of stars** of a given restaurant.

We will use a logistic regression model to make our predictions.

Data source: https://www.kaggle.com/datasets/jackywang529/michelin-restaurants

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

import re
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

In [2]:
one_star_df = pd.read_csv('archive/one-star-michelin-restaurants.csv')
two_star_df = pd.read_csv('archive/two-stars-michelin-restaurants.csv')
three_star_df = pd.read_csv('archive/three-stars-michelin-restaurants.csv')

In [3]:
one_star_df.head()

Unnamed: 0,name,year,latitude,longitude,city,region,zipCode,cuisine,price,url
0,Kilian Stuba,2019,47.34858,10.17114,Kleinwalsertal,Austria,87568,Creative,$$$$$,https://guide.michelin.com/at/en/vorarlberg/kl...
1,Pfefferschiff,2019,47.83787,13.07917,Hallwang,Austria,5300,Classic cuisine,$$$$$,https://guide.michelin.com/at/en/salzburg-regi...
2,Esszimmer,2019,47.80685,13.03409,Salzburg,Austria,5020,Creative,$$$$$,https://guide.michelin.com/at/en/salzburg-regi...
3,Carpe Diem,2019,47.80001,13.04006,Salzburg,Austria,5020,Market cuisine,$$$$$,https://guide.michelin.com/at/en/salzburg-regi...
4,Edvard,2019,48.216503,16.36852,Wien,Austria,1010,Modern cuisine,$$$$,https://guide.michelin.com/at/en/vienna/wien/r...


In [4]:
two_star_df.head()

Unnamed: 0,name,year,latitude,longitude,city,region,zipCode,cuisine,price,url
0,SENNS.Restaurant,2019,47.83636,13.06389,Salzburg,Austria,5020,Creative,$$$$$,https://guide.michelin.com/at/en/salzburg-regi...
1,Ikarus,2019,47.79536,13.00695,Salzburg,Austria,5020,Creative,$$$$$,https://guide.michelin.com/at/en/salzburg-regi...
2,Mraz & Sohn,2019,48.23129,16.37637,Wien,Austria,1200,Creative,$$$$$,https://guide.michelin.com/at/en/vienna/wien/r...
3,Konstantin Filippou,2019,48.21056,16.37996,Wien,Austria,1010,Modern cuisine,$$$$$,https://guide.michelin.com/at/en/vienna/wien/r...
4,Silvio Nickol Gourmet Restaurant,2019,48.20558,16.37693,Wien,Austria,1010,Modern cuisine,$$$$$,https://guide.michelin.com/at/en/vienna/wien/r...


In [5]:
three_star_df.head()

Unnamed: 0,name,year,latitude,longitude,city,region,zipCode,cuisine,price,url
0,Amador,2019,48.25406,16.35915,Wien,Austria,1190,Creative,$$$$$,https://guide.michelin.com/at/en/vienna/wien/r...
1,Manresa,2019,37.22761,-121.98071,South San Francisco,California,95030,Contemporary,$$$$,https://guide.michelin.com/us/en/california/so...
2,Benu,2019,37.78521,-122.39876,San Francisco,California,94105,Asian,$$$$,https://guide.michelin.com/us/en/california/sa...
3,Quince,2019,37.79762,-122.40337,San Francisco,California,94133,Contemporary,$$$$,https://guide.michelin.com/us/en/california/sa...
4,Atelier Crenn,2019,37.79835,-122.43586,San Francisco,California,94123,Contemporary,$$$$,https://guide.michelin.com/us/en/california/sa...


### Preprocessing

In [6]:
one_star_df['stars'] = pd.Series(0, index=one_star_df.index)
two_star_df['stars'] = pd.Series(1, index=two_star_df.index)
three_star_df['stars'] = pd.Series(2, index=three_star_df.index)

combined_df = pd.concat([one_star_df, two_star_df, three_star_df], axis=0).sample(frac=1.0).reset_index(drop=True)

In [7]:
combined_df

Unnamed: 0,name,year,latitude,longitude,city,region,zipCode,cuisine,price,url,stars
0,Spruce,2019,37.787720,-122.452640,San Francisco,California,94118,Californian,$$$,https://guide.michelin.com/us/en/california/sa...,0
1,Feng Wei Ju,2019,22.189960,113.547940,Macau,Macau,,Hunanese and Sichuan,$,https://guide.michelin.com/mo/en/macau-region/...,1
2,Joo Ok,2019,37.522520,127.043960,Seoul,South Korea,,Korean contemporary,$$$,https://guide.michelin.com/kr/en/seoul-capital...,0
3,Tate,2019,22.280996,114.152760,Hong Kong,Hong Kong,,Innovative,$$$$$,https://guide.michelin.com/hk/en/hong-kong-reg...,0
4,Simpsons,2019,52.469250,-1.923880,Birmingham,United Kingdom,B15 3DU,Modern cuisine,,https://guide.michelin.com/gb/en/west-midlands...,0
...,...,...,...,...,...,...,...,...,...,...,...
690,Atera,2019,40.716797,-74.005650,New York,New York City,10013,Contemporary,$$$$,https://guide.michelin.com/us/en/new-york-stat...,1
691,Acquerello,2019,37.791670,-122.421310,San Francisco,California,94109,Italian,$$$$,https://guide.michelin.com/us/en/california/sa...,1
692,Duddell's,2019,22.280080,114.157364,Hong Kong,Hong Kong,,Cantonese,$$$,https://guide.michelin.com/hk/en/hong-kong-reg...,0
693,Sushi Ginza Onodera,2019,34.082380,-118.376540,Los Angeles,California,,Japanese,$$$$,https://guide.michelin.com/us/en/california/us...,1


In [8]:
y = combined_df['stars'].copy()
X = combined_df.drop('stars', axis=1)

In [9]:
# Unneeded columns
X

Unnamed: 0,name,year,latitude,longitude,city,region,zipCode,cuisine,price,url
0,Spruce,2019,37.787720,-122.452640,San Francisco,California,94118,Californian,$$$,https://guide.michelin.com/us/en/california/sa...
1,Feng Wei Ju,2019,22.189960,113.547940,Macau,Macau,,Hunanese and Sichuan,$,https://guide.michelin.com/mo/en/macau-region/...
2,Joo Ok,2019,37.522520,127.043960,Seoul,South Korea,,Korean contemporary,$$$,https://guide.michelin.com/kr/en/seoul-capital...
3,Tate,2019,22.280996,114.152760,Hong Kong,Hong Kong,,Innovative,$$$$$,https://guide.michelin.com/hk/en/hong-kong-reg...
4,Simpsons,2019,52.469250,-1.923880,Birmingham,United Kingdom,B15 3DU,Modern cuisine,,https://guide.michelin.com/gb/en/west-midlands...
...,...,...,...,...,...,...,...,...,...,...
690,Atera,2019,40.716797,-74.005650,New York,New York City,10013,Contemporary,$$$$,https://guide.michelin.com/us/en/new-york-stat...
691,Acquerello,2019,37.791670,-122.421310,San Francisco,California,94109,Italian,$$$$,https://guide.michelin.com/us/en/california/sa...
692,Duddell's,2019,22.280080,114.157364,Hong Kong,Hong Kong,,Cantonese,$$$,https://guide.michelin.com/hk/en/hong-kong-reg...
693,Sushi Ginza Onodera,2019,34.082380,-118.376540,Los Angeles,California,,Japanese,$$$$,https://guide.michelin.com/us/en/california/us...


In [10]:
X = X.drop(['name', 'zipCode', 'url'], axis=1)
X

Unnamed: 0,year,latitude,longitude,city,region,cuisine,price
0,2019,37.787720,-122.452640,San Francisco,California,Californian,$$$
1,2019,22.189960,113.547940,Macau,Macau,Hunanese and Sichuan,$
2,2019,37.522520,127.043960,Seoul,South Korea,Korean contemporary,$$$
3,2019,22.280996,114.152760,Hong Kong,Hong Kong,Innovative,$$$$$
4,2019,52.469250,-1.923880,Birmingham,United Kingdom,Modern cuisine,
...,...,...,...,...,...,...,...
690,2019,40.716797,-74.005650,New York,New York City,Contemporary,$$$$
691,2019,37.791670,-122.421310,San Francisco,California,Italian,$$$$
692,2019,22.280080,114.157364,Hong Kong,Hong Kong,Cantonese,$$$
693,2019,34.082380,-118.376540,Los Angeles,California,Japanese,$$$$


#### Missing value imputation

In [11]:
X.isna().sum()

year           0
latitude       0
longitude      0
city           2
region         0
cuisine        0
price        176
dtype: int64

In [12]:
X['price'].value_counts()

price
$$$$     197
$$$      143
$$        75
$$$$$     73
$         31
Name: count, dtype: int64

In [16]:
X['price'] = X['price'].fillna(X['price'].mode().values[0])

In [17]:
X.isna().sum()

year         0
latitude     0
longitude    0
city         2
region       0
cuisine      0
price        0
dtype: int64

### Encoding

In [18]:
{column: list(X[column].unique()) for column in X.columns if X.dtypes[column] == 'object'}

{'city': ['San Francisco',
  'Macau',
  'Seoul',
  'Hong Kong',
  'Birmingham',
  'Aarhus',
  'Cambridge',
  'Malm√∂',
  'Wien',
  'Port Isaac',
  'Lovran',
  'G√∂teborg',
  "Burchett's Green",
  'Stockholm',
  "Saint James's",
  'City Centre',
  'Singapore',
  'Los Angeles',
  'Soho',
  'Baile Mhic And√°in/Thomastown',
  'Edinburgh',
  'Shoreditch',
  'New York',
  'Washington, D.C.',
  'Baltimore',
  'Taipei',
  'Rio de Janeiro - 22271',
  'Chicago',
  'Mayfair',
  'Mountsorrel',
  'Bangkok',
  'Budapest',
  'Auchterarder',
  'Machynlleth',
  'Torquay',
  nan,
  'Dorking',
  'Ilfracombe',
  'Fence',
  'Bray',
  'Gaillimh/Galway',
  'Birkenhead',
  'Upper Hambleton',
  'Helsingfors / Helsinki',
  'Kensington',
  'Egham',
  'S√£o Paulo - 05416',
  'S√£o Paulo - 05415',
  'Bristol',
  'Winteringham',
  'Ath√≠na',
  'Great Milton',
  'Belgravia',
  'Newcastle upon Tyne',
  'Ballydehob',
  'K√∏benhavn',
  'Zagreb',
  'Nottingham',
  'Bloomsbury',
  'South San Francisco',
  'Rio de Janeiro

In [20]:
price_ordering = ['$', '$$', '$$$', '$$$$', '$$$$$']

X['price'] = X['price'].apply(lambda price: price_ordering.index(price))

In [21]:
X

Unnamed: 0,year,latitude,longitude,city,region,cuisine,price
0,2019,37.787720,-122.452640,San Francisco,California,Californian,2
1,2019,22.189960,113.547940,Macau,Macau,Hunanese and Sichuan,0
2,2019,37.522520,127.043960,Seoul,South Korea,Korean contemporary,2
3,2019,22.280996,114.152760,Hong Kong,Hong Kong,Innovative,4
4,2019,52.469250,-1.923880,Birmingham,United Kingdom,Modern cuisine,3
...,...,...,...,...,...,...,...
690,2019,40.716797,-74.005650,New York,New York City,Contemporary,3
691,2019,37.791670,-122.421310,San Francisco,California,Italian,3
692,2019,22.280080,114.157364,Hong Kong,Hong Kong,Cantonese,2
693,2019,34.082380,-118.376540,Los Angeles,California,Japanese,3


In [26]:
X['city'] = X['city'].apply(lambda city: re.sub(r' - \d+$', '', city) if str(city) != 'nan' else city)

In [27]:
X

Unnamed: 0,year,latitude,longitude,city,region,cuisine,price
0,2019,37.787720,-122.452640,San Francisco,California,Californian,2
1,2019,22.189960,113.547940,Macau,Macau,Hunanese and Sichuan,0
2,2019,37.522520,127.043960,Seoul,South Korea,Korean contemporary,2
3,2019,22.280996,114.152760,Hong Kong,Hong Kong,Innovative,4
4,2019,52.469250,-1.923880,Birmingham,United Kingdom,Modern cuisine,3
...,...,...,...,...,...,...,...
690,2019,40.716797,-74.005650,New York,New York City,Contemporary,3
691,2019,37.791670,-122.421310,San Francisco,California,Italian,3
692,2019,22.280080,114.157364,Hong Kong,Hong Kong,Cantonese,2
693,2019,34.082380,-118.376540,Los Angeles,California,Japanese,3


In [28]:
def onehot_encode(df, columns, prefixes):
    df = df.copy()
    for column, prefix in zip(columns, prefixes):
        dummies = pd.get_dummies(df[column], prefix=prefix, dtype=int)
        df = pd.concat([df, dummies], axis=1)
        df = df.drop(column, axis=1)
    return df

In [29]:
nominal_columns = ['city', 'region', 'cuisine']
nominal_prefixes = ['C', 'R', 'CU']

X = onehot_encode(X, nominal_columns, nominal_prefixes)

In [30]:
X

Unnamed: 0,year,latitude,longitude,price,C_Aarhus,C_Aird Mh√≥r/Ardmore,C_Anstruther,C_Ascot,C_Ath√≠na,C_Auchterarder,...,CU_Taiwanese,CU_Taizhou,CU_Temple cuisine,CU_Teppanyaki,CU_Thai,CU_Thai Contemporary,CU_Traditional British,CU_Vegetarian,CU_creative,CU_modern
0,2019,37.787720,-122.452640,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2019,22.189960,113.547940,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2019,37.522520,127.043960,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2019,22.280996,114.152760,4,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2019,52.469250,-1.923880,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
690,2019,40.716797,-74.005650,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
691,2019,37.791670,-122.421310,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
692,2019,22.280080,114.157364,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
693,2019,34.082380,-118.376540,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Scaling and Splitting

In [31]:
scaler = StandardScaler()

X = scaler.fit_transform(X)

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=40)

### Training

In [35]:
models = []
Cs = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]

for i in range(len(Cs)):
    model = LogisticRegression(C = Cs[i])
    model.fit(X_train, y_train)
    models.append(model)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Results

In [36]:
model_acc = [model.score(X_test, y_test) for model in models]

print(f" Model Accuracy (C={Cs[0]}):", model_acc[0])
print(f" Model Accuracy (C={Cs[1]}):", model_acc[1])
print(f" Model Accuracy (C={Cs[2]}):", model_acc[2])
print(f" Model Accuracy (C={Cs[3]}):", model_acc[3])
print(f" Model Accuracy (C={Cs[4]}):", model_acc[4])
print(f" Model Accuracy (C={Cs[5]}):", model_acc[5])
print(f" Model Accuracy (C={Cs[6]}):", model_acc[6])

 Model Accuracy (C=0.0001): 0.7751196172248804
 Model Accuracy (C=0.001): 0.7751196172248804
 Model Accuracy (C=0.01): 0.7703349282296651
 Model Accuracy (C=0.1): 0.7416267942583732
 Model Accuracy (C=1.0): 0.7129186602870813
 Model Accuracy (C=10.0): 0.7129186602870813
 Model Accuracy (C=100.0): 0.6028708133971292
