In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns; sns.set()

In [2]:
foodie_df = pd.read_csv('Clean_FoodieX_data.csv')
foodie_df.head()

Unnamed: 0,Restaurant,Latitude,Longitude,Cuisines,Average_Cost,Minimum_Order,Rating,Votes,Reviews,Cook_Time_Mins,Numeric_Rating
0,ID_6321,39.262605,-85.837372,"Fast Food, Rolls, Burger, Salad, Wraps",20.0,50.0,3.5,12.0,4.0,30,3.5
1,ID_2882,39.775933,-85.740581,"Ice Cream, Desserts",10.0,50.0,3.5,11.0,4.0,30,3.5
2,ID_1595,39.253436,-85.123779,"Italian, Street Food, Fast Food",15.0,50.0,3.6,99.0,30.0,65,3.6
3,ID_5929,39.029841,-85.33205,"Mughlai, North Indian, Chinese",25.0,99.0,3.7,176.0,95.0,30,3.7
4,ID_6123,39.882284,-85.517407,"Cafe, Beverages",20.0,99.0,3.2,521.0,235.0,65,3.2


**Note:** Here, I choose Elastic Net regression to predict ``Cook_Time`` because as we have seen in the EDA, some of the variables (e.g. ``Minimum_Order`` or ``Numeric_Rating``) are not very strongly correlated to ``Cook_Time``. Also, we see that some of the features are correlated to each other (e.g ``Votes`` and ``Reviews``). For the detailed steps, view the comments in the function below.  

In [3]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


def elastic_net_reg(df, X_cols, y_col):
    reg_df = df.dropna(subset=X_cols, axis=0)
    
    # Split data into training and test sets:
    X = reg_df[X_cols].values
    y = reg_df[y_col]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
    
    # data preprocessing and standardizing:
    pipeline = make_pipeline(preprocessing.StandardScaler(), ElasticNet(random_state=0))
    
    # declare hyperparameter to tune:
    hyperparameters = {'elasticnet__l1_ratio' : [0.25, 0.5, 0.75]}
    
    # tune model using cross-validation pipeline:
    regr = GridSearchCV(pipeline, hyperparameters, cv=10)

    # fit data:
    regr.fit(X_train, y_train)
    
    # evaluate model on test data
    pred = regr.predict(X_test)
    print('R2 score:', r2_score(y_test, pred))
    print('Mean Absolute Error:', mean_absolute_error(y_test, pred))
    
    return regr


elnt_regr = elastic_net_reg(foodie_df, ['Average_Cost', 'Minimum_Order', 'Votes', 'Reviews', 'Numeric_Rating'], 
                            'Cook_Time_Mins')

# print(elnt_regr.coef_)
# print(elnt_regr.intercept_)

R2 score: 0.06772592366739827
Mean Absolute Error: 9.099929126279246


In [4]:
log_foodie_df = foodie_df.copy()
log_foodie_df['Log_Votes'] = np.log(foodie_df.Votes)
log_foodie_df['Log_Reviews'] = np.log(foodie_df.Reviews)

log_elnt_regr = elastic_net_reg(log_foodie_df, ['Average_Cost', 'Minimum_Order', 'Log_Votes', 'Log_Reviews', 'Numeric_Rating'], 
                                'Cook_Time_Mins')

R2 score: 0.14993885314240085
Mean Absolute Error: 8.289981505136302


**Note:** As seen in the EDA, we know that the relationships between ``Votes``, ``Reviews`` and ``Cook_Time`` are not linear. Also, we know that some values of ``Votes`` and ``Reviews`` are very high, that's why I decide to take their log() so the magnitude difference is reduced. By doing this, we manage to increase the R2 score and bring the MAE down to 8.3. This means that on average, the prediced ``Cook_Time`` for each restaurant is skewed by 8 minutes. 

In [5]:
# try predicting the some random rows:

consider_cols = ['Average_Cost', 'Minimum_Order', 'Log_Votes', 'Log_Reviews', 'Numeric_Rating']

random_df = log_foodie_df.dropna(subset=consider_cols, axis=0).sample(10)
random_X = random_df[consider_cols].values
random_y = random_df.Cook_Time_Mins
result = pd.DataFrame(data={'Actual': random_y, 'Prediction': log_elnt_regr.predict(random_X)})

result

Unnamed: 0,Actual,Prediction
486,30,44.779407
711,30,38.091671
614,30,37.435981
1877,45,36.765665
183,45,41.05196
1338,30,35.820444
1235,45,43.445446
1025,30,41.399331
46,30,37.88758
798,45,39.032244


**Note:** So far, we are only able to analyze the records with provided information for ``Reviews``, ``Votes``, and ``Numeric_Rating``. That is because some places just open (or are about to open) and data for these variables are not available yet. Now let's see if the average ``Cook_Time`` of new places are different from that of old, established restaurants.

In [6]:
def Cohen_effect_size(gr1, gr2):
    mean_diff = np.mean(gr1) - np.mean(gr2)
    
    n1, n2 = len(gr1), len(gr2)
    var1, var2 = np.var(gr1), np.var(gr2)
    pooled_var = (n1*var1 + n2*var2)/(n1 + n2) 
    
    d = mean_diff/ np.sqrt(pooled_var)
    return mean_diff, d

In [7]:
foodie_df['Rating'] = foodie_df['Rating'].fillna('')
new_restaurants = foodie_df.loc[foodie_df.Rating.str.contains('NEW|Opening Soon')]
old_restaurants = foodie_df.loc[~foodie_df.Rating.str.contains('NEW|Opening Soon')]

mean_diff, effect = Cohen_effect_size(old_restaurants.Cook_Time_Mins, new_restaurants.Cook_Time_Mins)
print("Mean difference between new and old: ", mean_diff)
print("Cohen effect size: ", effect)

Mean difference between new and old:  6.681702180472708
Cohen effect size:  0.572005963719399


**Comment:** On average, it takes new restaurants around 6.7 minutes less to cook and this effect is about 0.57 (pooled) standard deviations. This is considered a medium effect size.

## Is fast food actually fast?

In [8]:
fast_food = foodie_df.loc[foodie_df.Cuisines.str.contains('Fast Food')]
slow_food = foodie_df.loc[~foodie_df.Cuisines.str.contains('Fast Food')]

mean_diff, effect = Cohen_effect_size(slow_food.Cook_Time_Mins, fast_food.Cook_Time_Mins)
print("Mean difference between fast and slow food: ", mean_diff)
print("Cohen effect size: ", effect)

Mean difference between fast and slow food:  0.09404135765791466
Cohen effect size:  0.00796869258961153


**Comment:** On average, it takes a restaurant that serves Fast Food about the same cook time as a restaurant that doesn't.

In [9]:
only_fast_food = foodie_df.loc[foodie_df.Cuisines.str.contains('^Fast Food$')]
slow_food = foodie_df.loc[~foodie_df.Cuisines.str.contains('Fast Food')]

mean_diff, effect = Cohen_effect_size(slow_food.Cook_Time_Mins, only_fast_food.Cook_Time_Mins)
print("Mean difference between (only) fast food and slow food: ", mean_diff)
print("Cohen effect size: ", effect)

Mean difference between (only) fast food and slow food:  4.718958137075049
Cohen effect size:  0.4193577321361906


**Comment:** On average, restaurants that serve **ONLY** Fast Food have a cook time of 4.7 minutes less than restaurants that don't include Fast Food in their menus at all. However, this effect is considered not very significant (0.5 is considered a medium effect size).