# Logistic Regression Model

This notebook develops a logistic regression model for classifying a cuisine based on a recipe's ingredients. The input data used is from the cleaned and one hot encoded recipes. First, we train a baseline model, then perform cross validation to obtain the ideal regularization hyperparameter. Finally we use the model to infer information about the most important ingredients for each cuisine from a predictive perspective. 

In [3]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [6]:
%%time 
train_data = pd.read_csv("../data/ohe_train_recipes_v2.csv",index_col="id")
test_data = pd.read_csv("../data/ohe_test_recipes_v2.csv",index_col="id")

CPU times: user 12.2 s, sys: 1.26 s, total: 13.5 s
Wall time: 13.8 s


In [7]:
train_data.head(2)

Unnamed: 0_level_0,1% buttermilk,1% chocolate milk,1% cottage cheese,1% milk,"2 1/2 to 3 lb. chicken, cut into serving pieces",2% cottage cheese,2% low fat cheddar chees,2% lowfat greek yogurt,2% milk mozzarella cheese,2% reduced-fat milk,...,yuzu,yuzu juice,za'atar,zest,zesty italian dressing,zinfandel,ziti,zucchini,zucchini blossoms,cuisine
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,spanish
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,mexican


In [8]:
X =  train_data.drop(['cuisine'],axis=1)
response = train_data['cuisine']

In [10]:
X_train,X_validate,y_train,y_validate=train_test_split(X,response,test_size=0.3,random_state=22)

In [11]:
X_train.shape[0], X_validate.shape[0], X_validate.shape[0]/(X_validate.shape[0]+X_train.shape[0])

(27841, 11933, 0.30002011364207776)

In [12]:
%%time
lr=LogisticRegression()
lr.fit(X_train,y_train)

CPU times: user 8min 26s, sys: 4.61 s, total: 8min 30s
Wall time: 1min 15s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [13]:
lr.score(X_validate,y_validate), lr.score(X_train,y_train)

(0.7715578647448252, 0.890808519808915)

Given the convergence warning, we will scale the data and retrain the model to see if this improves convergence and accuracy. 

In [18]:
from sklearn.preprocessing import StandardScaler

In [19]:
scaler = StandardScaler()
X_train_trans = scaler.fit_transform(X_train)

In [20]:
X_validate_trans = scaler.transform(X_validate)

In [21]:
%%time
lr=LogisticRegression()
lr.fit(X_train_trans,y_train)

CPU times: user 7min 50s, sys: 3.29 s, total: 7min 54s
Wall time: 1min 7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [24]:
lr.score(X_validate_trans,y_validate), lr.score(X_train_trans,y_train)

(0.6932875219978212, 0.9812147552171259)

This method still returns a convergence error and does not improve model performance on the validation set. 

## Hyperparameter Tuning 
In this section we use `GridSearchCV` to apply cross-validation to find the optimal regularization parameter, C, for the logistic regression model. 

In [34]:
from sklearn.model_selection import GridSearchCV

In [35]:
parameters = {'C':[0.05,0.1,0.5,1,2]}
lr=LogisticRegression()

In [36]:
clf = GridSearchCV(lr, parameters)

In [None]:
clf.fit(X_train, y_train)
clf.best_params_

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [22]:
lr

## Model Interpretability 

To better understand cuisines, we can use the learned coefficients from the logistic regression model to extract a list of top ingredients for any cuisine.

In [None]:
lr=LogisticRegression()
lr.fit(X_train,y_train)

In [14]:
log_reg_coeffs = pd.DataFrame(data=lr.coef_, index=lr.classes_, columns=X_train.columns)
log_reg_coeffs.head()

Unnamed: 0,1% buttermilk,1% chocolate milk,1% cottage cheese,1% milk,"2 1/2 to 3 lb. chicken, cut into serving pieces",2% cottage cheese,2% low fat cheddar chees,2% lowfat greek yogurt,2% milk mozzarella cheese,2% reduced-fat milk,...,yukon gold potatoes,yuzu,yuzu juice,za'atar,zest,zesty italian dressing,zinfandel,ziti,zucchini,zucchini blossoms
brazilian,-0.01196,-0.000657,-0.007501,-0.294994,-0.001545,-0.000605,0.0,-0.006912,-0.000152,0.227497,...,0.037151,-0.000102,-0.003631,-0.002167,-0.015516,-0.020227,-0.033111,-0.027343,-0.937937,-0.011308
british,-0.038661,-0.001647,-0.004116,-0.434116,-0.00043,-0.000672,0.0,-0.014513,-0.000164,0.635151,...,0.150851,-0.000107,-0.014011,-1.8e-05,-0.086049,-0.017721,-0.043599,-0.016618,0.199513,-0.001836
cajun_creole,-0.046311,-0.001809,-0.026893,0.326151,-0.016949,-0.003808,0.0,-0.025271,-0.00014,0.201637,...,-0.61606,-0.00016,-0.013003,-0.000146,-0.034674,-0.046637,-0.035379,-0.048493,-0.492954,-0.012062
chinese,-0.006983,-0.004305,-0.005239,-0.279634,-0.018244,-0.000521,0.0,-0.005794,-0.000216,-0.317954,...,-0.55647,-0.00566,-0.007454,-4.2e-05,-0.013384,-0.092291,-0.031417,-0.00857,-0.271191,-0.014274
filipino,-0.00448,-0.004331,-0.005384,-0.251803,-0.00229,-0.001327,0.0,-0.011288,-0.000123,-0.200673,...,-0.078106,-0.000286,-0.006766,-5.2e-05,-0.011075,-0.02387,-0.026977,-0.012995,-0.004062,-0.022963


In [15]:
def top_ingredients(coeffs_df, cuisine, n):
    return ', '.join(list(coeffs_df.loc[cuisine,:].sort_values(ascending=False).head(n).index))

In [16]:
top_ingredients(log_reg_coeffs, "thai",10)

'Thai red curry paste, sweet chili sauce, chunky peanut butter, unsweetened coconut milk, fish sauce, thai basil, coconut milk, palm sugar, sticky rice, creamy peanut butter'

In [17]:
top_ingredients(log_reg_coeffs, "italian",10)

'polenta, arborio rice, pesto, mascarpone, marsala wine, spaghetti, italian sausage, fettucine, ricotta cheese, gnocchi'