#### Title: Machine learning predictive models, and text analysis

This jupyter notebook imports the recipes.csv file, and runs all the 6 ML classification models - Logistic regression, Decision tree, Random forest, K-NN, Naive Bayes and Linear SVC to predict if a recipe is Italian, but first, we use a Tfidf vectorizer to numericize the ingredients in the recipe.

#### Author: Anuja Venkatachalam

In [5]:
# importing the requisite packages
import pandas as pd
import numpy as np

In [6]:
# reading in the raw file

In [7]:
df=pd.read_csv("recipes.csv")

In [8]:
df.shape

(39774, 3)

In [4]:
df.sample(5)

Unnamed: 0,cuisine,id,ingredient_list
13889,spanish,11833,"sugar, vanilla extract, egg whites, sweetened ..."
6788,mexican,16500,"beef, garlic, onions, stock, lemon, ground cor..."
27458,moroccan,45738,"whole wheat couscous, pinenuts, flat leaf pars..."
36837,mexican,3013,"tomatoes, garlic powder, salsa, boneless skinl..."
19383,cajun_creole,5229,"ground cloves, unsalted butter, ground cardamo..."


In [5]:
df.cuisine.value_counts()

italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64

In [9]:
# creating a binary outcome variable  
df["is_italian"]=np.where(df["cuisine"]=="italian", 1,0)

In [11]:
# creating and testing a custom tokenizer
def custom_tokenizer(text):
    spaces=text.strip()
    lower_case=text.lower()
    split_words=text.split(",")
    return split_words

In [13]:
text="my, name, is, anuja"
custom_tokenizer(text)

['my', ' name', ' is', ' anuja']

In [15]:
# sampling only 40% of the dataset to combat memory issues
df=df.sample(frac=0.40)

In [16]:
# vectorizing the ingredients list - setting max_features to combat memory issues
from sklearn.feature_extraction.text import TfidfVectorizer
model=TfidfVectorizer(stop_words="english",tokenizer=custom_tokenizer, use_idf=True, max_features=40)
result=model.fit_transform(df["ingredient_list"])

In [17]:
results_df=pd.DataFrame(result.toarray(), columns=model.get_feature_names())

In [18]:
results_df

Unnamed: 0,all-purpose flour,baking powder,black pepper,butter,carrots,chili powder,chopped cilantro fresh,corn starch,dried oregano,eggs,...,scallions,sesame oil,shallots,sour cream,soy sauce,sugar,tomatoes,unsalted butter,vegetable oil,water
0,0.000000,0.000000,0.616569,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0000,0.584619,0.000000,0.446574
1,0.000000,0.000000,0.000000,0.000000,0.543628,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,0.426099
2,0.466671,0.626052,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,0.000000
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,0.000000
4,0.000000,0.000000,0.000000,0.546495,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,0.000000
5,0.000000,0.000000,0.000000,0.000000,0.525150,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,0.000000
6,0.000000,0.000000,0.558919,0.000000,0.000000,0.000000,0.000000,0.000000,0.579479,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,0.000000
7,0.000000,0.000000,0.000000,0.424741,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.567784,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,0.000000
8,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,0.566767,0.000000
9,0.395233,0.530216,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,0.000000


In [19]:
# Creating the X and y variables
X=results_df
y=df["is_italian"]

In [20]:
# Splitting the dataset into train and test 
from sklearn.model_selection import train_test_split
X_train,X_test,y_train, y_test=train_test_split(X,y,random_state=30)

In [21]:
# Checking the distribution of y in the training and test datasets - 
## as it is skewed - i.e., there are more 0s than 1s, the models are likely to have low sensitivity and high specificity
X_train.shape

(4773, 40)

In [22]:
X_test.shape

(1591, 40)

In [23]:
y_train.value_counts()

0    3820
1     953
Name: is_italian, dtype: int64

In [19]:
y_test.value_counts()

0    3173
1     805
Name: is_italian, dtype: int64

#### Logistic regression

In [20]:
from sklearn.linear_model.logistic import LogisticRegression
logit=LogisticRegression(C=1e9, solver="lbfgs",max_iter=4000)
result=logit.fit(X_train,y_train)
predictions=logit.predict(X_test)

In [21]:
from sklearn.metrics import confusion_matrix
matrix=confusion_matrix(y_test, predictions)
matrix_df=pd.DataFrame(matrix, columns=["Predicted_0","Predicted_1"], index=["Actual_0","Actual_1"])

In [22]:
matrix_df

Unnamed: 0,Predicted_0,Predicted_1
Actual_0,3053,120
Actual_1,508,297


In [23]:
correct=matrix_df.loc["Actual_0","Predicted_0"]+matrix_df.loc["Actual_1","Predicted_1"]
total=matrix_df.sum()[0]+matrix_df.sum()[1]
accuracy=correct/total*100
error_rate=100-accuracy
sensitivity_pos=matrix_df.loc["Actual_1","Predicted_1"]/(matrix_df.loc["Actual_1","Predicted_1"]+matrix_df.loc["Actual_1","Predicted_0"])*100
specificity_neg=matrix_df.loc["Actual_0","Predicted_0"]/(matrix_df.loc["Actual_0","Predicted_0"]+matrix_df.loc["Actual_0","Predicted_1"])*100

In [24]:
print(f"Logit(accuracy): {accuracy:.2f}%")
print(f"Logit(error_rate): {error_rate:.2f}%")
print(f"Logit(sensitivity): {sensitivity_pos:.2f}%")
print(f"Logit(specificity): {specificity_neg:.2f}%")

Logit(accuracy): 84.21%
Logit(error_rate): 15.79%
Logit(sensitivity): 36.89%
Logit(specificity): 96.22%


#### Decision tree

In [25]:
from sklearn.tree import DecisionTreeClassifier
tree=DecisionTreeClassifier(max_depth=5)
results=tree.fit(X_train,y_train)
predictions=tree.predict(X_test)
matrix_df=pd.DataFrame(confusion_matrix(y_test,predictions), columns=["Predicted_0","Predicted_1"], index=["Actual_0","Actual_1"])

In [26]:
matrix_df

Unnamed: 0,Predicted_0,Predicted_1
Actual_0,3057,116
Actual_1,559,246


In [27]:
correct=matrix_df.loc["Actual_0","Predicted_0"]+matrix_df.loc["Actual_1","Predicted_1"]
total=matrix_df.sum()[0]+matrix_df.sum()[1]
accuracy=correct/total*100
error_rate=100-accuracy
sensitivity_pos=matrix_df.loc["Actual_1","Predicted_1"]/(matrix_df.loc["Actual_1","Predicted_1"]+matrix_df.loc["Actual_1","Predicted_0"])*100
specificity_neg=matrix_df.loc["Actual_0","Predicted_0"]/(matrix_df.loc["Actual_0","Predicted_0"]+matrix_df.loc["Actual_0","Predicted_1"])*100

In [28]:
print(f"Tree(accuracy): {accuracy:.2f}%")
print(f"Tree(error_rate): {error_rate:.2f}%")
print(f"Tree(sensitivity): {sensitivity_pos:.2f}%")
print(f"Tree(specificity): {specificity_neg:.2f}%")

Tree(accuracy): 83.03%
Tree(error_rate): 16.97%
Tree(sensitivity): 30.56%
Tree(specificity): 96.34%


#### Random forest

In [29]:
from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier(n_estimators=100, max_depth = 5)
results=forest.fit(X_train,y_train)
predictions=forest.predict(X_test)
matrix_df=pd.DataFrame(confusion_matrix(y_test, predictions), columns=["Predicted_0","Predicted_1"], index=["Actual_0","Actual_1"])

In [30]:
matrix_df

Unnamed: 0,Predicted_0,Predicted_1
Actual_0,3154,19
Actual_1,677,128


In [31]:
correct=matrix_df.loc["Actual_0","Predicted_0"]+matrix_df.loc["Actual_1","Predicted_1"]
total=matrix_df.sum()[0]+matrix_df.sum()[1]
accuracy=correct/total*100
error_rate=100-accuracy
sensitivity_pos=matrix_df.loc["Actual_1","Predicted_1"]/(matrix_df.loc["Actual_1","Predicted_1"]+matrix_df.loc["Actual_1","Predicted_0"])*100
specificity_neg=matrix_df.loc["Actual_0","Predicted_0"]/(matrix_df.loc["Actual_0","Predicted_0"]+matrix_df.loc["Actual_0","Predicted_1"])*100

In [32]:
print(f"Forest (accuracy): {accuracy:.2f}%")
print(f"Forest (error_rate): {error_rate:.2f}%")
print(f"Forest (sensitivity): {sensitivity_pos:.2f}%")
print(f"Forest (specificity): {specificity_neg:.2f}%")

Forest (accuracy): 82.50%
Forest (error_rate): 17.50%
Forest (sensitivity): 15.90%
Forest (specificity): 99.40%


#### K-NN

In [33]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=5)
results=knn.fit(X_train, y_train)
predictions=knn.predict(X_test)
matrix_df=pd.DataFrame(confusion_matrix(y_test,predictions), columns=["Predicted_0", "Predicted_1"], index=["Actual_0","Actual_1"])

In [34]:
matrix_df

Unnamed: 0,Predicted_0,Predicted_1
Actual_0,2943,230
Actual_1,481,324


In [35]:
correct=matrix_df.loc["Actual_0","Predicted_0"]+matrix_df.loc["Actual_1","Predicted_1"]
total=matrix_df.sum()[0]+matrix_df.sum()[1]
accuracy=correct/total*100
error_rate=100-accuracy
sensitivity_pos=matrix_df.loc["Actual_1","Predicted_1"]/(matrix_df.loc["Actual_1","Predicted_1"]+matrix_df.loc["Actual_1","Predicted_0"])*100
specificity_neg=matrix_df.loc["Actual_0","Predicted_0"]/(matrix_df.loc["Actual_0","Predicted_0"]+matrix_df.loc["Actual_0","Predicted_1"])*100

In [36]:
print(f"K-NN (accuracy): {accuracy:.2f}%")
print(f"K-NN (error_rate): {error_rate:.2f}%")
print(f"K-NN (sensitivity): {sensitivity_pos:.2f}%")
print(f"K-NN (specificity): {specificity_neg:.2f}%")

K-NN (accuracy): 82.13%
K-NN (error_rate): 17.87%
K-NN (sensitivity): 40.25%
K-NN (specificity): 92.75%


#### Linear SVC

In [37]:
from sklearn import svm
svm=svm.SVC(kernel="linear", C=1.0)
results=svm.fit(X_train, y_train)
predictions=svm.predict(X_test)
matrix_df=pd.DataFrame(confusion_matrix(y_test, predictions), columns=["Predicted_0","Predicted_1"], index=["Actual_0","Actual_1"])

In [38]:
matrix_df

Unnamed: 0,Predicted_0,Predicted_1
Actual_0,3063,110
Actual_1,556,249


In [39]:
correct=matrix_df.loc["Actual_0","Predicted_0"]+matrix_df.loc["Actual_1","Predicted_1"]
total=matrix_df.sum()[0]+matrix_df.sum()[1]
accuracy=correct/total*100
error_rate=100-accuracy
sensitivity_pos=matrix_df.loc["Actual_1","Predicted_1"]/(matrix_df.loc["Actual_1","Predicted_1"]+matrix_df.loc["Actual_1","Predicted_0"])*100
specificity_neg=matrix_df.loc["Actual_0","Predicted_0"]/(matrix_df.loc["Actual_0","Predicted_0"]+matrix_df.loc["Actual_0","Predicted_1"])*100

In [40]:
print(f"Linear SVC (accuracy): {accuracy:.2f}%")
print(f"Linear SVC (error_rate): {error_rate:.2f}%")
print(f"Linear SVC (sensitivity): {sensitivity_pos:.2f}%")
print(f"Linear SVC (specificity): {specificity_neg:.2f}%")

Linear SVC (accuracy): 83.26%
Linear SVC (error_rate): 16.74%
Linear SVC (sensitivity): 30.93%
Linear SVC (specificity): 96.53%


#### Naive Bayes

In [41]:
from sklearn.naive_bayes import MultinomialNB
nb=MultinomialNB()

In [42]:
results=nb.fit(X_train, y_train)
predictions=nb.predict(X_test)
matrix_df=pd.DataFrame(confusion_matrix(y_test, predictions), columns=["Predicted_0","Predicted_1"], index=["Actual_0","Actual_1"])

In [43]:
matrix_df

Unnamed: 0,Predicted_0,Predicted_1
Actual_0,3132,41
Actual_1,635,170


In [44]:
correct=matrix_df.loc["Actual_0","Predicted_0"]+matrix_df.loc["Actual_1","Predicted_1"]
total=matrix_df.sum()[0]+matrix_df.sum()[1]
accuracy=correct/total*100
error_rate=100-accuracy
sensitivity_pos=matrix_df.loc["Actual_1","Predicted_1"]/(matrix_df.loc["Actual_1","Predicted_1"]+matrix_df.loc["Actual_1","Predicted_0"])*100
specificity_neg=matrix_df.loc["Actual_0","Predicted_0"]/(matrix_df.loc["Actual_0","Predicted_0"]+matrix_df.loc["Actual_0","Predicted_1"])*100

In [45]:
print(f"NB (accuracy): {accuracy:.2f}%")
print(f"NB (error_rate): {error_rate:.2f}%")
print(f"NB (sensitivity): {sensitivity_pos:.2f}%")
print(f"NB (specificity): {specificity_neg:.2f}%")

NB (accuracy): 83.01%
NB (error_rate): 16.99%
NB (sensitivity): 21.12%
NB (specificity): 98.71%


#### Logit  has the highest accuracy, sensitivity and specificity rates, and performs the best in screening the features and predicting if a recipe has x,y,z ingredients (weighted), whether it is likely to be Italian or not. 

#### The End!