# [Cuisine Classification from Ingredients](https://www.kaggle.com/rahulsridhar2811/cuisine-classification-with-accuracy-78-88)

### **Description:** The main purpose of this project is to classify cuisines based on Ingredients of the Recipe.


The steps followed are as follows:
* Step-1: Download dataset from [here](https://www.kaggle.com/c/whats-cooking/data), store it in dictionary and convert it to dataframe.
* Step-2: Feature Selection:
  * Removal of punctuation, digits, content inside parenthesis using Regex Expression
  * Remove brand names using Regex Expression
  * Convert to lower case and Remove stop words using Regex Expression
  * Use Porter Stemmer Algorithm
* Step-3: Encoded Cuisine column using Label Encoder of sklearn.
* Step-4: Convert the ingredients column after feature selection into TFIDF Matrix
* Step-5: Split the data(X-TFIDF Matrix, Y-Label Encoded value of Cuisine into training and test data(80:20).
* Step-6: Use Different Machine Learning Algorithm to get best accuracy. 


In [1]:
import json
import re

import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import PorterStemmer

from sklearn import svm
from sklearn.svm import SVC
from sklearn import datasets
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

nltk.download('punkt')
nltk.download('word_tokenize')

[nltk_data] Downloading package punkt to /home/rsouza/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Error loading word_tokenize: Package 'word_tokenize' not
[nltk_data]     found in index


False

In [2]:
with open('../data/yummly/train.json', 'r') as train_file:
    dict_train = json.load(train_file)

print(dict_train[0])

{'id': 10259, 'cuisine': 'greek', 'ingredients': ['romaine lettuce', 'black olives', 'grape tomatoes', 'garlic', 'pepper', 'purple onion', 'seasoning', 'garbanzo beans', 'feta cheese crumbles']}


In [3]:
len(dict_train)

39774

In [4]:
dict_train[0]['ingredients']

['romaine lettuce',
 'black olives',
 'grape tomatoes',
 'garlic',
 'pepper',
 'purple onion',
 'seasoning',
 'garbanzo beans',
 'feta cheese crumbles']

In [5]:
id_ = []
cuisine = []
ingredients = []
for i in range(len(dict_train)):
    id_.append(dict_train[i]['id'])
    cuisine.append(dict_train[i]['cuisine'])
    ingredients.append(dict_train[i]['ingredients'])

In [6]:
df = pd.DataFrame({'id':id_, 
                   'cuisine':cuisine, 
                   'ingredients':ingredients})
print(df.head(5))

      id      cuisine                                        ingredients
0  10259        greek  [romaine lettuce, black olives, grape tomatoes...
1  25693  southern_us  [plain flour, ground pepper, salt, tomatoes, g...
2  20130     filipino  [eggs, pepper, salt, mayonaise, cooking oil, g...
3  22213       indian                [water, vegetable oil, wheat, salt]
4  13162       indian  [black pepper, shallots, cornflour, cayenne pe...


In [7]:
df['cuisine'].value_counts()

italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64

In [8]:
new = []
for s in df['ingredients']:
    s = ' '.join(s)
    new.append(s)

In [9]:
df['ing'] = new

In [10]:
ps = PorterStemmer()
l=[]
for s in df['ing']:
    
    #Remove punctuations
    s=re.sub(r'[^\w\s]','',s)
    
    #Remove Digits
    s=re.sub(r"(\d)", "", s)
    
    #Remove content inside paranthesis
    s=re.sub(r'\([^)]*\)', '', s)
    
    #Remove Brand Name
    s=re.sub(u'\w*\u2122', '', s)
    
    #Convert to lowercase
    s=s.lower()
    
    #Remove Stop Words
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(s)
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    filtered_sentence = []
    for w in word_tokens:
        if w not in stop_words:
            filtered_sentence.append(w)
    s=' '.join(filtered_sentence)
    
    #Remove low-content adjectives
    
    
    #Porter Stemmer Algorithm
    words = word_tokenize(s)
    word_ps=[]
    for w in words:
        word_ps.append(ps.stem(w))
    s=' '.join(word_ps)
    
    l.append(s)
df['ing_mod']=l
print(df.head(10))

      id      cuisine                                        ingredients  \
0  10259        greek  [romaine lettuce, black olives, grape tomatoes...   
1  25693  southern_us  [plain flour, ground pepper, salt, tomatoes, g...   
2  20130     filipino  [eggs, pepper, salt, mayonaise, cooking oil, g...   
3  22213       indian                [water, vegetable oil, wheat, salt]   
4  13162       indian  [black pepper, shallots, cornflour, cayenne pe...   
5   6602     jamaican  [plain flour, sugar, butter, eggs, fresh ginge...   
6  42779      spanish  [olive oil, salt, medium shrimp, pepper, garli...   
7   3735      italian  [sugar, pistachio nuts, white almond bark, flo...   
8  16903      mexican  [olive oil, purple onion, fresh pineapple, por...   
9  12734      italian  [chopped tomatoes, fresh basil, garlic, extra-...   

                                                 ing  \
0  romaine lettuce black olives grape tomatoes ga...   
1  plain flour ground pepper salt tomatoes ground..

In [11]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['ing_mod'])

print(X)
#print(vectorizer.get_feature_names())

  (0, 645)	0.33593035910361563
  (0, 457)	0.15275671094077872
  (0, 839)	0.3187972762529108
  (0, 151)	0.20926361996623533
  (0, 948)	0.40698769941435575
  (0, 2129)	0.2353444283083364
  (0, 1678)	0.11254514115347572
  (0, 1936)	0.25074218315906177
  (0, 1788)	0.10459549609170087
  (0, 953)	0.11042361722588336
  (0, 2454)	0.14902078223349116
  (0, 1026)	0.35837089683191997
  (0, 1675)	0.14038392852393708
  (0, 207)	0.14748076920025835
  (0, 1350)	0.2791135588220773
  (0, 2038)	0.35874236830471423
  (1, 1667)	0.1171826390275072
  (1, 2545)	0.21067385712038839
  (1, 1525)	0.2131293756857962
  (1, 1483)	0.4041016783959638
  (1, 591)	0.22105888852639313
  (1, 2640)	0.2786478197048322
  (1, 1043)	0.18615449107858506
  (1, 778)	0.18468552446921774
  (1, 2438)	0.2735433360467422
  :	:
  (39772, 778)	0.08974390709815865
  (39772, 2087)	0.05279966977386053
  (39772, 884)	0.08958350385905274
  (39772, 953)	0.06205511227647933
  (39773, 426)	0.2653103714057182
  (39773, 2037)	0.41387002304539366


In [12]:
len(new)

39774

In [13]:
type(df['ing'][0])

str

In [14]:
s='1 1cool co1l coo1'
s=re.sub(r"(\d)", "", s)
print(s)

 cool col coo


In [15]:
s='hi 1(bye)'
s=re.sub(r'\([^)]*\)', '', s)
print(s)

hi 1


In [16]:
s='hi 1 Marvel™ hi'
s=re.sub(u'\w*\u2122', '', s)
print(s)

hi 1  hi


In [17]:
s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
s = pattern.sub('', s)
print(s)

I love phone, super fast 'much new cool things jelly bean....recently I'seen bugs.


In [19]:
example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [20]:
"love phone, super fast much cool jelly bean....but recently bugs."

'love phone, super fast much cool jelly bean....but recently bugs.'

In [21]:
s = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [22]:
ps = PorterStemmer()

In [23]:
new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."

print(s)

This is a sample sentence, showing off the stop words filtration.


In [24]:

le = preprocessing.LabelEncoder()
le.fit(df['cuisine'])
df['cuisine']=le.transform(df['cuisine']) 

In [25]:
df['cuisine'].value_counts()

9     7838
13    6438
16    4320
7     3003
3     2673
5     2646
2     1546
18    1539
11    1423
6     1175
17     989
12     830
19     825
14     821
1      804
4      755
8      667
10     526
15     489
0      467
Name: cuisine, dtype: int64

In [26]:
cuisine_map={'0':'brazilian', '1':'british', '2':'cajun_creole', '3':'chinese', '4':'filipino', '5':'french', '6':'greek', '7':'indian', '8':'irish', '9':'italian', '10':'jamaican', '11':'japanese', '12':'korean', '13':'mexican', '14':'moroccan', '15':'russian', '16':'southern_us', '17':'spanish', '18':'thai', '19':'vietnamese'}

In [27]:
Y=[]
Y = df['cuisine']


In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 100)

In [29]:
for K in range(25):
    K_value = K+1
    neigh = KNeighborsClassifier(n_neighbors = K_value, weights='uniform', algorithm='auto')
    neigh.fit(X_train, y_train) 
    y_pred = neigh.predict(X_test)
    print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:",K_value)


Accuracy is  70.18227529855436 % for K-Value: 1
Accuracy is  68.4098051539912 % for K-Value: 2
Accuracy is  71.69076052796983 % for K-Value: 3
Accuracy is  72.84726587052168 % for K-Value: 4
Accuracy is  73.32495285983657 % for K-Value: 5
Accuracy is  73.99120050282842 % for K-Value: 6
Accuracy is  74.19233186675046 % for K-Value: 7
Accuracy is  74.51917033312382 % for K-Value: 8
Accuracy is  74.69516027655563 % for K-Value: 9
Accuracy is  74.46888749214331 % for K-Value: 10
Accuracy is  74.78315524827153 % for K-Value: 11
Accuracy is  74.70773098680075 % for K-Value: 12
Accuracy is  74.56945317410434 % for K-Value: 13
Accuracy is  74.46888749214331 % for K-Value: 14
Accuracy is  74.73287240729101 % for K-Value: 15
Accuracy is  74.72030169704588 % for K-Value: 16
Accuracy is  74.7705845380264 % for K-Value: 17
Accuracy is  74.6825895663105 % for K-Value: 18
Accuracy is  74.49402891263355 % for K-Value: 19
Accuracy is  74.55688246385921 % for K-Value: 20
Accuracy is  74.46888749214331 %

In [30]:
#Implement KNN(So we take K value to be 11)
neigh = KNeighborsClassifier(n_neighbors = 11, weights='uniform', algorithm='auto')
neigh.fit(X_train, y_train) 
y_pred = neigh.predict(X_test)
print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:11")

Accuracy is  74.78315524827153 % for K-Value:11


In [31]:
#Implement Grid Serch for best Gamma, C and Selection between rbf and linear kernel

parameter_candidates = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
clf = GridSearchCV(estimator=svm.SVC(), param_grid=parameter_candidates, n_jobs=-1)
clf.fit(X_train, y_train)   
print('Best score for data1:', clf.best_score_) 
print('Best C:',clf.best_estimator_.C) 
print('Best Kernel:',clf.best_estimator_.kernel)
print('Best Gamma:',clf.best_estimator_.gamma)

Best score for data1: 0.7817028205469374
Best C: 1
Best Kernel: linear
Best Gamma: scale


In [32]:
#OVA SVM(Grid Search Results: Kernel - Linear, C - 1, Gamma - Auto)

lin_clf = svm.LinearSVC(C=1)
lin_clf.fit(X_train, y_train)
y_pred=lin_clf.predict(X_test)
print(accuracy_score(y_test,y_pred)*100)

78.88120678818353


In [33]:
#SVM by Crammer(Grid Search Results: Gamma - , C - )
lin_clf = svm.LinearSVC(C=1.0, multi_class='crammer_singer')
lin_clf.fit(X_train, y_train)
y_pred=lin_clf.predict(X_test)
print(accuracy_score(y_test,y_pred)*100)

78.80578252671276




In [34]:
#Implementing OVA Naive Bayes

clf = MultinomialNB().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test,y_pred)*100)

66.77561282212446


In [35]:
#Implementing OVA Logistic Regerssion

logisticRegr = LogisticRegression()
logisticRegr.fit(X_train, y_train)
y_pred = logisticRegr.predict(X_test)
print(accuracy_score(y_test,y_pred)*100)

78.22752985543683


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [39]:
df3 = pd.DataFrame({'id':y_test.index, 'cuisine':y_test.values})
y_test2=[]
y_test1=df3['cuisine'].tolist()
for i in range(len(df3['cuisine'])):
    y_test2.append(cuisine_map[str(df3['cuisine'][i])])
#print(y_test2)

In [40]:
#Convert Predicted Output 
#_______ gives the best accuracy. So lets implement it on last time to get the final output.
y_pred1=[]
for i in range(len(y_pred)):
    y_pred1.append(cuisine_map[str(y_pred[i])])
#print(y_test2)

In [41]:
#We Choose OVA SVM as it gives the best accuracy of 78.88%

lin_clf = svm.LinearSVC(C=1)
lin_clf.fit(X_train, y_train)
y_pred=lin_clf.predict(X_test)
print(accuracy_score(y_test,y_pred)*100)
result=pd.DataFrame({'Actual Cuisine':y_test2, 'Predicted Cuisine':y_test2})
print(result)

78.88120678818353
     Actual Cuisine Predicted Cuisine
0           mexican           mexican
1       southern_us       southern_us
2           italian           italian
3           mexican           mexican
4          japanese          japanese
...             ...               ...
7950        italian           italian
7951         indian            indian
7952       jamaican          jamaican
7953        mexican           mexican
7954       japanese          japanese

[7955 rows x 2 columns]


# Conclusion
The OVA SVM Algorithm gives the best accuracy of 78.88% for the given dataset.

|       Algorithm       |           Parameters           | Accuracy |
|:----------------------|:------------------------------:|:--------:|
|          KNN          |              K=25              |  74.81%  |
|        OVA SVM        |              C=1               |  78.88%  |
|      Crammer SVM      |C=1, multi_class=crammer_singer |  78.84%  |
|    OVA Naive Bayes    |               -                |  66.71%  |
|OVA Logistic Regression|               -                |  77.72%  |
