### In this blog, we will make a model that can figure out the cuisine from the ingredients. We will use a simple Naive Bayes approach.

Importing Some Python Modules

In [5]:
import pandas as pd
import json
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

I will be using the Kaggle data provided in this [competition](https://www.kaggle.com/c/whats-cooking) for the cuisine prediction. 

In [10]:
fin = open('train.json')

In [11]:
t = json.loads(fin.read())

## The format of the training data

The data is given as a json object. 

In [9]:
t[0]

{u'cuisine': u'greek',
 u'id': 10259,
 u'ingredients': [u'romaine lettuce',
  u'black olives',
  u'grape tomatoes',
  u'garlic',
  u'pepper',
  u'purple onion',
  u'seasoning',
  u'garbanzo beans',
  u'feta cheese crumbles']}

##Lets see how many unique cuisines we have in our data

In [22]:
uniqueCuisine = set(map(lambda x:x['cuisine'], t))
print '\n'.join(uniqueCuisine).title()

Irish
Mexican
Chinese
Filipino
Vietnamese
Moroccan
Brazilian
Japanese
British
Greek
Indian
Jamaican
French
Spanish
Russian
Cajun_Creole
Thai
Southern_Us
Korean
Italian


###Lets make a dictionary for assigning different cuisines to a unique integers.

In [28]:
cuisineDict = dict(zip(sorted(list(uniqueCuisine)), range(len(uniqueCuisine))))
dictCuisine = dict(zip(range(len(uniqueCuisine)), sorted(list(uniqueCuisine)) ))

In [29]:
dictCuisine

{0: u'brazilian',
 1: u'british',
 2: u'cajun_creole',
 3: u'chinese',
 4: u'filipino',
 5: u'french',
 6: u'greek',
 7: u'indian',
 8: u'irish',
 9: u'italian',
 10: u'jamaican',
 11: u'japanese',
 12: u'korean',
 13: u'mexican',
 14: u'moroccan',
 15: u'russian',
 16: u'southern_us',
 17: u'spanish',
 18: u'thai',
 19: u'vietnamese'}

In [30]:
cuisineDict

{u'brazilian': 0,
 u'british': 1,
 u'cajun_creole': 2,
 u'chinese': 3,
 u'filipino': 4,
 u'french': 5,
 u'greek': 6,
 u'indian': 7,
 u'irish': 8,
 u'italian': 9,
 u'jamaican': 10,
 u'japanese': 11,
 u'korean': 12,
 u'mexican': 13,
 u'moroccan': 14,
 u'russian': 15,
 u'southern_us': 16,
 u'spanish': 17,
 u'thai': 18,
 u'vietnamese': 19}

In [15]:
def listTowords(alist):
    combineWordList = ' '.join(map(lambda x: '_'.join(x.strip().split()), alist))
    return combineWordList

In [37]:
def dataFile(filename):
    ingredientsAndcuisine = json.loads(open(filename).read())
    ingredientsCuisineList = map(lambda x:(listTowords(x['ingredients']), cuisineDict[x['cuisine']]), ingredientsAndcuisine) 
    ingredients, cuisine = map(lambda x: x[0], ingredientsCuisineList), map(lambda x:x[1], ingredientsCuisineList)
    return ingredients, cuisine

In [39]:
ingredients, cuisine = dataFile('train.json')

### Now we have ingredients and cuisine data. Since most machine learning algorithm work with numbers, we will create a [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) from the ingredients data. 

### To achive this goal, we will use sci-kit learn CountVectorizer.  

In [40]:
vec = CountVectorizer()
data_X = vec.fit_transform(ingredients)

#### We will use the transformed data to train our model. 

In [42]:
mnb = MultinomialNB()
mnb.fit(data_X, cuisine)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Lets test our model in some new data.

In [59]:
fin = open('test.json')
t = json.loads(fin.read())

### Lets try to see which cuisne the following ingredients belong to using our model.

In [100]:
t[200]['ingredients']

[u'creamed coconut',
 u'almonds',
 u'chicken thigh fillets',
 u'cilantro leaves',
 u'greek yogurt',
 u'black peppercorns',
 u'milk',
 u'white poppy seeds',
 u'garlic',
 u'lemon juice',
 u'onions',
 u'clove',
 u'red chili peppers',
 u'garam masala',
 u'ginger',
 u'cardamom',
 u'ghee',
 u'tumeric',
 u'coriander seeds',
 u'cassia cinnamon',
 u'salt',
 u'gram flour',
 u'saffron']

In [99]:
(dictCuisine
 [mnb.predict(vec
              .transform([listTowords(t[200]['ingredients'])]))
              .item()])

u'indian'

### Our model suggests that the above ingredients belong to southern US style cooking. 

## Now we will check the accuracy of our model using the test data provided by the Kaggle. 

In [65]:
idx = map(lambda x:x['id'], t)

In [69]:
data_X_test = vec.transform(map(lambda x:listTowords(x['ingredients']), t))

In [70]:
(pd
 .DataFrame({'id': idx, 'cuisine': map(lambda x: dictCuisine[x], mnb.predict(data_X_test))})
 .to_csv(open('submission.csv', 'w'), index=False))

## Our simple model shows 74% accuracy on the test data set. 