# What's cooking competiton

**The competitions** asks to predict the category of a dish's cuisine given a list of its ingredients.




In the dataset, we include the **recipe id**, **the type of cuisine**, and **the list of ingredients of each recipe** (of variable length). The data is stored in JSON format. 

#### An example of a recipe node in train.json:

```json
 {
 "id": 24717,
 "cuisine": "indian",
 "ingredients": [
     "tumeric",
     "vegetable stock",
     "tomatoes",
     "garam masala",
     "naan",
     "red lentils",
     "red chili peppers",
     "onions",
     "spinach",
     "sweet potatoes"
 ]
 },
 ```
 
In the test file **test.json**, the format of a recipe is the same as **train.json**, only *the cuisine type* is removed, <font color=red>as it is the target variable you are going to predict.</font>

#### File descriptions:

**train.json -** the training set containing recipes id, type of cuisine, and list of ingredients

**test.json -** the test set containing recipes id, and list of ingredients

**sample_submission.csv -** a sample submission file in the correct format
In the dataset, we include the recipe id, the type of cuisine, and the list of ingredients of each recipe (of variable length). The data is stored in JSON format. 

-------

In [1]:
# include required libraries
import os
import sys
import data_utils as du
import numpy as np

In [2]:
# Read the data from files
trainingData = du.readJson('./data/train.json')
testData = du.readJson('./data/test.json')

### Exploring first few rows of each data set

In [3]:
print('TRAINING DATA:')
trainingData.head()

TRAINING DATA:


Unnamed: 0,cuisine,id,ingredients
0,greek,10259,"[romaine lettuce, black olives, grape tomatoes..."
1,southern_us,25693,"[plain flour, ground pepper, salt, tomatoes, g..."
2,filipino,20130,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,indian,22213,"[water, vegetable oil, wheat, salt]"
4,indian,13162,"[black pepper, shallots, cornflour, cayenne pe..."


In [4]:
print('TEST DATA:')
testData.head()

TEST DATA:


Unnamed: 0,id,ingredients
0,18009,"[baking powder, eggs, all-purpose flour, raisi..."
1,28583,"[sugar, egg yolks, corn starch, cream of tarta..."
2,41580,"[sausage links, fennel bulb, fronds, olive oil..."
3,29752,"[meat cuts, file powder, smoked sausage, okra,..."
4,35687,"[ground black pepper, salt, sausage casings, l..."


### Make a Distribution map for ingredients and food types in thetraining set

In [5]:
ingredientsColumn = trainingData['ingredients']
recipeType = trainingData['cuisine']

ingredientsDist = du.dataDistributionMap(ingredientsColumn)
cuisineDist = du.dataDistributionMap(recipeType)

print ('Number of cuisines in the data set:',len(cuisineDist))
print ('Number of ingredients in the data set:',len(ingredientsDist))

<class 'list'>
<class 'str'>
Number of cuisines in the data set: 20
Number of ingredients in the data set: 6714


-------

### Prepare and the data to be used as input of a Neural Network ase well as create training and dev data-set

1. we need to create input vectors representing each **recipe** and output vectors representing a **cuisine type**.
2. XYZ 
3. XYZ
4. XYZ

### Prepre the input matrix: 

In [6]:
# create a map of ingredients
w2i, i2w = du.wordsToMap(list(ingredientsDist.keys()))

In [7]:
max(len(recipe) for recipe in trainingData['ingredients'])

65

In [8]:
# An input matrix with having the ingredients encoded for each recipe and zero padding.
# Each row is a recipe in the training set and will be the input of the first layer of NN
X = du.convertToInputMatrix(trainingData['ingredients'], w2i)

### Sanity Check:

In [9]:
X[0]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [10]:
trainingData['ingredients'][0]

['romaine lettuce',
 'black olives',
 'grape tomatoes',
 'garlic',
 'pepper',
 'purple onion',
 'seasoning',
 'garbanzo beans',
 'feta cheese crumbles']

In [11]:
print(i2w[0],"\n",i2w[1], "\n",i2w[2], "\n",i2w[3])

romaine lettuce 
 black olives 
 grape tomatoes 
 garlic


### Prepare the expected output matrix:

In [12]:
# create a map of index to cuisine  nad cuisine to index
c2i, i2c = du.wordsToMap(list(cuisineDist))

In [13]:
Y = du.convertToOutputMatrix(trainingData['cuisine'], c2i)

### Sanity Check:


In [20]:
Y[0:20]

array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0,

In [22]:
trainingData['cuisine'][1] 

'southern_us'

In [23]:
Y[1]

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [26]:
print(c2i['southern_us'], i2c[1])

1 southern_us
