## Data Preprocessing - Numeric Generation

This notebook requires:
* train.json
* test.json

and produces:
* cuisineNum.json
* ingredientNum.json

This preprocessing step is an extension of exploratory data analysis. It finds all unique ingredients and assigns a unique number to each of them. Similarly, it also finds all unique cuisine types and assigns a unique number to each type.

In [1]:
import pandas as pd
import json

In [2]:
trainDF = pd.read_json('train.json')
trainDF.head()

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe..."


In [3]:
testDF = pd.read_json('test.json')
testDF.head()

Unnamed: 0,id,ingredients
0,18009,"[baking powder, eggs, all-purpose flour, raisi..."
1,28583,"[sugar, egg yolks, corn starch, cream of tarta..."
2,41580,"[sausage links, fennel bulb, fronds, olive oil..."
3,29752,"[meat cuts, file powder, smoked sausage, okra,..."
4,35687,"[ground black pepper, salt, sausage casings, l..."


### Numeric Generation for Ingredients

First, we find all the unique ingredients in both train and test sets.

In [4]:
# Train Set
igdDict = dict()
for i in range(len(trainDF)):
    idgList = trainDF['ingredients'].iloc[i]
    for igd in idgList:
        igdDict[igd] = 1
print(len(igdDict))

6714


In [5]:
# Test Set
for i in range(len(testDF)):
    idgList = testDF['ingredients'].iloc[i]
    for igd in idgList:
        igdDict[igd] = 1
print(len(igdDict))

7137


Then, we assign a unique number for each ingredient.

In [6]:
igdNumDict = dict()
for i, igd in enumerate(igdDict):
    igdNumDict[igd] = i

In [7]:
# print(igdNumDict)

We will export the generated data as a json file to be used by the feature engineering step.

In [8]:
# Export the dictionary as ingredientNum.json
with open('ingredientNum.json', 'w+') as fp:
  fp.write(json.dumps(igdNumDict, indent = 4))

In [9]:
# Load the ingredientNum.json
with open('ingredientNum.json') as json_file: 
  igdNum = json.load(json_file)

In [10]:
# print(igdNum)

### Numeric Generation for Cuisines

First, we find all the unique cuisine types in the train set.

In [11]:
# Train Set
csnList = trainDF['cuisine'].unique()
print(csnList)

['greek' 'southern_us' 'filipino' 'indian' 'jamaican' 'spanish' 'italian'
 'mexican' 'chinese' 'british' 'thai' 'vietnamese' 'cajun_creole'
 'brazilian' 'french' 'japanese' 'irish' 'korean' 'moroccan' 'russian']


Then, we assign a unique number for each cuisine type.

In [12]:
csnNumDict = dict()
for i, csn in enumerate(csnList):
    csnNumDict[csn] = i

In [13]:
print(csnNumDict)

{'greek': 0, 'southern_us': 1, 'filipino': 2, 'indian': 3, 'jamaican': 4, 'spanish': 5, 'italian': 6, 'mexican': 7, 'chinese': 8, 'british': 9, 'thai': 10, 'vietnamese': 11, 'cajun_creole': 12, 'brazilian': 13, 'french': 14, 'japanese': 15, 'irish': 16, 'korean': 17, 'moroccan': 18, 'russian': 19}


In [14]:
# Export the dictionary as cuisineNum.json
with open('cuisineNum.json', 'w+') as fp:
  fp.write(json.dumps(csnNumDict, indent = 4))

In [15]:
# Load the cuisineNum.json
with open('cuisineNum.json') as json_file: 
  csnNum = json.load(json_file)

In [16]:
print(csnNum)

{'greek': 0, 'southern_us': 1, 'filipino': 2, 'indian': 3, 'jamaican': 4, 'spanish': 5, 'italian': 6, 'mexican': 7, 'chinese': 8, 'british': 9, 'thai': 10, 'vietnamese': 11, 'cajun_creole': 12, 'brazilian': 13, 'french': 14, 'japanese': 15, 'irish': 16, 'korean': 17, 'moroccan': 18, 'russian': 19}
