## Feature Engineering (Number of Ingredients by Cuisine Types)

This notebook requires:
* train.json
* cuisineNum.json
* ingredientNum.json

and produces:
* trainEngineered.csv

The concept of this feature engineering:
* (1) Loop through the entire dataset to count the number of occurrences for an ingredient in all 20 cuisine types. For example, 'romaine lettuce' appears 39 times in greek cuisines and 33 times in italian cuisines.
* (2) From the EDA, we saw the top 20 ingredients appearing in all 20 cuisine types. In this case, an ingredient which appears at least once in all 20 cuisine types can be classified as 'general' ingredient.
* (3) For an ingredient which does not appear in all different cuisine types, if the highest occurrence by cuisine type is at least 10, this ingredient will be classified as all cuisine types with at least 10 times appearance. For example, black olives appear in greek (31), italian (67), mexican (92), brazilian (21) cuisines. So, black olives will be classified as greek, italian, mexican, or brazilian ingredient.
* (4) For an ingredient which does not appear in at least 10 times in at least one cuisine type, it will be classified as the cuisine type with highest occurrence.

In [1]:
import numpy as np 
import pandas as pd
import json

In [2]:
trainDF = pd.read_json('train.json')
trainDF.head()

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe..."


In [3]:
# Load the cuisineNum.json
with open('cuisineNum.json') as json_file: 
  cuisineDict = json.load(json_file)

In [4]:
# Add a general cuisine type
cuisineDict['general'] = 20

In [5]:
print(cuisineDict)

{'greek': 0, 'southern_us': 1, 'filipino': 2, 'indian': 3, 'jamaican': 4, 'spanish': 5, 'italian': 6, 'mexican': 7, 'chinese': 8, 'british': 9, 'thai': 10, 'vietnamese': 11, 'cajun_creole': 12, 'brazilian': 13, 'french': 14, 'japanese': 15, 'irish': 16, 'korean': 17, 'moroccan': 18, 'russian': 19, 'general': 20}


In [6]:
# Reversed cuisineDict
rCuisineDict = dict()
for i in cuisineDict:
  rCuisineDict[cuisineDict[i]] = i
print(rCuisineDict)

{0: 'greek', 1: 'southern_us', 2: 'filipino', 3: 'indian', 4: 'jamaican', 5: 'spanish', 6: 'italian', 7: 'mexican', 8: 'chinese', 9: 'british', 10: 'thai', 11: 'vietnamese', 12: 'cajun_creole', 13: 'brazilian', 14: 'french', 15: 'japanese', 16: 'irish', 17: 'korean', 18: 'moroccan', 19: 'russian', 20: 'general'}


In [7]:
# Load the ingredientNum.json
with open('ingredientNum.json') as json_file: 
  igdNumDict = json.load(json_file)

In [8]:
# For each ingredient, assign a list of 20 elements (cuisines)
igdDict = dict()
for igd in igdNumDict:
  igdDict[igd] = [0 for i in range(20)]

(1) Loop through the entire dataset to count the number of occurrences for an ingredient in all 20 cuisine types. For example, 'romaine lettuce' appears 39 times in greek cuisines and 33 times in italian cuisines.

In [9]:
# Main loop to count the number of each ingredient by cuisine type.
for i in range(len(trainDF)):
  igdList = trainDF['ingredients'].iloc[i]
  cuisineType = trainDF['cuisine'].iloc[i]
  for j in igdList:
    igdDict[j][cuisineDict[cuisineType]] += 1

In [10]:
# The dictionary for ingredients with the number by cuisine type
# print(igdDict)

In [11]:
igdCuisine = dict()

* (2) An ingredient which appears at least once in all 20 cuisine types can be classified as 'general' ingredient.
* (3) For an ingredient which does not appear in all different cuisine types, if the highest occurrence by cuisine type is at least 10, this ingredient will be classified as all cuisine types with at least 10 times appearance. For example, black olives appear in greek (31), italian (67), mexican (92), brazilian (21) cuisines. So, black olives will be classified as greek, italian, mexican, or brazilian ingredient.
* (4) For an ingredient which does not appear in at least 10 times in at least one cuisine type, it will be classified as the cuisine type with highest occurrence.

In [12]:
# Assign corresponding cuisine types for each ingredient
# Core part of feature engineering
for igd in igdDict:
  max = igdDict[igd][0]
  maxIdx = 0
  if(max >= 1):
    # The number of cuisine types for which an ingredient appears
    numDiffType = 1
  else:
    numDiffType = 0

  # Find the cuisine type with highest occurrence for an ingredient
  for i in range(1, len(igdDict[igd])):
    igdFreq = igdDict[igd][i]
    if(igdFreq >= 1):
      numDiffType += 1
    if(max < igdDict[igd][i]):
      max = igdDict[igd][i]
      maxIdx = i
    
  # If an ingredient appears in all 20 cuisine types,
  # then it is a general ingredient
  if(numDiffType == 20):
    igdCuisine[igd] = [cuisineDict['general']]

  # If an ingredient does not appear in all different cuisine types,
  # find all cuisine types with at least 10 times appearance for that ingredient
  # provided the highest occurrence is at least 10
  elif(max >= 10):
    indexList = [maxIdx] # The first cuisine type has the highest occurrence
    for j in range(len(igdDict[igd])):
      if(j == maxIdx):
        continue
      if(igdDict[igd][j] >= 10):
        indexList.append(j)
    igdCuisine[igd] = indexList

  # Otherwise, the ingredient belongs to the cuisine type of highest occurrence
  else:
    igdCuisine[igd] = [maxIdx]

In [13]:
# The dictionary for ingredients with the highest number by cuisuine type
# print(igdCuisine)

In [14]:
# The dictionary for ingredients with its associated cuisine type
igdCuisineType = dict()
for i in igdCuisine:
  csnList = []
  for j in igdCuisine[i]:
    csnList.append(rCuisineDict[j])
  igdCuisineType[i] = csnList
# print(igdCuisineType)

### Integrating into the Dataset

In [15]:
# The dictionary to be converted into dataframe
finalDict = {i:[] for i in cuisineDict}
finalDict['cuisine'] = []
print(finalDict)

{'greek': [], 'southern_us': [], 'filipino': [], 'indian': [], 'jamaican': [], 'spanish': [], 'italian': [], 'mexican': [], 'chinese': [], 'british': [], 'thai': [], 'vietnamese': [], 'cajun_creole': [], 'brazilian': [], 'french': [], 'japanese': [], 'irish': [], 'korean': [], 'moroccan': [], 'russian': [], 'general': [], 'cuisine': []}


In [16]:
for i in range(len(trainDF)):
  igdList = trainDF['ingredients'].iloc[i]

  # Loop through the finalDict to set each ingredient to 0 for a sample
  for igd in finalDict:
    finalDict[igd].append(0)

  # Count the number of ingredients by type
  for igd in igdList:
    for index, csnType in enumerate(igdCuisineType[igd]):
      if(index == 0):
        finalDict[csnType][i] += 2
      else:
        finalDict[csnType][i] += 1

  finalDict['cuisine'][i] = trainDF['cuisine'].iloc[i]

In [17]:
pd.DataFrame(finalDict).to_csv('trainEngineered.csv', index=False)

In [18]:
finalDF = pd.read_csv('trainEngineered.csv')
finalDF

Unnamed: 0,greek,southern_us,filipino,indian,jamaican,spanish,italian,mexican,chinese,british,...,cajun_creole,brazilian,french,japanese,irish,korean,moroccan,russian,general,cuisine
0,6,1,0,2,0,0,6,7,1,0,...,1,0,3,0,0,0,1,0,8,greek
1,0,5,0,1,3,0,2,1,1,2,...,2,0,1,1,1,0,0,0,14,southern_us
2,0,0,1,2,0,0,1,3,1,0,...,1,0,2,1,0,1,0,0,18,filipino
3,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,6,indian
4,1,3,0,14,2,2,3,5,5,1,...,3,1,1,6,0,1,3,0,22,indian
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39769,0,3,0,3,1,0,4,1,1,2,...,1,0,2,1,4,1,0,0,16,irish
39770,0,1,0,1,1,1,10,2,3,0,...,1,0,0,0,0,1,0,0,2,italian
39771,1,3,1,3,2,2,2,1,0,4,...,3,0,4,0,2,0,2,1,14,irish
39772,2,7,2,7,3,1,6,7,18,2,...,3,1,2,7,2,4,2,1,16,chinese


Now, we can use this engineered dataset for the training of new models. Notice that we have significantly reduced the number of features in the dataset from 6714 columns for one-hot encoding to just 21 columns by transforming the data. Such dimentionality reduction can speed up the training time.