### Feature Engineering - One-Hot Encoding

This notebook requires:
* train.json

and produces:
* trainOneHotEncoded.csv

For the first feature engineering, we will perform one-hot encoding on the list of ingredients. For example, if 'romaine lettuce' appears in one sample, then under column 'romaince lettuce', the value will be 1. Otherwise, the value will be 0.

In [None]:
import numpy as np 
import pandas as pd
import json 

In [None]:
# Load the train dataset
with open('train.json') as json_file: 
    train_data = json.load(json_file)

In [None]:
length = len(train_data)

In [None]:
# List of unique ingredients
igdDict = dict()
for x in train_data:
  for igd in x['ingredients']:
    igdDict[igd] = 0

In [None]:
# Set all ingredients not present for each sample
igdValues = dict()
for igd in igdDict:
  igdValues[igd] = [0 for i in range(length)]

In [None]:
# One hot encoding of ingredients for each sample
cuisine = []
for i, x in enumerate(train_data):
    cuisine.append(x['cuisine'])
    for igd in x['ingredients']:
        igdValues[igd][i] = 1

In [None]:
# Create a key in the igdValues dictionary called cuisine 
# The value = list of cuisines in the dataset 

igdValues["cuisine"] = cuisine

In [None]:
# Export for model training later
pd.DataFrame(igdValues).to_csv('trainOneHotEncoded.csv', index=False)

In [None]:
train_df = pd.read_csv('trainOneHotEncoded.csv')
train_df.head()

Unnamed: 0,romaine lettuce,black olives,grape tomatoes,garlic,pepper,purple onion,seasoning,garbanzo beans,feta cheese crumbles,plain flour,...,Challenge Butter,orange glaze,cholesterol free egg substitute,ciabatta loaf,Lipton® Iced Tea Brew Family Size Tea Bags,Hidden Valley® Greek Yogurt Original Ranch® Dip Mix,lop chong,tomato garlic pasta sauce,crushed cheese crackers,cuisine
0,1,1,1,1,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,greek
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,southern_us
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,filipino
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,indian
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,indian
