# SchlemmerSlammer - Neural Network for recipes

## 1. Teammembers

### Maxim Bex
research, documentation, concept, coding

### Hannes Gelbhardt
coding, research, concept, documentation 

### York Smeddinck
documentation, concept, coding, research

## 2. Problem Description

We thought about training an AI to recommend / create recipes based on given attributes (e.g. rating, certain ingredients (include / exclude), calories, sodium, fat, FODMAP, etc.pp.).
The following graphic is an entity relationship diagram for recipes, which we created to get a better understanding of the nature and complexity of our problem.

![Recipe Entity Relation Diagram](https://i.imgur.com/PGj3fdZ.png)

Since this would be a rather complex system to begin with, we decided to break the problem down, and start with a smaller application first.
Thus we are trying to train an AI to recommend recipe ingredients based on given ingredients.

Our rough plan is to use Word2Vec to put the content of the directions of the recipes in context, then train a model based on this data, and finally visualizing our results using T-SNE.




## 3. Dataset description
### Source: https://www.kaggle.com/hugodarwood/epirecipes
The Dataset is a collection of about 21000 recipes given in a .csv and .json format, where the json file contained all information of the recipes and the .csv file is basically a list of attributes and ingredients for all recipes.
We opted for using the json file, which includes more context of the ingredients (e.g. 3 evenly chopped tomatoes) compared to the csv file (e.g. 3 tomatoes), which should result in a more complex model.
### Processing the data:
Because we put our focus on the relationship of ingredients, we preprocess the data from the json format into a textfile consisting of single line listings of the recipes ingredients.
#### Problems during preprocessing
The first problem was that the windows command line could not process specific characters of the dataset to a file, which first was solved by setting the command line charset to "UTF - 8"
Unfortunetly we just could read and process about 700 more recipes, but not all.

In [0]:
import json

badchars = set("(),-*\"'<>|:")

with open('full_format_recipes.json') as json_file:  
	data = json.load(json_file)
	outFile = "./preProcessedData.txt"
	with open (outFile, "w") as f:
		for recipe in data:
			try:
				ingredients = recipe['ingredients']
				for ingredient in ingredients:
					for c in badchars:
						ingredient = ingredient.replace(c,' ')
					try:
						f.write(ingredient+" ")
					except UnicodeEncodeError:
						print("UnicodeEncodeError")
				f.write("\n")
			except KeyError:
        print("KeyError")

In [0]:
import d2l
import sys
sys.path.insert(0, '..')
import pandas as pd

import collections
import math
from mxnet import autograd, gluon, nd
from mxnet.gluon import data as gdata, loss as gloss, nn
import random
import sys
import collections
import time
import json
import numpy as np

# Get the interactive Tools for Matplotlib
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.decomposition import PCA

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

In [0]:
#read the input data as line = sentence and print the number of sentences
with open('./data.txt', 'r') as f:
    lines = f.readlines()
    # st is the abbreviation of "sentence" in the loop
    ingredients = [st.split() for st in lines]

'# sentences: %d' % len(ingredients)

'# sentences: 1076'

In [0]:
#For test purposes print tge first 3 lines with the first 5 "words"
for st in ingredients [:3]:
    print('# tokens:', len(st), st[:5])

# tokens: 90 ["['4", 'cups', 'low-sodium', 'vegetable', 'or']
# tokens: 123 ["['1", '1/2', 'cups', 'whipping', "cream',"]
# tokens: 38 ["['1", 'fennel', 'bulb', '(sometimes', 'called']


In [0]:
# tk is an abbreviation for "token" in the loop
counter = collections.Counter([tk for st in ingredients for tk in st])
counter = dict(filter(lambda x: x[1] >= 5, counter.items()))

In [0]:
#count the total number of tokens in the dataset
idx_to_token = [tk for tk, _ in counter.items()]
token_to_idx = {tk: idx for idx, tk in enumerate(idx_to_token)}
dataset = [[token_to_idx[tk] for tk in st if tk in token_to_idx]
           for st in ingredients]
num_tokens = sum([len(st) for st in dataset])
'# tokens: %d' % num_tokens

'# tokens: 56124'

In [0]:
#Subsampling -> normalization of the word relation. 
# Remove Tokens? 
def discard(idx):
    return random.uniform(0, 1) < 1 - math.sqrt(
        1e-4 / counter[idx_to_token[idx]] * num_tokens)

subsampled_dataset = [[tk for tk in st if not discard(tk)] for st in dataset]
'# tokens: %d' % sum([len(st) for st in subsampled_dataset])


'# tokens: 14761'

In [0]:

#Wordcount before and after subsampling of the word "cup"
def compare_counts(token):
    return '# %s: before=%d, after=%d' % (token, sum(
        [st.count(token_to_idx[token]) for st in dataset]), sum(
        [st.count(token_to_idx[token]) for st in subsampled_dataset]))

compare_counts('cup')

'# cup: before=2259, after=112'

In [0]:
#Words with a distance from the target window that not exceeding the max window size
#Is there a need?
def get_centers_and_contexts(dataset, max_window_size):
    centers, contexts = [], []
    for st in dataset:
        # Each sentence needs at least 2 words to form a
        # "central target word - context word" pair
        if len(st) < 2:
            continue
        centers += st
        for center_i in range(len(st)):
            window_size = random.randint(1, max_window_size)
            indices = list(range(max(0, center_i - window_size),
                                 min(len(st), center_i + 1 + window_size)))
            # Exclude the central target word from the context words
            indices.remove(center_i)
            contexts.append([st[idx] for idx in indices])
    return centers, contexts

In [0]:
#Create artificilly a dataset with two random sentences of 2 to 7 words each
tiny_dataset = [list(range(7)), list(range(7, 10))]
print('dataset', tiny_dataset)
for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)):
    print('center', center, 'has contexts', context)

dataset [[0, 1, 2, 3, 4, 5, 6], [7, 8, 9]]
center 0 has contexts [1]
center 1 has contexts [0, 2]
center 2 has contexts [1, 3]
center 3 has contexts [2, 4]
center 4 has contexts [2, 3, 5, 6]
center 5 has contexts [3, 4, 6]
center 6 has contexts [4, 5]
center 7 has contexts [8]
center 8 has contexts [7, 9]
center 9 has contexts [7, 8]


In [0]:
#test example, extracting words to the context max window size of 5
all_centers, all_contexts = get_centers_and_contexts(subsampled_dataset, 5)

In [0]:
#Negative sampling for training
def get_negatives(all_contexts, sampling_weights, K):
    all_negatives, neg_candidates, i = [], [], 0
    population = list(range(len(sampling_weights)))
    for contexts in all_contexts:
        negatives = []
        while len(negatives) < len(contexts) * K:
            if i == len(neg_candidates):
                # An index of k words is randomly generated as noise words
                # based on the weight of each word (sampling_weights). For
                # efficient calculation, k can be set slightly larger
                i, neg_candidates = 0, random.choices(
                    population, sampling_weights, k=int(1e5))
            neg, i = neg_candidates[i], i + 1
            # Noise words cannot be context words
            if neg not in set(contexts):
                negatives.append(neg)
        all_negatives.append(negatives)
    return all_negatives

sampling_weights = [counter[w]**0.75 for w in idx_to_token]
all_negatives = get_negatives(all_contexts, sampling_weights, 5)

In [0]:
#using tiny batches for the data reading process in a function
def batchify(data):
    max_len = max(len(c) + len(n) for _, c, n in data)
    centers, contexts_negatives, masks, labels = [], [], [], []
    for center, context, negative in data:
        cur_len = len(context) + len(negative)
        centers += [center]
        contexts_negatives += [context + negative + [0] * (max_len - cur_len)]
        masks += [[1] * cur_len + [0] * (max_len - cur_len)]
        labels += [[1] * len(context) + [0] * (max_len - len(context))]
    return (nd.array(centers).reshape((-1, 1)), nd.array(contexts_negatives),
            nd.array(masks), nd.array(labels))

In [0]:
#We use the previously defined batchify function to specify the data loader instance 
#and print the shape of each variable into the first batch read
batch_size = 512
#Checks how many cpu are available
num_workers = 0 if sys.platform.startswith('win32') else 4
dataset = gdata.ArrayDataset(all_centers, all_contexts, all_negatives)
data_iter = gdata.DataLoader(dataset, batch_size, shuffle=True,
                             batchify_fn=batchify, num_workers=num_workers)
#change of zip?
for batch in data_iter:
    for name, data in zip(['centers', 'contexts_negatives', 'masks',
                           'labels'], batch):
        print(name, 'shape:', data.shape)
    break

centers shape: (512, 1)
contexts_negatives shape: (512, 60)
masks shape: (512, 60)
labels shape: (512, 60)


In [0]:
#Skip-Gram Model
#Embedding a layer
embed = nn.Embedding(input_dim=20, output_dim=4)
embed.initialize()
embed.weight

Parameter embedding0_weight (shape=(20, 4), dtype=float32)

In [0]:
#The input of the embedding layer is the index of the context word
x = nd.array([[1, 2, 3], [4, 5, 6]])
embed(x)


[[[ 0.01438687  0.05011239  0.00628365  0.04861524]
  [-0.01068833  0.01729892  0.02042518 -0.01618656]
  [-0.00873779 -0.02834515  0.05484822 -0.06206018]]

 [[ 0.06491279 -0.03182812 -0.01631819 -0.00312688]
  [ 0.0408415   0.04370362  0.00404529 -0.0028032 ]
  [ 0.00952624 -0.01501013  0.05958354  0.04705103]]]
<NDArray 2x3x4 @cpu(0)>

In [0]:
#Mini Batch Multiplication
X = nd.ones((2, 1, 4))
Y = nd.ones((2, 4, 6))
nd.batch_dot(X, Y).shape

(2, 1, 6)

In [0]:
#Skip-Gram Forward Calculation
def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
    v = embed_v(center)
    u = embed_u(contexts_and_negatives)
    pred = nd.batch_dot(v, u.swapaxes(1, 2))
    return pred

In [0]:
#Train a Model

In [0]:
#Gluon's binary cross entropy loss function SigmoidBinaryCrossEntropyLoss
loss = gloss.SigmoidBinaryCrossEntropyLoss()

In [0]:
#Mask functions are considerable
pred = nd.array([[1.5, 0.3, -1, 2], [1.1, -0.6, 2.2, 0.4]])
# 1 and 0 in the label variables label represent context words and the noise
# words, respectively
label = nd.array([[1, 0, 0, 0], [1, 1, 0, 0]])
mask = nd.array([[1, 1, 1, 1], [1, 1, 1, 0]])  # Mask variable
loss(pred, label, mask) * mask.shape[1] / mask.sum(axis=1)


[0.8739896 1.2099689]
<NDArray 2 @cpu(0)>

In [0]:
#binary cross-entropy loss function calculation to compare
#and calculate the predicted value with a mask of 1 and the loss of the label based on the mask variable mask.
def sigmd(x):
    return -math.log(1 / (1 + math.exp(-x)))

print('%.7f' % ((sigmd(1.5) + sigmd(-0.3) + sigmd(1) + sigmd(-2)) / 4))
print('%.7f' % ((sigmd(1.1) + sigmd(-0.6) + sigmd(-2.2)) / 3))

0.8739896
1.2099689


In [0]:
#Initilize Model Parameter
embed_size = 100
net = nn.Sequential()
net.add(nn.Embedding(input_dim=len(idx_to_token), output_dim=embed_size),
        nn.Embedding(input_dim=len(idx_to_token), output_dim=embed_size))

In [0]:
#Training function
def train(net, lr, num_epochs):
    ctx = d2l.try_gpu()
    net.initialize(ctx=ctx, force_reinit=True)
    trainer = gluon.Trainer(net.collect_params(), 'adam',
                            {'learning_rate': lr})
    for epoch in range(num_epochs):
        start, l_sum, n = time.time(), 0.0, 0
        for batch in data_iter:
            center, context_negative, mask, label = [
                data.as_in_context(ctx) for data in batch]
            with autograd.record():
                pred = skip_gram(center, context_negative, net[0], net[1])
                # Use the mask variable to avoid the effect of padding on loss
                # function calculations
                l = (loss(pred.reshape(label.shape), label, mask) *
                     mask.shape[1] / mask.sum(axis=1))
            l.backward()
            trainer.step(batch_size)
            l_sum += l.sum().asscalar()
            n += l.size
        print('epoch %d, loss %.2f, time %.2fs'
              % (epoch + 1, l_sum / n, time.time() - start))

In [0]:
train(net, 0.005, 3)

epoch 1, loss 0.67, time 15.31s
epoch 2, loss 0.48, time 15.15s
epoch 3, loss 0.43, time 15.30s


In [0]:
#Applying word embedding model
def get_similar_tokens(query_token, k, embed):
    W = embed.weight.data()
    x = W[token_to_idx[query_token]]
    # The added 1e-9 is for numerical stability
    cos = nd.dot(W, x) / (nd.sum(W * W, axis=1) * nd.sum(x * x) + 1e-9).sqrt()
    topk = nd.topk(cos, k=k+1, ret_typ='indices').asnumpy().astype('int32')
    for i in topk[1:]:  # Remove the input words
        print('cosine sim=%.3f: %s' % (cos[i].asscalar(), (idx_to_token[i])))

get_similar_tokens('toasted', 10, net[0])

cosine sim=0.863: 'Vegetable
cosine sim=0.860: halves
cosine sim=0.860: removed
cosine sim=0.858: top
cosine sim=0.858: jigger)
cosine sim=0.853: ginger']
cosine sim=0.849: any
cosine sim=0.849: springform
cosine sim=0.848: melted,
cosine sim=0.845: sprig


In [0]:
glove_file = datapath('D:\Studium\DeepLearning\Scripte\output3.txt')
glove_file = 'D:\Studium\DeepLearning\Scripte\output3.txt'

word2vec_glove_file = get_tmpfile("output3.txt")
word2vec_glove_file

'C:\\Users\\Gazt\\AppData\\Local\\Temp\\output3.txt'

In [0]:
glove2word2vec(glove_file, word2vec_glove_file)

(1076, 89)

In [0]:
#model = KeyedVectors.load_word2vec_format(word2vec_glove_file)

ValueError: could not convert string to float: 'cups'