# SchlemmerSlammer - Neural Network for recipes
---
## 1. Teammembers
### Maxim Bex
research, documentation, concept, coding
### Hannes Gelbhardt
coding, research, concept, documentation
### York Smeddinck
documentation, concept, coding, research

---
## 2. Problem Description
We thought about training an AI to recommend / create recipes based on given attributes (e.g. rating, certain ingredients (include / exclude), calories, sodium, fat, FODMAP, etc.pp.). The following graphic is an entity relationship diagram for recipes, which we created to get a better understanding of the nature and complexity of our problem.

![Entity Relationship Diagram for Recipes](https://i.imgur.com/SqWTTtY.png)

Since this would be a rather complex system to begin with, we decided to break the problem down, and start with a smaller application first. Thus we are trying to train a neural network to recommend recipe ingredients based on given ingredients.


### General Knowledge
One of the more basic problems we face when we talking about natural language processing is that - in difference to humans - computers can just understand numbers, more specific ones and zeroes. It is not too hard to represent a language's syntax in numbers, but if we try to get meaning and context into language, humans are way better in understanding each other. The main reason for that is that humans often times have a similar context (background knowledge) so sometimes they even can communicate without words. To make this more clear, when we talk about cooking each human probably have a similar concept of preparing food in mind, but a computer just understands the word cooking out of it’s language context without actual knowing or understanding what it means.

---
## 3. Dataset
### Source: https://www.kaggle.com/hugodarwood/epirecipes
The Dataset is a collection of about 21000 recipes given in a .csv and .json format, where the json file contained all information of the recipes and the .csv file is basically a list of attributes and ingredients for all recipes. We opted for using the json file, which includes more context of the ingredients (e.g. 3 evenly chopped tomatoes) compared to the csv file (e.g. 3 tomatoes), which should result in a more complex model.
### Processing the data:
Because we put our focus on the relationship of ingredients, we preprocess the data from the json format into a textfile consisting of single line listings of the recipes ingredients, excluding special characters (see variable badchars in the code below.)

The following script reads the input json file 'full_format_recipes.json', strips it of all unwanted characters defined in the variable 'badchars' and stores the result in the output file 'preProcessedData.txt'.

In [0]:
import json

badchars = set("(),-*\"'<>|:")

with open('full_format_recipes.json') as json_file:  
    data = json.load(json_file)
    outFile = "./preProcessedData.txt"
    with open (outFile, "w") as f:
        for recipe in data:
            try:
                ingredients = recipe['ingredients']
                for ingredient in ingredients:
                    for c in badchars:
                        ingredient = ingredient.replace(c,' ')
                    try:
                        f.write(ingredient+" ")
                    except UnicodeEncodeError:
                        print("UnicodeEncodeError")
                f.write("\n")
            except KeyError:
                print("KeyError")

### Problems during preprocessing
The first problem was that the windows command line could not process specific characters of the dataset, which first was solved by setting the command line charset to "UTF - 8" Unfortunately we just could read and process about 700 recipes, but not all.

Our solution to this was to do all the preprocessing through a python script (see below).

Another problem were the unwanted “bad” characters (basically all symbols that are not letters) and how to exclude them.
Our solution is to check every single word for said bad characters and to replace those with an empty space.

Also we discussed whether to keep numbers (e.g. ¾) and measurement units (e.g. tbsp or cup), but we decided that our neural network would probably benefit from being able to put those aspects into context as well,so we kept them included. 

##4. Neural Network(s)
In our project we tried 

In [0]:
import d2l
import sys
sys.path.insert(0, '..')
import pandas as pd

import collections
import math
from mxnet import autograd, gluon, nd
from mxnet.gluon import data as gdata, loss as gloss, nn
import random
import sys
import collections
import time
import json
import numpy as np

# Get the interactive Tools for Matplotlib
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.decomposition import PCA

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec


Read the number of lines to get an idea how many recipes we are working with

In [0]:
with open('./preProcessedData.txt', 'r') as f:
    lines = f.readlines()
    ingredients = [st.split() for st in lines]
'# sentences: %d' % len(ingredients)

To get a more meaningful result, delete the tokens, which come less then 3 times in the dataset

In [0]:
counter = collections.Counter([tk for st in ingredients for tk in st])
counter = dict(filter(lambda x: x[1] >= 3, counter.items()))

count the total number of tokens in the dataset

In [0]:
idx_to_token = [tk for tk, _ in counter.items()]
token_to_idx = {tk: idx for idx, tk in enumerate(idx_to_token)}
dataset = [[token_to_idx[tk] for tk in st if tk in token_to_idx]
           for st in ingredients]
num_tokens = sum([len(st) for st in dataset])
'# tokens: %d' % num_tokens

To make the relationship between the most frequent words and the less frequent one more meaningful and have a better performance, we used subsampling and print out the result of the reduced dataset

In [0]:
def discard(idx):
    return random.uniform(0, 1) < 1 - math.sqrt(
        1e-4 / counter[idx_to_token[idx]] * num_tokens)

subsampled_dataset = [[tk for tk in st if not discard(tk)] for st in dataset]
'# tokens: %d' % sum([len(st) for st in subsampled_dataset])

For comparison we count the occurrence of the frequently used word "cup" before and after subsampling

In [0]:
def compare_counts(token):
    return '# %s: before=%d, after=%d' % (token, sum(
        [st.count(token_to_idx[token]) for st in dataset]), sum(
        [st.count(token_to_idx[token]) for st in subsampled_dataset]))

compare_counts('cup')

In [0]:
def get_centers_and_contexts(dataset, max_window_size):
    centers, contexts = [], []
    for st in dataset:
        if len(st) < 2:
            continue
        centers += st
        for center_i in range(len(st)):
            window_size = random.randint(1, max_window_size)
            indices = list(range(max(0, center_i - window_size),
                                 min(len(st), center_i + 1 + window_size)))
            indices.remove(center_i)
            contexts.append([st[idx] for idx in indices])
    return centers, contexts

In [0]:
#Create artificilly a dataset with two random sentences of 2 to 7 words each
tiny_dataset = [list(range(7)), list(range(7, 10))]
print('dataset', tiny_dataset)
for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)):
    print('center', center, 'has contexts', context)

In the following steps we implement the Skip-Gram Model. The first thing to do is trying to give the single words a context based meaning, so that in our previously giving example “I cook carrots” or “I cook potatoes” carrots, potatoes and cooking are closely related. On the top trough subsampling the performance of the later training part will be better, because the dataset is smaller

In [0]:
#test example, extracting words to the context max window size of 5
all_centers, all_contexts = get_centers_and_contexts(subsampled_dataset, 5)

Negative sampling is a second mechanism to imrpove the perfomance of the training part also as getter slightly better results. In the word representation after each  “going trough” the weights of all other words except the target word should be updated, which would need an enormous amount of compute power. In negative sampling we just pick a few, randomly chosen words which don’t occur in the context and change there weights related to the target word.


In [0]:
#Negative sampling for training
def get_negatives(all_contexts, sampling_weights, K):
    all_negatives, neg_candidates, i = [], [], 0
    population = list(range(len(sampling_weights)))
    for contexts in all_contexts:
        negatives = []
        while len(negatives) < len(contexts) * K:
            if i == len(neg_candidates):
                i, neg_candidates = 0, random.choices(
                    population, sampling_weights, k=int(1e5))
            neg, i = neg_candidates[i], i + 1
            # Noise words cannot be context words
            if neg not in set(contexts):
                negatives.append(neg)
        all_negatives.append(negatives)
    return all_negatives

sampling_weights = [counter[w]**0.75 for w in idx_to_token]
all_negatives = get_negatives(all_contexts, sampling_weights, 5)


In [0]:
#using mini batches gradient descent algorithm for the data reading process in a function 
def batchify(data):
    max_len = max(len(c) + len(n) for _, c, n in data)
    centers, contexts_negatives, masks, labels = [], [], [], []
    for center, context, negative in data:
        cur_len = len(context) + len(negative)
        centers += [center]
        contexts_negatives += [context + negative + [0] * (max_len - cur_len)]
        masks += [[1] * cur_len + [0] * (max_len - cur_len)]
        labels += [[1] * len(context) + [0] * (max_len - len(context))]
    return (nd.array(centers).reshape((-1, 1)), nd.array(contexts_negatives),
            nd.array(masks), nd.array(labels))

In [0]:
#We use the previously defined batchify function to specify the data loader instance 
#and print the shape of each variable into the first batch read
batch_size = 512
#Checks how many cpu are available
num_workers = 0 if sys.platform.startswith('win32') else 4
dataset = gdata.ArrayDataset(all_centers, all_contexts, all_negatives)
data_iter = gdata.DataLoader(dataset, batch_size, shuffle=True,
                             batchify_fn=batchify, num_workers=num_workers)
#change of zip?
for batch in data_iter:
    for name, data in zip(['centers', 'contexts_negatives', 'masks',
                           'labels'], batch):
        print(name, 'shape:', data.shape)
    break

In [0]:
#Skip-Gram Model
#Embedding a layer with a input size of 20 neurons and a output of 4
embed = nn.Embedding(input_dim=20, output_dim=4)
embed.initialize()
embed.weight

In [0]:
#The input of the embedding layer is the index of the context word
x = nd.array([[1, 2, 3], [4, 5, 6]])
embed(x)

In [0]:
#Mini Batch Multiplication
X = nd.ones((2, 1, 4))
Y = nd.ones((2, 4, 6))
nd.batch_dot(X, Y).shape


In [0]:
#Skip-Gram Forward Calculation
def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
    v = embed_v(center)
    u = embed_u(contexts_and_negatives)
    pred = nd.batch_dot(v, u.swapaxes(1, 2))
    return pred


In [0]:
#We use Gluon's binary cross entropy loss function Sigmoid Binary Cross Entropy Loss
loss = gloss.SigmoidBinaryCrossEntropyLoss()

In [0]:
#Mask functions are considerable
pred = nd.array([[1.5, 0.3, -1, 2], [1.1, -0.6, 2.2, 0.4]])
# 1 and 0 in the label variables label represent context words and the noise
# words, respectively
label = nd.array([[1, 0, 0, 0], [1, 1, 0, 0]])
mask = nd.array([[1, 1, 1, 1], [1, 1, 1, 0]])  # Mask variable
loss(pred, label, mask) * mask.shape[1] / mask.sum(axis=1)

In [0]:
#binary cross-entropy loss function calculation to compare
#and calculate the predicted value with a mask of 1 and the loss of the label based on the mask variable mask.
def sigmd(x):
    return -math.log(1 / (1 + math.exp(-x)))

print('%.7f' % ((sigmd(1.5) + sigmd(-0.3) + sigmd(1) + sigmd(-2)) / 4))
print('%.7f' % ((sigmd(1.1) + sigmd(-0.6) + sigmd(-2.2)) / 3))

In [0]:
#Initialize Model Parameter with a size of 100, a sequential neural network
embed_size = 100
net = nn.Sequential()
net.add(nn.Embedding(input_dim=len(idx_to_token), output_dim=embed_size),
        nn.Embedding(input_dim=len(idx_to_token), output_dim=embed_size))

In [0]:
#Function for the training process
def train(net, lr, num_epochs):
    ctx = d2l.try_gpu()
    net.initialize(ctx=ctx, force_reinit=True)
    trainer = gluon.Trainer(net.collect_params(), 'adam',
                            {'learning_rate': lr})
    for epoch in range(num_epochs):
        start, l_sum, n = time.time(), 0.0, 0
        for batch in data_iter:
            center, context_negative, mask, label = [
                data.as_in_context(ctx) for data in batch]
            with autograd.record():
                pred = skip_gram(center, context_negative, net[0], net[1])
                # Use the mask variable to avoid the effect of padding on loss
                # function calculations
                l = (loss(pred.reshape(label.shape), label, mask) *
                     mask.shape[1] / mask.sum(axis=1))
            l.backward()
            trainer.step(batch_size)
            l_sum += l.sum().asscalar()
            n += l.size
        print('epoch %d, loss %.2f, time %.2fs'
              % (epoch + 1, l_sum / n, time.time() - start))

In [0]:
train(net, 0.005, 3)

In [0]:
def get_similar_tokens(query_token, k, embed):
    W = embed.weight.data()
    x = W[token_to_idx[query_token]]
    # The added 1e-9 is for numerical stability
    cos = nd.dot(W, x) / (nd.sum(W * W, axis=1) * nd.sum(x * x) + 1e-9).sqrt()
    topk = nd.topk(cos, k=k+1, ret_typ='indices').asnumpy().astype('int32')
    for i in topk[1:]:  # Remove the input words
        print('cosine sim=%.3f: %s' % (cos[i].asscalar(), (idx_to_token[i])))

get_similar_tokens('toasted', 10, net[0])

In [0]:
glove_file = datapath('./glove.6B.100d.txt')
glove_file = './glove.6B.100d.txt'

word2vec_glove_file = get_tmpfile("preProcessedData.txt")
word2vec_glove_file

glove2word2vec(glove_file, word2vec_glove_file)

In [0]:
model = KeyedVectors.load_word2vec_format(word2vec_glove_file)

As we can see below, the trained model gives us pretty good results for word similarity

In [0]:
model.most_similar('tomato')

In the following step we define a function to plot random samples via Principal Component Analysis (PCA)

In [0]:
def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
        
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [0]:
display_pca_scatterplot(model, ['tomato','lettuce','onion','lamb','pork','salt','pepper','chopped','sliced','happy','beef','minced','flour','cake','bake','raw','italian','german','french','greek'], sample=100)

As we can see in the figure the result gives us to our recipes unrelated words, which happened due to the base glove file which was made by a wikipedia entry crawler

Our Problem occured here is that we cant work with the previously trained network in the way, that we cannot use net.save, tsne or similar functions to word2vec