# 🥙 LSTM on Recipe Data

## Table of contents
0. [Parameters](#parameters)
1. [Load the Data](#load)
2. [Tokensise the Data](#tokenise)
3. [Create the Training Set](#create)
4. [Build the LSTM](#build)
5. [Train the LSTM](#train)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import os
import json
from pprint import pprint
import random
import re
import string

import tensorflow as tf
import tensorflow.keras as keras

## 0. Parameters <a name="parameters"></a>

In [3]:
VOCAB_SIZE = 10000
MAX_LEN = 200
EMBEDDING_DIM = 100
N_UNITS = 128
VALIDATION_SPLIT = 0.2
SEED = 42
LOAD_MODEL = False
BATCH_SIZE = 32

## 1. Load the data <a name="load"></a>

In [4]:
# Load the full dataset
with open('/app/data/epirecipes/full_format_recipes.json') as json_data:
    recipe_data = json.load(json_data)
    

In [5]:
# Filter the dataset
filtered_data = ['Recipe for ' + x['title']+ ' | ' + ' '.join(x['directions']) for x in recipe_data
              if 'title' in x
              and x['title'] is not None
              and 'directions' in x
              and x['directions'] is not None
             ]

In [6]:
# Count the recipes
n_recipes = len(filtered_data)
print(f'{n_recipes} recipes loaded')

20111 recipes loaded


In [7]:
example = filtered_data[9]
print(example)

Recipe for Ham Persillade with Mustard Potato Salad and Mashed Peas  | Chop enough parsley leaves to measure 1 tablespoon; reserve. Chop remaining leaves and stems and simmer with broth and garlic in a small saucepan, covered, 5 minutes. Meanwhile, sprinkle gelatin over water in a medium bowl and let soften 1 minute. Strain broth through a fine-mesh sieve into bowl with gelatin and stir to dissolve. Season with salt and pepper. Set bowl in an ice bath and cool to room temperature, stirring. Toss ham with reserved parsley and divide among jars. Pour gelatin on top and chill until set, at least 1 hour. Whisk together mayonnaise, mustard, vinegar, 1/4 teaspoon salt, and 1/4 teaspoon pepper in a large bowl. Stir in celery, cornichons, and potatoes. Pulse peas with marjoram, oil, 1/2 teaspoon pepper, and 1/4 teaspoon salt in a food processor to a coarse mash. Layer peas, then potato salad, over ham.


## 2. Tokenise the data

In [8]:
# Pad the punctuation, to treat them as separate 'words'  
def pad_punctuation(s):
    s = re.sub(f"([{string.punctuation}])", r' \1 ', s)
    s = re.sub(' +', ' ', s)
    return s

text_data = [pad_punctuation(x) for x in filtered_data]

In [9]:
# Display an example of a recipe
example_data = text_data[9]
example_data

'Recipe for Ham Persillade with Mustard Potato Salad and Mashed Peas | Chop enough parsley leaves to measure 1 tablespoon ; reserve . Chop remaining leaves and stems and simmer with broth and garlic in a small saucepan , covered , 5 minutes . Meanwhile , sprinkle gelatin over water in a medium bowl and let soften 1 minute . Strain broth through a fine - mesh sieve into bowl with gelatin and stir to dissolve . Season with salt and pepper . Set bowl in an ice bath and cool to room temperature , stirring . Toss ham with reserved parsley and divide among jars . Pour gelatin on top and chill until set , at least 1 hour . Whisk together mayonnaise , mustard , vinegar , 1 / 4 teaspoon salt , and 1 / 4 teaspoon pepper in a large bowl . Stir in celery , cornichons , and potatoes . Pulse peas with marjoram , oil , 1 / 2 teaspoon pepper , and 1 / 4 teaspoon salt in a food processor to a coarse mash . Layer peas , then potato salad , over ham . '

In [10]:
# Convert to a Tensorflow Dataset
text_ds = tf.data.Dataset.from_tensor_slices(text_data).batch(BATCH_SIZE).shuffle(1000)

2022-05-19 21:10:37.780710: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-19 21:10:37.876383: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-19 21:10:37.877023: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-19 21:10:37.879924: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

In [11]:
# Create a vectorisation layer
vectorize_layer = keras.layers.TextVectorization(
    standardize = 'lower',
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=MAX_LEN + 1,
)

In [12]:
# Adapt the layer to the training set
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()

In [13]:
# Display some token:word mappings
for i, word in enumerate(vocab[:10]):
    print(f'{i}: {word}')

0: 
1: [UNK]
2: .
3: ,
4: and
5: to
6: in
7: the
8: with
9: a


In [14]:
# Display the same example converted to ints
example_tokenised = vectorize_layer(example_data)
print(example_tokenised.numpy())

[  26   16  557    1    8  298  335  189    4 1054  494   27  332  228
  235  262    5  594   11  133   22  311    2  332   45  262    4  671
    4   70    8  171    4   81    6    9   65   80    3  121    3   59
   12    2  299    3   88  650   20   39    6    9   29   21    4   67
  529   11  164    2  320  171  102    9  374   13  643  306   25   21
    8  650    4   42    5  931    2   63    8   24    4   33    2  114
   21    6  178  181 1245    4   60    5  140  112    3   48    2  117
  557    8  285  235    4  200  292  980    2  107  650   28   72    4
  108   10  114    3   57  204   11  172    2   73  110  482    3  298
    3  190    3   11   23   32  142   24    3    4   11   23   32  142
   33    6    9   30   21    2   42    6  353    3 3224    3    4  150
    2  437  494    8 1281    3   37    3   11   23   15  142   33    3
    4   11   23   32  142   24    6    9  291  188    5    9  412  572
    2  230  494    3   46  335  189    3   20  557    2    0    0    0
    0 

## 3. Create the Training Set

In [15]:
# Create the training set of recipes and the same text shifted by one word
def prepare_inputs(text):
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y

train_ds = text_ds.map(prepare_inputs)

## 2. Build the LSTM <a name="build"></a>

In [22]:
inputs = keras.layers.Input(shape=(None,), dtype="int32")
x = keras.layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs)
x = keras.layers.LSTM(N_UNITS, return_sequences=True)(x)
outputs = keras.layers.Dense(VOCAB_SIZE, activation = 'softmax')(x)
model = keras.models.Model(inputs, outputs)
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_1 (Embedding)     (None, None, 100)         1000000   
                                                                 
 lstm_1 (LSTM)               (None, None, 128)         117248    
                                                                 
 dense_1 (Dense)             (None, None, 10000)       1290000   
                                                                 
Total params: 2,407,248
Trainable params: 2,407,248
Non-trainable params: 0
_________________________________________________________________


In [23]:
if LOAD_MODEL:
    # model.load_weights('./models/model')
    keras.models.load_model('./models/model', compile=False)

## 3. Train the LSTM <a name="train"></a>

In [24]:
loss_fn = keras.losses.SparseCategoricalCrossentropy()
model.compile("adam", loss_fn)

In [25]:
# Create a TextGenerator checkpoint
class TextGenerator(keras.callbacks.Callback):
    def __init__(self, index_to_word, top_k=10):
        self.index_to_word = index_to_word
        self.word_to_index = {}
        for index, word in enumerate(index_to_word):
            self.word_to_index[word] = index

    def sample_from(self, probs, temperature):
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs 
    
    def generate(self, start_prompt, max_tokens = 40, temperature = 1.0):
        start_tokens = [self.word_to_index.get(x, 1) for x in start_prompt.split()]
        num_tokens_generated = 0
        tokens_generated = []
        info = []
        sample_token = None
        while num_tokens_generated <= max_tokens and sample_token != 0:
            x = np.array([start_tokens])
            y = self.model.predict(x)
            sample_token, probs = self.sample_from(y[0][-1], temperature)
            
            info.append({'prompt': start_prompt , 'word_probs': probs})
    
            tokens_generated.append(sample_token)
            start_tokens.append(sample_token)
            start_prompt = start_prompt + ' ' + self.index_to_word[sample_token]
            
            num_tokens_generated = len(tokens_generated)
   
        print(f"generated text:\n{start_prompt}\n")
        return info
        
    def on_epoch_end(self, epoch, logs=None):
        self.generate("recipe for", max_tokens = 40)
        

In [26]:
# Create a model save checkpoint
model_checkpoint_callback = keras.callbacks.ModelCheckpoint(
    filepath="./checkpoint/checkpoint.ckpt",
    save_weights_only=True,
    save_freq="epoch",
    verbose=0,
)

tensorboard_callback = keras.callbacks.TensorBoard(log_dir="./logs")

# Tokenize starting prompt
text_generator = TextGenerator(vocab)

In [27]:
model.fit(
    train_ds, 
    epochs=25, 
    # steps_per_epoch = 3,
    callbacks = [model_checkpoint_callback, tensorboard_callback, text_generator]
)

Epoch 1/25
recipe for sugar and press soufflè with [UNK] wine | pat motion bag . oven to cinnamon covered , serve . the pour heat and large tablespoon sugar . lamb pieces . butter and center . healthy in taste . a it .

Epoch 2/25
recipe for rib salsa | arrange butter in in pan in a section of cheesecloth of syrup . repeat to a baking sheet alongside . flip on processor and salt with celery , over 325 orange dishes . mix the egg and and

Epoch 3/25
recipe for carpaccio with roasted bread wild cream salad | preheat oven masala to 475°f . 

Epoch 4/25
recipe for herbed ceviche ' spring meat | whisk eggs husks , chocolate , and fennel sugar in large bowl . let drain , melt to large baking plate , add coarsely sugar ( a medium skillet over coat . add stew to

Epoch 5/25
recipe for basic thumbprints | combine beets , chopped guacamole , garlic , vinegar , sugar , shallot , and carrot in a coarse bowl . put in each of jelly - golden them . bake until meat slightly softened , on lightly

Epoc

<keras.callbacks.History at 0x7f8256608a30>

In [28]:
# Save the final model
model.save("./models/model")

2022-05-19 21:22:46.516187: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: ./models/model/assets


INFO:tensorflow:Assets written to: ./models/model/assets


# 3. Generate text using the LSTM

In [31]:
def print_probs(info, vocab, top_k = 5):
    for i in info:
        print(f"\nPROMPT: {i['prompt']}")
        word_probs = i['word_probs']
        p_sorted = np.sort(word_probs)[::-1][:top_k]
        i_sorted = np.argsort(word_probs)[::-1][:top_k]
        for p, i in zip(p_sorted, i_sorted):
            print(f'{vocab[i]}:   \t{np.round(100*p,2)}%') 
        print('--------\n')

In [32]:
info = text_generator.generate("recipe for", max_tokens = 40, temperature = 1.0)

generated text:
recipe for watermelon [UNK] with tequila | peel morels . whisk together cream and vanilla in a shallow bowl . stir in sugar and sugar and beat together both egg whites and cream of tartar and a pinch of salt in a bowl



In [33]:
print_probs(info, vocab)


PROMPT: recipe for
grilled:   	2.25%
roasted:   	2.02%
chicken:   	1.85%
[UNK]:   	1.58%
chocolate:   	1.53%
--------


PROMPT: recipe for watermelon
and:   	27.86%
,:   	18.81%
-:   	17.7%
salad:   	4.02%
with:   	3.31%
--------


PROMPT: recipe for watermelon [UNK]
|:   	70.57%
with:   	13.92%
on:   	1.05%
':   	0.84%
and:   	0.63%
--------


PROMPT: recipe for watermelon [UNK] with
lemon:   	5.82%
orange:   	4.68%
mint:   	4.62%
ginger:   	2.1%
fresh:   	2.05%
--------


PROMPT: recipe for watermelon [UNK] with tequila
and:   	39.89%
|:   	30.2%
,:   	7.6%
ice:   	5.4%
-:   	2.85%
--------


PROMPT: recipe for watermelon [UNK] with tequila |
in:   	17.51%
combine:   	10.73%
stir:   	8.63%
bring:   	4.92%
peel:   	4.79%
--------


PROMPT: recipe for watermelon [UNK] with tequila | peel
the:   	31.19%
and:   	11.64%
,:   	7.39%
beets:   	5.46%
potatoes:   	1.76%
--------


PROMPT: recipe for watermelon [UNK] with tequila | peel morels
and:   	53.05%
,:   	17.1%
.:   	14.76%
(:   	5.7

In [153]:
info = text_generator.generate("recipe for", max_tokens = 40, temperature = 0.1)

generated text:
recipe for grilled steak with roasted peppers and garlic | preheat oven to 350°f . butter 13x9x2 - inch glass baking dish . combine 1 / 2 cup sugar , and 1 / 2 cup sugar in heavy medium saucepan over medium heat



In [154]:
print_probs(info, vocab)


PROMPT: recipe for
grilled:   	97.53%
roasted:   	1.67%
chicken:   	0.65%
chocolate:   	0.1%
lemon:   	0.04%
--------


PROMPT: recipe for grilled
chicken:   	94.24%
pork:   	5.1%
salmon:   	0.55%
steak:   	0.1%
beef:   	0.0%
--------


PROMPT: recipe for grilled steak
with:   	100.0%
and:   	0.0%
salad:   	0.0%
chops:   	0.0%
,:   	0.0%
--------


PROMPT: recipe for grilled steak with
roasted:   	36.56%
garlic:   	34.7%
lemon:   	10.92%
green:   	5.64%
tomato:   	3.55%
--------


PROMPT: recipe for grilled steak with roasted
peppers:   	91.95%
garlic:   	7.67%
tomatoes:   	0.29%
red:   	0.07%
-:   	0.02%
--------


PROMPT: recipe for grilled steak with roasted peppers
and:   	100.0%
,:   	0.0%
|:   	0.0%
.:   	0.0%
with:   	0.0%
--------


PROMPT: recipe for grilled steak with roasted peppers and
garlic:   	98.66%
red:   	1.33%
arugula:   	0.0%
olives:   	0.0%
cilantro:   	0.0%
--------


PROMPT: recipe for grilled steak with roasted peppers and garlic
|:   	100.0%
dressing:   	0.0%
