# 🥙 LSTM on Recipe Data

In this notebook, we'll walk through the steps required to train your own LSTM on the recipes dataset

In [2]:
%load_ext autoreload
%autoreload 2

import numpy as np
import json
import re
import string

import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, losses

## 0. Parameters <a name="parameters"></a>

In [3]:
VOCAB_SIZE = 10000
MAX_LEN = 200
EMBEDDING_DIM = 100
N_UNITS = 128
VALIDATION_SPLIT = 0.2
SEED = 42
LOAD_MODEL = False
BATCH_SIZE = 32
EPOCHS = 25

## 1. Load the data <a name="load"></a>

In [4]:
# Load the full dataset
with open("full_format_recipes.json") as json_data:
    recipe_data = json.load(json_data)

In [5]:
# Filter the dataset
filtered_data = [
    "Recipe for " + x["title"] + " | " + " ".join(x["directions"])
    for x in recipe_data
    if "title" in x
    and x["title"] is not None
    and "directions" in x
    and x["directions"] is not None
]

In [6]:
# Count the recipes
n_recipes = len(filtered_data)
print(f"{n_recipes} recipes loaded")

20111 recipes loaded


In [7]:
example = filtered_data[9]
print(example)

Recipe for Ham Persillade with Mustard Potato Salad and Mashed Peas  | Chop enough parsley leaves to measure 1 tablespoon; reserve. Chop remaining leaves and stems and simmer with broth and garlic in a small saucepan, covered, 5 minutes. Meanwhile, sprinkle gelatin over water in a medium bowl and let soften 1 minute. Strain broth through a fine-mesh sieve into bowl with gelatin and stir to dissolve. Season with salt and pepper. Set bowl in an ice bath and cool to room temperature, stirring. Toss ham with reserved parsley and divide among jars. Pour gelatin on top and chill until set, at least 1 hour. Whisk together mayonnaise, mustard, vinegar, 1/4 teaspoon salt, and 1/4 teaspoon pepper in a large bowl. Stir in celery, cornichons, and potatoes. Pulse peas with marjoram, oil, 1/2 teaspoon pepper, and 1/4 teaspoon salt in a food processor to a coarse mash. Layer peas, then potato salad, over ham.


## 2. Tokenise the data

In [8]:
# Pad the punctuation, to treat them as separate 'words'
def pad_punctuation(s):
    s = re.sub(f"([{string.punctuation}])", r" \1 ", s)
    s = re.sub(" +", " ", s)
    return s


text_data = [pad_punctuation(x) for x in filtered_data]

In [9]:
# Display an example of a recipe
example_data = text_data[9]
example_data

'Recipe for Ham Persillade with Mustard Potato Salad and Mashed Peas | Chop enough parsley leaves to measure 1 tablespoon ; reserve . Chop remaining leaves and stems and simmer with broth and garlic in a small saucepan , covered , 5 minutes . Meanwhile , sprinkle gelatin over water in a medium bowl and let soften 1 minute . Strain broth through a fine - mesh sieve into bowl with gelatin and stir to dissolve . Season with salt and pepper . Set bowl in an ice bath and cool to room temperature , stirring . Toss ham with reserved parsley and divide among jars . Pour gelatin on top and chill until set , at least 1 hour . Whisk together mayonnaise , mustard , vinegar , 1 / 4 teaspoon salt , and 1 / 4 teaspoon pepper in a large bowl . Stir in celery , cornichons , and potatoes . Pulse peas with marjoram , oil , 1 / 2 teaspoon pepper , and 1 / 4 teaspoon salt in a food processor to a coarse mash . Layer peas , then potato salad , over ham . '

In [10]:
# Convert to a Tensorflow Dataset
text_ds = (
    tf.data.Dataset.from_tensor_slices(text_data)
    .batch(BATCH_SIZE)
    .shuffle(1000)
)

In [11]:
# Create a vectorisation layer
vectorize_layer = layers.TextVectorization(
    standardize="lower",
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=MAX_LEN + 1,
)

In [12]:
# Adapt the layer to the training set
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()

2025-02-16 13:39:18.589804: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [13]:
# Display some token:word mappings
for i, word in enumerate(vocab[:10]):
    print(f"{i}: {word}")

0: 
1: [UNK]
2: .
3: ,
4: and
5: to
6: in
7: the
8: with
9: a


In [14]:
# Display the same example converted to ints
example_tokenised = vectorize_layer(example_data)
print(example_tokenised.numpy())

[  26   16  557    1    8  298  335  189    4 1054  494   27  332  228
  235  262    5  594   11  133   22  311    2  332   45  262    4  671
    4   70    8  171    4   81    6    9   65   80    3  121    3   59
   12    2  299    3   88  650   20   39    6    9   29   21    4   67
  529   11  164    2  320  171  102    9  374   13  643  306   25   21
    8  650    4   42    5  931    2   63    8   24    4   33    2  114
   21    6  178  181 1245    4   60    5  140  112    3   48    2  117
  557    8  285  235    4  200  292  980    2  107  650   28   72    4
  108   10  114    3   57  204   11  172    2   73  110  482    3  298
    3  190    3   11   23   32  142   24    3    4   11   23   32  142
   33    6    9   30   21    2   42    6  353    3 3224    3    4  150
    2  437  494    8 1281    3   37    3   11   23   15  142   33    3
    4   11   23   32  142   24    6    9  291  188    5    9  412  572
    2  230  494    3   46  335  189    3   20  557    2    0    0    0
    0 

## 3. Create the Training Set

In [15]:
# Create the training set of recipes and the same text shifted by one word
def prepare_inputs(text):
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y


train_ds = text_ds.map(prepare_inputs)

## 4. Build the LSTM <a name="build"></a>

In [16]:
inputs = layers.Input(shape=(None,), dtype="int32")
x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs)
x = layers.LSTM(N_UNITS, return_sequences=True)(x)
outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x)
lstm = models.Model(inputs, outputs)
lstm.summary()

In [17]:
if LOAD_MODEL:
    # model.load_weights('./models/model')
    lstm = models.load_model("./models/lstm", compile=False)

## 5. Train the LSTM <a name="train"></a>

In [18]:
loss_fn = losses.SparseCategoricalCrossentropy()
lstm.compile("adam", loss_fn)

In [19]:
# Create a TextGenerator checkpoint
class TextGenerator(callbacks.Callback):
    def __init__(self, index_to_word, top_k=10):
        self.index_to_word = index_to_word
        self.word_to_index = {
            word: index for index, word in enumerate(index_to_word)
        }  # <1>

    def sample_from(self, probs, temperature):  # <2>
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs

    def generate(self, start_prompt, max_tokens, temperature):
        start_tokens = [
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ]  # <3>
        sample_token = None
        info = []
        while len(start_tokens) < max_tokens and sample_token != 0:  # <4>
            x = np.array([start_tokens])
            y = self.model.predict(x, verbose=0)  # <5>
            sample_token, probs = self.sample_from(y[0][-1], temperature)  # <6>
            info.append({"prompt": start_prompt, "word_probs": probs})
            start_tokens.append(sample_token)  # <7>
            start_prompt = start_prompt + " " + self.index_to_word[sample_token]
        print(f"\ngenerated text:\n{start_prompt}\n")
        return info

    def on_epoch_end(self, epoch, logs=None):
        self.generate("recipe for", max_tokens=100, temperature=1.0)

In [20]:
# Create a model save checkpoint
model_checkpoint_callback = callbacks.ModelCheckpoint(
    filepath="./checkpoint/checkpoint.weights.h5",
    save_weights_only=True,
    save_freq="epoch",
    verbose=0,
)

tensorboard_callback = callbacks.TensorBoard(log_dir="./logs")

# Tokenize starting prompt
text_generator = TextGenerator(vocab)

In [30]:
lstm.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[model_checkpoint_callback, tensorboard_callback, text_generator],
)

Epoch 1/25
[1m629/629[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 421ms/step - loss: 4.2907
generated text:
recipe for thyme with gently tomato greens | drop juice first rack and a boil in large water , / taste ; cook and low with just until time 

[1m629/629[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m266s[0m 422ms/step - loss: 4.2901
Epoch 2/25
[1m629/629[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 427ms/step - loss: 2.9667
generated text:
recipe for cookies goblets with mushroom squash , and roast , and rice | combine first 1 tablespoon butter in a large a medium manner . add by 3 teaspoons sauce in blender . stir in ground the fried pepper and until slightly grill , about 5 minutes . place preheat dish ; stir in the water bulb . using mixer the pot halfway through 12 cast over a ball , just wedge , at quart lower stream , ( . transfer 

[1m629/629[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m271s[0m 430ms/step - loss: 2.9665
Epoch 3/25
[1m629/629[0m

<keras.src.callbacks.history.History at 0x347de3590>

In [31]:
# Save the final model
lstm.save("lstm.keras")

## 6. Generate text using the LSTM

In [32]:
def print_probs(info, vocab, top_k=5):
    for i in info:
        print(f"\nPROMPT: {i['prompt']}")
        word_probs = i["word_probs"]
        p_sorted = np.sort(word_probs)[::-1][:top_k]
        i_sorted = np.argsort(word_probs)[::-1][:top_k]
        for p, i in zip(p_sorted, i_sorted):
            print(f"{vocab[i]}:   \t{np.round(100*p,2)}%")
        print("--------\n")

In [33]:
info = text_generator.generate(
    "recipe for roasted vegetables | chop 1 /", max_tokens=10, temperature=1.0
)


generated text:
recipe for roasted vegetables | chop 1 / 4 cup



In [34]:
print_probs(info, vocab)


PROMPT: recipe for roasted vegetables | chop 1 /
2:   	51.72%
4:   	32.59%
3:   	12.56%
8:   	1.59%
off:   	0.4%
--------


PROMPT: recipe for roasted vegetables | chop 1 / 4
cup:   	75.12%
teaspoon:   	6.09%
of:   	5.28%
-:   	3.66%
":   	2.08%
--------



In [35]:
info = text_generator.generate(
    "recipe for roasted vegetables | chop 1 /", max_tokens=50, temperature=0.2
)


generated text:
recipe for roasted vegetables | chop 1 / 2 cup all the ingredients with salt and pepper in a small bowl . in a small bowl stir together the garlic paste , the garlic , the salt , and the oil in a large bowl . add the remaining 1



In [36]:
print_probs(info, vocab)


PROMPT: recipe for roasted vegetables | chop 1 /
2:   	90.89%
4:   	9.03%
3:   	0.08%
8:   	0.0%
off:   	0.0%
--------


PROMPT: recipe for roasted vegetables | chop 1 / 2
cup:   	100.0%
teaspoon:   	0.0%
":   	0.0%
-:   	0.0%
inch:   	0.0%
--------


PROMPT: recipe for roasted vegetables | chop 1 / 2 cup
all:   	53.03%
bacon:   	36.96%
garlic:   	5.61%
water:   	2.17%
of:   	0.88%
--------


PROMPT: recipe for roasted vegetables | chop 1 / 2 cup all
the:   	95.32%
of:   	4.55%
over:   	0.13%
but:   	0.0%
onion:   	0.0%
--------


PROMPT: recipe for roasted vegetables | chop 1 / 2 cup all the
ingredients:   	99.87%
onion:   	0.09%
garlic:   	0.03%
oil:   	0.0%
juice:   	0.0%
--------


PROMPT: recipe for roasted vegetables | chop 1 / 2 cup all the ingredients
with:   	97.25%
in:   	1.51%
.:   	0.88%
and:   	0.23%
together:   	0.05%
--------


PROMPT: recipe for roasted vegetables | chop 1 / 2 cup all the ingredients with
salt:   	73.53%
the:   	15.53%
oil:   	10.59%
a:   	0.34%
mortar

In [37]:
info = text_generator.generate(
    "recipe for chocolate ice cream |", max_tokens=7, temperature=1.0
)
print_probs(info, vocab)


generated text:
recipe for chocolate ice cream | blend


PROMPT: recipe for chocolate ice cream |
whisk:   	13.11%
combine:   	12.12%
in:   	10.58%
preheat:   	7.04%
beat:   	6.75%
--------



In [38]:
info = text_generator.generate(
    "recipe for chocolate ice cream |", max_tokens=500, temperature=0.2
)
print_probs(info, vocab)


generated text:
recipe for chocolate ice cream | whisk together cream , sugar , and salt in a bowl . stir in remaining 1 / 2 cup sugar and 1 / 4 cup water . stir in remaining 1 / 4 cup sugar and salt . add milk mixture to a boil , whisking until sugar is dissolved , then whisk in milk . cook custard over moderate heat , stirring , until gelatin is dissolved , then whisk in remaining 1 / 4 cup sugar and stir until smooth . 


PROMPT: recipe for chocolate ice cream |
whisk:   	46.7%
combine:   	31.53%
in:   	16.03%
preheat:   	2.09%
beat:   	1.69%
--------


PROMPT: recipe for chocolate ice cream | whisk
together:   	97.14%
first:   	2.6%
cream:   	0.19%
eggs:   	0.03%
flour:   	0.02%
--------


PROMPT: recipe for chocolate ice cream | whisk together
flour:   	53.86%
cream:   	41.24%
sugar:   	2.22%
milk:   	2.06%
yolks:   	0.61%
--------


PROMPT: recipe for chocolate ice cream | whisk together cream
,:   	97.26%
and:   	2.74%
cheese:   	0.0%
in:   	0.0%
of:   	0.0%
--------


PROMPT: 

In [43]:
info = text_generator.generate("recipe for chicken", max_tokens=100, temperature=0.5)

print_probs(info, vocab)


generated text:
recipe for chicken with olives , fennel , and bacon | cook bacon in large pot of boiling salted water until tender but still firm to bite . drain ; transfer to paper towels . drain beans and rinse under cold water . drain well . place 1 cup beans in large saucepan . add enough water to cover by 2 inches . bring to boil . reduce heat to medium - low ; simmer until liquid is reduced to 1 / 2 cup , about 30 minutes . do ahead : can be made 1 day ahead . cover


PROMPT: recipe for chicken
with:   	40.86%
and:   	38.03%
,:   	11.93%
stock:   	4.22%
breasts:   	1.19%
--------


PROMPT: recipe for chicken with
tomato:   	11.59%
sausage:   	6.75%
green:   	6.74%
red:   	6.68%
bacon:   	6.29%
--------


PROMPT: recipe for chicken with olives
,:   	54.51%
and:   	45.14%
|:   	0.34%
with:   	0.0%
.:   	0.0%
--------


PROMPT: recipe for chicken with olives ,
olives:   	19.2%
tomatoes:   	18.5%
bacon:   	10.56%
tomato:   	6.86%
garlic:   	6.34%
--------


PROMPT: recipe for chicke

In [42]:
info = text_generator.generate("recipe for chicken", max_tokens=100, temperature=0.8)

print_probs(info, vocab)


generated text:
recipe for chicken and scallops soup | separate garlic cloves , and thyme sprig into pan . heat 3 / 4 cup butter in same wok or wok over high heat . add brown sugar and stir - fry until mixture begins to brown , turning once , about 6 minutes . add tomatoes , onion and garlic ; sauté 1 minute . add pork , remaining 1 / 2 cup reserved juice , 2 tablespoons cilantro , and 2 cups water and cook , stirring occasionally , until beans are wilted and translucent , about 5 minutes . remove


PROMPT: recipe for chicken
with:   	26.28%
and:   	25.13%
,:   	12.18%
stock:   	6.36%
breasts:   	2.89%
--------


PROMPT: recipe for chicken and
mushroom:   	10.91%
tomato:   	6.49%
sausage:   	4.58%
vegetable:   	3.61%
noodle:   	3.45%
--------


PROMPT: recipe for chicken and scallops
with:   	58.19%
on:   	4.56%
salad:   	3.9%
stew:   	2.52%
stir:   	2.39%
--------


PROMPT: recipe for chicken and scallops soup
|:   	83.16%
with:   	16.63%
in:   	0.11%
[UNK]:   	0.01%
on:   	0.01%
---