<a href="https://colab.research.google.com/github/ekandemir/FoodRecipeGenerator/blob/main/CreativityProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creativity Project

# IMPORTANT INFORMATION FOR THIS NOTEBOOK

!!! **THE SYSTEM GIVES BEST GENERATION RESULTS WITH PEGASUS** !!!

- Necessary Installations and Assignments section's cells should run for every attempt in the notebook. The working directory can be changed under this section.

- Data Processing creates datasets for training. archive.zip (provided in the Google Drive link) file should be in the current path.

- **For training, provided dataset_1.csv, dataset_2.txt, dataset_3.txt files should be in Data/ directory.**

- RNN Model Train section is to train RNN Models, the dataset should be chosen in "Choose dataset" cell. Then Letter or Word approach can be trained.

- PEGASUS Model Train section is to fine-tune the PEGASUS model.
-- The output of the train cell of PEGASUS model is to show training is possible only. Since the model takes long to train, it has been reloaded and trained multiple times. The final trained model can be tried in TEST section.

TO TEST THE MODELS AFTER INSTALLATIONS AND ASSIGNMENTS RAN, YOU CAN JUMP TO TEST SECTIONS.

You can enter the ingredients by comma seperated and post-processed generated text should be printed in Generate Recipe sections.



## Necessary Installations and Assignments

### Installations

In [None]:
!pip install sentencepiece
!pip install datasets
!pip install transformers



### Library Imports

In [None]:
import os
import pandas as pd
import csv
from transformers import PegasusForConditionalGeneration, PegasusTokenizer, Trainer, TrainingArguments
import torch
from datasets import load_dataset
import numpy as np
import tensorflow as tf
import numpy as np
import random
from ast import literal_eval
import time

### Drive Mount and Dataset Unzip Operations

In [None]:
# Mount the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#@title Create or Change to Drive Directory
path = "/content/drive/MyDrive/Computer_Creativity_Project" # @param {type:"string"}
if not os.path.exists(path):
  # Create a new directory because it does not exist 
  os.makedirs(path)
  print(path, " directory created.")

os.chdir(path)

In [None]:
"""
  archive.zip file added to shared Google Drive link.
  PUT archive.zip file obtained from https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions
  in path. Since during the download authentication is needed, download command has not been added here. 
"""

'\n  archive.zip file added to shared Google Drive link.\n  PUT archive.zip file obtained from https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions\n  in path. Since during the download authentication is needed, download command has not been added here. \n'

In [None]:
DATASET_PATH = "./food_dataset"
if not os.path.exists(DATASET_PATH):
  os.system("unzip archive.zip -d food_dataset/")

# Data Processing

In [None]:
# LOAD RAW DATA
data = pd.read_csv("food_dataset/RAW_recipes.csv", nrows=10000)
data.head()

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,['place potatoes in a large pot of lightly sal...,"this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n...",11
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,['mix all ingredients& boil for 2 1 / 2 hours ...,my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar...",8


Further operations are to create the "*dataset_1.csv*" file. The data already provided in the Google Drive link has been shared. You can create or copy-paste that file.

In [None]:
def tokenize_list(df_column, max_length=128, pad_to_max_length=True):
  """
    Tokenize the ingredients and return token and token_id list

  :param df_column: pd.data_frame column, should contain list of strings 
  :param max_length: int, maximum length
  :param pad_to_max_length: bool, should to the padding
  :return: tokenized_str_list, tokenized_id_list : tokenized ingredient list string and token_ids; [ing_1, <sep>, ing_2, <sep>, ing_3... ]
  """
  def join_strings(str_list):
    try:
      sep_token = " "+"<sep>"+" "
      joined_ingredients = sep_token.join(literal_eval(str_list))

    except:
      print("Error Occured on: ", str_list)
      return None
    return joined_ingredients
  
  str_list = map(join_strings, df_column)
  str_list = list(str_list)
  return str_list


In [None]:
# Tokenize ingredients and steps columns as two different lists
ingredients_joined = tokenize_list(data.ingredients)
steps_joined = tokenize_list(data.steps)

In [None]:
#@title Creating dataset_1
data_dict = {"ingredients": ingredients_joined, "steps": steps_joined}
with open("dataset_1.csv", "w") as outfile:
   writer = csv.writer(outfile)
   writer.writerow(data_dict.keys())
   writer.writerows(zip(*data_dict.values()))


A csv file with columns "**ingredients, steps**" is created and saved as "*dataset_1.csv*" file.

Further operations are to create the "*dataset_2.txt*" file. The data already provided in the Google Drive link has been shared. You can create or copy-paste that file.

In [None]:
#@title Creating dataset_2
# 
# Merge all ingredients and steps
text = ''
for i in range(len(data)):
  text = text + str(data.ingredients[i]) + str(data.steps[i])

# Save the corpus to dataset_2.txt
with open("dataset_2.txt","w") as dataset:
  dataset.write(text)


Further operation is to create the "*dataset_3.txt*" file. The data already provided in the Google Drive link has been shared. You can create or copy-paste that file.

In [None]:
#@title Creating dataset_3
# 
# Merge all ingredients and steps
text = ''
for i in range(len(data)):
  text = text + " <rec> " + str(data.ingredients[i]).replace("<sep>","<ing>") +" <s_stp> "+ str(data.steps[i]).replace("<sep>","<stp>")+ " </rec> "

# Save the corpus to dataset_2.txt
with open("dataset_3.txt","w") as dataset:
  dataset.write(text)


# RNN Model Train Codes

In [None]:
#@title Choose dataset
dataset = "dataset_2" # @param ["dataset_2", "dataset_3"]
data_file = "Data/"+ dataset + ".txt"

## Letter Based RNN Approach

In [None]:
#@title Load and Process the data

text = ""
with open(data_file,"r") as dataset_f:
  text = dataset_f.read()

# The length of text is the number of characters in it
print (len(text))
vocabulary = sorted(set(text))
char2idx = {u:i for i, u in enumerate(vocabulary)}
idx2char = np.array(vocabulary)

7365210


In [None]:
#@title Dataset parameters

# batch size, default: 64
BATCH_SIZE = 2048  # @param {type: "integer"}
# buffer size to shuffle our dataset, default 10000
BUFFER_SIZE = 10000  # @param {type: "integer"}
# number of RNN units, default 1024
N_RNN_UNITS = 1024  # @param {type: "integer"}
# length of text chunks for training, default 100
MAX_LENGTH = 100  # @param {type: "integer"}
# size of the embedding layer, default 256
EMBEDDING_DIM = 256    # @param {type: "integer"}

VOCAB_SIZE = len(vocabulary)  # length of the vocabulary in chars
print("Batch size: {} \nBuffer size: {} \n# RNN Units: {}\
       \nMax input length: {} \nVocabulary size: {} \nEmbedding dimension: {}".format(
            BATCH_SIZE, BUFFER_SIZE, N_RNN_UNITS, MAX_LENGTH, VOCAB_SIZE, EMBEDDING_DIM
        )
)

Batch size: 2048 
Buffer size: 10000 
# RNN Units: 1024       
Max input length: 100 
Vocabulary size: 65 
Embedding dimension: 256


In [None]:
# Obtain input and target data
input_text = []
target_text = []

for c in range(0, len(text)-MAX_LENGTH, MAX_LENGTH):
    inps = text[c : c + MAX_LENGTH]
    tars = text[c + 1 : c + 1 + MAX_LENGTH]

    input_text.append([char2idx[i] for i in inps])
    target_text.append([char2idx[t] for t in tars])
    
print (np.array(input_text).shape)
print (np.array(target_text).shape)

(73652, 100)
(73652, 100)


In [None]:
# Create TF datasets
dataset = tf.data.Dataset.from_tensor_slices((input_text, target_text)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [None]:
#@title Set up generator network structure

# Define the loss function
def loss_function(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

# Define input and output around the RNN (GRU)
def build_model(vocab_size=VOCAB_SIZE, embedding_dim=EMBEDDING_DIM, n_rnn_units=N_RNN_UNITS, batch_size=BATCH_SIZE):
    model = tf.keras.Sequential([
            tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                      batch_input_shape=[batch_size, None]),
            tf.keras.layers.LSTM(n_rnn_units,
                                return_sequences=True,
                                stateful=True,
                                recurrent_activation='sigmoid',
                                recurrent_initializer='glorot_uniform'),
            tf.keras.layers.Dense(vocab_size)
        ])
    model.summary()
    return model

model = build_model()

# Define the optimiser
# default: 0.001
opt_learning_rate = 0.001  #@param{type:"raw"}
# default: 0.5
opt_beta = 0.5 #@param{type:"raw"}
optimizer = tf.keras.optimizers.Adam(opt_learning_rate, beta_1=opt_beta)

# Compile the model
model.compile(optimizer, loss_function)

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (2048, None, 256)         16640     
                                                                 
 lstm_4 (LSTM)               (2048, None, 1024)        5246976   
                                                                 
 dense_4 (Dense)             (2048, None, 65)          66625     
                                                                 
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Test Configuration
for input_example_batch, target_example_batch in dataset.take(1):
    # Run the batch through the model
    example_batch_predictions = model(input_example_batch)

    # Print output shape
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

    # To get the predictions, sample over the output distribution
    sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
    sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy() 
    
    # Decode the indices to see the text predicted by the (untrained) model
    print("Input: \n", repr("".join(idx2char[input_example_batch[0]])), "\n")
    print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices])))

(2048, 100, 66) # (batch_size, sequence_length, vocab_size)
Input: 
 "e spice , baking powder and baking soda', 'mix in the flour at low speed', 'stir in the carrots', 'f" 

Next Char Predictions: 
 '1\'&b16rvm<-3j=q6pg,^z#\\t3i"%f[3v&=m`%#p<e8()!+t4s6 @d,0a@tdw=gb]gtxk5%5"=3$3_\'-e(@:+}j" >(7yvt7ov/j}'


In [None]:
def generate_text(model, input_text, n_characters_output=1000):
    # First, vectorize the input text as before
    input_eval = [char2idx[s] for s in input_text]
    input_eval = tf.expand_dims(input_eval, 0)

    # We'll store results in this variable
    text_generated = []

    # Generate the number of characters desired
    model.reset_states()
    for i in range(n_characters_output):
        # Run input through model
        predictions = model(input_eval)

        # Remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # Using a categorical distribution to predict the character returned by the model
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # Pass the predicted character as the next input to the model
        input_eval = tf.expand_dims([predicted_id], 0)

        # Add the predicted character to the output
        text_generated.append(idx2char[predicted_id])

    # Return output
    return (input_text + ''.join(text_generated))

In [None]:
#@title Choose where to save models
model_path = "./dataset_3_letter/" # @param {type : "string"}
full_path = model_path + "ckpt_{epoch}" 

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
                      filepath=full_path,
                      save_weights_only=True)

In [None]:
#@title Train the model

n_epochs =  50# @param{type: "integer"} 
history = model.fit(dataset, epochs=n_epochs, callbacks=[checkpoint_callback])


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


## Word Based RNN Approach

In [None]:
#@title Load and Process the data

text = ""
with open(data_file,"r") as dataset_f:
  text = dataset_f.read()

# The length of text is the number of characters in it
words = text.split(" ")
print (len(words))
vocabulary_words = sorted(set(words))
word2idx = {u:i for i, u in enumerate(vocabulary_words)}
idx2word = np.array(vocabulary_words)

1319788


In [None]:
#@title Dataset parameters

# batch size, default: 64
BATCH_SIZE =   512# @param {type: "integer"}
# buffer size to shuffle our dataset, default 10000
BUFFER_SIZE = 10000  # @param {type: "integer"}
# number of RNN units, default 1024
N_RNN_UNITS = 1024  # @param {type: "integer"}
# length of text chunks for training, default 100
MAX_LENGTH = 100  # @param {type: "integer"}
# size of the embedding layer, default 256
EMBEDDING_DIM = 100    # @param {type: "integer"}

VOCAB_SIZE = len(vocabulary_words)  # length of the vocabulary in chars
print("Batch size: {} \nBuffer size: {} \n# RNN Units: {}\
       \nMax input length: {} \nVocabulary size: {} \nEmbedding dimension: {}".format(
            BATCH_SIZE, BUFFER_SIZE, N_RNN_UNITS, MAX_LENGTH, VOCAB_SIZE, EMBEDDING_DIM
        )
)

Batch size: 512 
Buffer size: 10000 
# RNN Units: 1024       
Max input length: 100 
Vocabulary size: 23816 
Embedding dimension: 100


In [None]:
# Obtain input and target data
input_text = []
target_text = []

for c in range(0, len(words)-MAX_LENGTH, MAX_LENGTH):
    inps = words[c : c + MAX_LENGTH]
    tars = words[c + 1 : c + 1 + MAX_LENGTH]

    input_text.append([word2idx[i] for i in inps])
    target_text.append([word2idx[t] for t in tars])
    
print (np.array(input_text).shape)
print (np.array(target_text).shape)

(12125, 100)
(12125, 100)


In [None]:
# Create TF datasets
dataset = tf.data.Dataset.from_tensor_slices((input_text, target_text)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)


In [None]:
#@title Set up generator network structure

# Define the loss function
def loss_function(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

# Define input and output around the RNN (GRU)
def build_model(vocab_size=VOCAB_SIZE, embedding_dim=EMBEDDING_DIM, n_rnn_units=N_RNN_UNITS, batch_size=BATCH_SIZE):
    model = tf.keras.Sequential([
            tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                      batch_input_shape=[batch_size, None]),
            tf.keras.layers.LSTM(n_rnn_units,
                                return_sequences=True,
                                stateful=True,
                                recurrent_activation='sigmoid',
                                recurrent_initializer='glorot_uniform'),
            tf.keras.layers.Dense(vocab_size)
        ])
    model.summary()
    return model

model = build_model()

# Define the optimiser
# default: 0.001
opt_learning_rate = 0.001  #@param{type:"raw"}
# default: 0.5
opt_beta = 0.5 #@param{type:"raw"}
optimizer = tf.keras.optimizers.Adam(opt_learning_rate, beta_1=opt_beta)

# Compile the model
model.compile(optimizer, loss_function)

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (512, None, 100)          2381600   
                                                                 
 lstm_1 (LSTM)               (512, None, 1024)         4608000   
                                                                 
 dense_1 (Dense)             (512, None, 23816)        24411400  
                                                                 
Total params: 31,401,000
Trainable params: 31,401,000
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Test Configuration
for input_example_batch, target_example_batch in dataset.take(1):
    # Run the batch through the model
    example_batch_predictions = model(input_example_batch)

    # Print output shape
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

    # To get the predictions, sample over the output distribution
    sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
    sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy() 
    
    # Decode the indices to see the text predicted by the (untrained) model
    print("Input: \n", repr(" ".join(idx2word[input_example_batch[0]])), "\n")
    print("Next Char Predictions: \n", repr(" ".join(idx2word[sampled_indices])))

(512, 100, 23816) # (batch_size, sequence_length, vocab_size)
Input: 
 "'press edges together with a fork to seal', 'heat 1 cup of cooking oil to 400 degrees in an electric skillet', 'place several empanadas in hot oil at a time , cooking on one side , and turning over when golden brown', 'cook second side until golden brown', 'remove and drain on paper towels', 'sprinkle tops of empanadas with cinnamon and sugar mixture'] </rec>  <rec> ['apple pie filling', 'flour tortillas', 'margarine', 'white sugar', 'brown sugar', 'water', 'ground cinnamon'] <s_stp> ['preheat oven to 350 degrees f', 'warm tortillas in the microwave for approximately 20 seconds', 'this will make them easier" 

Next Char Predictions: 
 '55-65 bisucits \'miso\', horrible 170 12-quart stones\', [\'candy:\', disposable [\'directions\', done milk"] allowed [\'set \'meat 350-degrees french-fry sling commercial [\'puff cassis powdered-sugar buttermilk-- condiment\'] scratch: ros \'securely margo\'s tablespoonsmargarine ti

In [None]:
def generate_text(model, input_text, n_tokens_output=256):
    # First, vectorize the input text as before
    input_eval = [word2idx[s] for s in input_text.split(" ")]
    input_eval = tf.expand_dims(input_eval, 0)

    # We'll store results in this variable
    text_generated = []

    # Generate the number of characters desired
    model.reset_states()
    for i in range(n_tokens_output):
        # Run input through model
        predictions = model(input_eval)

        # Remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # Using a categorical distribution to predict the character returned by the model
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # Pass the predicted character as the next input to the model
        input_eval = tf.expand_dims([predicted_id], 0)

        # Add the predicted character to the output
        text_generated.append(idx2word[predicted_id])

    # Return output
    return (input_text + "\n"+' '.join(text_generated))

In [None]:
#@title Choose where to save models
model_path = "./dataset_3_word/" # @param {type : "string"}
full_path = model_path + "ckpt_{epoch}" 

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
                      filepath=full_path,
                      save_weights_only=True)

In [None]:
#@title Train the model

n_epochs =  50# @param{type: "integer"} 
history = model.fit(dataset, epochs=n_epochs, callbacks=[checkpoint_callback])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


# Pegasus Model Train

In [None]:
"""
Creating Pegasus Dataset and Trainer functions

Reference:
  https://gist.github.com/jiahao87/50cec29725824da7ff6dd9314b53c4b3
"""
class PegasusDataset(torch.utils.data.Dataset):
  def __init__(self, encodings, labels):
      self.encodings = encodings
      self.labels = labels
  def __getitem__(self, idx):
      item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
      item['labels'] = torch.tensor(self.labels['input_ids'][idx])  
      return item
  def __len__(self):
      return len(self.labels['input_ids'])

class PegasusTrainer():

  model = None,
  tokenizer = None,
  training_args = None,
  train_dataset = None,
  trainer = None

  def __init__(self, model_name, train_texts, train_labels, training_args, freeze_encoder = False):
    """
    Create model, tokenizer
    Create train dataset and return trainer
    """
    torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
    self.model = PegasusForConditionalGeneration.from_pretrained(model_name)
    self.tokenizer = PegasusTokenizer.from_pretrained(model_name)
    self.training_args = training_args
    self.train_dataset = self.create_encoding_decoding(train_texts, train_labels)
    self.trainer = self.create_trainer(freeze_encoder)
    
  def create_encoding_decoding(self, train_texts, train_labels):
    """
    Prepare input data for model fine-tuning
    """
    encodings = self.tokenizer(train_texts, truncation=True, padding=True, max_length=128)
    decodings = self.tokenizer(train_labels, truncation=True, padding=True, max_length=512)
    dataset_tokenized = PegasusDataset(encodings, decodings)

    return dataset_tokenized


  def create_trainer(self, freeze_encoder):
    """
    Prepare configurations and base model for fine-tuning
    """
    if freeze_encoder:
      for param in model.model.encoder.parameters():
        param.requires_grad = False


    trainer = Trainer(
      model=self.model,                         
      args=self.training_args,                  
      train_dataset=self.train_dataset,
      tokenizer=self.tokenizer
    )

    return trainer


In [None]:
# Load Dataset as Huggingface dataset object
# First 1000 sample has been taken due to memory issues
dataset = load_dataset("csv", data_files = "Data/dataset_1.csv")
train_texts, train_labels = dataset['train']['ingredients'][:1000], dataset['train']['steps'][:1000]

Using custom data configuration default-f6cc69f207a5cff2


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-f6cc69f207a5cff2/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-f6cc69f207a5cff2/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
#@title Choose Pre_Trained Pegasus model
model_name = 'google/pegasus-large' # @param ["google/pegasus-large", "sshleifer/distill-pegasus-xsum-16-4"]

In [None]:
#@title Pegasus Training Arguments
output_dir="./results"          # @param {type: "string"}
num_train_epochs=0.1            # @param {type: "integer"}
per_device_train_batch_size=1   # @param {type: "integer"}
save_steps=1000                 # @param {type: "integer"}
save_total_limit=5              # @param {type: "integer"}
warmup_steps=500                # @param {type: "integer"}
weight_decay=0.01               # @param {type: "raw"}
logging_dir='./logs'            # @param {type: "string"}
logging_steps=100               # @param {type: "string"}


training_args = TrainingArguments(
  output_dir=output_dir,           # output directory
  num_train_epochs=num_train_epochs,           # number of epochs
  per_device_train_batch_size=per_device_train_batch_size,   # batch size per device during training, can increase if memory allows
  save_steps=save_steps,                  # number of updates steps before checkpoint saves
  save_total_limit=save_total_limit,              # limit the total amount of checkpoints and deletes the older checkpoints
  warmup_steps=warmup_steps,                # number of warmup steps for learning rate scheduler
  weight_decay=weight_decay,               # strength of weight decay
  logging_dir=logging_dir,            # directory for storing logs
  logging_steps=logging_steps,
  eval_accumulation_steps= 1
)


In [None]:
trainer = PegasusTrainer(model_name = model_name, 
                         train_texts = train_texts, 
                         train_labels = train_labels, 
                         training_args = training_args, 
                         freeze_encoder = False)


Downloading:   0%|          | 0.00/3.02k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:01<?, ?B/s]

Downloading:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

In [None]:
#@title Train model
trainer.trainer.train()

***** Running training *****
  Num examples = 1000
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 100


Step,Training Loss
100,10.6265




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=100, training_loss=10.62646240234375, metrics={'train_runtime': 41.4685, 'train_samples_per_second': 2.411, 'train_steps_per_second': 2.411, 'total_flos': 36118305177600.0, 'train_loss': 10.62646240234375, 'epoch': 0.1})

In [None]:
# Clear the GPU memory
print("Initial GPU usage: ",torch.cuda.memory_reserved())
del trainer
time.sleep(10)
torch.cuda.empty_cache()
print("Last GPU usage: ",torch.cuda.memory_reserved())


Initial GPU usage:  13136560128
Last GPU usage:  0


# TEST: PEGASUS

In [None]:
#@title Choose model to test
model_path = "Pegasus_Recipe" # @param ["Pegasus_Recipe"]
model_path = "Models/"+ model_path
torch_device = 'cuda'
model = PegasusForConditionalGeneration.from_pretrained(model_path).to(torch_device)
tokenizer = PegasusTokenizer.from_pretrained(model_path)


In [None]:
#@title Write Input Ingredients (Split by using comma)
input_ingredients = "penne pasta, chicken, onion, mushroom, double cream, olive oil" # @param {type: "string"}
input_text = " <sep> ".join([ingredient.strip() for ingredient in input_ingredients.split(",")])

temperature = 1.0 # @param {type: "raw"}

In [None]:
#@title Generate Recipe
input_tokens = tokenizer(input_text, max_length=128, return_tensors="pt")

# Generate Recipe
summary_ids = model.generate(input_tokens["input_ids"].to(torch_device), temperature=temperature, early_stopping=True)
generated_text = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print("Ingredient List : "+ input_ingredients)
print("Generated Recipe : ")
for step in generated_text.split("sep> "):
  print("-",step)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Ingredient List : penne pasta, chicken, onion, mushroom, double cream, olive oil, 
Generated Recipe : 
- cook pasta as directed on package 
- drain 
- add chicken , onion , mushrooms and cream to pan 
- bring to a boil 
- reduce heat to medium low 
- simmer 10 minutes or until pasta is tender 
- stir in olive oil to coat 
- serve hot


# TEST: RNN Model

## TEST: Letter Base Generation

In [None]:
#@title Choose dataset model trained
dataset = "dataset_3" # @param ["dataset_2", "dataset_3"]{type:"string"}
data_file = "Data/"+dataset + ".txt"
model_name = "Models/"+dataset + "_letter"
model_name = model_name+"/"

In [None]:
#@title Load model

text = ""
with open(data_file,"r") as dataset_f:
  text = dataset_f.read()

# The length of text is the number of characters in it
vocabulary = sorted(set(text))
char2idx = {u:i for i, u in enumerate(vocabulary)}
idx2char = np.array(vocabulary)

def build_model(vocab_size=len(vocabulary), embedding_dim=256, n_rnn_units=1024, batch_size=1):
    model = tf.keras.Sequential([
            tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                      batch_input_shape=[batch_size, None]),
            tf.keras.layers.LSTM(n_rnn_units,
                                return_sequences=True,
                                stateful=True,
                                recurrent_activation='sigmoid',
                                recurrent_initializer='glorot_uniform'),
            tf.keras.layers.Dense(vocab_size)
        ])
    return model

def generate_text(model, input_text, n_characters_output=1000):
    # First, vectorize the input text as before
    input_eval = [char2idx[s] for s in input_text]
    input_eval = tf.expand_dims(input_eval, 0)

    # We'll store results in this variable
    text_generated = []

    # Generate the number of characters desired
    model.reset_states()
    for i in range(n_characters_output):
        # Run input through model
        predictions = model(input_eval)

        # Remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # Using a categorical distribution to predict the character returned by the model
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # Pass the predicted character as the next input to the model
        input_eval = tf.expand_dims([predicted_id], 0)

        # Add the predicted character to the output
        text_generated.append(idx2char[predicted_id])

    # Return output
    return (input_text + "\n"+''.join(text_generated))

model = build_model()
model.load_weights(tf.train.latest_checkpoint(model_name))
model.build(tf.TensorShape([1, None]))



In [None]:
#@title Write Input Ingredients (Split by using comma)
input_ingredients = "eggs, sugar, butter, flour, cocoa, milk" # @param {type: "string"}
input_text = ""
if dataset == "dataset_2":
  input_text = " <sep> ".join([ingredient.strip() for ingredient in input_ingredients.split(",")])
if dataset == "dataset_3":
  input_text = " <rec> " + " <ing> ".join([ingredient.strip() for ingredient in input_ingredients.split(",")]) + " <s_stp> "
n_characters_output = 256 #@param 


In [None]:
#@title Generate Recipe
generated_text = generate_text(model, input_text=input_text, n_characters_output=n_characters_output)

print("Ingredient List : "+ input_ingredients)
print("Generated Recipe : ")

if dataset == "dataset_3":
  generated_text = generated_text.split("\n")[1]
  for step in generated_text.split("<stp>"):
    print("-",step)
if dataset == "dataset_2":
  generated_text = generated_text.split("\n")[1]
  for step in generated_text.split("<sep>"):
    print("-",step)

Ingredient List : eggs, sugar, butter, flour, cocoa, milk
Generated Recipe : 
- mix more cream cheese and butter 
-  beat in egg whites 
-  add about 3 minutes 
-  transfer to a tender 
-  place in warm place for 5 minutes 
-  pulse with apple mixture 
-  pour into 9x13 baking dish 
-  meanwhile , puree artichoke mixture


## TEST: Word Base Generation

In [None]:
#@title Choose dataset model trained
dataset_name = "dataset_2" # @param ["dataset_2", "dataset_3"]
data_file = "Data/"+dataset + ".txt"
model_name = "Models/" + dataset + "_word"
model_name = model_name+"/"

In [None]:
#@title Load Model
# The length of text is the number of characters in it


text = ""
with open(data_file,"r") as dataset_f:
  text = dataset_f.read()
words = text.split(" ")
vocabulary_words = sorted(set(words))
word2idx = {u:i for i, u in enumerate(vocabulary_words)}
idx2word = np.array(vocabulary_words)

def build_word_model(vocab_size=len(vocabulary_words), embedding_dim=100, n_rnn_units=1024, batch_size=1):
    model = tf.keras.Sequential([
            tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                      batch_input_shape=[batch_size, None]),
            tf.keras.layers.LSTM(n_rnn_units,
                                return_sequences=True,
                                stateful=True,
                                recurrent_activation='sigmoid',
                                recurrent_initializer='glorot_uniform'),
            tf.keras.layers.Dense(vocab_size)
        ])
    return model

def generate_text(model, input_text, n_tokens_output=256):
    # First, vectorize the input text as before
    input_eval = [word2idx[s] for s in input_text.split(" ")]
    input_eval = tf.expand_dims(input_eval, 0)

    # We'll store results in this variable
    text_generated = []

    # Generate the number of characters desired
    model.reset_states()
    for i in range(n_tokens_output):
        # Run input through model
        predictions = model(input_eval)

        # Remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # Using a categorical distribution to predict the character returned by the model
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # Pass the predicted character as the next input to the model
        input_eval = tf.expand_dims([predicted_id], 0)

        # Add the predicted character to the output
        text_generated.append(idx2word[predicted_id])

    # Return output
    return (input_text + "\n"+' '.join(text_generated))

model = build_word_model()
model.load_weights(tf.train.latest_checkpoint(model_name))
model.build(tf.TensorShape([1, None]))

In [None]:
#@title Write Input Ingredients (Split by using comma)
input_ingredients = "eggs, sugar, butter, flour, cocoa, milk" # @param {type: "string"}
input_text = ""
if dataset == "dataset_2":
  input_text = " <sep> ".join([ingredient.strip() for ingredient in input_ingredients.split(",")])
if dataset == "dataset_3":
  input_text = " <rec> " + " <ing> ".join([ingredient.strip() for ingredient in input_ingredients.split(",")]) + " <s_stp> "
n_tokens_output =  100#@param 



In [None]:
#@title Generate Recipe
generated_text = generate_text(model, input_text=input_text, n_tokens_output=n_tokens_output)

print("Ingredient List : "+ input_ingredients)
print("Generated Recipe : ")

if dataset == "dataset_3":
  generated_text = generated_text.split("<s_stp>")[2]
  for step in generated_text.split("<stp>"):
    print("-",step)
if dataset == "dataset_2":
  generated_text = generated_text.split("\n")[1]
  for step in generated_text.split("<sep>"):
    print("-",step)

Ingredient List : eggs, sugar, butter, flour, cocoa, milk
Generated Recipe : 
-  in a 3 quart saucepan or flour , 1 tablespoon of water 
-  cook sausage , onion & bell pepper in large skillet over medium heat 
-  add olive oil to skillet 
-  bring to a boil add noodles on low speed and add tomatoes and honey to a simmer and cook 20 minutes , cool 
-  pour peanut liquid pectin and sprinkle with icing sugar and 1 / 2 teaspoon tomato sauce 
-  simmer until cheese is a sauce 
-  transfer tomatoes
