# Evaluating Language Models
This notebook describes the process of querying multiple language models using a test corpus
for the purpose of benchmarking and comparing each model.

Models Used:
1. [LLaMa-2 (13B) conversational chat model](https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ)
2. [OpenAI GPT4-turbo](https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo)
3. [OpenAI GPT3.5-turbo](https://platform.openai.com/docs/models/gpt-3-5)

### Scrape recipe information for benchmark corpus

In [1]:
import pandas as pd
import os
os.chdir('../capstone_crawlers')
print(os.getcwd())

import sys
sys.path.append('../')
sys.path.append('../capstone_crawlers/')
sys.path.append('../capstone_crawlers/capstone_crawlers')

import scrapy
from scrapy.crawler import CrawlerProcess, Crawler
from scrapy.utils.project import get_project_settings
from capstone_crawlers import settings
from capstone_crawlers.spiders import recipe_spider
from src.data import process_recipes

/mnt/c/Users/Alvin/Desktop/ML Test Scripts/mec-capstone/capstone_crawlers


#### Load test recipe metadata

In [2]:
fpath = "../datasets/llm_test_cases/Test Cases.xlsx"
df = pd.read_excel(fpath)

In [3]:
urls = df[['start_url','end_url']].to_dict(orient='list')

# Combine start and end urls into a list
tmp = list()
for key in urls.keys():
    tmp += urls[key]
urls = list(set(tmp))

#### Run crawlers

In [4]:
# os.chdir('../notebooks/')
# with open('../datasets/llm_test_cases/test_recipes.txt','w') as f:
#     for url in urls:
#         f.write(url+'\n')
# os.chdir('../capstone_crawlers/')

In [5]:
# !scrapy crawl recipe
# !mv ./test_recipes.jl ../datasets/llm_test_cases/test_recipes.jl

#### Parse and clean recipes

In [6]:
os.chdir('../notebooks')
schema = ".recipe_name, .ingredients, [.instructions[].text], .source_url"
columns = ["recipe_name", "ingredients", "instructions", "url"]
fpath = ["../datasets/llm_test_cases/test_recipes.jl"]
outpath = ["../datasets/llm_test_cases/test_recipes_cleaned.jl"]

rp = process_recipes.RecipeProcessor(schema)
rp.process_recipes(dict(zip(fpath, outpath)), columns=columns)

Now processing ../datasets/llm_test_cases/test_recipes.jl...
Rows dropped: 2
Finished processing ../datasets/llm_test_cases/test_recipes.jl, exported to ../datasets/llm_test_cases/test_recipes_cleaned.jl.


In [7]:
cleaned_df = pd.read_json(outpath[0],lines=True)
cleaned_df.head()

Unnamed: 0,recipe_name,ingredients,instructions,url,step_instructions,whole_recipe
0,Beef Stroganoff,600 g / 1.2 lb scotch fillet steak / boneles...,Use your fist (or rolling pin or mallet) to fl...,https://www.recipetineats.com/beef-stroganoff/,1. Use your fist (or rolling pin or mallet) to...,"Recipe Name: Beef Stroganoff, \nIngredients: 6..."
1,Creamy Homemade Baked Mac and Cheese,"1 lb. dried elbow pasta, 1/2 cup unsalted butt...",Preheat oven to 325 degrees F and grease a 3 q...,https://www.thechunkychef.com/family-favorite-...,1. Preheat oven to 325 degrees F and grease a ...,Recipe Name: Creamy Homemade Baked Mac and Che...
2,Vegetarian Butternut Squash Chipotle Chili wit...,"2 tablespoons olive oil, 1 medium red onion, c...",In a 4- to 6-quart Dutch oven or stockpot over...,https://cookieandkate.com/butternut-squash-chi...,1. In a 4- to 6-quart Dutch oven or stockpot o...,Recipe Name: Vegetarian Butternut Squash Chipo...
3,Best Bolognese,"1 medium onion, chopped, 1 celery stalk, chopp...","Pulse onion, celery, and carrot in a food proc...",https://www.bonappetit.com/recipe/bas-best-bol...,"1. Pulse onion, celery, and carrot in a food p...","Recipe Name: Best Bolognese, \nIngredients: 1 ..."
4,Easy Brown Rice Risotto with Mushrooms and Fre...,"3 tablespoons olive oil, divided, 1 small yell...",Make sure your oven rack is in the middle posi...,https://cookieandkate.com/easy-brown-rice-riso...,1. Make sure your oven rack is in the middle p...,Recipe Name: Easy Brown Rice Risotto with Mush...


In [8]:
raw_df = pd.read_json(fpath[0],lines=True)
raw_df.head()

Unnamed: 0,recipe_name,yields,time,ingredients,instructions,body,source_url
0,Beef Stroganoff,"[4, 4 - 5 people]","{'prep': 'PT15M', 'cook': 'PT15M', 'total': 'P...",[600 g / 1.2 lb scotch fillet steak / bonele...,"[{'@type': 'HowToStep', 'text': 'Use your fist...","{""@type"": ""Recipe"", ""name"": ""Beef Stroganoff"",...",https://www.recipetineats.com/beef-stroganoff/
1,Creamy Homemade Baked Mac and Cheese,"[8, 8 - 10 servings]","{'prep': 'PT20M', 'cook': 'PT15M', 'total': 'P...","[1 lb. dried elbow pasta, 1/2 cup unsalted but...","[{'@type': 'HowToStep', 'text': 'Preheat oven ...","{""@type"": ""Recipe"", ""name"": ""Creamy Homemade B...",https://www.thechunkychef.com/family-favorite-...
2,Vegetarian Butternut Squash Chipotle Chili wit...,4,"{'prep': 'PT20M', 'cook': 'PT1H', 'total': 'PT...","[2 tablespoons olive oil, 1 medium red onion, ...","[{'@type': 'HowToStep', 'text': 'In a 4- to 6-...","{""@context"": ""https://schema.org/"", ""@type"": ""...",https://cookieandkate.com/butternut-squash-chi...
3,Stuffed Cabbage Rolls,5 serving(s),"{'prep': 'PT25M', 'cook': 'PT0S', 'total': 'PT...","[2 tbsp. extra-virgin olive oil, 1/2 large oni...","[{'@type': 'HowToSection', 'name': 'For the sa...","{""@type"": ""Recipe"", ""thumbnailUrl"": ""https://h...",https://www.delish.com/cooking/recipe-ideas/a2...
4,Best Bolognese,4 servings,"{'prep': 0, 'cook': 0, 'total': 0}","[1 medium onion, chopped, 1 celery stalk, chop...","[{'@type': 'HowToStep', 'text': 'Pulse onion, ...","{""@context"": ""http://schema.org"", ""@type"": ""Re...",https://www.bonappetit.com/recipe/bas-best-bol...


The websites containing some test recipes did not have a JSON schema that could be parsed by the scrapy crawler.
These recipes are dropped and omitted from the test corpus.

In [9]:
# Show missing recipes
set(raw_df['recipe_name'].values) - set(cleaned_df['recipe_name'].values)

{'Chef John&#39;s Buttermilk Fried Chicken',
 'Keto Stuffed Cabbage',
 'Stuffed Cabbage Rolls'}

In [10]:
# Merge metadata and recipe information into a single dataframe
merge_recipe = df.merge(cleaned_df, left_on='start_url', right_on='url')
merge_recipe = merge_recipe.merge(cleaned_df, left_on='end_url', right_on='url', suffixes=('_start', '_end'))
merge_recipe.head()

Unnamed: 0,start_name,end_name,change,remove,add,start_url,end_url,Intermediate Reasoning,Sub Mapping,recipe_name_start,...,instructions_start,url_start,step_instructions_start,whole_recipe_start,recipe_name_end,ingredients_end,instructions_end,url_end,step_instructions_end,whole_recipe_end
0,Classic Risotto,Easy Brown Rice Risotto with Mushrooms and Fre...,Vegetarian,Chicken broth,Vegetable broth,https://www.delish.com/cooking/recipe-ideas/a2...,https://cookieandkate.com/easy-brown-rice-riso...,,,Classic Risotto,...,"In a medium pot over medium-high heat, heat br...",https://www.delish.com/cooking/recipe-ideas/a2...,"1. In a medium pot over medium-high heat, heat...","Recipe Name: Classic Risotto, \nIngredients: 4...",Easy Brown Rice Risotto with Mushrooms and Fre...,"3 tablespoons olive oil, divided, 1 small yell...",Make sure your oven rack is in the middle posi...,https://cookieandkate.com/easy-brown-rice-riso...,1. Make sure your oven rack is in the middle p...,Recipe Name: Easy Brown Rice Risotto with Mush...
1,Easy Brown Rice Risotto with Mushrooms and Fre...,Classic Risotto,Add Protein,Vegetable broth,Chicken broth,https://cookieandkate.com/easy-brown-rice-riso...,https://www.delish.com/cooking/recipe-ideas/a2...,,,Easy Brown Rice Risotto with Mushrooms and Fre...,...,Make sure your oven rack is in the middle posi...,https://cookieandkate.com/easy-brown-rice-riso...,1. Make sure your oven rack is in the middle p...,Recipe Name: Easy Brown Rice Risotto with Mush...,Classic Risotto,4 c. store-bought or homemade low-sodium chick...,"In a medium pot over medium-high heat, heat br...",https://www.delish.com/cooking/recipe-ideas/a2...,"1. In a medium pot over medium-high heat, heat...","Recipe Name: Classic Risotto, \nIngredients: 4..."
2,Chipotle Chili,Butternut Squash Chipotle Chili with Avocado,Vegetarian,Ground turkey or beef,Butternut squash,https://www.chilipeppermadness.com/recipes/chi...,https://cookieandkate.com/butternut-squash-chi...,,,Chipotle Chili Recipe,...,Heat the oil in a large pot to medium-high hea...,https://www.chilipeppermadness.com/recipes/chi...,1. Heat the oil in a large pot to medium-high ...,"Recipe Name: Chipotle Chili Recipe, \nIngredie...",Vegetarian Butternut Squash Chipotle Chili wit...,"2 tablespoons olive oil, 1 medium red onion, c...",In a 4- to 6-quart Dutch oven or stockpot over...,https://cookieandkate.com/butternut-squash-chi...,1. In a 4- to 6-quart Dutch oven or stockpot o...,Recipe Name: Vegetarian Butternut Squash Chipo...
3,Butternut Squash Chipotle Chili with Avocado,Chipotle Chili,Add Protein,Butternut squash,Ground turkey or beef,https://cookieandkate.com/butternut-squash-chi...,https://www.chilipeppermadness.com/recipes/chi...,,,Vegetarian Butternut Squash Chipotle Chili wit...,...,In a 4- to 6-quart Dutch oven or stockpot over...,https://cookieandkate.com/butternut-squash-chi...,1. In a 4- to 6-quart Dutch oven or stockpot o...,Recipe Name: Vegetarian Butternut Squash Chipo...,Chipotle Chili Recipe,"1 tablespoon olive oil, 1.5 pound ground turke...",Heat the oil in a large pot to medium-high hea...,https://www.chilipeppermadness.com/recipes/chi...,1. Heat the oil in a large pot to medium-high ...,"Recipe Name: Chipotle Chili Recipe, \nIngredie..."
4,Beef Stroganoff,Portobello Mushroom Stroganoff,Vegetarian,Ground beef,Portobello mushroom,https://www.recipetineats.com/beef-stroganoff/,https://www.allrecipes.com/recipe/60923/portob...,,,Beef Stroganoff,...,Use your fist (or rolling pin or mallet) to fl...,https://www.recipetineats.com/beef-stroganoff/,1. Use your fist (or rolling pin or mallet) to...,"Recipe Name: Beef Stroganoff, \nIngredients: 6...",Portobello Mushroom Stroganoff,"3 tablespoons butter, 1 large onion, chopped, ...",Bring a large pot of lightly salted water to a...,https://www.allrecipes.com/recipe/60923/portob...,1. Bring a large pot of lightly salted water t...,"Recipe Name: Portobello Mushroom Stroganoff, \..."


### Evaluating performance at recipe customization

#### Load LLMs, embedding models, vector stores, and retrievers

In [11]:
import torch
import tiktoken
import warnings

from dotenv import load_dotenv

from operator import itemgetter

from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain.vectorstores import Chroma
from langchain.llms import HuggingFacePipeline, OpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain.schema import format_document
from langchain.globals import set_debug, set_verbose
from langchain.memory import ConversationBufferMemory

set_debug(False)
set_verbose(False)
warnings.filterwarnings("ignore")
load_dotenv('../.env')

model_name = "BAAI/bge-large-en"
model_kwargs = {"device": "cuda"}
encode_kwargs = {"normalize_embeddings": True}
hf = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)
chroma_store = Chroma(collection_name='summed', persist_directory='../datasets/processed/chroma_db', embedding_function=hf)
retriever = chroma_store.as_retriever()
model_path = "TheBloke/Llama-2-13B-chat-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map="auto", revision="main"
)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=1024,
    # temperature=0.00,
    # top_p=0.95,
    repetition_penalty=1.15,
)
output_parser = StrOutputParser()
llama = HuggingFacePipeline(pipeline=pipe)

# For counting token usage
encoding = tiktoken.get_encoding("cl100k_base")

  from .autonotebook import tqdm as notebook_tqdm


In [12]:
# Used to estimate GPT4.5-turbo costs without sending an API request for generation
def usage_breakdown(input_strs = [""], output_strs = [""], input_cost = 0.01/1e3, output_cost = 0.03/1e3):
    input_tokens = 0
    output_tokens = 0
    for input in input_strs:
        input_tokens += len(encoding.encode(input))
    
    for output in output_strs:
        output_tokens += len(encoding.encode(output))
    
    total_input_cost = input_tokens*input_cost
    total_output_cost = output_tokens * output_cost
    cost = round(total_input_cost + total_output_cost, 6)

    print(f'Input Tokens: {input_tokens}')
    print(f'Output Tokens: {output_tokens}')
    print(f'Input Cost: {round(total_input_cost,6)}, Output Cost: {round(total_output_cost,6)}, Total Cost: {cost}')

# Prompt used to extract text from retrieved documents
DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")

def _combine_documents(
    docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)

### Retriever chains
Retriever chains perform similarity search over the entire vector database with a query string.
During evaluation, retrievers are limited to providing a single recipe as additional context
because of token limits.

In [13]:
sample_recipe = chroma_store.similarity_search("Chicken Noodle Soup", k=5)
noodle_recipe = sample_recipe[1].page_content
print(noodle_recipe)

Recipe Name: Chicken Noodle Soup with Egg Noodles, 
Ingredients: 1 (3 1/2) pound chicken, cut into 8 pieces, 4 (16 ounce) cans low-sodium chicken broth, 2 medium carrots, peeled and thinly sliced, 2 stalks celery, sliced, 0.5 cup chopped onion, 1 (8 ounce) package dried egg noodles, 0.5 cup finely chopped fresh parsley, 1 tablespoon freshly squeezed lemon juice, salt and ground black pepper to taste, 
Cooking Instructions: 1. Place chicken into a Dutch oven; pour in broth and bring to a boil over medium-high heat. Reduce the heat, cover partially, and simmer until chicken is no longer pink at the bone and the juices run clear, about 20 minutes. An instant-read thermometer inserted near the bone should read 165 degrees F (74 degrees C). Use tongs to transfer to a large bowl. Allow chicken and broth to cool slightly, about 10 minutes.
2. Remove and discard skin and bones from cooled chicken. Cut meat into bite-sized pieces. Skim fat off the top of the cooled broth and discard.
3. Return 

### Prompt definitions
Each type of prompt describes the requested task to the LLM. There are placeholder names defined
inside of each prompt to allow for variable inputs, such as retrieved documents, to be placed within
a fixed input prompt. <br>
<br>
There are three different prompts corresponding to combinations of two different prompt engineering techniques,
plus an additional prompt for naive response generation for comparison.
1. Naive
2. Retrieval-augmented generation (RAG)
3. Chain-of-thought (COT) prompting
4. Multiple inference chains with RAG + COT

In [14]:
# Naive generation without retrieval-augmentation or chain-of-thought prompting
base_prompt = """
You are a helpful, respectful and honest cooking assistant. Always answer as helpfully as possible. Your job is to create a new recipe based on the user's instructions. You will 
be given an original recipe and a request from the user to customize the recipe.

### Input:
User Request: Create a new recipe called {new_name} from this recipe for {recipe_name}. In the new recipe, replace the {old_ingredient} from the original recipe with {new_ingredient}.
{user_recipe}

### Response:
"""
base_prompt = ChatPromptTemplate.from_template(base_prompt)
output_parser = StrOutputParser()
base_memory = ConversationBufferMemory()
base_chain = base_prompt | {'output':llama|output_parser, 'prompt': RunnablePassthrough()}
consts = {"user_recipe":noodle_recipe, 'old_ingredient': 'chicken', 'new_ingredient': 'seared tuna', 'new_name':'Seared Tuna Noodle Soup', 'recipe_name':'chicken noodle soup'}
res = base_chain.invoke(consts)

#### Defining chains using Langchain LCEL
Langchain LCEL allows for larger chains to be composed from individual components by "piping" the 
output of one component (or multiple parallel ones) into the next. This defines a computation graph
from the start with provided inputs to the end with the LLM output. <br>
<br>
The simplest chain used here is the one defined below for naive response generation without prompt engineering techniques.
The chain inputs are incorporated as context in the LLM prompt, then the LLM is queried 
and both the generated response and prompt are returned as the chain output.

Python dictionaries are coerced through LCEL into parallel chains. In the below example,
The `output` and `prompt` chains receive the same input prompt but perform two different tasks.

In [15]:
base_chain = base_prompt | {'output':llama|output_parser, 'prompt': RunnablePassthrough()}

In [16]:
print(res['prompt'].messages[0].content)


You are a helpful, respectful and honest cooking assistant. Always answer as helpfully as possible. Your job is to create a new recipe based on the user's instructions. You will 
be given an original recipe and a request from the user to customize the recipe.

### Input:
User Request: Create a new recipe called Seared Tuna Noodle Soup from this recipe for chicken noodle soup. In the new recipe, replace the chicken from the original recipe with seared tuna.
Recipe Name: Chicken Noodle Soup with Egg Noodles, 
Ingredients: 1 (3 1/2) pound chicken, cut into 8 pieces, 4 (16 ounce) cans low-sodium chicken broth, 2 medium carrots, peeled and thinly sliced, 2 stalks celery, sliced, 0.5 cup chopped onion, 1 (8 ounce) package dried egg noodles, 0.5 cup finely chopped fresh parsley, 1 tablespoon freshly squeezed lemon juice, salt and ground black pepper to taste, 
Cooking Instructions: 1. Place chicken into a Dutch oven; pour in broth and bring to a boil over medium-high heat. Reduce the heat, c

In [17]:
print(res['output'])
usage_breakdown(input_strs=[base_prompt.format(**consts)], output_strs=[res['output']])

Here is your new recipe, Seared Tuna Noodle Soup!

Seared Tuna Noodle Soup

Servings: 6

Ingredients:

* 1 (12 ounce) piece of sushi-grade tuna steak, seared
* 4 (16 ounce) cans low-sodium tuna broth
* 2 medium carrots, peeled and thinly sliced
* 2 stalks celery, sliced
* 0.5 cup chopped onion
* 1 (8 ounce) package dried egg noodles
* 0.5 cup finely chopped fresh parsley
* 1 tablespoon freshly squeezed lemon juice
* Salt and ground black pepper to taste

Cooking Instructions:

1. Preheat a skillet or wok over medium-high heat. Add oil and swirl it around. Place the tuna steak in the pan and sear for 2-3 minutes per side, or until browned and cooked through. Remove from the pan and set aside.
2. Pour the tuna broth into the same pan and bring to a simmer. Add the carrots, celery, and onion, and cook for 8 minutes, or until the vegetables are softened.
3. Add the egg noodles to the pan and cook according to the package instructions.
4. Add the seared tuna back into the pan and stir to co

#### Retrieval augmentation
Retrieval augmentation inserts additional context into the input prompt by retrieving relevant documents from an external
source of knowledge, such as the internet or a document database. Relevancy is determined here using
cosine similarity between embedding vectors managed through a vector database.

In [18]:
# RAG
retrieval_prompt_base = """
You are a helpful, respectful and honest cooking assistant. Always answer as helpfully as possible. Your job is to create a new recipe based on the user's instructions. You will 
be given an original recipe and a request from the user to customize the recipe. To help you customize the recipe, several example recipes will be provided based on the user's request.

### Example Recipe:
{retrieved_recipes}

### Input:
User Request: Create a new recipe called {new_name} from this recipe for {recipe_name}. In the new recipe, replace the {old_ingredient} from the original recipe with {new_ingredient}.
{user_recipe}

### Response:
Recipe Name:
"""
retrieval_prompt = ChatPromptTemplate.from_template(retrieval_prompt_base)
output_parser = StrOutputParser()

### Chain-of-thought prompting
Chain-of-thought prompting tries to improve response quality through providing worked examples of
the requested task. The example elucidates on reasoning process that the language model should take to complete the task.
For this set of prompts, an example for customizing a recipe for beef stroganoff by replacing the ground beef with
portobello mushrooms is included. 

In [19]:
# COT
beef_recipe = """Recipe Name: Beef Stroganoff, 
Ingredients: 
1 pound ground beef, 
0.5 cup chopped onion, 
1 tablespoon all-purpose flour, 
0.5 teaspoon salt, 
0.25 teaspoon paprika, 
1 (10.75 ounce) can condensed cream of mushroom soup, 
1 cup sour cream, 
8 ounces egg noodles, 

Cooking Instructions: 
1. In a large skillet over medium heat, sauté beef and onions for 10 minutes, or until meat is browned and onion is tender.
2. Stir in flour, salt, and paprika. Add condensed soup, mix well, and cook, uncovered, for 20 minutes.
3. Reduce heat to low and add sour cream, stirring well and allowing to heat through. Cover and set this mixture aside.
4. Cook egg noodles according to package directions. Drain. Serve beef mixture over noodles., """

mushroom_recipe = """Recipe Name: Portobello Mushroom Stroganoff, 
Ingredients: 
0.75 pound portobello mushrooms, 
0.5 cup chopped onion, 
1 tablespoon all-purpose flour, 
0.5 teaspoon salt, 
0.25 teaspoon paprika, 
1 (10.75 ounce) can condensed cream of mushroom soup, 
1 cup sour cream, 
8 ounces egg noodles, 

Cooking Instructions: 
1. In a large skillet over medium heat, sauté mushroom and onions for 10 minutes, or until meat is browned and onion is tender.
2. Stir in flour, salt, and paprika. Add condensed soup, mix well, and cook, uncovered, for 20 minutes.
3. Reduce heat to low and add sour cream, stirring well and allowing to heat through. Cover and set this mixture aside.
4. Cook egg noodles according to package directions. Drain. Serve mushroom mixture over noodles., """

cot_prompt_base = """
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible. Your job is to create a new recipe based on the user's instructions. You will 
be given an original recipe and a request from the user to customize the recipe.

### Instruction:
Q. Complete the following task by reasoning step-by-step. Create a new recipe called Mushroom Stroganoff from this recipe for Beef Stroganoff. In the new recipe, replace the ground beef from the original recipe with portobello mushrooms.
{beef_recipe}

A. The request to replace ground beef for portobello mushrooms indicates that any references to ground beef should be changed to use portobello mushrooms instead. 
In the ingredient list, ground beef is mentioned as an ingredient for the original recipe, which should now be replaced with portobello mushrooms instead. 
In addition, the cooking steps will need to be modified reflect the change in ingredients.
Steps 1 and 4 directly mention ground beef, so these steps will need to be updated. Step 1 should say to "sauté mushroom and onions" instead of "sauté beef and onions".
Step 4 should be changed to mention the mushroom mixture instead of the beef mixture. The modified recipe will look like this:
{mushroom_recipe}

### Input:
Q. Complete the following task by reasoning step-by-step. Create a new recipe called {new_name} from this recipe for {recipe_name}. In the new recipe, replace the {old_ingredient} from the original recipe with {new_ingredient}.
{user_recipe}
### Response:

"""

cot_prompt = ChatPromptTemplate.from_template(cot_prompt_base)

### Multiple inference chains
This chain combines retrieval-augmentation and chain-of-thought prompting. Additionally, another 
LLM query is made before the RAG + COT prompt to generate a new name for the customized recipe. 
This is to simulate the use of the LLM chain in an application setting where users may not know
the name of the recipe they want in advance.

In [20]:
# RAG + COT + Multi-inference
modified_name_prompt_cotless = """
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible. Your job is to come up with new names for recipes. You will 
be given an original recipe and a request from the user to change the name of the recipe to a more fitting one.

### Input:
Create a new recipe name for {recipe_name} where {old_ingredient} will be replaced with {new_ingredient}.
Return the name of the recipe at the end of your response.
{user_recipe}

### Response:
A.
"""

modified_name_prompt_base = """
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible. Your job is to come up with new names for recipes. You will 
be given an original recipe and a request from the user to change the name of the recipe to a more fitting one.

### Instruction:
Q. Complete the following task by reasoning step-by-step. Create a new recipe name for 'Beef Stroganoff' where beef will be replaced with portobello mushrooms,
and vegetable oil will be replaced with canola oil. Return the name of the recipe at the end of your response.
{beef_recipe}

A. Each modification to the recipe should be reflected in the name of the new recipe. The original recipe name is "Beef Stroganoff" because it contains beef. Since portobello mushrooms
will be substituted in place of beef, 'Beef' should be removed from the recipe's name as it will no longer be in the new recipe.
The word 'Beef' in the title will need to be replaced with the new ingredient, which is 'Portobello Mushroom'. Vegetable oil is listed as an ingredient in the original recipe,
but it is not mentioned in the recipe name. The replacement, canola oil, should be mentioned in the name of the new recipe to indicate that this ingredient was changed. Thus, the new name for the modified recipe
should be 'Portobello Mushroom Stroganoff with Canola Oil'. Final Answer: Portobello Mushroom Stroganoff with Canola Oil

### Input:
Q. Complete the following task by reasoning step-by-step. Create a new recipe name for {recipe_name} where {old_ingredient} will be replaced with {new_ingredient}.
Return the name of the recipe at the end of your response.
{user_recipe}
### Response:
A.
"""

strip_name_prompt = ChatPromptTemplate.from_template("""
    You are a helpful, respectful and honest assistant. Always answer as helpfully as possible. Your task is to find the final name of the recipe in a group of sentences. Return the name of the recipe and nothing else.

    ### Input:
    Find the final name of the recipe in the following sentences. Return the name of the recipe and nothing else.
    {resp}
    
    ### Response:
""")
modified_name_prompt = ChatPromptTemplate.from_template(modified_name_prompt_base)

In [21]:
cot_multi_prompt_base = """
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible. Your job is to create a new recipe based on the user's instructions. You will 
be given an original recipe and a request from the user to customize the recipe.To help you customize the recipe, several example recipes will be provided based on the user's request.


### Instruction:
Q. Complete the following task by reasoning step-by-step. Create a new recipe called Mushroom Stroganoff from this recipe for Beef Stroganoff. In the new recipe, replace the ground beef from the original recipe with portobello mushrooms.
{beef_recipe}

A. The request to replace ground beef for portobello mushrooms indicates that any references to ground beef should be changed to use portobello mushrooms instead. 
In the ingredient list, ground beef is mentioned as an ingredient for the original recipe, which should now be replaced with portobello mushrooms instead. 
In addition, the cooking steps will need to be modified reflect the change in ingredients.
Steps 1 and 4 directly mention ground beef, so these steps will need to be updated. Step 1 should say to "sauté mushroom and onions" instead of "sauté beef and onions".
Step 4 should be changed to mention the mushroom mixture instead of the beef mixture. The modified recipe will look like this:
{mushroom_recipe}

### Input:

Example Recipes:
{retrieved_recipes}

Q. Complete the following task by reasoning step-by-step. Create a new recipe called {new_name} from this recipe for {recipe_name}. In the new recipe, replace the {old_ingredient} from the original recipe with {new_ingredient}.
{user_recipe}

### Response:
A. 
"""

cot_multi_prompt = ChatPromptTemplate.from_template(cot_multi_prompt_base)
multi_prompt_retriever = chroma_store.as_retriever(search_kwargs = {"k": 1})

### Creating evaluation tests
Each test will contain the same type of task: to replace one ingredient from a recipe with another ingredient to create a new recipe as the output. Every test will also have a pair of recipes and a pair of ingredients. For the recipes, one recipe represents the starting recipe and another represents the target recipe. The pair of ingredients details the requested substitution for the test. Information from the starting recipe and the substitution is provided to the language model, while information about the intended target is preserved for test evaluation later and is excluded from the prompt context. Each test is performed in a separate conversation context so that information from one test cannot leak to other tests.<br>

Test Inputs:
- starting recipe
- ingredient to replace
- new ingredient to replace with
- name of the target recipe <sup>1</sup>
- (RAG) Similar recipes to the target recipe, based on the target recipe's name
- (COT) Worked example for substitution task, same example used across all tests

<sup>1. Omitted as input for the multiple inference chain and instead delegated to a separate task for the LLM.</sup>

Test Outputs:
- input prompt
- generated output

In [22]:
# Package test cases into single batch of dicts
recipe_tests = merge_recipe.copy()
recipe_tests['index'] = recipe_tests.index
select_cols = ['index','start_name', 'end_name', 'remove', 'add', 'whole_recipe_start', 'whole_recipe_end']
key_names = ['index', 'recipe_name', 'new_name', 'old_ingredient', 'new_ingredient', 'user_recipe', 'expected']
recipe_tests.rename(columns=dict(zip(select_cols,key_names)), inplace=True)
test_cases = recipe_tests[key_names].to_dict(orient='records')

### Generating responses with LLaMa-2, GPT4-turbo, and GPT3.5-turbo
Each language model is placed into its separate set of LCEL chains. Then, the chains are executed
using each recipe in the test corpus as input.

In [1]:
def evaluate_test_cases(LLM_model, output_name):
    # Base
    base_chain = base_prompt | {'output':LLM_model| output_parser, 'prompt': RunnablePassthrough()}

    # RAG
    retriever = chroma_store.as_retriever(search_kwargs = {"k": 1})
    setup = {"retrieved": itemgetter("new_name") | retriever | _combine_documents, "user_recipe": itemgetter('user_recipe'), "inputs":RunnablePassthrough()}
    prompt_inputs = {"inputs": lambda x: {"retrieved_recipes": x["retrieved"], "user_recipe": x['user_recipe']} | x["inputs"]}
    prompt_chain = prompt_inputs | RunnablePassthrough() | itemgetter("inputs") | retrieval_prompt
    output_chain = {"output": prompt_chain | LLM_model | output_parser, "context_recipes": itemgetter("retrieved"), "prompt": prompt_chain}
    rag_chain = setup | RunnablePassthrough() | output_chain

    # COT
    cot_chain = cot_prompt | {'output':LLM_model|output_parser, 'prompt': RunnablePassthrough()}

    # RAG + COT, multi-inference
    retriever_branch = itemgetter("new_name") | multi_prompt_retriever | _combine_documents
    generate_new_name =  {"new_name":{"resp": modified_name_prompt | LLM_model | output_parser} | strip_name_prompt | LLM_model | output_parser, "inputs": RunnablePassthrough()}
    prompt_inputs = {"inputs":lambda x: x["inputs"] | {"new_name": x["new_name"], "retrieved_recipes":  x["retrieved"]}}
    parallels = {"retrieved": retriever_branch, "new_name": itemgetter("new_name"), "inputs": itemgetter("inputs")}
    output_chain = RunnableParallel({"output": prompt_inputs | RunnablePassthrough() | itemgetter("inputs") | cot_multi_prompt | LLM_model | output_parser,
        "context_recipes": itemgetter("retrieved"), "prompt":prompt_inputs | RunnablePassthrough() | itemgetter("inputs") | cot_multi_prompt})
    rag_cot_chain = generate_new_name | RunnablePassthrough() | parallels | output_chain

    test_results = list()
    cot_examples = {'beef_recipe':beef_recipe,'mushroom_recipe':mushroom_recipe}
    for tcase in tqdm(test_cases):
        consts = tcase.copy()
        del consts['index']
        multi_chain_consts = consts.copy()
        del multi_chain_consts['new_name']

        base_case = base_chain.invoke(consts)
        rag_case = rag_chain.invoke(consts)
        cot_case = cot_chain.invoke(consts|cot_examples)
        rag_cot_case = rag_cot_chain.invoke(multi_chain_consts|cot_examples)
        test_results.append({'index':tcase['index'],'consts': consts,'base':base_case,'RAG':rag_case,'COT':cot_case,'multi':rag_cot_case})

    case_names = ['base', 'RAG', 'COT', 'multi']
    output_cols = [name + '_output' for name in case_names]
    prompt_cols = [name + '_prompt' for name in case_names]
    test_records = list()
    for test_res in test_results:
        recipe_index = test_res['index']
        outputs = [test_res[name]['output'] for name in case_names]
        prompt = [test_res[name]['prompt'].messages[0].content.strip() for name in case_names]

        outputs = dict(zip(output_cols,outputs))
        prompt = dict(zip(prompt_cols, prompt))
        test_records.append(outputs | prompt)

    test_df = pd.DataFrame.from_records(test_records)
    test_df.to_json(f'./{output_name}.jsonl', lines=True, force_ascii=False, orient='records')
    return test_df

In [None]:
openai_model_4 = OpenAI(model_name="gpt-4-1106-preview", max_retries=0)
openai_model_3 = OpenAI(model_name="gpt-3.5-turbo-1106", max_retries=0)

test_df = evaluate_test_cases(llama, "../datasets/llm_test_cases/test_cases_llama_v3")
_ = evaluate_test_cases(openai_model_4, "../datasets/llm_test_cases/test_cases_gpt4turbo_v3")
_ = evaluate_test_cases(openai_model_3, "../datasets/llm_test_cases/test_cases_gpt3-5turbo_v3")

In [34]:
test_df.head()

Unnamed: 0,base_output,RAG_output,COT_output,multi_output,base_prompt,RAG_prompt,COT_prompt,multi_prompt
0,"Here is your new recipe, Easy Brown Rice Risot...",Easy Brown Rice Risotto with Mushrooms and Fre...,To create a vegetarian version of the classic ...,\nTo create a vegetarian version of the classi...,"You are a helpful, respectful and honest cooki...","You are a helpful, respectful and honest cooki...","You are a helpful, respectful and honest assis...","You are a helpful, respectful and honest assis..."
1,"As requested, I have created a new recipe base...",Classic Risotto with Mushrooms and Fresh Orega...,To complete the task of creating a new recipe ...,\nTo create a vegetarian version of the recipe...,"You are a helpful, respectful and honest cooki...","You are a helpful, respectful and honest cooki...","You are a helpful, respectful and honest assis...","You are a helpful, respectful and honest assis..."
2,"As a helpful, respectful and honest cooking as...",Butternut Squash Chipotle Chili with Avocado\n...,To create a vegetarian version of the Chipotle...,\nTo create a vegetarian version of the Chipot...,"You are a helpful, respectful and honest cooki...","You are a helpful, respectful and honest cooki...","You are a helpful, respectful and honest assis...","You are a helpful, respectful and honest assis..."
3,"As requested, here is a new recipe that substi...",Vegetarian Butternut Squash Chipotle Chili wit...,To complete the task of creating a new recipe ...,\nTo create a vegetarian version of the Butter...,"You are a helpful, respectful and honest cooki...","You are a helpful, respectful and honest cooki...","You are a helpful, respectful and honest assis...","You are a helpful, respectful and honest assis..."
4,"Here is your new recipe, Portobello Mushroom S...","Portobello Mushroom Stroganoff, \nIngredients:...",To complete the task of creating a new recipe ...,"\nThe new recipe, Portobello Mushroom Strogano...","You are a helpful, respectful and honest cooki...","You are a helpful, respectful and honest cooki...","You are a helpful, respectful and honest assis...","You are a helpful, respectful and honest assis..."


### Breakdown of test case execution

##### Insert language model into LCEL chains

In [23]:
openai_model = OpenAI(model_name="gpt-4-1106-preview", max_retries=0)

# Base
base_chain_openai = base_prompt | {'output':openai_model|output_parser, 'prompt': RunnablePassthrough()}
# RAG
retriever = chroma_store.as_retriever(search_kwargs = {"k": 1})
setup = {"retrieved": itemgetter("new_name") | retriever | _combine_documents, "user_recipe": itemgetter('user_recipe'), "inputs":RunnablePassthrough()}
prompt_inputs = {"inputs": lambda x: {"retrieved_recipes": x["retrieved"], "user_recipe": x['user_recipe']} | x["inputs"]}
prompt_chain = prompt_inputs | RunnablePassthrough() | itemgetter("inputs") | retrieval_prompt
output_chain_openai = {"output": prompt_chain | openai_model | output_parser, "context_recipes": itemgetter("retrieved"), "prompt": prompt_chain}
rag_chain_openai = setup | RunnablePassthrough() | output_chain_openai

# COT
cot_chain_openai = cot_prompt | {'output':openai_model|output_parser, 'prompt': RunnablePassthrough()}

# RAG + COT, multi-inference
pass_inputs = RunnablePassthrough()
retriever_branch = itemgetter("new_name") | multi_prompt_retriever | _combine_documents
generate_new_name_openai =  {"new_name":{"resp": modified_name_prompt | openai_model | output_parser} | strip_name_prompt | openai_model | output_parser, "inputs": RunnablePassthrough()}
prompt_inputs = {"inputs":lambda x: x["inputs"] | {"new_name": x["new_name"], "retrieved_recipes":  x["retrieved"]}}
parallels = {"retrieved": retriever_branch, "new_name": itemgetter("new_name"), "inputs": itemgetter("inputs")}
output_chain_openai = RunnableParallel({"output": prompt_inputs | RunnablePassthrough() | itemgetter("inputs") | cot_multi_prompt | openai_model | output_parser,
     "context_recipes": itemgetter("retrieved"), "prompt":prompt_inputs | RunnablePassthrough() | itemgetter("inputs") | cot_multi_prompt})
rag_cot_chain_openai = generate_new_name_openai | RunnablePassthrough() | parallels | output_chain_openai
 

##### Query LLM for each recipe in test corpus and collect results

In [24]:
test_results_openai = list()
cot_examples = {'beef_recipe':beef_recipe,'mushroom_recipe':mushroom_recipe}
# Should be all cases, but only one is needed for the following example outputs
for tcase in tqdm([test_cases[0]]):
    consts = tcase.copy()
    del consts['index']
    multi_chain_consts = consts.copy()
    del multi_chain_consts['new_name']

    base_case_openai = base_chain_openai.invoke(consts)
    rag_case_openai = rag_chain_openai.invoke(consts)
    cot_case_openai = cot_chain_openai.invoke(consts|cot_examples)
    rag_cot_case_openai = rag_cot_chain_openai.invoke(multi_chain_consts|cot_examples)
    test_results_openai.append({'index':tcase['index'],'consts': consts,'base':base_case_openai,'RAG':rag_case_openai,'COT':cot_case_openai,'multi':rag_cot_case_openai})
    

100%|██████████| 1/1 [02:33<00:00, 153.36s/it]


##### Merge results into dataframe and export to JSON file

In [25]:
case_names = ['base', 'RAG', 'COT', 'multi']
output_cols = [name + '_output' for name in case_names]
prompt_cols = [name + '_prompt' for name in case_names]
test_records = list()
for test_res in test_results_openai:
    recipe_index = test_res['index']
    outputs = [test_res[name]['output'] for name in case_names]
    prompt = [test_res[name]['prompt'].messages[0].content.strip() for name in case_names]

    outputs = dict(zip(output_cols,outputs))
    prompt = dict(zip(prompt_cols, prompt))
    test_records.append(outputs | prompt)

test_df = pd.DataFrame.from_records(test_records)
test_df.to_json(f'../datasets/llm_test_cases/example_test.jsonl', lines=True, force_ascii=False, orient='records')

In [26]:
print(base_case_openai['output'])

Recipe Name: Easy Brown Rice Risotto with Mushrooms and Fresh Oregano

Ingredients:
- 4 cups store-bought or homemade low-sodium vegetable broth
- 5 tablespoons unsalted butter, divided
- 1 medium white onion, finely chopped (about 1 1/2 cups)
- 2 1/2 teaspoons kosher salt, divided
- 1 1/2 cups brown rice, preferably short grain
- 1 cup dry white wine
- 3 oz. Parmesan, finely grated (about 1 1/2 cups)
- 3/4 teaspoon freshly ground black pepper
- 8 oz. fresh mushrooms, cleaned and sliced
- 2 tablespoons fresh oregano, chopped

Cooking Instructions:

1. In a medium pot over medium-high heat, heat the vegetable broth until steaming. Reduce the heat to low, cover, and keep the broth warm.

2. In a large skillet over medium heat, melt 3 tablespoons of butter. Add the onion and 1/2 teaspoon of salt. Cook, stirring occasionally, until the onion is softened and translucent, about 8 to 10 minutes. 

3. Add the sliced mushrooms to the skillet and cook until they release their moisture and begin 

In [27]:
print(rag_case_openai['output'])

Easy Brown Rice Risotto with Mushrooms and Fresh Oregano

### Ingredients:
- 4 cups vegetable broth, preferably low-sodium
- 3 tablespoons unsalted butter, divided
- 1 medium white onion, finely chopped (about 1 1/2 cups)
- 1 1/2 teaspoons kosher salt, divided, plus more to taste
- 1 1/2 cups brown rice, preferably short-grain for creaminess
- 1 cup dry white wine
- 1 pound assorted fresh mushrooms, such as shiitake or cremini, cleaned and sliced
- Freshly ground black pepper, to taste
- 1/2 cup grated Parmesan cheese, plus more for serving
- 1/4 cup fresh oregano leaves, chopped, plus more for garnish

### Cooking Instructions:
1. In a medium pot over medium heat, bring the vegetable broth to a simmer. Reduce the heat to low, cover, and keep warm.

2. In a large, deep skillet or saucepan, melt 2 tablespoons of butter over medium heat. Add the onion and 1/2 teaspoon of salt. Cook, stirring occasionally, until the onions are soft and translucent, about 6 to 8 minutes.

3. Add the brown 

In [28]:
print(cot_case_openai['output'])

A. To create the new recipe called "Easy Brown Rice Risotto with Mushrooms and Fresh Oregano," we will modify the Classic Risotto recipe. The key change here is that we're replacing chicken broth with vegetable broth, making the recipe vegetarian. Furthermore, we will be adding mushrooms and fresh oregano to enhance the flavors associated with these ingredients. The original steps will be adjusted to accommodate the cooking time of brown rice and the addition of mushrooms and oregano.

Let's adjust the recipe ingredient by ingredient, and then rewrite the cooking instructions accordingly.

Recipe Name: Easy Brown Rice Risotto with Mushrooms and Fresh Oregano

Ingredients:
- 4 c. store-bought or homemade low-sodium vegetable broth
- 5 tbsp. unsalted butter, divided
- 1 medium white onion, finely chopped (about 1 1/2 c.)
- 2 1/2 tsp. kosher salt, divided
- 1 1/2 c. brown arborio or short-grain brown rice
- 1 c. dry white wine
- 3 oz. Parmesan, finely grated (about 1 1/2 c.)
- 3/4 tsp. fr

In [29]:
print(rag_cot_case_openai['output'])

To create a new recipe called Classic Risotto with Vegetable Broth, we will modify the original Classic Risotto recipe by replacing the chicken broth with vegetable broth, making it suitable for vegetarians. This change will also alter the overall flavor to be more vegetable-focused. Here is the modified recipe:

Recipe Name: Classic Risotto with Vegetable Broth

Ingredients:
- 4 cups store-bought or homemade low-sodium vegetable broth
- 5 tbsp. unsalted butter, divided
- 1 medium white onion, finely chopped (about 1 1/2 c.)
- 2 1/2 tsp. kosher salt, divided
- 1 1/2 c. arborio or carnaroli rice
- 1 c. dry white wine
- 3 oz. Parmesan, finely grated (about 1 1/2 c.)
- 3/4 tsp. freshly ground black pepper

Cooking Instructions:
1. In a medium pot over medium-high heat, heat the vegetable broth until steaming. Reduce heat to low, cover, and keep warm.
2. Meanwhile, in a large skillet over medium heat, melt 4 tablespoons of butter. Add onion and 1/2 teaspoon of salt and cook, stirring occas