# Evaluating Test Results
This notebook details the process of formatting the test responses and presents the results
of manual evaluation.

In [1]:
import pandas as pd
import sys
sys.path.append('../')

##### Merge recipe metadata into single dataframe

In [2]:
fpath = "../datasets/llm_test_cases/Test Cases.xlsx"
df = pd.read_excel(fpath)

schema = ".recipe_name, .ingredients, [.instructions[].text], .source_url"
columns = ["recipe_name", "ingredients", "instructions", "url"]
fpath = ["../datasets/llm_test_cases/test_recipes.jl"]
outpath = ["../datasets/llm_test_cases/test_recipes_cleaned.jsonl"]

cleaned_df = pd.read_json(outpath[0],lines=True)
cleaned_df.head()

Unnamed: 0,recipe_name,ingredients,instructions,url,step_instructions,whole_recipe
0,Beef Stroganoff,600 g / 1.2 lb scotch fillet steak / boneles...,Use your fist (or rolling pin or mallet) to fl...,https://www.recipetineats.com/beef-stroganoff/,1. Use your fist (or rolling pin or mallet) to...,"Recipe Name: Beef Stroganoff, \nIngredients: 6..."
1,Creamy Homemade Baked Mac and Cheese,"1 lb. dried elbow pasta, 1/2 cup unsalted butt...",Preheat oven to 325 degrees F and grease a 3 q...,https://www.thechunkychef.com/family-favorite-...,1. Preheat oven to 325 degrees F and grease a ...,Recipe Name: Creamy Homemade Baked Mac and Che...
2,Vegetarian Butternut Squash Chipotle Chili wit...,"2 tablespoons olive oil, 1 medium red onion, c...",In a 4- to 6-quart Dutch oven or stockpot over...,https://cookieandkate.com/butternut-squash-chi...,1. In a 4- to 6-quart Dutch oven or stockpot o...,Recipe Name: Vegetarian Butternut Squash Chipo...
3,Best Bolognese,"1 medium onion, chopped, 1 celery stalk, chopp...","Pulse onion, celery, and carrot in a food proc...",https://www.bonappetit.com/recipe/bas-best-bol...,"1. Pulse onion, celery, and carrot in a food p...","Recipe Name: Best Bolognese, \nIngredients: 1 ..."
4,Easy Brown Rice Risotto with Mushrooms and Fre...,"3 tablespoons olive oil, divided, 1 small yell...",Make sure your oven rack is in the middle posi...,https://cookieandkate.com/easy-brown-rice-riso...,1. Make sure your oven rack is in the middle p...,Recipe Name: Easy Brown Rice Risotto with Mush...


In [3]:
merge_recipe = df.merge(cleaned_df, left_on='start_url', right_on='url')
merge_recipe = merge_recipe.merge(cleaned_df, left_on='end_url', right_on='url', suffixes=('_start', '_end'))
merge_recipe.shape

(21, 21)

In [4]:
# Store results for later
merge_recipe.to_excel('../datasets/llm_test_cases/test_recipes.xlsx')

##### Write responses to spreadsheet

In [5]:
recipe_targets = merge_recipe[['start_name','end_name']].copy()
recipe_targets.head()

Unnamed: 0,start_name,end_name
0,Classic Risotto,Easy Brown Rice Risotto with Mushrooms and Fre...
1,Easy Brown Rice Risotto with Mushrooms and Fre...,Classic Risotto
2,Chipotle Chili,Butternut Squash Chipotle Chili with Avocado
3,Butternut Squash Chipotle Chili with Avocado,Chipotle Chili
4,Beef Stroganoff,Portobello Mushroom Stroganoff


In [6]:
def create_evaluation_df(fpath):
    test_df = pd.read_json(fpath, lines=True)
    return recipe_targets.merge(test_df, left_index=True,right_index=True)

In [7]:
def create_results_spreadsheet():
    test_results_paths = {'llama': '../datasets/llm_test_cases/test_cases_llama_v2.jsonl',
                        'gpt4.5':'../datasets/llm_test_cases/test_cases_gpt4turbo_v2.jsonl',
                        'gpt3.5':'../datasets/llm_test_cases/test_cases_gpt3-5_turbo_v2.jsonl',}
    test_results = dict()
    with pd.ExcelWriter('../datasets/llm_test_cases/test_evaluation_v3.xlsx') as test_writer:
        for model_name, result_path in test_results_paths.items():
            eval_df = create_evaluation_df(result_path)
            test_results[model_name] = eval_df
            eval_df.to_excel(test_writer,sheet_name=model_name)
# test_results = create_results_spreadsheet()

### Test Summary
Each test will contain the same type of task: to replace one ingredient from a recipe with another ingredient to create a new recipe as the output. Every test will also have a pair of recipes and a pair of ingredients. In the case of the recipes, one recipe represents the starting recipe and another represents the target recipe. The pair of ingredients details the requested substitution for the test. Information from the starting recipe and the substitution is provided to the language model, while information about the intended target is preserved for test evaluation later and is excluded from the prompt context.

Refer to *execute_test_cases.ipynb* for more details on model tests.

Models Used:
1. [LLaMa-2 (13B) conversational chat model](https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ)
2. [OpenAI GPT4-turbo](https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo)
3. [OpenAI GPT3.5-turbo](https://platform.openai.com/docs/models/gpt-3-5)

## Evaluation Scores

Each response to a recipe was scored based on four categories on a pass/fail basis.

1. Formatting: Response contains answer in the form of a recipe (Recipe name, followed by ingredients, then followed by instructions)
2. Replaced Ingredient: Response makes requested substitutions with appropriate replacements and amounts.
3. Customized Instructions: Response tries to implement requested customizations in instructions for preparing the recipe.
Accurate instructions are not required for a passing score in this category but are evaluated in the next category.
4. Cooking Understanding: Response demonstrates an understanding of cooking; i.e. the response does not contain nonsensical or impossible instructions,
nor does it simply copy the instructions from the provided recipe.

In addition, the multiple inference task for each test was also scored on the generated name for the new recipe. <br>

5. Naming: Generated name appropriately describes the results of the substitution task.

<br>
Scores are aggregated into an accuracy metric representing the proportion 
of the 20 test cases passed in each category.

In [8]:
# Cleaning up the imported dataframe
def format_scores(score_df:pd.DataFrame):
    formatted_df = score_df.copy()
    formatted_df.drop(columns=['Note'], inplace=True)
    formatted_df['Index'] = formatted_df['Index'].ffill(axis=0).astype(int)
    formatted_df.dropna(how='all', subset=['Formatting','Replaced Ingredient','Customized Instructions','Cooking Understanding'], inplace=True)
    formatted_df.set_index(['Index','Prompt Type'], inplace=True)

    return formatted_df
    
# Load scores from spreadsheets
# Reorganize into multiindex dataframe
def load_scores(fpath):
    sheet_names = ['llama','gpt4.5','gpt3.5']
    score_sheets = dict()
    for name in sheet_names:
        score_sheets[name] = format_scores(pd.read_excel(fpath,sheet_name=name))\
                                            .groupby(by='Prompt Type')\
                                            .mean()
    return pd.concat(score_sheets)


In [9]:
v1_scores = load_scores('../datasets/llm_test_cases/evaluation_scores.xlsx')
v2_scores = load_scores('../datasets/llm_test_cases/evaluation_scores_v2.xlsx')

In [10]:
v1_scores

Unnamed: 0_level_0,Unnamed: 1_level_0,Formatting,Replaced Ingredient,Customized Instructions,Cooking Understanding,Naming
Unnamed: 0_level_1,Prompt Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
llama,RAG,1.0,0.6,0.55,0.55,
llama,base,1.0,0.6,0.55,0.4,
llama,multi,1.0,0.75,0.7,0.5,0.9
gpt4.5,RAG,1.0,0.8,0.8,0.8,
gpt4.5,base,1.0,0.95,0.95,0.95,
gpt4.5,multi,1.0,1.0,1.0,1.0,1.0
gpt3.5,RAG,1.0,0.65,0.65,0.65,
gpt3.5,base,1.0,0.8,0.8,0.8,
gpt3.5,multi,0.95,0.95,0.9,0.95,0.95


## Refining input prompts
After reviewing the previous test results, several issues were noted in the input prompts used
to generate the responses. The task description was found to be too vague to describe the task to the LLM. 
The direction of substitution was the problem; was the task to replace ingredient *x* with a new ingredient *y*, or the other way around? 
Often, the LLM would produce a response where it had interpreted the replacement task opposite of the intended substitution. 
To fix this problem, the phrase "substitute *x* for *y*" was substituted with 
"replace ingredient *x* from the original recipe with ingredient *y*".<br>

Additionally, the chain-of-thought prompt specialized too heavily into making a recipe vegetarian, 
which had a negative impact on response generation for ingredient substitutions that did not 
expect a vegetarian recipe as the end result. This was remedied by replacing the example with another that simply showed how to reason through substituting ingredients in a recipe without catering to dietary restrictions. <br>

The results from responses for tasks with vegetarian substitutions did indicate that having several specialized examples may provoke higher-quality
responses, albeit at the loss of flexibility to accurately generate answers for other substitution types.



The models were reevaluated using the test corpus. The changes made to the input prompts produced a
significant increase in pass rate among the four categories.

In [11]:
v2_scores

Unnamed: 0_level_0,Unnamed: 1_level_0,Formatting,Replaced Ingredient,Customized Instructions,Cooking Understanding,Naming
Unnamed: 0_level_1,Prompt Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
llama,COT,1.0,0.95,0.95,0.9,
llama,RAG,1.0,1.0,1.0,0.9,
llama,base,0.95,0.95,0.95,0.8,
llama,multi,1.0,0.9,0.85,0.45,1.0
gpt4.5,COT,1.0,1.0,1.0,1.0,
gpt4.5,RAG,1.0,1.0,1.0,1.0,
gpt4.5,base,1.0,1.0,1.0,1.0,
gpt4.5,multi,1.0,1.0,1.0,1.0,1.0
gpt3.5,COT,1.0,0.95,0.9,0.9,
gpt3.5,RAG,1.0,1.0,1.0,0.95,


### Notes and Observations
- GPT4-turbo produced high-quality responses regardless of prompt type. As a result, most of the following observations were noted with respect to the less complex models, namely LLaMa-2 and GPT3.5-turbo.

- Retrieval-augmented generation appeared to benefit creativity and breadth of knowledge in the generated responses.
Uncommon substitution tasks that the models failed in the naive and chain-of-thought case were passed when using RAG. This indicates that language models
are able to infer and gather information from the retrieved documents and apply it to the task. In addition, the additional context helped the models in adapting the instructions to match the requested substitutions in the recipe ingredients. A pattern was noticed with LLaMa-2 where the instructions for the output recipe would be too similar to the input recipe. This often led to the listing of inappropriate instructions, such as shredding pork before cooking it in the oven to make pulled pork. Providing an example recipe of a pulled pork sandwich allowed the model to move the shredding process to after the cooking steps. Some nuances of preparing ingredients that eluded the LLM in the base prompt were captured when context recipes were provided. When replacing butternut squash with ground beef, additional context was needed to indicate to the LLM that the ground beef should be browned and steps would be needed to handle the rendered fat, which would not be present in the original recipe using butternut squash.
- The downsides of providing more than one recipe as additional context for RAG outweighs the benefit of extra information for two reasons. First, each additional recipe incurs a cost in the input prompt's token length, which in turn leads to significantly longer generation times. In the case of LLaMa-2, the token length exceeded the maximum token length, so the model failed to generate a proper response. Second, the extra text from the additional recipes would "dilute" the information from the task description and the original recipe. This led to responses where the output recipe would simply reuse instructions from the context recipes without consideration for the ingredients involved. This was evident in the multiple inference outputs generated by LLaMa-2. While the excessive token lengths led to longer generation times in GPT4 and GPT3.5, they did not encounter the instruction reuse. This was probably because of their longer maximum token lengths.
- COT helps with stability of answers. Showing the LLM an example in context strongly influenced the generation to follow the formatting in the example. This assisted in getting a consistent response from the LLM. This also means that formatting in COT examples is much more vital compared to prompts utilizing other techniques when a specific sentence structure is expected in the output. For example, the worked examples used in the first iteration used a different spacing format to separate the recipe name, ingredients, and instructions. Despite containing the same structure of information, the LLM would use the spacing format in the example instead, which made it difficult to read the responses during manual evaluation.
- Wei et al.<sup>1</sup> notes that larger models benefit more from chain-of-thought prompting. This matches with the difference in the effect of COT prompts in GPT4.5 compared to LLAMA-2 (13B) and GPT3.5.
- Although LLaMa-2 performed poorly in this benchmark compared to GPT4 and GPT3.5, it is important to note that the size of the model was severely restricted due to hardware limitations, whereas the GPT models did not have the same restrictions. The LLaMa-2 model parameter count was only 13 billion and the parameters were quantized into 4 bits with a large quantization group size (128) to further reduce GPU memory requirements. Without these adjustments, the model would have been too expensive to run on a single GPU.

<sup>1</sup>[Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf)

### Considerations for Deployment
- When considering how user inputs could be fed into the input prompt, strict controls over the range of input formats is a good idea for deployment; specifying queries is non-trivial and can produce significant variance in responses.

- Utilizing retrieval-augmentation imposes a large memory requirement to load the embeddings model and the database index into memory. Additionally, performing similarity search will require a meaningful amount of computation time. This will need to be considered when designing around API latencies and hardware costs.
- GPT3.5 seems to be a good candidate for implementing into a LLM-backed application, as it strikes a balance between response quality and API costs.
- While the operating costs of using the OpenAI API for LLM inference is minimal for a small application, the cost of using the API would probably outscale the initial costs of hosting a model of similar complexity on local hardware.

### Conclusion
Prompt engineering using retrieval-augmentation and chain-of-thought prompting proved to be effective in improving the quality of responses for the task of modifying recipes to substitute ingredients. LLaMa-2 (13B) struggled to provide good responses for multiple inference prompts, likely because of token limits, but demonstrated that locally-hosted, smaller models could still provide quality results. As expected, GPT4-turbo excelled at this task and was clearly the best model out of the three. GPT3.5 proved to be a cost-effective alternative to GPT4/GPT4-turbo.