# AutoPrompt - Auto Write Evaluation Prompt


## The Idea + TODOs


It's hard to know what the right prompt is, and it's hard to know if you've found it. This project aims to automate the process of finding the perfect evaluation prompt.

**System IO:**

- [ ] INPUT `PROMPT_IDEA_SEED`: A seed idea / guidance for the prompt to be written. For example: _Compare the writing style of two pieces of text. Score only the writing style, not the actual content meaning._
- [ ] Input `TEXT_1`: With 500 words of Known Author writing
- [ ] Input `TEXT_2`: Two examples:
  - [ ] With 500 words of Unknown Author writing
  - [ ] With 500 words of the same Known Author writing
- [ ] Output: `SCORE`, the score from 1 - 10 of how similar the writing style is between the two pieces of text.

**Next steps:**

- [ ] Build dataset file with:
  - [ ] INPUT: `TEXT_1` and `TEXT_2` as inputs.
  - [ ] OUTPUT: `SCORE` as output (1 if comparing to unknown and 10 if comparing to known same author).
- [ ] Adapt the code below to auto write the prompt.
- [ ] Replace usage of to_markdown() with a better format for long form text within the prompt.
- [ ] Evaluation Goal (+/- 2 from the target score, within 1-10 limits):
  - Comparing to Unknown Author must get a **score <= 3**,
  - Comparing to Known Author must get a **score >= 7**.

**Improvements:**

- IDEA: Try providing last 3 diffs of the prompt changes, to help guide the next move.
- Using Mixtral API cheaper? See [ChatMistralAI](https://js.langchain.com/docs/integrations/chat/mistral).


## Let's build it!


In [1]:
from utils import extract_prompt_from_answer, save_tmp_file
from utils_multiline_table import df_to_multiline_table
from data_handling import load_and_clean_dataset
from eval import invoke_test_prompt_against_dataset
from update_prompt import (
    invoke_update_prompt,
    previous_attempts_add,
    previous_attempts_to_str,
)
from config import TMP_DIR, ROWS_INITIAL, IDEA_SEED, PROMPT_INIT_FILE, DATABASE_PATH

import os
from langchain.prompts import load_prompt
from langchain.chat_models import ChatOpenAI
from langchain.cache import SQLiteCache
from langchain.globals import set_llm_cache
from langchain_core.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser

### The Main loop

The main loop will run until the prompt is good enough (or max loops is reached).


In [2]:
# empty ./_tmp directory
for filename in os.listdir(TMP_DIR):
    os.remove(os.path.join(TMP_DIR, filename))

# Load the dataset
# dataset_file = "./datasets/sentiment_analysis_examples_10.csv"
dataset_file = "./datasets/dataset-writing-style-v-not-v.xlsx"
df_all = load_and_clean_dataset(dataset_file)

# If df_all has more rows than ROWS_INITIAL, take the first ROWS_INITIAL rows
df_sample = df_all
if len(df_all) > ROWS_INITIAL:
    df_sample = df_all.head(ROWS_INITIAL)

df_sample

Unnamed: 0,INPUT: TEXT_1,INPUT: TEXT_2,OUTPUT: SCORE_SAME_AUTHOR
0,# The Ultimate Travel Nurse Salary Guide: 4 Mi...,# Find the Best Travel Nursing Jobs\n\nLooking...,9
1,# The Ultimate Travel Nurse Salary Guide: 4 Mi...,## Background on Nursing Job Search Book\n\nTh...,1
2,## Get A Travel Nurse Pay Breakdown\n\nUnderst...,# Find the Best Travel Nursing Jobs\n\nLooking...,10
3,## Get A Travel Nurse Pay Breakdown\n\nUnderst...,## Background on Nursing Job Search Book\n\nTh...,2


In [3]:
# Set up LangChain models
set_llm_cache(SQLiteCache(database_path=DATABASE_PATH))
model_gpt4 = ChatOpenAI(model="gpt-4-1106-preview", temperature=0.5, max_tokens=2000)
model_gpt35 = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.5, max_tokens=750)

# Load the WRITEP prompt and set up the LangChain chain
prompt_init = load_prompt(PROMPT_INIT_FILE)
chain4 = prompt_init | model_gpt4 | StrOutputParser()

# print(prompt_init.format(dataset_table=df_to_multiline_table(df_sample), idea_seed=IDEA_SEED))


# Invoke the LangChain chain to generate the prompt
save_tmp_file(
    "01-prompt_init.md",
    prompt_init.format(
        dataset_table=df_to_multiline_table(df_sample), idea_seed=IDEA_SEED
    ),
)
print(f"Generating initial prompt...")
answer = chain4.invoke(
    {"dataset_table": df_to_multiline_table(df_sample), "idea_seed": IDEA_SEED}
)
save_tmp_file("02-prompt_init-response.md", answer)

# Extract the generated prompt
prompt_generated_str = ""
prompt_generated_str = extract_prompt_from_answer(answer)
prompt_generated_str = prompt_generated_str.replace(
    "%%%INPUT_TABLE%%%", "{input_table}"
)
prompt_generated = PromptTemplate.from_template(prompt_generated_str)

# print(prompt_generated_str)

Generating initial prompt...


In [4]:
# The prompt counter used for the main loop
i_prompt = 1

df_generated, accuracy = invoke_test_prompt_against_dataset(
    prompt_generated, df_all, model_gpt35, i_prompt
)

df_generated

Getting answer, chunk of 1 rows...
Getting answer, chunk of 1 rows...
Getting answer, chunk of 1 rows...
Getting answer, chunk of 1 rows...
Getting answer, chunk of 1 rows...
Getting answer, chunk of 1 rows...
Getting answer, chunk of 1 rows...
Getting answer, chunk of 1 rows...
Correct answers: 75.00%


Unnamed: 0,ROW_NO,Thinking step by step,OUTPUT: SCORE_SAME_AUTHOR,INPUT: TEXT_1,INPUT: TEXT_2,Truth,Is Correct?
0,0,"Both texts use similar formatting, including h...",9,# The Ultimate Travel Nurse Salary Guide: 4 Mi...,# Find the Best Travel Nursing Jobs\n\nLooking...,9,True
1,1,Both texts have a similar tone of providing in...,6,# The Ultimate Travel Nurse Salary Guide: 4 Mi...,## Background on Nursing Job Search Book\n\nTh...,1,False
2,2,"Both texts have a similar tone, using a conver...",9,## Get A Travel Nurse Pay Breakdown\n\nUnderst...,# Find the Best Travel Nursing Jobs\n\nLooking...,10,True
3,3,The writing style of TEXT_1 is informative and...,2,## Get A Travel Nurse Pay Breakdown\n\nUnderst...,## Background on Nursing Job Search Book\n\nTh...,2,True
4,4,"Both texts have a similar structure, with head...",9,### Taxable vs Tax Free Pay\n\nFollowing on fr...,# Find the Best Travel Nursing Jobs\n\nLooking...,9,True
5,5,Both texts have a similar tone of providing in...,5,### Taxable vs Tax Free Pay\n\nFollowing on fr...,## Background on Nursing Job Search Book\n\nTh...,2,False
6,6,Both texts use a conversational tone and engag...,9,### Top 5 Highest Paying States for Travel Nur...,# Find the Best Travel Nursing Jobs\n\nLooking...,8,True
7,7,Both texts have a different formatting style. ...,1,### Top 5 Highest Paying States for Travel Nur...,## Background on Nursing Job Search Book\n\nTh...,1,True


In [5]:
previous_attempts = []
previous_attempts_add(previous_attempts, i_prompt, accuracy, "First attempt.")


# Loop until accuracy is greater than 95% or 5 iterations have been reached
try:
    while accuracy < 90 and i_prompt < 7:
        i_prompt = i_prompt + 1

        previous_attempts_str = previous_attempts_to_str(previous_attempts, df_all)
        print(f"Previous attempts:\n{previous_attempts_str}\n\n")

        prompt_previous = prompt_generated_str

        prompt_generated_str, changes_made_str = invoke_update_prompt(
            df_generated,
            prompt_previous,
            model_gpt4,
            previous_attempts_str,
            i_prompt=i_prompt,
        )

        prompt_updated = PromptTemplate.from_template(prompt_generated_str)
        df_generated, accuracy = invoke_test_prompt_against_dataset(
            prompt_updated, df_all, model_gpt35, i_prompt
        )

        previous_attempts_add(previous_attempts, i_prompt, accuracy, changes_made_str)

    # print(f"\n\nFinal prompt:\n{prompt_generated_str}")
    save_tmp_file("10-prompt_final.md", prompt_generated_str)
    print(f"\nFinal prompt saved with accuracy {accuracy:.2f}%")
except ValueError as e:
    if str(e) != "TRUTH_IS_WRONG":
        raise e


# print(json.dumps(previous_attempts, indent=2))
print("\n\n\n\n\n\n")
print(previous_attempts_to_str(previous_attempts, df_all))

Previous attempts:
### Attempt 1: 75.00% accuracy (2 wrong out of 8 test rows)
Changes made:
First attempt.




Incorrect answers count: 2
Pick the first 1 examples...
Updating prompt using gpt-4-turbo...
Getting answer, chunk of 1 rows...
Getting answer, chunk of 1 rows...
Getting answer, chunk of 1 rows...
Getting answer, chunk of 1 rows...
Getting answer, chunk of 1 rows...
Getting answer, chunk of 1 rows...
Getting answer, chunk of 1 rows...
Getting answer, chunk of 1 rows...
Correct answers: 50.00%
Previous attempts:
### Attempt 1: 75.00% accuracy (2 wrong out of 8 test rows)
Changes made:
First attempt.

### Attempt 2: 50.00% accuracy (4 wrong out of 8 test rows)
Changes made:
- Clarified that the score must take into consideration elements such as syntax, vocabulary, punctuation, tone, and use of literary devices, rather than content, subject matter, or meaning.




Incorrect answers count: 4
Pick the first 1 examples...
Updating prompt using gpt-4-turbo...
Getting answer, chunk