# AutoPrompt - Auto Write Evaluation Prompt


## The Idea

It's hard to know what the right prompt is, and it's hard to know if you've found it. This project aims to automate the process of finding the perfect evaluation prompt.


## Load deps

In [1]:
# ! pip install -r requirements.txt

import os

from langchain.cache import SQLiteCache
from langchain.globals import set_llm_cache

from utils import save_log_file, load_model
from data_handling import load_and_clean_dataset

from strategies.sequential import SequentialStrategy
from strategies.basic import BasicStrategy
from strategies.tree_search import TreeSearchStrategy

## Configuration

In [2]:
DATASET_FILE = "./datasets/sentiment_analysis_examples_25.csv"
# DATASET_FILE = "./datasets/dataset-writing-style-v-not-v.xlsx"
# DATASET_FILE = "./datasets/writing-style.xlsx"
# DATASET_FILE = "./datasets/writing-style-30-100-words.xlsx"

# Seed Idea for prompt generation
IDEA_SEED = """Decide the sentiment of the input text."""
# IDEA_SEED = """Compare the two pieces of text. Your OUTPUT MUST ONLY take the writing style, voice, tone and stucture into consideration. Do not consider the meaning or thematic similarity of the texts.""".strip()


# Initial prompt. If `None`, the initial prompt will be generated automatically
# PROMPT_TO_EVAL_FILE = None
# PROMPT_TO_EVAL_FILE = "_scored_100/writing-style-01-gpt-turbo-3.5-temp-0.3.md"

# Maximum number of rows to use from the dataset for initial prompt generation
ROWS_INITIAL = 10
# Maximum number of rows in each chunk
ROWS_MAX = 15
# Number of rows to use as `incorrect` examples
ROWS_INCORRECT = 5


# Use Few or Zero Shot?
IS_FEW_SHOT = True
EVAL_CONCURRENCY = 10


# Stopping criteria (inclusive)
GOAL_ACCURACY = 99
MAX_ATTEMPTS_PER_PLAN = 5 # including the initial attempt before using UPDATEP


# Model configurations
# MODEL_PROMPT_WRITER_NAME = "gpt-3.5-turbo"
# MODEL_PROMPT_WRITER_NAME = "gpt-4-1106-preview"
MODEL_PROMPT_WRITER_NAME = "gpt-3.5-turbo-16k"
# MODEL_PROMPT_WRITER_NAME = "mistralai/Mixtral-8x7B-Instruct-v0.1"
MODEL_PROMPT_WRITER_TEMPERATURE = 0.7
MODEL_PROMPT_WRITER_MAX_TOKENS = 2200

MODEL_EVALUATE_NAME = "gpt-3.5-turbo"
# MODEL_EVALUATE_NAME = "gpt-3.5-turbo-16k"
# MODEL_EVALUATE_NAME = "gpt-4-1106-preview"
# MODEL_EVALUATE_NAME = "mistralai/Mixtral-8x7B-Instruct-v0.1"
# MODEL_EVALUATE_NAME = "togethercomputer/llama-2-70b-chat"
MODEL_EVALUATE_TEMPERATURE = 0.1
MODEL_EVALUATE_MAX_TOKENS = 2000

### Load Models


In [3]:
# Set up LangChain models

# if both model names start with `gpt-`, set cache
if MODEL_PROMPT_WRITER_NAME.startswith("gpt-") and MODEL_EVALUATE_NAME.startswith(
    "gpt-"
):
    print("Enabling LLM cache...")
    set_llm_cache(SQLiteCache(database_path=".langchain.db"))


# Setup the prompt writer model
model_prompt_writer = load_model(
    MODEL_PROMPT_WRITER_NAME,
    MODEL_PROMPT_WRITER_TEMPERATURE,
    MODEL_PROMPT_WRITER_MAX_TOKENS,
)

# Setup the evaluation model
model_evaluate = load_model(
    MODEL_EVALUATE_NAME,
    MODEL_EVALUATE_TEMPERATURE,
    MODEL_EVALUATE_MAX_TOKENS,
)

Enabling LLM cache...
Loading ChatOpenAI model: gpt-3.5-turbo-16k


  warn_deprecated(


Loading ChatOpenAI model: gpt-3.5-turbo


  warn_deprecated(


### Load the dataset


In [4]:

# create it if it does not exist, or empty ./_log directory if it does
if not os.path.exists("_log"):
    os.mkdir("_log")
else:
    for filename in os.listdir("_log"):
        os.remove(os.path.join("_log", filename))

# Load the dataset
df_all = load_and_clean_dataset(DATASET_FILE)
df_all.head(ROWS_INITIAL)

from utils_xml_table import slugify_column_names
slugify_column_names(df_all).head(ROWS_INITIAL)

Unnamed: 0,ROW_NO,INPUT_Sentence,OUTPUT_Sentiment
0,1,I love this new phone,positive
1,2,This is just okay. Nothing special. 😐,neutral
2,3,"Unfortunately, it broke the first day I used it",negative
3,4,I guess it could've been worse 😅,neutral
4,5,Waiting forever for a response... 😒,negative
5,6,The movie was both amazing and boring 😕,neutral
6,7,Not sure if I liked it or not,neutral
7,8,Absolutely fantastic experience!,positive
8,9,"Mediocre service, wouldn't recommend 😑",negative
9,10,Hard to tell if it's good or bad 😶,neutral


### Call the Auto Prompt Main Loop + Strategy

In [5]:
# Setup the strategy executor
# Options so far: SequentialStrategy, BasicStrategy, TreeSearchStrategy
auto_prompt_strategy = TreeSearchStrategy(
    model_prompt_writer=model_prompt_writer,
    model_evaluate=model_evaluate,
    df_original=df_all,
    idea_seed=IDEA_SEED,
    goal_accuracy=GOAL_ACCURACY,
    max_attempts_per_plan=MAX_ATTEMPTS_PER_PLAN,
    is_few_shot=IS_FEW_SHOT,
    eval_concurrency=EVAL_CONCURRENCY,
    rows_initial=ROWS_INITIAL,
    rows_max=ROWS_MAX,
    rows_incorrect=ROWS_INCORRECT,
    is_use_eval_aware_dataset=True,
)

# Execute the strategy
# prompt_str, accuracy, plan = auto_prompt_strategy.run()
prompt_str, accuracy, plan = auto_prompt_strategy.run(max_mutations=1, min_acceptable_accuracy=50)

# Save the final prompt
save_log_file("10-prompt_final.md", prompt_str)
print(f"\nFinal prompt saved with accuracy {accuracy:.2f}%")
if plan is not None:
    print(f"Plan {plan.id}: {plan.expert_title}")

Generating 5 ranked ToT prompt construction plans...

-> Running plan 5 with mutation 1...


Plan 5:
Decide the sentiment of the input text. Combine NLP techniques, machine learning models, psychological analysis, and social media data. Use sentiment analysis algorithms to analyze the text and classify it as positive, negative, or neutral. Consider psychological aspects and emotional cues in the text. Additionally, leverage social media data and sentiment analysis tools to enhance the accuracy of sentiment classification. 

Generating initial prompt...
Getting chunk 1 retry 0 with 15 rows...
Getting chunk 2 retry 0 with 10 rows...
Correct answers: 88.00%
Incorrect answers count: 3
Pick the first 3 incorrect examples...

Updating prompt...
Getting chunk 1 retry 0 with 15 rows...
Getting chunk 2 retry 0 with 10 rows...


Correct answers: 88.00%

---
### Attempt 1: 88.00% accuracy (3 wrong out of 25 test rows)
First attempt.
### Attempt 2: 88.00% accuracy (3 wrong out of 25 test rows)
Changes made to the prompt compared to attempt 1:
- Updated the prompt instructions for clarity and accuracy.
---

Incorrect answers count: 3
Pick the first 3 incorrect examples...

Updating prompt...
>> Total cost: 0.009 USD, tokens used 2696
Getting chunk 1 retry 0 with 15 rows...
Getting chunk 2 retry 0 with 10 rows...
Correct answers: 88.00%

---
### Attempt 1: 88.00% accuracy (3 wrong out of 25 test rows)
First attempt.
### Attempt 2: 88.00% accuracy (3 wrong out of 25 test rows)
Changes made to the prompt compared to attempt 1:
- Updated the prompt instructions for clarity and accuracy.
### Attempt 3: 88.00% accuracy (3 wrong out of 25 test rows)
Changes made to the prompt compared to attempt 2:
- Clarified the importance of considering social media data in the sentiment analysis.
- Emphasized the need to leverage ps