# AutoPrompt - Auto Write Evaluation Prompt


## The Idea

It's hard to know what the right prompt is, and it's hard to know if you've found it. This project aims to automate the process of finding the perfect evaluation prompt.


## Load deps

In [8]:
# ! pip install -r requirements.txt

import os

from langchain.cache import SQLiteCache
from langchain.globals import set_llm_cache

from utils import save_tmp_file, load_model
from data_handling import load_and_clean_dataset
from strategy_sequential import SequentialStrategy
from strategy_basic_memory import BasicMemoryStrategy

## Configuration

In [9]:
DATASET_FILE = "./datasets/sentiment_analysis_examples_25.csv"
# DATASET_FILE = "./datasets/dataset-writing-style-v-not-v.xlsx"
# DATASET_FILE = "./datasets/writing-style.xlsx"
# DATASET_FILE = "./datasets/writing-style-30-100-words.xlsx"

# Seed Idea for prompt generation
IDEA_SEED = """Decide the sentiment of the input text."""
# IDEA_SEED = """Compare the writing style of the two pieces of text. Your OUTPUT MUST ONLY take the writing style into consideration, NOT the meaning or thematic similarity of the texts.""".strip()


# Initial prompt. If `None`, the initial prompt will be generated automatically
# PROMPT_TO_EVAL_FILE = None
# PROMPT_TO_EVAL_FILE = "_scored_100/writing-style-01-gpt-turbo-3.5-temp-0.3.md"

# Maximum number of rows to use from the dataset for initial prompt generation
ROWS_INITIAL = 6
# Maximum number of rows in each chunk
ROWS_MAX = 13
# Number of rows to use as `incorrect` examples
ROWS_INCORRECT = 5


# Use Few or Zero Shot?
IS_FEW_SHOT = True
EVAL_CONCURRENCY = 4


# Stopping criteria (inclusive)
GOAL_ACCURACY = 98
MAX_ATTEMPTS_PER_PLAN = 2


# Model configurations
# MODEL_PROMPT_WRITER_NAME = "gpt-4-1106-preview"
MODEL_PROMPT_WRITER_NAME = "gpt-3.5-turbo"
# MODEL_PROMPT_WRITER_NAME = "mistralai/Mixtral-8x7B-Instruct-v0.1"
MODEL_PROMPT_WRITER_TEMPERATURE = 0.9
MODEL_PROMPT_WRITER_MAX_TOKENS = 2000

MODEL_EVALUATE_NAME = "gpt-3.5-turbo"
# MODEL_EVALUATE_NAME = "gpt-4-1106-preview"
# MODEL_EVALUATE_NAME = "mistralai/Mixtral-8x7B-Instruct-v0.1"
# MODEL_EVALUATE_NAME = "togethercomputer/llama-2-70b-chat"
MODEL_EVALUATE_TEMPERATURE = 0.01
MODEL_EVALUATE_MAX_TOKENS = 1600

### Load Models


In [10]:
# Set up LangChain models

# if both model names start with `gpt-`, set cache
if MODEL_PROMPT_WRITER_NAME.startswith("gpt-") and MODEL_EVALUATE_NAME.startswith(
    "gpt-"
):
    print("Enabling LLM cache...")
    set_llm_cache(SQLiteCache(database_path=".langchain.db"))


# Setup the prompt writer model
model_prompt_writer = load_model(
    MODEL_PROMPT_WRITER_NAME,
    MODEL_PROMPT_WRITER_TEMPERATURE,
    MODEL_PROMPT_WRITER_MAX_TOKENS,
)

# Setup the evaluation model
model_evaluate = load_model(
    MODEL_EVALUATE_NAME,
    MODEL_EVALUATE_TEMPERATURE,
    MODEL_EVALUATE_MAX_TOKENS,
)

Enabling LLM cache...
Loading ChatOpenAI model: gpt-3.5-turbo
Loading ChatOpenAI model: gpt-3.5-turbo


  warn_deprecated(
  warn_deprecated(


### Load the dataset


In [11]:
# empty ./_tmp directory
for filename in os.listdir("_tmp"):
    os.remove(os.path.join("_tmp", filename))

# Load the dataset
df_all = load_and_clean_dataset(DATASET_FILE)
df_all.head(4)

Unnamed: 0,ROW_NO,INPUT: Sentence,OUTPUT: Sentiment
0,1,I love this new phone,positive
1,2,This is just okay. Nothing special. 😐,neutral
2,3,"Unfortunately, it broke the first day I used it",negative
3,4,I guess it could've been worse 😅,neutral


### Call the Auto Prompt Main Loop + Strategy

In [12]:
# Setup the strategy executor
# Options so far: SequentialStrategy, BasicMemoryStrategy
auto_prompt_strategy = BasicMemoryStrategy(
    model_prompt_writer=model_prompt_writer,
    model_evaluate=model_evaluate,
    df_original=df_all,
    idea_seed=IDEA_SEED,
    goal_accuracy=GOAL_ACCURACY,
    max_attempts_per_plan=MAX_ATTEMPTS_PER_PLAN,
    is_few_shot=IS_FEW_SHOT,
    eval_concurrency=EVAL_CONCURRENCY,
    rows_initial=ROWS_INITIAL,
    rows_max=ROWS_MAX,
    rows_incorrect=ROWS_INCORRECT,
    is_use_eval_aware_dataset=True,
)

# Execute the strategy
prompt_str, accuracy, plan = auto_prompt_strategy.run()

# Save the final prompt
save_tmp_file("10-prompt_final.md", prompt_str)
print(f"\nFinal prompt saved with accuracy {accuracy:.2f}%")
if plan is not None:
    print(f"Plan {plan.id}: {plan.expert_title}")

Generating 5 ranked ToT prompt construction plans...


Plan 5:
Decide the sentiment of the input text. Combine the approaches of NLP engineering for sentiment analysis, emotional intelligence analysis, psychological analysis, and subject matter expertise. Train a sentiment analysis model using a diverse dataset and fine-tune it with NLP techniques. Incorporate the understanding of emotions, attitudes, and context from emotional intelligence and psychology. Consult subject matter experts to gain insights into domain-specific sentiments. This integrated strategy should yield a more accurate determination of the sentiment as positive, negative, or neutral.

Generating initial prompt...
Getting chunk 1 retry 0 with 13 rows...
Getting chunk 2 retry 0 with 12 rows...


Correct answers: 96.00%


Plan 1:
Decide the sentiment of the input text. Train a sentiment analysis model using a large dataset of labeled text, including positive, negative, and neutral sentiments. Fine-tune the model using advanced NLP techniques such as pre-trained language models like BERT or GPT. Apply the trained model to predict the sentiment of the input text.

Generating initial prompt...
Getting chunk 1 retry 0 with 13 rows...
Getting chunk 2 retry 0 with 12 rows...
Correct answers: 96.00%


Plan 3:
Decide the sentiment of the input text. Apply principles of psychology to assess the underlying emotions and attitudes in the input text. Consider factors such as language usage, tone, and connotations. Determine the sentiment as positive, negative, or neutral based on the psychological analysis.

Generating initial prompt...
Getting chunk 1 retry 0 with 13 rows...
Getting chunk 2 retry 0 with 12 rows...
Correct answers: 92.00%


Plan 2:
Decide the sentiment of the input text. Le

In [13]:
# # Create an instance of the DatasetWithMistakeTracking class
# dataset_tracker = EvalAwareDataset(df_all)

# # Get initial sample of the dataset
# df_sample = dataset_tracker.get_sample(ROWS_INITIAL)

# # df_all
# df_sample

In [14]:
# # If PROMPT_TO_EVAL_FILE is not None, load the prompt from the file
# prompt_str = ""
# if PROMPT_TO_EVAL_FILE is not None:
#     print(f"Loading prompt from {PROMPT_TO_EVAL_FILE}")
#     with open(PROMPT_TO_EVAL_FILE, "r") as f:
#         prompt_str = f.read()