# Numbers every developer should know about prompt engineering

A developer needs to know a few core numbers to be effective and efficient. Memory footprint, the performance of a program, the latency, or bandwidth of a network come to mind. With the advent of AI, especially Generative AI like LLMs, prompt engineering is not resigned for just prompt engineers. Copilot has already been used by every developer I know. That prompted (pun not intended) the question of the article: what are the numbers that a developer should know about prompt engineering?



## Prepare sample data

In [1]:
# I want to first prepare some data. It's mimicking prompt/response for an AI agent.
# Query sentences followed by the actual responses from AI and the expected responses (ground truth).
# I want 10 such pairs in csv format

CSV file 'ai_agent_data.csv' created successfully.


## Set OPEN_API_KEY

In [2]:
import os

os.environ["OPENAI_API_KEY"] = "OPEN_API_KEY"

#

## What is a token?

In [None]:
!pip install tiktoken

In [13]:
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    enc = tiktoken.encoding_for_model(encoding_name)
    tokens = enc.encode(string)
    print(tokens)                         # → [11455, 15639, 13959]
    print([enc.decode([t]) for t in tokens])  # → ['low', ' lower', ' lowest']

    return len(tokens)

num_tokens_from_string("low lower lowest", "gpt-3.5")

[10516, 4827, 15821]
['low', ' lower', ' lowest']


3

In [14]:
num_tokens_from_string("low lower lowest", "gpt-4")

[10516, 4827, 15821]
['low', ' lower', ' lowest']


3

In [18]:
num_tokens_from_string("The quick brown fox", "gpt-4")

[791, 4062, 14198, 39935]
['The', ' quick', ' brown', ' fox']


4

In [16]:
num_tokens_from_string("developer", "gpt-4o")

[77944]
['developer']


1

In [17]:
enc = tiktoken.encoding_for_model("gpt-4o")

words = ["develop", "developer", "development", "developed", "undeveloped"]
for word in words:
    tokens = enc.encode(word)
    print(f"{word}: {len(tokens)} token(s) -> {tokens} → {[enc.decode([t]) for t in tokens]}")


develop: 1 token(s) -> [88886] → ['develop']
developer: 1 token(s) -> [77944] → ['developer']
development: 1 token(s) -> [71620] → ['development']
developed: 2 token(s) -> [88886, 295] → ['develop', 'ed']
undeveloped: 2 token(s) -> [14171, 112997] → ['unde', 'veloped']


In [10]:
import tiktoken
from tiktoken.model import MODEL_TO_ENCODING

print("Available models for tiktoken.encoding_for_model():")
for model_name in MODEL_TO_ENCODING.keys():
    print(model_name)

Available models for tiktoken.encoding_for_model():
o1
o3
gpt-4o
gpt-4
gpt-3.5-turbo
gpt-3.5
gpt-35-turbo
davinci-002
babbage-002
text-embedding-ada-002
text-embedding-3-small
text-embedding-3-large
text-davinci-003
text-davinci-002
text-davinci-001
text-curie-001
text-babbage-001
text-ada-001
davinci
curie
babbage
ada
code-davinci-002
code-davinci-001
code-cushman-002
code-cushman-001
davinci-codex
cushman-codex
text-davinci-edit-001
code-davinci-edit-001
text-similarity-davinci-001
text-similarity-curie-001
text-similarity-babbage-001
text-similarity-ada-001
text-search-davinci-doc-001
text-search-curie-doc-001
text-search-babbage-doc-001
text-search-ada-doc-001
code-search-babbage-code-001
code-search-ada-code-001
gpt2
gpt-2


In [12]:
enc2 = tiktoken.encoding_for_model("gpt-2")
tokens2 = enc.encode("low lower lowest")
print(tokens2)                         # → [11455, 15639, 13959]
print([enc.decode([t]) for t in tokens2])  # → ['low', ' lower', ' lowest']

[10516, 4827, 15821]
['low', ' lower', ' lowest']


In [19]:
import tiktoken

string = "developer"

enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode(string)
print(tokens)                         # → [11455, 15639, 13959]
print([enc.decode([t]) for t in tokens])  # → ['low', ' lower', ' lowest']

[77944]
['developer']


In [22]:
type(tokens)

list

# G-Eval Example

## Install and import DeepEval

In [None]:
!pip install deepeval

In [7]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
from deepeval.test_case import LLMTestCase

## Criteria, evaluation steps, and parameters

In [15]:
CRITERIA = "Determine whether the actual output is factually correct based on the expected output."
EVALUATION_STEPS = [
    "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
    "You should also heavily penalize omission of detail",
    "Vague language, or contradicting OPINIONS, are OK",
    "The reason should be summarized in less than 50 words. Also the score should be in 2 decimal places."
]
EVALUATION_PARAMS = [
    LLMTestCaseParams.INPUT,
    LLMTestCaseParams.ACTUAL_OUTPUT,
    LLMTestCaseParams.EXPECTED_OUTPUT
]

In [8]:

correctness_metric_chatgpt = GEval(
    name="Correctness",
    criteria=CRITERIA,
    evaluation_steps=EVALUATION_STEPS,
    evaluation_params=EVALUATION_PARAMS,
    model="gpt-4o"
)

query = "What is the capital of France?"
output = "The capital of France is Paris."
expected = "The capital of France is Paris."

test_case = LLMTestCase(
    input=query,
    actual_output=output,
    expected_output=expected
)
correctness_metric_chatgpt.measure(test_case)

Output()

1.0

In [10]:
score = correctness_metric_chatgpt.score
reason = correctness_metric_chatgpt.reason
print(f"Score: {score}\r\nReason: {reason}")

Score: 1.0
Reason: The actual output matches the expected output exactly, with no contradictions or omissions of detail.


## Custom Logging Class to log the prompt before calling LLM

In [16]:
class LoggingGEval(GEval):
    def measure(self, test_case: LLMTestCase):
        # Build the prompt manually
        prompt = self._construct_prompt(test_case)
        print("\n🔍 Prompt sent to evaluator model:\n")
        print(prompt)
        print("\n--- End of Prompt ---\n")

        # Continue with standard GEval behavior
        return super().measure(test_case)

    def _construct_prompt(self, test_case: LLMTestCase) -> str:
        """
        Manually constructs the evaluation prompt using the criteria and evaluation steps,
        mimicking what GEval internally does.
        """
        input_text = f"# Input\n{test_case.input}" if LLMTestCaseParams.INPUT in self.evaluation_params else ""
        actual_output = f"# Actual Output\n{test_case.actual_output}" if LLMTestCaseParams.ACTUAL_OUTPUT in self.evaluation_params else ""
        expected_output = f"# Expected Output\n{test_case.expected_output}" if LLMTestCaseParams.EXPECTED_OUTPUT in self.evaluation_params else ""

        steps = "\n".join(f"{i+1}. {step}" for i, step in enumerate(self.evaluation_steps))
        return f"""# Evaluation Criteria
{self.criteria}

# Evaluation Steps
{steps}

{input_text}
{actual_output}
{expected_output}

# Your Task:
Please assess the Actual Output against the Expected Output based on the above criteria and steps. Provide:
1. A brief justification (under 50 words).
2. A numerical score between 0.00 and 1.00 (2 decimal places).
"""


In [17]:
correctness_metric_chatgpt = LoggingGEval(
    name="Correctness",
    criteria=CRITERIA,
    evaluation_steps=EVALUATION_STEPS,
    evaluation_params=EVALUATION_PARAMS,
    model="gpt-4o"
)

query = "What is the capital of France?"
output = "The capital of France is Paris."
expected = "The capital of France is Paris."

test_case = LLMTestCase(
    input=query,
    actual_output=output,
    expected_output=expected
)

result = correctness_metric_chatgpt.measure(test_case)
print("✅ Evaluation result:", result)


Output()


🔍 Prompt sent to evaluator model:

# Evaluation Criteria
Determine whether the actual output is factually correct based on the expected output.

# Evaluation Steps
1. Check whether the facts in 'actual output' contradicts any facts in 'expected output'
2. You should also heavily penalize omission of detail
3. Vague language, or contradicting OPINIONS, are OK
4. The reason should be summarized in less than 50 words. Also the score should be in 2 decimal places.

# Input
What is the capital of France?
# Actual Output
The capital of France is Paris.
# Expected Output
The capital of France is Paris.

# Your Task:
Please assess the Actual Output against the Expected Output based on the above criteria and steps. Provide:
1. A brief justification (under 50 words).
2. A numerical score between 0.00 and 1.00 (2 decimal places).


--- End of Prompt ---



✅ Evaluation result: 1.0


## Define LoggingGEvalTokenCount Helper class
This wrapper class is responsible for counting the tokens.

In [31]:
import tiktoken
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.models.llms.openai_model import GPTModel

class LoggingGEvalTokenCount(GEval):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._total_prompt_tokens = 0
        self._total_response_tokens = 0
        self._model_name_str = None # Store model name as string

    def measure(self, test_case: LLMTestCase):
        # Build the prompt manually
        prompt = self._construct_prompt(test_case)
        # print("\n🔍 Prompt sent to evaluator model:\n") # Keep this for debugging if needed
        # print(prompt)
        # print("\n--- End of Prompt ---\n")

        # Count tokens in the prompt
        prompt_token_count = 0
        try:
            # Extract the model name string if it's a DeepEval model object
            self._model_name_str = self.model.model_name if isinstance(self.model, GPTModel) else self.model
            encoding = tiktoken.encoding_for_model(self._model_name_str)
            prompt_token_count = len(encoding.encode(prompt))
            self._total_prompt_tokens += prompt_token_count
            print(f"📊 Prompt token count ({self._model_name_str}): {prompt_token_count}\n") # Keep for per-case logging
        except Exception as e:
            print(f"⚠️ Could not count prompt tokens for model {self.model}: {e}\n")

        # Continue with standard GEval behavior to get the score and set self.reason
        score = super().measure(test_case)

        # Attempt to count response tokens - using self.reason
        response_token_count = 0
        try:
             if hasattr(self, 'reason') and self.reason: # Access reason from self
                 encoding = tiktoken.encoding_for_model(self._model_name_str)
                 response_token_count = len(encoding.encode(self.reason))
                 self._total_response_tokens += response_token_count
                 print(f"📊 Estimated response token count ({self._model_name_str}): {response_token_count}\n") # Keep for per-case logging
        except Exception as e:
             print(f"⚠️ Could not estimate response tokens for model {self.model}: {e}\n")


        return score # Return the score

    def _construct_prompt(self, test_case: LLMTestCase) -> str:
        """
        Manually constructs the evaluation prompt using the criteria and evaluation steps,
        mimicking what GEval internally does.
        """
        input_text = f"# Input\n{test_case.input}" if LLMTestCaseParams.INPUT in self.evaluation_params else ""
        actual_output = f"# Actual Output\n{test_case.actual_output}" if LLMTestCaseParams.ACTUAL_OUTPUT in self.evaluation_params else ""
        expected_output = f"# Expected Output\n{test_case.expected_output}" if LLMTestCaseParams.EXPECTED_OUTPUT in self.evaluation_params else ""

        steps = "\n".join(f"{i+1}. {step}" for i, step in enumerate(self.evaluation_steps))
        return f"""# Evaluation Criteria
{self.criteria}

# Evaluation Steps
{steps}

{input_text}
{actual_output}
{expected_output}

# Your Task:
Please assess the Actual Output against the Expected Output based on the above criteria and steps. Provide:
1. A brief justification (under 50 words).
2. A numerical score between 0.00 and 1.00 (2 decimal places).
"""

    def get_total_token_counts(self):
        """Returns the total prompt and estimated response token counts."""
        return self._total_prompt_tokens, self._total_response_tokens

# G-Eval version
Call the API using the wrapper class that counts the tokens

In [32]:
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.dataset import EvaluationDataset
import pandas as pd
import tiktoken

# Assuming the CSV is already created and contains the data
try:
    df = pd.read_csv('ai_agent_data.csv')
except FileNotFoundError:
    print("Error: 'ai_agent_data.csv' not found. Please run the data preparation cell first.")
    df = None

if df is not None:
    # --- DeepEval Evaluation with Token Counting ---

    # Define the evaluation criteria and steps (same as before)
    CRITERIA = "Determine whether the actual output is factually correct based on the expected output."
    EVALUATION_STEPS = [
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK",
        "The reason should be summarized in less than 50 words. Also the score should be in 2 decimal places."
    ]
    EVALUATION_PARAMS = [
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT
    ]

    # Create test cases
    test_cases = []
    for index, row in df.iterrows():
        test_cases.append(
            LLMTestCase(
                input=row['Query'],
                actual_output=row['Actual Response'],
                expected_output=row['Expected Response']
            )
        )

    # Create an EvaluationDataset and add test cases
    dataset = EvaluationDataset()
    for test_case in test_cases:
        dataset.add_test_case(test_case)

    # Instantiate the LoggingGEvalTokenCount (assuming it's defined in a previous cell)
    try:
        correctness_metric_deepeval = LoggingGEvalTokenCount(
            name="Correctness",
            criteria=CRITERIA,
            evaluation_steps=EVALUATION_STEPS,
            evaluation_params=EVALUATION_PARAMS,
            model="gpt-4o"
        )

        print("\n--- Running DeepEval Evaluation ---")
        for test_case in test_cases:
            correctness_metric_deepeval.measure(test_case)

        deepeval_total_prompt_tokens, deepeval_total_response_tokens = correctness_metric_deepeval.get_total_token_counts()
        deepeval_grand_total_tokens = deepeval_total_prompt_tokens + deepeval_total_response_tokens

    except NameError:
        print("\nError: LoggingGEvalTokenCount class not found. Please ensure the cell defining LoggingGEvalTokenCount is executed.")
        deepeval_total_prompt_tokens, deepeval_total_response_tokens, deepeval_grand_total_tokens = None, None, None


print(f"{deepeval_total_prompt_tokens}, {deepeval_total_response_tokens}")

print("\n--- DeepEval OpenAI API Call Token Totals ---")
print(f"Total Prompt Tokens: {deepeval_total_prompt_tokens}")
print(f"Total Response Tokens: {deepeval_total_response_tokens}")
print(f"Grand Total Tokens: {deepeval_total_prompt_tokens + deepeval_total_response_tokens if deepeval_total_prompt_tokens is not None and deepeval_total_response_tokens is not None else None}")

Output()


--- Running DeepEval Evaluation ---
📊 Prompt token count (gpt-4o): 182



Output()

📊 Estimated response token count (gpt-4o): 16

📊 Prompt token count (gpt-4o): 210



Output()

📊 Estimated response token count (gpt-4o): 42

📊 Prompt token count (gpt-4o): 208



Output()

📊 Estimated response token count (gpt-4o): 43

📊 Prompt token count (gpt-4o): 204



Output()

📊 Estimated response token count (gpt-4o): 34

📊 Prompt token count (gpt-4o): 203



Output()

📊 Estimated response token count (gpt-4o): 33

📊 Prompt token count (gpt-4o): 206



Output()

📊 Estimated response token count (gpt-4o): 36

📊 Prompt token count (gpt-4o): 199



Output()

📊 Estimated response token count (gpt-4o): 32

📊 Prompt token count (gpt-4o): 239



Output()

📊 Estimated response token count (gpt-4o): 54

📊 Prompt token count (gpt-4o): 227



Output()

📊 Estimated response token count (gpt-4o): 33

📊 Prompt token count (gpt-4o): 181



📊 Estimated response token count (gpt-4o): 32

2059, 355

--- DeepEval OpenAI API Call Token Totals ---
Total Prompt Tokens: 2059
Total Response Tokens: 355
Grand Total Tokens: 2414


# Single-Prompt OpenAI version

In [4]:
import os
from openai import OpenAI
import pandas as pd
import tiktoken

# Assuming the OPENAI_API_KEY is already set in the environment or Colab secrets
client = OpenAI()

# Load the data (assuming the CSV is already created)
try:
    df = pd.read_csv('ai_agent_data.csv')
except FileNotFoundError:
    print("Error: 'ai_agent_data.csv' not found. Please run the data preparation cell first.")
    df = None

if df is not None:
    # Define the evaluation criteria and steps
    CRITERIA = "Determine whether the actual output is factually correct based on the expected output."
    EVALUATION_STEPS = [
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK",
        "The reason should be summarized in less than 50 words. Also the score should be in 2 decimal places."
    ]

    # Manually construct a single prompt for all test cases
    steps = "\n".join(f"{i+1}. {step}" for i, step in enumerate(EVALUATION_STEPS))
    full_prompt = f"""# Overall Evaluation Criteria
{CRITERIA}

# Overall Evaluation Steps
{steps}

# Data to Evaluate (Query, Actual Output, Expected Output pairs)
Evaluate each pair below based on the criteria and steps provided above. For each pair, provide a brief justification (under 50 words) and a numerical score between 0.00 and 1.00 (2 decimal places).

"""

    # Add each data pair to the prompt
    for index, row in df.iterrows():
        full_prompt += f"""---
Pair {index + 1}:
Input: {row['Query']}
Actual Output: {row['Actual Response']}
Expected Output: {row['Expected Response']}

Evaluation for Pair {index + 1}:
"""

    print("Constructed a single prompt for all queries:")
    print(full_prompt)
    print("--- End of Full Prompt ---")

    model_name = "gpt-4o" # Define the model name here

    # Calculate prompt token count
    full_prompt_token_count = None
    try:
        encoding = tiktoken.encoding_for_model(model_name)
        full_prompt_token_count = len(encoding.encode(full_prompt))
        print(f"\n📊 Full Prompt token count ({model_name}): {full_prompt_token_count}\n")
    except Exception as e:
        print(f"\n⚠️ Could not count tokens for full prompt ({model_name}): {e}\n")


    try:
        # Call the OpenAI API with the single prompt
        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": "You are an AI assistant that evaluates text based on provided criteria for multiple cases."},
                {"role": "user", "content": full_prompt}
            ],
            max_tokens=1000 # Adjust max_tokens to accommodate evaluation for all pairs
        )

        evaluation_text = response.choices[0].message.content.strip()
        print("OpenAI Response for all queries:")
        print(evaluation_text)
        print("--- End of Full Response ---")

        # Calculate response token count
        full_response_token_count = None
        try:
            encoding = tiktoken.encoding_for_model(model_name)
            full_response_token_count = len(encoding.encode(evaluation_text))
            print(f"📊 Full Response token count ({model_name}): {full_response_token_count}\n")
        except Exception as e:
            print(f"⚠️ Could not count tokens for full response ({model_name}): {e}\n")

        # Note: Parsing scores and reasons from a single response for multiple pairs
        # would require more sophisticated parsing logic. This code focuses on token count.

        # Display total token counts for this single-prompt approach
        print("\n--- Single-Prompt OpenAI API Call Token Totals ---")
        print(f"Total Prompt Tokens (Single Call): {full_prompt_token_count}")
        print(f"Total Response Tokens (Single Call): {full_response_token_count}")
        print(f"Grand Total Tokens (Single Call): {full_prompt_token_count + full_response_token_count if full_prompt_token_count is not None and full_response_token_count is not None else None}")


    except Exception as e:
        print(f"Error during single OpenAI API call: {e}")

Constructed a single prompt for all queries:
# Overall Evaluation Criteria
Determine whether the actual output is factually correct based on the expected output.

# Overall Evaluation Steps
1. Check whether the facts in 'actual output' contradicts any facts in 'expected output'
2. You should also heavily penalize omission of detail
3. Vague language, or contradicting OPINIONS, are OK
4. The reason should be summarized in less than 50 words. Also the score should be in 2 decimal places.

# Data to Evaluate (Query, Actual Output, Expected Output pairs)
Evaluate each pair below based on the criteria and steps provided above. For each pair, provide a brief justification (under 50 words) and a numerical score between 0.00 and 1.00 (2 decimal places).

---
Pair 1:
Input: What is the capital of France?
Actual Output: Paris is the capital of France.
Expected Output: The capital of France is Paris.

Evaluation for Pair 1:
---
Pair 2:
Input: Tell me about the history of the internet.
Actual Outp