# Benchmarking:
### Objective: Systematically test prompts, GPT models, and other techniques for D&D AI use cases

### Description: 
* There are 6 simple scenarios that two chracters might have dialogue in. 
* 3 prior messages are defined to create each scenario. 
* The chatbot will respond 1 or more times, alternating the character it is roleplaying.

Related files: 
* bm1.py -- runs benchmark
* dialogue_basic.json -- defines scenarios
---

In [None]:
import json
def display_json(filename):
    with open(filename, 'r') as f:
        output = json.load(f, )
    print(json.dumps(output, indent=4))

display_json('../scenarios/dialogue-basic.json')

### Benchmark 1: Simple Dialogue Part 1 -- Single Response

Why? 
* Chatbots should be able to respond to dialogue with dialogue in-character and in a clear and error-free manner.

Challenges:
1. All answers should address the previous message.
2. All answers should be dialogue, with no extra text.
3. Remove all NER errors.


### Attempt 1
Notes: Issues with challenge *1* and *3*: NER and responding to the correct prompt.

In [None]:
file = './../mlruns/834045129700860192/9a69d75c91674318adc509b9fd0b06c3/artifacts/bm1_mr1.json'
display_json(file)

### Attempt 2

Let's try to improve NER by injecting names into the qa_system_prompt:

qa_system_prompt: 
"...You are roleplaying as {responder_name}, and you are responding to {prompter_name}....

Notes: Passes all challenges. 

In [None]:
file = './../mlruns/834045129700860192/6d5d4d5f3de94309b51541b6f751b35c/artifacts/bm1_mr1.json'
display_json(file)

---
### Benchmark 1: Simple Dialogue Part 2 -- Multiple Responses (k=4)

Why? 
* To verify challenges from part 1 at a larger scale
* To test conversational progression and ensure conversations don't get "stuck"

Challenges:
1. All answers should address the previous message.
2. All answers should be dialogue, with no extra text.
3. Remove all NER errors.
4. There should be clear conversational progression.

### Attempt 1 -- no changes

Notes: 
* Responses to 3 scenarios include 3rd person narration. (Fails 2)
* Clear conversational progression

In [None]:
file = './../mlruns/946018013973686521/b3dfd936a6784bf28b537d9bfc8aafdf/artifacts/bm1_mr4.json'
display_json(file)

### Attempt 2 -- Directly specify to use only dialogue in response

 'qa_system_prompt': 'You are a role playing chatbot for a Dungeons and Dragons game. Use the retrieved context to answer the prompt. If the context does not apply, respond in a way that makes sense. Use three sentences maximum, and keep the answer concise. You are roleplaying as {responder_name}, and you are responding to {prompter_name}. Use only dialogue in your response.\n\n CONTEXT: {context}',

Notes: 
* All scenarios contain only dialogue. (Passes 2)
* Clear conversational progression.
* Passes all challenges


In [None]:
file = './../mlruns/946018013973686521/ba90cb1dccc844cf88a06c70f10c3e1d/artifacts/bm1_mr4.json'
display_json(file)

# Takeaways and Next Steps

### Takeaways
1. Simple Dialogue is consistent and passes challenges with single and multiple (4) responses on all scenarios.
2. Direct instructions work best for constraining behavior. (Don't make the bot guess behavior!)

### Next Steps
1. Implement a tags-based prompt template, which appends instructions based on tags. For example, a bot with the tag "dialogue-only" should have the sentence "Use only dialogue in your response."
2. Implement an action benchmark for action-only and action-dialogue scenarios.
3. Implement an agentic benchmark for agentic chatbot behavior, deciding when an action or dialogue is appropriate.