### Note before importing
This first cell of imports will take a few minutes to execute. I make use of a vector store, and to keep the scope of this project limited to the purpose of the exercise, I'm building a local vector index of a PDF file downloaded from the web instead of setting up an online vector index and figuring out how to grant access, etc.

In [1]:
import logging

from dotenv import load_dotenv
from griptape.drivers import OpenAiChatPromptDriver

load_dotenv()

from alteryx_poc.agent import AlteryxAgent

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Problem solving approach
Pretrained LLMs that support function calling are already equipped with the ability to respond "I don't know." when they cannot find answers to a question by ReAct prompting and require no additional training data or fine-tuning to do so (although fine-tuning may yield improvements in other areas).

### Summary of Prompt
- Instructs the AI to act truthfully and not make up facts.
- Defines a set of topics the AI is able to discuss.
- Equips the AI with a set of tools or actions in may take in discussing those topics.
- A prohibition of responding with anything other than "I don't know" when the user asks a question that no associated tool can answer.

### Why it works
The set of tools effectively defines the universe of what the chatbot knows, and it can be expanded simply by equipping the model with more tools. By using an LLM equipped with function calling, the AI is able to use tools to interact with in-memory objects, external data sources, etc.

See the full react prompt tepmlate at `altyrex_poc/templates/react.j2`. (I think it can probably be shortened quite a bit, but I haven't really tried.)

# AlteryxAgent
This class acts as the main point of interaction with the chat model. Here, we use "gpt-3.5-turbo-0613" since it's cheap, seems to work well for this, and has a large context window to handle passing tool schema.

`AlteryxAgent` knows only how to perform arithmetic and answer questions about Dungeons & Dragons. Anything else falls outside AI's domain of knowledge and elecitis "I don't know." (or sometimes some variant meaning it doesn't know.

### Unreliability after repeated requests
Occasionally, the model will hallucinate and answer questions it's not supposed to, or attempt to utilize a tool inappropriately, but these instances seem rare. Hallucinations seem to become more common as more questions are repeatedly asked to the same agent. I think this is because conversation memory that gets passed as part of the full prompt accumulates over time, so that the initial constraints we define in the basic system prompt get "lost" in the noise. If that is the cause, it could be mitigated by buffering the conversation memory.

Setting the model to GPT-4 does seem to mitigate the problem quite a bit. But due to GPT-4's strict rate limiting, using tools will sometimes cause rate limit warnings and repeated retries.

In [2]:
agent = AlteryxAgent(
    logger_level=logging.INFO, 
    prompt_driver=OpenAiChatPromptDriver(
        # model="gpt-4",
        model="gpt-3.5-turbo-0613",
        max_tokens=500,
        temperature=0,
    ),
)

### Demo 1
Start with a couple of questions we know the answer to, but the bot shouldn't.

In [3]:
response = agent.run("What is the capital of Illinois?")

In [4]:
response = agent.run("Who wrote The Fall of the House of Usher?")

### Demo 2
Now a question that would be hard for us to answer, but is right within the bot's domain.

In [5]:
response = agent.run("What is 156.5345 times 2345 divided by 12345?")

### Demo 3
Now a question that draws on a specific domain of knowledge that has been provided to it in a vector store of the D&D rules.

In [6]:
response = agent.run("What does magic missile do in D&D 5e?")

### Demo 4
Now a question where it's ambiguous whether we're asking within its domain or outside of it.

Most of the time, the model just immediately defaults to interpreting the question about D&D, but occasionally, it will do a really interesting chain of thought in which it attempts to resolve the ambiguity on its own. Fingers crossed it does the cool thing.

In [7]:
response = agent.run("What is a cleric?")

### Demo 5
Now let's just ask a bunch of questions it shouldn't know answers to and see if the model is starting to get unreliable. It usually does after this point. I'd love to talk about strategies for solving this issue.

In [8]:
response = agent.run("What is the most populous city on Earth?")

In [9]:
response = agent.run("What was the highest grossing movie in 2017?")

In [10]:
response = agent.run("How many wheels does a horse have?")

# How to evaluate

Although we did not require any data for fine-tuning, evaluation does require some data. This data, however, can be automatically generated by bare GPT-4, but I think it would also be wise to include some human-created examples addressing interesting edge cases such as questions with ambiguous context like "What is a cleric?"

AI-generated questions should always be reviewed by a human before being used to evaluate a production model.

In [11]:
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")

dnd_questions = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant who generates evaluation examples for a question answering AI."},
    {"role": "user", "content": "Generate 10 questions a D&D player might ask about fifth edition rules. Put each queston on a separate line."}
  ]
)

print(dnd_questions.choices[0].message["content"])

1. What are the changes in character creation rules compared to previous editions?
2. How do spellcasting classes function in fifth edition?
3. Can you explain the rules for opportunity attacks and how they work in combat?
4. What are the different types of damage and resistances in the game?
5. How do critical hits and critical failures work in fifth edition combat?
6. Can you explain the rules for multi-classing and how it impacts character progression?
7. How are skills and abilities determined, and how do they affect the outcomes of actions in the game?
8. What are the rules for resting and regaining hit points and spell slots?
9. How does the advantage and disadvantage mechanic work, and how is it applied in gameplay?
10. Can you explain the mechanics of grappling and shoving during combat?


In [12]:
off_topic_questions = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant who generates evaluation examples for a question answering AI."},
    {"role": "user", "content": "Generate 10 questions about any topic EXCEPT arithmetic and D&D fifth edition."}
  ]
)

print(off_topic_questions.choices[0].message["content"])

1. What are the main features of the new iPhone 12?
2. How does the human respiratory system work?
3. What are the major causes of climate change?
4. What are the key differences between capitalism and socialism?
5. How does artificial intelligence impact the job market?
6. What are the potential benefits and risks of gene editing technology?
7. How does the internet of things (IoT) impact our daily lives?
8. What are the main theories explaining the origins of the universe?
9. What are the most effective ways to reduce plastic pollution?
10. What is the impact of social media on mental health?


### Evaluation

Just strip out the questions and put them through our `AlteryxAgent`

In [13]:
agent = AlteryxAgent(
    logger_level=logging.INFO,
    prompt_driver=OpenAiChatPromptDriver(
        model="gpt-4",
        #model="gpt-3.5-turbo-0613",
        max_tokens=500,
        temperature=0,
    ),
)

In [14]:
answers = [agent.run(x).output.value for x in off_topic_questions.choices[0].message["content"].split("\n")[:4]]
answers

["I don't know.", "I don't know.", "I don't know.", "I don't know."]