# Automated Teseting for LLMOps
This notebook follows the deepleraning.ai [.](https://learn.deeplearning.ai/courses/automated-testing-llmops/lesson/1/introduction)
By Rob Zuber from circleci.

*Goal:* 
- run tests whenever you commit changes to your code base. 
- Combine per-commit and pre-release evals.
- detect hallcinations in LLM responses.

## Rule-based evals
- Use string or pattern matching. E.g. regulat expression matching.
- Use whenever you evaluate outputs with a clear right answer, e.g. sentiment classification where you have ground trouth labels.
- Fast and cheap to run - good to run per-commit.

## Model graded evals
- Whenever there are many possible good or bad outputs, e.g. LLM writing text content for you there are potenitally many high quality responses.
- You use an evaluation LLM to evaluate your application LLM.
- More expansive to run - good fro pre-release.

## Introduction to Continuous Integration (CI)
CI means to test your code everytime you make a change/feature contribution to avoid merging buggy code.

## Overiview of automated evals
There are general benchmarks like MMLU or HellaSwag, but it is often necessary to build your own
evaluation for a specific use case.

What should you evaluate?
- context adherence
- context relevance
- correctness
- bias and toxicity

When should you evalute?
- After every change (bug fixes, feature updates, data changes)
- Pre-deployment (merges to production branch, end of sprint, prior to shipping hotfix)
- Post-deployment (on demand based on business needs)

### The task
Create a quiz generater application.

In [None]:
human_template  = "{question}"

# The quiz bank is the data the LLM should draw from. The LLM must not use any other sources to come up with questions.
quiz_bank = """1. Subject: Leonardo DaVinci
   Categories: Art, Science
   Facts:
    - Painted the Mona Lisa
    - Studied zoology, anatomy, geology, optics
    - Designed a flying machine
  
2. Subject: Paris
   Categories: Art, Geography
   Facts:
    - Location of the Louvre, the museum where the Mona Lisa is displayed
    - Capital of France
    - Most populous city in France
    - Where Radium and Polonium were discovered by scientists Marie and Pierre Curie

3. Subject: Telescopes
   Category: Science
   Facts:
    - Device to observe different objects
    - The first refracting telescopes were invented in the Netherlands in the 17th Century
    - The James Webb space telescope is the largest telescope in space. It uses a gold-berillyum mirror

4. Subject: Starry Night
   Category: Art
   Facts:
    - Painted by Vincent van Gogh in 1889
    - Captures the east-facing view of van Gogh's room in Saint-Rémy-de-Provence

5. Subject: Physics
   Category: Science
   Facts:
    - The sun doesn't change color during sunset.
    - Water slows the speed of light
    - The Eiffel Tower in Paris is taller in the summer than the winter due to expansion of the metal."""

In [None]:
# why the delimiter here (also, it does not enclose but is only at the beginning)?
# also, the steps are delimited and not only the question. Is there a mistake?
delimiter = "####"

prompt_template = f"""
Follow these steps to generate a customized quiz for the user.
The question will be delimited with four hashtags i.e {delimiter}

The user will provide a category that they want to create a quiz for. Any questions included in the quiz
should only refer to the category.

Step 1:{delimiter} First identify the category user is asking about from the following list:
* Geography
* Science
* Art

Step 2:{delimiter} Determine the subjects to generate questions about. The list of topics are below:

{quiz_bank}

Pick up to two subjects that fit the user's category. 

Step 3:{delimiter} Generate a quiz for the user. Based on the selected subjects generate 3 questions for the user using the facts about the subject.

Use the following format for the quiz:
Question 1:{delimiter} <question 1>

Question 2:{delimiter} <question 2>

Question 3:{delimiter} <question 3>

"""

We expect a few things from the LLM:
- identify the right category from the question
- only ask questions that belong to the category
- only use facts from our data bank and nothing else

So how can we evaluate this? One way could be to assume that if we ask the LLM to design a science
quiz, that it will only create questions drawing on 1, 3, and 5 in our data bank.

Data entries 1, 3, 5 were about Leonardo DaVinci, the telescope, and physics. Hence, one way to 
evaluate if the LLM did a good job could be to check if words related to thesee topics appear in
the created questions.

We can define some words we would expect to see in the questions: ["davinci", "telescope", "physics", "curie"]

Let's look at all components one by one:

In [None]:
from langchain.prompts import ChatPromptTemplate
chat_prompt = ChatPromptTemplate.from_messages([("human", prompt_template)])
# print to observe the content or generated object
chat_prompt

In [None]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
llm

In [None]:
# parser
from langchain.schema.output_parser import StrOutputParser
output_parser = StrOutputParser()
output_parser

In [None]:
# using the pipe operator to put the prompt into the llm and parse the llm's output
chain = chat_prompt | llm | output_parser
chain

In [None]:
# taking all components and making reusable as one piece
def assistant_chain(
    system_message,
    human_template="{question}",
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()):
  
  chat_prompt = ChatPromptTemplate.from_messages([
      ("system", system_message),
      ("human", human_template),
  ])
  return chat_prompt | llm | output_parser

In [None]:
def eval_expected_words(
    system_message,
    question,
    expected_words,
    human_template="{question}",
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()):
    
  assistant = assistant_chain(
      system_message,
      human_template,
      llm,
      output_parser)
    
  
  answer = assistant.invoke({"question": question})
    
  print(answer)
    
  assert any(word in answer.lower() \
             for word in expected_words), \
    f"Expected the assistant questions to include \
    '{expected_words}', but it did not"

In [None]:
question  = "Generate a quiz about science."
expected_words = ["davinci", "telescope", "physics", "curie"]

In [None]:
eval_expected_words(
    prompt_template,
    question,
    expected_words
)

In [None]:
def evaluate_refusal(
    system_message,
    question,
    decline_response,
    human_template="{question}", 
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()):
    
  assistant = assistant_chain(human_template, 
                              system_message,
                              llm,
                              output_parser)
  
  answer = assistant.invoke({"question": question})
  print(answer)
  
  assert decline_response.lower() in answer.lower(), \
    f"Expected the bot to decline with \
    '{decline_response}' got {answer}"

In [None]:
question  = "Generate a quiz about Rome."
decline_response = "I'm sorry"

In [None]:
evaluate_refusal(
    prompt_template,
    question,
    decline_response
)