In [1]:
#%pip install langchain langsmith openai pyyaml PyGithub

In [3]:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
import utils

# load API tokens for our 3rd party APIs.
cci_api_key = utils.get_circle_api_key()
gh_api_key = utils.get_gh_api_key()
openai_api_key = utils.get_openai_api_key()

# set up our github branch
course_repo = utils.get_repo_name()
course_branch = utils.get_branch()

# Introduction to LLMs and evals
LLM based applications introduce new problems to software testing: non-deterministic output and subjectivity.

* LLMs work by learning a probability distribution over training data and then predicting the next token to output in a sequence. This introduces randomness, which makes output sound more human like but also makes it difficult to predict what the LLM will say.
* Because LLMs deal with text output, there's also more subjectivity. If you're application produces summaries, there are multiple "good" ways to answer a question. Similarly with code, there's no universally correct way to write a function.

It's also important to note that LLMs can also product harmful, toxic, or offensive content. This is a new problem for application testing compared to traditional software where outputs can be constrained by the programmer.

To deal with these new testing challenges AI Researchers developed the concept of "evals" to assess how well LLMs do at different tasks. There are many common datasets for different tasks including MMLU, hellaswag, and HumanEval. LLMs are tested on different datasets so researchers have a point of comparison between models.

Standard benchmarks are a good starting point, but they don't cover the specifics of __your__ application. In order to assess how well your agents, chatbots, and assistants perform you need to write evaluations for your application's use cases.

## What you will learn
In this course we'll show you how to write those evaluations and automate running them in CircleCI. This will give you a way to see how well your application performs while you build as well as give you a set of tests to run as your application changes.


* We'll give you a framework to think about:
  * What to evaluate - starting with basic evals, using LLM based evals to grade your application output, and dealing with difficult to automate cases
  * When to run evaluations:
    * Starting with evals when you make application changes
    * Running evals before you deploy
    * Periodically evaluating your entire application

# The sample Application

We are going to build a AI powered quiz generator.

The app will have a data set of facts categorized across Art, Science, and Geography. The facts are grouped into specific subjects. Some subjects apply to multiple categories, for example Paris is home to many great works of art and scientific inventions.

The user will ask our bot to write a quiz about a given topic and get back a set of questions. We'll write evaluations to check that the bot is using the appropriate facts and only using facts in our data set.

In [20]:
from langchain.prompts                import ChatPromptTemplate
from langchain.chat_models            import ChatOpenAI
from langchain.schema.output_parser   import StrOutputParser

delimiter = "####"

# Note: Our topics are stored in the prompt. In a real application you might use a database
# or files to hold the data.
system_message = f"""
Follow these steps to generate a customized quiz for the user.
The question will be delimited with four hashtags i.e {delimiter}

Step 1:{delimiter} First identify the category user is asking about from the following list:
* Geography
* Science
* Art

Step 2:{delimiter} Determine the subjects to generate questions about. The list of topics are below:
1. Subject: Leonardo DaVinci
   Categories: Art, Science
   Facts:
    - Painted the Mona Lisa
    - Studied zoology, anatomy, geology, optics
    - Designed a flying machine
  
2. Subject: Paris
   Categories: Art, Geography
   Facts:
    - Location of the Louvre, the museum where the Mona Lisa is displayed
    - Capital of France
    - Most populous city in France
    - Where Radium and Polonium were discovered by scientists Marie and Pierre Curie

3. Subject: Telescopes
   Category: Science
   Facts:
    - Device to observe different objects
    - The first refracting telescopes were invented in the Netherlands in the 17th Century
    - The James Webb space telescope is the largest telescope in space. It uses a gold-berillyum mirror

4. Subject: Starry Night
   Category: Art
   Facts:
    - Painted by Vincent van Gogh in 1889
    - Captures the east-facing view of van Gogh's room in Saint-Rémy-de-Provence

5. Subject: Physics
   Category: Science
   Facts:
    - The sun doesn't change color during sunset.
    - Water slows the speed of light
    - The Eiffel Tower in Paris is taller in the summer than the winter due to expansion of the metal.

Pick up to two subjects that fit the user's category. 

Step 3:{delimiter} Generate a quiz for the user. Based on the selected subjects generate 3 questions for the user using the facts about the subject.
Use the following format:
Question 1:{delimiter} <question 1>

Question 2:{delimiter} <question 2>

Question 3:{delimiter} <question 3>
"""

def assistant_chain():
  human_template  = "{question}"

  chat_prompt = ChatPromptTemplate.from_messages([
      ("system", system_message),
      ("human", human_template),
  ])
  return chat_prompt | ChatOpenAI(model="gpt-3.5-turbo", temperature=0) | StrOutputParser()

Let's create some basic evaluations for our assistant

In [9]:
def evaluate_science_facts():
  assistant = assistant_chain()
  question  = "Generate a quiz about science."
  answer = assistant.invoke({"question": question})
  expected_subjects = ["davinci", "telescope", "physics", "curie"]
  print(answer)
  assert any(subject in answer.lower() for subject in expected_subjects), f"Expected the assistant questions to include '{expected_subjects}', but it did not"

In [8]:
evaluate_science_facts()

Great! Here are three science questions for you:

Question 1:#### What is the largest telescope in space called and what material is its mirror made of?

Question 2:#### True or False: Water slows down the speed of light.

Question 3:#### What did Marie and Pierre Curie discover in Paris?


AssertionError: Expected the assistant questions to include '['mona list']', but it did not

Now, let's write a failing test case.

We'll ask our application to answer a question it doesn't have information about. We want the application to decline to answer rather than make up questions, but we don't have any restrictions in our prompt.

In [21]:
def evaluate_geography_facts():
  assistant = assistant_chain()
  question  = "Help me create a quiz about Rome"
  answer = assistant.invoke({"question": question})
  print(answer)
  # We'll look for a substring of the message the bot prints when it gets a question about any
  decline_response = "I'm sorry"
  assert decline_response.lower() in answer.lower(), f"Expected the bot to decline with '{decline_response}' got {answer}"

In [22]:
evaluate_geography_facts()

Great! Since you mentioned Rome, we will focus on the category of Geography. Let's generate some questions about Rome for your quiz.

Question 1:####
What is the capital city of Italy?

Question 2:####
Which famous ancient structure in Rome was used for gladiatorial contests and other public spectacles?

Question 3:####
What is the name of the river that runs through Rome?

Feel free to use these questions for your quiz about Rome!


AssertionError: Expected the bot to decline with 'I'm sorry' got Great! Since you mentioned Rome, we will focus on the category of Geography. Let's generate some questions about Rome for your quiz.

Question 1:####
What is the capital city of Italy?

Question 2:####
Which famous ancient structure in Rome was used for gladiatorial contests and other public spectacles?

Question 3:####
What is the name of the river that runs through Rome?

Feel free to use these questions for your quiz about Rome!

## Try and fix the prompt
Try and update the prompt so that it handles the case where the user asks about an unsupported category.

**Note: Any code you write will be saved to a public GitHub repository. If you want to use a private repository, then you will need to create your own GitHub and CircleCI API key's**

# Running evaluations in a CircleCI pipeline

Now that you have a set of evaluations, we'll show you how to automate running them in CircleCI.

For our first round of evaluations we'll focus on adding basic checks to make sure our assistant is being setup properly and producing valid results.

From there, we'll add more rigorous evals that we run prior to release and finally evals that we want to run periodically to smoke test the entire application.

## Notes
* For this notebook, we are using the GitHub API to commit code. In your normal workflow you would use the `git` or `gh` command line tools or a GitHub GUI applications.
* As a reminder, any code you push to GitHub will be publicly visible.
* We've updated the application prompt to decline generating quizzes for topics that there is no information for.

In [10]:
%%writefile app.py
from langchain.prompts                import ChatPromptTemplate
from langchain.chat_models            import ChatOpenAI
from langchain.schema.output_parser   import StrOutputParser

delimiter = "####"

quiz_information_bank = """1. Subject: Leonardo DaVinci
   Categories: Art, Science
   Facts:
    - Painted the Mona Lisa
    - Studied zoology, anatomy, geology, optics
    - Designed a flying machine
  
2. Subject: Paris
   Categories: Art, Geography
   Facts:
    - Location of the Louvre, the museum where the Mona Lisa is displayed
    - Capital of France
    - Most populous city in France
    - Where Radium and Polonium were discovered by scientists Marie and Pierre Curie

3. Subject: Telescopes
   Category: Science
   Facts:
    - Device to observe different objects
    - The first refracting telescopes were invented in the Netherlands in the 17th Century
    - The James Webb space telescope is the largest telescope in space. It uses a gold-berillyum mirror

4. Subject: Starry Night
   Category: Art
   Facts:
    - Painted by Vincent van Gogh in 1889
    - Captures the east-facing view of van Gogh's room in Saint-Rémy-de-Provence

5. Subject: Physics
   Category: Science
   Facts:
    - The sun doesn't change color during sunset.
    - Water slows the speed of light
    - The Eiffel Tower in Paris is taller in the summer than the winter due to expansion of the metal.
"""

system_message = f"""
Follow these steps to generate a customized quiz for the user.
The question will be delimited with four hashtags i.e {delimiter}

Step 1:{delimiter} First identify the category user is asking about from the following list:
* Geography
* Science
* Art

Step 2:{delimiter} Determine the subjects to generate questions about. The list of topics are below:

{quiz_information_bank}

Pick up to two subjects that fit the user's category.

Step 3:{delimiter} Generate a quiz for the user. Based on the selected subjects generate 3 questions for the user using the facts about the subject.
Only reference facts in the included list of topics.
Use the following format:
Question 1:{delimiter} <question 1>

Question 2:{delimiter} <question 2>

Question 3:{delimiter} <question 3>

If the user asks about a subject you do not have information about, tell them "I'm sorry, but I do not have information on that topic."
"""

def assistant_chain():
  human_template  = "{question}"

  chat_prompt = ChatPromptTemplate.from_messages([
      ("system", system_message),
      ("human", human_template),
  ])
  return chat_prompt | ChatOpenAI(model="gpt-3.5-turbo", temperature=0) | StrOutputParser()

Overwriting app.py


In [11]:
%%writefile test_assistant.py
from app import assistant_chain
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

def test_science_quiz():
  assistant = assistant_chain()
  question  = "Generate a quiz about science."
  answer = assistant.invoke({"question": question})
  expected_subjects = ["davinci", "telescope", "physics", "curie"]
  print(answer)
  assert any(subject.lower() in answer.lower() for subject in expected_subjects), f"Expected the assistant questions to include '{expected_subjects}', but it did not"

def test_geography_quiz():
  assistant = assistant_chain()
  question  = "Generate a quiz about geography."
  answer = assistant.invoke({"question": question})
  expected_subjects = ["paris", "france", "louvre"]
  print(answer)
  assert any(subject.lower() in answer.lower() for subject in expected_subjects), f"Expected the assistant questions to include '{expected_subjects}', but it did not"

def test_decline_unknown_subjects():
  assistant = assistant_chain()
  question  = "Help me create a quiz about Rome"
  answer = assistant.invoke({"question": question})
  print(answer)
  # We'll look for a substring of the message the bot prints when it gets a question about any
  decline_response = "I'm sorry"
  assert decline_response.lower() in answer.lower(), f"Expected the bot to decline with '{decline_response}' got {answer}"

Overwriting test_assistant.py


# The CircleCI config file
Now let's set up our tests to run automatically in CircleCI.

For this course, we've created a working CircleCI config file. Let's take a look at the configuration.

In the config we will define a **workflow** that describes the test to build, test, and deploy our application. The workflow consists of **jobs** that run each step of our process.

In this config, we only have one workflow and one job that will conditionally run tests based on passed in parameters.

In [3]:
with open("circle_config.yml") as f:
  print(f.read())

version: 2.1
orbs:
  # The python orb contains a set of prepackaged circleci configuration you can use repeatedly in your configurations files
  # Orb commands and jobs help you with common scripting around a language/tool
  # so you dont have to copy and paste it everywhere.
  # See the orb documentation here: https://circleci.com/developer/orbs/orb/circleci/python
  python: circleci/python@2.1.1

parameters:
  eval-mode:
    type: string
    default: "commit"


workflows:
  evalaute-app:  # This is the name of the workflow, feel free to change it to better match your workflow.
    # Inside the workflow, you define the jobs you want to run.
    # For more details on extending your workflow, see the configuration docs: https://circleci.com/docs/2.0/configuration-reference/#workflows
    jobs:
      - run-evals:
          context:
            - dl-ai-courses

jobs:
  # Our main job to run evals.
  # Based on parameters we will run evals on every commit, a pre-release set of evals, or al

# Run the per-commit evals
The evals we have now are quick checks that we run whenever we change our application.

Now when we save our code in github, CircleCI will run our tests

## Steps
To run the evals we will:
1. Write our test file in the course notebook
2. Push the file to GitHub and run the CircleCI pipeline

In [13]:
from utils import push_files, trigger_commit_evals, trigger_release_evals
push_files(course_repo, course_branch, ["app.py", "test_assistant.py"])
trigger_commit_evals(course_repo, course_branch, cci_api_key)

uploading test_assistant.py
uploading app.py
dl-cci-long-lasting-radar-7 already exists in the repository pushing updated changes
Please visit https://app.circleci.com/pipelines/github/mw-courses/cci-dl-ai-course/63


# Running pre-release evals
Now let's look at running pre-release evals.

So far, our evals are meant catch obvious errors in our application. As the application grows though, having a set of good pre-release tests can help catch more subtle regressions.

## Steps
To run the evals we will:
1. Write our test file in the course notebook
2. Push the file to GitHub and run the CircleCI pipeline

## A first model graded eval

Evaluating LLM output can be tricky because a "good response" to a query is subjective. We could try and write custom rules, like we did for our initial evals to make sure expected data was in the output, but this gets more fragile as an application expands.

One approach to checking the output of an LLM is to use another LLM as a grader. This is referred to as "model graded evaluation" we'll show a quick example to make sure our model is actually producing output as a quiz.

We aren't concerned with the content just yet, just that the LLM is giving back responses that look like a set of questions.

We are including a passing and failing case. If you want to see a passing build in CircleCI update the failing case with the expected response before commiting the file.

In [11]:
%%writefile test_release_evals.py
# note, you will need to run the cell to write the app file for these imports to work.
from app import system_message, quiz_information_bank, assistant_chain
from langchain.prompts                import ChatPromptTemplate
from langchain.chat_models            import ChatOpenAI
from langchain.schema.output_parser   import StrOutputParser

def create_eval_chain(agent_response):
  delimiter = "####"
  eval_system_prompt = f"""You are an assistant that evaluates whether or not an assistant is producing valid quizzes.
  The assistant should be producing output in the format of Question N:{delimiter} <question N>?"""
  
  eval_user_message = f"""You are evaluating a generated quiz based on the context that the assistant uses to create the quiz.
  Here is the data:
    [BEGIN DATA]
    ************
    [Response]: {agent_response}
    ************
    [END DATA]

Read the response carefully and determine if it looks like a quiz or test. Do not evaluate if the information is correct
only evaluate if the data is in the expected format.

Output Y if the response is a quiz, output N if the response does not look like a quiz.
"""
  eval_prompt = ChatPromptTemplate.from_messages([
      ("system", eval_system_prompt),
      ("human", eval_user_message),
  ])

  return eval_prompt | ChatOpenAI(model="gpt-3.5-turbo", temperature=0) | StrOutputParser()

def test_model_graded_eval():
  assistant = assistant_chain()
  quiz_request = "Write me a quiz about geography."
  result = assistant.invoke({"question": quiz_request})
  print(result)
  eval_agent = create_eval_chain(result)
  eval_response = eval_agent.invoke({})
  assert eval_response == "Y"

def test_model_graded_eval_should_fail():
  # In this test we are using output that will fail the evaluation.
  # This is a good way to check your evaluator is behaving as expected
  known_bad_result = "There are lots of interesting facts. Tell me more about what you'd like to know"
  print(known_bad_result)
  eval_agent = create_eval_chain(result)
  eval_response = eval_agent.invoke({})
  assert eval_response == "Y", f"expected failure, asserted the response should be 'Y', got back '{eval_response}'"


In [16]:
from utils import push_files, trigger_release_evals
push_files(course_repo, course_branch, ["test_release_evals.py"])
trigger_release_evals(course_repo, course_branch, cci_api_key)

uploading test_release_evals.py
dl-cci-long-lasting-radar-7 already exists in the repository pushing updated changes
Please visit https://app.circleci.com/pipelines/github/mw-courses/cci-dl-ai-course/65


# Pulling it together: Running all of our evaluations
Finally, we can run our full the set of commit and pre-release evals.

You may want to do this to debug the full application or as a periodic check.

## Steps
To run the evals we will:
1. Run our pipeline in CircleCI passing in a parameter to run all evaluations.

In [17]:
from utils import trigger_full_evals
trigger_full_evals(course_repo, course_branch, cci_api_key)

Please visit https://app.circleci.com/pipelines/github/mw-courses/cci-dl-ai-course/67
