# Rule-based evaluation of LLMs

Suppose we have a LLM-based system that we want to evaluate. We can use rule-based evaluation to evaluate the LLM.

For example, imagine we have a system that summarises news articles. A simple rule we could follow is to check if the summary contains words that are contained in the original article.

For example, suppose we have an article about generative AI:

 > Policy Lab has experimented with Artificial Intelligence (AI) in policy development with teams across government, and beyond, for a number of years. In 2019 we worked with the Department for Transport’s data science team to consider the role that AI could play in improving the efficiency and effectiveness of the policy consultation process. In 2022 we used AI to create a vision for the future of Hounslow with the local authority. In 2023, we commissioned the creation of the Ecological Intelligence Agency, a speculative artefact to help experience the role AI might have in future decision-making in environmental policy. 
 >
 >This blog explains how Policy Lab used generative AI in policy relating to the future of the subsurface. Broadly speaking, generative AI can be understood as systems that create new data, which could be new code, text, images, video or other forms of data. We used generative AI to visualise evidence, bring to life policy ideas, and create this blog, written iteratively with a system trained on Policy Lab’s publicly available content. 


In [1]:
article = ("Policy Lab has experimented with Artificial Intelligence (AI) in policy development with teams across government, "
           "and beyond, for a number of years. In 2019 we worked with the Department for Transport’s data science team to consider "
           "the role that AI could play in improving the efficiency and effectiveness of the policy consultation process. "
           "In 2022 we used AI to create a vision for the future of Hounslow with the local authority. "
           "In 2023, we commissioned the creation of the Ecological Intelligence Agency, a speculative artefact to help experience "
           "the role AI might have in future decision-making in environmental policy. "
           "\n\n"
           "This blog explains how Policy Lab used generative AI in policy relating to the future of the subsurface. "
           "Broadly speaking, generative AI can be understood as systems that create new data, which could be new code, text, images, "
           "video or other forms of data. We used generative AI to visualise evidence, bring to life policy ideas, and create this blog, "
           "written iteratively with a system trained on Policy Lab’s publicly available content.")

In [2]:
from common import get_response

In [None]:
instruction_1 = "Give me a short summary of the following article:"

prompt_1 = f"{instruction_1}\n\n{article}"

response_1 = get_response(prompt_1)

print(response_1)

## Check if the response contains the expected words

We expect the response to contain the following words: *__"Policy Lab"__*, *__"AI"__*, *__"generative AI"__*, *__"government"__*

In [None]:
expected_words = ["Policy Lab", "AI", "generative AI", "government"]

def check_expected_words(response, expected_words):
    for word in expected_words:
        assert word.lower() in response.lower(), f"The word {word} is not in the response"

check_expected_words(response_1, expected_words)
print("All expected words are in the response")

## What if we want to make sure the response does not contain lists?

Imagine we don't want the response to be short, maximum a couple of sentences with no lists.

We need a new prompt and a new evaluation function.

In [None]:
instruction_2 = ("Give me a short summary of the following article, "
                "maximum a couple of sentences with no lists or quotes:")

prompt_2 = f"{instruction_2}\n\n{article}"

response_2 = get_response(prompt_2)

print(response_2)

In [None]:
check_expected_words(response_2, expected_words)
print("All expected words are in the response")

In [None]:
def no_lists_in_response(response):
    assert "1. " not in response, "The response contains a list"

no_lists_in_response(response_2)
print("The response contains no lists")

In [None]:
no_lists_in_response(response_1)
print("The response contains no lists")