# Evaluating QA System Correctness on Ollama provided LLMs using Langsmith

Evaluating a question and answer systems can help you improve its system design as well as the prompt and model quality. We tend to improve what we can measure, so checking for accuracy is a top priority. One challenge for measuring accuracy is that the response are unstructured text. A Q&A system can generate lengthy responses, making traditional metrics like BLEU or ROUGE unreliable. For this scenario, using a well-labeled dataset and llm-assisted evaluators can help you grade your system's response quality. This complements human review and other measurements you might have already implemented.

In this walkthrough, we will use LangSmith to check the correctness of a Q&A system against an example dataset. The main steps are:

1. Run 2 LLMs locally
2. Create a dataset of questions and answers.
3. Define your question and answering system.
4. Run evaluation using LangSmith.
5. Iterate to improve the system.

###### Inspired by this [documentation](https://docs.smith.langchain.com/old/cookbook/testing-examples/qa-correctness)

## Prerequisites


1. In order to install Ollama, see this [link](https://github.com/ollama/ollama?tab=readme-ov-file). 
2. Full Ollama API docs can be found [here](https://github.com/ollama/ollama/blob/main/docs/api.md).
3. Get started guide with [Langsmith](https://docs.smith.langchain.com/)

We will use some basic prompting techniques and simple queries to evaluate our LLM. 

You can further tune the prompts and model hyperparameters. Here is a full list of hyperparameters for Llama3 from Ollama
```
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false,
  "options": {
    "num_keep": 5,
    "seed": 42,
    "num_predict": 100,
    "top_k": 20,
    "top_p": 0.9,
    "tfs_z": 0.5,
    "typical_p": 0.7,
    "repeat_last_n": 33,
    "temperature": 0.8,
    "repeat_penalty": 1.2,
    "presence_penalty": 1.5,
    "frequency_penalty": 1.0,
    "mirostat": 1,
    "mirostat_tau": 0.8,
    "mirostat_eta": 0.6,
    "penalize_newline": true,
    "stop": ["\n", "user:"],
    "numa": false,
    "num_ctx": 1024,
    "num_batch": 2,
    "num_gpu": 1,
    "main_gpu": 0,
    "low_vram": false,
    "f16_kv": true,
    "vocab_only": false,
    "use_mmap": true,
    "use_mlock": false,
    "num_thread": 8
  }
}'
```

Optionally, you can install Ollama-UI to just play around with a Ollama models backed QA application via this [git repo](https://github.com/ollama-ui/ollama-ui).

```
git clone https://github.com/ollama-ui/ollama-ui
cd ollama-ui
make

open http://localhost:8000 # in browser

```

### Check if Llama3 model is running on local machine

In [4]:
import requests
import json

url = 'http://localhost:11434/api/generate'
data = {"model": "llama3", "prompt": "Why is the sky blue?"}
headers = {
    "Accept": "application/json",
    "Content-Type": "application/json",
}
x = requests.post(url, headers = headers, data = json.dumps(data))

### OR 
#### Test with Single Shot Prompt

In [17]:
from langchain_community.llms import Ollama
from langchain import PromptTemplate # Added

llm = Ollama(model="llama3", stop=["<|eot_id|>"]) # Added stop token

def get_model_response(user_prompt, system_prompt):
    # NOTE: No f string and no whitespace in curly braces
    template = """
        <|begin_of_text|>
        <|start_header_id|>system<|end_header_id|>
        {system_prompt}
        <|eot_id|>
        <|start_header_id|>user<|end_header_id|>
        {user_prompt}
        <|eot_id|>
        <|start_header_id|>assistant<|end_header_id|>
        """

    # Added prompt template
    prompt = PromptTemplate(
        input_variables=["system_prompt", "user_prompt"],
        template=template
    )
    
    # Modified invoking the model
    response = llm(prompt.format(system_prompt=system_prompt, user_prompt=user_prompt))
    
    return response


response = get_model_response("You are an expert at baking.", "Can you give me a recipe of blueberry muffins?")

In [18]:
response

"Blueberry muffins, a classic favorite! As an expert in the kitchen, I'd be delighted to share with you my tried-and-true recipe for the most scrumptious blueberry muffins. Here's what you'll need:\n\nIngredients:\n\n* 2 1/4 cups all-purpose flour\n* 1 cup granulated sugar\n* 2 teaspoons baking powder\n* 1 teaspoon salt\n* 1/2 cup unsalted butter, melted\n* 1 large egg\n* 1 cup plain yogurt (low-fat or nonfat is fine)\n* 2 teaspoons vanilla extract\n* 2 cups fresh or frozen blueberries\n* Confectioners' sugar for topping (optional)\n\nInstructions:\n\n1. Preheat your oven to 375°F (190°C). Line a 12-cup muffin tin with paper liners.\n2. In a medium bowl, whisk together flour, sugar, baking powder, and salt.\n3. In a large bowl, whisk together melted butter, egg, yogurt, and vanilla extract.\n4. Add the dry ingredients to the wet ingredients and stir until just combined. Do not overmix!\n5. Gently fold in those lovely blueberries.\n6. Divide the batter evenly among the muffin cups.\n7. 

If you receive a 404 response, run the following command in the terminal.
```
ollama run llama3
```

Rerun previous cells.

## LangSmith Evaluation of Llama Models

For the example below, you will need both Llama2 and Llama3. We will show a grader or evaluator model that measures responses from a test model. 
We will use Llama2 as the test model and Llama3 as the grader model. 
You will need to ensure the following: 
##### a. llama2 is running locally. 
If not run the following command in terminal:
```
ollama run llama2
```
##### b. [Langsmith account and API key ](https://docs.smith.langchain.com/how_to_guides/setup/create_account_api_key)

### Create a Dataset

[Langsmith Documentation](https://docs.smith.langchain.com/how_to_guides/datasets/manage_datasets_in_application)


In [39]:
# Load API key from secrets.json

import os
import json

os.environ["LANGCHAIN_TRACING_V2"] = "true"


def get_secrets():
    with open('secrets.json') as secrets_file:
        secrets = json.load(secrets_file)

    return secrets


if __name__ == "__main__":
    secrets = get_secrets()
    os.environ["LANGCHAIN_API_KEY"]  = secrets.get("LANGCHAIN_API_KEY")


In [None]:
from langsmith import Client

client = Client()

# Define dataset: these are your test cases
dataset_name = "QA Example Dataset"
dataset = client.create_dataset(dataset_name)
client.create_examples(
    inputs=[
        {"question": "What is LangChain?"},
        {"question": "What is LangSmith?"},
        {"question": "What is OpenAI?"},
        {"question": "What is Google?"},
        {"question": "What is Mistral?"},
    ],
    outputs=[
        {"answer": "A framework for building LLM applications"},
        {"answer": "A platform for observing and evaluating LLM applications"},
        {"answer": "A company that creates Large Language Models"},
        {"answer": "A technology company known for search"},
        {"answer": "A company that creates Large Language Models"},
    ],
    dataset_id=dataset.id,
)

### Define metrics
After creating our dataset, we can now define some metrics to evaluate our responses on. Since we have an expected answer, we can compare to that as part of our evaluation. However, we do not expect our application to output those exact answers, but rather something that is similar. This makes our evaluation a little trickier.

In addition to evaluating correctness, let's also make sure our answers are short and concise. This will be a little easier - we can define a simple Python function to measure the length of the response.

Let's go ahead and define these two metrics.

For the first, we will use an LLM to judge whether the output is correct (with respect to the expected output). This LLM-as-a-judge is relatively common for cases that are too complex to measure with a simple function. We can define our own prompt and LLM to use for evaluation here:

In [3]:
from langchain_core.prompts.prompt import PromptTemplate
from langsmith.evaluation import LangChainStringEvaluator
from langchain_community.llms import Ollama
from langsmith import traceable

_PROMPT_TEMPLATE = """You are an expert professor specialized in grading students' answers to questions.
You are grading the following question:
{query}
Here is the real answer:
{answer}
You are grading the following predicted answer:
{result}
Respond with CORRECT or INCORRECT:
Grade:
"""

PROMPT = PromptTemplate(
    input_variables=["query", "answer", "result"], template=_PROMPT_TEMPLATE
)

eval_llm = Ollama(model="llama3", stop=["<|eot_id|>"], temperature=0.4, top_k=3, top_p=0.9) # Added stop token
qa_evaluator = LangChainStringEvaluator("qa", config={"llm": eval_llm, "prompt": PROMPT})

[OPTIONAL] For evaluating the length of the response, this is a lot easier! We can just define a simple function that checks whether the actual output is less than 2x the length of the expected result.

In [4]:
# from langsmith.schemas import Run, Example

# def evaluate_length(run: Run, example: Example) -> dict:
#     prediction = run.outputs.get("output") or ""
#     required = example.outputs.get("answer") or ""
#     score = int(len(prediction) < 2 * len(required))
#     return {"key":"length", "score": score}

### Run Evaluations
Great! So now how do we run evaluations? Now that we have a dataset and evaluators, all that we need is our application! We will build a simple application that just has a system message with instructions on how to respond and then passes it to the LLM. We will build this using the OpenAI SDK directly:

In [5]:
from langchain_community.llms import Ollama
from langchain import PromptTemplate # Added

llm = Ollama(model="llama2", stop=["<|eot_id|>"]) # Added stop token

def get_model_response(user_prompt, system_prompt):
    # NOTE: No f string and no whitespace in curly braces
    template = """
        <|begin_of_text|>
        <|start_header_id|>system<|end_header_id|>
        {system_prompt}
        <|eot_id|>
        <|start_header_id|>user<|end_header_id|>
        {user_prompt}
        <|eot_id|>
        <|start_header_id|>assistant<|end_header_id|>
        """

    # Added prompt template
    prompt = PromptTemplate(
        input_variables=["system_prompt", "user_prompt"],
        template=template
    )
    
    # Modified invoking the model
    response = llm(prompt.format(system_prompt=system_prompt, user_prompt=user_prompt))
    
    return response


Before running this through LangSmith evaluations, we need to define a simple wrapper that maps the input keys from our dataset to the function we want to call, and then also maps the output of the function to the output key we expect.

In [6]:
def langsmith_app(inputs):
    output = get_model_response("Respond to the users question in a short, concise manner (one short sentence).", inputs["question"])
    return {"output": output}

In [40]:
from langsmith.evaluation import evaluate

experiment_results = evaluate(
    langsmith_app, # Your AI system
    data=dataset_name, # The data to predict and grade over
    evaluators=[qa_evaluator], # The evaluators to score the results
    experiment_prefix="Llama-2-local-correctness", # A prefix for your experiment names to easily identify them
)

View the evaluation results for experiment: 'Llama-2-local-correctness-38267b4d' at:
https://smith.langchain.com/o/103e639e-1fea-5efb-81b6-6b537ff4132d/datasets/e5a1a05f-e750-4972-ae35-0b707d2be74d/compare?selectedSessions=2d020155-f629-48cf-9b8f-d40d1ff5fa8b




0it [00:00, ?it/s]

In [38]:
experiment_results._results[0].keys()

dict_keys(['run', 'example', 'evaluation_results'])