Evaluating Retrieval

The Judge UI:

Using FastAPI as a backend and Vue.js as a frontend, we've custom built a Judge UI for searching different asset types via semantic or keyword search. This is essentially just a bit of wiring that is making very simple REST calls to the router. Via SQLAlchemy, we've integrated FastAPI with a little backend SQLite database to store the feedback we generate by voting on results.

In [None]:
%%js
var host = window.location.host;
var url = 'http://'+host+':5007';
element.innerHTML = '<a style="color:green;" target="_blank" href='+url+'>Click to open the judge UI.</a>';

Voting with Judge UI

Now let's vote. First let's evaluate how semantic search does on this query.

1. Make sure we're in Semantic Search mode.
2. Type `H200` in the search box.
3. Make sure only "TechBlog Posts Summaries" is checked in Asset Types.
4. Vote thumbs up or down on the 10 results. This populates our SQLite database.

Now let's evaluate how keyword search does on this query.
1. Switch to Keyword Search mode.
2. Type `*H200` in the search box. **Include the asterisk so that we get wildcard matches on the closely-related GH200.**
3. Make sure only "TechBlog Posts Summaries" is checked in Asset Types.
4. Vote thumbs up or down on the 10 results. This populates our SQLite database.

In [None]:
import os
os.getcwd()

sql_db_filepath = os.path.abspath(os.path.join(os.getcwd(), "db", "sql_app.db"))
sql_db_filepath

import pandas as pd
import sqlite3

def select_all_feedback() -> pd.DataFrame:
    # Read sqlite query results into a pandas DataFrame
    con = sqlite3.connect(sql_db_filepath)
    df = pd.read_sql_query("SELECT * FROM feedback", con)
    con.close()
    return df

df = select_all_feedback()

# Verify that result of SQL query is stored in the dataframe
print(df.head(20))

Precision and Recall

As in most data science evaluation tasks, we are interested in precision and recall. Remembering the definitions of each:

- **Precision:** Total number of *relevant* documents retrieved / Total number of documents retrieved.
- **Recall:** Total number of relevant documents *retrieved* / Total number of relevant documents *in the database*.

In retrievers, those scores are typically calculated with the system set for some arbitrary number of results K

You may see variations on these like Mean Average Precision, F1 score, or other rank-based metrics like Mean Reciprocal Rank (MRR) or Normalized Cumulative Discounted Gain (NCDG). We'll just focus on Precision and Recall in this course.


In [None]:
# first filter to all the feedback we put in manually (a.k.a. human feedback)
hf = df[df['username'] != 'llmjudge']

# Next transform query column so that it strips wildcards and lowercases everything
hf["query"] = hf["query"].str.replace("*", "").str.lower()

print(hf.shape)
print(hf.head())

result = hf.groupby(["query", "search_type", "asset_type"]).aggregate(precision=("vote_value", "mean")).reset_index(drop=False)
print(result)

# Calculating Recall

totalpos = hf[hf["vote_value"] == 1].groupby("query").aggregate(totalpos=("chunk_id", "nunique")).reset_index(drop=False)
totalpos

hf = pd.merge(hf, totalpos, on=["query"])
hf_grouped = hf.groupby(["query", "search_type", "asset_type"]).aggregate(precision=("vote_value", "mean"), recall=("vote_value", "sum"), totalpos=("totalpos", "first")).reset_index(drop=False)
hf_grouped["recall"] = hf_grouped["recall"] / hf_grouped["totalpos"]
hf_grouped

LLM as a judge

Getting human feedback is pretty important for evaluation. However, unless you have access to an automated way of continually collecting user preferences - as search engines track whether or not a user clicked on a search result - it will require a lot of dedicated person-hours to build up your database of feedback, especially as your database of document chunks scales.

This leads to a natural follow-up question: can we ask a machine to evaluate whether a document is relevant or not? If so, we could solve the problem by just throwing compute at it.

We are actually getting close to another important topic in information retrieval systems: **rerankers**.

In [None]:
import json

# load the summaries from the json file
with open("data/techblogs_summaries/saved.json", "r") as f:
    saved_summaries = json.load(f)

summary = saved_summaries['https://developer.nvidia.com/blog/create-share-and-scale-enterprise-ai-workflows-with-nvidia-ai-workbench-now-in-beta/'][0]['text']
print(summary)

In [None]:
# Use LLMs to check reference

from llms import llms
llm = llms.nim_mixtral_llm

import asyncio

# Initialize a semaphore object with a limit of 3.
limit = asyncio.Semaphore(3)

async def async_generate(llm, msg):
    resp = await llm.agenerate([msg])
    return resp.generations[0][0].text

In [None]:
from langchain_core.prompts import ChatPromptTemplate

template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful AI bot being used by NVIDIA to determine if a passage of text is relevant to a search query. "
            + "All of the passages of text are in some way related to NVIDIA, so in order to be relevant it needs to be a strict match between the topic of the passage and the topic of the query. "
            + 'Format your output as a JSON object with a single boolean field "relevant". ',
        ),
        (
            "user",
            'Is the following passage strictly relevant to a search query for "{query}"?\nPassage: {passage}',
        ),
    ]
)



In [None]:
batch_messages = []

# truncating the list to limit to a subset of urls
# this example is just to illustrate the point
# the first 5 urls should be irrelevant to H200, the next 2 should be relevant
urls = list(saved_summaries.keys())[0:5] + ['https://developer.nvidia.com/blog/one-giant-superchip-for-llms-recommenders-and-gnns-introducing-nvidia-gh200-nvl32/', 'https://developer.nvidia.com/blog/nvidia-tensorrt-llm-enhancements-deliver-massive-large-language-model-speedups-on-nvidia-h200/']

for url in urls:
    title = saved_summaries[url][0]['document_title']
    summary = saved_summaries[url][0]['text']
    print(title)
    print(summary)
    print("-----")
    passage = title + "\n" + summary
    messages = template.format_messages(query="H200", passage=passage)
    batch_messages.append(messages)


In [None]:
response = llm.generate(batch_messages)
for gen in response.generations:
    print(gen[0].text)

In [None]:
import json
from langchain_core.messages import AIMessage, SystemMessage, HumanMessage
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate

def extract_json(text):
    stack = []
    start_index = None

    for i, char in enumerate(text):
        if char == '{':
            if not stack:
                start_index = i
            stack.append(char)
        elif char == '}':
            if stack:
                stack.pop()
                if not stack:
                    end_index = i + 1
                    json_str = text[start_index:end_index]
                    try:
                        json_obj = json.loads(json_str)
                        return json_obj
                    except json.JSONDecodeError:
                        print("Error: JSON decoding failed.")
                        return None
            else:
                print("Error: Unmatched '}' character.")
                return None

    print("No JSON object found in the text.")
    return None

In [None]:
system_message = "You are a helpful AI bot being used in a technical domain. Format your output as a JSON object."
human_msg_pt = HumanMessagePromptTemplate.from_template(
    'First, is the following text a user question that needs answering or just a topic to learn more about? Second, if the text is a user question that needs answering, is the question asking for code to be written?\nText: {text}'
)
# three classification categories
code_question = AIMessage(content="{\n  \"is_user_question\": true,\n  \"asks_for_code\": true\n}")
regular_question = AIMessage(content="{\n  \"is_user_question\": true,\n  \"asks_for_code\": false\n}")
not_question = AIMessage(content="{\n  \"is_user_question\": false\n}")

prompt = ChatPromptTemplate(
    messages=[
        SystemMessage(content=system_message),
        human_msg_pt.format(text="how do I install cuda drivers"),
        code_question,
        human_msg_pt.format(text="what is the right NVIDIA SDK to use for computer vision"),
        regular_question,
        human_msg_pt.format(text="recommender systems for online shopping"),
        not_question,
        human_msg_pt.format(text="How to import rapids cudf in python?"),
        code_question,
        human_msg_pt.format(text="Generate code to make a Python web server."),
        code_question,
        human_msg_pt.format(text="biomedical devices"),
        not_question,
        human_msg_pt.format(text="write some code that prints hello world"),
        code_question,
        human_msg_pt.format(text="The leading cause of death in the 16th century was infection."),
        not_question,
        human_msg_pt.format(text="NVIDIA Merlin SDK for recommendation systems"),
        not_question,
        human_msg_pt.format(text="who founded the company NVIDIA?"),
        regular_question,
        human_msg_pt,
    ]
)
chain = prompt | llm

generation = chain.invoke({"text": "what libraries should I learn in C++"})
print(extract_json(generation.content))

In [None]:
generation = chain.invoke({"text": "What is a major seventh chord?"})
print(extract_json(generation.content))

In [None]:
generation = chain.invoke({"text": "omniverse scene lighting"})
print(extract_json(generation.content))

In [None]:
generation = chain.invoke({"text": "Generate code to write a simple Python web app."})
print(extract_json(generation.content))

In [None]:
generation = chain.invoke({"text": "Deep learning techniques for obstacle avoidance in autonomous mobile robots"})
print(extract_json(generation.content))

In [None]:
#Web App

%%js
var host = window.location.host;
var url = 'http://'+host+':5000';
element.innerHTML = '<a style="color:green;" target="_blank" href='+url+'>Click to open the final product web app.</a>';