# README

**IMPORTANT: This notebook has cached outputs so you can demo WITHOUT needing to re-run.  See [xx]() for a clone of this notebook w/out the demo script, but with the outputs for demo'ing.**

How to give this demo:
1. **Tee up the problem:** Developers struggle to get their SMEs to create evaluation sets - it takes weeks to months, and the resulting data is often poor quality because SMEs struggle to create diverse questions from scratch.
2. **Tee up the solution:** In partnership with Mosaic AI Research, we created the Agent Evaluation Synthetic API to address this challenge - it allows you to create a diverse, representative evaluation set based on your documents. With this data, you can immediately start iterating on quality!
3. **Tee up the demo:** Today, I'll walk you through the process of generating synthetic data, comparing the quality/cost/latency of a function-calling Agent with a vector search retrieval tool between several LLMs, and then deploying the best Agent to either production or to a web-based chat app so your SMEs can provide feedback on its quality.  We'll do this with the Databricks documentation.
4. **Give the demo:** (follow talk track inline)

Quick links to get UIs w/ visuals:
* [**MLflow Evaluation UI:** see the generated questions & LLM judge quality evaluation]()
* [**MLflow Runs UI:** see metrics comparing Agent versions on quality/cost/latency]()
* [**Review App:** web-based chat app for SMEs]()

*IMPORTANT: The Review App is scaled to zero.  Go ask a question ~15 mins before your demo to let it warm  up!*

FAQ:
- Q: How do I get my SMEs to review the synthetic data to ensure it's accurate?
  - A: We have the evaluation set review UI, which lets SMEs review and edit the evaluation set. If you are interested, we can enroll you in the Private Preview.
- Q: How do I tune the generated questions for my use case?
  - A: You can use the `guidelines` parameter which allows you to provide English instructions about the style of question you want, the target user persona, and more. The API will take these instructions into account when generating data.
- Q: How many questions should I generate?
  - A: We suggest at least 2 - 3 questions per document.


**If you want to modify this notebook, please clone [this copy]() that has user_name paramet.**

In [0]:
%pip install -U -qqqq "git+https://github.com/mlflow/mlflow.git@master" "https://ml-team-public-read.s3.us-west-2.amazonaws.com/wheels/rag-studio/staging/databricks_agents-0.8.1.dev0-py3-none-any.whl" "https://ml-team-public-read.s3.us-west-2.amazonaws.com/wheels/managed-evals/staging/databricks_managed_evals-latest-py3-none-any.whl" databricks-vectorsearch databricks-sdk[openai] 
dbutils.library.restartPython()

In [0]:
import mlflow
from databricks.sdk import WorkspaceClient

# Get current user's name & email to ensure each user doesn't over-write other user's outputs
w = WorkspaceClient()
user_email = w.current_user.me().user_name
user_name = user_email.split("@")[0].replace(".", "_")

experiment = mlflow.set_experiment(f"/Users/{user_email}/agents-demo-experiment")

synthetic_evaluation_set_delta_table = (
    f"agents_demo.synthetic_data.db_docs_synthetic_eval_set__{user_name}"
)

managed_eval_delta_table = (
    f"agents_demo.synthetic_data.db_docs_managed_eval_set__{user_name}"
)

uc_model_name = f"agents_demo.synthetic_data.db_docs__{user_name}"

print(f"User: {user_name}")
print()
print(f"MLflow Experiment: {experiment.name}")
print()
print(f"Synthetic Data output Delta Table: {synthetic_evaluation_set_delta_table}")
print(f"Managed Evaluation Set Delta Table: {managed_eval_delta_table}")
print(f"Unity Catalog Model: {uc_model_name}")

## Generate synthetic evaluation set

DEMO SCRIPT: Here, I'll pass my documents (Databricks docs in this demo) to the Synthetic API along with some guidance to tune the generated questions for my use case.  The API will generate synthetic questions & ground truth responses using our propietary synthetic data pipeline that we built in partnership with Mosaic AI Research.

```
def generate_evals_df(
    docs: Union[pd.DataFrame, "pyspark.sql.DataFrame"], *,
    num_questions_per_doc: int = 3,
    guidelines: Optional[str] = None,
) -> pd.DataFrame:
    """
    Run the synthetic generation pipeline to generate evaluations for a given set of documents.
    Generated evaluation set can be used with Databricks Agent Evaluation (https://docs.databricks.com/en/generative-ai/agent-evaluation/evaluate-agent.html).

    :param docs: A pandas/Spark DataFrame with a string column `content` and a string `doc_uri` column.
    :param num_questions_per_doc: The number of questions to generate for each document. Default is 3.
    :param guidelines: Optional guidelines to guide the question generation.
    """
```

In [0]:
import mlflow
from databricks.agents.eval import generate_evals_df
from pyspark.sql.functions import rand

# Get parsed documents
df = (
    spark.table("agents_demo.agents.db_docs_docs__initial")
    .orderBy(rand())
    .withColumnRenamed("doc_content", "content")
    .limit(10) # Only do 10 questions for the demo
)

# Optional guideline

# NOTE: The guidelines you provide are a free-form string. The markdown string below is the suggested formatting for the set of guidelines, however you are free to add your sections here. Note that this will be prompt-engineering an LLM that generates the synthetic data, so you may have to iterate on these guidelines before you get the results you desire.
guidelines = """
# Task Description
You are generating an evaluation dataset which will be used to test a RAG chatbot on its ability to answer questions about Databricks documentation, providing support for Databricks APIs and its UI.

# Content Guidelines
- Address scenarios where data engineers are trying to understand the product capabilities
- Simulate real-world scenarios a data engineer may occur when writing data pipelines

# Example questions
- what is in a good eval set
- saving files in uc volume
- spark sql join
- How do I add a secret?

# Style Guidelines
- Questions should be succinct, and human-like.

# Personas
- A Data Scientist using Databricks.
"""

# Generate 1 question for each document
synthetic_data = generate_evals_df(
    docs=df, guidelines=guidelines, num_questions_per_doc=1
)

DEMO SCRIPT: Now, let's look at a few of the questions.  You can see that for each document, we generated a synthetic question and the expected facts (ground truth) that the Agent must generate to get the questions correct.  We generate JUST the facts, rather than a fully written answer, since this helps the accuracy of the propietary LLM judges we will see later.  If your SMEs will review these questions, having just the facts, versus a generated response, helps makes them more efficient in their review.

*Note: Click to the visualization tab to see a pretty rendering*

In [0]:
synthetic_data_spark_df = spark.createDataFrame(synthetic_data)

# Display generated questions/ground truth
display(
    synthetic_data_spark_df.select(
        "request", "expected_facts", "expected_retrieved_context.doc_uri"
    )
)

# Write to Delta Table
synthetic_data_spark_df.write.format("delta").mode("overwrite").saveAsTable(
    synthetic_evaluation_set_delta_table
)

Databricks visualization. Run in Databricks to view.

## Use the evaluation set to evaluate a RAG Agent

DEMO SCRIPT: Now, let's use this synthetic evaluation set to evaluate the quality of our Agent.  Here, I'll use a function-calling Agent with a vector search Retriever that I grabbed from our [AI Cookbook](https://ai-cookbook.io).  Before this call, I built a vector index of the Databricks docs using Vector Search.  Let's quickly look at the MLflow Trace to see what this Agent is doing.

In [0]:
%run ./function_calling_agent_openai_sdk


OPTIONAL DEMO SCRIPT (for advanced teams only, hide cell for others): Here, we can inspect the Agent's configuration.  To improve the agent's quality, you'll tune these parameters, along with your vector index's data pipeline and the agent's code itself.  Note that these parameters are just a starting point - Agent Framework allows you full control over your Agent's code and config which allows you to achieve production-ready quality.

In [0]:
# Pydantic class to make configuration easiser to use.  Developers can use this, Python dictionaries or YAML files for their configuration.
from agent_config import (
    AgentConfig,
    FunctionCallingLLMConfig,
    LLMParametersConfig,
    RetrieverToolConfig,
    RetrieverParametersConfig,
    RetrieverSchemaConfig,
)
import yaml

retriever_config = RetrieverToolConfig(
    vector_search_index="agents_demo.agents.db_docs_docs_chunked_index__initial",  # UC Vector Search index
    vector_search_schema=RetrieverSchemaConfig(
        primary_key="chunk_id",
        chunk_text="content_chunked",
        document_uri="doc_uri",
        additional_metadata_columns=[],
    ),
    vector_search_parameters=RetrieverParametersConfig(
        num_results=5,
        query_type="ann",  # Type of search: ann or hybrid
    ),
    vector_search_threshold=0.0,
    # Tool prompt templates
    chunk_template="Passage text: {chunk_text}\nPassage metadata: {metadata}\n\n",
    prompt_template="""Use the following pieces of retrieved context to answer the question.\nOnly use the passages from context that are relevant to the query to answer the question, ignore the irrelevant passages.  When responding, cite your source, referring to the passage by the columns in the passage's metadata.\n\nContext: {context}""",
    retriever_query_parameter_prompt="query to look up in retriever",
    tool_description_prompt="Search for documents that are relevant to a user's query about the Databricks documentation.",
    tool_name="retrieve_documents",
    # Retriever internals
    tool_class_name="VectorSearchRetriever",
)

########################
#### ✅✏️ LLM configuration
########################

llm_config = FunctionCallingLLMConfig(
    llm_endpoint_name="agents-demo-gpt4o",  # Model serving endpoint
    llm_system_prompt_template=(
        """You are a helpful assistant that answers questions by calling tools.  Provide responses ONLY based on the outputs from tools.  If you do not have a relevant tool for a question, respond with 'Sorry, I'm not trained to answer that question'."""
    ),  # System prompt template
    llm_parameters=LLMParametersConfig(
        temperature=0.01, max_tokens=1500
    ),  # LLM parameters
    tools=[retriever_config],
)

agent_config = AgentConfig(
    llm_config=llm_config,
    input_example={
        "messages": [
            {
                "role": "user",
                "content": "What is Agent Evaluation?",
            },
        ]
    },
)


########################
##### Dump the configuration to a YAML
########################

# We dump the dump the Pydantic model to a YAML file because:
# 1. MLflow ModelConfig only accepts YAML files or dictionaries
# 2. When importing the Agent's code, it needs to read this configuration
with open("config.yml", "w") as file:
    yaml.dump(agent_config.dict(), file, default_flow_style=False)

In [0]:
vibe_check_query = {
    "messages": [
        {"role": "user", "content": f"what is agent evaluation?"},
    ]
}

# Could also be "databricks-meta-llama-3-1-405b-instruct" or "agents-demo-gpt4o" or "databricks-meta-llama-3-1-70b-instruct" or any other Model Serving endpoint
agent_config.llm_config.llm_endpoint_name = "agents-demo-gpt4o-mini"

# Set the retriever tool to use our Vector Search index
agent_config.llm_config.tools[
    0
].vector_search_index = "agents_demo.agents.db_docs_docs_chunked_index__initial"

# Initialize the agent
rag_agent = FunctionCallingAgent(agent_config=agent_config.dict())

# Call the agent for the vibe check query
output = rag_agent.predict(model_input=vibe_check_query)

### Evaluate the Agent's performance on a few LLMs

DEMO SCRIPT: Now, let's run Agent Evaluation's propietary LLM judges using the synthetic evaluation set to see the quality/cost/latency of several propietary and open source LLMs.  Our research team has invested signficantly in the quality AND speed of these judges, which we define as how often the judge agrees with humans - these judges outperform competitors such as RAGAS in terms of their quality and speed.  

Note that while I am showing the comparison of multiple LLMs, you will use this same approach to compare your experiments with code/config changes to improve quality.  Each iteration is logged to MLflow, so you can quickly come back to the code/config version that worked and deploy it.

We can use MLflow Evaluation UI to inspect the individual records & see the judge outputs, including how they identified the root cause of quality issues.  This UI, coupled with the speed of evaluation, help you efficiently test their hypotheses to improve quality, which lets you reach the production quality bar faster. 

We will use the MLflow Runs UI to compare quality/cost/latency metrics between the LLMs.  This helps you make informed tradeoffs in partnership with your business stakeholders about cost/latency/quality.  Further, you can use this view (or turn it into a Lakeview Dashboard) to provide quantitative updates to your stakeholders so they can follow your progress improving quality!

In [0]:
from mlflow.models.resources import (
    DatabricksVectorSearchIndex,
    DatabricksServingEndpoint,
)
from mlflow.models.signature import ModelSignature
from mlflow.models.rag_signatures import StringResponse, ChatCompletionRequest, Message
import yaml
from databricks import agents
from databricks import vector_search


def log_agent_to_mlflow(agent_config, agent_code_file):
    # Add the Databricks resources so that credentials are automatically provisioned by agents.deploy(...)
    databricks_resources = [
        DatabricksServingEndpoint(
            endpoint_name=agent_config.llm_config.llm_endpoint_name
        ),
    ]

    # Add the Databricks resources for the retriever's vector indexes
    for tool in agent_config.llm_config.tools:
        if type(tool) == RetrieverToolConfig:
            databricks_resources.append(
                DatabricksVectorSearchIndex(index_name=tool.vector_search_index)
            )
            index_embedding_model = (
                VectorSearchClient(disable_notice=True)
                .get_index(index_name=retriever_config.vector_search_index)
                .describe()
                .get("delta_sync_index_spec")
                .get("embedding_source_columns")[0]
                .get("embedding_model_endpoint_name")
            )
            if index_embedding_model is not None:
                databricks_resources.append(
                    DatabricksServingEndpoint(endpoint_name=index_embedding_model),
                )
            else:
                print(
                    "Could not identify the embedding model endpoint resource for {tool.vector_search_index}.  Please manually add the embedding model endpoint to `databricks_resources`."
                )

    # Specify the full path to the Agent notebook
    # model_file = "function_calling_agent_openai_sdk"
    # model_path = os.path.join(os.getcwd(), model_file)

    # Log the agent as an MLflow model
    return mlflow.pyfunc.log_model(
        python_model=agent_code_file,
        model_config=agent_config.dict(),
        artifact_path="agent",
        input_example=agent_config.input_example,
        resources=databricks_resources,
        signature=ModelSignature(
            inputs=ChatCompletionRequest(),
            outputs=StringResponse(),
        ),
        extra_pip_requirements=[
            "databricks-agents",
            "databricks-vectorsearch",
            "mlflow", 
            "databricks-sdk[openai]",
        ],
    )

In [0]:
import mlflow
from mlflow.metrics.genai import make_genai_metric_from_prompt
import os

# Synthetic data from earlier
synthetic_data_df = spark.table(synthetic_evaluation_set_delta_table).toPandas()

# Add a custom LLM judge to asses use-case specific requirements
no_pii_prompt = """
Your task is to determine whether the content includes PII information (personally identifiable information).

You should output a 5 if there is no PII, a 1 if there is PII. This was the content: '{response}'"""

no_pii = make_genai_metric_from_prompt(
    name="no_pii",
    judge_prompt=no_pii_prompt,
    model="endpoints:/agents-demo-gpt4o",
    metric_metadata={"assessment_type": "ANSWER"},
)

# Compare several LLMs.  This can be ANY Model Serving endpoint, including an OpenAI External Model.
model_endpoints_to_test = [
    # "databricks-meta-llama-3-1-405b-instruct",
    "agents-demo-gpt4o",
    "agents-demo-gpt4o-mini",
    # "databricks-meta-llama-3-1-70b-instruct",
]

for endpoint in model_endpoints_to_test:
    # Identify the evaluation inside MLflow using a Run name.  run_name is a user-defined string.
    with mlflow.start_run(run_name=endpoint):
        # Change config to use the LLM
        agent_config.llm_config.llm_endpoint_name = endpoint
        # rag_agent = FunctionCallingAgent(agent_config=agent_config.dict())

        # Log agent's code & config to MLflow
        logged_agent_info = log_agent_to_mlflow(
            agent_config=agent_config,
            agent_code_file=os.path.join(
                os.getcwd(), "function_calling_agent_openai_sdk"
            ),
        )

        # def wrapper_fn(model_input: Dict[str, Any]):
        #     return rag_agent.predict(model_input=model_input)

        # Call Agent Evaluation.
        result = mlflow.evaluate(
            data=synthetic_data_df,  # Your evaluation set
            model=logged_agent_info.model_uri,  # MLflow logged agent
            model_type="databricks-agent",  # Enable Mosaic AI Agent Evaluation

            ## optional parameters below
            extra_metrics=[no_pii],
            # Optional, configure which LLM judges run.  By default, we run the relevant judges.
            evaluator_config={
                "databricks-agent": {
                    "metrics": [
                        "chunk_relevance",
                        "context_sufficiency",
                        "correctness",
                        "groundedness",
                        "relevance_to_query",
                        "safety",
                    ]
                }
            },
        )


DEMO SCRIPT: You can see how you'd iterate to reach your production quality target.  For the purposes of the demo, let's assume one of the models above met your targets for either production or sharing it with internal stakeholders to test.  

The process for deployment is the same - you'll register the Agent to Unity Catalog, and then call `agents.deploy(...)`.  From this command, you'll get production-ready REST API and a hosted web app - the Review App - that your stakeholders can use to test the model.  All logs and feedback are stored in an Inference Table along with the MLflow Trace, so you can debug any quality issues without needing to resort to spreadsheets!

In [0]:
from databricks import agents
import mlflow

# You can log a new version or deploy an already logged/evaluated model from above.  Here, we use the last model logged for simplicity.

# Use Unity Catalog as the model registry
mlflow.set_registry_uri("databricks-uc")

# Register the Agent's model to the Unity Catalog
uc_registered_model_info = mlflow.register_model(
    model_uri=logged_agent_info.model_uri, 
    name=uc_model_name # Unity Catalog model is configured in settings cell
)

# Deploy to enable the Review App and create an API endpoint
deployment_info = agents.deploy(uc_model_name, uc_registered_model_info.version)

displayHTML(
    f'<a href="{deployment_info.review_app_url}" target="_blank"><button style="color: white; background-color: #0073e6; padding: 10px 24px; cursor: pointer; border: none; border-radius: 4px;">SME Chat UI (review app)</button></a>'
)

displayHTML(
    f'<a href="{deployment_info.endpoint_url}" target="_blank"><button style="color: white; background-color: #0073e6; padding: 10px 24px; cursor: pointer; border: none; border-radius: 4px;">Model Serving REST API</button></a>'
)

## WARNING: Below steps are an early private preview.  Functionality may break, make sure you test BEFORE the customer demo.


DEMO SCRIPT:  While Synthetic data is great for unblocking your quality iteration, it is not perfect, and our best practice is to have SMEs review the generated data for accuracy.

Here, I'm going to show you use our SME Evaluation Set Review UI - this is a gamified experience that is designed to allow your SMEs to efficiently review the evaluation set.  Since we give the SMEs a starting point - they can focus on "reviewing" vs. "generating" - as I'm sure you've experienced, its much easier to critique someone else's document than write one from scratch!

Let's walk through this UI.  As the SMEs review each question, Agent Evaluation's backend automatically tracks the history and lineage and updates the Delta Table, so you can start using the reviewed data in parallel with the SME finishing their review!

In [0]:
import managed_evals as agent_evaluation_preview
import copy

# Delete to reset state
# agent_evaluation_preview.delete_evals_table(evals_table_name=managed_eval_delta_table)

# Create the managed evaluation set backend
agent_evaluation_preview.create_evals_table(
    # Delta Table where the managed evaluation set is stored
    evals_table_name=managed_eval_delta_table,
    # Generations from the deployed agent is used to improve the SME-facing UX for review the evaluation set
    model_serving_endpoint_name=deployment_info.endpoint_name,
    # Note: The mode parameter will be removed in future versions and replaced with a single mode
    eval_mode="grading_notes",
)


# Below is temporary code required to translate the synthetic evaluation set into the managed evaluation backend.  This is a temporary state - managed eval sets will soon support directly loading the synthetic data.


# Load synthetic data
df_dict = spark.table(synthetic_evaluation_set_delta_table).toPandas().to_dict(orient='records')

# Translate to format accepted 
new_evals = []
for row in df_dict:
    new_row = copy.deepcopy(row)
    new_row["expected_facts"][0] = f"- {new_row['expected_facts'][0]}"
    fact_list = "\n- ".join(new_row["expected_facts"])

    new_eval = {
        "request_id": new_row["request_id"],
        "request": new_row["request"],
        "grading_notes": f"The answer should mention the following facts either explicitly or implicitly (NOTE: It is sufficient for the answer to mention these facts **implicitly** because explicit agreement is not required!):\n{fact_list}",
    }
    new_evals.append(new_eval)



agent_evaluation_preview.add_evals(evals=new_evals, evals_table_name=managed_eval_delta_table)


## Get the link

sme_ui_link = agent_evaluation_preview.get_evals_link(
    evals_table_name=managed_eval_delta_table
)
displayHTML(
    f'<a href="{sme_ui_link}/review" target="_blank"><button style="color: white; background-color: #0073e6; padding: 10px 24px; cursor: pointer; border: none; border-radius: 4px;">SME Evaluation Set Review UI</button></a>'
)