## LLM RAG Evaluation with MLflow

In this notebook, we will demonstrate how to evaluate a RAG system with MLflow. We will use Sonnet 3.5 as the judge model, via a AWS Bedrock API.

### Installing Requirements

Before proceeding with this tutorial, ensure that you install the necessary dependencies using `poetry`

```bash
    poetry install
```

### Configuration

We need to set up Loka AWS SSO profile `loka-mlengineer`.

In [1]:
import time
import boto3
import pandas as pd
import faiss
from itertools import product
from collections import defaultdict
from tqdm import tqdm
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS
from langchain_community.callbacks.manager import get_bedrock_anthropic_callback
from langchain_aws import BedrockEmbeddings, ChatBedrock
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from ragas import SingleTurnSample, EvaluationDataset
from ragas.metrics import (
    LLMContextRecall,
    Faithfulness,
    SemanticSimilarity,
    ContextPrecision,
    ResponseRelevancy,
)
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper


import mlflow

boto_session = boto3.Session(profile_name="loka-mlengineer", region_name="us-east-1")
bedrock_client = boto_session.client("bedrock-runtime")

USER_AGENT environment variable not set, consider setting it to identify your requests.
  from .autonotebook import tqdm as notebook_tqdm

For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from ragas.llms.prompt import PromptValue


## Create a RAG system

Use Langchain and FAISS to create a RAG system that answers questions based on the MLflow documentation.

In [2]:
# Load the data
loader = WebBaseLoader(
    [
        "https://mlflow.org/docs/latest/index.html",
        "https://mlflow.org/docs/latest/tracking/autolog.html",
        "https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html",
        "https://mlflow.org/docs/latest/python_api/mlflow.deployments.html",
    ]
)
documents = loader.load()
documents

[Document(metadata={'source': 'https://mlflow.org/docs/latest/index.html', 'title': 'MLflow: A Tool for Managing the Machine Learning Lifecycle', 'language': 'en'}, page_content="\n\n\n\n  \n\n\n\nMLflow: A Tool for Managing the Machine Learning Lifecycle\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n2.17.0\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n MLflow\n\nMLflow Overview\nGetting Started with MLflow\nNew Features\nLLMs\nMLflow Tracing\nModel Evaluation\nDeep Learning\nTraditional ML\nDeployment\nMLflow Tracking\nSystem Metrics\nMLflow Projects\nMLflow Models\nMLflow Model Registry\nMLflow Recipes\nMLflow Plugins\nMLflow Authentication\nCommand-Line Interface\nSearch Runs\nSearch Experiments\nPython API\nR API\nJava API\nREST API\nOfficial MLflow Docker Image\nCommunity Model Flavors\nTutorials and Examples\n\n\n\n\nContribute\n\n\n\n\n\n\n\n\n\n\nDocumentation \nMLflow: A Tool for Managing the Machine Learning Lifecycle\n\n\n\n\n\n\nMLflow: A Tool for Managing t

In [3]:
# Chunk the documents into smaller pieces
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_documents(documents)
print(f"Splitted into {len(texts)} documents")

# Initialize the components of the chain
embeddings = BedrockEmbeddings(
    client=bedrock_client, model_id="amazon.titan-embed-text-v1"
)
index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))
vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

# Add the documents to the vector store
vector_store.add_documents(documents=texts)

Created a chunk of size 745, which is longer than the specified 500
Created a chunk of size 567, which is longer than the specified 500
Created a chunk of size 583, which is longer than the specified 500
Created a chunk of size 670, which is longer than the specified 500
Created a chunk of size 841, which is longer than the specified 500
Created a chunk of size 1133, which is longer than the specified 500
Created a chunk of size 629, which is longer than the specified 500
Created a chunk of size 617, which is longer than the specified 500
Created a chunk of size 559, which is longer than the specified 500
Created a chunk of size 607, which is longer than the specified 500
Created a chunk of size 530, which is longer than the specified 500
Created a chunk of size 793, which is longer than the specified 500
Created a chunk of size 713, which is longer than the specified 500
Created a chunk of size 804, which is longer than the specified 500
Created a chunk of size 771, which is longer th

Splitted into 155 documents


['03e7d370-e419-4c87-8b7f-c3bc9f3caf52',
 'e0cc3e67-9fec-4fdf-bce7-5b2ea6ee6c03',
 '732407c9-d446-4096-8821-a4ea300d6b79',
 'be0b25de-5b93-4da4-9dd5-5b240f705ab1',
 'fa9fe616-8530-4320-8203-92bd671aad2b',
 '98d5f5e3-75f3-420a-b9c8-643ebe0eb646',
 '24de5746-13af-4c87-90b2-eef82d86e70b',
 'cfb4fda6-06e3-4d71-b56f-c26a66143a66',
 '49c44a41-8518-4d62-be55-d9e7d62c748f',
 '46811949-d8f4-41f1-bc2e-608d5abb691a',
 'f9f030da-9f0e-4ac9-bd41-f479ec3225fb',
 '01905bbf-a4c9-46cf-9e71-ccd6b0747102',
 '7e6760fd-5b77-48e7-abb4-ba9909abdac9',
 '1f257387-1f02-473f-91d1-99272f8c1227',
 '3e2e487c-a7b4-4500-918b-d8406e4ceca5',
 '08b00e1e-502e-4648-a972-231f4b60dfb0',
 '9978340b-6566-4f64-80e5-5189ac640905',
 '1390da84-cc13-430e-8ae2-528190a3777a',
 '0011cde9-bdcd-475d-9d6e-f48f4239aefc',
 'f2891cbb-0a24-4eff-a1c6-29651c09996c',
 'e22679a6-d301-4527-9b29-e7bc020171e8',
 '8a760f94-5020-4fd8-a900-e4ca5b720918',
 '89151717-f249-43c2-87df-e1c91a42b0f0',
 '1348ee8f-f26c-4fbf-94cb-1385584c1d2b',
 '55ecc44d-f36c-

In [4]:
# Create the prompt
prompt = """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. 
Use three sentences maximum and keep the answer concise.

Question: {question}
Context: {context}
Answer:"""

prompt = ChatPromptTemplate.from_template(prompt)

In [5]:
# Build qa chain
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


llm = ChatBedrock(
    client=bedrock_client,
    model="anthropic.claude-3-5-sonnet-20240620-v1:0",
    temperature=0.1,
)

rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {"context": vector_store.as_retriever(), "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

In [6]:
# Testing the chain
result = rag_chain_with_source.invoke("What is MLflow?")
result

{'context': [Document(metadata={'source': 'https://mlflow.org/docs/latest/index.html', 'title': 'MLflow: A Tool for Managing the Machine Learning Lifecycle', 'language': 'en'}, page_content='Learn about MLflowMLflow BasicsMLflow Models IntroductionGenAI QuickstartsDeep Learning Quickstarts\nLearn about the core components of MLflow\n\n\nQuickstarts\n\n                Get Started with MLflow in our 5-minute tutorial\n\nGuides\n\n                Learn the core components of MLflow with this in-depth guide to Tracking\n\n\nLearn how to perform common tasks in MLflow\n\n\nGuides\n\nAutologging tutorial for effortless model tracking'),
  Document(metadata={'source': 'https://mlflow.org/docs/latest/index.html', 'title': 'MLflow: A Tool for Managing the Machine Learning Lifecycle', 'language': 'en'}, page_content='Contribute\n\n\nDocumentation \nMLflow: A Tool for Managing the Machine Learning Lifecycle\n\n\nMLflow: A Tool for Managing the Machine Learning Lifecycle \nMLflow is an open-source

## Evaluate the RAG system using MLFlow and RAGAS

Create an eval dataset

In [7]:
# Create the evaluation dataset
eval_df = pd.DataFrame(
    {
        "question": [
            "What is MLflow?",
            "What is the mlflow.evaluate() function?",
            "How can I log a table with MLFlow?",
            "How can I load a saved table?",
        ],
        "reference": [
            "MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, and deployment.",
            "The mlflow.evaluate() function evaluates the model on the given dataset.",
            "You can log a table with MLFlow using the mlflow.log_table() function.",
            "You can load a saved table using the mlflow.load_table() function.",
        ],
    }
)
eval_df

Unnamed: 0,question,reference
0,What is MLflow?,MLflow is an open source platform to manage th...
1,What is the mlflow.evaluate() function?,The mlflow.evaluate() function evaluates the m...
2,How can I log a table with MLFlow?,You can log a table with MLFlow using the mlfl...
3,How can I load a saved table?,You can load a saved table using the mlflow.lo...


Run the chain on the `eval_df` and extract metadata

In [8]:
invoke_metadata = defaultdict(list)
samples = []
for _, row in eval_df.iterrows():
    # Invoke the chain while capturing the token usage/cost
    with get_bedrock_anthropic_callback() as cb:
        start_time = time.time()
        result = rag_chain_with_source.invoke(row["question"])
        end_time = time.time()

    invoke_metadata["user_input"].append(row["question"])
    invoke_metadata["total_tokens"].append(cb.total_tokens)
    invoke_metadata["total_cost"].append(cb.total_cost)
    invoke_metadata["latency"].append(end_time - start_time)

    samples.append(
        SingleTurnSample(
            user_input=row["question"],
            reference=row["reference"],
            response=result["answer"],
            retrieved_contexts=[i.page_content for i in result["context"]],
        )
    )

metadata_df = pd.DataFrame(invoke_metadata)
scoring_dataset = EvaluationDataset(samples=samples)
scoring_dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,response,reference
0,What is MLflow?,[Learn about MLflowMLflow BasicsMLflow Models ...,MLflow is an open-source platform designed to ...,MLflow is an open source platform to manage th...
1,What is the mlflow.evaluate() function?,[MLflow Overview\nGetting Started with MLflow\...,"I apologize, but I don't have any specific inf...",The mlflow.evaluate() function evaluates the m...
2,How can I log a table with MLFlow?,[Automatic Logging with MLflow Tracking\n\n\n2...,"To log a table with MLflow, you can use the `m...",You can log a table with MLFlow using the mlfl...
3,How can I load a saved table?,[Clicking on the run (“clumsy-steed-426” in th...,"I apologize, but I don't have any specific inf...",You can load a saved table using the mlflow.lo...


In [9]:
metadata_df

Unnamed: 0,user_input,total_tokens,total_cost,latency
0,What is MLflow?,863,0.003417,2.48342
1,What is the mlflow.evaluate() function?,938,0.003486,2.757172
2,How can I log a table with MLFlow?,641,0.002799,2.670629
3,How can I load a saved table?,633,0.002679,2.757582


Evaluate the scoring dataset

In [10]:
# Define the evaluators llms and embeddings
evaluator_llm = LangchainLLMWrapper(
    ChatBedrock(
        client=bedrock_client,
        model="anthropic.claude-3-5-sonnet-20240620-v1:0",
        temperature=0.4,
    )
)
evaluator_embeddings = BedrockEmbeddings(
    client=bedrock_client, model_id="amazon.titan-embed-text-v1"
)

# Define the metrics
metrics = [
    LLMContextRecall(),
    Faithfulness(),
    ContextPrecision(),
    ResponseRelevancy(),
    SemanticSimilarity(),
]

# Evaluate the model
results = evaluate(
    dataset=scoring_dataset,
    metrics=metrics,
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
)

results_df = results.to_pandas()
results_df = results_df.merge(metadata_df, on="user_input")
results_df

Evaluating: 100%|██████████| 20/20 [00:22<00:00,  1.12s/it]


Unnamed: 0,user_input,retrieved_contexts,response,reference,context_recall,faithfulness,context_precision,answer_relevancy,semantic_similarity,total_tokens,total_cost,latency
0,What is MLflow?,[Learn about MLflowMLflow BasicsMLflow Models ...,MLflow is an open-source platform designed to ...,MLflow is an open source platform to manage th...,1.0,0.636364,0.5,0.962677,0.933165,863,0.003417,2.48342
1,What is the mlflow.evaluate() function?,[MLflow Overview\nGetting Started with MLflow\...,"I apologize, but I don't have any specific inf...",The mlflow.evaluate() function evaluates the m...,0.0,1.0,0.0,0.0,0.683917,938,0.003486,2.757172
2,How can I log a table with MLFlow?,[Automatic Logging with MLflow Tracking\n\n\n2...,"To log a table with MLflow, you can use the `m...",You can log a table with MLFlow using the mlfl...,0.0,0.0,0.0,0.988673,0.895391,641,0.002799,2.670629
3,How can I load a saved table?,[Clicking on the run (“clumsy-steed-426” in th...,"I apologize, but I don't have any specific inf...",You can load a saved table using the mlflow.lo...,0.0,0.5,0.0,0.0,0.868268,633,0.002679,2.757582


In [11]:
def create_agg_metrics_dict(df, metrics):
    agg_metrics = {}
    for metric in metrics:
        agg_metrics[metric + "_mean"] = df[metric].mean()
        agg_metrics[metric + "_std"] = df[metric].std()
    return agg_metrics


# Set out tracking server uri for logging
mlflow.set_tracking_uri(uri="http://localhost:8080")

# Create a new MLflow Experiment
mlflow.set_experiment("Question-Answering Evaluation 0.1")

with mlflow.start_run():
    # Log the hyperparameters
    mlflow.log_params(
        {
            "text_splitter": "langchain.text_splitter.CharacterTextSplitter",
            "text_splitter__chunk_size": 500,
            "text_splitter__chunk_overlap": 100,
            "llm_model": "anthropic.claude-3-5-sonnet-20240620-v1:0",
            "llm_model__temperature": 0.1,
            "embedding_model": "amazon.titan-embed-text-v1",
            "user_prompt": prompt.messages[0].prompt.template,
        }
    )

    # Log the user_prompt
    mlflow.log_text(prompt.messages[0].prompt.template, "user_prompt.txt")

    # Log the evaluation dataset
    mlflow.log_input(
        mlflow.data.from_pandas(eval_df),
        context="test",
    )

    # Log the evaluation results
    mlflow.log_table(
        results_df,
        "results_df.json",
    )

    # Log the metric results
    mlflow.log_metrics(
        create_agg_metrics_dict(
            results_df,
            [
                "context_recall",
                "faithfulness",
                "context_precision",
                "answer_relevancy",
                "semantic_similarity",
                "total_tokens",
                "total_cost",
                "latency",
            ],
        )
    )

2024/10/22 10:50:39 INFO mlflow.tracking._tracking_service.client: 🏃 View run rambunctious-fly-398 at: http://localhost:8080/#/experiments/511731529628869183/runs/6a98d2ac73134132b3ae6a16fd854929.
2024/10/22 10:50:39 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:8080/#/experiments/511731529628869183.


# Comparing different hyperparameters

In [12]:
def product_hyperparameters(**kwargs):
    keys = kwargs.keys()
    return [dict(zip(keys, instance)) for instance in product(*kwargs.values())]


hyperparameter_grid = {
    "llm_model": [
        "anthropic.claude-3-5-sonnet-20240620-v1:0",
        "anthropic.claude-3-haiku-20240307-v1:0",
    ],
    "user_prompt": [
        {
            "v1": "Use the following pieces of retrieved context to answer the question.\n\nQuestion: {question}\nContext: {context}\nAnswer:"
        },
        {
            "v2": "You are an assistant for question-answering tasks.\nUse the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. \nUse three sentences maximum and keep the answer concise.\n\nQuestion: {question}\nContext: {context}\nAnswer:"
        },
    ],
    "temperature": [0.1, 0.7],
}

# Open the hyperparameter json file
unique_hyper_product = product_hyperparameters(**hyperparameter_grid)

print("Hyperparameter Grid Size: ", len(list(unique_hyper_product)))

Hyperparameter Grid Size:  8


In [13]:
unique_hyper_product[0]

{'llm_model': 'anthropic.claude-3-5-sonnet-20240620-v1:0',
 'user_prompt': {'v1': 'Use the following pieces of retrieved context to answer the question.\n\nQuestion: {question}\nContext: {context}\nAnswer:'},
 'temperature': 0.1}

In [14]:
def evaluate_run(params, metrics, evaluator_llm, evaluator_embeddings) -> dict:
    """
    Evaluates a QA model run using the provided parameters, metrics, and evaluators.
    Args:
        params (dict): Parameters for building the QA chain.
        metrics (list): List of metrics to evaluate the model.
        evaluator_llm (object): The language model evaluator.
        evaluator_embeddings (object): The embeddings evaluator.
    Returns:
        dict: A dictionary containing:
            - "eval_df" (pd.DataFrame): The evaluation DataFrame.
            - "results_df" (pd.DataFrame): The results DataFrame with evaluation metrics and metadata.
    """

    # Build qa chain
    qa_chain = build_qa_chain(params)

    # Build scoring and metadata dataset
    print("Running the QA chain on the evaluation dataset...")
    invoke_metadata = defaultdict(list)
    samples = []
    for _, row in eval_df.iterrows():
        # Invoke the chain while capturing the token usage/cost
        with get_bedrock_anthropic_callback() as cb:
            start_time = time.time()
            result = qa_chain.invoke(row["question"])
            end_time = time.time()

        invoke_metadata["user_input"].append(row["question"])
        invoke_metadata["total_tokens"].append(cb.total_tokens)
        invoke_metadata["total_cost"].append(cb.total_cost)
        invoke_metadata["latency"].append(end_time - start_time)

        samples.append(
            SingleTurnSample(
                user_input=row["question"],
                reference=row["reference"],
                response=result["answer"],
                retrieved_contexts=[i.page_content for i in result["context"]],
            )
        )

    metadata_df = pd.DataFrame(invoke_metadata)
    scoring_dataset = EvaluationDataset(samples=samples)

    # Evaluate the model
    print("Evaluating the QA chain...")
    results = evaluate(
        dataset=scoring_dataset,
        metrics=metrics,
        llm=evaluator_llm,
        embeddings=evaluator_embeddings,
    )

    results_df = results.to_pandas()
    results_df = results_df.merge(metadata_df, on="user_input")
    return {"eval_df": eval_df, "results_df": results_df}


def log_run(run_output: dict) -> None:
    """
    Logs the run output to MLflow as a nested run.

    Parameters:
    - run_output (dict): A dictionary containing the run output data.

    Returns:
    - None
    """
    print("Logging the run output to MLflow...")
    # Log the hyperparameters
    mlflow.log_params(run_output["params"])

    # Log the user_prompt
    mlflow.log_text(run_output["user_prompt"], "user_prompt.txt")

    # Log the evaluation dataset
    mlflow.log_input(
        mlflow.data.from_pandas(run_output["eval_df"]),
        context="test",
    )

    # Log the evaluation results
    mlflow.log_table(
        run_output["results_df"],
        "results_df.json",
    )

    # Log the metric results
    mlflow.log_metrics(
        create_agg_metrics_dict(
            run_output["results_df"],
            [
                "context_recall",
                "faithfulness",
                "context_precision",
                "answer_relevancy",
                "semantic_similarity",
                "total_tokens",
                "total_cost",
                "latency",
            ],
        )
    )


def build_qa_chain(params):
    # Build qa chain
    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    llm = ChatBedrock(
        client=bedrock_client,
        model=params["llm_model"],
        temperature=params["temperature"],
    )

    rag_chain_from_docs = (
        RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
        | ChatPromptTemplate.from_template(list(params["user_prompt"].values())[0])
        | llm
        | StrOutputParser()
    )

    rag_chain_with_source = RunnableParallel(
        {"context": vector_store.as_retriever(), "question": RunnablePassthrough()}
    ).assign(answer=rag_chain_from_docs)

    return rag_chain_with_source


In [None]:
# Define the evaluators llms and embeddings
evaluator_llm = LangchainLLMWrapper(
    ChatBedrock(
        client=bedrock_client,
        model="anthropic.claude-3-5-sonnet-20240620-v1:0",
        temperature=0.4,
    )
)
evaluator_embeddings = BedrockEmbeddings(
    client=bedrock_client, model_id="amazon.titan-embed-text-v1"
)

# Define the metrics
metrics = [
    LLMContextRecall(),
    Faithfulness(),
    ContextPrecision(),
    ResponseRelevancy(),
    SemanticSimilarity(),
]

# Set out tracking server uri for logging
mlflow.set_tracking_uri(uri="http://localhost:8080")

# Create a new MLflow Experiment
mlflow.set_experiment("Question-Answering Evaluation 0.2")

# Enable LangChain autologging
# Note that models and examples are not required to be logged in order to log traces.
# Simply enabling autolog for LangChain via mlflow.langchain.autolog() will enable trace logging.
mlflow.langchain.autolog()

experiment_cost = 0
# Iterate through combinations of hyperparameters
for params in tqdm(
    unique_hyper_product,
    desc="Iterating on each unique Hyperparameter combination",
):
    # Create the run output dictionary
    params_with_tags = {
        k: v if not isinstance(v, dict) else list(v.keys())[0]
        for k, v in params.items()
    }
    output_map = {
        "params": params_with_tags,
        "user_prompt": list(params["user_prompt"].values())[0],
    }

    with mlflow.start_run():
        # Evaluate the run
        run_output = evaluate_run(params, metrics, evaluator_llm, evaluator_embeddings)

        # Add the cost of the experiment to the total cost
        experiment_cost += run_output["results_df"]["total_cost"].sum()

        # Log mlflow run
        log_run(output_map | run_output)


print(f"Total Cost of Experiment: {experiment_cost} USD")

Iterating on each unique Hyperparameter combination:   0%|          | 0/8 [00:00<?, ?it/s]

Running the QA chain on the evaluation dataset...
Evaluating the QA chain...


Exception raised in Job[1]: TimeoutError()
Evaluating: 100%|██████████| 20/20 [03:00<00:00,  9.00s/it]
2024/10/22 10:56:47 INFO mlflow.tracking._tracking_service.client: 🏃 View run whimsical-robin-126 at: http://localhost:8080/#/experiments/183325462269967356/runs/c03f781176ea42969492139d777fa200.
2024/10/22 10:56:47 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:8080/#/experiments/183325462269967356.
Iterating on each unique Hyperparameter combination:  12%|█▎        | 1/8 [03:37<25:24, 217.80s/it]

Logging the run output to MLflow...
Running the QA chain on the evaluation dataset...
Evaluating the QA chain...


Prompt fix_output_format failed to parse output: The output parser failed to parse the output after 0 retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output after 0 retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output after 0 retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output after 0 retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output after 0 retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output after 0 retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output after 0 retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output after 0 retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output after 0 retries.
P

Logging the run output to MLflow...
Running the QA chain on the evaluation dataset...
Evaluating the QA chain...


Evaluating: 100%|██████████| 20/20 [00:21<00:00,  1.07s/it]
2024/10/22 11:00:53 INFO mlflow.tracking._tracking_service.client: 🏃 View run salty-grub-820 at: http://localhost:8080/#/experiments/183325462269967356/runs/939e3b0ca0f54a618755ebc1f22865e2.
2024/10/22 11:00:53 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:8080/#/experiments/183325462269967356.
Iterating on each unique Hyperparameter combination:  38%|███▊      | 3/8 [07:44<11:01, 132.25s/it]

Logging the run output to MLflow...
Running the QA chain on the evaluation dataset...
Evaluating the QA chain...


Evaluating: 100%|██████████| 20/20 [00:23<00:00,  1.17s/it]
2024/10/22 11:01:31 INFO mlflow.tracking._tracking_service.client: 🏃 View run zealous-foal-236 at: http://localhost:8080/#/experiments/183325462269967356/runs/bf3c9f3a7ba140c0a1ec9b2f18a6fca6.
2024/10/22 11:01:31 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:8080/#/experiments/183325462269967356.
Iterating on each unique Hyperparameter combination:  50%|█████     | 4/8 [08:21<06:18, 94.69s/it] 

Logging the run output to MLflow...
Running the QA chain on the evaluation dataset...
Evaluating the QA chain...




# Tracing

https://mlflow.org/docs/latest/llms/tracing/index.html#automatic-tracing

https://github.com/LokaHQ/mlops-tech-stack/tree/main/monitoring/mlflow-tracing

# Tracking Server

https://mlflow.org/docs/latest/tracking/server.html

https://github.com/LokaHQ/mlops-tech-stack/tree/main/experiment-tracking/sagemaker-mlflow-managed

https://github.com/LokaHQ/mlops-tech-stack/tree/main/experiment-tracking/mlflow

# Prompt Engineering UI

https://mlflow.org/docs/latest/llms/prompt-engineering/index.html