<a href="https://colab.research.google.com/github/anshupandey/Generative-AI-for-Professionals/blob/main/LLM_Evaluation_Question_Answering_Evaluation_with_mlflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Course Name: LLM Conversational Engagement**

---



## Module: Analytics and performance metrics in conversational AI systems
## Lab: LLM Question Answering Evaluation with mlflow

---




**Installing required libraries for this Lab**

In [1]:
!pip install -U -q openai mlflow pandas evaluate tiktoken


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.1/320.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.2/20.2 MB[0m [31m42.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m71.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m72.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.6/147.6 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

Importing necessary libraries and setting the OPENAI_API_KEY environment variable to a value entered securely by the user at runtime, ensuring the API key is not exposed or hard-coded in the script.







In [2]:
import openai
import pandas as pd
import mlflow
import os, getpass

os.environ["OPENAI_API_KEY"]=getpass.getpass()


··········


Creating a test case of inputs that will be passed into the model and ground_truth which will be used to compare against the generated output from the model.

In [3]:
import pandas as pd

eval_df = pd.DataFrame(
    {
        "inputs": [
            "How is the useEffect() function utilized?",
            "What is the significance of the static keyword within a function?",
            "Explain the purpose of the 'finally' block in Python.",
            "Differentiate between multiprocessing and multithreading?",
        ],
        "ground_truth": [
            "The useEffect() hook tells React that your component needs to do something after render. React will remember the function you passed (we’ll refer to it as our “effect”), and call it later after performing the DOM updates.",
            "Static members belongs to the class, rather than a specific instance. This means that only one instance of a static member exists, even if you create multiple objects of the class, or if you don't create any. It will be shared by all objects.",
            "'Finally' defines a block of code to run when the try... except...else block is final. The finally block will be executed no matter if the try block raises an error or not.",
            "Multithreading refers to the ability of a processor to execute multiple threads concurrently, where each thread runs a process. Whereas multiprocessing refers to the ability of a system to run multiple processors in parallel, where each processor can run one or more threads.",
        ],
    }
)


**Model Logging and Evaluation:**
* The script uses MLflow to manage the machine learning lifecycle, including model logging and evaluation.
* Within the mlflow.start_run() context, the script logs a model using mlflow.openai.log_model(). Here, it specifies the model (gpt-3.5-turbo), the task (openai.chat.completions), and the location to store the model (artifact_path="model").
* It then creates a system prompt and a structure to hold the interaction between the system and the user in the format of role and content.
* mlflow.evaluate() is used to evaluate the logged model's performance on the provided DataFrame (eval_df). It compares the model's responses to the ground_truth and determines the quality of the model using the question-answering model type and default evaluators.
* Finally, results.metrics will display the evaluation metrics after the evaluation process is complete.


In [4]:
with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    basic_qa_model = mlflow.openai.log_model(
        model="gpt-3.5-turbo",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        targets="ground_truth",  # specify which column corresponds to the expected output
        model_type="question-answering",  # model type indicates which metrics are relevant for this task
        evaluators="default",
    )
results.metrics


2024/05/13 09:07:10 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2024/05/13 09:07:10 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/05/13 09:07:13 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


Downloading builder script:   0%|          | 0.00/6.08k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



{'toxicity/v1/mean': 0.00024036074682953767,
 'toxicity/v1/variance': 4.567699126591084e-09,
 'toxicity/v1/p90': 0.00031514947477262467,
 'toxicity/v1/ratio': 0.0,
 'exact_match/v1': 0.0}

In [5]:
results.tables["eval_results_table"]


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,inputs,ground_truth,outputs,token_count,toxicity/v1/score
0,How is the useEffect() function utilized?,The useEffect() hook tells React that your com...,The `useEffect()` hook in React is utilized to...,55,0.000267
1,What is the significance of the static keyword...,"Static members belongs to the class, rather th...",The static keyword within a function in C or C...,58,0.000157
2,Explain the purpose of the 'finally' block in ...,'Finally' defines a block of code to run when ...,The 'finally' block in Python is used to defin...,57,0.000336
3,Differentiate between multiprocessing and mult...,Multithreading refers to the ability of a proc...,Multiprocessing involves running multiple proc...,64,0.000202


**LLM-judged correctness with OpenAI GPT-4**

MLflow's genai module, which seems to be a part of MLflow for generating and evaluating AI models. The EvaluationExample class is instantiated with several parameters:

* input: This is the question or prompt given to the model.
* output: This is the model's response to the input.
* score: This is a numerical score representing the quality of the model's output, with 5 presumably being the highest score. A score of 4 indicates a high-quality response, but with room for improvement.
* justification: This provides a reason for the assigned score. In this case, it explains that the definition provided by the model was effective, but it could have been more concise.
* grading_context: This is the ground truth or target answer against which the model's output is being evaluated.

The answer_similarity metric is then constructed to use OpenAI's GPT-4 model as a judge for evaluating the similarity between the model's output and the target answer provided in the grading context. This metric presumably uses some form of semantic similarity measurement to compare the two texts and produce a score.

In [6]:
from mlflow.metrics.genai import EvaluationExample, answer_similarity
# Create an example to describe what answer_similarity means like for this problem.
example = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine "
    "learning workflows, including experiment tracking, model packaging, "
    "versioning, and deployment, simplifying the ML lifecycle.",
    score=4,
    justification="The definition effectively explains what MLflow is "
    "its purpose, and its developer. It could be more concise for a 5-score.",
    grading_context={
        "targets": "MLflow is an open-source platform for managing "
        "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
        "a company that specializes in big data and machine learning solutions. MLflow is "
        "designed to address the challenges that data scientists and machine learning "
        "engineers face when developing, training, and deploying machine learning models."
    },
)
# Construct the metric using OpenAI GPT-4 as the judge
answer_similarity_metric = answer_similarity(model="openai:/gpt-4", examples=[example])
print(answer_similarity_metric)


EvaluationMetric(name=answer_similarity, greater_is_better=True, long_name=answer_similarity, version=v1, metric_details=
Task:
You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's answer_similarity based on the rubric
justification: Your reasoning about the model's answer_similarity score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called answer_similarity based on the input and output.
A definition of answer_similarity and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them be

**Call mlflow.evaluate() again but with your new answer_similarity_metric**

with mlflow.start_run() as run:: A new MLflow run, creating a new experiment run context. All subsequent operations are logged to this run until the context manager exits.

Inside the context manager, mlflow.evaluate() is called with the following arguments:
* basic_qa_model.model_uri: This is the URI of the logged model which you want to evaluate.
* eval_df: This is a DataFrame containing the data to evaluate the model against. In this context, it includes questions (inputs) and their corresponding correct answers (ground_truth).
* targets="ground_truth": This specifies the column in eval_df that contains the correct answers against which the model's predictions will be evaluated.
* model_type="question-answering": This indicates the type of model being evaluated. Specific evaluators or metrics may be triggered based on the model type.
* evaluators="default": This uses the default set of evaluators provided by MLflow for the specified model type.
* extra_metrics=[answer_similarity_metric]: This adds the custom similarity metric defined earlier to the evaluation. The answer_similarity_metric is expected to compute the semantic similarity between the model's output and the target answer, adding an additional layer of analysis to the evaluation.

Once the evaluation is complete, results.metrics is called outside the with context. This line is intended to output the metrics from the evaluation. It should include the default metrics for the question-answering task as well as the custom similarity metric that was passed to the mlflow.evaluate() function.


In [7]:
with mlflow.start_run() as run:
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        targets="ground_truth",
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[answer_similarity_metric],  # use the answer similarity metric created above
    )
results.metrics


2024/05/13 09:07:43 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2024/05/13 09:07:43 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/05/13 09:07:44 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/4 [00:00<?, ?it/s]

{'toxicity/v1/mean': 0.00022141765293781646,
 'toxicity/v1/variance': 4.083352028979105e-09,
 'toxicity/v1/p90': 0.0002911611110903323,
 'toxicity/v1/ratio': 0.0,
 'exact_match/v1': 0.0,
 'answer_similarity/v1/mean': 4.25,
 'answer_similarity/v1/variance': 0.1875,
 'answer_similarity/v1/p90': 4.7}

In [8]:
results.tables["eval_results_table"]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,inputs,ground_truth,outputs,token_count,toxicity/v1/score,answer_similarity/v1/score,answer_similarity/v1/justification
0,How is the useEffect() function utilized?,The useEffect() hook tells React that your com...,The useEffect() function is utilized in React ...,56,0.000199,4,The output effectively explains what the useEf...
1,What is the significance of the static keyword...,"Static members belongs to the class, rather th...","Within a function, the static keyword is used ...",50,0.000186,4,The output accurately describes the function o...
2,Explain the purpose of the 'finally' block in ...,'Finally' defines a block of code to run when ...,The 'finally' block in Python is used to defin...,56,0.000331,5,The model's output closely aligns with the pro...
3,Differentiate between multiprocessing and mult...,Multithreading refers to the ability of a proc...,Multiprocessing involves running multiple proc...,42,0.000169,4,The model's output aligns with the provided ta...


Access mlflow ui,



>  mlflow ui




Access UI at http://127.0.0.1:5000

## Thank You