#**Vertex Evaluation For LLM - Demo**

**Authors**: jsndai@

This notebook showcases how to launch a Kubeflow pipeline (KFP) using the [`LLM Evaluation Component`](google3/third_party/py/vertexevaluation/llm/component/eval_component.py) for Generative Language Models on Vertex AI Managed Pipelines.

***Please make a copy of this notebook to execute your own pipelines.***

Terms of Service: This content is experimental functionality covered by the Pre-GA Offerings Terms of your Google Cloud Platform [Terms of Service](https://cloud.google.com/terms).

# Instructions

# ***Please create a COPY of this colab before running.***


1. Please follow the
- `Setup`,
- `Configure your GCP project`,
- `Test Vertex SDK for LLM Evaluation`
- `Test LLM Evaluation Pipeline` sections below.

2. Update the pipeline parameters in the `Define the Inputs Specific to the pipeline` section if you would like to customize the evaluation job.

3. If any bugs arise or a pipeline fails, please file a ticket [here](https://b.corp.google.com/issues/new?component=865810&template=1816845).

#  1. Setup

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

### To use the current version of LLM Eval in the Vertex SDK, you'll need:

* A tuned LLM Model in prod environment. This notebook uses one in `pyc-llm-dev` GCP project, you can use one in your own project if you'd like.

* Read access to the `gs://vertex_sdk_private_releases/` bucket (you should already have access)

## Authenticate your GCP account

In [1]:
import sys
if 'google.colab' in sys.modules:
  from google.colab import auth as google_auth
  google_auth.authenticate_user()

## Install dependencies

Please make sure to click on the "RESTART RUNTIME" button in the output after pip install completes.

In [2]:
# Install the Google Cloud Pipeline Components (GCPC) & Vertex SDK.
import os

if not os.getenv("IS_TESTING"):
    USER = "--user"
else:
    USER = ""
!pip3 install {USER} --upgrade google-cloud-aiplatform -q --no-warn-conflicts
!pip3 install {USER} --upgrade google-cloud-pipeline-components -q --no-warn-conflicts
!pip3 install {USER} --upgrade kfp -q --no-warn-conflicts

[0m

In [3]:
# !gsutil cp gs://vertex_sdk_private_releases/sara_test/google_cloud_aiplatform-1.26.dev20230530+language.models.eval-py2.py3-none-any.whl .


In [4]:
# !pip uninstall protobuf

In [5]:
# Installing the SDK from a whl file
# !gsutil cp gs://vertex_sdk_private_releases/sara_test/google_cloud_aiplatform-1.26.dev20230530+language.models.eval-py2.py3-none-any.whl .

!pip install google_cloud_aiplatform-1.26.dev20230530+language.models.eval-py2.py3-none-any.whl "shapely<2.0.0" --force-reinstall --user

[0mProcessing ./google_cloud_aiplatform-1.26.dev20230530+language.models.eval-py2.py3-none-any.whl
Collecting shapely<2.0.0
  Using cached Shapely-1.8.5.post1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.0 MB)
Collecting google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.32.0 (from google-cloud-aiplatform==1.26.dev20230530+language.models.eval)
  Obtaining dependency information for google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.32.0 from https://files.pythonhosted.org/packages/c4/1e/924dcad4725d2e697888e044edf7a433db84bf9a3e40d3efa38ba859d0ce/google_api_core-2.14.0-py3-none-any.whl.metadata
  Using cached google_api_core-2.14.0-py3-none-any.whl.metadata (2.6 kB)
Collecting proto-plus<2.0.0dev,>=1.22.0 (from google-cloud-aiplatform==1.26.dev20230530+language.models.eval)
  Obtaining dependency information for proto-plus<2.0.0dev,>=1.22.0 from https://files.pythonhosted.

### Restart the kernel runtime

In [1]:
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Import modules

In [2]:
# !pip install etils
# !pip install kfp --user

In [3]:
# from google.cloud import aiplatform
# print(f"Vertex AI SDK version: {aiplatform.__version__}")



In [4]:
!pip show google.cloud.aiplatform 

[0mName: google-cloud-aiplatform
Version: 1.26.dev20230530+language.models.eval
Summary: Vertex AI API client library
Home-page: https://github.com/googleapis/python-aiplatform
Author: Google LLC
Author-email: googleapis-packages@google.com
License: Apache 2.0
Location: /home/jupyter/.local/lib/python3.10/site-packages
Requires: google-api-core, google-cloud-bigquery, google-cloud-resource-manager, google-cloud-storage, packaging, proto-plus, protobuf, shapely
Required-by: google-cloud-pipeline-components


In [5]:
# Import required modules
import os
import sys
import uuid
# import kfp
# from etils import epath
from google.cloud import aiplatform

In [6]:
!python -c "import sys; sys.path.append('/opt/conda/envs/pytorch/lib/python3.10/site-packages/')"

In [7]:
import vertexai
from vertexai.preview import language_models

# from vertexai.preview import language_models_eval

# import google.cloud.aiplatform #private_preview import language_models_eval #_evaluatable_language_models

from google.cloud.aiplatform.private_preview.language_models_eval import _evaluatable_language_models

# 2. Configure your GCP project

## Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [8]:
# PROJECT_ID = "pyc-llm-dev"  # @param {type:"string"}
REGION = "us-central1"  # @param {type: "string"}

VERTEX_API_PROJECT = PROJECT_ID = "my-project-0004-346516" #'your-project' #@param {"type": "string"}
# REGION = LOCATION = "europe-west4" # 'us-central1' #"europe-west4"
GCS_BUCKET = STAGING_BUCKET = DATA_STAGING_GCS_LOCATION = 'gs://my-project-0004-346516' #"my-project-0004-346516-vertex-pipelines-europe-west4"


# Initialize Vertex AI SDK
import vertexai
vertexai.init(project=PROJECT_ID, location=REGION)

In [9]:
aiplatform.init(
    project=PROJECT_ID,
    location=REGION
)
!gcloud config set project {PROJECT_ID}

Updated property [core/project].


## Configure a test GCS bucket


In [10]:
BUCKET_NAME = "my-project-0004-346516" # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

# bucket_path = epath.Path(BUCKET_URI)

# SDK Configuration
vertexai.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets, and output performance metrics file.


**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [11]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

Creating gs://my-project-0004-346516/...
ServiceException: 409 A Cloud Storage bucket named 'my-project-0004-346516' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


## Configure GCS folder for running pipelines

Evaluation related files (Eval Metrics, Batch Prediction results) will
be saved to the GCS bucket. The pipeline will not clean up the files since
some of them might be useful for you, please make sure to clean up them if
needed.

In [12]:
gcs_base_path = f"{BUCKET_URI}/eval-fishfooding-pipelines"
gcs_base_path

'gs://my-project-0004-346516/eval-fishfooding-pipelines'

# 3. Test LLM Evaluation Pipeline

## Load compiled template pipeline

In [13]:
from google.cloud import storage
import json

storage_client = storage.Client()

# Define the function to read metrics content from GCS
def get_metrics_blob(job, nlp_task):
  expected_task_name = "model-evaluation-text-generation" if nlp_task != "" else "model-evaluation-classification"
  task_detail = None
  for detail in job.task_details:
    if detail.task_name == expected_task_name:
      task_detail = detail
  if not task_detail:
    print(f"Not able to find the task {expected_task_name}.")
  metrics_uri = None
  for k, v in task_detail.outputs.items():
    if k != "evaluation_metrics":
      continue
    for artifact in v.artifacts:
      if artifact.display_name == "evaluation_metrics":
        metrics_uri = artifact.uri[5:]
  if not metrics_uri:
    print("Not able to find the metric.")
  splits = metrics_uri.split("/")
  bucket_name = splits[0]
  blob_name = '/'.join(splits[1:])
  bucket = storage_client.bucket(bucket_name)
  blob = bucket.blob(blob_name)
  with blob.open("r") as f:
    return json.loads(f.read())

# Define the function to plot confusion matrix
import matplotlib
matplotlib.use('Agg')
%matplotlib inline

import matplotlib.pyplot as plt
import numpy
from sklearn import metrics

def plot_confusion_matrix(job, nlp_task):
  overall_metrics = get_metrics_blob(job, nlp_task)
  confusion_matrix = []
  for slice_metric in overall_metrics['slicedMetrics']:
    if 'value' in slice_metric['singleOutputSlicingSpec']:
      continue
    if 'confusionMatrix' not in slice_metric['metrics']['classification']:
      print("No Confusion Matrix found")
      print(f"Evaluation metrics is: {slice_metric}")
      return
    for row in slice_metric['metrics']['classification']['confusionMatrix']['rows']:
      confusion_matrix.append(row['dataItemCounts'])
  # Plot the matrix
  confusion_matrix = numpy.array(confusion_matrix)

  cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix, display_labels = evaluation_class_labels)

  fig, ax = plt.subplots(figsize=(8,8))
  cm_display.plot(ax=ax)
  plt.show()


# Define the function to print nlp metrics
from tabulate import tabulate

def print_nlp_metrics(job, nlp_task):
  metrics = get_metrics_blob(job, nlp_task)
  metric_names = []
  if nlp_task == "question-answering":
    metric_names = ["exact_match"]
  elif nlp_task == "summarization":
    metric_names = ["rougeLSum"]
  else:
    metric_names = ["bleu", "rougeLsum"]
  table = [metric_names, [metrics[metric_name] for metric_name in metric_names]]
  print(tabulate(table, headers='firstrow', tablefmt='fancy_grid'))

# Define the function to print classification metrics
def print_classification_metrics(job, nlp_task):
  all_metrics = get_metrics_blob(job, nlp_task)['slicedMetrics']
  overall_metrics = all_metrics[0]['metrics']['classification']
  metric_names = ["Metric Slice", "auPrc", "auRoc", "logLoss"]
  f1_metrics = ["f1Score"]
  aggregated_f1_metrics = ["f1ScoreMicro", "f1ScoreMacro"]
  table = [metric_names + f1_metrics + aggregated_f1_metrics]
  for metrics in all_metrics:
    classification_metric = metrics['metrics']['classification']
    slice_name = "class - " + metrics['singleOutputSlicingSpec']['value'] if 'value' in metrics['singleOutputSlicingSpec'] else "Overall"
    slice_metric_values = [slice_name]
    slice_metric_values.extend([classification_metric.get(metric_name, 0) for metric_name in metric_names[1:]])
    slice_metric_values.extend([classification_metric['confidenceMetrics'][0].get(metric_name, 0) for metric_name in f1_metrics])
    slice_metric_values.extend([classification_metric['confidenceMetrics'][0].get(metric_name, 'n/a') for metric_name in aggregated_f1_metrics])
    table.append(slice_metric_values)
  print(tabulate(table, headers='firstrow', tablefmt='fancy_grid'))

# Define the function to print confidence metrics
def print_confidence_metrics(job, nlp_task, expected_confidence_threshold):
  all_metrics = get_metrics_blob(job, nlp_task)['slicedMetrics']
  confidence_metric_names = ["Metric Slice", "recall", "precision", "falsePositiveRate", "f1Score", "truePositiveCount", "falsePositiveCount"]
  table = [confidence_metric_names]
  for metrics in all_metrics:
    classification_metric = metrics['metrics']['classification']
    slice_name = "class - " + metrics['singleOutputSlicingSpec']['value'] if 'value' in metrics['singleOutputSlicingSpec'] else "Overall"
    slice_metric_values = [slice_name]
    confidence_metrics = None
    found_threshold_distance = 1
    for metrics in classification_metric['confidenceMetrics']:
      confidence_threshold = metrics['confidenceThreshold'] if 'confidenceThreshold' in metrics else 0
      if abs(expected_confidence_threshold-confidence_threshold) <= found_threshold_distance:
        confidence_metrics = metrics
        found_threshold_distance = abs(expected_confidence_threshold-confidence_threshold)
    slice_metric_values.extend([confidence_metrics.get(metric_name, 0) for metric_name in confidence_metric_names[1:]])
    table.append(slice_metric_values)
  print(tabulate(table, headers='firstrow', tablefmt='fancy_grid'))



evaluation_llm_text_generation_pipeline = "https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-text-generation-pipeline/1.0.1"#@param {type:"string"}
evaluation_llm_classification_pipeline = "https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-classification-pipeline/1.0.1"#@param {type:"string"}

In [14]:
# !wget "https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-text-generation-pipeline/1.0.1"

In [15]:
# Model in prod environment from `pyc-llm-dev` project.
model_name = "projects/255766800726/locations/us-central1/models/1231208931527753728" #@param {type:"string"}
model_name = "projects/255766800726/locations/us-central1/models/998292786346196992"

# model_name = "projects/255766800726/locations/us-central1/models/8377665076164820992" #@param {type:"string"}



# "publishers/google/models/text-bison@001"
# ""


## Select your test dataset file.



Select a public test dataset for Fishfooding. You are also encouraged to use your own test dataset for testing.

Task Type  | GCS URI
------- | --------
Text Generation | gs://vertex-evaluation-llm-dataset-us-central1/test_datasets/text_generation_bp_input_with_ground_truth.jsonl
Text Classification | gs://vertex-evaluation-llm-dataset-us-central1/test_datasets/llm_classification_bp_input_prompts_with_ground_truth.jsonl |
Question Answering | gs://vertex-evaluation-llm-dataset-us-central1/test_datasets/qa_bp_input.jsonl
Summarization | gs://vertex-evaluation-llm-dataset-us-central1/test_datasets/summarization_bp_input.jsonl



In [16]:
# this is a common GCS and may not be accessible to service ID hence copy it to my own bucket first

batch_predict_gcs_source_uris = 'gs://vertex-evaluation-llm-dataset-us-central1/test_datasets/text_generation_bp_input_with_ground_truth.jsonl'#@param {type:"string"}

In [17]:
! gsutil cp -r $batch_predict_gcs_source_uris gs://my-project-0004-346516/eval-fishfooding-pipelines

# ! gsutil cp -r $batch_predict_gcs_source_uris gs://my-project-0004-346516my-project-0004-346516

Copying gs://vertex-evaluation-llm-dataset-us-central1/test_datasets/text_generation_bp_input_with_ground_truth.jsonl [Content-Type= text/plain]...
/ [1 files][  1.5 KiB/  1.5 KiB]                                                
Operation completed over 1 objects/1.5 KiB.                                      


In [18]:
batch_predict_gcs_source_uris = 'gs://my-project-0004-346516/eval-fishfooding-pipelines/text_generation_bp_input_with_ground_truth.jsonl'#@param {type:"string"}

In [19]:
# Peek at your BP input file.
! gsutil cat $batch_predict_gcs_source_uris | head -n 5

{"prompt":"Basketball teams in the Midwest.", "ground_truth":"There are several basketball teams located in the Midwest region of the United States. Here are some of them:"}
{"prompt":"How to bake gluten-free bread?", "ground_truth":"Baking gluten-free bread can be a bit challenging because gluten is the protein that gives bread its structure and elasticity."}
{"prompt":"Want to buy a new phone.", "ground_truth":"Great! There are many factors to consider when buying a new phone, including your budget, preferred operating system, desired features, and more. Here are some general steps to follow to help you make an informed decision:"}
{"prompt":"I told them \"see you tomorrow\"", "ground_truth":"If you told someone \"see you tomorrow,\" you most likely meant that you will see them the following day. This is a common phrase used when saying goodbye to someone with the intention of seeing them again soon. If you are unable to meet with them as planned, it is always polite to let them know

In [20]:
target_field_name='instance.ground_truth' #@param {type:"string"}
prediction_field_name='predictions.content' #@param {type:"string"}

## [Option 1] Submit a Eval Pipeline for QA task

### Define the Inputs Specific to the pipeline

In [26]:
evaluation_task = nlp_task ='question-answering' #@param {type:"string"}


### Submit a pipeline to Vertex

In [22]:
# We need to provide the parameter for all arguments that does not have a default value.
parameters = {
    "project": PROJECT_ID,
    "location": REGION,
    "evaluation_task": evaluation_task,
    "batch_predict_gcs_source_uris": [batch_predict_gcs_source_uris],
    "batch_predict_gcs_destination_output_uri": gcs_base_path,
    "model_name": model_name,
}

job_id = "fishfood-llm-eval-test-QA-{}".format(uuid.uuid4()).lower()
job = aiplatform.PipelineJob(
    display_name=job_id,
    template_path=evaluation_llm_text_generation_pipeline,
    job_id=job_id,
    pipeline_root=gcs_base_path,
    parameter_values=parameters,
    enable_caching=False,
)

job.run()

Creating PipelineJob
PipelineJob created. Resource name: projects/255766800726/locations/us-central1/pipelineJobs/fishfood-llm-eval-test-qa-9c54fb6c-ebaf-4bd0-96dd-fcbdee75d760
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/255766800726/locations/us-central1/pipelineJobs/fishfood-llm-eval-test-qa-9c54fb6c-ebaf-4bd0-96dd-fcbdee75d760')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/fishfood-llm-eval-test-qa-9c54fb6c-ebaf-4bd0-96dd-fcbdee75d760?project=255766800726
PipelineJob projects/255766800726/locations/us-central1/pipelineJobs/fishfood-llm-eval-test-qa-9c54fb6c-ebaf-4bd0-96dd-fcbdee75d760 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/255766800726/locations/us-central1/pipelineJobs/fishfood-llm-eval-test-qa-9c54fb6c-ebaf-4bd0-96dd-fcbdee75d760 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/255766800726/locations/us-central1/pipelin

### View the QA task Evaluation metrics

In [27]:
print_nlp_metrics(job, nlp_task)

╒═══════════════╕
│   exact_match │
╞═══════════════╡
│             0 │
╘═══════════════╛


## [Option 2] Submit a Eval Pipeline for Summarization task

### Define the Inputs Specific to the pipeline

In [28]:
evaluation_task = 'summarization' #@param {type:"string"}


### Submit a pipeline to Vertex

In [29]:
# We need to provide the parameter for all arguments that does not have a default value.
parameters = {
    "project": PROJECT_ID,
    "location": REGION,
    "evaluation_task": evaluation_task,
    "batch_predict_gcs_source_uris": [batch_predict_gcs_source_uris],
    "batch_predict_gcs_destination_output_uri": gcs_base_path,
    "model_name": model_name,

}

job_id = "fishfood-llm-eval-test-summarization-{}".format(uuid.uuid4()).lower()
job = aiplatform.PipelineJob(
    display_name=job_id,
    template_path=evaluation_llm_text_generation_pipeline,
    job_id=job_id,
    pipeline_root=gcs_base_path,
    parameter_values=parameters,
    enable_caching=False,
)

job.run()

Creating PipelineJob
PipelineJob created. Resource name: projects/255766800726/locations/us-central1/pipelineJobs/fishfood-llm-eval-test-summarization-4f35dcc0-74dc-4b50-971d-689ef2358c40
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/255766800726/locations/us-central1/pipelineJobs/fishfood-llm-eval-test-summarization-4f35dcc0-74dc-4b50-971d-689ef2358c40')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/fishfood-llm-eval-test-summarization-4f35dcc0-74dc-4b50-971d-689ef2358c40?project=255766800726
PipelineJob projects/255766800726/locations/us-central1/pipelineJobs/fishfood-llm-eval-test-summarization-4f35dcc0-74dc-4b50-971d-689ef2358c40 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/255766800726/locations/us-central1/pipelineJobs/fishfood-llm-eval-test-summarization-4f35dcc0-74dc-4b50-971d-689ef2358c40 current state:
PipelineState.PIPELINE_STATE_RUNNING
Pipeline

### View the Summarization task Evaluation metrics

In [31]:
print_nlp_metrics(job, evaluation_task)

╒═════════════╕
│   rougeLSum │
╞═════════════╡
│   0.0888592 │
╘═════════════╛


## [Option 3] Submit a Eval Pipeline for General Text Generation task

### Define the Inputs Specific to the pipeline

In [None]:
evaluation_task = 'text-generation' #@param {type:"string"}

### Submit a pipeline to Vertex

In [None]:
# We need to provide the parameter for all arguments that does not have a default value.
parameters = {
    "project": PROJECT_ID,
    "location": REGION,
    "evaluation_task": evaluation_task,
    "batch_predict_gcs_source_uris": [batch_predict_gcs_source_uris],
    "batch_predict_gcs_destination_output_uri": gcs_base_path,
    "model_name": model_name,
}

job_id = "fishfood-llm-eval-test-summarization-{}".format(uuid.uuid4()).lower()
job = aiplatform.PipelineJob(
    display_name=job_id,
    template_path=evaluation_llm_text_generation_pipeline,
    job_id=job_id,
    pipeline_root=gcs_base_path,
    parameter_values=parameters,
    enable_caching=False,
)

job.run()

### View the General Text Generation task Evaluation metrics

In [None]:
print_nlp_metrics(job, evaluation_task)

## [Option 4] Submit a Eval Pipeline for Classification task

### Define the Inputs Specific to the pipeline

In [None]:
target_field_name='ground_truth' #@param {type:"string"}
evaluation_class_labels=['nature', 'news', 'sports', 'health', 'startups'] #@param

### Submit a pipeline to Vertex

In [None]:
# We need to provide the parameter for all arguments that does not have a default value.
parameters = {
    "project": PROJECT_ID,
    "location": REGION,
    "batch_predict_gcs_destination_output_uri": gcs_base_path,
    "evaluation_class_labels": evaluation_class_labels,
    "batch_predict_gcs_source_uris": [batch_predict_gcs_source_uris],
    "target_field_name": target_field_name,
    "model_name": model_name,
}

job_id = "fishfood-llm-eval-test-classification-{}".format(uuid.uuid4())
job = aiplatform.PipelineJob(
    display_name=job_id,
    template_path=evaluation_llm_classification_pipeline,
    job_id=job_id,
    pipeline_root=gcs_base_path,
    parameter_values=parameters,
    enable_caching=False,
)

job.run()

### View the Classification task Evaluation metrics for whole Dataset and each Class

In [None]:
print_classification_metrics(job, evaluation_task)

### Spot check Confusion Matrix for the whole Dataset

In [None]:
overall_metrics = plot_confusion_matrix(job, evaluation_task)

### Spot check Confidence Metrics for the whole Dataset and each Class

In [None]:
print_confidence_metrics(job, nlp_task, expected_confidence_threshold=0.6)

# Clean up

To clean up all Google Cloud resources used in this project, you can delete the Google Cloud project you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [None]:
# Delete Cloud Storage objects that were created
delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -rf {BUCKET_URI}

# How It Works / FAQ
To use evaluation on Vertex Managed Pipelines, there's a couple terms to get familiar with. ***(Check out the bolded parts in each section!)***
### 1. Kubeflow Pipelines (KFP) vs Managed Pipelines (MP)
Kubeflow is a machine learning toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

[Kubeflow pipelines](https://github.com/kubeflow/pipelines) are reusable end-to-end ML workflows built using the Kubeflow Pipelines SDK.

[Managed Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction) helps you to automate, monitor, and govern your ML systems by orchestrating your ML workflow in a serverless manner, and storing your workflow's artifacts using Vertex ML Metadata.

***In summary: Managed Pipelines used in this fishfooding is a hosted orchestration on the Vertex Platform of Kubeflow Pipelines.***

### 2. Pipeline Components
Pipeline components are self-contained sets of code that perform one part of a pipeline's workflow, such as data preprocessing, data transformation, and training a model.

Components are composed of a set of inputs, a set of outputs, and the location of a container image. A component's container image is a package that includes the component's executable code and a definition of the environment that the code runs in.

***The Vertex AI Evaluation Component is a pipeline component that runs after the Vertex AI Batch Prediction Component.***


### 3. The Pipeline
Kubeflow pipeline components are factory functions that create pipeline steps. Each component describes the inputs, outputs, and implementation of the component.

These components are linked together to create a reusable pipeline. This is done with the kfp.dsl package.

***The Vertex AI Evaluation Component will be a step in the pipeline. This pipeline will be compiled then sent to MP to execute.***

