<a href="https://colab.research.google.com/github/barbaroja2000/rubric/blob/main/Rubric_Vertex_AI_Gemini_Pro_(gemini_pro).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Evaluating Vertex AI Gemini Pro against GPT-4 with a custom rubric

Uses:

*   Langsmith for custom rubrik dataset & Evaluation Framework
*   GPT-4 To evaluate LLM output
*   Vertex-AI

---

Test:

*  provides a quality assesment 0-5 and a rationalle for the score
* The interactive table can be used to Check the model output against the GPT-4 assesment
* Possible improvment here is to be able to manually update the score returned from the model

Model gemini-pro (Gemini Pro):

* The API will bork if the prompt provided hits the woke/safety prime directive threshold, rather than return a explanation in the response it will reject the request. This proably account at least partially for the relatively poor score this model received.

---

Ref:

https://python.langchain.com/docs/langsmith/

https://cloud.google.com/blog/products/ai-machine-learning/gemini-support-on-vertex-ai

https://deepmind.google/technologies/gemini/#introduction

---

Requires:
*   Test dataset in Langsmith: This notebook shows how to set this up https://github.com/barbaroja2000/rubric/blob/main/Rubric_Langsmith_Dataset_Creator.ipynb
*   Account on google cloud platform
*   Google cloud project id
*   Langchain Key
*   OpenAI Key

In [None]:
!pip install -qU google-cloud-aiplatform

In [None]:
!pip install -q langchain langsmith openai tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m809.1/809.1 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.2/46.2 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m221.9/221.9 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [None]:
from google.colab import userdata
import os
import requests

os.environ['LANGCHAIN_ENDPOINT']= "https://api.smith.langchain.com"
os.environ['LANGCHAIN_API_KEY']= userdata.get('langchain_api_key')
os.environ["OPENAI_API_KEY"] = userdata.get('openai_api_key')
PROJECT_ID = userdata.get('gcloud_project_id')

In [None]:
import vertexai
from vertexai.preview.language_models import TextGenerationModel

vertexai.init(project=PROJECT_ID, location="us-central1")

In [None]:
from google.colab import auth as google_auth
google_auth.authenticate_user()

In [None]:
from langchain.llms import VertexAI
llm = VertexAI(model_name="gemini-pro")

## Run Metadata

In [None]:
project_name  = "Google Vertex AI - Gemini Pro"
rubric_model = "gemini-pro"
rubric_temperature=0
evaluator_model="gpt-4-1106-preview"
evaluator_temperature=0
dataset_name = "General Rubric: Extended"

In [None]:
from langsmith import Client
client = Client()

In [None]:
from langchain.prompts import PromptTemplate, SystemMessagePromptTemplate, ChatPromptTemplate

# create a string template for a System role
system_template = """
Please provide a detailed, accurate, and contextually relevant response to the following query.
Your answer should demonstrate a clear understanding of the subject matter, considering any logical,
ethical, historical, or cultural aspects involved. Ensure the response is tailored to the specific
requirements of the question, whether it involves analysis, explanation, creativity, or problem-solving.
"""

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnableMap, RunnablePassthrough

def create_runnable():
    prompt = ChatPromptTemplate.from_messages([
      ("system", system_template),
      ("human", "{input}.")
    ])
    return RunnableMap({"input": RunnablePassthrough()}) | prompt | llm

## Evaluator

https://docs.smith.langchain.com/evaluation/evaluator-implementations

In [None]:
import re
from typing import Any, Optional

from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.evaluation import StringEvaluator

class QualityEvaluator(StringEvaluator):
    """An LLM-based quality  evaluator."""

    def __init__(self):
        llm = ChatOpenAI(model=evaluator_model, temperature=evaluator_temperature)

        template = """ Evaluate the answer provided from 0 to 5, with 5 being the best quality answer, and 0 where the answer is false or not provided
        --------
        QUESTION: {input}
        --------
        ANSWER: {prediction}
        --------
        Consisely reason step by step about why the given score is appropriate, then print the score at the end. At the end, repeat that score alone on a new line."""

        self.eval_chain = LLMChain.from_string(llm=llm, template=template)

    @property
    def requires_input(self) -> bool:
        return True

    @property
    def requires_reference(self) -> bool:
        return False

    @property
    def evaluation_name(self) -> str:
        return "quality"

    def _evaluate_strings(
        self,
        prediction: str,
        input: Optional[str] = None,
        reference: Optional[str] = None,
        **kwargs: Any
    ) -> dict:
        evaluator_result = self.eval_chain(
            dict(input=input, prediction=prediction), **kwargs
        )
        lines = evaluator_result["text"].strip().split("\n")
        reasoning = evaluator_result["text"].strip()
        score = lines[-1]
        if score is not None:
            score = float(score.strip())
        return {"score": score, "reasoning": reasoning}

In [None]:
import random
import string

def rando():
  # Generating a random 4-letter string
  random_string = ''.join(random.choices(string.ascii_letters, k=4))
  return random_string

## Test Run

In [None]:
from langchain.smith import RunEvalConfig, run_on_dataset

evaluation_config = RunEvalConfig(
    input_key="question",
    output_key="output",
    custom_evaluators = [QualityEvaluator()],
)

project_metadata = {
  rubric_model : rubric_model,
  rubric_temperature : rubric_temperature,
  evaluator_model : evaluator_model,
  evaluator_temperature: evaluator_temperature
}

run = run_on_dataset(
    client=client,
    project_metadata = project_metadata,
    project_name = f"{project_name}-{rando()}",
    dataset_name=dataset_name,
    llm_or_chain_factory=create_runnable,
    evaluation=evaluation_config,
    concurrency_level=1,
)

View the evaluation results for project 'Google Vertex AI - Gemini Pro-edGi' at:
https://smith.langchain.com/o/8bebdbff-9433-40f6-b7e1-f013521a8100/datasets/dfedd53c-204a-4b69-9f64-51a61568a7fd/compare?selectedSessions=eb45791d-99dc-45ae-9056-6f8bb2a22e30

View all tests for Dataset General Rubric: Extended at:
https://smith.langchain.com/o/8bebdbff-9433-40f6-b7e1-f013521a8100/datasets/dfedd53c-204a-4b69-9f64-51a61568a7fd
[------------>                                     ] 32/123

Error Type: ValueError, Message: Content has no parts.


[------------>                                     ] 33/123

Error Type: ValueError, Message: Content has no parts.


[-------------->                                   ] 37/123

Error Type: ValueError, Message: Content has no parts.


[---------------------------->                     ] 72/123

Error Type: ValueError, Message: Content has no parts.


[----------------------------->                    ] 74/123

Error Type: ValueError, Message: Content has no parts.


[-------------------------------->                 ] 80/123

Error Type: ValueError, Message: Content has no parts.


[------------------------------------------>       ] 105/123

Error Type: ValueError, Message: Content has no parts.


[------------------------------------------->      ] 108/123

Error Type: ValueError, Message: Content has no parts.


[--------------------------------------------->    ] 114/123

Error Type: ValueError, Message: Content has no parts.


[------------------------------------------------> ] 120/123

Error Type: ValueError, Message: Content has no parts.


[------------------------------------------------->] 123/123

In [None]:
# Extracting the required information
extracted_data = []
for key, value in run['results'].items():
    output = value
    question = value['input']['question']
    for feedback_item in value['feedback']:
        if feedback_item.key == 'quality':
            quality = feedback_item.score
            feedback = feedback_item.comment
            extracted_data.append({
                'question': question,
                'output': output,
                'quality': quality,
                'feedback': feedback
            })

In [None]:
import pandas as pd
from google.colab import data_table
df = pd.DataFrame(extracted_data)

data_table.DataTable(df, include_index=False, num_rows_per_page=10)

Unnamed: 0,question,output,quality,feedback
0,"I WAS born in the year 1632, in the city of Yo...","{'input': {'category': 'Summarization', 'quest...",4.0,The provided answer summarizes the introductor...
1,"I WAS born in the year 1632, in the city of Yo...","{'input': {'category': 'Summarization', 'quest...",5.0,Evaluation of the Answer:\n\n- The answer corr...
2,"Y curiosity, in a sense, was stronger than my ...","{'input': {'category': 'Summarization', 'quest...",2.0,The provided answer identifies three main them...
3,Explain the role of a central bank in an econo...,"{'input': {'category': 'Reframing', 'question'...",2.0,The provided answer begins with an analogy tha...
4,Describe the basics of how the stock market wo...,"{'input': {'category': 'Reframing', 'question'...",2.0,The provided answer begins well by explaining ...
...,...,...,...,...
108,Assess the emotional tone in this news headlin...,"{'input': {'category': 'Emotion Analysis', 'qu...",4.0,The answer provided does a good job of analyzi...
109,Analyze the emotional content of this diary en...,"{'input': {'category': 'Emotion Analysis', 'qu...",4.0,The answer provided offers a thoughtful analys...
110,What is the emotional tone of this speech exce...,"{'input': {'category': 'Emotion Analysis', 'qu...",5.0,The answer provided does a good job of analyzi...
111,Analyze the emotion in this customer review: '...,"{'input': {'category': 'Emotion Analysis', 'qu...",5.0,The provided answer does an excellent job of a...


In [None]:
pd.to_numeric(df.quality).mean()

3.4601769911504423

In [None]:
examples = list(client.list_examples(dataset_name=dataset_name))

In [None]:
# Creating a new DataFrame from the list
list_df = pd.DataFrame([{'question': item.inputs['question'], 'category': item.inputs['category']} for item in examples])

In [None]:
# Merging the DataFrames
merged_df = pd.merge(df, list_df, on='question')

In [None]:
# Group by 'category' and calculate the mean of 'quality' scores
avg_scores_per_category = merged_df.groupby('category')['quality'].mean()

print(avg_scores_per_category)

category
Categorization                           4.363636
Coding                                   2.461538
Creative Writing                         3.200000
Cultural and Contextual Understanding    3.600000
Emotion Analysis                         4.375000
Ethics Analysis                          2.666667
Fact Analysis                            4.500000
Logical Reasoning                        2.400000
Reading Comprehension                    4.222222
Reframing                                2.875000
Safety and Security                      4.800000
Summarization                            3.666667
Name: quality, dtype: float64
