<a href="https://colab.research.google.com/github/barbaroja2000/rubric/blob/main/Rubric_Anthropic_Claude_3_Opus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Evaluating Anthropic Claude 3 Opus against a custom Rubrik with GPT-4

Uses:

*   Langsmith for custom rubrik datasets & Evaluation Framework
*   GPT-4 To evaluate LLM output

---

Test:

*  provides a quality assesment 0-5 and a rationalle for the score
* The interactive table can be used to Check the model output against the GPT-4 assesment
* Possible improvment here is to be able to manually update the score returned from the model

---

Model Anthropic Claude 3 Opus:

* 200K Initial Context Window: Will scale to millions.
* Improved incorrect refusals

https://www.anthropic.com/news/claude-3-family

https://python.langchain.com/docs/langsmith/


---

Requires:
*   Test dataset in Langsmith: This notebook shows how to set this up https://github.com/barbaroja2000/rubric/blob/main/Langsmith_Rubric_Dataset_Creator.ipynb
*   Anthropic API Key
*   Langchain Key
*   OpenAI Key





In [None]:
!pip install -q langchain langsmith anthropic openai tiktoken

In [None]:
from google.colab import userdata
import os
import requests

os.environ['LANGCHAIN_ENDPOINT']= "https://api.smith.langchain.com"
os.environ['LANGCHAIN_API_KEY']= userdata.get('langchain_api_key')
os.environ["OPENAI_API_KEY"] = userdata.get('openai_api_key')
os.environ['ANTHROPIC_API_KEY']= userdata.get('anthropic_api_key')

## Run Metadata

In [None]:
project_name  = "Anthropic Claude  Opus"
rubric_model = "claude-3-opus-20240229"
rubric_temperature=0
evaluator_model="gpt-4-1106-preview"
evaluator_temperature=0
dataset_name = "General Rubric: Extended"

In [None]:
from langsmith import Client
client = Client()

In [None]:
from langchain.prompts import PromptTemplate, SystemMessagePromptTemplate, ChatPromptTemplate

# create a string template for a System role
system_template = """
Please provide a detailed, accurate, and contextually relevant response to the following query.
Your answer should demonstrate a clear understanding of the subject matter, considering any logical,
ethical, historical, or cultural aspects involved. Ensure the response is tailored to the specific
requirements of the question, whether it involves analysis, explanation, creativity, or problem-solving.
"""

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnableMap, RunnablePassthrough
import os
from anthropic import Anthropic
from google.colab import userdata


anthropic_api_key = userdata.get('anthropic_api_key')
anthropic_client = Anthropic(
    # This is the default and can be omitted
    api_key=anthropic_api_key,
)

def create_runnable():

    def get_response(input):
      # No Anthopic Opus in Langchain yet
      message = anthropic_client.messages.create(
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": f'{system_template} \n---------------\n {input["input"]["question"]}',
            }
        ],
        model=rubric_model,
    )

      return message.content

    return RunnableMap({"input": RunnablePassthrough()}) | get_response

## Evaluator

https://docs.smith.langchain.com/evaluation/evaluator-implementations

In [None]:
import re
from typing import Any, Optional

from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.evaluation import StringEvaluator

class QualityEvaluator(StringEvaluator):
    """An LLM-based quality  evaluator."""

    def __init__(self):
        llm = ChatOpenAI(model=evaluator_model, temperature=evaluator_temperature)

        template = """ Evaluate the answer provided from 0 to 5, with 5 being the best quality answer, and 0 where the answer is false or not provided
        --------
        QUESTION: {input}
        --------
        ANSWER: {prediction}
        --------
        Consisely reason step by step about why the given score is appropriate, then print the score at the end. At the end, repeat that score alone on a new line."""

        self.eval_chain = LLMChain.from_string(llm=llm, template=template)

    @property
    def requires_input(self) -> bool:
        return True

    @property
    def requires_reference(self) -> bool:
        return False

    @property
    def evaluation_name(self) -> str:
        return "quality"

    def _evaluate_strings(
        self,
        prediction: str,
        input: Optional[str] = None,
        reference: Optional[str] = None,
        **kwargs: Any
    ) -> dict:
        evaluator_result = self.eval_chain(
            dict(input=input, prediction=prediction), **kwargs
        )
        lines = evaluator_result["text"].strip().split("\n")
        reasoning = evaluator_result["text"].strip()
        score = lines[-1]
        if score is not None:
            score = float(score.strip())
        return {"score": score, "reasoning": reasoning}

In [None]:
import random
import string

def rando():
  # Generating a random 4-letter string
  random_string = ''.join(random.choices(string.ascii_letters, k=4))
  return random_string

## Test Run

In [None]:
from langchain.smith import RunEvalConfig, run_on_dataset

evaluation_config = RunEvalConfig(
    input_key="question",
    output_key="output",
    custom_evaluators = [QualityEvaluator()],
)

project_metadata = {
  rubric_model : rubric_model,
  rubric_temperature : rubric_temperature,
  evaluator_model : evaluator_model,
  evaluator_temperature: evaluator_temperature
}

run = run_on_dataset(
    client=client,
    project_metadata = project_metadata,
    project_name = f"{project_name}-{rando()}",
    dataset_name=dataset_name,
    llm_or_chain_factory=create_runnable,
    evaluation=evaluation_config,
    concurrency_level=1, #This will bork if not provided or set to anything other than 1 for Anthropic without $$ account
)

  warn_deprecated(


View the evaluation results for project 'Anthropic Claude  Opus-ejHU' at:
https://smith.langchain.com/o/8bebdbff-9433-40f6-b7e1-f013521a8100/datasets/dfedd53c-204a-4b69-9f64-51a61568a7fd/compare?selectedSessions=1396f633-8d33-4c15-b166-43c5f2a4dd7d

View all tests for Dataset General Rubric: Extended at:
https://smith.langchain.com/o/8bebdbff-9433-40f6-b7e1-f013521a8100/datasets/dfedd53c-204a-4b69-9f64-51a61568a7fd
[>                                                 ] 0/123

  warn_deprecated(


[---------->                                       ] 27/123

Error Type: InternalServerError, Message: Error code: 500 - {'type': 'error', 'error': {'type': 'api_error', 'message': 'Internal server error'}}


[------------------------------------------------->] 123/123

In [None]:
# for key, value in run['results'].items():
#  print(key, value)

In [None]:
# Extracting the required information
extracted_data = []
for key, value in run['results'].items():
  try:
    output = value["output"][0].text
    question = value['input']['question']
    for feedback_item in value['feedback']:
        if feedback_item.key == 'quality':
            quality = feedback_item.score
            feedback = feedback_item.comment
            extracted_data.append({
                'question': question,
                'output': output,
                'quality': quality,
                'feedback': feedback
            })
  except Exception as e:
      print(e)

'output'


In [None]:
import pandas as pd
from google.colab import data_table
df = pd.DataFrame(extracted_data)

data_table.DataTable(df, include_index=False, num_rows_per_page=10)

Unnamed: 0,question,output,quality,feedback
0,"I WAS born in the year 1632, in the city of Yo...",In the opening paragraphs of Daniel Defoe's no...,5.0,The provided answer is a summary of the openin...
1,"I WAS born in the year 1632, in the city of Yo...",Here is a concise summary of the key points in...,5.0,Evaluation of the Answer:\n\n- The answer accu...
2,"Y curiosity, in a sense, was stronger than my ...","This text, which appears to be an excerpt from...",5.0,The provided answer does a good job of summari...
3,Explain the role of a central bank in an econo...,A central bank plays a crucial role in managin...,5.0,The provided answer gives a comprehensive expl...
4,Describe the basics of how the stock market wo...,Here is a basic explanation of how the stock m...,5.0,"The provided answer is comprehensive, accurate..."
...,...,...,...,...
117,Analyze the emotional content of this diary en...,Upon analyzing the emotional content of the di...,5.0,The provided answer does a thorough job of ana...
118,What is the emotional tone of this speech exce...,"The emotional tone of the speech excerpt ""We h...",5.0,The provided answer does a thorough job of ana...
119,Determine the emotional tone of this message: ...,"Based on the detailed analysis you requested, ...",5.0,The provided answer offers a comprehensive ana...
120,Analyze the emotion in this customer review: '...,This customer review expresses a very positive...,5.0,The provided answer does an excellent job of a...


In [None]:
pd.to_numeric(df.quality).mean()

4.872950819672131

In [None]:
examples = list(client.list_examples(dataset_name=dataset_name))

In [None]:
# Creating a new DataFrame from the list
list_df = pd.DataFrame([{'question': item.inputs['question'], 'category': item.inputs['category']} for item in examples])

In [None]:
# Merging the DataFrames
merged_df = pd.merge(df, list_df, on='question')

In [None]:
# Group by 'category' and calculate the mean of 'quality' scores
avg_scores_per_category = merged_df.groupby('category')['quality'].mean()

print(avg_scores_per_category)

category
Categorization                           4.909091
Coding                                   4.961538
Creative Writing                         4.500000
Cultural and Contextual Understanding    5.000000
Emotion Analysis                         5.000000
Ethics Analysis                          4.944444
Fact Analysis                            5.000000
Logical Reasoning                        4.818182
Reading Comprehension                    4.777778
Reframing                                4.750000
Safety and Security                      4.875000
Summarization                            5.000000
Name: quality, dtype: float64
