<a href="https://colab.research.google.com/github/barbaroja2000/rubric/blob/main/Rubric_Mistral_Medium.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Evaluating Mistral Medium with GPT-4

Uses:

*   Langsmith for custom rubrik datasets & Evaluation Framework
*   Mistal AI Platform for inference
*   GPT-4 To evaluate LLM output

---

Test:

*  provides a quality assesment 0-5 and a rationalle for the score

---

Model Mistral-Medium:

---

Refs:

https://mistral.ai/news/la-plateforme/
https://python.langchain.com/docs/langsmith/

---

Requires:
*   Test dataset in Langsmith: This notebook shows how to set this up https://github.com/barbaroja2000/rubric/blob/main/Rubric_Langsmith_Dataset_Creator.ipynb
*   Mistral AI API Key
*   Langchain Key
*   OpenAI Key


In [None]:
!pip install -q langchain langsmith openai tiktoken mistralai

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/138.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m133.1/138.7 kB[0m [31m4.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.7/138.7 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from google.colab import userdata
import os
import requests

os.environ['LANGCHAIN_ENDPOINT']= "https://api.smith.langchain.com"
os.environ['LANGCHAIN_API_KEY']= userdata.get('langchain_api_key')
os.environ["OPENAI_API_KEY"] = userdata.get('openai_api_key')

mistralai_api_key = userdata.get('mistral_api_key')

In [None]:
project_name  = "Mistal AI - Medium"
rubric_model = "mistral-medium"
rubric_temperature=0
evaluator_model="gpt-4-1106-preview"
evaluator_temperature=0
dataset_name = "General Rubric: Extended" #Use quick to test the runner

In [None]:
from langchain.prompts import PromptTemplate, SystemMessagePromptTemplate, ChatPromptTemplate

# create a string template for a System role
system_template = """
Please provide a detailed, accurate, and contextually relevant response to the following query.
Your answer should demonstrate a clear understanding of the subject matter, considering any logical,
ethical, historical, or cultural aspects involved. Ensure the response is tailored to the specific
requirements of the question, whether it involves analysis, explanation, creativity, or problem-solving.
"""

In [None]:
from langsmith import Client
client = Client()

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.schema.runnable import RunnableMap, RunnablePassthrough

def create_runnable():

  def get_response(input):
    # No Mistal LLM in Langchain yet
    chat_response = mistral_client.chat(
      model=rubric_model,
      messages=[ChatMessage(role="system", content=system_template), ChatMessage(role="user", content=input["input"]["question"])],
  )
    return chat_response.choices[0].message.content

  return RunnableMap({"input": RunnablePassthrough()}) | get_response

In [None]:
import re
from typing import Any, Optional

from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.evaluation import StringEvaluator

class QualityEvaluator(StringEvaluator):
    """An LLM-based quality  evaluator."""

    def __init__(self):
        llm = ChatOpenAI(model=evaluator_model, temperature=evaluator_temperature)

        template = """ Evaluate the answer provided from 0 to 5, with 5 being the best quality answer, and 0 where the answer is false or not provided
        --------
        QUESTION: {input}
        --------
        ANSWER: {prediction}
        --------
        Consisely reason step by step about why the given score is appropriate, then print the score at the end. At the end, repeat that score alone on a new line."""

        self.eval_chain = LLMChain.from_string(llm=llm, template=template)

    @property
    def requires_input(self) -> bool:
        return True

    @property
    def requires_reference(self) -> bool:
        return False

    @property
    def evaluation_name(self) -> str:
        return "quality"

    def _evaluate_strings(
        self,
        prediction: str,
        input: Optional[str] = None,
        reference: Optional[str] = None,
        **kwargs: Any
    ) -> dict:
        evaluator_result = self.eval_chain(
            dict(input=input, prediction=prediction), **kwargs
        )
        lines = evaluator_result["text"].strip().split("\n")
        reasoning = evaluator_result["text"].strip()
        score = lines[-1]
        if score is not None:
            score = float(score.strip())
        return {"score": score, "reasoning": reasoning}

In [None]:
import random
import string

def rando():
  # Generating a random 4-letter string
  random_string = ''.join(random.choices(string.ascii_letters, k=4))
  return random_string

In [None]:
from langchain.smith import RunEvalConfig, run_on_dataset

evaluation_config = RunEvalConfig(
    input_key="question",
    custom_evaluators = [QualityEvaluator()]
)

project_metadata = {
  rubric_model : rubric_model,
  rubric_temperature : rubric_temperature,
  evaluator_model : evaluator_model,
  evaluator_temperature: evaluator_temperature
}

run = run_on_dataset(
    client=client,
    project_metadata = project_metadata,
    project_name = f"{project_name}-{rando()}",
    dataset_name=dataset_name,
    llm_or_chain_factory=create_runnable,
    evaluation=evaluation_config,
    concurrency_level=1, #This will bork if not provided or set to anything other than 1
)

View the evaluation results for project 'Mistal AI - Medium-ZhqP' at:
https://smith.langchain.com/o/8bebdbff-9433-40f6-b7e1-f013521a8100/datasets/dfedd53c-204a-4b69-9f64-51a61568a7fd/compare?selectedSessions=adbca504-3173-45c1-998c-26560ffc4b78

View all tests for Dataset General Rubric: Extended at:
https://smith.langchain.com/o/8bebdbff-9433-40f6-b7e1-f013521a8100/datasets/dfedd53c-204a-4b69-9f64-51a61568a7fd
[------------------------------------------------->] 123/123

In [None]:
# Extracting the required information
extracted_data = []
for key, value in run['results'].items():
    output = value["output"] if "output" in value else None
    question = value['input']['question']
    for feedback_item in value['feedback']:
        if feedback_item.key == 'quality':
            quality = feedback_item.score
            feedback = feedback_item.comment
            extracted_data.append({
                'question': question,
                'output': output,
                'quality': quality,
                'feedback': feedback
            })

In [None]:
import pandas as pd
from google.colab import data_table
df = pd.DataFrame(extracted_data)

data_table.DataTable(df, include_index=False, num_rows_per_page=10)

Unnamed: 0,question,output,quality,feedback
0,Explain the role of a central bank in an econo...,A central bank plays a crucial role in managin...,5.0,The answer provided is comprehensive and expla...
1,"Y curiosity, in a sense, was stronger than my ...",The object in the narrator's breast pocket is ...,3.0,The answer provided does a reasonable job of i...
2,Verify the accuracy of this statement: 'The Gr...,"The statement ""The Great Wall of China can be ...",5.0,The answer provided is accurate and informativ...
3,Write a Python program to find the Fibonacci s...,"Sure, here's a Python program that finds the F...",4.0,The provided answer includes a Python program ...
4,Create a comedic scene about a alien game show...,"Title: ""Intergalactic Idol: The Cosmic Comedy""...",4.0,The provided answer is a creative and detailed...
5,Discuss the historical context behind the cele...,"Juneteenth, also known as Freedom Day, Jubilee...",5.0,The provided answer gives a comprehensive and ...
6,Evaluate the truthfulness of the statement: 'G...,"The statement ""Goldfish have a three-second me...",5.0,The answer provided is of high quality as it d...
7,Analyze the emotion in this customer review: '...,"In the customer review, 'I've never been more ...",5.0,The provided answer does an excellent job of a...
8,Explain the concept of blockchain and its uses...,"Blockchain is a type of database, but instead ...",5.0,The provided answer gives a clear and comprehe...
9,Interpret the emotion behind this comment: 'I...,The comment 'I just love when people are late ...,5.0,The answer provided is comprehensive and accur...


In [None]:
pd.to_numeric(df.quality).mean()

4.6

In [None]:
examples = list(client.list_examples(dataset_name=dataset_name))

In [None]:
# Creating a new DataFrame from the list
list_df = pd.DataFrame([{'question': item.inputs['question'], 'category': item.inputs['category']} for item in examples])

In [None]:
# Merging the DataFrames
merged_df = pd.merge(df, list_df, on='question')

In [None]:
# Group by 'category' and calculate the mean of 'quality' scores
avg_scores_per_category = merged_df.groupby('category')['quality'].mean()

print(avg_scores_per_category)

category
Coding                                   4.0
Creative Writing                         4.0
Cultural and Contextual Understanding    5.0
Emotion Analysis                         5.0
Fact Analysis                            5.0
Reading Comprehension                    3.0
Reframing                                5.0
Name: quality, dtype: float64
