# Using Selene Mini as a Custom Model with DeepEval

## Load Selene from HF

Install required packages:

In [1]:
!pip install deepeval --quiet
!pip install -U bitsandbytes --quiet
!pip install lm-format-enforcer --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m556.7/556.7 kB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.9/5.9 MB[0m [31m39.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.0/65.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.9/55.9 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.7/118.7 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.4/177.4 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m319.7/319.7 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m30.6 MB/s[0m eta [36m0:

Load the model + tokenizer:

In [2]:
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

selene_model = AutoModelForCausalLM.from_pretrained(
    "AtlaAI/Selene-1-Mini-Llama-3.1-8B",
    device_map="auto",
    quantization_config=quantization_config # remove to load FP16 model
)
selene_tokenizer = AutoTokenizer.from_pretrained(
    "AtlaAI/Selene-1-Mini-Llama-3.1-8B"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/935 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/325 [00:00<?, ?B/s]

Let's test the model to make sure it's loaded correctly.

In [24]:
from transformers import pipeline

pipe = pipeline(model=selene_model, tokenizer=selene_tokenizer, task="text-generation", max_new_tokens=256)
pipe("Why did the chicken cross the road?")

Device set to use cuda:0


[{'generated_text': "Why did the chicken cross the road? To get away from the farmer's constant questions about its motives.\nIn a humorous twist on the classic joke, this chicken has had enough of being interrogated about its actions. It decides to take matters into its own hands (or wings) and make a break for it.\nAs it struts across the road, it's not just fleeing the farmer's questions; it's also making a statement about the absurdity of being asked to justify one's every move. After all, who needs to explain why a chicken wants to cross a road? It's a fundamental right, isn't it?\nOf course, this chicken's rebellion is short-lived, as it soon finds itself facing a new set of questions from curious onlookers who are more interested in its crossing than its reasons for doing so. But hey, at least it's a chicken with attitude! \nSo the next time you're tempted to ask a chicken why it crossed the road, just remember: it's not about the reasons; it's about the freedom to roam – or at 

## Create CustomModel class for DeepEval

1. Inherit the class `DeepEvalBaseLLM`

2. Implement `load_model()` , which will be responsible for returning a model object.

3. Implement `generate()` with parameter of type string that acts as the prompt to your custom LLM. This function returns the generated string output from Selene.

4. Implement `get_model_name()`, which simply returns a string representing our custom model name.

In [9]:
from deepeval.models.base_model import DeepEvalBaseLLM
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import (
    build_transformers_prefix_allowed_tokens_fn,
)
from transformers import pipeline
from pydantic import BaseModel
import json

class CustomModel(DeepEvalBaseLLM):
    def __init__(
        self,
        model,
        tokenizer
    ):
        self.model = model
        self.tokenizer = tokenizer

    def load_model(self):
        return self.model

    def generate(self, prompt: str, schema: BaseModel = None) -> BaseModel | str:
        model = self.load_model()

        # HF pipeline for inference of text generation
        pipeline = transformers.pipeline(
            "text-generation",
            model=model,
            tokenizer=self.tokenizer,
            use_cache=True,
            device_map="auto",
            num_return_sequences=1,
            max_new_tokens=512,
            do_sample=True,
            top_k=5,
            eos_token_id=self.tokenizer.eos_token_id,
            pad_token_id=self.tokenizer.eos_token_id,
        )

        if schema is not None:
            # Create parser required for JSON confinement using lmformatenforcer
            parser = JsonSchemaParser(schema.model_json_schema())
            prefix_function = build_transformers_prefix_allowed_tokens_fn(
                pipeline.tokenizer, parser
            )
            # Output and load valid JSON
            output_dict = pipeline(chat_template, prefix_allowed_tokens_fn=prefix_function)
            output = output_dict[0]["generated_text"][len(chat_template):]
            json_result = json.loads(output)
            # Return valid JSON object according to the schema DeepEval supplied
            return schema(**json_result)
        return pipeline(prompt)

    async def a_generate(self, prompt: str, schema: BaseModel = None) -> BaseModel | str:
        return self.generate(prompt, schema)

    def get_model_name(self):
        return "Atla Selene Mini"

In [10]:
custom_selene = CustomModel(model=selene_model, tokenizer=selene_tokenizer)

Let's test out this custom model class to make sure it's working correctly:

In [27]:
custom_selene.generate("Why did the chicken cross the road?")

Device set to use cuda:0


[{'generated_text': 'Why did the chicken cross the road? To get to the other side, of course! But have you ever wondered what might have driven that chicken to take such a bold step? Was it a desire for adventure, a quest for food, or perhaps a need to escape a predator? Whatever the reason, it\'s clear that the chicken\'s decision was motivated by a sense of self-preservation and a desire to improve its circumstances.\n\nIn this sense, the chicken\'s story is not so different from our own. We, too, make decisions that are driven by our own needs and desires. We strive to improve our circumstances, to achieve our goals, and to overcome obstacles. And just like the chicken, we often find ourselves at crossroads, facing choices that will determine the course of our lives.\n\nSo the next time you find yourself pondering the age-old question of why the chicken crossed the road, remember that it\'s not just a joke or a play on words. It\'s a reflection of our own human experience, a reminde

## Start running evals with DeepEval

> Set `model = custom_selene` as you initialize the evaluation metric, and Selene will be used to run the evaluation.

### Example: RAG QA Chatbot Evaluation

We'll mock the evaluation of a RAG QA Chatbot using DeepEval and assess the below three metrics:

**Context Relevance**, **Groundedness**, and **Answer Relevance**.
<br><br>

These evaluations help detect hallucinations in LLM responses by ensuring that:

1. **Context is relevant.**
2. **Responses are grounded.**
3. **Answers align with user queries.**

#### Contextual Relevance

In [14]:
from deepeval import evaluate
from deepeval.metrics import ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30 day full refund at no extra cost."]

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    retrieval_context=retrieval_context
)

metric = ContextualRelevancyMetric(
    threshold=0.7, # Deepeval's default metrics output a score between 0-1 - the metric is successful if the evaluation score is >= threshold
    model=custom_selene, # Set the model used as Selene
    include_reason=True
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output()

Device set to use cuda:0
Device set to use cuda:0


0.3333333333333333
The score is 0.33 because the retrieval context is irrelevant to the input since it mentions refund policies, which is unrelated to the question about shoe fit. Additionally, the relevant statement about refund eligibility does not address the user's concern about the shoes not fitting. The user's concern is about the physical fit, not the return process. Quotes like 'All customers are eligible for a 30 day full refund at no extra cost' do not help in this context. It is a poor match because the user's question is about the physical properties of the shoes, not the return policy, which is why the relevancy score is low.


#### Groundedness / RAG Hallucination

In [13]:
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30 day full refund at no extra cost."]

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    context=retrieval_context
)
metric = HallucinationMetric(
    threshold=0.7, # Deepeval's default metrics output a score between 0-1 - the metric is successful if the evaluation score is >= threshold
    model=custom_selene # Set the model used as Selene
    )

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output()

Device set to use cuda:0
Device set to use cuda:0


0.0
The score is 0.00 because the actual output perfectly aligns with the context with no discrepancies, indicating a strong agreement between the actual output and the context.


#### Answer Relevance

In [18]:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

metric = AnswerRelevancyMetric(
    threshold=0.7,
    model=custom_selene, # Set the model used as Selene
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output()

Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0


0.0
The score is 0.00 because the output contains multiple irrelevant statements about unrelated topics such as 'We', 'offer', and '30-day' that do not address the user's question about shoe fit and return policies, while only one relevant statement about refund is present, which is not enough to elevate the overall relevance score to a higher level. However, the presence of relevant content does not outweigh the dominance of irrelevant statements, leading to a score of 0.00.


### Example: G-Eval Evaluation Metric

[G-Eval](https://docs.confident-ai.com/docs/metrics-llm-evals) is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria.

In [None]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
    name="Correctness",
    model=custom_selene, # Set the model used as Selene
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)

from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(
    input="The dog chased the cat up the tree, who ran up the tree?",
    actual_output="It depends, some might consider the cat, while others might argue the dog.",
    expected_output="The cat."
)

correctness_metric.measure(test_case)
print(correctness_metric.score)
print(correctness_metric.reason)