# Obtaining logprobs from GPT-4o-mini RAG systems




This notebook showcases how the GPT-4o-mini logprob results were obtained in our tutorial on integrating Cleanlab TLM within a RAG system. We reference code from OpenAI's [blogpost](https://cookbook.openai.com/examples/using_logprobs) on logprobs.

From the blogpost,
> Log probabilities of output tokens indicate the likelihood of each token occurring in the sequence given the context. To simplify, a logprob is log(p), where p = probability of a token occurring at a specific position based on the previous tokens in the context.

## Installing LlamaIndex

In [None]:
%pip install -U cleanlab-studio llama-index llama-index-embeddings-huggingface

### Initialize OpenAI Client

LlamaIndex uses OpenAI’s embedding models by default. Make sure your API key is activated in your environment by using the following command.

In [12]:
from openai import OpenAI
import os
API_KEY = "<API Key>"
os.environ['OPENAI_API_KEY'] = API_KEY
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", API_KEY))

## Integrating GPT-4o-mini with LlamaIndex

The following code defines two utility functions. `get_completion()` returns the output from the OpenAI client given a prompt and a few other customizable parameters. `parse()` extracts our desired fields from the output, `response` and `logprobs`, and calculates the average logprob and linear probability over token in the `response` string.

In [13]:
from math import exp
import numpy as np
from IPython.display import display, HTML
from typing import Dict, List, Optional

def get_completion(
    messages: List[Dict[str, str]],
    model: str = "gpt-4o-mini",
    max_tokens=500,
    temperature=0,
    stop=None,
    seed=123,
    tools=None,
    logprobs=None,  # whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message..
    top_logprobs=None,
) -> str:
    params = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stop": stop,
        "seed": seed,
        "logprobs": logprobs,
        "top_logprobs": top_logprobs,
    }
    if tools:
        params["tools"] = tools

    completion = client.chat.completions.create(**params)
    return completion

def parse(api_response):
    choice = api_response.choices[0]
    content = choice.message.content
    logprobs = [logprob.logprob for logprob in choice.logprobs.content]
    
    average_logprob = np.mean(logprobs)
    average_linear_prob = np.exp(average_logprob) * 100
    
    result_string = (f"Response: {content}\n"
                     f"Average Log Probability: {average_logprob:.4f}\n"
                     f"Average Linear Probability: {average_linear_prob:.2f}%")
    
    return result_string

The following code sets GPT-4o-mini (with logprobs) as the underlying large language model (LLM) for our RAG system. `GPTWrapper` is built on top of LlamaIndex's [CustomLLM](https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom.html#using-custom-llm-advanced) class. 


In [14]:
from typing import Any, Callable, Optional, Sequence

from llama_index.core.base.llms.types import (
    ChatMessage,
    CompletionResponse,
    CompletionResponseGen,
    LLMMetadata,
)
from llama_index.core.callbacks import CallbackManager
from llama_index.core.llms.callbacks import llm_completion_callback
from llama_index.core.llms.custom import CustomLLM
from llama_index.core.types import PydanticProgramMode
from llama_index.core import Settings
import json

class GPTWrapper(CustomLLM):
    context_window: int = 3900
    num_output: int = 256
    model_name: str = "ChatGPT-4o-mini"

    @property
    def metadata(self) -> LLMMetadata:
        """Get LLM metadata."""
        return LLMMetadata(
            context_window=self.context_window,
            num_output=self.num_output,
            model_name=self.model_name,
        )

    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        API_RESPONSE = get_completion(
            [
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            model="gpt-4o-mini",
            logprobs=True,
        )
        return CompletionResponse(text=parse(API_RESPONSE))


    @llm_completion_callback()
    def stream_complete(
        self, prompt: str, **kwargs: Any
    ) -> CompletionResponseGen:
        API_RESPONSE = get_completion(
            [
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            model="gpt-4o-mini",
            logprobs=True,
        )
        for char in API_RESPONSE:
            yield CompletionResponse(text=char, delta=char)

Settings.llm = GPTWrapper()

## Load data and build an index

This tutorial uses Nvidia's earnings report from Q1 FY2024. Download it via the command below and save it in a folder called `data`.

In [None]:
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md'
!mkdir -p data
!mv NVIDIA_Financial_Results_Q1_FY2024.md data/

Your directory structure should look like this:
```
├── tlm-rag-tutorial.ipynb
└── data
    └── NVIDIA_Financial_Results_Q1_FY2024.md
```

The following cell constructs an index from the content in the `data` folder.

In [15]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

  from .autonotebook import tqdm as notebook_tqdm


In [16]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

## Query your data

Now, you can create an Q&A engine over your index and input prompts.

In [17]:
query_engine = index.as_query_engine()

In [18]:
response = query_engine.query("What is Nvidia's largest product?")
print(response)

Response: The context information provided does not specify what NVIDIA's largest product is. However, NVIDIA is well-known for its graphics processing units (GPUs), particularly the RTX series, which are widely used in gaming, professional visualization, and AI applications. If you are looking for a specific product, the NVIDIA RTX™ GPUs mentioned in the highlights could be considered among their largest offerings.
Average Log Probability: -0.2541
Average Linear Probability: 77.56%


In [45]:
response = query_engine.query("How many RTX GPU models, including all custom versions released by third-party manufacturers and all revisions across different series, were officially announced by NVIDIA?")
print(response)

Response: NVIDIA officially announced a total of **seven RTX GPU models** in the context provided: 

1. **GeForce RTX 4060 family of GPUs**
2. **GeForce RTX 4070 GPU**
3. **Six new NVIDIA RTX GPUs for mobile and desktop workstations** (based on the Ada architecture)

The total includes the RTX 4060 family as one model and the RTX 4070 as another, along with the six new workstation GPUs.
Average Log Probability: -0.1716
Average Linear Probability: 84.23%
