# NHLBI: BioData Catalyst OpenAI Tutorial
## Using logprobs for classification and Q&A evaluation
**Authors**: Andrew Blair, James Hills, Shyamal Anadkats


This notebook demonstrates the use of the `logprobs` parameter in the Chat Completions API. When `logprobs` is enabled, the API returns the log probabilities of each output token, along with a limited number of the most likely tokens at each token position and their log probabilities. The relevant request parameters are:
* `logprobs`: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message. This option is currently not available on the `gpt-4-vision-preview` model.
* `top_logprobs`: An integer between 0 and 5 specifying the number of most likely tokens to return at each token position, each with an associated log probability. `logprobs` must be set to true if this parameter is used.

Log probabilities of output tokens indicate the likelihood of each token occurring in the sequence given the context. To simplify, a logprob is `log(p)`, where `p` = probability of a token occurring at a specific position based on the previous tokens in the context. Some key points about `logprobs`:
* Higher log probabilities suggest a higher likelihood of the token in that context. This allows users to gauge the model's confidence in its output or explore alternative responses the model considered.
* Logprob can be any negative number or `0.0`. `0.0` corresponds to 100% probability.
* Logprobs allow us to compute the joint probability of a sequence as the sum of the logprobs of the individual tokens. This is useful for scoring and ranking model outputs. Another common approach is to take the average per-token logprob of a sentence to choose the best generation.
* We can examine the `logprobs` assigned to different candidate tokens to understand what options the model considered plausible or implausible.



While there are a wide array of use cases for `logprobs`, this notebook will focus on its use for:

1. Classification tasks

* Large Language Models excel at many classification tasks, but accurately measuring the model's confidence in its outputs can be challenging. `logprobs` provide a probability associated with each class prediction, enabling users to set their own classification or confidence thresholds.
  
* Demonstrate how to develop a classification prompt and use the `logprobs` to assess the confidence for classifying cardiac research abstracts into four categories: **Congenital Heart Disease, Cardiometabolic, Atrial Fibrillation, and Congestive Heart Failure**


2. Retrieval (Q&A) evaluation

* `logprobs` can assist with self-evaluation in retrieval applications. In the Q&A example, the model outputs a contrived `has_sufficient_context_for_answer` boolean, which can serve as a confidence score of whether the answer is contained in the retrieved content. Evaluations of this type can reduce retrieval-based hallucinations and enhance accuracy.

3. Autocomplete
*  `logprobs` could help us decide how to suggest words as a user is typing.

4. Token highlighting and outputting bytes\n",
*  Users can easily create a token highlighter using the built in tokenization that comes with enabling `logprobs`. Additionally, the bytes parameter includes the ASCII encoding of each output character, which is particularly useful for reproducing emojis and special characters."

## 0. Imports and utils

In [1]:
from openai import OpenAI
from dotenv import load_dotenv
import os

from math import exp
import numpy as np
from IPython.display import display, HTML

load_dotenv()
client = OpenAI()

In [2]:
def get_completion(
    messages: list[dict[str, str]],
    model: str = "gpt-4",
    max_tokens=500,
    temperature=0,
    stop=None,
    seed=123,
    tools=None,
    logprobs=None,  # whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message..
    top_logprobs=None,
) -> str:
    params = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stop": stop,
        "seed": seed,
        "logprobs": logprobs,
        "top_logprobs": top_logprobs,
    }
    if tools:
        params["tools"] = tools

    completion = client.chat.completions.create(**params)
    return completion

## 1. Using `logprobs` to assess confidence for classification tasks

Let's say we want to create a system to classify research article abstracts into a set of pre-defined categories. Without `logprobs`, we can use Chat Completion to do this, but it much more difficult to assess the certainy with which the model made its classifications.

Now, with `logprobs` enabled, we can see exactly how confident the model is in its predictions, which is crucial for creating an accurate and trustworthy classifier. For example, if the log probability for the chosen category is high, this suggests the model is quite confident in its classification. If it's low, this suggests the model is less confident. This can be particularly useful in cases where the model's classification is not what you expected, or when the model's output needs to be reviewed or validated by a human.

We'll begin with a prompt that presents the model with four categories: **Congenital Heart Disease, Cardiometabolic, Atrial Fibrillation, and Congestive Heart Failure**. The model is then tasked with classifying articles into these categories based solely on their abstracts.

In [3]:
CLASSIFICATION_PROMPT = """You will be given an abstract of a research article.
Classify the article into one of the following categories: Congenital Heart Disease, Cardiometabolic Disorder, Atrial Fibrillation, and Congestive Heart Failure.
Return only the name of one of the four categories, and nothing else.
MAKE SURE your output is one of the four categories stated.
Article Abstract: {headline}"""


Let's look at three sample headlines, and first begin with a standard Chat Completions output, without `logprobs`

In [4]:
abstracts = [
    
    # Congenital Heart Disease
    # Richter, F., Morton, S.U., Kim, S.W. et al. Genomic analyses implicate noncoding de novo variants in congenital heart disease. Nat Genet 52, 769–777 (2020). 
    "A genetic etiology is identified for one-third of patients with congenital heart disease (CHD), with 8% of cases attributable to coding de novo variants (DNVs). To assess the contribution of noncoding DNVs to CHD, we compared genome sequences from 749 CHD probands and their parents with those from 1,611 unaffected trios. Neural network prediction of noncoding DNV transcriptional impact identified a burden of DNVs in individuals with CHD (n = 2,238 DNVs) compared to controls (n = 4,177; P = 8.7 × 10−4). Independent analyses of enhancers showed an excess of DNVs in associated genes (27 genes versus 3.7 expected, P = 1 × 10−5). We observed significant overlap between these transcription-based approaches (odds ratio (OR) = 2.5, 95% confidence interval (CI) 1.1–5.0, P = 5.4 × 10−3). CHD DNVs altered transcription levels in 5 of 31 enhancers assayed. Finally, we observed a DNV burden in RNA-binding-protein regulatory sites (OR = 1.13, 95% CI 1.1–1.2, P = 8.8 × 10−5). Our findings demonstrate an enrichment of potentially disruptive regulatory noncoding DNVs in a fraction of CHD at least as high as that observed for damaging coding DNVs.",
    
    # Cardiometabolic
    # Laber, S. et al. Discovering cellular programs of intrinsic and extrinsic drivers of metabolic traits using LipocyteProfiler. Cell Genom. 3, 100346 (2023).
    "A primary obstacle in translating genetic associations with disease into therapeutic strategies is elucidating the cellular programs affected by genetic risk variants and effector genes. Here, we introduce LipocyteProfiler, a cardiometabolic-disease-oriented high-content image-based profiling tool that enables evaluation of thousands of morphological and cellular profiles that can be systematically linked to genes and genetic variants relevant to cardiometabolic disease. We show that LipocyteProfiler allows surveillance of diverse cellular programs by generating rich context- and process-specific cellular profiles across hepatocyte and adipocyte cell-state transitions. We use LipocyteProfiler to identify known and novel cellular mechanisms altered by polygenic risk of metabolic disease, including insulin resistance, fat distribution, and the polygenic contribution to lipodystrophy. LipocyteProfiler paves the way for large-scale forward and reverse deep phenotypic profiling in lipocytes and provides a framework for the unbiased identification of causal relationships between genetic variants and cellular programs relevant to human disease.",
    
    # Atrial Fibrillation
    # Miyazawa, K. et al. Cross-ancestry genome-wide analysis of atrial fibrillation unveils disease biology and enables cardioembolic risk prediction. Nat. Genet. 55, 187–197 (2023).
    "Atrial fibrillation (AF) is a common cardiac arrhythmia resulting in increased risk of stroke. Despite highly heritable etiology, our understanding of the genetic architecture of AF remains incomplete. Here we performed a genome-wide association study in the Japanese population comprising 9,826 cases among 150,272 individuals and identified East Asian-specific rare variants associated with AF. A cross-ancestry meta-analysis of >1 million individuals, including 77,690 cases, identified 35 new susceptibility loci. Transcriptome-wide association analysis identified IL6R as a putative causal gene, suggesting the involvement of immune responses. Integrative analysis with ChIP-seq data and functional assessment using human induced pluripotent stem cell-derived cardiomyocytes demonstrated ERRg as having a key role in the transcriptional regulation of AF-associated genes. A polygenic risk score derived from the cross-ancestry meta-analysis predicted increased risks of cardiovascular and stroke mortalities and segregated individuals with cardioembolic stroke in undiagnosed AF patients. Our results provide new biological and clinical insights into AF genetics and suggest their potential for clinical applications.",
    
    # Congestive Heart Failure
    # Joseph, J. et al. Genetic architecture of heart failure with preserved versus reduced ejection fraction. Nat. Commun. 13, 7753 (2022).
    "Pharmacologic clinical trials for heart failure with preserved ejection fraction have been largely unsuccessful as compared to those for heart failure with reduced ejection fraction. Whether differences in the genetic underpinnings of these major heart failure subtypes may provide insights into the disparate outcomes of clinical trials remains unknown. We utilize a large, uniformly phenotyped, single cohort of heart failure sub-classified into heart failure with reduced and with preserved ejection fractions based on current clinical definitions, to conduct detailed genetic analyses of the two heart failure sub-types. We find different genetic architectures and distinct genetic association profiles between heart failure with reduced and with preserved ejection fraction suggesting differences in underlying pathobiology. The modest genetic discovery for heart failure with preserved ejection fraction (one locus) compared to heart failure with reduced ejection fraction (13 loci) despite comparable sample sizes indicates that clinically defined heart failure with preserved ejection fraction likely represents the amalgamation of several, distinct pathobiological entities. Development of consensus sub-phenotyping of heart failure with preserved ejection fraction is paramount to better dissect the underlying genetic signals and contributors to this highly prevalent condition."
]


In [5]:
for article in abstracts:
    print(CLASSIFICATION_PROMPT.format(headline=article))
    API_RESPONSE = client.chat.completions.create(model="gpt-3.5-turbo", 
                                   messages=[{"role": "user", "content": CLASSIFICATION_PROMPT.format(headline=article)}],

    )
    print(f"Category: {API_RESPONSE.choices[0].message.content}\n")

You will be given an abstract of a research article.
Classify the article into one of the following categories: Congenital Heart Disease, Cardiometabolic Disorder, Atrial Fibrillation, and Congestive Heart Failure.
Return only the name of one of the four categories, and nothing else.
MAKE SURE your output is one of the four categories stated.
Article Abstract: A genetic etiology is identified for one-third of patients with congenital heart disease (CHD), with 8% of cases attributable to coding de novo variants (DNVs). To assess the contribution of noncoding DNVs to CHD, we compared genome sequences from 749 CHD probands and their parents with those from 1,611 unaffected trios. Neural network prediction of noncoding DNV transcriptional impact identified a burden of DNVs in individuals with CHD (n = 2,238 DNVs) compared to controls (n = 4,177; P = 8.7 × 10−4). Independent analyses of enhancers showed an excess of DNVs in associated genes (27 genes versus 3.7 expected, P = 1 × 10−5). We o

Here we can see the selected category for each headline. However, we have no visibility into the confidence of the model in its predictions. Let's rerun the same prompt but with `logprobs` enabled, and `top_logprobs` set to 2 (this will show us the 2 most likely output tokens for each token). Additionally we can also output the linear probability of each output token, in order to convert the log probability to the more easily interprable scale of 0-100%. 

In [6]:
for article in abstracts:
    print(f"Article Abstract: {article}\n")
    API_RESPONSE = get_completion(
        [{"role": "user", "content": CLASSIFICATION_PROMPT.format(headline=article)}],
        model="gpt-3.5-turbo",
        logprobs=True,
        top_logprobs=2,
    )
    top_two_logprobs = API_RESPONSE.choices[0].logprobs.content[0].top_logprobs
    print(f"Category: {API_RESPONSE.choices[0].message.content}\n")
    html_content = ""
    for i, logprob in enumerate(top_two_logprobs, start=1):
        html_content += (
            f"<span style='color: cyan'>Output token {i}:</span> {logprob.token}, "
            f"<span style='color: darkorange'>logprobs:</span> {logprob.logprob}, "
            f"<span style='color: magenta'>linear probability:</span> {np.round(np.exp(logprob.logprob)*100,2)}%<br>"
        )
    display(HTML(html_content))
    print("\n")

Article Abstract: A genetic etiology is identified for one-third of patients with congenital heart disease (CHD), with 8% of cases attributable to coding de novo variants (DNVs). To assess the contribution of noncoding DNVs to CHD, we compared genome sequences from 749 CHD probands and their parents with those from 1,611 unaffected trios. Neural network prediction of noncoding DNV transcriptional impact identified a burden of DNVs in individuals with CHD (n = 2,238 DNVs) compared to controls (n = 4,177; P = 8.7 × 10−4). Independent analyses of enhancers showed an excess of DNVs in associated genes (27 genes versus 3.7 expected, P = 1 × 10−5). We observed significant overlap between these transcription-based approaches (odds ratio (OR) = 2.5, 95% confidence interval (CI) 1.1–5.0, P = 5.4 × 10−3). CHD DNVs altered transcription levels in 5 of 31 enhancers assayed. Finally, we observed a DNV burden in RNA-binding-protein regulatory sites (OR = 1.13, 95% CI 1.1–1.2, P = 8.8 × 10−5). Our fi



Article Abstract: A primary obstacle in translating genetic associations with disease into therapeutic strategies is elucidating the cellular programs affected by genetic risk variants and effector genes. Here, we introduce LipocyteProfiler, a cardiometabolic-disease-oriented high-content image-based profiling tool that enables evaluation of thousands of morphological and cellular profiles that can be systematically linked to genes and genetic variants relevant to cardiometabolic disease. We show that LipocyteProfiler allows surveillance of diverse cellular programs by generating rich context- and process-specific cellular profiles across hepatocyte and adipocyte cell-state transitions. We use LipocyteProfiler to identify known and novel cellular mechanisms altered by polygenic risk of metabolic disease, including insulin resistance, fat distribution, and the polygenic contribution to lipodystrophy. LipocyteProfiler paves the way for large-scale forward and reverse deep phenotypic pr



Article Abstract: Atrial fibrillation (AF) is a common cardiac arrhythmia resulting in increased risk of stroke. Despite highly heritable etiology, our understanding of the genetic architecture of AF remains incomplete. Here we performed a genome-wide association study in the Japanese population comprising 9,826 cases among 150,272 individuals and identified East Asian-specific rare variants associated with AF. A cross-ancestry meta-analysis of >1 million individuals, including 77,690 cases, identified 35 new susceptibility loci. Transcriptome-wide association analysis identified IL6R as a putative causal gene, suggesting the involvement of immune responses. Integrative analysis with ChIP-seq data and functional assessment using human induced pluripotent stem cell-derived cardiomyocytes demonstrated ERRg as having a key role in the transcriptional regulation of AF-associated genes. A polygenic risk score derived from the cross-ancestry meta-analysis predicted increased risks of cardi



Article Abstract: Pharmacologic clinical trials for heart failure with preserved ejection fraction have been largely unsuccessful as compared to those for heart failure with reduced ejection fraction. Whether differences in the genetic underpinnings of these major heart failure subtypes may provide insights into the disparate outcomes of clinical trials remains unknown. We utilize a large, uniformly phenotyped, single cohort of heart failure sub-classified into heart failure with reduced and with preserved ejection fractions based on current clinical definitions, to conduct detailed genetic analyses of the two heart failure sub-types. We find different genetic architectures and distinct genetic association profiles between heart failure with reduced and with preserved ejection fraction suggesting differences in underlying pathobiology. The modest genetic discovery for heart failure with preserved ejection fraction (one locus) compared to heart failure with reduced ejection fraction (






This shows how important using `logprobs` can be, as if we are using LLMs for classification tasks we can set confidence theshholds, or output several potential output tokens if the log probability of the selected output is not sufficiently high. For instance, if we are creating a recommendation engine to tag articles and abstracts, we can automatically classify abstracts crossing a certain threshold, and send the less certain headlines for manual review.

## 2. Retrieval confidence scoring to reduce hallucinations

To reduce hallucinations, and the performance of our RAG-based Q&A system, we can use `logprobs` to evaluate how confident the model is in its retrieval.

Let's say we have built a retrieval system using RAG for Q&A, but are struggling with hallucinated answers to our questions. *Note:* we will use a hardcoded article for this example, but see other entries in the cookbook for tutorials on using RAG for Q&A.

In [7]:
# Article retrieved
priest_results = """Meta-analysis  identified  16  novel  loci,  including  12  rare  variants, 
which  displayed  moderate  or  large  effect  sizes  (median  odds  ratio,  3.02)  
for  4  separate  CHD  categories.  Analyses  of  chromatin  structure  link  13  
of  the  genome-wide significant  loci  to  key  genes  in  cardiac  development; 
rs373447426  (minor  allele  frequency,  0.003  [odds  ratio,  3.37  for  Conotruncal heart disease]; P=1.49×10−8) 
is predicted to disrupt chromatin structure for 2 nearby genes BDH1 and DLG1 involved  in  Conotruncal  development.
A  lead  variant  rs189203952  (minor  allele  frequency,  0.01  [odds  ratio,  2.4  for  left ventricular outflow tract obstruction]; P=1.46×10−8) 
is predicted to disrupt the binding sites of 4 transcription factors known to participate in cardiac development in the promoter of SPAG9. 
A tissue-specific model of chromatin conformation suggests that common variant rs78256848 (minor allele frequency, 0.11 [odds ratio, 1.4 for Conotruncal heart disease]; P=2.6×10−8) 
physically interacts with NCAM1 (PFDR=1.86×10−27), a neural adhesion molecule acting in cardiac development. Importantly, while each individual 
malformation displayed substantial heritability (observed h2 ranging from 0.26 for complex malformations to 0.37 for left ventricular outflow tract obstructive disease) 
the risk for different CHD malformations appeared to be separate, without genetic correlation measured by linkage disequilibrium score regression or regional colocalization."""

In [4]:

# Questions that can be easily answered given the article
easy_questions = [
    "How many rare variants were identified?",
    "How many separate CHD categories were included in the meta-analysis?"
]

# Questions that are not fully covered in the article
medium_questions = [
    "What did the tissue-specific model of chromatin conformation suggest? Name the common variant identifier and gene name that is in physical content.",
    "Why does the risk for different CHD malformations appear to be separate?",
    "What genomic data and sequencing technology was used in the tissue specific model of chromatin conformation?"
]

Now, what we can do is ask the model to respond to the question, but then also evaluate its response. Specifically, we will ask the model to output a boolean `has_sufficient_context_for_answer`. We can then evaluate the `logprobs` to see just how confident the model is that its answer was contained in the provided context

In [5]:
PROMPT = """You retrieved this article: {article}. The question is: {question}.
Before even answering the question, consider whether you have sufficient information in the article to answer the question fully.
Your output should JUST be the boolean true or false, of if you have sufficient information in the article to answer the question.
Respond with just one word, the boolean true or false. You must output the word 'True', or the word 'False', nothing else.
"""


In [10]:
html_output = ""
html_output += "Questions clearly answered in article"

for question in easy_questions:
    API_RESPONSE = get_completion(
        [
            {
                "role": "user",
                "content": PROMPT.format(
                    article=priest_results, question=question
                ),
            }
        ],
        model="gpt-3.5-turbo",
        logprobs=True,
    )
    html_output += f'<p style="color:green">Question: {question}</p>'
    print(API_RESPONSE.choices[0].logprobs)
    for logprob in API_RESPONSE.choices[0].logprobs.content:
        html_output += f'<p style="color:cyan">has_sufficient_context_for_answer: {logprob.token}, <span style="color:darkorange">logprobs: {logprob.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%</span></p>'
    break

# html_output += "Questions only partially covered in the article"

# for question in medium_questions:
#     API_RESPONSE = get_completion(
#         [
#             {
#                 "role": "user",
#                 "content": PROMPT.format(
#                     article=priest_results, question=question
#                 ),
#             }
#         ],
#         model="gpt-3.5-turbo",
#         logprobs=True,
#         top_logprobs=3,
#     )
#     html_output += f'<p style="color:green">Question: {question}</p>'
#     for logprob in API_RESPONSE.choices[0].logprobs.content:
#         html_output += f'<p style="color:cyan">has_sufficient_context_for_answer: {logprob.token}, <span style="color:darkorange">logprobs: {logprob.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%</span></p>'

display(HTML(html_output))

ChoiceLogprobs(content=[ChatCompletionTokenLogprob(token='True', bytes=[84, 114, 117, 101], logprob=-0.1508361, top_logprobs=[])])


For the first two questions, our model asserts with (near) 100% confidence that the article has sufficient context to answer the posed questions.<br><br>
On the other hand, for the more tricky questions which are less clearly answered in the article, the model is less confident that it has sufficient context. This is a great guardrail to help ensure our retrieved content is sufficient.<br><br>
This self-evaluation can help reduce hallucinations, as you can restrict answers or re-prompt the user when your `sufficient_context_for_answer` log probability is below a certain threshold. Methods like this have been shown to significantly reduce RAG for Q&A hallucinations and errors ([Example]((https://jfan001.medium.com/how-we-cut-the-rate-of-gpt-hallucinations-from-20-to-less-than-2-f3bfcc10e4ec)))

## 3. Autocomplete

Another use case for `logprobs` are autocomplete systems. Without creating the entire autocomplete system end-to-end, let's demonstrate how `logprobs` could help us decide how to suggest words as a user is typing.

First, let's come up with a sample sentence: `"A variant disrupts transcription factors in cardiac development."` Let's say we want it to dynamically recommend the next word or token as we are typing the sentence, but *only* if the model is quite sure of what the next word will be. To demonstrate this, let's break up the sentence into sequential components.

In [27]:
sentence_list = [

    # A  lead  variant  rs189203952 is predicted to disrupt the binding sites of
    # 4 transcription factors known to participate in cardiac development in the promoter of SPAG9. 
    "A",
    "A variant",
    "A variant disrupts",
    "A variant disrupts transcription",
    "A variant disrupts transcription factors",
    "A variant disrupts transcription factors in",
    "A variant disrupts transcription factors in cardiac",
    "A variant disrupts transcription factors in cardiac development"
]

Now, we can ask `gpt-3.5-turbo` to act as an autocomplete engine with whatever context the model is given. We can enable `logprobs` and can see how confident the model is in its prediction.

In [28]:
high_prob_completions = {}
low_prob_completions = {}
html_output = ""

for sentence in sentence_list:
    PROMPT = """Complete this sentence. You are acting as auto-complete. Simply complete the sentence to the best of your ability, make sure it is just ONE sentence: {sentence}"""
    API_RESPONSE = get_completion(
        [{"role": "user", "content": PROMPT.format(sentence=sentence)}],
        model="gpt-3.5-turbo",
        logprobs=True,
        top_logprobs=3,
    )
    html_output += f'<p>Sentence: {sentence}</p>'
    first_token = True
    for token in API_RESPONSE.choices[0].logprobs.content[0].top_logprobs:
        html_output += f'<p style="color:cyan">Predicted next token: {token.token}, <span style="color:darkorange">logprobs: {token.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(token.logprob)*100,2)}%</span></p>'
        if first_token:
            if np.exp(token.logprob) > 0.95:
                high_prob_completions[sentence] = token.token
            if np.exp(token.logprob) < 0.60:
                low_prob_completions[sentence] = token.token
        first_token = False
    html_output += "<br>"

display(HTML(html_output))

Let's look at the high confidence autocompletions:

In [30]:
high_prob_completions


{'A variant': 'of'}

These look reasonable! We can feel confident in those suggestions. It's pretty likely you want to write 'of' after writing 'A variant of'(i.e., unknown signficance)! Now let's look at the autocompletion suggestions the model was less confident about:

In [32]:
low_prob_completions


{'A': 'dog', 'A variant disrupts transcription factors': 'and'}

These are logical as well. It's pretty unclear what the user is going to say with just the prefix 'A', and it's really anyone's guess what or how the variant disrupts transcription factors. <br><br>
So, using `gpt-3.5-turbo`, we can create the root of a dynamic autocompletion engine with `logprobs`!

## 4. Highlighter and bytes parameter

Let's quickly touch on creating a simple token highlighter with `logprobs`, and using the bytes parameter. First, we can create a function that counts and highlights each token. While this doesn't use the log probabilities, it uses the built in tokenization that comes with enabling `logprobs`.

In [33]:
PROMPT = """What's the longest gene name according to the HUGO gene name nomenclature?"""

API_RESPONSE = get_completion(
    [{"role": "user", "content": PROMPT}], model="gpt-3.5-turbo", logprobs=True, top_logprobs=5
)


def highlight_text(api_response):
    colors = [
        "#FF00FF",  # Magenta
        "#008000",  # Green
        "#FF8C00",  # Dark Orange
        "#FF0000",  # Red
        "#0000FF",  # Blue
    ]
    tokens = api_response.choices[0].logprobs.content

    color_idx = 0  # Initialize color index
    html_output = ""  # Initialize HTML output
    for t in tokens:
        token_str = bytes(t.bytes).decode("utf-8")  # Decode bytes to string

        # Add colored token to HTML output
        html_output += f"<span style='color: {colors[color_idx]}'>{token_str}</span>"

        # Move to the next color
        color_idx = (color_idx + 1) % len(colors)
    display(HTML(html_output))  # Display HTML output
    print(f"Total number of tokens: {len(tokens)}")

In [34]:
highlight_text(API_RESPONSE)


Total number of tokens: 40


Next, let's reconstruct a sentence using the bytes parameter. With `logprobs` enabled, we are given both each token and the ASCII (decimal utf-8) values of the token string. These ASCII values can be helpful when handling tokens of or containing emojis or special characters.

In [37]:
PROMPT = """Output the genetics emoji and its name."""
API_RESPONSE = get_completion(
    [{"role": "user", "content": PROMPT}], model="gpt-3.5-turbo", logprobs=True
)

aggregated_bytes = []
joint_logprob = 0.0

# Iterate over tokens, aggregate bytes and calculate joint logprob
for token in API_RESPONSE.choices[0].logprobs.content:
    print("Token:", token.token)
    print("Log prob:", token.logprob)
    print("Linear prob:", np.round(exp(token.logprob) * 100, 2), "%")
    print("Bytes:", token.bytes, "\n")
    aggregated_bytes += token.bytes
    joint_logprob += token.logprob

# Decode the aggregated bytes to text
aggregated_text = bytes(aggregated_bytes).decode("utf-8")

# Assert that the decoded text is the same as the message content
assert API_RESPONSE.choices[0].message.content == aggregated_text

# Print the results
print("Bytes array:", aggregated_bytes)
print(f"Decoded bytes: {aggregated_text}")
print("Joint prob:", np.round(exp(joint_logprob) * 100, 2), "%")

Token: \xf0\x9f
Log prob: -0.48728606
Linear prob: 61.43 %
Bytes: [240, 159] 

Token: \xa7
Log prob: -0.009539441
Linear prob: 99.05 %
Bytes: [167] 

Token: \xac
Log prob: -5.1331983e-05
Linear prob: 99.99 %
Bytes: [172] 

Token:  DNA
Log prob: -0.8761488
Linear prob: 41.64 %
Bytes: [32, 68, 78, 65] 

Token:  Double
Log prob: -1.1560525
Linear prob: 31.47 %
Bytes: [32, 68, 111, 117, 98, 108, 101] 

Token:  Hel
Log prob: -4.3941356e-05
Linear prob: 100.0 %
Bytes: [32, 72, 101, 108] 

Token: ix
Log prob: -6.704273e-07
Linear prob: 100.0 %
Bytes: [105, 120] 

Bytes array: [240, 159, 167, 172, 32, 68, 78, 65, 32, 68, 111, 117, 98, 108, 101, 32, 72, 101, 108, 105, 120]
Decoded bytes: 🧬 DNA Double Helix
Joint prob: 7.97 %


Here, we see that while the first token was ` \xf0\x9f`, we can get its ASCII value and append it to a bytes array. Then, we can easily decode this array into a full sentence, and validate with our assert statement that the decoded bytes is the same as our completion message!

Additionally, we can get the joint probability of the entire completion, which is the exponentiated product of each token's log probability. This gives us how `likely` this given completion is given the prompt. Since, our prompt is quite directive (asking for a certain emoji and its name), the joint probability of this output is high! If we ask for a random output however, we'll see a much lower joint probability. This can also be a good tactic for developers during prompt engineering. 

## 5. Conclusion

Nice! We were able to use the `logprobs` parameter to build a more robust classifier, evaluate our retrieval for Q&A system, and encode and decode each 'byte' of our tokens! `logprobs` adds useful information and signal to our completions output, and we are excited to see how developers incorporate it to improve applications.

## 6. Possible extensions

There are many other use cases for `logprobs` that are not covered in this cookbook. We can use `logprobs` for:
  - Evaluations (e.g.: calculate `perplexity` of outputs, which is the evaluation metric of uncertainty or surprise of the model at its outcomes)
  - Moderation
  - Keyword selection
  - Improve prompts and interpretability of outputs
  - Token healing
  - and more!