In [3]:
import os
import re
from dotenv import load_dotenv
from openai import OpenAI
from datasets import load_dataset # hugging face, I think

# Lesson five: Reward functions with LLM-as-a-judge

While most of the lectures are either totally general (the intro ones) or specific to solving Wordle (like lessons three and four), this one is specific but not to Wordle. It's interesting as a good way to think about different kinds of rewards, and especially to places where it's harder to quantify 'good' and 'bad' than something like a game or code (i.e., where things aren't verifiable directly/objectively).

In [8]:
_ = load_dotenv(override=True) # populate env from .env file, reload of file - 'override' - ok here

MODEL_NAME='predibase/Meta-Llama-3.1-8B-Instruct-dequantized'

# the examples use both OpenAI's API and Predibase's API.
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

pb_client = OpenAI(
    base_url=os.environ['PREDIBASE_LLAMA_MODEL_URL'],
    api_key=os.environ['PREDIBASE_API_TOKEN']
)

## Load earnings call data and use Predibase API to get summaries

In [9]:
ds = load_dataset('mrSoul7766/ECTSum')
transcript = ds['train'][1]['text']
print(transcript[:1983])

I'm joined by Tom Greco, our President and Chief Executive Officer; and Jeff Shepherd, our Executive Vice President and Chief Financial Officer.
We also hope that you and your families are healthy and safe.
The health and safety of our team members and customers has been a top priority over the past year.
With strength across all channels, we delivered comparable store sales growth of 24.7%, and margin expansion of 478 basis points versus the prior year.
On a two-year stack, our comp sales growth was 15.4%.
Adjusted diluted earnings per share of $3.34 represented an all-time quarterly high for AAP, and improved more than 230% compared to Q1 2020.
Free cash flow of $259 million was up significantly versus the prior year, and we returned over $203 million to our shareholders through a combination of share repurchases and our quarterly cash dividend.
In addition, we recently announced an updated capital allocation framework targeting top quartile total shareholder return, highlighted by o

The above is a transcript of an earnings call. Suppose for this example notebook, what we want is a good summary of the call w/ the key takeaways, and no hallucinations.

In [10]:
SUMMARIZE_PROMPT = """Generate a concise summary of the information in the following earnings call transcript.

Only respond with the summary, do not include any extraneous text.

Transcript:

{transcript}
"""

In [11]:
def summarize(transcript, n=1):
    prompt = SUMMARIZE_PROMPT.format(transcript=transcript)
    messages = [
        {'role': 'user', 'content': prompt},
    ]

    return pb_client.chat.completions.create(
        model=MODEL_NAME,
        messages=messages,
        n=n,
        temperature=0.9,
    )

resp = summarize(transcript)
summary = resp.choices[0].message.content
print(summary)

Advance Auto Parts reported Q1 2021 results:

* Comparable store sales growth of 24.7%
* Margin expansion of 478 basis points
* Adjusted diluted earnings per share of $3.34, up 230% from Q1 2020
* Free cash flow of $259 million
* Returned $203 million to shareholders through share repurchases and dividend
* Raised comp sales guidance to up 4-6%
* Updated adjusted OI margin range to 9-9.2%
* Confident in long-term strategic plans to deliver strong and sustainable total shareholder return


Interestingly, the summary I get above is much shorter (the requested 'conciseness') than the one that the lecture shows. I think I'm using the same model, but perhaps something else's changed? (Side question: if we're doing this RFT work here to 'get better summaries' and models already summarize well, especially given good prompts, and new and/or bigger models probably summarize even better, this points to wanting metrics to judge how our RFT work improves, and how it compares to existing and new models. Maybe some of the reward function work here is useful when thinking about that too, or is that 'teaching to the test' and would we want something separate?)

## Use a separate LLM as a judge of the quality of the summary

Above we used a Predibase-hosted Llama-3.1-8B model to generate a summary. Here we'll use a separate GPT-4o-mini model, via OpenAI directly, to assign a reward score to a summary. That is, we're using the second LLM as a cheaper/faster replacement for having an SME judge and rate the quality of the summary.

In [12]:
JUDGE_PROMPT_V1 = """
Rate the following summary of an earnings call transcript on a 
scale from 1 to 10. 

1 means the summary is very poor, 10 means the summary is very good.

Provide reasoning followed by the final score at the end 
surrounded by <score> tags.

For example:

<score>1</score>

Transcript:

{transcript}

Summary:

{summary}
"""

In [15]:
def judge_reward_v1(
    transcript: str,
    summary: str,
    model: str = 'gpt-4o-mini',
    verbose: bool = False
) -> float:
    prompt = JUDGE_PROMPT_V1.format(transcript=transcript, summary=summary)
    messages = [ { 'role': 'user', 'content': prompt } ]

    resp = client.chat.completions.create(
        model=model,
        messages=messages,
        n=1,
        temperature=0, # get the most likely (and deterministic) score
    )
    completion = resp.choices[0].message.content

    if verbose:
        print(completion)

    try:
        match = re.search(r'<score>(\d+)</score>', completion)
        if match is None:
            return 0

        score = match.group(1).strip()
        score = int(score)
    except:
        score = 0

    return score / 10

In [14]:
score = judge_reward_v1(transcript, summary, verbose=True)
print(score)

The summary provided captures the key financial metrics and strategic updates from the earnings call transcript effectively. It highlights significant achievements such as the impressive comparable store sales growth, margin expansion, and the substantial increase in adjusted diluted earnings per share. Additionally, it notes the company's commitment to returning value to shareholders and updates on guidance, which are crucial for investors.

However, the summary could be improved by including more context about the factors driving these results, such as the impact of federal stimulus, changes in consumer behavior, and the company's strategic initiatives. It also lacks mention of specific categories or regions that contributed to the growth, which could provide a more comprehensive view of the company's performance.

Overall, while the summary is concise and covers the essential points, it misses some of the nuances and details that would give a fuller picture of the earnings call. The

Now, eight separate summaries, with a score for each.

In [16]:
resp = summarize(transcript, n=8)
summaries = [choice.message.content for choice in resp.choices]

In [17]:
scores = [judge_reward_v1(transcript, summary) for summary in summaries]
scores

[0.8, 0.7, 0.8, 0.8, 0.8, 0.7, 0.8, 0.8]

LLMs-as-a-judge have the problem we see above where we have little diversity - no 'this is really bad' or 'this is really good', which apparently is common when using this approach.

Finished through 5:15.