# LLM Evaluation and Tracing with W&B

<!--- @wandbcode{dlai_04} -->

## 1. Using Tables for Evaluation

In this section, we will call OpenAI LLM to generate names of our game assets. We will use W&B Tables to evaluate the generations. 

In [2]:
import os
import random
import time
import datetime

import openai

from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential, # for exponential backoff
)  
import wandb
from wandb.sdk.data_types.trace_tree import Trace

In [3]:
# get openai API key
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [4]:
PROJECT = "dlai_llm"
MODEL_NAME = "gpt-3.5-turbo"

In [5]:
wandb.login(anonymous="allow")

[34m[1mwandb[0m: Currently logged in as: [33meromoseleeigbedion[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [6]:
# initialize a run
run = wandb.init(project=PROJECT, job_type="generation")

### Simple generations
Let's start by generating names for our game assets using OpenAI `ChatCompletion`, and saving the resulting generations in W&B Tables. 

In [7]:
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def completion_with_backoff(**kwargs):
    return openai.ChatCompletion.create(**kwargs)

In [8]:
def generate_and_print(system_prompt, user_prompt, table, n=5):
    messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ]
    start_time = time.time()
    responses = completion_with_backoff(
        model=MODEL_NAME,
        messages=messages,
        n = n,
        )
    elapsed_time = time.time() - start_time
    for response in responses.choices:
        generation = response.message.content
        print(generation)
    table.add_data(system_prompt,
                user_prompt,
                [response.message.content for response in responses.choices],
                elapsed_time,
                datetime.datetime.fromtimestamp(responses.created),
                responses.model,
                responses.usage.prompt_tokens,
                responses.usage.completion_tokens,
                responses.usage.total_tokens
                )

In [9]:
system_prompt = """You are a creative copywriter.
You're given a category of game asset, \
and your goal is to design a name of that asset.
The game is set in a fantasy world \
where everyone laughs and respects each other, 
while celebrating diversity."""

In [10]:
# Define W&B Table to store generations
columns = ["system_prompt", "user_prompt", "generations", "elapsed_time", "timestamp",\
            "model", "prompt_tokens", "completion_tokens", "total_tokens"]
table = wandb.Table(columns=columns)

In [11]:
user_prompt = "hero"
generate_and_print(system_prompt, user_prompt, table)

Unity Chalice: A token of camaraderie and strength
Unity's Champion
Unity Knight: The Laughing Savior
Unity Guardian
Kindhearted Defender


In [12]:
user_prompt = "jewel"
generate_and_print(system_prompt, user_prompt, table)

Harmony Sparkle-Gems
Harmony Gems
Rainbow Harmony Gem
Harmony Gems
Gem of Harmony


In [13]:
# give the table a name
wandb.log({"simple_generations": table})
run.finish()

VBox(children=(Label(value='0.005 MB of 0.005 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

## 2. Using Tracer to log more complex chains

How can we get more creative outputs? Let's design an LLM chain that will first randomly pick a fantasy world, and then generate character names. We will demonstrate how to use Tracer in such scenario. We will log the inputs and outputs, start and end times, whether the OpenAI call was successful, the token usage, and additional metadata.

In [14]:
worlds = [
    "a mystic medieval island inhabited by intelligent and funny frogs",
    "a modern castle sitting on top of a volcano in a faraway galaxy",
    "a digital world inhabited by friendly machine learning engineers"
]

In [15]:
# define your config
model_name = "gpt-3.5-turbo"
temperature = 0.7
system_message = """You are a creative copywriter. 
You're given a category of game asset and a fantasy world.
Your goal is to design a name of that asset.
Provide the resulting name only, no additional description.
Single name, max 3 words output, remember!"""

In [16]:
def run_creative_chain(query):
    # part 1 - a chain is started...
    start_time_ms = round(datetime.datetime.now().timestamp() * 1000)

    root_span = Trace(
          name="MyCreativeChain",
          kind="chain",
          start_time_ms=start_time_ms,
          metadata={"user": "student_1"},
          model_dict={"_kind": "CreativeChain"}
          )

    # part 2 - your chain picks a fantasy world
    time.sleep(3)
    world = random.choice(worlds)
    expanded_prompt = f'Game asset category: {query}; fantasy world description: {world}'
    tool_end_time_ms = round(datetime.datetime.now().timestamp() * 1000)

    # create a Tool span 
    tool_span = Trace(
          name="WorldPicker",
          kind="tool",
          status_code="success",
          start_time_ms=start_time_ms,
          end_time_ms=tool_end_time_ms,
          inputs={"input": query},
          outputs={"result": expanded_prompt},
          model_dict={"_kind": "tool", "num_worlds": len(worlds)}
          )

    # add the TOOL span as a child of the root
    root_span.add_child(tool_span)

    # part 3 - the LLMChain calls an OpenAI LLM...
    messages=[
      {"role": "system", "content": system_message},
      {"role": "user", "content": expanded_prompt}
    ]

    response = completion_with_backoff(model=model_name,
                                       messages=messages,
                                       max_tokens=12,
                                       temperature=temperature)   

    llm_end_time_ms = round(datetime.datetime.now().timestamp() * 1000)
    response_text = response["choices"][0]["message"]["content"]
    token_usage = response["usage"].to_dict()

    llm_span = Trace(
          name="OpenAI",
          kind="llm",
          status_code="success",
          metadata={"temperature":temperature,
                    "token_usage": token_usage, 
                    "model_name":model_name},
          start_time_ms=tool_end_time_ms,
          end_time_ms=llm_end_time_ms,
          inputs={"system_prompt":system_message, "query":expanded_prompt},
          outputs={"response": response_text},
          model_dict={"_kind": "Openai", "engine": response["model"], "model": response["object"]}
          )

    # add the LLM span as a child of the Chain span...
    root_span.add_child(llm_span)

    # update the end time of the Chain span
    root_span.add_inputs_and_outputs(
          inputs={"query":query},
          outputs={"response": response_text})

    # update the Chain span's end time
    root_span.end_time_ms = llm_end_time_ms


    # part 4 - log all spans to W&B by logging the root span
    root_span.log(name="creative_trace")
    print(f"Result: {response_text}")


In [17]:
# Let's start a new wandb run
wandb.init(project=PROJECT, job_type="generation")

In [18]:
run_creative_chain("hero")

Result: Froggy Sage


In [19]:
run_creative_chain("jewel")

Result: Bytegem Sparkle


In [20]:
wandb.finish()

VBox(children=(Label(value='0.002 MB of 0.004 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.385510…

## Langchain agent

In the third scenario, we'll introduce an agent that will use tools such as WorldPicker and NameValidator to come up with the ultimate name. We will also use Langchain here and demonstrate its W&B integration.

In [21]:
# Import things that are needed generically
from langchain.agents import AgentType, initialize_agent
from langchain.chat_models import ChatOpenAI
from langchain.tools import BaseTool

from typing import Optional

from langchain.callbacks.manager import (
    AsyncCallbackManagerForToolRun,
    CallbackManagerForToolRun,
)

In [22]:
wandb.init(project=PROJECT, job_type="generation")

In [23]:
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"

In [24]:
# define the Agent tools
class WorldPickerTool(BaseTool):
    name = "pick_world"
    description = "pick a virtual game world for your character or item naming"
    worlds = [
                "a mystic medieval island inhabited by intelligent and funny frogs",
                "a modern anthill featuring a cyber-ant queen and her cyber-ant-workers",
                "a digital world inhabited by friendly machine learning engineers"
            ]

    def _run(
        self, query: str, run_manager: Optional[CallbackManagerForToolRun] = None
    ) -> str:
        """Use the tool."""
        time.sleep(1)
        return random.choice(self.worlds)

    async def _arun(
        self, query: str, run_manager: Optional[AsyncCallbackManagerForToolRun] = None
    ) -> str:
        """Use the tool asynchronously."""
        raise NotImplementedError("pick_world does not support async")
        
class NameValidatorTool(BaseTool):
    name = "validate_name"
    description = "validate if the name is properly generated"

    def _run(
        self, query: str, run_manager: Optional[CallbackManagerForToolRun] = None
    ) -> str:
        """Use the tool."""
        time.sleep(1)
        if len(query) < 20:
            return f"This is a correct name: {query}"
        else:
            return f"This name is too long. It should be shorter than 20 characters."

    async def _arun(
        self, query: str, run_manager: Optional[AsyncCallbackManagerForToolRun] = None
    ) -> str:
        """Use the tool asynchronously."""
        raise NotImplementedError("validate_name does not support async")

In [25]:
llm = ChatOpenAI(temperature=0.7)

In [26]:
tools = [WorldPickerTool(), NameValidatorTool()]
agent = initialize_agent(
    tools, 
    llm, 
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    handle_parsing_errors=True,
    verbose=True
)

In [27]:
agent.run(
    "Find a virtual game world for me and imagine the name of a hero in that world"
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI should pick a virtual game world and generate a name for a hero in that world.
Action: pick_world
Action Input: Fantasy World[0m
Observation: [36;1m[1;3ma digital world inhabited by friendly machine learning engineers[0m
Thought:[32;1m[1;3mI should now generate a name for a hero in the Fantasy World.
Action: validate_name
Action Input: Sirathor[0m
Observation: [33;1m[1;3mThis is a correct name: Sirathor[0m
Thought:[32;1m[1;3mI now have a suitable hero name for the Fantasy World.
Final Answer: The hero's name in the Fantasy World is Sirathor.[0m

[1m> Finished chain.[0m


"The hero's name in the Fantasy World is Sirathor."

In [28]:
agent.run(
    "Find a virtual game world for me and imagine the name of a jewel in that world"
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI need to pick a virtual game world and come up with a name for a jewel within that world.
Action: pick_world
Action Input: Fantasy Realm[0m
Observation: [36;1m[1;3ma mystic medieval island inhabited by intelligent and funny frogs[0m
Thought:[32;1m[1;3mI now need to come up with a name for a jewel in this Fantasy Realm. 
Action: validate_name
Action Input: Dragon's Tear[0m
Observation: [33;1m[1;3mThis is a correct name: Dragon's Tear[0m
Thought:[32;1m[1;3mI now know the final answer
Final Answer: Dragon's Tear in the Fantasy Realm[0m

[1m> Finished chain.[0m


"Dragon's Tear in the Fantasy Realm"

In [29]:
agent.run(
    "Find a virtual game world for me and imagine the name of food in that world."
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI should pick a virtual game world for the character or item naming.
Action: pick_world
Action Input: Natura[0m
Observation: [36;1m[1;3ma digital world inhabited by friendly machine learning engineers[0m
Thought:[32;1m[1;3mI should think of a food name that would fit in a world inhabited by friendly machine learning engineers.
Action: validate_name
Action Input: ByteBites[0m
Observation: [33;1m[1;3mThis is a correct name: ByteBites[0m
Thought:[32;1m[1;3mI now know the final answer
Final Answer: In the virtual game world of Natura, there is a food called ByteBites.[0m

[1m> Finished chain.[0m


'In the virtual game world of Natura, there is a food called ByteBites.'

In [30]:
wandb.finish()

VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

**Note**: LLM outputs are variable. Your results may not match those in the video.