pip install textgrad dspy

In [6]:
import textgrad as tg

import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Verify the API key is loaded
if os.getenv("OPENAI_API_KEY") is None:
    raise ValueError("OPENAI_API_KEY not found in environment variables")


tg.set_backward_engine("gpt-4o", override=True)

# Step 1: Get an initial response from an LLM
model = tg.BlackboxLLM("gpt-4o")
question_string = ("If it takes 1 hour to dry 25 shirts under the sun, "
                    "how long will it take to dry 30 shirts under the sun? "
                    "Reason step by step.")

question = tg.Variable(question_string, role_description="question to the LLM", requires_grad=False)

# Step 2: Get the LLM's response
answer = model(question)
print(answer)

To determine how long it will take to dry 30 shirts under the sun, we need to consider the drying process and whether it is affected by the number of shirts.

1. **Understand the Drying Process**: Drying shirts under the sun is typically a parallel process. Each shirt dries independently of the others, assuming there is enough space and sunlight for all shirts to be exposed equally.

2. **Initial Information**: We know that 25 shirts take 1 hour to dry. This implies that each shirt, when exposed to the sun, takes 1 hour to dry.

3. **Drying 30 Shirts**: Since drying is a parallel process and each shirt dries independently, adding more shirts does not increase the drying time for each shirt. Therefore, drying 30 shirts will also take 1 hour, provided that all shirts have equal exposure to sunlight and there is no limitation in space or sunlight.

4. **Conclusion**: The time it takes to dry 30 shirts is the same as the time it takes to dry 25 shirts, which is 1 hour, assuming all conditi

In [7]:
answer.set_role_description("concise and accurate answer to the question")

optimizer = tg.TGD(parameters=[answer], verbose=1)

evaluation_instruction = (f"Here's a question: {question_string}. "
                           "Evaluate any given answer to this question, "
                           "be smart, logical, and very critical. "
                           "Just provide concise feedback.")

loss_fn = tg.TextLoss(evaluation_instruction)


In [8]:
loss = loss_fn(answer)
loss.backward()
optimizer.step()
answer

-----------------------TextualGradientDescent------------------------
To determine how long it will take to dry 30 shirts under the sun, we need to consider the drying process. Drying shirts under the sun is a parallel process, meaning each shirt dries independently, assuming there is sufficient space and sunlight for all shirts. Given that 25 shirts take 1 hour to dry, each shirt takes 1 hour to dry. Therefore, drying 30 shirts will also take 1 hour, assuming equal exposure and no space limitations. The drying time is independent of the number of shirts as long as conditions remain constant.


Variable(value=To determine how long it will take to dry 30 shirts under the sun, we need to consider the drying process. Drying shirts under the sun is a parallel process, meaning each shirt dries independently, assuming there is sufficient space and sunlight for all shirts. Given that 25 shirts take 1 hour to dry, each shirt takes 1 hour to dry. Therefore, drying 30 shirts will also take 1 hour, assuming equal exposure and no space limitations. The drying time is independent of the number of shirts as long as conditions remain constant., role=concise and accurate answer to the question, grads={Variable(value=To improve the concise and accurate answer to the question, consider the following feedback:

1. **Clarify Assumptions**: While the answer correctly identifies that the drying process is parallel, it could benefit from explicitly stating the assumption that there is sufficient space and sunlight for all shirts. This would preemptively address any potential concerns about limitati

In [10]:
question_string = ("what are high memory and low memory oi linux?")

question = tg.Variable(question_string, role_description="question to the LLM", requires_grad=False)

# Step 2: Get the LLM's response
answer = model(question)
answer

Variable(value=In the context of Linux, "high memory" and "low memory" refer to different regions of the system's physical memory, particularly on 32-bit systems. This distinction arises from the way memory is managed and accessed by the operating system and hardware.

### Low Memory

- **Definition**: Low memory is the portion of physical memory that is directly accessible by the kernel without any special handling. On 32-bit systems, this typically includes the first 896 MB of RAM.
- **Address Space**: It is mapped directly into the kernel's address space, allowing the kernel to access it easily and efficiently.
- **Usage**: Low memory is used for kernel data structures, buffers, and other critical components that require fast and direct access by the kernel.
- **Limitations**: The size of low memory is limited by the architecture and the kernel's address space layout, which can be a constraint on systems with large amounts of RAM.

### High Memory

- **Definition**: High memory refe

In [14]:
import json

with open("ragqa_arena_tech_examples.jsonl") as f:
    data = [json.loads(line) for line in f]

data[2]

{'question': 'why are my text messages coming up as maybe?',
 'response': 'This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". \n\nHowever, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.',
 'gold_doc_ids': [3956, 3957, 8034]}

In [12]:
import random

random.Random(0).shuffle(data)
trainset, devset, testset = data[:200], data[200:500], data[500:1000]

len(trainset), len(devset), len(testset)

(200, 300, 500)

In [32]:
# Just importing dspy for the metric only
from dspy.evaluate import SemanticF1
import dspy

# Instantiate the metric.
metric = SemanticF1(decompositional=True)
model = tg.BlackboxLLM("gpt-4o")

# Produce a prediction from our `cot` module, using the `example` above as input.
example = data[2]
question = tg.Variable(example["question"], role_description="question to the LLM", requires_grad=False)
pred = model(question)

# Compute the metric score for the prediction.
lm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=lm)

def evaluate_single(the_model, the_example):
    the_example = dspy.Example(
        question=the_example["question"],
        response=the_example["response"]
    )
    question = tg.Variable(the_example["question"], role_description="question to the LLM", requires_grad=False)
    pred = dspy.Prediction(
        response=the_model(question)
    )
    score = metric(the_example, pred)
    # print("Question:\n", example.question)
    # print("\n\nGround truth:\n", example.response)
    # print("\n\nPrediction:\n", pred.response)
    # print("\n\nSemantic F1 score:", score)
    return score

In [31]:
from tqdm import tqdm

def evaluate(the_model):
    total_score = 0
    top_score = 0
    for example in tqdm(devset, desc=f"Evaluating (score: {total_score}/{top_score}, {total_score/max(1, top_score):.2%})"):
        score = evaluate_single(the_model, example)
        total_score += score
        top_score += 1
    return total_score / top_score

evaluate(model)

Evaluating (score: 0/0, 0.00%):   0%|          | 0/300 [00:00<?, ?it/s]