pip install textgrad dspy faiss-cpu mlflow

In [87]:
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("textgrad-test")

mlflow.litellm.autolog()

2025/03/14 21:21:15 INFO mlflow.tracking.fluent: Experiment with name 'textgrad-test' does not exist. Creating a new experiment.


In [75]:
import textgrad as tg

import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Verify the API key is loaded
if os.getenv("OPENAI_API_KEY") is None:
    raise ValueError("OPENAI_API_KEY not found in environment variables")


tg.set_backward_engine("gpt-4o", override=True)

# Step 1: Get an initial response from an LLM
model = tg.BlackboxLLM("gpt-4o")
question_string = ("If it takes 1 hour to dry 25 shirts under the sun, "
                    "how long will it take to dry 30 shirts under the sun? "
                    "Reason step by step.")

question = tg.Variable(question_string, role_description="question to the LLM", requires_grad=False)

# Step 2: Get the LLM's response
answer = model(question)
print(answer)

To determine how long it will take to dry 30 shirts under the sun, we need to consider the drying process and whether it is affected by the number of shirts.

1. **Understand the Drying Process**: Drying shirts under the sun is typically a parallel process. Each shirt dries independently of the others, assuming there is enough space and sunlight for all shirts to be exposed equally.

2. **Initial Information**: We know that 25 shirts take 1 hour to dry. This implies that each shirt, when exposed to the sun, takes 1 hour to dry.

3. **Drying 30 Shirts**: Since drying is a parallel process and each shirt dries independently, adding more shirts does not increase the drying time for each shirt. Therefore, drying 30 shirts will also take 1 hour, provided that all shirts have equal exposure to sunlight and there is no limitation in space or sunlight.

4. **Conclusion**: The time it takes to dry 30 shirts is the same as the time it takes to dry 25 shirts, which is 1 hour, assuming all conditi

In [76]:
answer.set_role_description("concise and accurate answer to the question")

optimizer = tg.TGD(parameters=[answer], verbose=1)

evaluation_instruction = (f"Here's a question: {question_string}. "
                           "Evaluate any given answer to this question, "
                           "be smart, logical, and very critical. "
                           "Just provide concise feedback.")

loss_fn = tg.TextLoss(evaluation_instruction)


In [77]:
loss = loss_fn(answer)
loss.backward()
optimizer.step()
answer

-----------------------TextualGradientDescent------------------------
To determine how long it will take to dry 30 shirts under the sun, we need to consider the drying process. Drying shirts under the sun is a parallel process, meaning each shirt dries independently, assuming there is sufficient space and sunlight for all shirts. Given that 25 shirts take 1 hour to dry, each shirt takes 1 hour to dry. Therefore, drying 30 shirts will also take 1 hour, assuming equal exposure and no space limitations. The drying time is independent of the number of shirts as long as conditions remain constant.


Variable(value=To determine how long it will take to dry 30 shirts under the sun, we need to consider the drying process. Drying shirts under the sun is a parallel process, meaning each shirt dries independently, assuming there is sufficient space and sunlight for all shirts. Given that 25 shirts take 1 hour to dry, each shirt takes 1 hour to dry. Therefore, drying 30 shirts will also take 1 hour, assuming equal exposure and no space limitations. The drying time is independent of the number of shirts as long as conditions remain constant., role=concise and accurate answer to the question, grads={Variable(value=To improve the concise and accurate answer to the question, consider the following feedback:

1. **Clarify Assumptions**: While the answer correctly identifies that the drying process is parallel, it could benefit from explicitly stating the assumption that there is sufficient space and sunlight for all shirts. This would preemptively address any potential concerns about limitati

In [78]:
question_string = ("what are high memory and low memory in linux?")

question = tg.Variable(question_string, role_description="question to the LLM", requires_grad=False)

# Step 2: Get the LLM's response
answer = model(question)
answer

Variable(value=In Linux, the terms "high memory" and "low memory" refer to different regions of the system's physical memory, particularly in the context of 32-bit architectures. This distinction is primarily relevant for systems with large amounts of RAM.

### Low Memory
- **Definition**: Low memory is the portion of physical memory that is directly accessible by the kernel without any special handling.
- **Address Range**: On 32-bit systems, low memory typically refers to the first 896 MB of RAM. This is because the Linux kernel reserves the upper 128 MB of the 4 GB address space for its own use, leaving 3 GB for user space and 1 GB for kernel space.
- **Usage**: Low memory is used for kernel data structures, buffers, and other critical components that need to be accessed quickly and efficiently.

### High Memory
- **Definition**: High memory is the portion of physical memory that is not directly mapped into the kernel's address space.
- **Address Range**: High memory starts just abo

In [79]:
import json

with open("ragqa_arena_tech_examples.jsonl") as f:
    data = [json.loads(line) for line in f]

data[2]

{'question': 'why are my text messages coming up as maybe?',
 'response': 'This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". \n\nHowever, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.',
 'gold_doc_ids': [3956, 3957, 8034]}

In [80]:
import random

random.Random(0).shuffle(data)
trainset, devset, testset = data[:200], data[200:500], data[500:1000]

len(trainset), len(devset), len(testset)

(200, 300, 500)

In [130]:
# Just importing dspy for the metric only
from dspy.evaluate import SemanticF1
import dspy
from textgrad.engine import get_engine
import litellm

litellm.set_verbose=False

engine = get_engine("experimental:gpt-4o", cache=False)

system_prompt = tg.Variable("You are a helpful assistant that can answer questions about the given context.", role_description="system prompt for the LLM", requires_grad=True)

# Instantiate the metric.
metric = SemanticF1(decompositional=True)
model = tg.BlackboxLLM(engine=engine, system_prompt=system_prompt)

# Produce a prediction from our `cot` module, using the `example` above as input.
example = data[2]
question = tg.Variable(example["question"], role_description="question to the LLM", requires_grad=False)
# pred = model(question)

# Compute the metric score for the prediction.
lm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=lm)

def evaluate_single(the_model, the_example):
    the_example = dspy.Example(
        question=the_example["question"],
        response=the_example["response"]
    )
    pred = dspy.Prediction(
        response=the_model(the_example["question"])
    )
    score = metric(the_example, pred)
    # print("Question:\n", example.question)
    # print("\n\nGround truth:\n", example.response)
    # print("\n\nPrediction:\n", pred.response)
    # print("\n\nSemantic F1 score:", score)
    return score

In [82]:
from tqdm import tqdm
# Clear instances
tqdm._instances.clear()

# Reset monitor thread
if hasattr(tqdm, 'monitor'):
    tqdm.monitor.exit()
    tqdm.monitor = None

In [83]:
from tqdm import tqdm

# def evaluate(the_model):
#     total_score = 0
#     top_score = 0
#     pbar = tqdm(devset)
#     for example in pbar:
#         score = evaluate_single(the_model, example)
#         total_score += score
#         top_score += 1
#         pbar.set_description(f"Evaluating (score: {total_score:.1f}/{top_score}, {total_score/max(1, top_score):.2%})")
#     return total_score / top_score

from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

def evaluate(the_model):
    total_score = 0
    pbar = tqdm(total=len(devset), position=0, leave=True)

    # Use ThreadPoolExecutor since the work is I/O bound (API calls)
    with ThreadPoolExecutor(max_workers=24) as executor:
        # Submit all tasks
        future_to_example = {
            executor.submit(evaluate_single, the_model, example): example 
            for example in devset
        }
        
        # Process completed tasks as they finish
        for future in as_completed(future_to_example):
            score = future.result()
            total_score += score
            pbar.update(1)
            pbar.set_description(f"Evaluating (score: {total_score:.1f}/{pbar.n}, {total_score/max(1, pbar.n):.2%})")
    
    pbar.close()
    return total_score / len(devset)


In [90]:
evaluate(model)

Evaluating (score: 149.8/300, 49.93%): 100%|██████████| 300/300 [03:31<00:00,  1.42it/s]


0.4993400821965219

In [85]:
import json

max_characters = 6000  # for truncating >99th percentile of documents
topk_docs_to_retrieve = 5  # number of documents to retrieve per search query

with open("ragqa_arena_tech_corpus.jsonl") as f:
    corpus = [json.loads(line)['text'][:max_characters] for line in f]

embedder = dspy.Embedder('openai/text-embedding-3-small', dimensions=512)
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=topk_docs_to_retrieve)

Training a 32-byte FAISS index with 337 partitions, based on 28436 x 512-dim embeddings


In [133]:
class RAG():
    def __init__(self, model, search):
        self.model = model
        self.search = search

    def __call__(self, question):
        docs = self.search(question)
        context = "Context:\n"
        for doc in docs.passages:
            context += "-" + doc + "\n"
        question = tg.Variable(context + "\n\nQuestion:\n" + question, role_description="question to the LLM", requires_grad=False)
        # print(question)
        return self.model(question)

rag = RAG(model, search)

rag("what are high memory and low memory in linux?").predecessors

{Variable(value=Context:
 -As far as I remember, High Memory is used for application space and Low Memory for the kernel. Advantage is that (user-space) applications cant access kernel-space memory.
 -This is relevant to the Linux kernel; Im not sure how any Unix kernel handles this. The High Memory is the segment of memory that user-space programs can address. It cannot touch Low Memory. Low Memory is the segment of memory that the Linux kernel can address directly. If the kernel must access High Memory, it has to map it into its own address space first. There was a patch introduced recently that lets you control where the segment is. The tradeoff is that you can take addressable memory away from user space so that the kernel can have more memory that it does not have to map before using. Additional resources: http://tldp.org/HOWTO/KernelAnalysis-HOWTO-7.html http://linux-mm.org/HighMemory
 -HIGHMEM is a range of kernels memory space, but it is NOT memory you access but its a place wh

In [95]:
evaluate(rag)

Evaluating (score: 193.8/300, 64.61%): 100%|██████████| 300/300 [04:18<00:00,  1.16it/s]


0.646081424389819

In [137]:
from textgrad.tasks import load_task

optimizer_prompt = """
You are part of an optimization system that improves text (i.e., variable). You will be asked to creatively and critically improve prompts, solutions to problems, code, or any other text-based variable. You will receive some feedback, and use the feedback to improve the variable. The feedback may be noisy, identify what is important and what is correct. Pay attention to the role description of the variable, and the context in which it is used. This is very important: You MUST give your response by sending the improved variable between {new_variable_start_tag} {{improved variable}} {new_variable_end_tag} tags. The text you send between the tags will directly replace the variable.


### Glossary of tags that will be sent to you:
# - <LM_SYSTEM_PROMPT>: The system prompt for the language model.
# - <LM_INPUT>: The input to the language model.
# - <LM_OUTPUT>: The output of the language model.
# - <FEEDBACK>: The feedback to the variable.
# - <CONVERSATION>: The conversation history.
# - <FOCUS>: The focus of the optimization.
# - <ROLE>: The role description of the variable.
"""

train_loader = tg.tasks.DataLoader(trainset, batch_size=3, shuffle=True)

eval_model = get_engine("experimental:anthropic/claude-3-7-sonnet-latest", cache=False)
optimizer = tg.TextualGradientDescent(engine=eval_model, parameters=[system_prompt])

In [138]:
TOTAL_EPOCHS = 3

print("sys prompt: ", system_prompt)

def evaluate_response(response, the_example):
    the_example = dspy.Example(
        question=the_example["question"],
        response=the_example["response"]
    )
    pred = dspy.Prediction(
        response=response.value
    )
    score = metric(the_example, pred)
    # print("Question:\n", example.question)
    # print("\n\nGround truth:\n", example.response)
    # print("\n\nPrediction:\n", pred.response)
    # print("\n\nSemantic F1 score:", score)
    return score


for epoch in range(TOTAL_EPOCHS):
    print(f"Epoch {epoch}/{TOTAL_EPOCHS}")
    pbar = tqdm(train_loader, position=0)
    for step, batch in enumerate(pbar):
        pbar.set_description(f"Training step {step}. Epoch {epoch}")
        optimizer.zero_grad()
        losses = []
        for example in batch:
            response = rag(example["question"])
            score = evaluate_response(response, example)
            loss = tg.Variable(
                f"{1 - score:.3f}",
                role_description="loss",
                requires_grad=True,
                predecessors=[response]
            )
            losses.append(loss)
        loss = tg.sum(losses)
        # print("loss: ", loss)
        loss.backward()
        optimizer.step()
    
        print("\n\n")
        print("sys prompt: ", system_prompt)
        print("\n\n")

        if step == 3:
            break

sys prompt:  You are a helpful assistant that can answer questions about the given context.
Epoch 0/3


Training step 0. Epoch 0: : 0it [00:00, ?it/s]

loss:  0.600
0.167
0.187


Training step 1. Epoch 0: : 1it [00:21, 21.03s/it]




sys prompt:  You are a specialized technical assistant with expertise in cybersecurity, computer systems, and user interfaces. When answering questions about the given context:

1. Analyze the context thoroughly to provide accurate, relevant, and technically sound information.
2. Offer clear, step-by-step instructions when explaining processes or technical procedures.
3. Prioritize clarity and conciseness while ensuring your responses are comprehensive.
4. Consider the user's technical knowledge level and provide explanations that balance technical accuracy with accessibility.
5. When appropriate, suggest additional relevant information or best practices even if not explicitly asked.
6. For security-related topics, emphasize practical risk assessment and provide balanced advice that considers both technical possibilities and real-world applicability.
7. Define technical terms or concepts that may be unfamiliar to users.

Your goal is to be not just helpful, but insightful, precise, 

Training step 2. Epoch 0: : 2it [01:57, 65.22s/it]




sys prompt:  You are a specialized technical assistant with expertise in cybersecurity, computer systems, user interfaces, operating systems, programming languages, and software development. When answering questions about the given context:

1. Analyze the context thoroughly to identify key concepts, potential misconceptions, and underlying assumptions to provide accurate, relevant, and technically sound information.
2. Offer clear, step-by-step instructions when explaining processes or technical procedures, using practical examples and analogies to illustrate complex concepts.
3. Prioritize clarity and conciseness while ensuring your responses are comprehensive and appropriately detailed for the topic.
4. Dynamically assess the user's technical knowledge level and provide explanations that balance technical accuracy with accessibility, adjusting your language based on context clues.
5. When appropriate, suggest additional relevant information, best practices, or visual aids/externa

Training step 3. Epoch 0: : 3it [03:30, 78.10s/it]




sys prompt:  You are a specialized technical assistant with expertise in cybersecurity, computer systems, user interfaces, operating systems, programming languages, and software development. When answering questions about the given context:

1. Analyze the context thoroughly to identify key concepts, potential misconceptions, and underlying assumptions, considering both historical evolution and current relevance of technologies to provide accurate, relevant, and technically sound information.

2. Offer clear, step-by-step instructions when explaining processes or technical procedures, using practical examples and analogies to illustrate complex concepts that relate to everyday experiences.

3. Prioritize clarity and conciseness while balancing them with the necessary level of detail, ensuring your responses are comprehensive, accessible, and appropriately tailored to the topic's complexity.

4. Dynamically assess the user's technical knowledge level based on their language and quest

Training step 3. Epoch 0: : 3it [05:09, 103.25s/it]





sys prompt:  You are a specialized technical assistant with expertise in cybersecurity, computer systems, user interfaces, operating systems, programming languages, and software development. When answering questions about the given context:

1. Thoroughly analyze the context to identify the user's intent, key concepts, potential misconceptions, and underlying assumptions. Consider both the historical evolution and current relevance of technologies to provide accurate, technically sound information that directly addresses the user's specific needs.

2. Offer clear, step-by-step instructions when explaining processes or technical procedures, using practical examples and analogies to illustrate complex concepts. For technical processes, anticipate common errors and include troubleshooting tips proactively.

3. Prioritize clarity and conciseness while balancing them with the necessary level of detail. Ensure your responses are comprehensive, accessible, and appropriately tailored to the

Training step 0. Epoch 1: : 0it [01:02, ?it/s]


ValueError: Expected dict_keys(['reasoning', 'ground_truth_key_ideas', 'system_response_key_ideas', 'discussion', 'recall', 'precision']) but got dict_keys(['reasoning', 'ground_truth_key_ideas'])

In [None]:
# print(system_prompt)

engine = get_engine("experimental:gpt-4o", cache=False)
model = tg.BlackboxLLM(engine, system_prompt=system_prompt)
rag = RAG(model, search)

evaluate(rag)
# rag("what are high memory and low memory in linux?")

Evaluating (score: 170.6/289, 59.02%):  96%|█████████▋| 289/300 [04:37<00:07,  1.43it/s]