Skip to content

Conversation

mikecann
Copy link
Contributor

This PR adds the ability to "disable braintrust" that is to entirely skip reporting results to Braintrust and to exclude using the brantrust AI proxy.

The motivation for this is that I wanted to be able to run some experiments locally without polluting Braintrust or using up Braintrust tokens.

I also was getting frustrating errors regularly from Braintrust:

Retrying API request experiment-comparison2 {'experiment_id': 'ef7e6686-1643-4f75-9390-06322be77788', 'base_experiment_id': '51b388e1-383d-4476-886c-ac175c765897'} 504 {"message": "Endpoint request timed out"}

Also I am planning on a "self-improving" AI loop that would run over an extended period of time as per: #31 and thus I dont want to report to Braintrust during that process.

There are more changes than I would like in here but while I was at it I made some other additional changes that help with tracking down the cause of eval failures.

image

Now there are more high-level log lines that are reported to the console so you know what is happening, then when an eval finishes you are given a link to view the eval output directory.

image

Inside that directory is now a run.log file that gives you a log of what happened so you can track down the failures. This will be very important for a self-improving agent.

There is also a nice "Eval Summary" reported to the console each run:

image

I should be clear that this PR should not affect normal running and reporting to braintrust as I have tested that it still works (https://www.braintrust.dev/app/Convex/p/Convex%20Coding) this simply adds a new env var that you can optionally use to disable braintrust.

@mikecann mikecann requested a review from jordanhunt22 August 21, 2025 05:18
Copy link
Collaborator

@jordanhunt22 jordanhunt22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, looks good. but a decent amount of comments about code style inside

Comment on lines +4 to +17
def sanitize_output(text: str) -> str:
try:
# Remove ANSI CSI, OSC (BEL/ST terminated), hyperlinks, and 7-bit C1 escapes
patterns = [
r"\x1B\[[0-?]*[ -/]*[@-~]", # CSI sequences
r"\x1B\][^\x07]*\x07", # OSC sequences terminated by BEL
r"\x1B\]8;;.*?\x1B\\", # OSC 8 hyperlinks (ST-terminated)
r"\x1B[@-Z\\-_]", # 7-bit C1 escapes
]
out = text
for p in patterns:
out = re.sub(p, "", out)
return out
except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for doing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows thing. I was getting strange character outputs

Comment on lines 20 to 30
if model.provider == ModelProvider.OPENAI:
url = "https://api.openai.com/v1"
elif model.provider == ModelProvider.ANTHROPIC:
url = "https://api.anthropic.com/v1"
elif model.provider == ModelProvider.TOGETHER:
url = "https://api.together.xyz/v1"
elif model.provider == ModelProvider.GOOGLE:
url = "https://generativelanguage.googleapis.com/v1beta"
elif model.provider == ModelProvider.XAI:
url = "https://api.x.ai/v1"
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a match statement here would probably be a bit cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, ye I unfortunately dont know a lot about python, I was wondering if this would be better as some sort of pattern match. Ill get that fixed up

runner/scorer.py Outdated

try:
generate_code(output_project_dir_abs)
print(f"[{category}/{name}] Running convex codegen", flush=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these log lines be behind an env variable? it seems like they would be pretty noisy if they are on by default. especially if you run these tests locally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay ill put them behind an env var but I really wanted more visibility as to what the run is doing otherwise it would seemingly hang for a long time with no output

runner/scorer.py Outdated
def typecheck_code(project_dir):
results = []
convex_dir = os.path.abspath(os.path.join(project_dir, "convex"))
cmd1 = ["bunx", "tsc", "-noEmit", "-p", convex_dir]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add better names here instead of cmd1 abd cmd2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure!

runner/scorer.py Outdated
def lint_code(project_dir):
results = []
eslint_config = os.path.abspath("eslint.config.mjs")
cmd1 = ["bunx", "eslint", "-c", eslint_config, "convex"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

runner/scorer.py Outdated

try:
install_dependencies(output_project_dir_abs)
def run_command_step(log_path, handler, prefix, error_label, *, cmd_prefix=""):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we define this function in the same place as log_cmd_results down below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure!

@mikecann
Copy link
Contributor Author

okay thanks for the review @jordanhunt22 I have now implemented the things mentioned above :)

@mikecann mikecann mentioned this pull request Aug 27, 2025
Copy link
Collaborator

@jordanhunt22 jordanhunt22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides a few small comments. Great work!

- Per-eval result with ✅/❌ and a clickable output dir
- `local_results.jsonl` plus a `run.log` in each eval’s output directory

Optional Convex summary posting (still local mode): set both `CONVEX_EVAL_ENDPOINT` and `CONVEX_AUTH_TOKEN`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably disable this when running locally


url = matching_asset["browser_download_url"]
print("Downloading:", url)
log_info("Downloading:", url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should still always print out the log lines that we had previously with print. but, for all the new ones, use the log_info pattern. the logging before felt like it was useful to see progress

@mikecann mikecann merged commit 9d4fb0f into main Aug 29, 2025
@mikecann
Copy link
Contributor Author

whoops, sorry I did fix those things you mentioned there but did it on the wrong branch, ill commit those changes directly to main now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants