Improving error handling #39

dmgerman · 2026-06-22T01:08:38Z

dmgerman
Jun 22, 2026

great work everybody,

blobExec silently ignores any errors in the script that is trying to filter the blobs.

I have added error checking to blobExec that will output debug information to stderr. And I think that it would be valuable to have a command line option that aborts processing if there is an error. The reason being is that one error might completely invalidate the rest of the processing.

I have added a flag --abort-on-error

if that is of interest, I'll submit a pull-request.

But now that we have agents. Why don't we do all the processing inside blobexec? one of the biggest problems with cregit is that it takes a long time for big repositories because of all the forking. What if the user was able to provide altenatively a dynamic library that manages process the blob instead of a script? and on top of that, one of the major costs of processing the a repo repeatedly is lack of incremental processing. That seems to be also a feature that might not be difficult to implement now (the old repo becomes the source of memoized blobs). Wow, the work you have all done opens a lot ofimprovements in speed and flexibility

please see this:

EllianCarlos · 2026-06-22T02:44:32Z

EllianCarlos
Jun 22, 2026
Maintainer

Thank you for taking a look at our work and the collaboration!

I have added a flag --abort-on-error

if that is of interest, I'll submit a pull-request.

A flag --abort-on-error is a great addition, I think that it logs some errors, but to stdout and not stderr, so we might also want to change that. If you can submit a PR, we'd be glad to add it.

But now that we have agents. Why don't we do all the processing inside blobexec? one of the biggest problems with cregit is that it takes a long time for big repositories because of all the forking.

I think that makes more sense since blobexec already executes for all files and we already transformed it into a library locally, so this is also an easier change now.

What if the user was able to provide altenatively a dynamic library that manages process the blob instead of a script?

I agree this is a good path. The transform(blob, filename, blobId) interface is basically what blobExec already passes the script, so it's not a big change. The main thing to watch is making it thread-safe, since we'd want to run blobs in parallel across cores. Also, we would be able to keep things loaded between blobs (like the srcML parser and the memo cache) instead of starting cold every time.

and on top of that, one of the major costs of processing the a repo repeatedly is lack of incremental processing. That seems to be also a feature that might not be difficult to implement now (the old repo becomes the source of memoized blobs). Wow, the work you have all done opens a lot ofimprovements in speed and flexibility.

This is actually something that I thought how we could improve after I ran some test of cregit over the jq repository. This one might be easier than we think, since we already have BFG_MEMO_DIR we need just to share it between runs. This might be saved into a cregit.lock or even let the user input it.

We would need to take more things into account when generating the SHA, since the cache key right now is just the blob content.

0 replies

dmgerman · 2026-06-22T15:00:16Z

dmgerman
Jun 22, 2026
Author

I was thinking about this. I think the best solution is not do the translation in-situ. It was done that way because of the use of bfg.

The proper way is to say: here is the repo, create me another one. And by maintaining a mapping of blobs, commits and tags, an update process becomes straightforward: check what has been done, and add the ones missing. No need to memoize, because the new repository is already memoizing everything. This has another advantage: it can recover to failure. Current cregit, if one blob fail to be processed, can invalidate the entire new repo.

cregit is already doing this: for every blog in the original repo: it adds a new one, for every commit it adds a new one. That is why at the end of processing cregit, one has to do garbage collection (to remove all the dangling objects). I can't recall how tags are handled, and if the old ones are explicitly removed.

So all that is needed is maintaining this mapping.

so maybe this should be the first step.

1 reply

EllianCarlos Jun 23, 2026
Maintainer

This makes sense to me. It's more consistent and longer-lived than the memo approach and fits git's model better, though it's a harder implementation than just memoization.

It might also let us drop the intermediate rewrite copy and not clone the repo twice, though we'd still keep the working clones that blame and the HTML need. We should watch disk usage for this approach because we might not want a big output repo. Although that is a price I'm willing to pay instead of many hours to execute for a large git repo.

About the mappings, the commitmap already fits this suggested design. For blobs inside the BlobExecModifier we have already the mapping between old->new which we just need to keep it. Tags mapping from old->new should also be easy since bfg does that in a similar way to commits.

I'd like to raise this as an issue and a milestone here and work it out. I can open one for the first step which is to persist the mapping, and then we do it incrementally. I don't know how to envision this as whole plan though, so if you have any suggestion I would be happy to take it.

dmgerman · 2026-06-23T00:56:21Z

dmgerman
Jun 23, 2026
Author

about disk space... you made me laugh. this is my current consumption. The tokenize repo is "only" 6 gigs. The rest is blame info ] 1d ***@***.***:/home/linux] % du -h -s 7.0 59G 7.0 [] 1d ***@***.***:/home/linux] % du -h -s memo 291G memo

…

On Mon, Jun 22, 2026 at 5:42 PM Ellian Carlos ***@***.***> wrote: This makes sense to me. It's more consistent and longer-lived than the memo approach and fits git's model better, though it's a harder implementation than just memoization. It might also let us drop the intermediate rewrite copy and not clone the repo twice, though we'd still keep the working clones that blame and the HTML need. We should watch disk usage for this approach because we might not want a big output repo. Although that is a price I'm willing to pay instead of many hours to execute for a large git repo. About the mappings, the commitmap already fits this suggested design. For blobs inside the BlobExecModifier we have already the mapping between old->new which we just need to keep it. Tags mapping from old->new should also be easy since bfg does that in a similar way to commits. I'd like to raise this as an issue and a milestone here and work it out. I can open one for the first step which is to persist the mapping, and then we do it incrementally. I don't know how to envision this as whole plan though, so if you have any suggestion I would be happy to take it. — Reply to this email directly, view it on GitHub <#39?email_source=notifications&email_token=AABIZ4GRRRZHN5NNOEC5O6L5BHG7DA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZUGAYDMOJRUZZGKYLTN5XKMYLVORUG64VFMV3GK3TUVRTG633UMVZF6Y3MNFRWW#discussioncomment-17400691>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABIZ4CCZNKPFF6HU4H4T3D5BHG7DAVCNFSNUABJKJSXA33TNF2G64TZHMYTENJYGU3DMOBYGY5UI2LTMN2XG43JN5XDWMJQGI4TOMBUGOQXMAQ> . Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS <https://github.com/notifications/mobile/ios/AABIZ4G5YMCRQI53JUMH2RT5BHG7DA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZUGAYDMOJRUZZGKYLTN5XKMYLVORUG64VFMV3GK3TUVJTG633UMVZF62LPOM> and Android <https://github.com/notifications/mobile/android/AABIZ4CBL4WBLLIRDSUEIYT5BHG7DA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZUGAYDMOJRUZZGKYLTN5XKMYLVORUG64VFMV3GK3TUVZTG633UMVZF6YLOMRZG62LE>. Download it today! You are receiving this because you authored the thread.Message ID: ***@***.***>

-- --dmg

--- D M German http://turingmachine.org

0 replies

dmgerman · 2026-06-23T00:57:10Z

dmgerman
Jun 23, 2026
Author

create the issue. I think it should be done easy and straightforward. I can give it a try.

…

On Mon, Jun 22, 2026 at 5:42 PM Ellian Carlos ***@***.***> wrote: This makes sense to me. It's more consistent and longer-lived than the memo approach and fits git's model better, though it's a harder implementation than just memoization. It might also let us drop the intermediate rewrite copy and not clone the repo twice, though we'd still keep the working clones that blame and the HTML need. We should watch disk usage for this approach because we might not want a big output repo. Although that is a price I'm willing to pay instead of many hours to execute for a large git repo. About the mappings, the commitmap already fits this suggested design. For blobs inside the BlobExecModifier we have already the mapping between old->new which we just need to keep it. Tags mapping from old->new should also be easy since bfg does that in a similar way to commits. I'd like to raise this as an issue and a milestone here and work it out. I can open one for the first step which is to persist the mapping, and then we do it incrementally. I don't know how to envision this as whole plan though, so if you have any suggestion I would be happy to take it. — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS and Android. Download it today! You are receiving this because you authored the thread.Message ID: ***@***.***>

-- --dmg

--- D M German http://turingmachine.org

0 replies

EllianCarlos · 2026-06-23T01:16:24Z

EllianCarlos
Jun 23, 2026
Maintainer

Opened #40 to track the new-repo mapping approach, scoped to the Phase 2 milestone. If you want, feel free to work on it.

0 replies

EllianCarlos · 2026-06-25T03:02:41Z

EllianCarlos
Jun 25, 2026
Maintainer

Great work on #41 @dmgerman , thanks a lot for the cooperation. I left some minor comments and ideas, but thanks for the contribution!

0 replies

Uh oh!

Improving error handling #39

Uh oh!

Uh oh!

dmgerman Jun 22, 2026

Replies: 6 comments · 1 reply

Uh oh!

Uh oh!

EllianCarlos Jun 22, 2026 Maintainer

Uh oh!

dmgerman Jun 22, 2026 Author

Uh oh!

EllianCarlos Jun 23, 2026 Maintainer

Uh oh!

dmgerman Jun 23, 2026 Author

Uh oh!

dmgerman Jun 23, 2026 Author

Uh oh!

Uh oh!

EllianCarlos Jun 23, 2026 Maintainer

Uh oh!

EllianCarlos Jun 25, 2026 Maintainer

dmgerman
Jun 22, 2026

Replies: 6 comments 1 reply

EllianCarlos
Jun 22, 2026
Maintainer

dmgerman
Jun 22, 2026
Author

EllianCarlos Jun 23, 2026
Maintainer

dmgerman
Jun 23, 2026
Author

dmgerman
Jun 23, 2026
Author

EllianCarlos
Jun 23, 2026
Maintainer

EllianCarlos
Jun 25, 2026
Maintainer