Replies: 6 comments 1 reply
-
|
Thank you for taking a look at our work and the collaboration!
A flag --abort-on-error is a great addition, I think that it logs some errors, but to
I think that makes more sense since blobexec already executes for all files and we already transformed it into a library locally, so this is also an easier change now.
I agree this is a good path. The transform(blob, filename, blobId) interface is basically what blobExec already passes the script, so it's not a big change. The main thing to watch is making it thread-safe, since we'd want to run blobs in parallel across cores. Also, we would be able to keep things loaded between blobs (like the srcML parser and the memo cache) instead of starting cold every time.
This is actually something that I thought how we could improve after I ran some test of cregit over the jq repository. This one might be easier than we think, since we already have We would need to take more things into account when generating the SHA, since the cache key right now is just the blob content. |
Beta Was this translation helpful? Give feedback.
-
|
I was thinking about this. I think the best solution is not do the translation in-situ. It was done that way because of the use of bfg. The proper way is to say: here is the repo, create me another one. And by maintaining a mapping of blobs, commits and tags, an update process becomes straightforward: check what has been done, and add the ones missing. No need to memoize, because the new repository is already memoizing everything. This has another advantage: it can recover to failure. Current cregit, if one blob fail to be processed, can invalidate the entire new repo. cregit is already doing this: for every blog in the original repo: it adds a new one, for every commit it adds a new one. That is why at the end of processing cregit, one has to do garbage collection (to remove all the dangling objects). I can't recall how tags are handled, and if the old ones are explicitly removed. So all that is needed is maintaining this mapping. so maybe this should be the first step. |
Beta Was this translation helpful? Give feedback.
-
|
about disk space... you made me laugh. this is my current consumption. The
tokenize repo is "only" 6 gigs. The rest is blame info
] 1d ***@***.***:/home/linux] % du -h -s 7.0
59G 7.0
[] 1d ***@***.***:/home/linux] % du -h -s memo
291G memo
…On Mon, Jun 22, 2026 at 5:42 PM Ellian Carlos ***@***.***> wrote:
This makes sense to me. It's more consistent and longer-lived than the
memo approach and fits git's model better, though it's a harder
implementation than just memoization.
It might also let us drop the intermediate rewrite copy and not clone the
repo twice, though we'd still keep the working clones that blame and the
HTML need. We should watch disk usage for this approach because we might
not want a big output repo. Although that is a price I'm willing to pay
instead of many hours to execute for a large git repo.
About the mappings, the commitmap already fits this suggested design. For
blobs inside the BlobExecModifier we have already the mapping between
old->new which we just need to keep it. Tags mapping from old->new should
also be easy since bfg does that in a similar way to commits.
I'd like to raise this as an issue and a milestone here and work it out. I
can open one for the first step which is to persist the mapping, and then
we do it incrementally. I don't know how to envision this as whole plan
though, so if you have any suggestion I would be happy to take it.
—
Reply to this email directly, view it on GitHub
<#39?email_source=notifications&email_token=AABIZ4GRRRZHN5NNOEC5O6L5BHG7DA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZUGAYDMOJRUZZGKYLTN5XKMYLVORUG64VFMV3GK3TUVRTG633UMVZF6Y3MNFRWW#discussioncomment-17400691>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABIZ4CCZNKPFF6HU4H4T3D5BHG7DAVCNFSNUABJKJSXA33TNF2G64TZHMYTENJYGU3DMOBYGY5UI2LTMN2XG43JN5XDWMJQGI4TOMBUGOQXMAQ>
.
Triage notifications, keep track of coding agent tasks and review pull
requests on the go with GitHub Mobile for iOS
<https://github.com/notifications/mobile/ios/AABIZ4G5YMCRQI53JUMH2RT5BHG7DA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZUGAYDMOJRUZZGKYLTN5XKMYLVORUG64VFMV3GK3TUVJTG633UMVZF62LPOM>
and Android
<https://github.com/notifications/mobile/android/AABIZ4CBL4WBLLIRDSUEIYT5BHG7DA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZUGAYDMOJRUZZGKYLTN5XKMYLVORUG64VFMV3GK3TUVZTG633UMVZF6YLOMRZG62LE>.
Download it today!
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
--dmg
---
D M German
http://turingmachine.org
|
Beta Was this translation helpful? Give feedback.
-
|
create the issue. I think it should be done easy and straightforward.
I can give it a try.
…On Mon, Jun 22, 2026 at 5:42 PM Ellian Carlos ***@***.***> wrote:
This makes sense to me. It's more consistent and longer-lived than the memo approach and fits git's model better, though it's a harder implementation than just memoization.
It might also let us drop the intermediate rewrite copy and not clone the repo twice, though we'd still keep the working clones that blame and the HTML need. We should watch disk usage for this approach because we might not want a big output repo. Although that is a price I'm willing to pay instead of many hours to execute for a large git repo.
About the mappings, the commitmap already fits this suggested design. For blobs inside the BlobExecModifier we have already the mapping between old->new which we just need to keep it. Tags mapping from old->new should also be easy since bfg does that in a similar way to commits.
I'd like to raise this as an issue and a milestone here and work it out. I can open one for the first step which is to persist the mapping, and then we do it incrementally. I don't know how to envision this as whole plan though, so if you have any suggestion I would be happy to take it.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS and Android. Download it today!
You are receiving this because you authored the thread.Message ID: ***@***.***>
--
--dmg
---
D M German
http://turingmachine.org
|
Beta Was this translation helpful? Give feedback.
-
|
Opened #40 to track the new-repo mapping approach, scoped to the Phase 2 milestone. If you want, feel free to work on it. |
Beta Was this translation helpful? Give feedback.
-
|
Great work on #41 @dmgerman , thanks a lot for the cooperation. I left some minor comments and ideas, but thanks for the contribution! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
great work everybody,
blobExec silently ignores any errors in the script that is trying to filter the blobs.
I have added error checking to blobExec that will output debug information to stderr. And I think that it would be valuable to have a command line option that aborts processing if there is an error. The reason being is that one error might completely invalidate the rest of the processing.
I have added a flag --abort-on-error
if that is of interest, I'll submit a pull-request.
But now that we have agents. Why don't we do all the processing inside blobexec? one of the biggest problems with cregit is that it takes a long time for big repositories because of all the forking. What if the user was able to provide altenatively a dynamic library that manages process the blob instead of a script? and on top of that, one of the major costs of processing the a repo repeatedly is lack of incremental processing. That seems to be also a feature that might not be difficult to implement now (the old repo becomes the source of memoized blobs). Wow, the work you have all done opens a lot ofimprovements in speed and flexibility
please see this:
Beta Was this translation helpful? Give feedback.
All reactions