-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Execution Model Inversion #2666
base: master
Are you sure you want to change the base?
Conversation
This PR inverts the execution model -- from recursively calling nodes to using a topological sort of the nodes. This change allows for modification of the node graph during execution. This allows for two major advantages: 1. The implementation of lazy evaluation in nodes. For example, if a "Mix Images" node has a mix factor of exactly 0.0, the second image input doesn't even need to be evaluated (and visa-versa if the mix factor is 1.0). 2. Dynamic expansion of nodes. This allows for the creation of dynamic "node groups". Specifically, custom nodes can return subgraphs that replace the original node in the graph. This is an incredibly powerful concept. Using this functionality, it was easy to implement: a. Components (a.k.a. node groups) b. Flow control (i.e. while loops) via tail recursion c. All-in-one nodes that replicate the WebUI functionality d. and more All of those were able to be implemented entirely via custom nodes, so those features are *not* a part of this PR. (There are some front-end changes that should occur before that functionality is made widely available, particularly around variant sockets.) The custom nodes associated with this PR can be found at: https://github.com/BadCafeCode/execution-inversion-demo-comfyui Note that some of them require that variant socket types ("*") be enabled.
If anyone can test and report if it works or not with their most complex workflows that would be very helpful. |
All errors I get on this one seem to be from custom nodes. The missing input and invalid image file errors happen regardless due to how this workflow is built
|
started trying to test this. first the job-iterator extension has to be disabled. it also is not compatible with rgthree's nodes even if you disable the executor stuff in it because it will always try to patch the executor regardless. i commented out that part and rgthree didn't seem to be causing problems after that. those problems are basically expected, stuff that messes with the executor is not going to be compatible with these changes.
this is more of an issue. i added a debug print after line 107 which calls the method can return
seems like it failed fetching info for the standard VAE node. maybe other weird stuff is going on here but there definitely should be more graceful handling for methods that can return don't want to seem negative, i definitely appreciate the work you put into these changes and a better approach to execution is certainly very welcome and needed! edit: did some more digging, the issue seemed to be an incompatibility with the use everywhere nodes (was using that to broadcast the VAE so maybe it wasn't a standard VAE node after all). are these changes expected to be compatible with use everywhere? |
I see that compatibility of existing custom nodes may be compromised with the new structure. More important than immediate compatibility breaking is verifying whether each extension can provide compatibility patches for the new structure. |
Job iterator only patches the executor to let it run multiple times in a loop. This pull request should make it obsolete. |
Thanks for the ping and opportunity to fix before breaking. The execution optimization was meant to be forward compatible, but it did assume methods weren't being removed. I just pushed rgthree/rgthree-comfy@6aa0392 which is forwards compatible with this PR by no longer attempting to patch the optimization to ComfyUI's execution if the recursive methods don't exist, as is the case in this PR. I've patched this PR in and ensured it. Question: I assume this PR will render the rgthree optimization obsolete? Currently, the patch I provide reduces iterations and times by 1000s of percent (from 250,496,808 to just 142, and 158.13 seconds to 0.0 as tested on my machine). Also, I noticed the client API events have changed (client-side progress bar shows 100% complete, even though the workflow is still running). Are there more details on breaking changes? |
Crashed for the following nodes while attempt to create
Potentially problematic line: It looks like the problematic key is [FIXED] Assertion crash during execution. Looks like there's some incorrect assumption when calling https://github.com/guill/ComfyUI/blob/36b2214e30db955a10b27ae0d58453bab99dac96/execution.py#L457 Then the rest:
Crash at https://github.com/guill/ComfyUI/blob/36b2214e30db955a10b27ae0d58453bab99dac96/comfy/caching.py#L175 The root cause is kinda hard to explain, but take this workflow for instance (workflow embedded in the image): This should run fine if every is at it is (notice that the node id order MUST BE class concat_text_O:
"""
This node will concatenate two strings together
"""
@ classmethod
def INPUT_TYPES(cls):
return {"required": {
"text1": ("STRING", {"multiline": True, "defaultBehavior": "input"}),
"text2": ("STRING", {"multiline": True, "defaultBehavior": "input"}),
"separator": ("STRING", {"multiline": False, "default": ","}),
}}
RETURN_TYPES = ("STRING",)
FUNCTION = "fun"
CATEGORY = "O/text/operations"
@ staticmethod
def fun(text1, separator, text2):
return (text1 + separator + text2,)
@classmethod
def IS_CHANGED(cls, *args, **kwargs):
return float("NaN") Comparing with the https://github.com/omar92/ComfyUI-QualityOfLifeSuit_Omar92/blob/ebcaad0edbfbd8783eb6ad3cb979f23ee3e71c5e/src/QualityOfLifeSuit_Omar92.py#L1189, the new code added [FIXED] Another bug when a node is being added but does not get executed. This occurs when the node has all of it inputs being optional (with may includes link inputs, not value inputs) AND when the same output being connected to two nodes: the Not sure how to process with this. The only way I could think is adding a way to force add a node to |
I managed to trigger an error like this by running out of VRAM mid-generation and then trying to run another generation.
ComfyUI then seems to get stuck unable to do anything. I use ComfyBox as a frontend to ComfyUI. Didn't yet try reproducing this without it. I fixed it by changing the code to use |
Update: I've managed to test the PR now, and apart from having to disable rgthree, my flows seem to work fine. My fear that mapping over lists might be broken seems to be unfounded. Original: I'm not able to test this at the moment, but does this preserve the behaviour that a node will map over lists given as inputs? Looking at the change it appears like it might have removed it. For example, using the impact pack read prompts from file followed by unzip prompts will give you lists of +ve and -ve prompts, which a CLIP prompt encode will turn into lists of conditioning, which a ksampler will turn into lists of latents. Also really useful for shmoo-ing parameters over ranges with CR Float / Integer range list. |
|
suppose i already have a way to distributed-compute whole workflows robustly and transparently inside python. with these changes, is it a small lift to distribute pieces of the graph / individual nodes? the idea would be to make it practicable to integrate many models together performantly - each individual model may take up the entire VRAM, but distributed on different machines. this is coming from the POV of having a working implementation. if the signature of node execution were async, it would be a very small lift to parallelize individual node execution among consumers (aka workers). it would bring no changes to the current blocking behavior in ordinary 1-producer-1-consume. |
@WeeBull Yeah, the intention is that functionality is maintained. I tested it via some contrived test cases, but it's not functionality I use much in my actual workflows. Good to hear that it seems to work for you! @Seedsa As mentioned in the PR summary, this PR does not enable "*"-type sockets. You'll have to manually re-enable those via a patch just as you would on mainline. @comfyanonymous How do you feel about a command-line argument that enables "*" types so people don't have to circulate another patch file for it after this PR? @doctorpangloss That should be a good bit easier with this architecture than the old one, though it's likely any old working code won't transfer over unchanged. In this execution model, nodes can already return There might be a way to wrap the |
@Trung0246 I believe the issue with I'm unable to reproduce any of your other issues. Any chance you could upload some workflows? (I'm not sure what node pack the 'process' or 'prompt_expand' nodes are from.) |
This allows the use of nodes that have sockets of type '*' without applying a patch to the code.
Pasting a code block that @Trung0246 gave me on Matrix here so that it's documented:
|
This could happen when attempting to evaluate `IS_CHANGED` for a node during the creation of the cache (in order to create the cache key).
I believe all the issues reported in this PR so far have been addressed. Please let me know if you encounter any new ones (or I'm wrong about the existing reports being resolved). |
Getting this error in a complex workflow - working on identifying the source so I can upload a minimal test case:
|
Looks like just the Load Image node will cause this Edit: I verified this is still happening after merging in the latest of master branch, so something is breaking LoadImage node |
Behavior should now match the master branch with regard to undeclared inputs. Undeclared inputs that are socket connections will be used while undeclared inputs that are literals will be ignored.
you should probably get a promise from @comfyanonymous before putting any further work into this. i can also merge this into my fork and merge tests, because i don't care that much about third party nodes |
I have evaluated the code, and it ran for a set of complex workflows we can test against for stuff that doesn't use looping or conditional evaluation. These tests included over 30 nodepacks and used LLMs and audio/video functionalities. I am concerned by what Streect has posted, but I suspect we can probably solve for it rather quickly if we work to narrow it down and identify the exact issue as it seems to be a case of node mis-identification under some cases? I'm suspecting we can probably get this soon, but that it's not immediately ready at this second. @Streect By chance can we have the JSON for these test flows so we can dig in and figure out what it's doing - I ask as a member of the community on this one. |
@Streect Is it possible that you restarted the back-end without refreshing your browser at some point? I've seen bugs like that happen (both with and without this PR) in that case. Multiple nodes of the same type with the exact same input will share output unless they have the If you're both confident that you didn't have a restarting issue and weren't using any custom front-end plugins, I can take a look at some point -- it's possible that a front-end change broke things since I started relying on the automated tests. To be honest though, I probably won't be putting too much more effort into this PR unless I get a clear indication that it's likely to be merged. |
@guill I was testing this PR under wsl with clean python venv, no plugins or extensions. This is not really performance issue, more like usabilty issue, since it only affects Output nodes with same exact input in different places, so I just wondered if it intended or not. Also since people confirmed it is working for them, the only thing i can imagine is that i was not using xformers (since it not installing by default) in this PR venv. I can check it later today. edit: well, xformers didnt help |
guill,
I think the conversation at this point is happening because yours is one of
the few PRs that is being strongly considered to *get* merged and folks
just have alot of questions because it impacts so much all at once. To the
community and developers, it's a great PR. To a maintainer, it's a tall
order and takes a leap of faith - which we're both being extended because
the community wants it.
A thing I've learned lately is that it's very useful to a maintainer to be
able to break down the bits into smaller steps and know what the changes
are.
…On Thu, Jun 13, 2024 at 8:33 PM guill ***@***.***> wrote:
@Streect <https://github.com/Streect> Is it possible that you restarted
the back-end without refreshing your browser at some point? I've seen bugs
like that happen (both with and without this PR) in that case.
Multiple nodes of the same type with the exact same input will share
output unless they have the NOT_IDEMPOTENT attribute. That generally
shouldn't be an issue for the front-end (since the UI output is reused as
well), but if you have a front-end extension that assumes that the node
will re-run, it may need a fix.
If you're both confident that you didn't have a restarting issue and
weren't using any custom front-end plugins, I can take a look at some point
-- it's possible that a front-end change broke things since I started
relying on the automated tests. To be honest though, I probably won't be
putting too much more effort into this PR unless I get a clear indication
from that it's likely to be merged.
—
Reply to this email directly, view it on GitHub
<#2666 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC36LN4KTTXMFR4DT3OGPODZHJJA3AVCNFSM6AAAAABCOY6SYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRXGEYDEMJUGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@Streect Thank you for the report. I was able to reproduce the issue and get a fix in 👍 . This issue only affected the UI output which is why it hasn't shown up in previous automated testing. I did add some new automated testing to guard against regressions though. @daniel-lewis-ab I totally understand the questions and concerns on the PR. As long as those continue coming in, I have no issue with delaying a merge. The part that has made me question the value in continuing to maintain the PR is the months of radio silence with no new concerns raised. Each time I merge the main branch into this PR to keep it up to date, it's becoming more work (both to resolve conflicts and to test things). Is there any chance you could share your corpus of complex workflows? That could significantly reduce the burden of continuing to maintain this branch 😅 |
@guill Glad it helped. I really like the way this PR work, execution feels faster and predictable. The only thing i liked to be optional is this:
I commented this lines for my own use case with my custom nodes (caching and reusing output on disconnected nodes) but it can be useful for others in the future. Anyway thanks for your work. |
@Streect I'm interested to hear why you don't like that heuristic. It should just be making deterministic something that can happen non-deterministically anyway. Are you actually encountering any issues due to it prioritizing output nodes? Or is there some aspect of your workflow that makes that behavior feel worse? |
Yeah, I’m caching output of ksampler and then reusing it in detailers not connected to ksampler. With this behavior it executing detailer node before ksampler since it doesn’t know that output of caching node will change. In my workflows it a little bit more complicated, but other than that its just a wirelessly connected nodes, so one nodes rely on output of others just like when they directly connected but they are not. It can look weird case, but i can imagine less weird cases when this also be a problem :) In my opinion, this behavior makes execution less predictable, it kind of partly returns us to the recursive model. This also feels like software tries do decide what better for me as a user without my input, so if there was an option for things like this it would be cool :) |
I see. I think there may be better ways of solving what you're trying to do. Even without this PR, execution order could change depending on where Preview nodes (or other output) happen to be placed and what order the nodes themselves were created in. Relying on a specific execution order that isn't determined by the graph structure is also bound to cause issues in the future. (If this PR ever gets merged, I want to add support for async nodes next -- it shouldn't be a big lift on top of this execution model.) IMO, "portals" should really be a front-end plugin -- one that just lets you toggle certain edges in the rendered UI. I would highly recommend going that route as it's really going to be the most maintainable and give the best user experience. If you really want to do it on the node execution side, you might be able to implement portals using node expansion like this: BadCafeCode/execution-inversion-demo-comfyui@fa3a038 By using node expansion, you can impose your required execution order at runtime. Again, I would really recommend making that a front-end plugin though. |
@guill, I worked around this issue by serializing ancestors of ksampler and detailers myself and based on its change i’m removing or adding output nodes to output json. It’s working but feels kind of overcomplicated, and can be replaced by just using switches, which i also don’t really like. Thank you for suggestions, I’ll check it when i have time, hope this PR will be merged asap, i see way more perspective in this. |
any switch from @rgthree doesn't work.
|
|
Looks like it wasn't compatible with a few of nodes; (Power Lora Loader, Any Switch, Context Switch, Context Merge). These nodes are flexible nodes that allow for dynamic input types as well as a dynamic number of inputs themselves. rgthree/rgthree-comfy@bd958e4 will make these forwards-compatible with this PR (as it exists now, at least). |
Almost. @rgthree, can you check if this is related to the last update you did on any switch ?
|
No, that is and has always been broken with this PR, and is unrelated to rgthree-comfy nodes, or even the Any Switch (you can remove the Any Switch in your screenshot and it's still broken). That Simple Math is from ComfyUI Essentials and it looks like it defines the type as a string value of "INT,FLOAT". Interesting that this works, I guess the JS Client allows for this flexibility, which I didn't know. Unfortunately, it looks like that was never really meant to work in the backend, but the current ComfyUI only checks the types for required inputs, and Simple Math's At least, that's as far as I can tell. |
This PR inverts the execution model -- from recursively calling nodes to using a topological sort of the nodes. This change allows for modification of the node graph during execution. This allows for two major advantages:
The custom nodes associated with this PR can be found at:
https://github.com/BadCafeCode/execution-inversion-demo-comfyui
Note that some of them require that variant socket types ("*") be enabled.