-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blazor signalr connection breaks after syncing large dom / dom element #9683
Comments
It sounds like you are hitting the buffer size limit in SignalR. @anurse Is there any reason to not remove/increase the limit on the outbound buffer size? |
The main reason for the limit is to ensure that clients can't over-allocate server resources. I think I vaguely remember that this may not be as much of an issue in recent previews because of how Pipelines tracks the limit now (+ @davidfowl) |
I can understand limiting the inbound buffer size for this reason, but doesn't the server completely control how much data it wants to send to the client? |
Turns out in preview4 we shipped with a 32 MB limit right? Are you sure you're hitting that? |
@tn-5 does anything show up in the server logs to indicate the nature of the error? |
I'm having deja vu about this conversation. I feel like we talked about this before and removed the outbound limit. Or maybe Blazor configured it away... I can't find the issue though.
This is true. Due to a typo we actually had a 32 MB limit in preview 4, so it doesn't seem too likely this would be the case. |
Yes, so you can set the limit higher if you'd like. |
After some more testing I'm convinced it is not related to a buffer size per se. At this point I'm thinking it is either a timing issue or something related to the actual dom diff code itself. I've tested very large dom syncs (total dom + individual component) - this all works 100%. In this specific instance the reproducible error happens when I request the data from an async api. I'll investigate further to see if it can be reproduced more reliably. |
I've managed to reproduce this reliably with some small changes to the template project. it seems to be a time related issue linked to a large dom sync. The large dom makes some code take a bit longer and then the signalr messages sequences are different. It happens specifically when navigating from one page to another. Trace detailWorking case:
Failure case:
You'll see in the failure case there is a websocketstransport message between the "invoke notifylocationchanged" and the "remoteurihelper - location changed to..." message. Reproduce
At this point the application works fine, navigating between different pages works as expected What seems to then break it is when navigation occurs from another layout (it may also be that there are more different elements to sync). Anyway -, to reproduce:
In this scenario the issue reproduces reliably:
|
@danroth27 I think I saw this the other day when trying out a separate bug. I think what is happening is that the server takes a long time to process the event/render and the signalr connection gives up along the way. I will investigate more, but I’m not sure we will be able to do anything reasonable here. I was rendering 100.000 elements at the time, so I didn’t give it much importance given that I was able to reliably render at least 1000 and that if you render large payloads these things are going to happen eventually. I also have a proposal based on this to not produce large batch updates but to split them into smaller ones to improve responsiveness. For the record, I would avoid delivering assets this way. |
I'd definitely agree with @javiercn here. SignalR isn't really designed with large message payloads in mind. It's designed for frequent small-to-medium size updates. Generally, when users ask about transferring large files "through" SignalR, we strongly recommend hosting the content at some URL (blob storage, CDN, etc.) and sending the URL through SignalR instead of the actual content. |
I agree in general that this should not be done, at least not without a good reason. In our use case this logo is a client logo in a large multi-tenant installation. The logo is stored in the database and retrieved from an API and then displayed. Normally the logo should be small (< 20k) and retrieved only once (since it is present on the main layout across all pages). We are designing our Blazor app primarily to be used client side (wasm) which means the code will directly receive this data from the api using our normal authentication etc. We only found this issue originally since there was a misconfiguration on one tenant (resulting in a very large logo - several 100k). @javiercn I am concerned about the way this is exhibiting itself. In the failure case all the data is actually transferred correctly and the client side seems ok - it still keeps sending signalr messages to the server. It seems more likely that some state in the signalr handling on the server side is getting confused than a a simple timeout (we are talking total transfer times of several ms here, not seconds). I am guessing that this is rather indicative of a bug related to receiving responses out of an expected sequence. |
@pranavkm didn’t you fix this issue the other day in the message pack implementation? |
Generally we would also recommend not storing the image content in the database (instead storing it on blob storage and storing a link) but I realize you may not want to redesign your entire app :). I'd still advise you consider using a standard server-side ASP.NET Core middleware to allow you to fetch a tenant's image using standard HTTP given the necessary data (tenant ID, authentication, etc.) and then use that URL in the image tag. This is a little tangential of course, since we do still need to investigate the Blazor issue further. Just throwing my design thoughts out there 😸 |
@davidfowl I've tested again with latest nightly (11681) and the problem is still there, so it is not fixed. When I tested now I realised that it only happens when the navigation is performed by using In the repro example you can copy So it seems the issue is somehow related to the |
I dunno if the fix is merged but there is an issue that @pranavkm found in the current impl. I dunno if it's related to this but it's a bug in the protocol parsing which might be related. |
I've investigated this in more detail. The root cause is that there is an issue with the lock handling in the signalR Specifically, the use case I have fails if the following sequence of events happen:
In this scenario task [2] is triggered when task [1] releases the lock (but before the actual return from At this point the lock seems to be allocated (although it shouldn't be). No further comms is then allowed until the connection is reset. I am not sure what the root cause of this is, it may be a subtle issue with the ValueTask being awaited (maybe it caches the same one & then awaits it multiple times which is not allowed?). So to summarise this issue occurs when the |
@BrennanConroy can you take a look? |
I have the same error when I try to transfer a html page with a weight of 23,000 characters (no pictures, just html), the application crashes as described in the first post. |
We're continuing to dig through this. Moving out to preview7 because we aren't confident we can have a fix for preview6. |
We've identified the issue and @rynowak is working on a fix |
Fixes: #9683 - SignalR connection breaks on large DOM The root cause here is a misbehaving sync context. It's not legal for a Post() implementation to run a callback synchronously. We want that behavior for most of the functionality Blazor calls directly, but Post() should always go async or else various threading primitives are broken.
* Fix RendererSyncContext.Post() Fixes: #9683 - SignalR connection breaks on large DOM The root cause here is a misbehaving sync context. It's not legal for a Post() implementation to run a callback synchronously. We want that behavior for most of the functionality Blazor calls directly, but Post() should always go async or else various threading primitives are broken. * Fix incorrect tests These tests have the assumption that setting the result of a TCS will execution continuations synchronously. This was a bug in our SyncContext, and these tests needed updating to be more resiliant. * Remove a delegate allocation
Describe the bug
On login in our application we load a logo which is displayed. Normally this is quite small but since it is user specific it sometimes can be large. With preview 4 the server connection is broken when this is too large. The client then "freezes" - i.e. does not respond to clicks. The clicks do generate websocket messages but no responses. After a timeout the auto reconnect establishes a new connection and everything is then ok (until the next large dom sync).
This worked fine in preview 3.
To Reproduce
I tried to reproduce with a small update to the template project but could not. Let me know if I should try again. It is 100% reproducible in our application.
I've artificially generated various size images and it seems to break when the image is between 25440 and 25450 bytes. Tracing in chrome the websocket message for this dom sync is between 43772 and 43784 bytes (working / non-working).
Please let me know if there is anything I can do to get more diagnostics on this for the case where it fails.
The text was updated successfully, but these errors were encountered: