Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blazor signalr connection breaks after syncing large dom / dom element #9683

Closed
tn-5 opened this issue Apr 23, 2019 · 27 comments · Fixed by #10451
Closed

Blazor signalr connection breaks after syncing large dom / dom element #9683

tn-5 opened this issue Apr 23, 2019 · 27 comments · Fixed by #10451
Assignees
Labels
area-blazor Includes: Blazor, Razor Components bug This issue describes a behavior which is not expected - a bug. Done This issue has been fixed

Comments

@tn-5
Copy link

tn-5 commented Apr 23, 2019

Describe the bug

On login in our application we load a logo which is displayed. Normally this is quite small but since it is user specific it sometimes can be large. With preview 4 the server connection is broken when this is too large. The client then "freezes" - i.e. does not respond to clicks. The clicks do generate websocket messages but no responses. After a timeout the auto reconnect establishes a new connection and everything is then ok (until the next large dom sync).

This worked fine in preview 3.

To Reproduce

I tried to reproduce with a small update to the template project but could not. Let me know if I should try again. It is 100% reproducible in our application.

I've artificially generated various size images and it seems to break when the image is between 25440 and 25450 bytes. Tracing in chrome the websocket message for this dom sync is between 43772 and 43784 bytes (working / non-working).

Please let me know if there is anything I can do to get more diagnostics on this for the case where it fails.

@Eilon Eilon added the area-mvc Includes: MVC, Actions and Controllers, Localization, CORS, most templates label Apr 23, 2019
@danroth27 danroth27 added the area-blazor Includes: Blazor, Razor Components label Apr 23, 2019
@danroth27
Copy link
Member

It sounds like you are hitting the buffer size limit in SignalR.

@anurse Is there any reason to not remove/increase the limit on the outbound buffer size?

@analogrelay
Copy link
Contributor

Is there any reason to not remove/increase the limit on the outbound buffer size?

The main reason for the limit is to ensure that clients can't over-allocate server resources. I think I vaguely remember that this may not be as much of an issue in recent previews because of how Pipelines tracks the limit now (+ @davidfowl)

@danroth27
Copy link
Member

The main reason for the limit is to ensure that clients can't over-allocate server resources.

I can understand limiting the inbound buffer size for this reason, but doesn't the server completely control how much data it wants to send to the client?

@davidfowl
Copy link
Member

Turns out in preview4 we shipped with a 32 MB limit right? Are you sure you're hitting that?

@danroth27
Copy link
Member

@tn-5 does anything show up in the server logs to indicate the nature of the error?

@analogrelay
Copy link
Contributor

doesn't the server completely control how much data it wants to send to the client?

I'm having deja vu about this conversation. I feel like we talked about this before and removed the outbound limit. Or maybe Blazor configured it away... I can't find the issue though.

Turns out in preview4 we shipped with a 32 MB limit right? Are you sure you're hitting that?

This is true. Due to a typo we actually had a 32 MB limit in preview 4, so it doesn't seem too likely this would be the case.

@BrennanConroy
Copy link
Member

doesn't the server completely control how much data it wants to send to the client?

Yes, so you can set the limit higher if you'd like.

@tn-5
Copy link
Author

tn-5 commented Apr 25, 2019

After some more testing I'm convinced it is not related to a buffer size per se. At this point I'm thinking it is either a timing issue or something related to the actual dom diff code itself. I've tested very large dom syncs (total dom + individual component) - this all works 100%. In this specific instance the reproducible error happens when I request the data from an async api. I'll investigate further to see if it can be reproduced more reliably.

@tn-5
Copy link
Author

tn-5 commented Apr 25, 2019

I've managed to reproduce this reliably with some small changes to the template project. it seems to be a time related issue linked to a large dom sync. The large dom makes some code take a bit longer and then the signalr messages sequences are different. It happens specifically when navigating from one page to another.

Trace detail

Working case:

trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[11]
      Sending payload: 75 bytes.
trce: Microsoft.AspNetCore.Components.Browser.Rendering.RemoteRenderer[101]
      Begin remote rendering of components on client bsz_lKfQEEbDuvRBtUAOmw.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[11]
      Sending payload: 86 bytes.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[9]
      Message received. Type: Binary, size: 123, EndOfMessage: True.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[9]
      Message received. Type: Binary, size: 93, EndOfMessage: True.
dbug: Microsoft.AspNetCore.SignalR.Internal.DefaultHubDispatcher[1]
      Received hub invocation: InvocationMessage { InvocationId: "", Target: "BeginInvokeDotNetFromJS", Arguments: [ 6, Microsoft.AspNetCore.Components.Server, NotifyLocationChanged, 0, ["http://localhost:49993/"] ], StreamIds: [  ] }.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[9]
      Message received. Type: Binary, size: 26, EndOfMessage: True.
dbug: Microsoft.AspNetCore.Components.Server.Circuits.RemoteUriHelper[0]
      Location changed to 'http://localhost:49993/'.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[11]
      Sending payload: 86 bytes.

Failure case:

trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[11]
      Sending payload: 75 bytes.
trce: Microsoft.AspNetCore.Components.Browser.Rendering.RemoteRenderer[101]
      Begin remote rendering of components on client S5PSgzO0YLBrGnYHx1rSHQ.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[11]
      Sending payload: 86 bytes.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[9]
      Message received. Type: Binary, size: 123, EndOfMessage: True.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[9]
      Message received. Type: Binary, size: 93, EndOfMessage: True.
dbug: Microsoft.AspNetCore.SignalR.Internal.DefaultHubDispatcher[1]
      Received hub invocation: InvocationMessage { InvocationId: "", Target: "BeginInvokeDotNetFromJS", Arguments: [ 6, Microsoft.AspNetCore.Components.Server, NotifyLocationChanged, 0, ["http://localhost:49993/"] ], StreamIds: [  ] }.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[9]
      Message received. Type: Binary, size: 26, EndOfMessage: True.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[11]
      Sending payload: 86 bytes.
dbug: Microsoft.AspNetCore.Components.Server.Circuits.RemoteUriHelper[0]
      Location changed to 'http://localhost:49993/'.

You'll see in the failure case there is a websocketstransport message between the "invoke notifylocationchanged" and the "remoteurihelper - location changed to..." message.

Reproduce

  • dotnet new blazorserverside
  • Add large dom element to counter.razor
  • <img src="data:image/png;base64, @dummydata" alt="[LOGO]" />
  • Initialise to large data (500k should reliably break it) - functions block:
string dummydata = "";
protected override async Task OnInitAsync()
{
    Random rnd = new Random();
    int i = rnd.Next(32, 127);
    dummydata  = new string((char)(i), 500000); 
}

At this point the application works fine, navigating between different pages works as expected

What seems to then break it is when navigation occurs from another layout (it may also be that there are more different elements to sync). Anyway -, to reproduce:

  • Create a layout file Layout1.razor (no specific content required):
@inherits LayoutComponentBase
<div>@Body</div>
  • Change the fetchdata.razor file to use this layout
  • Add a button on fetchdata to enable navigation back to the counter page (due to the layout change this is required):
    <button onclick="@(() => UriHelper.NavigateTo("counter"))">Counter</button>
  • This also requires: @inject IUriHelper UriHelper

In this scenario the issue reproduces reliably:

  1. Run application
  2. Navigate to "fetchdata" page
  3. Navigate to "counter" page
  4. Clicks are not being processed. After a timeout the connection is reestablished and it works again

@danroth27 danroth27 added the bug This issue describes a behavior which is not expected - a bug. label Apr 25, 2019
@danroth27 danroth27 added this to the 3.0.0-preview6 milestone Apr 30, 2019
@javiercn
Copy link
Member

@danroth27 I think I saw this the other day when trying out a separate bug.

I think what is happening is that the server takes a long time to process the event/render and the signalr connection gives up along the way.

I will investigate more, but I’m not sure we will be able to do anything reasonable here.

I was rendering 100.000 elements at the time, so I didn’t give it much importance given that I was able to reliably render at least 1000 and that if you render large payloads these things are going to happen eventually.

I also have a proposal based on this to not produce large batch updates but to split them into smaller ones to improve responsiveness.

For the record, I would avoid delivering assets this way.

@analogrelay
Copy link
Contributor

analogrelay commented Apr 30, 2019

  • <img src="data:image/png;base64, @dummydata" alt="[LOGO]" />

For the record, I would avoid delivering assets this way.

I'd definitely agree with @javiercn here. SignalR isn't really designed with large message payloads in mind. It's designed for frequent small-to-medium size updates. Generally, when users ask about transferring large files "through" SignalR, we strongly recommend hosting the content at some URL (blob storage, CDN, etc.) and sending the URL through SignalR instead of the actual content.

@tn-5
Copy link
Author

tn-5 commented May 1, 2019

I agree in general that this should not be done, at least not without a good reason.

In our use case this logo is a client logo in a large multi-tenant installation. The logo is stored in the database and retrieved from an API and then displayed. Normally the logo should be small (< 20k) and retrieved only once (since it is present on the main layout across all pages).

We are designing our Blazor app primarily to be used client side (wasm) which means the code will directly receive this data from the api using our normal authentication etc.

We only found this issue originally since there was a misconfiguration on one tenant (resulting in a very large logo - several 100k).

@javiercn I am concerned about the way this is exhibiting itself. In the failure case all the data is actually transferred correctly and the client side seems ok - it still keeps sending signalr messages to the server. It seems more likely that some state in the signalr handling on the server side is getting confused than a a simple timeout (we are talking total transfer times of several ms here, not seconds). I am guessing that this is rather indicative of a bug related to receiving responses out of an expected sequence.

@davidfowl
Copy link
Member

@pranavkm didn’t you fix this issue the other day in the message pack implementation?

@analogrelay
Copy link
Contributor

In our use case this logo is a client logo in a large multi-tenant installation. The logo is stored in the database and retrieved from an API and then displayed.

Generally we would also recommend not storing the image content in the database (instead storing it on blob storage and storing a link) but I realize you may not want to redesign your entire app :).

I'd still advise you consider using a standard server-side ASP.NET Core middleware to allow you to fetch a tenant's image using standard HTTP given the necessary data (tenant ID, authentication, etc.) and then use that URL in the image tag. data URLs require transferring the entire image on every render which means you can't take advantage of browser caching. If, instead, you fill in a URL like <img src=/api/tenant/{tenantId}/logo />, the browser can cache that image appropriately and save a lot of wasted bandwidth (even for small logos) on full-page refreshes.

This is a little tangential of course, since we do still need to investigate the Blazor issue further. Just throwing my design thoughts out there 😸

@tn-5
Copy link
Author

tn-5 commented May 2, 2019

@davidfowl I've tested again with latest nightly (11681) and the problem is still there, so it is not fixed.

When I tested now I realised that it only happens when the navigation is performed by using NavigateTo from the UriHelper.

In the repro example you can copy Mainlayout.razor to Layout1.razor. When navigating using the button click (which uses NavigateTo) it fails. When clicking the link in the navpane it works.

So it seems the issue is somehow related to the UriHelper implementation

@davidfowl
Copy link
Member

I dunno if the fix is merged but there is an issue that @pranavkm found in the current impl. I dunno if it's related to this but it's a bug in the protocol parsing which might be related.

@tn-5
Copy link
Author

tn-5 commented May 8, 2019

I've investigated this in more detail. The root cause is that there is an issue with the lock handling in the signalR HubConnectionContext.

Specifically, the use case I have fails if the following sequence of events happen:

  1. A call to WriteAsync does not complete immediately and it returns a CompleteWriteAsync task
  2. Immediately afterwards another message is sent, no immediate lock is obtained and it returns a CompleteWriteAsync task.

In this scenario task [2] is triggered when task [1] releases the lock (but before the actual return from CompleteWriteAsync). Upon completion of [2] the _writeLock count is 1 (which is correct). Control flows back to CompleteWriteAsync(to return) and at that point the _writeLock count is 0 (incorrect).

At this point the lock seems to be allocated (although it shouldn't be). No further comms is then allowed until the connection is reset.

I am not sure what the root cause of this is, it may be a subtle issue with the ValueTask being awaited (maybe it caches the same one & then awaits it multiple times which is not allowed?).

So to summarise this issue occurs when the CompleteWriteAsync is still busy and another message needs to be sent and ends up in WriteSlowAsync, awaiting the lock.

@davidfowl
Copy link
Member

davidfowl commented May 8, 2019

@anurse Can we get somebody to look at this? Thanks for looking deeper @tn-5

@analogrelay
Copy link
Contributor

@BrennanConroy can you take a look?

@mkArtakMSFT mkArtakMSFT removed area-mvc Includes: MVC, Actions and Controllers, Localization, CORS, most templates labels May 9, 2019
@ghost
Copy link

ghost commented May 21, 2019

I have the same error when I try to transfer a html page with a weight of 23,000 characters (no pictures, just html), the application crashes as described in the first post.

@analogrelay analogrelay added the blocked The work on this issue is blocked due to some dependency label May 21, 2019
@analogrelay
Copy link
Contributor

We're continuing to dig through this. Moving out to preview7 because we aren't confident we can have a fix for preview6.

@analogrelay analogrelay pinned this issue May 21, 2019
@analogrelay analogrelay unpinned this issue May 21, 2019
@BrennanConroy BrennanConroy removed the area-signalr Includes: SignalR clients and servers label May 22, 2019
@BrennanConroy BrennanConroy removed the blocked The work on this issue is blocked due to some dependency label May 22, 2019
@davidfowl
Copy link
Member

We've identified the issue and @rynowak is working on a fix

rynowak added a commit that referenced this issue May 22, 2019
Fixes: #9683 - SignalR connection breaks on large DOM

The root cause here is a misbehaving sync context. It's not legal for a
Post() implementation to run a callback synchronously. We want that
behavior for most of the functionality Blazor calls directly, but Post()
should always go async or else various threading primitives are broken.
rynowak added a commit that referenced this issue May 23, 2019
* Fix RendererSyncContext.Post()

Fixes: #9683 - SignalR connection breaks on large DOM

The root cause here is a misbehaving sync context. It's not legal for a
Post() implementation to run a callback synchronously. We want that
behavior for most of the functionality Blazor calls directly, but Post()
should always go async or else various threading primitives are broken.

* Fix incorrect tests

These tests have the assumption that setting the result of a TCS will
execution continuations synchronously. This was a bug in our
SyncContext, and these tests needed updating to be more resiliant.

* Remove a delegate allocation
@rynowak rynowak added Done This issue has been fixed and removed 2 - Working labels May 23, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Dec 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-blazor Includes: Blazor, Razor Components bug This issue describes a behavior which is not expected - a bug. Done This issue has been fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants