Skip to content

Blazor worker "memory access out of bounds" bug #4059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
StachuDotNet opened this issue Jun 1, 2022 · 55 comments · Fixed by #4089
Closed

Blazor worker "memory access out of bounds" bug #4059

StachuDotNet opened this issue Jun 1, 2022 · 55 comments · Fixed by #4089

Comments

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 1, 2022

Weird - I think this is actually happening before the request was made. Maybe I was interpreting things before.
Edit: nevermind.

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 1, 2022

I've now gotten the analysis request payload that causes the issue.
It's indeed quite large (70814 chars of JSON).

FWIW this request seems to include a lot of stuff that it doesn't need to perform the analysis. (though I haven't verified this)

I'm seeing three paths forward:

  • somehow increase the memory limit (so far, this seems impossible without recompiling blazor and mono/wasm)
  • send less stuff in the request (i.e. filter out stuff not needed for analysis)
  • send the request in batches, e.g. per https://stackoverflow.com/a/57973792 or "as a file."
    I suppose we could create a .json file, and send that across the wire? haven't looked into this yet.

Currently experimenting to see what the char threshold of failure is.

@StachuDotNet
Copy link
Member Author

One more path forward: if the request is huge, ask the backend instead of wasm.
(we could do this for a bit, and follow up with other options later)

@StachuDotNet
Copy link
Member Author

Note: I can replicate this in Release AOT builds, but not Debug builds.
Now checking to see if I can repro in a non-AOT release build.

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 1, 2022

Seems I can't - i.e. the bug seems AOT-specific.
Edit: still didn't determine char threshold of failure - actually testing that now.

@pbiggar
Copy link
Member

pbiggar commented Jun 1, 2022

Do you know what function this is happening in? Perhaps we're able to separate reading the json from doing the analysis in some way. Maybe we parse the json in JS code and then call wasm using JSInterop or whatever it's called.

@StachuDotNet
Copy link
Member Author

Do you know what function this is happening in?

Maybe I'm answering the wrong question, but:

once we've defined messageHandler
const messageHandler = Module.mono_bind_static_method("[Wasm]Wasm.EvalWorker:OnMessage");

calling messageHandler(hugePayload); is what breaks it - we don't get to the first line of OnMessage (though, likely because we couldn't read to avail the message arg)

Maybe we parse the json in JS code and then call wasm using JSInterop or whatever it's called.

I don't know what you mean by this. Could you please spell it out?

@pbiggar
Copy link
Member

pbiggar commented Jun 1, 2022

Spelt out below but I think the fact that it happens before OnMessage might scupper what I was suggesting.

Currently we do:

let callWasm data = 
  let parsed = parseData(data) // happens in F#/WASM
  execute(parsed) // more F#/Wasm
callWasm data // JS caller

Could we instead do:

parsed = parseDataInJS(data); // happens in JS
callWasm(parsed) // F# code happens within here

@pbiggar
Copy link
Member

pbiggar commented Jun 1, 2022

  • send the request in batches, e.g. per https://stackoverflow.com/a/57973792 or "as a file."
    I suppose we could create a .json file, and send that across the wire? haven't looked into this yet.

That's a good idea.

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 1, 2022

TIL "scupper."

Yeah I don't think that'll work (because we don't get to OnMessage)


So, seems we have two options:

  • create a .json file, "upload" it, and then process that upload. (not yet sure how this will pan out...)
  • (more manual) break down the request into multiple messages, send them chunk by chunk as requests, update Wasm.fs to wait for all of the requests (likely with some little protocol to ensure ordering and a signal that "we're done sending"), and then process the request.

I'm currently trying to see if anything inhttps://github.com/Tewr/BlazorFileReader/blob/main/src/Blazor.FileReader/Tewr.Blazor.FileReader.md could help us out. Or maybe there's something official. It's hard to not be distracted by all of the view/Razor-specific stuff in this Blazor world.

@pbiggar
Copy link
Member

pbiggar commented Jun 1, 2022

Yeah, either of those sound good. I like the ArrayBuffer/file approach but does that also have size limits?

Are we sure this is the problem though? 76k isn't small but it's also not huge. The Stackoverflow thing says the limit is over 50MB now.

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 1, 2022

I'm now a bit less confident.

I added this to Wasm.fs:

  static member CanWeGetHere(message : string) =
    message |> String.length |> printfn "Yes %i"

and posted the giant json to that handler rather than the OnMessage one. It works? Digging in..

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 1, 2022

I was wrong before! We are getting to within OnMessage. We're even deserializing OK. The failure seems to be happening within the actual analysis.

Added some logs; waiting for another AOT build - seems we're closer.

@StachuDotNet
Copy link
Member Author

Even more interesting:

with my code like this:

 let! result =
        task {
          match args with
          | Error e -> return Error e
          | Ok args ->
            try
              printfn "Performing analysis"
              let! result = Eval.performAnalysis args
              printfn "Performed analysis"
              return Ok result
            with
            | e ->
              let metadata = Exception.nestedMetadata e
              System.Console.WriteLine("Error running analysis in Blazor")
              System.Console.WriteLine($"called with message: {message}")
              System.Console.WriteLine(
                $"caught exception: \"{e.Message}\" \"{metadata}\""
              )
              return Error($"exception: {e.Message}, metadata: {metadata}")
        }

      printfn "Serializing result"
      let serialized =
        try
          Json.Vanilla.serialize result
        with
        | e ->
          let metadata = Exception.nestedMetadata e
          System.Console.WriteLine("Error serializing results of Blazor analysis")
          System.Console.WriteLine($"called with message: {message}")
          System.Console.WriteLine(
            $"caught exception: \"{e.Message}\" \"{metadata}\""
          )
          Json.Vanilla.serialize ($"exception: {e.Message}, metadata: {metadata}")

      printfn "Posting result back"
      EvalWorker.postMessage serialized

I got "performed analysis" and it dies right after, without ever emitting Serializing result. Maybe something to do with the task.

@pbiggar
Copy link
Member

pbiggar commented Jun 1, 2022

Well that's optimistic at least!

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 1, 2022

So, I got it to sort-of work

here's some code (with minor simplifications):

static member OnMessage(message : string) =
  let args =
    try
      Ok(Json.Vanilla.deserialize<ClientInterop.performAnalysisParams> message)
    with
    | e ->
      let metadata = Exception.nestedMetadata e
      Error($"exception: {e.Message}, metdata: {metadata}")

  task {
    let serializeAndPost result =
      let serialized =
        try Json.Vanilla.serialize result
        with
        | e ->
          let metadata = Exception.nestedMetadata e
          Json.Vanilla.serialize ($"exception: {e.Message}, metadata: {metadata}")

      EvalWorker.postMessage serialized

    match args with
    | Error e -> Error e |> serializeAndPost
    | Ok args ->
      try
        let! result = Eval.performAnalysis args
        Ok result |> serializeAndPost
      with
      | e ->
        let metadata = Exception.nestedMetadata e
        Error($"exception: {e.Message}, metadata: {metadata}")
        |> serializeAndPost
  }

Right now, it successfully runs the full deserialization, analysis, serialization of result, and post-back, and then fails with the memory exception (after js gets the response)

I think the key thing here is that the task containing the result is garbage-collected after the post-back.
Seems relevant to dotnet/aspnetcore#16692, a closed issue linked from dotnet/aspnetcore#5617

@StachuDotNet
Copy link
Member Author

Not sure the exact reason this code gets further, but definitely feels relates to either GC and/or tasks

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 1, 2022

Curiously, if I

  • remove the task{}
  • update the let! result = Eval.performAnalysis args to a let result = Eval.performAnalysis args |> Async.AwaitTask |> Async.RunSynchronously
    I never see Performed analysis

Seems to indicate that this is a matter of GC-ing result - as opposed to GC-ing message (the big incoming payload).
(or maybe it's some combination of the two)

Some notes about this test:

  • the incoming (serialized) payload is 593969 chars long
  • the response payload (after serialization) is 93622 chars long

May be totally irrelevant to the serialization, fwiw. Maybe we're not GCing something correctly in our code?

@pbiggar
Copy link
Member

pbiggar commented Jun 1, 2022

What about if instead of sending one response payload, we didn't save it in a table and instead sent it back to the main worker immediately (eg implementing #4008).

Would help with #4035 too.

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 1, 2022

I'd like that solution, but some bad news:

If I update serializeAndPost to just let serializeAndPost result = (), such that we never Json.Vanilla.serialize result or EvalWorker.postMessage serialized, it still fails.

So, seems that the issue is GC-ing (or something) the un-serialized result, and not with the later stuff.

For full clarity, this fails:

static member OnMessage(message : string) =
  let args =
    try
      printfn "Deserializing"
      Ok(Json.Vanilla.deserialize<ClientInterop.performAnalysisParams> message)
    with
    | e ->
      let metadata = Exception.nestedMetadata e
      Error($"exception: {e.Message}, metdata: {metadata}")

  task {
    let serializeAndPost result = ()

    match args with
    | Error e -> Error e |> serializeAndPost
    | Ok args ->
      try
        let! result = Eval.performAnalysis args
        Ok result |> serializeAndPost
      with
      | e ->
        let metadata = Exception.nestedMetadata e
        
        Error($"exception: {e.Message}, metadata: {metadata}")
        |> serializeAndPost
  }

Edit: let me make sure these results are right. Don't believe this quite yet.
Edit2: yeah, that's the case - it fails even without serialization or posting back.

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 1, 2022

Given ^, thinking how to proceed... we can get the analysis results, but upon exiting the task (due to GC, or something else), we get the error. And this is unrelated to serializing those results or posting them back to JS - we get the error when those actions aren't present.

So, the strat from #4008 seems irrelevant here, for now.

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 1, 2022

Note: if I...

let! result = Eval.performAnalysis args
printfn "Performed analysis"
GC.Collect()
printfn "manually gc'd"

We don't reach "manually gc'd"

I'm going to look at various things like GC.GetGCMemoryInfo() |> printfn "%A" briefly, and see if I find anything interesting.

(this isn't something I've played with before - just an idea. I'm in foreign .NET stuff)

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 1, 2022

I queried various methods on GC right before and right after performing analysis, with two expressions (2+3, and the big analysis request).

Syntax 2+3, before 2 +3, after big request, before big request, after
GetAllocatedBytesForCurrentThread() 2,408,736 5,665,920 15,621,904 20,014,232
GetTotalAllocatedBytes() 0 4,412,192 8,888,560 17,691,712
GetTotalMemory(false) 4,352,064 4,918,256 13,946,776 14,602,064

This is comparing "cold start" (first analysis request after page load) to "cold start".

Edit: added some commas to make it easier to read/compare.

I also queried a few other things on GC., but found them to be useless.

I haven't actually studied these numbers yet - just collected. Not sure if they're even useful.

@StachuDotNet
Copy link
Member Author

quick thought: maybe we're GCing the analysis engine or something, while we could keep it running eternally instead? Not sure if that makes any sense :)

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 1, 2022

I wonder if I can replicate this in Debug mode, with a manual GC.Collect() - maybe Release mode GCs more often, and the manual call will allow us to get a stack trace? 🚧 🤞

Edit: nope, it didn't fail. Still not sure if GC is relevant.

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 1, 2022

I'm going to try experimenting with "how much of the analysis engine is being thrown away each time, and can we reduce that?"

For example,

    let stdlib =
      LibExecutionStdLib.StdLib.fns
      |> Map.fromListBy (fun fn -> RT.FQFnName.Stdlib fn.name)

    // TODO: get packages from caller
    let libraries : RT.Libraries = { stdlib = stdlib; packageFns = Map.empty }

is currently run/loaded (and then thrown away) every request, but doesn't have to be.

Edit: hits a dead end fast.

@StachuDotNet
Copy link
Member Author

A fix has been found (will link PR shortly).
The exact reason the fix works is unfortunately unknown - refer to the upcoming PR for further commentary.

@StachuDotNet
Copy link
Member Author

Unfortunately, while #4070 got past that error, another related one occurs shortly after:

Uncaught Error: System.ArgumentNullException: ArgumentNull_Generic Arg_ParamName_Name, key
   at System.Collections.Generic.Dictionary`2[[System.Object, System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.Int32, System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].Remove(Object )
   at System.Runtime.InteropServices.JavaScript.Runtime.ReleaseJSOwnedObjectByGCHandle(Int32 gcHandle)
    at Object._convert_exception_for_method_call (https://static.darklang.com/blazor/dotnet.6.0.5.epfcsc29vy-2e5c0b99229188abd29a4572dbc1760c0a73e6bb6a05b5358b8be034a503a2f0.js:1:176256)
    at Object._handle_exception_for_call (https://static.darklang.com/blazor/dotnet.6.0.5.epfcsc29vy-2e5c0b99229188abd29a4572dbc1760c0a73e6bb6a05b5358b8be034a503a2f0.js:1:178225)
    at Object.managed_BINDINGS_ReleaseJSOwnedObjectByGCHandle [as _release_js_owned_object_by_gc_handle] (https://mono-wasm.invalid/managed_BINDINGS_ReleaseJSOwnedObjectByGCHandle:17:21)
    at Object._js_owned_object_finalized (https://static.darklang.com/blazor/dotnet.6.0.5.epfcsc29vy-2e5c0b99229188abd29a4572dbc1760c0a73e6bb6a05b5358b8be034a503a2f0.js:1:150045)
    at FinalizationRegistry.cleanupSome (<anonymous>)

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 2, 2022

Little discovery worth its own comment:
image

For context, the original payload was 500k characters, compared to this 89k characters. (note: here I'm comparing non-minified JSON against non-minified JSON, which is what's sent over the wire currently)

I've managed to repro the bug with a much smaller payload, both in size and complexity.

  • I updated the expected types in Wasm.fs to just be:
module ClientInterop =
  type handler_analysis_param =
    { user_fns : List<ORT.user_fn<ORT.fluidExpr>>  }

  type performAnalysisParams =
    | AnalyzeHandler of handler_analysis_param
  • I ripped out all but one user_fn, and all of the now-unexpected stuff from the payload
    (the remaining user fn was the biggest of all of them, by significant margain)

I'm continuing to thin down the remaining user fn. Maybe the problem occurs due to the depth of the AST tree? or some specific type/data within the tree?

@pbiggar
Copy link
Member

pbiggar commented Jun 2, 2022

I believe STJ has a default recusion depth, maybe 250. You can change it.

@StachuDotNet
Copy link
Member Author

We already have options.MaxDepth <- System.Int32.MaxValue // infinite set, it seems

@StachuDotNet
Copy link
Member Author

in case it doesn't respect MaxValue, I'll set it to some reasonably large values smaller than such.
Still making good progress thinning the payload down, fwiw. Getting close to a minimal payload - I can feel it.

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 2, 2022

I set the STJ MaxDepth to 1024, and it didn't change anything.


Now failing with only a 6740-character payload. Much less than our original payload. Almost there...

I've just now made my feedback loop much tighter, so the rest of the progress should be faster. (edit: maybe. now I'm manually manipulating ASTs so it's tricky :) )

@StachuDotNet
Copy link
Member Author

Here's the current state of a payload that deserializes OK in Debug or non-AOT Release builds, but not AOT builds (in which, it fails with memory access out of bounds):
https://gist.github.com/StachuDotNet/17203813e61fcdfcb2c49c6d9fcf41cf

This started off as user data 😅

Really seems related to depth, but haven't quite figured out how.

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 2, 2022

When I see this AST, I interpret it as:

let x = null
let x = null
// for a total of 16 bindings for `x`
let x = null
...

and then I'm not sure how to read it. Certainly there's at least one pipe involved, but I'm not sure how to grok the fact that there are ... 3 pipes but only one target?

I realize that the AST looks a bit absurd as-is, but: Debug mode is OK deserializing it, and AOT build isn't. I'm trying to figure out why that is. @pbiggar any ideas?

@StachuDotNet
Copy link
Member Author

to be clear, if I remove any lets (even just one), serialization is OK in AOT mode.

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 2, 2022

I'm not sure if this is purely a matter of depth (I'll check by replacing a few layers with more lets), or something specific to these expression types.

Edit: replacing them with many lets serializes OK

@pbiggar
Copy link
Member

pbiggar commented Jun 2, 2022

I'm thinking of what could cause memory errors in this code.

Is there an assertion hit by weird pipes perhaps?

Or a array index?

@pbiggar
Copy link
Member

pbiggar commented Jun 2, 2022

What about the pipes? Does removing them work?

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 2, 2022

Aha! I can remove one of them and still get the error, but if I remove 2 (leaving only 1), no error! (verifying just to be sure I didn't mess up AST)

Edit: yes, I think that's it.

@StachuDotNet
Copy link
Member Author

Still baffling, but I think we can create a producible build pretty easily now, anyway.

@StachuDotNet
Copy link
Member Author

....why do we need 16 lets?

@pbiggar
Copy link
Member

pbiggar commented Jun 2, 2022

@StachuDotNet
Copy link
Member Author

putting a printfn "Hello - %A" arglist there to see if that's the issue

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 2, 2022

Wait, but the problem is just in deserialization - we're not running the interpreter at all. Unless the interpreter is somehow called during deserialization.

The OnMessage at this point is just:

  static member OnMessage(args : string) =
    try
      Json.Vanilla.deserialize<ClientInterop.performAnalysisParams> args
      |> ignore<ClientInterop.performAnalysisParams>
    with
    | _ -> printfn "failed to deserialize"

    task { return () }

@pbiggar
Copy link
Member

pbiggar commented Jun 2, 2022

ah. JsonFSharpConverter?

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 2, 2022

Good hunch - no error if the serializer doesn't register that converter. Something to follow up on tomorrow.
That said, although no error, it also doesn't deserialize - will try the expected shape tomorrow, and go from there.

@StachuDotNet
Copy link
Member Author

I vaguely recall some flag existing where certain packages could be excluded from AOT compilation. Something to look into.

@StachuDotNet
Copy link
Member Author

also, a rebuild of fsharp.systemtextjson on a newer .net version may help https://github.com/Tarmil/FSharp.SystemTextJson/blob/4e4560619b1902eda3749432e8f8447c0c8ed9a1/global.json#L3

@pbiggar
Copy link
Member

pbiggar commented Jun 2, 2022

Maybe report the issue to FSTJ to see if they've any idea. They might be motivated to help.

@StachuDotNet
Copy link
Member Author

Will do, once I have a minimal repro.

@StachuDotNet
Copy link
Member Author

StachuDotNet commented Jun 3, 2022

Maybe the easiest option: ship a Debug compilation or a non-AOT Release compilation of FSharp.SystemTextJson rather than the AOT Release one, if that truly is the problem. (if I can't find the flag I recall around excluding some packages from AOT)

@StachuDotNet
Copy link
Member Author

Resolved by #4089 - seems this is specific to Ply.
I'm continuing #4071 to create a minimal repro, separately, so we can inform relevant parties.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants