Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

baseline stack overrun in Twitch livestream frames #607

Closed
NapalmSauce opened this issue Jun 9, 2020 · 43 comments
Closed

baseline stack overrun in Twitch livestream frames #607

NapalmSauce opened this issue Jun 9, 2020 · 43 comments

Comments

@NapalmSauce
Copy link
Contributor

(Would have tenderapp been a better place for this?)

That didn't happen before but twitch frames bomb the browser. Example URL with bogus channel but crashes(for me at the very least):
https://embed.twitch.tv/?autoplay=true&channel=Whatever&height=480

If I have baseline enabled (with or without ion) the above makes TFx go down. This can be a nuisance since you might not expect this to be embed on some sites.

From crashreporter, running TFxDebug-fpr23:


Exception Type:  EXC_BAD_ACCESS (SIGBUS)
Exception Codes: KERN_PROTECTION_FAILURE at 0xf1a19ff0
Crashed Thread:  35

Thread 35 crashed with PPC Thread State 32:
  srr0: 0x0c254fc8  srr1: 0x0000f030   dar: 0xf1a19ff0 dsisr: 0x42000000
    r0: 0x0c015614    r1: 0xf1a1a040    r2: 0x062ad800    r3: 0x23cbb000
    r4: 0x35111ba8    r5: 0x35104678    r6: 0x35104408    r7: 0x00000000
    r8: 0x35111ba8    r9: 0x23cbb000   r10: 0x3609dec8   r11: 0xf1a1a120
   r12: 0x0c0155b4   r13: 0xf1a67b30   r14: 0x37f04dc0   r15: 0xffffff88
   r16: 0xf1a1a2ac   r17: 0xf1b10608   r18: 0x0be10a64   r19: 0x3501d044
   r20: 0x0000009c   r21: 0xf1b109c0   r22: 0x00000002   r23: 0xf1b105d8
   r24: 0x00000000   r25: 0x00000000   r26: 0x35200000   r27: 0x35100000
   r28: 0x23cbb000   r29: 0x3501d010   r30: 0x35111ba8   r31: 0x0c0155bc
    cr: 0x24424222   xer: 0x00000003    lr: 0x0c015614   ctr: 0x0c0155b4
vrsave: 0x00000000

(huge) backtrace for thread 35:


0   XUL                0x0c254fc8 js::CurrentThreadCanAccessRuntime(JSRuntime*) + 12
1   XUL                0x0c015610 js::jit::AssertValidObjectPtr(JSContext*, JSObject*) + 92
2   ???                0x24f1c4f4 0 + 619824372
3   XUL                0x0be10a60 __ZL13EnterBaselineP9JSContextRN2js3jit12EnterJitDataE + 392
4   XUL                0x0be12fa8 js::jit::EnterBaselineMethod(JSContext*, js::RunState&) + 472
5   XUL                0x0c221f9c js::RunScript(JSContext*, js::RunState&) + 488
6   XUL                0x0c22233c js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) + 664
7   XUL                0x0c223090 js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) + 640
8   XUL                0x0be3325c __ZN2js3jitL14DoCallFallbackEP9JSContextPNS0_13BaselineFrameEPNS0_15ICCall_FallbackEjPN2JS5ValueENS7_13MutableHandleIS8_EE + 1128
9   ???                0x26eb78f4 0 + 652966132
10  XUL                0x0be10a60 __ZL13EnterBaselineP9JSContextRN2js3jit12EnterJitDataE + 392
11  XUL                0x0be12fa8 js::jit::EnterBaselineMethod(JSContext*, js::RunState&) + 472
12  XUL                0x0c21e850 __ZL9InterpretP9JSContextRN2js8RunStateE + 49328
13  XUL                0x0c221f10 js::RunScript(JSContext*, js::RunState&) + 348
14  XUL                0x0c22233c js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) + 664
15  XUL                0x0c223090 js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) + 640
16  XUL                0x0be3325c __ZN2js3jitL14DoCallFallbackEP9JSContextPNS0_13BaselineFrameEPNS0_15ICCall_FallbackEjPN2JS5ValueENS7_13MutableHandleIS8_EE + 1128
17  ???                0x26eb78f4 0 + 652966132
18  XUL                0x0be10a60 __ZL13EnterBaselineP9JSContextRN2js3jit12EnterJitDataE + 392
19  XUL                0x0be12fa8 js::jit::EnterBaselineMethod(JSContext*, js::RunState&) + 472
20  XUL                0x0c221f9c js::RunScript(JSContext*, js::RunState&) + 488
21  XUL                0x0c22233c js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) + 664
22  XUL                0x0c223090 js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) + 640
23  XUL                0x0be3325c __ZN2js3jitL14DoCallFallbackEP9JSContextPNS0_13BaselineFrameEPNS0_15ICCall_FallbackEjPN2JS5ValueENS7_13MutableHandleIS8_EE + 1128
24  ???                0x26eb78f4 0 + 652966132
25  XUL                0x0be10a60 __ZL13EnterBaselineP9JSContextRN2js3jit12EnterJitDataE + 392
26  XUL                0x0be12fa8 js::jit::EnterBaselineMethod(JSContext*, js::RunState&) + 472
27  XUL                0x0c221f9c js::RunScript(JSContext*, js::RunState&) + 488
28  XUL                0x0c22233c js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) + 664
29  XUL                0x0c223090 js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) + 640
30  XUL                0x0be3325c __ZN2js3jitL14DoCallFallbackEP9JSContextPNS0_13BaselineFrameEPNS0_15ICCall_FallbackEjPN2JS5ValueENS7_13MutableHandleIS8_EE + 1128
31  ???                0x26eb78f4 0 + 652966132
32  XUL                0x0be10a60 __ZL13EnterBaselineP9JSContextRN2js3jit12EnterJitDataE + 392
33  XUL                0x0be12fa8 js::jit::EnterBaselineMethod(JSContext*, js::RunState&) + 472
34  XUL                0x0c21e850 __ZL9InterpretP9JSContextRN2js8RunStateE + 49328
35  XUL                0x0c221f10 js::RunScript(JSContext*, js::RunState&) + 348
36  XUL                0x0c22233c js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) + 664
37  XUL                0x0c223090 js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) + 640
38  XUL                0x0be3325c __ZN2js3jitL14DoCallFallbackEP9JSContextPNS0_13BaselineFrameEPNS0_15ICCall_FallbackEjPN2JS5ValueENS7_13MutableHandleIS8_EE + 1128
39  ???                0x26eb78f4 0 + 652966132
40  XUL                0x0be10a60 __ZL13EnterBaselineP9JSContextRN2js3jit12EnterJitDataE + 392
41  XUL                0x0be28224 js::jit::EnterBaselineAtBranch(JSContext*, js::InterpreterFrame*, unsigned char*) + 804
42  XUL                0x0c21ee60 __ZL9InterpretP9JSContextRN2js8RunStateE + 50880
43  XUL                0x0c221f10 js::RunScript(JSContext*, js::RunState&) + 348
44  XUL                0x0c22233c js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) + 664
45  XUL                0x0c223090 js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) + 640
46  XUL                0x0be3325c __ZN2js3jitL14DoCallFallbackEP9JSContextPNS0_13BaselineFrameEPNS0_15ICCall_FallbackEjPN2JS5ValueENS7_13MutableHandleIS8_EE + 1128
47  ???                0x26eb78f4 0 + 652966132
48  XUL                0x0be10a60 __ZL13EnterBaselineP9JSContextRN2js3jit12EnterJitDataE + 392
49  XUL                0x0be28224 js::jit::EnterBaselineAtBranch(JSContext*, js::InterpreterFrame*, unsigned char*) + 804
50  XUL                0x0c21ee60 __ZL9InterpretP9JSContextRN2js8RunStateE + 50880
51  XUL                0x0c221f10 js::RunScript(JSContext*, js::RunState&) + 348
52  XUL                0x0c22233c js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) + 664
53  XUL                0x0c223090 js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) + 640
54  XUL                0x0be3325c __ZN2js3jitL14DoCallFallbackEP9JSContextPNS0_13BaselineFrameEPNS0_15ICCall_FallbackEjPN2JS5ValueENS7_13MutableHandleIS8_EE + 1128
55  ???                0x26eb78f4 0 + 652966132
56  XUL                0x0be10a60 __ZL13EnterBaselineP9JSContextRN2js3jit12EnterJitDataE + 392
57  XUL                0x0be12fa8 js::jit::EnterBaselineMethod(JSContext*, js::RunState&) + 472
58  XUL                0x0c21e850 __ZL9InterpretP9JSContextRN2js8RunStateE + 49328
59  XUL                0x0c221f10 js::RunScript(JSContext*, js::RunState&) + 348
60  XUL                0x0c22233c js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) + 664
61  XUL                0x0c0c3b8c js::fun_apply(JSContext*, unsigned int, JS::Value*) + 820
62  XUL                0x0c229534 js::CallJSNative(JSContext*, bool (*)(JSContext*, unsigned int, JS::Value*), JS::CallArgs const&) + 264
63  XUL                0x0c222234 js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) + 400
64  XUL                0x0c223090 js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) + 640
65  XUL                0x0be3325c __ZN2js3jitL14DoCallFallbackEP9JSContextPNS0_13BaselineFrameEPNS0_15ICCall_FallbackEjPN2JS5ValueENS7_13MutableHandleIS8_EE + 1128
66  ???                0x26eb78f4 0 + 652966132
67  XUL                0x0be10a60 __ZL13EnterBaselineP9JSContextRN2js3jit12EnterJitDataE + 392
68  XUL                0x0be12fa8 js::jit::EnterBaselineMethod(JSContext*, js::RunState&) + 472
69  XUL                0x0c221f9c js::RunScript(JSContext*, js::RunState&) + 488
70  XUL                0x0c22233c js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) + 664
71  XUL                0x0c223090 js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) + 640
72  XUL                0x0c03d06c JS::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) + 220
73  XUL                0x09f383cc mozilla::dom::EventHandlerNonNull::Call(JSContext*, JS::Handle<JS::Value>, mozilla::dom::Event&, JS::MutableHandle<JS::Value>, mozilla::ErrorResult&) + 496
74  XUL                0x0a201110 mozilla::JSEventHandler::HandleEvent(nsIDOMEvent*) + 1588
75  XUL                0x0a201930 mozilla::EventListenerManager::HandleEventSubType(mozilla::EventListenerManager::Listener*, nsIDOMEvent*, mozilla::dom::EventTarget*) + 328
76  XUL                0x0a201c34 mozilla::EventListenerManager::HandleEventInternal(nsPresContext*, mozilla::WidgetEvent*, nsIDOMEvent**, mozilla::dom::EventTarget*, nsEventStatus*) + 692
77  XUL                0x0a1e56c0 mozilla::EventTargetChainItem::HandleEvent(mozilla::EventChainPostVisitor&, mozilla::ELMCreationDetector&) + 476
78  XUL                0x0a1cc9c4 mozilla::EventTargetChainItem::HandleEventTargetChain(nsTArray<mozilla::EventTargetChainItem>&, mozilla::EventChainPostVisitor&, mozilla::EventDispatchingCallback*, mozilla::ELMCreationDetector&) + 752
79  XUL                0x0a1da374 mozilla::EventDispatcher::Dispatch(nsISupports*, nsPresContext*, mozilla::WidgetEvent*, nsIDOMEvent*, nsEventStatus*, mozilla::EventDispatchingCallback*, nsTArray<mozilla::dom::EventTarget*>*) + 3552
80  XUL                0x0a1da6d0 mozilla::EventDispatcher::DispatchDOMEvent(nsISupports*, mozilla::WidgetEvent*, nsIDOMEvent*, nsPresContext*, nsEventStatus*) + 352
81  XUL                0x0a97e850 __ZN12_GLOBAL__N_120MessageEventRunnable16DispatchDOMEventEP9JSContextPN7mozilla3dom7workers13WorkerPrivateEPNS3_20DOMEventTargetHelperEb.isra.1193 + 1164
82  XUL                0x0a948a9c mozilla::dom::workers::WorkerRunnable::Run() + 1096
83  XUL                0x08718c64 nsThread::ProcessNextEvent(bool, bool*) + 844
84  XUL                0x0875c480 NS_ProcessNextEvent(nsIThread*, bool) + 60
85  XUL                0x0a97098c mozilla::dom::workers::WorkerPrivate::DoRunLoop(JSContext*) + 1248
86  XUL                0x0a904630 (anonymous namespace)::WorkerThreadPrimaryRunnable::Run() + 1620
87  XUL                0x08718c64 nsThread::ProcessNextEvent(bool, bool*) + 844
88  XUL                0x0875c480 NS_ProcessNextEvent(nsIThread*, bool) + 60
89  XUL                0x08b69c8c mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) + 360
90  XUL                0x08b3173c MessageLoop::RunInternal() + 148
91  XUL                0x08b317cc MessageLoop::Run() + 64
92  XUL                0x087138b8 nsThread::ThreadFunc(void*) + 396
93  libnss3.dylib      0x07ac88f0 _pt_root + 228
94  libSystem.B.dylib  0x0026ff70 _pthread_start + 316

looking at coredump:

(bad access is 0xf1a19ff0, r1 is 0xf1a1a040 )

(Guard page?)

Load command 1869
      cmd LC_SEGMENT
  cmdsize 56
  segname 
 -----  vmaddr 0xf1a19000
 -----  vmsize 0x00001000
  fileoff 2241069056
 filesize 4096
  maxprot 0x00000007
 initprot 0x00000000
   nsects 0
    flags 0xff000000

(thread's stack segment?)

Load command 1870
      cmd LC_SEGMENT
  cmdsize 56
  segname 
 -----  vmaddr 0xf1a1a000
 -----  vmsize 0x00101000
  fileoff 2241073152
 filesize 1052672
  maxprot 0x00000007
 initprot 0x00000003
   nsects 0
    flags 0xff000000

The backtrace makes it look like it's just stack exhaustion from excessive recursion.
(re-checked with gdb768 to see if things are consistent; they are)

@classilla
Copy link
Owner

You've pretty much hit the nail on the head -- stack exhaustion is exactly what's going on here.

Unfortunately, we already have a full GB dedicated to the CPU stack, which is already half the available 32-bit addressing space. Disabling the JIT by default doesn't seem like a winning strategy, so since this doesn't work anyway, one way around it might be to unconditionally blacklist one of the Twitch components so it doesn't load.

@NapalmSauce
Copy link
Contributor Author

What mechanism would be the most appropriate for this?

With more verification today, Twitch.tv has a frontpage livestream and same result happens there, so that nearly entirely prevents twitch from being browsed to.

The crashing script(s) seem to reside in static.twitchcdn.net, and blocking that CDN simply prevents twitch domains from loading.
Even in case someone wants to browse it by any means, tenfourfox's adblock being preffed of by default doesn't make it an ideal option for blacklisting.

It's not obvious how this should be handled. The Intel people and/or others tracking the repo changesets likely will not want this, so smells like#ifdef __ppc__

I can put up a PR and do some testing for next beta(if it's okay with you); just let me know where the blacklist entry should be.

@classilla
Copy link
Owner

Sure, feel free to take a whack at it. If possible, if you could get a PR up by the end of this week that will give me enough time to pull it in. I think we should do it like #469 but in a block before it and gate it with a different pref which should not exist (tenfourfox.allow_troublesome_js sounds good). If the pref exists and is true, then the scripts are not blocked. The code can otherwise be copy-pasted.

I bet there will be other hosts in this category eventually, so let's have a systematic solution for it.

NapalmSauce added a commit to NapalmSauce/tenfourfox that referenced this issue Jun 10, 2020
…t can crash the browser, but have no obvious workaround
NapalmSauce added a commit to NapalmSauce/tenfourfox that referenced this issue Jun 10, 2020
…t can crash the browser, but have no obvious workaround
@NapalmSauce
Copy link
Contributor Author

(Sorry about the above, I should stop force-pushing my tree)

So I'm using a browser built off NapalmSauce@cf9bf7d and it's working as intended.

I've put a logging function and a logging pref as well. The string matching macro name is BLOC (vs BLOK) to make it easy to tell the blacklist apart from the adblock, should it grow larger.

I think it's ready for you to review, so I'll put up my PR in not long.

classilla pushed a commit that referenced this issue Jun 12, 2020
…sh the browser, but have no obvious workaround (#609)
@classilla
Copy link
Owner

Mitigated by #609 but leaving open in case I figure out what's really going on.

@NapalmSauce
Copy link
Contributor Author

First tag to crash is 45.5.0b1 (45.4.0 survives). #453 had a common regression line. I lack any background in this context to tell if it has to do with hybrid typed-array endianness, but *-leopard-webkit doesn't seem to crash on twitch. Probably not that good a comparison; I can just reverse the JIT portion of 312362 in changesets-20160923 (see comment 4) and let you know what happens. Won't cost me much, and should document better what might be going on.

@NapalmSauce
Copy link
Contributor Author

Backing that portion of 312362 doesn't change anything about the crash, so maybe the scripts are accessing floats as ints or something else sketchy..

@classilla
Copy link
Owner

No, I don't think it's an endian bug, because it would show up in the interpreter too. I think either there is a JIT-specific bug, or (more likely) the interpreter just correctly terminates when the stack exhausts and the JIT doesn't.

@classilla
Copy link
Owner

I'm homing in on CodeGenerator::visitCheckOverRecursed and I suspect it is not adding a recursion check because it thinks we're using less stack than we are (we have famously huge turds). I wonder if the heuristic it's using is not properly adjusted. However, I'm also running out of time for the beta, so I think we'll ship this and hold this over.

@classilla
Copy link
Owner

Could also be VMFunctions::CheckOverRecursedWithExtra or CheckOverRecursed, though, since you mention this happens even with just Baseline and the Code Generator is Ion-specific.

@classilla
Copy link
Owner

We'd need some prints in there to see a) if the limit is at the right address and b) that we actually are correctly extracting the stack pointer from the BaselineFrame.

@NapalmSauce
Copy link
Contributor Author

45.4 and below only had the bug wallpapered since there's no recursion abort in the console. In > 45.5b1 baseline disabled there's a recursion error for script (warning 1.7MB static.twitchcdn.net/assets/worker.min-a76eeba0bf85a2f41428.js). There's emscripten-related names in there, so 45.4 simply halts the script before getting to the crash.

I'm doing a debug build to start investigating the recursion limits.

@NapalmSauce
Copy link
Contributor Author

I think I've found an explanation to how the overflow happens. In

#define WORKER_CONTEXT_NATIVE_STACK_LIMIT 512 * sizeof(size_t) * 1024
dom worker threads are given a 2MB stack limit, but coredumps generated from the crash reveal a roughly 1MB stack segment at the SIGBUS address.

Reducing the stack limit there to 1MB will crash in c++ code, probably because there's still not enough stack space to set up new frames, but last build I did had a 512KB limit and didn't get any stack overflow and correctly aborted.

Where might the JSRuntime and the VM mapping disagree on stack space?

@NapalmSauce
Copy link
Contributor Author

https://github.com/classilla/tenfourfox/blob/master/dom/workers/WorkerThread.cpp#L29

The stack size for worker threads per /dom/workers/WorkerThread.cpp is 1MB.
Looks like the fix is just to set kWorkerStackSize to something larger than 2MB (setting it to 2MB + 512KB works). With this done, the worker script just aborts like it should with baseline enabled.

I also tested enlarging the stack for worker threads to 4MB + 512KB (giving 4MB for the JS stack) and the script still aborts with a recursion error. I don't know at this point if there might be something involving infinite recursion going on, but at the very least it's crash-free.

Let me know what stack size (and JS stack size) worker threads should be given, then I can put up a new PR that will fix the crash without the need of a blocklist. I can also wrap stack sizes within an #ifdef __ppc__ so an #else will provide intel builders with the mozilla-defined stack size (for i386 = 1MB C stack, 512KB JS stack).

@classilla
Copy link
Owner

That's a good question. I will look at both of these things; let me do some thinking. Good detective work!

@NapalmSauce
Copy link
Contributor Author

Fine by me :P Thanks!

@classilla
Copy link
Owner

classilla commented Jul 4, 2020

Something's a little odd here.

/* worked */
#define WORKER_CONTEXT_NATIVE_STACK_LIMIT 128 * sizeof(size_t) * 1024
/* didn't */
#define WORKER_CONTEXT_NATIVE_STACK_LIMIT 4224 * sizeof(size_t) * 1024
#define WORKER_CONTEXT_NATIVE_STACK_LIMIT 768 * sizeof(size_t) * 1024
#define WORKER_CONTEXT_NATIVE_STACK_LIMIT 512 * sizeof(size_t) * 1024
#define WORKER_CONTEXT_NATIVE_STACK_LIMIT 256 * sizeof(size_t) * 1024

What values worked for you?

@NapalmSauce
Copy link
Contributor Author

By didn't work I assume you mean it crashed on loading twitch?

What did you set kWorkerStackSize in WorkerThread.cpp to? (I think I'm just not explaining myself too well)

const uint32_t kWorkerStackSize = 256 * sizeof(size_t) * 1024;

Right now it's set to 1MB; so worker threads are spawned with a 1MB stack segment that's enforced by VM permissions.

#define WORKER_CONTEXT_NATIVE_STACK_LIMIT 256 * sizeof(size_t) * 1024

Doesn't work since since the JITcode can use up the whole 1MB stack before aborting in recursion checks, then call C++ code which sets up extra stack frames, go past the limit and raises SIGBUS.

#define WORKER_CONTEXT_NATIVE_STACK_LIMIT 128 * sizeof(size_t) * 1024

Works since the JS stack uses max 512KB and will always abort with a safe extra 512KB between the JS stack limit and the -real- stack limit as enforced by VM, so C++ code can safely set up stack frames without crashing.

The 512KB extra safe zone it somewhat arbitrary, but it's the same "extra" as 32-bit mozilla builds of firefox 45, and it sounds like a pretty big "extra" zone.

So what I'm saying is kWorkerStackSize should also be equal to or larger than
( WORKER_CONTEXT_NATIVE_STACK_LIMIT + 512KB ).

@classilla
Copy link
Owner

classilla commented Jul 4, 2020

Derp. You know, I didn't get a lot of sleep last night. So, I tried

const uint32_t kWorkerStackSize = 4096 * sizeof(size_t) * 1024;

This still doesn't load any videos, but the chat, incredibly, works, and it doesn't crash. There is still a recursion error or two. How does this perform on your system? (This was my maxed-out Quad G5.)

@NapalmSauce
Copy link
Contributor Author

I haven't done an opt build since (last's an -Og debug build) so it's not very fast, but once twitch's loaded it's usable on my 2x1.5ghz MDD. Olga's fpr4 build does play twitch streams, and ken's fpr23 build does not error out and reaches the point where it complains h.264 is unsupported (just a quick test, haven't tried the mpeg4 enabler) so if the recursion error can be avoided.. twitch might just work (!)

@classilla
Copy link
Owner

Interesting. Even with Baseline off, I still get recursion errors, so whatever's going on is either an interpreter bug or an endian one (or an endian interpreter bug). Anyway, I set it to 64MB for giggles to see if this unbreaks anything else. I wonder if it should be even higher.

@classilla
Copy link
Owner

Just for laughs, I revved it to 512MB. No more recursion error!

The video thread doesn't start, though. It seems to be a failure in NSPR. Investigating that further.

@NapalmSauce
Copy link
Contributor Author

Wow, impressive, how long does the script take to complete? If it takes this much stack space to complete, that might still be an endian bug, no? Vs a mere 512KB-1MB on intel..

I did a new opt build for my MDD with some testing flags to reduce code size, and the twitch chat works decently; just slowing down because the chat gets spammed on front page channels..

@classilla
Copy link
Owner

Surprisingly, not all that long (admittedly on a G5, but this is a debug build). However, the problem is that pthread_create is throwing EAGAIN. Either we're out of memory or out of threads, or both. If I back down to 256MB, the recursion errors come back, and the threads still aren't created. I wonder which hard limit we're exceeding. No crashes, though.

If I can't sort it out, I might just ship 512MB as the limit (half the 1GB stack). It may make other things start working, at least, and it fixes the crash here even if it doesn't fix the site.

@NapalmSauce
Copy link
Contributor Author

It could also be trying to spawn another worker thread, which the 1GB stack obviously can't house simultaneously.. Shipping a 512mb stack for worker threads will break function for any site that has multiple workers in that sense :/

@classilla
Copy link
Owner

That's a possible explanation too. Maybe let's stick with 64MB and see what shakes out. I can't see anything to twiddle for boosting the system thread max, at least not in sysctl.

@NapalmSauce
Copy link
Contributor Author

NapalmSauce commented Jul 6, 2020

You might want to consider something small like 3MB or 4MB, as no site before twitch right now used the full 1MB stack space(otherwise we would have gotten this crash way earlier). 64MB looks overkill if it doesn't make twitch functional.

@classilla
Copy link
Owner

No, we need large frames (that's why we have the 1GB stack segment). I don't think a 64MB limit is that excessive under those circumstances. It's an easy change to back out if the beta has problems.

@NapalmSauce
Copy link
Contributor Author

NapalmSauce commented Jul 6, 2020

Good point!

(off-topic): A little while ago I've compiled my own gcc, and I didn't realize before but I think gcc from macports is built at -O0. My -O2 gcc build takes about 2hrs 33mins to complete an opt tenfourfox build on a 2ghz dual-core g5. It took about 5hrs before with gcc-5 from macports!

@classilla
Copy link
Owner

Does that extend to libgcc, etc.? Do you notice a performance difference using those libraries instead?

@NapalmSauce
Copy link
Contributor Author

It does. I even did builds with custom -mcpus to compare, and even -O3 builds. Both -O3 and -mcpu get filtered out and replaced by -O2 where they're inappropriate. I extracted a 7400 and 7450 set of libgcc, libatomic and libstc++ for both -O2 and -O3; and the -O3 libs changed my sunspider score from about 1520ms to 1460ms on the MDD. It might be libstdc++ that helps more than the others. It's not drastic, but it definitely feels smoother.

@kencu
Copy link
Contributor

kencu commented Jul 6, 2020

I can make gcc on macports build any way we want, pretty much within reason. Iain Sandoe tells us not be to strict with the optimization on the throwaway bootstrap gcc compiler (ergo the weak optimization there) for a bunch of reasons -- the system bootstrap compiler can't handle it, the bootstrap build takes much longer, and the final optimization used in building the final compiler has nothing to do with the opts we set for the bootstrap compiler.

There is a BIG problem with an ABI break with libgcc 7.5.0 that we did not see with libgcc 7.4.0, for a head's up. Iain either can't fix it or finds it too time consuming to fix -- but some of you guys might be able to fix it.

@classilla
Copy link
Owner

I opened #613 for this. Something that would really save me time is if someone can generate those -O3 libs for our four target subarches on 10.4 (750, 7400, 7450, 970).

@kencu, is there a ticket open on it?

@NapalmSauce
Copy link
Contributor Author

NapalmSauce commented Jul 15, 2020

I didn't really notice before because I was looking at 613. I see in the head commit the worker stack size is bumped to 64MB and remains 1MB for intel, but the worker context JS stack limit is still defined to 2MB for both powerpc and intel in /dom/workers/RuntimeService.cpp line 98:

#define WORKER_CONTEXT_NATIVE_STACK_LIMIT 512 * sizeof(size_t) * 1024

So I'm just asking, is the change to WORKER_CONTEXT_NATIVE_STACK_LIMIT not pushed to github yet? Sorry for being late.

@kencu
Copy link
Contributor

kencu commented Jul 15, 2020

@kencu, is there a ticket open on it?

no ticket in MacPorts to generate O3 builds of gcc or libgcc at present, no

@classilla
Copy link
Owner

Isn't it kWorkerStackSize? I don't think that needs to be changed with the worker stack size change now. Do you disagree?

(I do need to push, I was really tired last night and went to bed early NOT COVID-19 @kencu . But that change isn't in there.)

@NapalmSauce
Copy link
Contributor Author

NapalmSauce commented Jul 16, 2020

WORKER_CONTEXT_NATIVE_STACK_LIMIT is the value that JS_SetNativeStackQuota() sets the dom worker JSRuntime's native stack limit from, so at the other end, VMFunctions::CheckOverRecursedWithExtra() and CheckOverRecursed() depend on this value for how much stack the JIT can use up before either check fails and the script aborts.

So the JS can use up to WORKER_CONTEXT_NATIVE_STACK_LIMIT bytes from the stack segment that's kWorkerStackSizebytes in length. If WORKER_CONTEXT_NATIVE_STACK_LIMIT equals 2MB, then after the JIT has used 2MB, CheckOverRecursed* will fail and the script will abort regardless of the stack's actual size (kWorkerStackSize).

kWorkerStackSizemust be greater than WORKER_CONTEXT_NATIVE_STACK_LIMIT so even if the limit has been reached, C++ code can safely set up stack frames for a few functions without crashing. From what I've tried to fix twitch, if kWorkerStackSize is 512KB larger than what WORKER_CONTEXT_NATIVE_STACK_LIMIT allows, it's safe, and no crash occurs anymore.

Both are relevant. The browser won't be unstable without this, it just won't be able use the stack according to the size you gave it

@classilla
Copy link
Owner

Then what value do you think it should be, so that we're agreed on it? I don't think I want to do that now, since that could enable functionality that isn't being tested (unless I end up spinning another beta for another reason). I could shovel that into FPR26.

@NapalmSauce
Copy link
Contributor Author

NapalmSauce commented Jul 16, 2020

Since kWorkerStackSize is 64MB, something like 63MB should make it so dom workers take advantage of their large stack, but indeed there is no emergency to do it.

I'm a bit confused right now, I've spent most of the time on this bug reading how the interfaces for the recursion checks work, and I'm just doubting if I'm missing anything important: kWorkerStackSize changes how much stack gets allocated without influencing the recursion checks, so I'm assuming if you changed kWorkerStackSize but not WORKER_CONTEXT_NATIVE_STACK_LIMIT in accordance, (besides the crash) doesn't that mean it has no effect? For example, here whether kWorkerStackSize is 64MB or 512MB, it will do just as much recursion (use 2MB of stack max) on twitch before the script aborts in both cases if WORKER_CONTEXT_NATIVE_STACK_LIMIT is always 2MB.

I'm wondering if twitch didn't throw a recursion error on your 512MB stack test because it couldn't even spawn a single worker thread. I did a test on twitch with 'kWorkerStackSize' = 4.5MB and WORKER_CONTEXT_NATIVE_STACK_LIMIT = 4MB and the page seemed to take forever to be responsive, though honestly on a MDD in a debug build. In your earlier comment about the 512MB stack on the quad with a debug build, was it a matter of a few secs, or more something like nearly a minute?

Am I missing something? That said, there's no problem if you're busy, I'm just trying to help out.

classilla added a commit that referenced this issue Aug 16, 2020
classilla added a commit that referenced this issue Aug 16, 2020
@classilla
Copy link
Owner

Bumped to 60MB for PPC. That should leave ample stack for bailing out.

@classilla
Copy link
Owner

Anything else more to do here?

@NapalmSauce
Copy link
Contributor Author

Unless something else comes up, this looks complete to me. All that's left is for twitch to fix their client, and who knows if that will happen. :)

@classilla
Copy link
Owner

Unfortunately, yes. Thanks for doing this and figuring it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants